Server admin log/Archive 5

From Wikitech
Jump to: navigation, search

October 31

  • 21:09 brion: added some tor ips from [1] manually to mwblocker.log
  • 12:10 mark: Started squid on will
  • 10:53 Tim: set up hourly apache-restart on yaseo apaches
  • 10:05 sleeeepy-brion: started tarball-of-doom from khaldun->albert for enwiki images (non-thumbs) trickling
  • 09:49 zzzz-brion: restarted last of search servers in tampa with data updated from snapshots
  • 07:25 tired-brion: restarting tarball-of-doom on bacon under trickle so it doesn't slow things down
  • 07:05 wacky-brion: creating giant tarball-of-doom on bacon to snapshot commons files for archive/copy
  • 06:24 scary-brion: lifted restriction on reuploads. commons main files and archives now updated, thumbs seem to work (and copying in updates to maybe save some render time). may need to re-touch permissions at end
  • 05:30 fiendish-brion: mounting bacon's /var/upload2 on /mnt/upload2 on zwinger, apaches
  • 02:42 evil-brion: disabled reload priv for all users, all wikis, to try to get this image crap over and done with soon. going to migrate live commons files to bacon to try to reduce albert load
  • 02:25 goblin-brion: humboldt set up and running as an apache, in lvs
  • 01:44 ghoul-brion: srv6 also refusing connections, squid stuck, had to kill and restart
  • 01:15 pirate-brion: srv7 refusing connections on port 80, but squid seemed to be stuck (restart complained squid was already running). killed and restarted squid, seems ok now
  • 00:55 ghost-brion: bugs:3838 set localtime to UTC on ixia, lomaria, thistle
  • 00:45 daemon-brion: added 'bugs'/'bugzilla' interwiki prefix on wikitech
  • 00:39 zombie-brion: bugs:3839 installed ntp on humboldt to sync time

October 30

  • 21:40-22:10 Tim: deployed LVS in front of squid at yaseo.
  • 21:59 hashar: created dsh group apaches_yaseo (synced from amaryllis), moved apaches to apaches_pmtpa and set symlink for backward compatibilty.
  • 16:00 mark: I moved anthony from internal to external VLAN, gave it ip .233, and wanted to make it a temporary Squid. However, it's giving disk errors, so that might not be such a good idea. Added to datacentre tasks for Kyle to look at.
  • 15:00 mark: Upgraded all yaseo squids to the new squid RPM.
  • 14:40 mark: Upgraded all pmtpa squids (except will, which is running FC2) to the new squid RPM.
  • 14:10 mark: Upgraded all knams squids (except clematis, for comparison) to a new squid RPM, squid-2.5.STABLE12-1wm. This is a somewhat newer upstream Squid version, and also has a cron job added that checks whether Squid is still (supposed to be) running, and restarts it if it's not.
  • 14:10 hashar: fixed a bug with server inventory, was putting larousse data every time.
  • 13:50 mark: Installed NTP on vandale
  • 13:40 mark: Moved LVS back from iris to pascal
  • 13:25 hashar: started the Server inventory bot on wikitech site. Need feedback.
  • 10:29 brion: running lucene wikipedia index rebuilds in pmtpa
  • 09:23 brion: restarted yaseo apaches; extreme slowness in HTTP connection and response time, and some segfaults in logs. Seems better after restart.
  • 06:50 brion: activated lucene search daemon in yaseo. running non-wikipedia index rebuilds in pmtpa
  • 06:00 brion: restarted search daemons in pmtpa with wikisource. building ms/th/ko/ja indexes in yaseo, going to start more rebuilds in pmtpa...
  • 04:10 Tim: installed PHP on yf1007, for some reason it wasn't there
  • 04:05 Tim: Took yf1006, yf1008, yf1010, yf1017 out of rotation, they were segfaulting on form submission, e.g. save and move.
  • 02:45 Tim: Set up mxircecho at yaseo
  • 02:30 brion: setting up yf1017 as search server for yaseo
  • 02:14 Tim: moved jawiki to yaseo
  • 02:00 brion: running lucene index builds from last dumps for *wikisource

--Hashar 06:10, 30 October 2005 (PST)

October 29

  • 23:53 hashar: renamed squids dsh file to squids_pmtpa (and put a symlink)
  • 22:35 Tim: moved thwiki and mswiki to yaseo
  • 21:19 hashar: BUG srv24 & srv27 out of apache group (see October 19) but are still in ganglia Apache group.
  • 20:29 Tim: unmounted dead NFS mount srv2:/usr/local/fcache everywhere
  • 20:00 Tim: took webster (and srv9) out of dsh ALL until someone can work out how to set up LDAP
  • 15:41 hashar: made squid error message validate (size is invalid for hr element)
  • 15:25 hashar: some people on irc told me that search on sources wikis doesn't work. Looking at MWDaemon.log , the search indexes do not exist and need to be created. Need some documentation on LuceneSearch.
  • 07:40 brion: starting dump jobs on benet, srv35, srv36
  • 07:06 Tim: Moved benet's increasingly large collection of NFS mount points to /var/backup/public/mnt, with symlinks left behind. They were previously scattered all over the place. There's a bug in lighttpd which requires them to be mounted in the document root. Mounted a directory from bacon, with some image dumps in it.
  • 06:22 brion: fixing up more broken site_stats tables; fixed addwiki.php to use correct row id
  • 04:04 Tim: unmounted dead NFS share /home/wikipedia/backup on bacon
  • 00:51 Tim: Started copy of jawiki's external storage to yaseo.
  • 00:47 Tim: Copied mswiki to yaseo
  • 00:30 Tim: Copied thwiki to yaseo, now replicating. Images still need to be copied.

October 28

  • 22:35 hashar: following samuel trouble 2 days ago, there is still some ghost articles on at least frwiki. Will manually fix that Oct. 29 if I got time.
  • 21:49 Tim: LVS now in service in front of yaseo apaches. yf1010-1017 are now in service as apaches, they were previously idle but with apache installed. yf1018 is LVS, yf1019 is the experimental wikimaps installation and could be considered a spare load balancer.
  • 21:42 Tim: noticed that perlbal was still taking a fair bit of load on bacon. Killed icpagent on bacon, increased icpagent delay time from 1 to 5ms on holbach.
  • 21:03 Tim: Running sync-to-seoul, about to set up LVS on yf1018
  • 18:35 mark: Installed a squid RPM without epoll (otherwise identical) on clematis, to compare memory leak behaviour
  • 17:02 ævar: Turned off automatic capitalization on fowiktionary.
  • 14:40 mark: srv8 Squid had crashed, restarted it. Please pay attention to this until we have a better solution!
  • 06:52 Tim: copying dump of jawiki to henbane:/a/backup .
  • 02:20 Tim: changed the password for wikiuser. DB load glitch experienced due to a migration bug.

October 27

  • 17:16 ævar: disabled uploads on ndswiki [2]
  • 05:43 various: killed evil special pages job that broke wiki

October 26

  • 20:18 midom: rolled forward changes made to samuel on other db nodes.
  • 19:18 mark: Routing problems fixed, switched DNS back.
  • 19:00 mark: Routing problems in florida, but knams can reach out. Sent pmtpa traffic via knams.
  • 04:41 kate: put lomaria and thistle into db service
  • 02:40 kate: took suda out of rotation to turn it into fileserver. ixia is in rotation, thistle & lomaria are waiting for mysql to be set up
  • 11:16 brion: installed easytimeline on wikitech ;)
  • 11:00 brion: paused file copy onto bacon for peak hours

October 25

  • 20:15 mark: restarted srv7's squid. Was crashed at 13:00
  • 10:28 kate: ran post-installation setup on ixia, thistle & lomaria
  • 7:00 brion: paused image copy from albert to bacon for the next few hours
  • 6:00 Solar: Ixia, Thistle, and Lomaria, the three new db's are racked and ready!
  • 5:00 Tim: fixed srv11
  • 4:30 Tim: started HTML dump of all Wikipedias, running in 4 threads on srv31

October 24

  • 20:12 various: fixed pascal
  • 18:34 hashar: updated http://noc.wikimedia.org/ with the new wikidev URL.
  • 12:10 mark: Got knams back up with iris as LVM load balancer, with no NFS mounts that can block it
  • 11:30 Pascal went down
  • 07:02 ævar: created bug, pih, vec, lmo and udm wiki
  • 03:37 kate: uploaded the old images from wikidev
  • 02:58 brion: Moved this site from wp.wikidev.net to http://wikitech.leuksman.com/

October 23

  • 22:57 brion: starting trickle copy from khaldun to bacon
  • 21:58 brion: shutting off bacon's broken mysql; clearing out its disk space and making an upload data copy

October 22

  • 23:51 brion: hacked BoardVote to read from master; the boardvote2005 database is missing from suda and there was a lot of whinging in the database error log about it
  • 23:22 brion: fixed group ownership on php-1.5 on yaseo servers (GlobalFunctions.php was unwritable)
  • 22:25 hashar: zwitter fixed the issue (adding a live hack), merged back my changes.
  • 22:00 hashar: broke site search by cvs updating and syncing extensions/LuceneSearch.php
  • 05:37 brion: samuel caught up, back in rotation
  • 05:18 brion: replication broke on samuel due to "Last_error: Error 'Lock wait timeout exceeded; Try restarting transaction'". Took out of rotation to fix

October 21

  • 23:52 Tim: moved wikimedia button (the one in the footer) to the /images directory, to support offline browsing of the HTML dump
  • 19:01 Tim: Increased tcp_rmem on henbane and yf1010, for faster copying.

October 20

  • 06:13 brion: did '/sbin/ip addr add 145.97.39.155 dev eth0' on pascal; got one port 80 connection to go through to vandale, but others still refused
    • routing problems in general; some level3 issue. pmtpa having connection problems, and freenode is splitting
  • 05:56 brion: rr.knams.wikimedia.org (145.97.39.155) does not respond on port 80
  • 05:04 brion: reopened ssh tunnel from yaseo to pmtpa master db and restarted replication on henbane. commons copy is catching up; hoepfully kowiki will remain working
  • 01:53 brion: investigating reports that kowiki is COMPLETELY BROKEN DUE TO DATABASE LOCK since 10 hours ago
  • 23:59 mark: Florida squids all seem to run out of memory after a few days... memleak. Will have to investigate.
  • 14:52 kate: setup LVS at knams, on pascal.. no failover yet
  • 11:20 mark: mkfs'd /dev/sdb1 on srv10. Tried to rm -rf /var/spool/squid but this failed, probably due to a corrupted filesystem. Probably needs a reinstall/thorough hw check, but in the meantime, is running with 1 squid service ip and 1 cache_dir.
  • 09:40 future-brion: data dumps scheduled to start on benet, srv35, srv36 pulling from samuel
    • (This is the db-friendly dump with (most?) bugs fixed. May have some extra newlines in old text with \r\n in the raw database, will continue tracking down libxml2 problems later so future dumps will be clean.)

October 19

  • 16:00 mark: Installed the squid RPM on all yaseo squids. Updated hosts-lists.php...
  • 14:15 mark: Squid on will had crashed; restarted it.
  • 06:44 brion: running dump tests on srv35 to confirm that bugs are fixed
  • 03:09 ævar: Translated /home/wikipedia/conf/squid/errors/English/error_utf8.htm into Icelandic, someone with root access might need to run /home/wikipedia/conf/squid/deploy, might.
No, they need to be put in the next version of the squid RPM, which can then be deployed... -- mark
  • 01:18 brion: turned apache back off on srv24, srv27 and took them out of apache nodegroup, as avar claims tim said they shouldn't be apaches. [they seem to be memcached and external storage]
  • 01:13 brion: turned srv24, srv27 back on; several gigs have appeared and recopy of settings file succeeded
  • 01:00 brion: turned off apache on srv24, srv27: out of disk space
  • 00:44 brion: added Wikiportal, Wikiproyecto namespaces on eswiki

October 18

  • 11:40 Tim: dewiki and enwiki have been defragmented, unused columns and indexes have been removed. Now starting compression of jawiki.
  • 08:04 Tim: compressOld.php is finished with en, I've now taken adler out of rotation to defragment tables. Running null alter table on dewiki.text first.

October 17

  • 23:56 brion: added recently set up machines to mediawiki-installation nodegroup. (THEY WERE ALREADY IN APACHES GROUP. NOTHING SHOULD BE IN APACHES THAT'S NOT IN MEDIAWIKI-INSTALLATION, EVAR)
  • 22:28 brion: set up gmond on srv49, appears in ganglia now
  • 21:00 mark: Danny says that the 3 new DB servers have been delivered, and he dispatched Kyle for their physical installation tomorrow.
  • 18:15 brion: stopped dumps due to confirmed bug[3]. srv35 and srv36 available for apache for now (added back to node group)
  • 16:20 mark: Remounted /a on srv6 and clematis with the reiserfs nolog option, to disable journaling. Saves disk writes, just mkfs it when it crashes...
  • 13:50 mark: Deployed the new squid RPM on all knams squids
  • 11:40 mark: Deployed the new squid RPM on all Florida squids, except will, which is still running FC1. Can we please reinstall will?
  • 08:47 brion: rebuilt apaches now online: alrazi avicenna friedrich goeje harris hypatia kluge diderot srv49
  • 08:40 brion: enwiki full dump is being run semimanually with the prefetch on an older version. will want to manually fix up links when it's done
  • 07:00ish brion: trying apache setup on alrazi; will mass-run on other machines soonish
  • 01:21 brion: current state of dump runs:
    • srv36: enwiki
    • srv35: all other wikipedias
    • benet: non-wikipedias
    • They're set to pull table dumps and page+rev from adler, but should only touch occasional bits of text.
  • 00:55 brion: took srv35 and srv36 out of apaches node_group so nobody starts Apache on them by accident. :D
  • 00:30 brion: setting up for database dumps using srv35 and srv36. Domas, these should hit the dbs a lot less so please try not to kill them too hard! Thanks.

October 16

  • 21:30 midom: added ariel into enwiki service, made from fresh dump with fresh ibdata!!!!!
  • 17:20 Tim and Domas: deployed LVS-DR to load balance between the squids and the apaches
    • Halved median miss service time, 1100->500ms!
    • Please don't try to add apaches into the LVS realserver pool until I've documented the procedure. A simple mistake, like running the commands in the wrong order, could crash the site.
  • 14:00 mark: The new squid is serving peak load at roughly 20% of the cpu usage of the old squid...
  • 12:00 mark: Put clematis with squid+epoll rpm in production. In case of severe problems, just kill it and start the old squid.
  • 02:22 ævar: Took srv26 out of /usr/local/dsh/node_groups/mediawiki-installation and /usr/local/dsh/node_groups/apaches, wasn't working.

October 15

  • 16:20 mark: Clematis is now running my experimental Squid RPM, with epoll support and my HTCP patch. It's not pooled yet, because I want to discuss and test inclusion of other patches first...
  • 14:45 mark: Depooled clematis as squid, because I want to use it for testing my new Squid RPM.
  • 09:59 Tim: set up srv31 as an NFS server, exporting its /var/static directory; mounted it on benet
  • 07:50 brion: noticed minor bug in PHP on AMD64
  • 03:30 Tim: Set up ntpd on amaryllis, dryas, henbane, yf1000-1004, by copying configuration from yf1007, and running chkconfig ntpd on;/etc/init.d/ntpd start

October 13

  • 23:54 ævar: Kate fixed the tingxi issue, woo.
  • 23:52 ævar: I can't ssh to tingxi (10.0.0.12) which means language/LanguagePt.php is out of sync, nothing severe, but it will cause interface mismatches for pt*.wiki*
  • 23:22 kate: reinstalled vandale and added it as a squid since it's not doing anything else
  • 22:46 kate: reinstalled clematis and put it in squid pool
  • 21:30 brion: fiddling with albert's MaxClients aagin
  • 17:40 mark: Turned off log_mime_hdrs on all squids, as we're not using it anymore
  • 17:00 mark: Added fuchsia as a squid in knams
  • 15:00 mark: Clematis's disk has been replaced and should be fixed. Needs a reinstall...
  • 14:00 mark: Implemented and deployed a new udpmcast daemon with forwarding rules on amaryllis and larousse. This should solve our purging problems with wikis on separate clusters.
  • 13:00 midom: srv34,srv33,srv32 joined ExternalStore service as cluster3
  • 07:00ish brion: holbach back in business.
  • 06:28 brion: disabled Special:Makesysop steward bits until I get the database problem resolved. Still poking at holbach, skipping the enwiki bits.
  • 06:10 brion: took holbach out of rotation; replication broke with what looks like a steward bot application

October 12

  • 23:00ish brion: trying to restart dump process, because some idiot canceled them
  • 22:30ish brion: updated IRC channels in squid error page
  • 21:43 ævar: the code had some discrepency due to some of it being cvs uped recently and some of it not being cvs uped recently, ran scap' and solved a live hack in includes/SkinTemplate.php, whoever wrote it might want to take a look.
  • 21:15 jamesday: started gzip of first 50/50GB/25 days of samuel binary logs, had 41GB free. About 2GB/day used.
  • 08:40 brion: updated internalwiki on lucene search index, restarted lucene daemons. had to clear out old logs from maurus, out of disk space.
  • 07:22 brion: cleaning up dupe and missing site_stats rows; removed dupe 'warwiki' entry in all.dblist
  • 05:37 ævar: Installing the CrossNamespaceLinks specialpage extension.
  • 04:05 Tim: henbane and dryas were reporting high lag times, probably due to the low replication rate (it's only replicating commons and ko). The database was locked automatically when the lag was more than 30 seconds, which was usually. I increased the maximum lag to 6 hours.
  • 02:20 Solar: Initialized raid and reinstalled OS on webster. Only eth0 is plugged in and is private.

October 11

  • 22:00ish brion: changed project ns for plwikisource
  • 19:55 kate: changed root password of zedler, someone who wants it should ask me or elian
  • 17:45 Tim: When they got more load, all three pound instances started using large amounts of memory. Enough excitement for one night, switched back to perlbal.
  • 17:30 Tim: we were having oscillating load between the three pound hosts, so I switched squid to round-robin, and cut the perlbals out of the list at the same time.
  • 16:56 Tim: pound was reaching its fd limit during high concurrency ab testing, raising it to 100000 seems to have fixed it
  • 16:08 Tim: brought dalembert, friedrich and harris into service as pound servers
  • 09:44 brion: briefly stopped mailman to edit vereinde-l archives to remove improperly forwarded email
  • 02:30 Tim: moved static.wikipedia.org to srv31, proxied via the squids.
  • 01:55 Tim: did apache-restart-all to fix high memory usage

October 10

  • 18:14 kate: fs corruption on srv10 again, moved its IPs elsewhere
  • 14:41 Tim: wrote some new tools to allow srv31 to restrict its copying from albert to times of low NFS server load (<1300 req/s)
  • 08:45 Tim: Installed ganglia on srv32-35 (did srv31 earlier)
  • 06:29 Tim: Copying HTML dumps to srv31
  • 03:41 Tim: Added new ganglia metric "nfs_server_calls" to albert and zwinger. It's a perl script, /usr/local/bin/nfs-gmetric
  • 01:32 ævar: Reverted my changes in CVS; cvs up-ed, and synced the affected files, just in case.
  • 01:07 ævar: Checked all the apaches for Language::linkPrefix() and it turns out they all had it (see /home/avar/report (1 = has the function; 0 = does not have the function))
  • 00:46 ævar: Tried syncing again, same error, spooky, off to manually check the apaches.
  • 00:30 ævar: cvs up and scap breaking the wiki, which should not have happend but did for some reason, the error was: Call to undefined function: linkprefix() in /usr/local/apache/common-local/php-1.5/includes/Parser.php on line 1232, but function linkPrefix was defined in the Language class, no problems were reported with syncing, applied a live hack to Parser.php to fix the issue, investigating.

October 9

  • 21:51 midom: killed backups. haha. unkilled site. though adler is good boy, lots of RAM does not help with backups. serial reads do.
  • 19:02 kate: disk failed on clematis. added mint as squid.
  • 11:09 Tim: changed master for ko
  • 08:46 Tim: copied ko upload directory to amaryllis. Set up dryas, with chained replication from henbane.
  • 07:40 ævar: Installed extensions/Renameuser/Renameuserlog.php
  • 05:10 Tim: restored 245 and 248 to srv10
  • 04:28 Tim: removed bogus entries from zwinger's /etc/exports, with an RCS backup
  • 03:13 brion: starting weekly data dumps on benet, srv35, srv36; pulling from adler for primary data. (live; so table dumps will be slightly inconsistent. xml dumps are self-consistent internally.)
  • 02:50 Tim: srv10 down, moved virtual IPs: 245 to srv5, 248 to will and 210 to srv7
  • 02:30 jeronim: added missing mount points /mnt/upload and /mnt/wikipedia on humboldt and some machines in the apaches and mediawiki-installation groups
  • 02:23 Tim: Changed squid configuration to have no-query for albert. This might reduce the latency some people were experiencing when requesting images.

October 8

  • 22:52 ævar: Ran a script (/home/avar/3631.sh) to confirm that bug 3631 wasn't exploited on any wiki besides enwiki, it wasn't.
  • 22:29 ævar: de-sysopped myself on enwiki
  • 22:27 ævar: sysopped myself on enwiki and banned the users using exploiting bug 3631
  • 08:22 brion: removed 'srv9' dupe entry in zwinger exports; for some reason srv9 couldn't mount with that in place (the ip is also in)

October 7

  • 15:28 jeronim: turned off and disabled swap on knams squids (clematis hawthorn iris lily mayflower ragweed sage)
  • 13:34 ævar: created ilowiki
  • 11:49 jeronim: chmod/chowned zwinger:/usr/local/etc/powerdns/langlist-cnames to 664 root:wikidev on avar's request
  • 06:41 ævar: / on zwinger filled up (reported by jeronim) I deleted an old log I didn't need anymore freeing 2GB, more stuff needs to be cleaned out still.
  • 02:49 ævar: Added a live hack to Special:Export, a notice explaining that exporting of full histories is disabled, it can't be translated, boo hoo;)
  • 00:05 brion: changed pawiki sitename/meta namespace to 'ਵਿਕਿਪੀਡਿਆ'

October 6

  • 07:20: midom: re-enabled steward interface
  • yesterday kate: installed solaris on vandale because mysql wanted to test something. finished with it now, should have linux put back.

October 5

  • 18:57 brion: image server has been very slow lately. fixed a broken thumb file or two which had a subdirectory in the way (one on the wikipedia portal was being requested *very* often, producing extra redirect load)
  • 05:25 Solar: Rebooted srv26, bumped temp. threshold to 80C. Will investigate further.
  • 05:20 Solar: webster is back up for now, but will fail again. Will call SM to get replacement drives.
  • 02:56 Tim: srv24 in rotation as part of cluster2. Restarted compressOld.
  • 01:00 Tim: Setting up srv24 as an external storage server, to replace srv26 which is down again. Stopped compressOld and stopped slave on srv25 for data directory copy.

October 4

  • 23:50 Tim: Started compressOld.php, started mysqld on srv26.
  • 23:30 Tim: restarted evil resource-eating program (with kate's permission)
  • 20:02 kate: stopped evil resource-eating tim program on albert started ~ 06:20.
  • 14:40 mark: Increased DB load on samuel in an attempt to solve DB availability problems
  • 13:13 Webster broke.
  • 06:22 Tim: HTML dump post-process running on albert. It'll spend most of its time in sed, with a perl controlling script.
  • 05:13 Tim: static HTML dump of English Wikipedia is pretty much finished. I'm currently running a huge find command on albert, to get a list of files to post-process.
  • 00:20 Solar: Uploaded pictures. Take a look at User:Solar

October 3

  • 23:07 brion: srv28 shutdown broke dewiki and enwiki dumps, have to restart them. non-wikipedias finished before this.
  • 19:45 Solar: srv11 and srv28 moved to new racks for power distribution requirements.
  • 03:30 jeronim: pmtpa squids were mostly running with max FDs of 1024 and starving, so rebuilt them with limit of 8192 and restarted

October 2

  • 21:41 brion: taking srv35 out of apache loop to run additional dump processing
  • 13:10 brion: running wikipedia backups from bacon via srv36, nonwikipedia backups from bacon via benet
  • 12:57 brion: replication halted on bacon due to missing tables on the new wikis (napwiki, warwiki etc) -- this will need to get fixed. in the meantime doing dumps from other wikis ...
  • 09:30 brion: srv31-35 in apache service (in perlbal list)
  • 08:45 jeronim: srv31-35 ready for apache deployment
  • 07:40 Tim: fixed exif bug (http://bugs.php.net/bug.php?id=34704) and deployed the updated tree on all florida apaches
  • 06:30 brion: running cleanupTitles.php on various wikis
  • 00:10 Tim: Running fixSlaveDesync.php on en.

October 1

  • 21:07 Tim: Told dalembert to stop echoing its syslog spam to zwinger and larousse. Apparently temperature warnings were appearing in terminals on larousse.
  • 19:40 Tim: Added Internode proxies to the trusted XFF list
  • 11:00 brion: bacon and adler catching up last couple hours' data
  • 08:30 brion: stopping bacon, adler to copy current data over to bacon
  • 08:15 brion: continued replication catchup on bacon
  • 08:10 brion: stopped backups; benet's out of space (going to do cleanup) and I'm testing an improved backup dump script that eliminates the overhead of mwdumper on the initial dump-split-compress job.
  • 08:07 Tim: re-enabled Special:Makesysop, minus steward interface

September 30

  • 19:00-20:00 mark: Deployed the fixed HTCP-CLR patch to all squids, and restarted them
  • 19:18 ævar: disabled PageCSS because of potential XSS issues.
  • 16:06 ævar: Installed the PageCSS extension on the cluster for per-page CSS.
  • 13:12 Tim: installed apache, php etc. on dalembert, by modifying /home/wikipedia/deployment/apache/prepare-host until it kind of worked. Not sure if it's all set up right, but it's probably good enough for dumpHTML, which is what I'm using it for.
  • 12:30 Tim: installed gmond on various reinstalled machines
  • 07:22 midom: adler in service
  • 01:25 brion: did some scripted despamming crosswiki (some deleted pages by '127.0.0.1'...)
  • Solar: Replaced ram in srv42

September 29

  • 19:50 mark: Fixed a memleak in my HTCP CLR squid patch, and testing it on clematis. If it works well, I will deploy it to all other squids...
  • 17:52 Tim: made some more tweaks to http://mail.wikipedia.org/index.html . Now it displays properly in IE, and it works with small screens
  • 17:12 Tim: Returned text on http://mail.wikipedia.org/ to a comfortably readable size. Apologies to optometrists everywhere for the reduced pay cheque.
  • 07:25 brion: Ran initStats on warwiki, napwiki, ladwiki.
  • 05:15ish brion: ntp setup on ariel
  • 05:00 jeronim: clean fc3 on ariel; it has had a drive swapped and is hopefully not faulty now
  • 04:30 Solar: srv33, srv34, and srv35 have ip's and are ready for service. srv32 and srv31 are pending a bomis server move

September 28

  • - jeronim: srv49, alrazi, diderot, hypatia, avicenna, goeje, harris, dalembert, humboldt, kluge, friedrich all freshly set up with fc3 - but no ntp setup, and no apache. alrazi's old host keys lost.
  • 19:00 jeronim: on zwinger, moved squid errors directory and sync-errors back into /h/w/conf/squid from /h/w/conf/old-squid, and updated sync-errors to also sync to lopar, yaseo, and knams. Updated all squids to use shiny new error page from mark_ryan.
  • 10:00 mark: Added ragweed back to the knams squid pool because of overload on the other squids
  • 09:10 brion: dewiki backup running on benet while others continue (from bacon 20050921)
  • 07:20 brion: backups switched to use bzip2 for xml dumps; 'articles' instead of 'public' name change; image dumps disabled
  • 06:52 brion: starting bzip2 filter/output of 20050924 enwiki dump on srv36
  • 01:00 Solar: alrazi avicenna diderot friedrich goeje harris hypatia humboldt kluge srv42 srv49 are back on the netgear switch

September 27

  • - jeronim: dhcp still not working so I've asked Kyle to put most fc2 boxes on a different switch
  • 23:53 jeronim: commented out icpagent in /etc/rc.local on dalembert in case it's rebooted
  • 22:10 mark: The new switch appears to be Fast Ethernet only! It's accessible on 10.0.1.1. I configured some parts of it to make it somewhat usable: all ports in access mode, vlan 2.
  • 20:15 midom: disabled steward interface, needs rewriting to select databases instead of specifying their names directly in queries -- breaks replication
  • 18:00 midom: ariel gone down:
LSI MegaRAID SCSI BIOS           Version  G112 May 20, 2003
Copyright(c) LSI Logic Corp.
HA -0 (Bus 3 Dev 1) MegaRAID SCSI 320-2
Standard FW 1L26 DRAM=64MB (SDRAM)
Battery module is present on adapter
Following SCSI ID's are not responding
  Channel-2: 0, 1, 2
1 Logical Drives found on the host adapter.
1 Logical Drive(s) Failed
1 Logical Drive(s) handled by BIOS
Press <Ctrl><M> or <Enter> to Run Configuration Utility
  • 08:53 Tim: Stopped icpagent on dalembert for now, pending examination of pound's problems
  • 07:11 brion: added wikipedia.nl alias in powerdns in prep for changing master servers for that domain (jason has that info)
  • 02:54 brion: new8 machines (except srv50) don't have ntp working, still. punching at it again (were up to about 15 seconds slow)
    • copied /etc/ntp.conf and /etc/ntp/step-tickers from srv50 to the others in the group, ran /etc/init.d/ntpd start
  • 02:40 brion: starting test cur-only dewiki dump to double-check dump processing bugs while other backups continue

September 26

  • 21:59 brion: killed commons image dump again; too slow, too big. need to rework that...
  • 19:30 jeronim: turned off swap and commented it out in /etc/fstab on all pmtpa squids after kate noticed srv7 was swapping and restarted its squid
  • 17:30 jeluf: skipped some insert statements to enwiki on the slaves not replicating enwiki. Steward tool running on metawiki tries to write to enwiki and mysql replicates these transactions.
  • 09:30 brion: stopped bacon again, running backup of everything but enwiki/dewiki (backdated to 20050921) from bacon
  • 09:22 brion: added refresh-dblist script to update the split .dblist files in /h/w/c
  • 08:46 brion: started replication catchup on bacon (about 5 days behind)
  • 08:40 brion: restarted mwdumper on the enwiki dump, which had broken with a funky file locking problem
  • 07:37 Tim: deleted srv27 binlogs 020-026, the rest are needed for srv26 when it starts working again.
  • 07:37 brion: locking fiwiktionary for case conversion
  • 04:49 brion: turned off srv41, srv26 apaches due to segfaults; turned off exif log for commons due to giant >2gb log file
  • 01:04 brion: created wikiro-l, wikimediaro-l lists, iulianu as list admin

September 25

  • 13:45 hashar: made nap language inherit from italian language instead of english (rebuildMessages.php nap --update).
  • 13:10 hashar: created nap, war & lad wikipedia using the updated howto Add a language. Thanks Tim for the technical assistance.
  • 06:17 kate: copying bacon's mysql data to zedler

September 24

  • 23:44 brion: changed squid config to use bacon and holbach's .wikimedia.org names instead of .pmtpa.wmnet on kate's advice
  • 23:31 brion: pound didn't seem to be working; 503 errors, other problems, was unkillable without -9. unable to run site on holbach's perlba; squids couldn't find it? restarted pound and icpagent on dalembert, working now
  • 23:23 brion: tried restarting pound. (there's also weird cyclic load between rose and anthony every few minutes)
  • 23:15 brion: slow site performance reported; ganglia showed unusually high load on srv50, srv37, dalembert. Stopped dumpHTML on dalembert (pound machine), restarted apache on 50 & 37
  • 22:12 brion: srv8 failed on squid restart due to broken symlink to config file. added srv8 to pmtpa squid lists for new config list and relinked its config file
  • 20:40 midom: webster is up with non-enwiki dbset, ariel is up with enwiki only.
  • 19:36 kate: zedler is up with mysql installed; waiting for replication to be sorted out somehow
  • 18:20 Tim: deployed new squid configuration generator
  • 16:11 jeronim: diderot, harris, alrazi, avicenna out
  • 13:30 jeronim: kluge & friedrich out too, for reinstall
  • 12:12 jeronim: took goeje out of mediawiki-installation dsh group; putting fc3 on it
  • Tim: stopped icpagent on bacon. Load balancers are now holbach (perlbal) and dalembert (pound)
  • 07:55 brion: started enwiki xml dump with five parallel readers; experimental (on srv36 pulling from samuel)
  • 07:04 brion: trying to fix ntp again on humboldt and new8 machines
  • 02:14 brion: disabled Special:Undelete toplevel list; code needs rewriting or just dump it for Special:Log (added link as temp hack)

September 23

  • 21:17 brion: added /^Lynx/ to unicode browser blacklist
  • 15:37 Tim: Deployed pound/icpagent on dalembert. It is currently running alongside perlbal instances on bacon and holbach.

September 22

  • 23:18 brion: turning off capitallinks on tawiktionary
  • 18:48 brion: updated pmtpa squid error messages to remove obsolete openfacts and wikisearch references. master copies now in /h/w/conf/squid/errors
  • 18:21 brion: wikinews backup done. enwiki backup halted due to some nfs/large file problem. investigating
  • 11:58 Tim: brought srv26 back into service
  • 11:28 Tim: started deleting thumbnails still in their obsolete locations, 180,000 to delete.
  • 09:10 brion: starting *wikinews backups on srv36 pulling from bacon. [installed mwdumper]
  • 08:25 brion: running enwiki backup on srv36 pulling from a halted bacon, saving on benet
  • 08:08 brion: taking srv36 off perlbal nodelist to try running backups with it
  • 07:26 brion: adding new machines to perlbal; ready for service... hopefully
  • 07:00 Tim: restarted dumpHTML.php, I had stopped it for a while due to high DB load. I'll stop it again when we get closer to peak time.
  • 06:45 brion: running setup-apache script on remaining new8 machines (srv36-41, srv42-48)

September 21

  • 23:54 brion: recompiling librsvg with correction to security fix; it had accidentally disabled data: urls as well
  • 20:35 brion: set european tz for nlwikimedia
  • 18:46 Solar: webster and ariel have rebuilt raid and FC3 installed although they do not have IP's. They are accessible via console.
  • 17:00 midom: disabled all bloat in albert's http configuration (mod_perl, php, jk, ssl, ...), that freed lots of memory and allows more effective caching of directory trees and file metadata. And yes, it solved a bit performance issues (uh oh, yet another image server overload).
  • 08:59 brion: disabled wikidiff PHP extension sitewide; there are numerous reports of bad diff output in some cases, and dammit alleges it may be crashy or futex-y. InitialiseSettings.php is set to enable it in the wiki if it's on in php.ini and ignore it if not.
  • 07:50 brion: tim is doing ongoing debugging on srv50 trying to identify source of segfaults
  • 07:00 brion: installed patch for apache rewrite bug on amd64, but still getting segfaults on srv50
  • 06:08 brion: clocks are wrong on new8 boxen; working on correcting
  • 06:00 brion: setting up APC instead of Turck on srv50 experimentally
  • 00:35 brion: srv50 back out; some apache child process segfaults, which don't look too good
  • 00:34 brion: srv50 back in
  • 00:11 brion: srv50 out for further adjustments (tidy, proctitle)
  • 00:09 brion: putting srv50 into apache rotation to test it out before installing all others

September 20

  • 23:20 ævar: changed wgSitename to Vichipedie on furwiki
  • 23:14 ævar: ran php namespaceDupes.php --fix --suffix=/broken furwiki to fix namespaces on furry wiki
  • 22:39 ævar: changed $wgMetaNamespace on furwiki from Wikipedia to Vichipedie.
  • 20:14 brion: reverted Parser.php change temporarily due to reports of massive template breakage
  • 19:45 brion: fixed internal wiki (whoops, typo in config change last night)
  • 19:13 brion: removed bogus entries from master robots.txt ("/?", "/wiki?", "/wiki/?")
  • 14:16 Tim: Disabled context display for full text search results as an emergency optimisation measure. It was taking more than its fair share of our precious DB time. $wgDisableSearchContext in CommonSettings.php.
    • Note: This caused a large reduction in CPU usage on the master DB server, from 100% down to 70%. In the future, it might be worthwhile to ensure text for context display is loaded from the slaves.
  • 10:30 brion: doing experimental software installs on srv50 [amd64]
  • 10:08 brion: Added sync-apache script to rsync the apache config files from zwinger to pmtpa apaches. Don't forget to use it after making changes and before restarting apaches!
  • 09:30 brion: moving apache configs a) into /h/w/conf/httpd subdir, and b) into local copies on each server which will be rsync'd
  • 08:05 brion: new apache configs on all
  • 07:23 brion: fixed up apache configs on *.wikimedia.org
  • 07:00 jeronim: added acpi=off panic=5 to adler's kernel params and rebooted, because apparently there are some ACPI problems, and so that it reboots on kernel panic instead of freezing
  • 06:53 brion: cleaning up apache config files; replacing ampescape rewrite usage with aliases to remove our patch dependency (tested on wikimediafoundation.org)
  • 06:40 jeronim: installed same kernel on adler as is on samuel and set it as default; also samuel's default kernel was changed to a newer one (by yum?) in /etc/grub.conf, so changed it back to match the current kernel
  • 05:30 brion: put suda back in rotation; toned down its share of enwiki hits a bit
  • 05:02 brion: adler crashed again at some point
  • 02:36 brion: adler was rebooted by colo; running innodb recovery
  • 01:58 brion: adler is down, seems to have crashed (panic bits on scs output). taking out of rotation too
  • 01:45 brion: lots of delays trying to open suda from wiki; taking out of db rotation
  • 01:11 brion: halted backup; benet ran out of space. en_text_table.gz is much larger than expected (49gb), perhaps external storage has not been used correctly as expected? will remove file and continue.

September 19

  • 22:10 ævar: uninstalled nogomatch on enwiki, who's going to sort through all that gibberish data? Not me!
  • 21:07 brion: rebooting new8 machines to make sure they're running current kernel
  • 21:02 brion: new8 group status: srv47 online but borked; 31-35 and 49 offline. others to be set up as apaches
  • 20:46 brion: running special pages update on frwiki by request... will update others on cronjob if there's not already one?
  • 19:40 mark: Replaced udpmcast.py by a properly daemonized version. Set it up at knams to forward to a multicast group instead of all unicast IPs forwarded by larousse...
  • 18:45 mark: Removed miss_access line from knams squids to solve the cache peer errors. Repeat at yaseo if it works...
  • 13:49 ævar: Installed the nogomatch extension experementally on enwiki.
  • 08:00 Tim: Removed all NFS mounts from srv1's fstab. Set up a simple /home directory on its local hard drive.
  • 06:06 kate: reverted root prompt on zwinger so it's not invisible on a white background
  • 04:47 James: stop slave on bacon while dumper is running. Slave will restart when done.
  • 02:45 Tim: changed root prompt on zwinger. Started sync-to-seoul, with -u option this time so we don't accidentally overwrite stuff
  • 01:50 brion: seems to be mostly back up at this point. boot seemed to be aided by disabling named and letting it lookup from albert
  • 01:36 brion: zwinger boot still going on; nfs init is *very* slow doing the exportfs -r; seems to be slow dns lookups
  • 00:38 brion: jeronim did this: [root@zwinger srv38]# reboot - unfortunately it was not srv38, but zwinger.
  • 00:05 brion: mounted /home on srv1; couldn't login, caused sync-file failures
  • 00:05 brion: enabled Nuke extension on meta & mediawiki.org

September 18

  • 14:00 jeronim: rebooted zwinger by mistake and it needed a manual reset by colo staff to come back up. Site was offline for about an hour.
  • 04:34 brion: vandale kernel panic, frozen
  • 04:30 Solar: srv36-srv50 are racked, have ip's, and are ready for production
  • 03:10 Tim: moved compressOld.php to dalembert (where dumpHTML.php has been running), on complaints that it was causing problems on zwinger.

September 17

  • 22:17 brion: running unique-ip counter on fuchsia with saved logs (into uniqueip table on vandale)
  • 22:02 brion: disabled disused info-de-l list by request of list admins
  • 11:05 brion: ran initStats on all wikisources to initialise those not already set
  • 07:06 brion: canceled upload dump for commons backup due to size and slowness; too big to fit
  • 06:30 jeronim: on larousse, removed fedora netcat and installed from source into /usr/local
  • 04:30 Tim: used ntpdate -u pool.ntp.org to set the times on all the yaseo machines, some were a long way out. Then set all their timezones to UTC. This apparently caused ganglia to think yf1000 and yf1002 were down, fixed by restarting the local gmond.
  • 04:10 Tim: Started replication on henbane
  • 01:10 brion: enabled wikidiff on all wikis. (can be disabled selectively w/ wgUseExternalDiffEngine in InitialiseSettings)
  • Tim: Set up mysql on henbane, made a consistent dump of kowiki and commonswiki using bacon, copied dump to henbane ready to start replication

September 16

  • 22:20 Tim: started mysqld on srv26, it had been off for 12 hours or so. The compression script had been running all that time, srv26 caught up to the master without incident.
  • Colo (Solar):
    • supposedly bart is brought back up
    • borrowed HP switch connected to gi0/4 on the cisco
    • moreri was moved, and is trying to netboot (fails)
    • 10 of the 20 new servers have been racked and wired to the borrowed HP switch, but don't have IPs yet
  • 11:37 brion: updating sitenames on he, el, ru wikisource
  • 11:30 brion: started backup run
  • 03:17 brion: frwiki reimport done
  • 02:47 brion: frwiki reimport started
  • 02:35 brion: jawiki reimport done
  • 01:49 brion: started jawiki reimport
  • 01:33 brion: bacon catching up; suda is fine as it is partial mirror
  • 01:29 brion: took bacon, suda out of rotation for further investigation
  • 01:23 brion: nlwiki open for editing
  • 01:03 brion: reimporting nlwiki on samuel
  • 00:41 brion: nl/fr/ja dumps done (in /var/backup/private/recovery). going to try reimporting soon
  • 00:16 brion: running attachLatest on *wikisource

September 15

  • 23:14 brion: 3 dumps from adler done; doing extra backups from samuel too. setting adler to read-only
  • 22:37 dumping nlwiki, frwiki, jawiki databases from adler onto sql files on benet
  • 22:18 put load back on samuel for enwiki with adler disabled. fr, nl, ja wikipedias are locked while we work this out
  • 22:09 commented out adler from db.php; adler appears to be misconfigured and all kinds of breakage is going on. it's not read-only, and has some revisions that others don't have
  • 21:56 brion: took load off bacon (was 100 load on fr, nl, ja; nl and fr reporting weird editing problems possibly freak lag problems, and it was consistently lagging a few seconds at least)
  • 17:25 mark: Setup IPsec between bacon and vandale. Who wants to setup replication?
  • 16:50 mark: Altered geodns: pointed Malaysia at yaseo, and Israel, Turkey, Cyprus at knams
  • 13:04 Tim: Shutting down apache on dalembert temporarily so that I can use it for HTML dump testing and generation
  • 12:35 Tim: Restarted compressOld.php, it stopped when I shut down bacon to do the copy to adler.
  • 11:30 mark: Restarted some knams squids to increase FDs, changed /etc/rc.local startup script
  • 11:15 mark: Deployed squid on yf1003 and yf1004, and added them to the DNS pool
  • 11:10 mark: Recompiled squid on yaseo to increase filedescriptors to 8192 and restarted all squids with 4096
  • 07:37 brion: running importDumpFixPages.php on wikisources to fix bogus rev_page items
  • 02:30 kate: ariel's down
  • 02:29 brion: recompiling mono 1.1.9 on benet for xml bugfix
  • 00:15 brion: removed humboldt and hypatia from mediawiki-installation node group, neither has port 80 on:
    • humboldt prompts for password, not configured correctly?
    • hypatia shows host key changed; was reinstalled?
  • 00:10 brion: disabled MWSearchUpdater plugin as the daemon is broken; briefly broke the wiki due to bad include_path; need to fix config for MWBlockerHook to make sure the path is right even w/o the lucene include

September 14

  • 21:30 mark: Setup log rotation at yaseo to knams, routed japanese and chinese clients to yaseo squids.
  • 20:30 midom: adler online, bacon catching up
  • 20:15 mark: Deployed squid on yf1001, and routed Korean clients to the Florida squid cluster.
  • 18:15 mark: Deployed squid on yf1000.
  • 18:10 mark: Wrote a YASEO squid deploy script /home/wikipedia/deployment/yaseo-squid/prepare-host (yahoo cluster only, should I put it at florida?) after Tim's apache prepare-host script
  • 17:48 ævar: de-opped myself on ruwiki and stopped my revert bot, the russians hate me even more now.
  • 16:30 mark: Set up a squid on yf1001. Same setup as knams, except it's in /usr/local/squid as in florida. Adapted florida's squid and mediawiki configs accordingly.
  • 13:19 ævar: ran INSERT INTO user_groups VALUES (1165, "sysop"); on ruwiki to make myself temp. sysop to fix the MediaWiki: fsckup.
  • 11:15 brion: halted nlwiki partial temp backup as enough was run to test problem
    • (identified problem as [4])
  • 10:41 brion: running another nlwiki backup to get raw dumpBackup.php output for testing
  • 10:39 brion: halted old backup sequence (at nlwiki, with a mystery breakage in output that needs examining)
  • 10:33 brion: hacking dumpBackup.php to load php_utfnormal.so extension (not yet enabled sitewide)
  • 10:05 brion: running kowikisource and zhwikisource imports on formerly broken parts
  • 08:55 brion: updated messages on jawikisource
  • 08:30ish brion: updated messages on *wikisource
  • 01:30 jeronim: access to yaseo console server should be back hopefully within a few hours - eam is dealing with it

September 14

  • 13:32 Tim: Shut down mysql on bacon, started copying data directory to adler

September 13

  • 23:23 brion: set logo on dewikiquote to commons version
  • 23:ish brion: installing mono 1.1.9 with xml patch on benet to fix future dumps ([5])
  • 17:23 ævar: Logging Exif debug information to /home/wikipedia/logs/exif.log using wgDebugLogGroups.
  • 16:40 jeronim: yf1000 - yf1004 are all set up with reiserfs now. The only yaseo machine not working is yf1013 which is in an unknown state as the console server (konsoler04.krs.yahoo.com (10.11.1.186)) is unreachable.
  • 16:18 Tim: Started moving some text to cluster2, starting with frwiki.

September 12

  • 11:59 brion: killed search update daemon; going to replace this (again) with a more robust queuing system
  • 15:00 or so kate: upgraded perlbal to 1.37
  • 13:24 jeronim/kyle: lots of machines connected to SCS, port labels corrected. The APC has apparently vanished - Kyle couldn't find it.
  • 09:40 brion: installed ICU 3.4 on zwinger and mediawiki-installation from RPMs built from the ICU-provided spec file. Source and binary rpms in /home/wikipedia/src/icu
  • 09:34 brion: fixed misnamed krwikisource -> kowikisource db
  • 8:50 Tim: rebuilt interwiki tables
  • 02:15 brion: replaced old php.ini on zwinger with symlink to the common one. added /usr/local/lib/php back into the default include_path (for PEAR stuff sometimes used)
  • 01:04 brion: blocked leech enciclopedia.ipg.com.br

September 11

  • 22:05 brion: trying batch clears in parallel overloaded zwinger; canceled, running in serial again
  • 21:35 brion: running batch operation to remove bad cached messages
  • 21:00 brion: reconfigured blocker daemon to log to samuel. had to set up permission grant again on samuel
  • 18:19 Tim: finally managed to fix the message problem, except for some erroneous values stored in cache
  • ~18:00 ævar: To get interwiki links working on hrwikisource: sourced the output of maintenance/rebuildInterwiki.php and sourced mainteance/interwiki.sql on all wikis, some interwiki prefixes appear to have been lost in the progress e.g. bugzilla: (only mediazilla: exists in interwiki.sql) looks like we need better interwiki update scripts...
    • Don't run interwiki.sql, under any circumstances. Add new prefixes to m:Interwiki map. -- Tim 08:52, 12 Sep 2005 (UTC)
  • 16:05 Tim: switched master to samuel. Adler asks for root pw after reboot due to failed fsck.
  • 15:10 Adler crashed. Tim and JeLuF on the scene, wiki switched to read-only mode
  • 14:59 Tim: Non-default language message caching completely f****d up. Blank messages everywhere
  • 07:10 brion: now using blocker list
  • 07:00 brion: installed limited librsvg on apache cluster, svg back on
  • 15:40 Tim: Installed apache, php, turck and mediawiki on yf1005. Put all required commands in /home/wikipedia/deployment/yaseo-apache/prepare-host. Still needs database, memcached and mediawiki configuration.
  • 05:05 brion: restarted MWUpdateDaemon, hung again at 1gb used memory
  • 02:38 brion: disabled svg for further security work
  • 01:20 brion: reconfiguring wikisource to allow en.wikisource.org to work (hr ja kr sv zh en now imported)
  • 01:09 brion: installed librsvg 2.11.1 on the apaches; it's in /usr/local. (old librsvg versions seemed to muck up text pretty bad)

September 10

  • 22:49 brion: importing wikisource nl ro ru
  • 22:34 ævar: deinstalled the wgDebugLogFile on commonswiki, got enough debug output to see if anything was wrong.
  • --:-- jeronim: yaseo stuff:
    • reinstalled FC4 on yf1000, yf1001, yf1003, yf1004 with reiserfs
    • reinstalled FC4 on dryas & henbane with 10GB ext3 root partition and the bulk of the disk as jfs on /a
    • rsyncing /home, /tftpboot, /root, /var/www, /usr/local, and /etc from amaryllis to dryas in preparation for reinstalling amaryllis with reiserfs. It's a script, /root/amaryllis-rsync.sh, running in a screen on dryas.
  • 14:14 ævar: installed a wgDebugLogFile for commonswiki in /home/wikipedia/logs/commonswiki.log to monitor Exif debug output.
  • 13:26 ævar: ran maintenance/deleteImageMemcached.php on all wikis fixing bug 3410
  • 10:44 brion: cleaning out old mysql data from benet to free up space for current backups (40 days+ out of date, not too useful)
  • 10:00 brion: restored working frame-breakout code (pending cached wikibits.js)
  • 07:58 Tim: moved some ancient rubbish from /home/wikipedia/htdocs to /var/backup/home/wikipedia/htdocs
  • 07:10 brion: running data split for additional wikisource languages
  • 02:40 Tim: Changed names of Seoul machines
  • 02:15 brion: set edit rate limit for new accounts to same as ip rate limit
  • 01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads

September 9

  • 06:30 brion: restarted stalled de,en dumps

September 8

  • 19:18 brion: checker daemon running
  • 10:50 brion: setting up vandal checker daemon on larousse
  • 10:42 hashar: enabled subpages for portal (100) and portal discussion (101) on dewiki.
  • 7:45 hashar: added two namespaces for frwiki : 100=>Portail, 101=>Discussion_Portail .

September 7

  • 22:00 jeronim: fixed avar's login problem on servers in the mediawiki-installation group -
    • nscd -i passwd did not work
    • /etc/init.d/nscd restart ; /etc/init.d/sshd restart did solve the problem on each machine except for benet; for benet, problem was finally solved after doing the restarts twice more, then nscd -i passwd, then doing the 2 restarts with a pause in the middle
  • 21:30 jeronim: killed everyone's ssh sessions and sshd on zwinger (sorry)
  • 10:25 midom: After Tim did put live memcached patch, site's sessions were switched from NFS to memc.
  • 06:54 brion: killed stalled backup -- memcached send hang for the last day or so. It's continuing w/ dkwiki; will rerun stalled dewiki and enwiki

September 6

  • 19:55 brion: tgwiktionary to lowercase
  • 05:30 brion: set up experimental upload verification hook
  • 04:02 koko: removed firewall

September 5

  • 12:40 brion: set up to shut down search builder daemon every hour (at 47 minutes) to protect aganst memory leaks in builder; search-update-daemon wrapper script set to auto-restart 5 seconds after shutdown/crash of the daemon
  • 09:05 brion: rebuildMessages.php --update on all wikis to add various new messages
  • 06:09 brion: starting mass lucene updates of pages edited in august
  • 05:18 brion: lucene back-deletions done, reoptimizing build index
  • 01:10 brion: search updater up; running queued deletions
  • 00:45 brion: vincent back in active search rotation

September 4

  • 23:55 brion: splitting lucene config to lucene.php. putting coronelli on search, wiht optimized index
  • 19:30 jeronim: created helpdesk-l
  • 17:20 jeronim: fuchsia does not boot on the latest kernel (see below), but it does boot on the 2.6.11-1.33_FC3smp kernel, so switched it to boot that kernel by default
  • 16:27 mark: Because of cascading incidents in knams, we moved all traffic to florida and lopar via DNS.
  • 14:30 jeronim: fuchsia was dead or very close, so power-cycled it using the IPMI. It is broken:

NaodW29-pre4d8ccd6810bab9a700000001

  • 13:16 Tim: made /home/wikipedia/lib/install.sh ignore x86_64 machines, added a part to clean up rubbish left in /usr/lib, then ran it everywhere with dsh -a -f
  • 04:20 Tim: reinstalling PHP 4.4.0 with exif support. Using php-upgrade-440, which calls the new script /home/wikipedia/lib/install.sh to set up shared libraries in /usr/local/lib.

September 3

  • 18:40 jeronim: removed body of mailman archive messages here and here on yannf's request
  • 06:40 brion: relaunch updated backup script with some of the broken bits fixed.
  • 04:50 Tim: Finished benchmarking PHP 4.4.0, see GCC benchmarking. Now deploying the new binaries, from source tree /home/wikipedia/src/php/php-4.4.0-gcc4
  • sometime brion: added .log to text/plain on benet's lighty

September 2

  • 12:00 brion: ran backup test on aawiki using the new dump splitter and partial new backup script. (script is in ~brion/run-backup.sh if anyone wants to examine it)
  • 07:19 Tim: compiling GCC 4.0.1 on zwinger. It will be installed with a program suffix, so gcc is still the old compiler, and gcc-4.0.1 is the new one. Source directory is /home/wikipedia/src/gcc/gcc-4.0.1, build directory is /home/wikipedia/src/gcc/gcc-4.0.1-build.
  • 06:21 Tim: removing hypatia from perlbal nodelist for an hour or so, for some benchmarking

September 1

  • 07:45 brion: set sitename/meta namespace on mtwiki
  • 07:00 brion: running cleanupTitles.php to rename broken pages. Will be at Special:Prefixindex/Broken/ at each wiki.

August 30

  • 17:30 jeronim: made a robots.txt on larousse (noc/kohl) to disallow some dynamic pages and a few others
  • 16:40 jeronim: created wikimediapl-l

August 29

  • 21:30 brion: blocked wissens-schatz.de for remote loading
  • 17:30 jeluf: anonymized a name in the archive of wikide-l
  • 11:30 brion: running a batch job checking for invalid titles on various wikis (cleanupTitles). shouldn't interfere with anything, making no changes.

August 28

  • 22:15 brion: locking plwiktionary for capitalization change
  • 15:18 hashar: created wikimk-l mailing list.
  • 15:15 mark: Brought mayflower back up. Repaired the filesystems, and rebooted it. It was reporting lines like
Aug 28 04:22:34 mayflower kernel: swap_free: Bad swap file entry 7800007ffffff00f
  • 14:30 mark: Another Kennisnet V-20 went down, this time it was mayflower dieing somewhere this morning. Depooled it... As it's not critical and we still have SP access, I will have a look at it first.

August 27

  • 00:45 brion: turned on wegge's experimental watchlist bot thingy on dawiki

August 26

  • sometime: lots of data imported on wikisources

August 25

  • 16:02 jeronim: added fc-mirror.wikimedia.org DNS entry for fedora mirror
    • fc-mirror 1H IN CNAME albert
  • 15:40 hashar: created wikials-l mailing list. TODO: delete /h/w/htdocs/mail/.index.html.sw(o|p) (swap files by fire).
  • 19:00 mark: PowerDNS on pascal appeared corrupted. Most probably because of an overlapping zones problem in bindbackend (not bindbackend2). I integrated rev.wikimedia.org into the wikimedia.org to evade that.
  • 16:09 hashar: blacklisted www . izynews . com on florida squids (using acl badbadip src 62.75.174.182/32). Need to be done on kennisnet and paris cluster too.
  • 11:00 brion: set up https on kohl. (old ssl key files backed up; wasn't using the established password, nobody knew what it might have been)
  • 07:05 brion: rebuilt interwiki tables; using correct interwikis for the new wikisources.
  • 06:51 brion: added sr.wikisource.org
  • 02:02 hashar: updated in HEAD LanguagePt.php from meta. Watchout when syncronising.

August 24

  • 14:04 hashar: disabled lucene search. Daemon run on maurus but timeout / dont give any output.
  • 04:00 Jamesday: started nice bzip2 for slow query log and first 72 binary logs on adler to free 40GB of disk. Can archive them on another box later.
    • use avicenna for binlog archives -- Tim 05:53, 25 Aug 2005 (UTC)
  • 00:43 brion: trying out an older version of MWDaemon on vincent to see if memory leak is a new code problem

August 23

August 22

  • 22:12 brion: upped max post size to 75mb on squids; were problems posting large videos to commons (or something)
  • 21:50 brion: renamed presswiki to internalwiki

August 21

  • 22:53 brion: bugzilla up; removed ssl-ticket.wikimedia.org from pascal's apache conf.d dir
  • 22:48 brion: bugzilla.wikimedia.org appears to be offline.
  • 13:30 Tim: reduced lucene load on vincent to 1/4, maybe that will stop it from locking up (which it did again)
  • 13:00 Tim: restarted lucene on vincent, it was closing connections as soon as they were established
  • 06:27 brion: otrs now accessible again on https://ticket.wikimedia.org/ ; now with redirect for the index page! For reference: Apache is in /usr/local/otrs
  • 06:00 brion: trying to start otrs on ragweed. apache configuration appears to be borked.

August 20

  • 10:00 jeluf: finished OTRS transition to ragweed. Spamassasin setup finished.
  • 09:53 Tim: Switched site to 1.6alpha
  • 08:16 Tim: Applying schema update for 1.6alpha, basically an ALTER TABLE watchlist
  • 01:00 Tim: ran update-special-pages

August 19

  • 23:30 brion: changed postfix 'myhostname' setting from zwinger.wikimedia.org to mail.wikimedia.org, should prevent the mail loop errors reported sending to the full addr
  • 23:00 brion: ran namespace conflict checks for updates on tawiki and gawiki
  • 21:40 brion: updated rebuildInterwiki

August 18

  • 23:30 jeluf: OTRS status: Installed apache/php/perl/postfix/mysql client on ragweed. Using pascal as DB server. Problems with sessions, sessions seem to be mixed up, sometimes I get logged in as presroi, sometimes as JeLuF :-/ Stopped apache for now. Postfix still accepting new tickets.
  • 22:30 mark: Changed DNS CNAME ticket.wikimedia.org to point to ragweed
  • 22:17 brion: disabled account creation throttle on press wiki; this is closed wiki and all accounts are created by an admin
  • 10:00 midom: suda is back again, with enwiki and commonswiki databases
  • 05:00 jeluf: copied OTRS tables to pascal, copied otrs binaries to pascal, configured pascal to serve https. Can access old tickets again. Currently can't send new tickets to otrs. DNS change needs to be done.
  • 00:55 brion: recreated wikimediasr-l list on zwinger

August 17

  • 19:27 brion: fixed bug in db.php that set all database load factors to NULL

August 16

  • 20:15 jeluf: renamed project namespace on cswikibooks to Wikiknihy.
  • 15:30 midom: resumed idle bacon's mysql replication, we might need to do external store migration soon, and bring back suda with smaller dataset.

August 15

  • 21:46 kate: always_bcc on zwinger was set to "quagga" and its mbox was full, so it generated lots of bounce messages. i removed the setting.
  • 12:30 mark: Mint seems to have at least a bad disk, possibly other problems. Sun will look at it. In the meantime, we can *try* to network boot it and recover data.
  • 10:30 jeronim: had a look at mint via the IPMI - tried to power cycle it but it wouldn't switch off. Mark will tell the kennisnet guys about it. There's a dump of the OTRS DB from before the transfer to mint in albert:/root. If mailman is to be put back to zwinger, chapter-l and the new Serbian list will need to be re-created (and maybe some other lists?).
  • 09:00 mark: Mint apparently is fucked, RAID and SP settings were reverted to factory defaults. Trying to do data recovery now. Possibly a power problem?

August 14

  • 19:51 brion: mail config on zwinger broken or funky or otherwise annoying; just leaving it off for now. moved dns for mail back to mint (which is still dead) sighhhh
  • 19:26 brion: moved mail.wikimedia.org back to zwinger due to extended outage on mint. With our limited support contract on knams we can't afford to have this critical service there.
  • 14:30 midom: srv27,srv26,srv25 joined external storage service, waiting for payload
  • 09:30 brion: mint is offline, no ping
  • 00:20 brion: stopped bacon to run backup dump
  • 01:00 jeluf: enabled spamassassin for OTRS on mint (~otrs/.procmailrc)

August 13

  • sometime kate: moved otrs to mint
  • 23:25 brion: added wikimediasr-l aliases to mailman on mint
  • sometime someone: Apparently mail.wikimedia.org has been moved to mint.
  • 10:42 jeronim: set ticket.wikimedia.org to CNAME mint.knams.wikimedia.org. (move of OTRS to mint is in progress)
  • 00:58 Tim: started update-special-pages
  • 00:19 Tim: it happened again so I disabled otrs's crontab. Original crontab is in /opt/otrs/crontab

August 12

  • 23:18-23:30 Tim: An OTRS process on albert (PostMaster.pl) developed a runaway memory leak, causing heavy swapping. This slowed down albert sufficiently to cause the entire apache cluster to lock up with high load. Killed the process at 23:30 and the site soon returned to normal.
  • 09:30 brion: took srv1 out of 'apaches' node group and shut off apache on it. DON'T RUN APACHE ON SRV1

August 11

  • 21:26 Tim: TICK TICK TICK, that's the sound of 58 servers with their clocks ticking in synchrony, maximum offset 80ms.
  • 20:30 Tim: Added the missing restrict line for 10.0.0.200 to ntp.conf on (almost) all machines
  • 19:30 Tim: Synchronised ntp.conf on hypatia, humboldt, rose, anthony, rabanus, diderot and srv1 with /home/config/others/etc/ntp.conf.vlan2 . This made them remotely queryable, for easier debugging in the future, and also switched their preferred server from zwinger to the cisco (in broadcastclient mode).
  • 18:35 Tim: Fixed tingxi's resolv.conf
  • 17:45 mark: Fixed inconsistent favicons on apaches. Older apaches had symlinks to a common (wikipedia) favicon, which got overwritten with the new wikinews favicon by brion. Removed the symlinks, and put the correct favicons in place.
  • 12:20 brion: set up pl.wikimedia.org and press.wikimedia.org (press is locked, and currently has no user accounts. a sysop/bureaucrat will need to be added for it to be used)
  • 07:28 brion: updated wikinews.org favicon

August 9

  • 23:20 mark: Rerouted Europe back to knams, because all sorts of weird problems were occuring. Fixed a typo (pmpta) in DNS. Some nameservers report TTL 0 for some of our DNS records - need to investigate that.
  • 22:20 mark: Moved Squid service IP 207.142.131.246 from overloaded srv10 to srv5. Cleared the ARP entry on the l3 switch.
  • 22:00 mark: Reroute everything from knams to pmtpa directly, because of routing problems
  • 13:35 mark: changed biruni's hostname from biruni.wikimedia.org to biruni
  • 13:30 mark: added avicenna and biruni to node_groups/apaches
  • 13:00 mark: Restarted apaches on avicenna, alrazi and biruni with -DSLOW, and changed startup scripts
  • 08:52 jeronim: blocked 61.48.105.65 spammer IP from all wikis using block-ip-all - so ipblocklist message will speak of "vandalism" instead of "spam"
  • 08:25 jeronim: created chapter-l for mailman on mint

August 8

  • 09:22 kate: enabled greylisting on mail.wm.org
  • 20:54 hashar: readded srv2 (with ip x.x.0.1 ) to the apache pool
  • 18:25 hashar: avicenna & biruni readded. Monitoring error log, #wikipedia and memory.
  • 17:43 brion: added /mnt/upload mounts on avicenna and biruni
  • 17:32 hashar: forgot sync-common on avicenna and biruni :/ I though scap would do the job ... They both missing the upload directory.
  • 15:45 brion: stopped apache on avicenna and biruni pending more information on reported errors
  • 15:36 hashar: TODO: biruni hostname seems wrong /etc/sysconfig/network list HOSTNAME=biruni.wikimedia.org whereas other servers just get HOSTNAME=zwinger or HOSTNAME=srv30 ...
  • 15:36 hashar: removed srv1 from mediawiki-installation dsh file (as apache is not meant to run on).
  • 15:24 hashar: bringed back biruni in mediawiki-installation pool
  • 15:12 hashar: bringed back avicenna in mediawiki-installation pool
  • 14:30 hashar: started apache on srv11.
  • 06:30 kate: moved mailing lists to mint. let's see if it starts sucking less.

August 7

  • 20:50 brion: postfix hung zombified on zwinger, wouldn't restart automatically. had to remove master.pid and restart.
  • 16:25 brion: installed DynamicPageList on wikiquote per [6]
  • 15:50 brion: locked tlhwiki
  • 07:47 brion: added application/ogg as mime type for ogg files on albert
  • 00:59 brion: set localized logo for ptwiktionary

August 3

  • 14:15 mark: Switched over upload.wikimedia.org to lighttpd instead of apache on albert
  • 12:00 brion: added frankfurt city map to wikimania whitelist. whoops!

August 2

  • 15:45 mark: Bound albert's apache to a single IP, instead of INADDR_ANY
  • 09:40 brion: added wildcard subdomains for wiktionary.com redirection

August 1

  • 22:30 all: samuel's disk filled up. Switched master to adler. Re-syncing samuel from suda.
  • 14:50 mark: Put all kennisnet squids back into DNS, updated DNS on pascal

Archives