Server Admin Log/Archive 5

October 31

21:09 brion: added some tor ips from [1] manually to mwblocker.log
12:10 mark: Started squid on will
10:53 Tim: set up hourly apache-restart on yaseo apaches
10:05 sleeeepy-brion: started tarball-of-doom from khaldun->albert for enwiki images (non-thumbs) trickling
09:49 zzzz-brion: restarted last of search servers in tampa with data updated from snapshots
07:25 tired-brion: restarting tarball-of-doom on bacon under trickle so it doesn't slow things down
07:05 wacky-brion: creating giant tarball-of-doom on bacon to snapshot commons files for archive/copy
06:24 scary-brion: lifted restriction on reuploads. commons main files and archives now updated, thumbs seem to work (and copying in updates to maybe save some render time). may need to re-touch permissions at end
05:30 fiendish-brion: mounting bacon's /var/upload2 on /mnt/upload2 on zwinger, apaches
02:42 evil-brion: disabled reload priv for all users, all wikis, to try to get this image crap over and done with soon. going to migrate live commons files to bacon to try to reduce albert load
02:25 goblin-brion: humboldt set up and running as an apache, in lvs
01:44 ghoul-brion: srv6 also refusing connections, squid stuck, had to kill and restart
01:15 pirate-brion: srv7 refusing connections on port 80, but squid seemed to be stuck (restart complained squid was already running). killed and restarted squid, seems ok now
00:55 ghost-brion: bugs:3838 set localtime to UTC on ixia, lomaria, thistle
00:45 daemon-brion: added 'bugs'/'bugzilla' interwiki prefix on wikitech
00:39 zombie-brion: bugs:3839 installed ntp on humboldt to sync time

October 30

21:40-22:10 Tim: deployed LVS in front of squid at yaseo.
21:59 hashar: created dsh group apaches_yaseo (synced from amaryllis), moved apaches to apaches_pmtpa and set symlink for backward compatibilty.
16:00 mark: I moved anthony from internal to external VLAN, gave it ip .233, and wanted to make it a temporary Squid. However, it's giving disk errors, so that might not be such a good idea. Added to datacentre tasks for Kyle to look at.
15:00 mark: Upgraded all yaseo squids to the new squid RPM.
14:40 mark: Upgraded all pmtpa squids (except will, which is running FC2) to the new squid RPM.
14:10 mark: Upgraded all knams squids (except clematis, for comparison) to a new squid RPM, squid-2.5.STABLE12-1wm. This is a somewhat newer upstream Squid version, and also has a cron job added that checks whether Squid is still (supposed to be) running, and restarts it if it's not.
14:10 hashar: fixed a bug with server inventory, was putting larousse data every time.
13:50 mark: Installed NTP on vandale
13:40 mark: Moved LVS back from iris to pascal
13:25 hashar: started the Server inventory bot on wikitech site. Need feedback.
10:29 brion: running lucene wikipedia index rebuilds in pmtpa
09:23 brion: restarted yaseo apaches; extreme slowness in HTTP connection and response time, and some segfaults in logs. Seems better after restart.
06:50 brion: activated lucene search daemon in yaseo. running non-wikipedia index rebuilds in pmtpa
06:00 brion: restarted search daemons in pmtpa with wikisource. building ms/th/ko/ja indexes in yaseo, going to start more rebuilds in pmtpa...
04:10 Tim: installed PHP on yf1007, for some reason it wasn't there
04:05 Tim: Took yf1006, yf1008, yf1010, yf1017 out of rotation, they were segfaulting on form submission, e.g. save and move.
02:45 Tim: Set up mxircecho at yaseo
02:30 brion: setting up yf1017 as search server for yaseo
02:14 Tim: moved jawiki to yaseo
02:00 brion: running lucene index builds from last dumps for *wikisource

--Hashar 06:10, 30 October 2005 (PST)

October 29

23:53 hashar: renamed squids dsh file to squids_pmtpa (and put a symlink)
22:35 Tim: moved thwiki and mswiki to yaseo
21:19 hashar: BUG srv24 & srv27 out of apache group (see October 19) but are still in ganglia Apache group.
20:29 Tim: unmounted dead NFS mount srv2:/usr/local/fcache everywhere
20:00 Tim: took webster (and srv9) out of dsh ALL until someone can work out how to set up LDAP
15:41 hashar: made squid error message validate (size is invalid for hr element)
15:25 hashar: some people on irc told me that search on sources wikis doesn't work. Looking at MWDaemon.log , the search indexes do not exist and need to be created. Need some documentation on LuceneSearch.
07:40 brion: starting dump jobs on benet, srv35, srv36
07:06 Tim: Moved benet's increasingly large collection of NFS mount points to /var/backup/public/mnt, with symlinks left behind. They were previously scattered all over the place. There's a bug in lighttpd which requires them to be mounted in the document root. Mounted a directory from bacon, with some image dumps in it.
06:22 brion: fixing up more broken site_stats tables; fixed addwiki.php to use correct row id
04:04 Tim: unmounted dead NFS share /home/wikipedia/backup on bacon
00:51 Tim: Started copy of jawiki's external storage to yaseo.
00:47 Tim: Copied mswiki to yaseo
00:30 Tim: Copied thwiki to yaseo, now replicating. Images still need to be copied.

October 28

22:35 hashar: following samuel trouble 2 days ago, there is still some ghost articles on at least frwiki. Will manually fix that Oct. 29 if I got time.
21:49 Tim: LVS now in service in front of yaseo apaches. yf1010-1017 are now in service as apaches, they were previously idle but with apache installed. yf1018 is LVS, yf1019 is the experimental wikimaps installation and could be considered a spare load balancer.
21:42 Tim: noticed that perlbal was still taking a fair bit of load on bacon. Killed icpagent on bacon, increased icpagent delay time from 1 to 5ms on holbach.
21:03 Tim: Running sync-to-seoul, about to set up LVS on yf1018
18:35 mark: Installed a squid RPM without epoll (otherwise identical) on clematis, to compare memory leak behaviour
17:02 ævar: Turned off automatic capitalization on fowiktionary.
14:40 mark: srv8 Squid had crashed, restarted it. Please pay attention to this until we have a better solution!
06:52 Tim: copying dump of jawiki to henbane:/a/backup .
02:20 Tim: changed the password for wikiuser. DB load glitch experienced due to a migration bug.

October 27

17:16 ævar: disabled uploads on ndswiki [2]
05:43 various: killed evil special pages job that broke wiki

October 26

20:18 midom: rolled forward changes made to samuel on other db nodes.
19:18 mark: Routing problems fixed, switched DNS back.
19:00 mark: Routing problems in florida, but knams can reach out. Sent pmtpa traffic via knams.
04:41 kate: put lomaria and thistle into db service
02:40 kate: took suda out of rotation to turn it into fileserver. ixia is in rotation, thistle & lomaria are waiting for mysql to be set up
11:16 brion: installed easytimeline on wikitech ;)
11:00 brion: paused file copy onto bacon for peak hours

October 25

20:15 mark: restarted srv7's squid. Was crashed at 13:00
10:28 kate: ran post-installation setup on ixia, thistle & lomaria
7:00 brion: paused image copy from albert to bacon for the next few hours
6:00 Solar: Ixia, Thistle, and Lomaria, the three new db's are racked and ready!
5:00 Tim: fixed srv11
4:30 Tim: started HTML dump of all Wikipedias, running in 4 threads on srv31

October 24

20:12 various: fixed pascal
18:34 hashar: updated http://noc.wikimedia.org/ with the new wikidev URL.
12:10 mark: Got knams back up with iris as LVM load balancer, with no NFS mounts that can block it
11:30 Pascal went down
07:02 ævar: created bug, pih, vec, lmo and udm wiki
03:37 kate: uploaded the old images from wikidev
02:58 brion: Moved this site from wp.wikidev.net to http://wikitech.leuksman.com/

October 23

22:57 brion: starting trickle copy from khaldun to bacon
21:58 brion: shutting off bacon's broken mysql; clearing out its disk space and making an upload data copy

October 22

23:51 brion: hacked BoardVote to read from master; the boardvote2005 database is missing from suda and there was a lot of whinging in the database error log about it
23:22 brion: fixed group ownership on php-1.5 on yaseo servers (GlobalFunctions.php was unwritable)
22:25 hashar: zwitter fixed the issue (adding a live hack), merged back my changes.
22:00 hashar: broke site search by cvs updating and syncing extensions/LuceneSearch.php
05:37 brion: samuel caught up, back in rotation
05:18 brion: replication broke on samuel due to "Last_error: Error 'Lock wait timeout exceeded; Try restarting transaction'". Took out of rotation to fix

October 21

23:52 Tim: moved wikimedia button (the one in the footer) to the /images directory, to support offline browsing of the HTML dump
19:01 Tim: Increased tcp_rmem on henbane and yf1010, for faster copying.

October 20

06:13 brion: did '/sbin/ip addr add 145.97.39.155 dev eth0' on pascal; got one port 80 connection to go through to vandale, but others still refused
- routing problems in general; some level3 issue. pmtpa having connection problems, and freenode is splitting
05:56 brion: rr.knams.wikimedia.org (145.97.39.155) does not respond on port 80
05:04 brion: reopened ssh tunnel from yaseo to pmtpa master db and restarted replication on henbane. commons copy is catching up; hoepfully kowiki will remain working
01:53 brion: investigating reports that kowiki is COMPLETELY BROKEN DUE TO DATABASE LOCK since 10 hours ago
23:59 mark: Florida squids all seem to run out of memory after a few days... memleak. Will have to investigate.
14:52 kate: setup LVS at knams, on pascal.. no failover yet
11:20 mark: mkfs'd /dev/sdb1 on srv10. Tried to rm -rf /var/spool/squid but this failed, probably due to a corrupted filesystem. Probably needs a reinstall/thorough hw check, but in the meantime, is running with 1 squid service ip and 1 cache_dir.
09:40 future-brion: data dumps scheduled to start on benet, srv35, srv36 pulling from samuel
- (This is the db-friendly dump with (most?) bugs fixed. May have some extra newlines in old text with \r\n in the raw database, will continue tracking down libxml2 problems later so future dumps will be clean.)

October 19

16:00 mark: Installed the squid RPM on all yaseo squids. Updated hosts-lists.php...
14:15 mark: Squid on will had crashed; restarted it.
06:44 brion: running dump tests on srv35 to confirm that bugs are fixed
03:09 ævar: Translated /home/wikipedia/conf/squid/errors/English/error_utf8.htm into Icelandic, someone with root access might need to run /home/wikipedia/conf/squid/deploy, might.

No, they need to be put in the next version of the squid RPM, which can then be deployed... -- mark

01:18 brion: turned apache back off on srv24, srv27 and took them out of apache nodegroup, as avar claims tim said they shouldn't be apaches. [they seem to be memcached and external storage]
01:13 brion: turned srv24, srv27 back on; several gigs have appeared and recopy of settings file succeeded
01:00 brion: turned off apache on srv24, srv27: out of disk space
00:44 brion: added Wikiportal, Wikiproyecto namespaces on eswiki

October 18

11:40 Tim: dewiki and enwiki have been defragmented, unused columns and indexes have been removed. Now starting compression of jawiki.
08:04 Tim: compressOld.php is finished with en, I've now taken adler out of rotation to defragment tables. Running null alter table on dewiki.text first.

October 17

23:56 brion: added recently set up machines to mediawiki-installation nodegroup. (THEY WERE ALREADY IN APACHES GROUP. NOTHING SHOULD BE IN APACHES THAT'S NOT IN MEDIAWIKI-INSTALLATION, EVAR)
22:28 brion: set up gmond on srv49, appears in ganglia now
21:00 mark: Danny says that the 3 new DB servers have been delivered, and he dispatched Kyle for their physical installation tomorrow.
18:15 brion: stopped dumps due to confirmed bug[3]. srv35 and srv36 available for apache for now (added back to node group)
16:20 mark: Remounted /a on srv6 and clematis with the reiserfs nolog option, to disable journaling. Saves disk writes, just mkfs it when it crashes...
13:50 mark: Deployed the new squid RPM on all knams squids
11:40 mark: Deployed the new squid RPM on all Florida squids, except will, which is still running FC1. Can we please reinstall will?
08:47 brion: rebuilt apaches now online: alrazi avicenna friedrich goeje harris hypatia kluge diderot srv49
08:40 brion: enwiki full dump is being run semimanually with the prefetch on an older version. will want to manually fix up links when it's done
07:00ish brion: trying apache setup on alrazi; will mass-run on other machines soonish
01:21 brion: current state of dump runs:
- srv36: enwiki
- srv35: all other wikipedias
- benet: non-wikipedias
- They're set to pull table dumps and page+rev from adler, but should only touch occasional bits of text.
00:55 brion: took srv35 and srv36 out of apaches node_group so nobody starts Apache on them by accident. :D
00:30 brion: setting up for database dumps using srv35 and srv36. Domas, these should hit the dbs a lot less so please try not to kill them too hard! Thanks.

October 16

21:30 midom: added ariel into enwiki service, made from fresh dump with fresh ibdata!!!!!
17:20 Tim and Domas: deployed LVS-DR to load balance between the squids and the apaches
- Halved median miss service time, 1100->500ms!
- Please don't try to add apaches into the LVS realserver pool until I've documented the procedure. A simple mistake, like running the commands in the wrong order, could crash the site.
14:00 mark: The new squid is serving peak load at roughly 20% of the cpu usage of the old squid...
12:00 mark: Put clematis with squid+epoll rpm in production. In case of severe problems, just kill it and start the old squid.
02:22 ævar: Took srv26 out of /usr/local/dsh/node_groups/mediawiki-installation and /usr/local/dsh/node_groups/apaches, wasn't working.

October 15

16:20 mark: Clematis is now running my experimental Squid RPM, with epoll support and my HTCP patch. It's not pooled yet, because I want to discuss and test inclusion of other patches first...
14:45 mark: Depooled clematis as squid, because I want to use it for testing my new Squid RPM.
09:59 Tim: set up srv31 as an NFS server, exporting its /var/static directory; mounted it on benet
07:50 brion: noticed minor bug in PHP on AMD64
03:30 Tim: Set up ntpd on amaryllis, dryas, henbane, yf1000-1004, by copying configuration from yf1007, and running chkconfig ntpd on;/etc/init.d/ntpd start

October 13

23:54 ævar: Kate fixed the tingxi issue, woo.
23:52 ævar: I can't ssh to tingxi (10.0.0.12) which means language/LanguagePt.php is out of sync, nothing severe, but it will cause interface mismatches for pt*.wiki*
23:22 kate: reinstalled vandale and added it as a squid since it's not doing anything else
22:46 kate: reinstalled clematis and put it in squid pool
21:30 brion: fiddling with albert's MaxClients aagin
17:40 mark: Turned off log_mime_hdrs on all squids, as we're not using it anymore
17:00 mark: Added fuchsia as a squid in knams
15:00 mark: Clematis's disk has been replaced and should be fixed. Needs a reinstall...
14:00 mark: Implemented and deployed a new udpmcast daemon with forwarding rules on amaryllis and larousse. This should solve our purging problems with wikis on separate clusters.
13:00 midom: srv34,srv33,srv32 joined ExternalStore service as cluster3
07:00ish brion: holbach back in business.
06:28 brion: disabled Special:Makesysop steward bits until I get the database problem resolved. Still poking at holbach, skipping the enwiki bits.
06:10 brion: took holbach out of rotation; replication broke with what looks like a steward bot application

October 12

23:00ish brion: trying to restart dump process, because some idiot canceled them
22:30ish brion: updated IRC channels in squid error page
21:43 ævar: the code had some discrepency due to some of it being cvs uped recently and some of it not being cvs uped recently, ran scap' and solved a live hack in includes/SkinTemplate.php, whoever wrote it might want to take a look.
21:15 jamesday: started gzip of first 50/50GB/25 days of samuel binary logs, had 41GB free. About 2GB/day used.
08:40 brion: updated internalwiki on lucene search index, restarted lucene daemons. had to clear out old logs from maurus, out of disk space.
07:22 brion: cleaning up dupe and missing site_stats rows; removed dupe 'warwiki' entry in all.dblist
05:37 ævar: Installing the CrossNamespaceLinks specialpage extension.
04:05 Tim: henbane and dryas were reporting high lag times, probably due to the low replication rate (it's only replicating commons and ko). The database was locked automatically when the lag was more than 30 seconds, which was usually. I increased the maximum lag to 6 hours.
02:20 Solar: Initialized raid and reinstalled OS on webster. Only eth0 is plugged in and is private.

October 11

22:00ish brion: changed project ns for plwikisource
19:55 kate: changed root password of zedler, someone who wants it should ask me or elian
17:45 Tim: When they got more load, all three pound instances started using large amounts of memory. Enough excitement for one night, switched back to perlbal.
17:30 Tim: we were having oscillating load between the three pound hosts, so I switched squid to round-robin, and cut the perlbals out of the list at the same time.
16:56 Tim: pound was reaching its fd limit during high concurrency ab testing, raising it to 100000 seems to have fixed it
16:08 Tim: brought dalembert, friedrich and harris into service as pound servers
09:44 brion: briefly stopped mailman to edit vereinde-l archives to remove improperly forwarded email
02:30 Tim: moved static.wikipedia.org to srv31, proxied via the squids.
01:55 Tim: did apache-restart-all to fix high memory usage

October 10

18:14 kate: fs corruption on srv10 again, moved its IPs elsewhere
14:41 Tim: wrote some new tools to allow srv31 to restrict its copying from albert to times of low NFS server load (<1300 req/s)
08:45 Tim: Installed ganglia on srv32-35 (did srv31 earlier)
06:29 Tim: Copying HTML dumps to srv31
03:41 Tim: Added new ganglia metric "nfs_server_calls" to albert and zwinger. It's a perl script, /usr/local/bin/nfs-gmetric
01:32 ævar: Reverted my changes in CVS; cvs up-ed, and synced the affected files, just in case.
01:07 ævar: Checked all the apaches for Language::linkPrefix() and it turns out they all had it (see /home/avar/report (1 = has the function; 0 = does not have the function))
00:46 ævar: Tried syncing again, same error, spooky, off to manually check the apaches.
00:30 ævar: cvs up and scap breaking the wiki, which should not have happend but did for some reason, the error was: Call to undefined function: linkprefix() in /usr/local/apache/common-local/php-1.5/includes/Parser.php on line 1232, but function linkPrefix was defined in the Language class, no problems were reported with syncing, applied a live hack to Parser.php to fix the issue, investigating.

October 9

21:51 midom: killed backups. haha. unkilled site. though adler is good boy, lots of RAM does not help with backups. serial reads do.
19:02 kate: disk failed on clematis. added mint as squid.
11:09 Tim: changed master for ko
08:46 Tim: copied ko upload directory to amaryllis. Set up dryas, with chained replication from henbane.
07:40 ævar: Installed extensions/Renameuser/Renameuserlog.php
05:10 Tim: restored 245 and 248 to srv10
04:28 Tim: removed bogus entries from zwinger's /etc/exports, with an RCS backup
03:13 brion: starting weekly data dumps on benet, srv35, srv36; pulling from adler for primary data. (live; so table dumps will be slightly inconsistent. xml dumps are self-consistent internally.)
02:50 Tim: srv10 down, moved virtual IPs: 245 to srv5, 248 to will and 210 to srv7
02:30 jeronim: added missing mount points /mnt/upload and /mnt/wikipedia on humboldt and some machines in the apaches and mediawiki-installation groups
02:23 Tim: Changed squid configuration to have no-query for albert. This might reduce the latency some people were experiencing when requesting images.

October 8

22:52 ævar: Ran a script (/home/avar/3631.sh) to confirm that bug 3631 wasn't exploited on any wiki besides enwiki, it wasn't.
22:29 ævar: de-sysopped myself on enwiki
22:27 ævar: sysopped myself on enwiki and banned the users using exploiting bug 3631
08:22 brion: removed 'srv9' dupe entry in zwinger exports; for some reason srv9 couldn't mount with that in place (the ip is also in)

October 7

15:28 jeronim: turned off and disabled swap on knams squids (clematis hawthorn iris lily mayflower ragweed sage)
13:34 ævar: created ilowiki
11:49 jeronim: chmod/chowned zwinger:/usr/local/etc/powerdns/langlist-cnames to 664 root:wikidev on avar's request
06:41 ævar: / on zwinger filled up (reported by jeronim) I deleted an old log I didn't need anymore freeing 2GB, more stuff needs to be cleaned out still.
02:49 ævar: Added a live hack to Special:Export, a notice explaining that exporting of full histories is disabled, it can't be translated, boo hoo;)
00:05 brion: changed pawiki sitename/meta namespace to 'ਵਿਕਿਪੀਡਿਆ'

October 6

07:20: midom: re-enabled steward interface
yesterday kate: installed solaris on vandale because mysql wanted to test something. finished with it now, should have linux put back.

October 5

18:57 brion: image server has been very slow lately. fixed a broken thumb file or two which had a subdirectory in the way (one on the wikipedia portal was being requested *very* often, producing extra redirect load)
05:25 Solar: Rebooted srv26, bumped temp. threshold to 80C. Will investigate further.
05:20 Solar: webster is back up for now, but will fail again. Will call SM to get replacement drives.
02:56 Tim: srv24 in rotation as part of cluster2. Restarted compressOld.
01:00 Tim: Setting up srv24 as an external storage server, to replace srv26 which is down again. Stopped compressOld and stopped slave on srv25 for data directory copy.

October 4

23:50 Tim: Started compressOld.php, started mysqld on srv26.
23:30 Tim: restarted evil resource-eating program (with kate's permission)
20:02 kate: stopped evil resource-eating tim program on albert started ~ 06:20.
14:40 mark: Increased DB load on samuel in an attempt to solve DB availability problems
13:13 Webster broke.
06:22 Tim: HTML dump post-process running on albert. It'll spend most of its time in sed, with a perl controlling script.
05:13 Tim: static HTML dump of English Wikipedia is pretty much finished. I'm currently running a huge find command on albert, to get a list of files to post-process.
00:20 Solar: Uploaded pictures. Take a look at User:Solar

October 3

23:07 brion: srv28 shutdown broke dewiki and enwiki dumps, have to restart them. non-wikipedias finished before this.
19:45 Solar: srv11 and srv28 moved to new racks for power distribution requirements.
03:30 jeronim: pmtpa squids were mostly running with max FDs of 1024 and starving, so rebuilt them with limit of 8192 and restarted

October 2

21:41 brion: taking srv35 out of apache loop to run additional dump processing
13:10 brion: running wikipedia backups from bacon via srv36, nonwikipedia backups from bacon via benet
12:57 brion: replication halted on bacon due to missing tables on the new wikis (napwiki, warwiki etc) -- this will need to get fixed. in the meantime doing dumps from other wikis ...
09:30 brion: srv31-35 in apache service (in perlbal list)
08:45 jeronim: srv31-35 ready for apache deployment
07:40 Tim: fixed exif bug (http://bugs.php.net/bug.php?id=34704) and deployed the updated tree on all florida apaches
06:30 brion: running cleanupTitles.php on various wikis
00:10 Tim: Running fixSlaveDesync.php on en.

October 1

21:07 Tim: Told dalembert to stop echoing its syslog spam to zwinger and larousse. Apparently temperature warnings were appearing in terminals on larousse.
19:40 Tim: Added Internode proxies to the trusted XFF list
11:00 brion: bacon and adler catching up last couple hours' data
08:30 brion: stopping bacon, adler to copy current data over to bacon
08:15 brion: continued replication catchup on bacon
08:10 brion: stopped backups; benet's out of space (going to do cleanup) and I'm testing an improved backup dump script that eliminates the overhead of mwdumper on the initial dump-split-compress job.
08:07 Tim: re-enabled Special:Makesysop, minus steward interface

September 30

19:00-20:00 mark: Deployed the fixed HTCP-CLR patch to all squids, and restarted them
19:18 ævar: disabled PageCSS because of potential XSS issues.
16:06 ævar: Installed the PageCSS extension on the cluster for per-page CSS.
13:12 Tim: installed apache, php etc. on dalembert, by modifying /home/wikipedia/deployment/apache/prepare-host until it kind of worked. Not sure if it's all set up right, but it's probably good enough for dumpHTML, which is what I'm using it for.
12:30 Tim: installed gmond on various reinstalled machines
07:22 midom: adler in service
01:25 brion: did some scripted despamming crosswiki (some deleted pages by '127.0.0.1'...)
Solar: Replaced ram in srv42

September 29

19:50 mark: Fixed a memleak in my HTCP CLR squid patch, and testing it on clematis. If it works well, I will deploy it to all other squids...
17:52 Tim: made some more tweaks to http://mail.wikipedia.org/index.html . Now it displays properly in IE, and it works with small screens
17:12 Tim: Returned text on http://mail.wikipedia.org/ to a comfortably readable size. Apologies to optometrists everywhere for the reduced pay cheque.
07:25 brion: Ran initStats on warwiki, napwiki, ladwiki.
05:15ish brion: ntp setup on ariel
05:00 jeronim: clean fc3 on ariel; it has had a drive swapped and is hopefully not faulty now
04:30 Solar: srv33, srv34, and srv35 have ip's and are ready for service. srv32 and srv31 are pending a bomis server move

September 28

- jeronim: srv49, alrazi, diderot, hypatia, avicenna, goeje, harris, dalembert, humboldt, kluge, friedrich all freshly set up with fc3 - but no ntp setup, and no apache. alrazi's old host keys lost.
19:00 jeronim: on zwinger, moved squid errors directory and sync-errors back into /h/w/conf/squid from /h/w/conf/old-squid, and updated sync-errors to also sync to lopar, yaseo, and knams. Updated all squids to use shiny new error page from mark_ryan.
10:00 mark: Added ragweed back to the knams squid pool because of overload on the other squids
09:10 brion: dewiki backup running on benet while others continue (from bacon 20050921)
07:20 brion: backups switched to use bzip2 for xml dumps; 'articles' instead of 'public' name change; image dumps disabled
06:52 brion: starting bzip2 filter/output of 20050924 enwiki dump on srv36
01:00 Solar: alrazi avicenna diderot friedrich goeje harris hypatia humboldt kluge srv42 srv49 are back on the netgear switch

September 27

- jeronim: dhcp still not working so I've asked Kyle to put most fc2 boxes on a different switch
23:53 jeronim: commented out icpagent in /etc/rc.local on dalembert in case it's rebooted
22:10 mark: The new switch appears to be Fast Ethernet only! It's accessible on 10.0.1.1. I configured some parts of it to make it somewhat usable: all ports in access mode, vlan 2.
20:15 midom: disabled steward interface, needs rewriting to select databases instead of specifying their names directly in queries -- breaks replication
18:00 midom: ariel gone down:

LSI MegaRAID SCSI BIOS           Version  G112 May 20, 2003
Copyright(c) LSI Logic Corp.
HA -0 (Bus 3 Dev 1) MegaRAID SCSI 320-2
Standard FW 1L26 DRAM=64MB (SDRAM)
Battery module is present on adapter
Following SCSI ID's are not responding
  Channel-2: 0, 1, 2
1 Logical Drives found on the host adapter.
1 Logical Drive(s) Failed
1 Logical Drive(s) handled by BIOS
Press <Ctrl><M> or <Enter> to Run Configuration Utility

08:53 Tim: Stopped icpagent on dalembert for now, pending examination of pound's problems
07:11 brion: added wikipedia.nl alias in powerdns in prep for changing master servers for that domain (jason has that info)
02:54 brion: new8 machines (except srv50) don't have ntp working, still. punching at it again (were up to about 15 seconds slow)
- copied /etc/ntp.conf and /etc/ntp/step-tickers from srv50 to the others in the group, ran /etc/init.d/ntpd start
02:40 brion: starting test cur-only dewiki dump to double-check dump processing bugs while other backups continue

September 26

21:59 brion: killed commons image dump again; too slow, too big. need to rework that...
19:30 jeronim: turned off swap and commented it out in /etc/fstab on all pmtpa squids after kate noticed srv7 was swapping and restarted its squid
17:30 jeluf: skipped some insert statements to enwiki on the slaves not replicating enwiki. Steward tool running on metawiki tries to write to enwiki and mysql replicates these transactions.
09:30 brion: stopped bacon again, running backup of everything but enwiki/dewiki (backdated to 20050921) from bacon
09:22 brion: added refresh-dblist script to update the split .dblist files in /h/w/c
08:46 brion: started replication catchup on bacon (about 5 days behind)
08:40 brion: restarted mwdumper on the enwiki dump, which had broken with a funky file locking problem
07:37 Tim: deleted srv27 binlogs 020-026, the rest are needed for srv26 when it starts working again.
07:37 brion: locking fiwiktionary for case conversion
04:49 brion: turned off srv41, srv26 apaches due to segfaults; turned off exif log for commons due to giant >2gb log file
01:04 brion: created wikiro-l, wikimediaro-l lists, iulianu as list admin

September 25

13:45 hashar: made nap language inherit from italian language instead of english (rebuildMessages.php nap --update).
13:10 hashar: created nap, war & lad wikipedia using the updated howto Add a language. Thanks Tim for the technical assistance.
06:17 kate: copying bacon's mysql data to zedler

September 24

23:44 brion: changed squid config to use bacon and holbach's .wikimedia.org names instead of .pmtpa.wmnet on kate's advice
23:31 brion: pound didn't seem to be working; 503 errors, other problems, was unkillable without -9. unable to run site on holbach's perlba; squids couldn't find it? restarted pound and icpagent on dalembert, working now
23:23 brion: tried restarting pound. (there's also weird cyclic load between rose and anthony every few minutes)
23:15 brion: slow site performance reported; ganglia showed unusually high load on srv50, srv37, dalembert. Stopped dumpHTML on dalembert (pound machine), restarted apache on 50 & 37
22:12 brion: srv8 failed on squid restart due to broken symlink to config file. added srv8 to pmtpa squid lists for new config list and relinked its config file
20:40 midom: webster is up with non-enwiki dbset, ariel is up with enwiki only.
19:36 kate: zedler is up with mysql installed; waiting for replication to be sorted out somehow
18:20 Tim: deployed new squid configuration generator
16:11 jeronim: diderot, harris, alrazi, avicenna out
13:30 jeronim: kluge & friedrich out too, for reinstall
12:12 jeronim: took goeje out of mediawiki-installation dsh group; putting fc3 on it
Tim: stopped icpagent on bacon. Load balancers are now holbach (perlbal) and dalembert (pound)
07:55 brion: started enwiki xml dump with five parallel readers; experimental (on srv36 pulling from samuel)
07:04 brion: trying to fix ntp again on humboldt and new8 machines
02:14 brion: disabled Special:Undelete toplevel list; code needs rewriting or just dump it for Special:Log (added link as temp hack)

September 23

21:17 brion: added /^Lynx/ to unicode browser blacklist
15:37 Tim: Deployed pound/icpagent on dalembert. It is currently running alongside perlbal instances on bacon and holbach.

September 22

23:18 brion: turning off capitallinks on tawiktionary
18:48 brion: updated pmtpa squid error messages to remove obsolete openfacts and wikisearch references. master copies now in /h/w/conf/squid/errors
18:21 brion: wikinews backup done. enwiki backup halted due to some nfs/large file problem. investigating
11:58 Tim: brought srv26 back into service
11:28 Tim: started deleting thumbnails still in their obsolete locations, 180,000 to delete.
09:10 brion: starting *wikinews backups on srv36 pulling from bacon. [installed mwdumper]
08:25 brion: running enwiki backup on srv36 pulling from a halted bacon, saving on benet
08:08 brion: taking srv36 off perlbal nodelist to try running backups with it
07:26 brion: adding new machines to perlbal; ready for service... hopefully
07:00 Tim: restarted dumpHTML.php, I had stopped it for a while due to high DB load. I'll stop it again when we get closer to peak time.
06:45 brion: running setup-apache script on remaining new8 machines (srv36-41, srv42-48)

September 21

23:54 brion: recompiling librsvg with correction to security fix; it had accidentally disabled data: urls as well
20:35 brion: set european tz for nlwikimedia
18:46 Solar: webster and ariel have rebuilt raid and FC3 installed although they do not have IP's. They are accessible via console.
17:00 midom: disabled all bloat in albert's http configuration (mod_perl, php, jk, ssl, ...), that freed lots of memory and allows more effective caching of directory trees and file metadata. And yes, it solved a bit performance issues (uh oh, yet another image server overload).
08:59 brion: disabled wikidiff PHP extension sitewide; there are numerous reports of bad diff output in some cases, and dammit alleges it may be crashy or futex-y. InitialiseSettings.php is set to enable it in the wiki if it's on in php.ini and ignore it if not.
07:50 brion: tim is doing ongoing debugging on srv50 trying to identify source of segfaults
07:00 brion: installed patch for apache rewrite bug on amd64, but still getting segfaults on srv50
06:08 brion: clocks are wrong on new8 boxen; working on correcting
06:00 brion: setting up APC instead of Turck on srv50 experimentally
00:35 brion: srv50 back out; some apache child process segfaults, which don't look too good
00:34 brion: srv50 back in
00:11 brion: srv50 out for further adjustments (tidy, proctitle)
00:09 brion: putting srv50 into apache rotation to test it out before installing all others

September 20

23:20 ævar: changed wgSitename to Vichipedie on furwiki
23:14 ævar: ran php namespaceDupes.php --fix --suffix=/broken furwiki to fix namespaces on furry wiki
22:39 ævar: changed $wgMetaNamespace on furwiki from Wikipedia to Vichipedie.
20:14 brion: reverted Parser.php change temporarily due to reports of massive template breakage
19:45 brion: fixed internal wiki (whoops, typo in config change last night)
19:13 brion: removed bogus entries from master robots.txt ("/?", "/wiki?", "/wiki/?")
14:16 Tim: Disabled context display for full text search results as an emergency optimisation measure. It was taking more than its fair share of our precious DB time. $wgDisableSearchContext in CommonSettings.php.
- Note: This caused a large reduction in CPU usage on the master DB server, from 100% down to 70%. In the future, it might be worthwhile to ensure text for context display is loaded from the slaves.
10:30 brion: doing experimental software installs on srv50 [amd64]
10:08 brion: Added sync-apache script to rsync the apache config files from zwinger to pmtpa apaches. Don't forget to use it after making changes and before restarting apaches!
09:30 brion: moving apache configs a) into /h/w/conf/httpd subdir, and b) into local copies on each server which will be rsync'd
08:05 brion: new apache configs on all
07:23 brion: fixed up apache configs on *.wikimedia.org
07:00 jeronim: added acpi=off panic=5 to adler's kernel params and rebooted, because apparently there are some ACPI problems, and so that it reboots on kernel panic instead of freezing
06:53 brion: cleaning up apache config files; replacing ampescape rewrite usage with aliases to remove our patch dependency (tested on wikimediafoundation.org)
06:40 jeronim: installed same kernel on adler as is on samuel and set it as default; also samuel's default kernel was changed to a newer one (by yum?) in /etc/grub.conf, so changed it back to match the current kernel
05:30 brion: put suda back in rotation; toned down its share of enwiki hits a bit
05:02 brion: adler crashed again at some point
02:36 brion: adler was rebooted by colo; running innodb recovery
01:58 brion: adler is down, seems to have crashed (panic bits on scs output). taking out of rotation too
01:45 brion: lots of delays trying to open suda from wiki; taking out of db rotation
01:11 brion: halted backup; benet ran out of space. en_text_table.gz is much larger than expected (49gb), perhaps external storage has not been used correctly as expected? will remove file and continue.

September 19

22:10 ævar: uninstalled nogomatch on enwiki, who's going to sort through all that gibberish data? Not me!
21:07 brion: rebooting new8 machines to make sure they're running current kernel
21:02 brion: new8 group status: srv47 online but borked; 31-35 and 49 offline. others to be set up as apaches
20:46 brion: running special pages update on frwiki by request... will update others on cronjob if there's not already one?
19:40 mark: Replaced udpmcast.py by a properly daemonized version. Set it up at knams to forward to a multicast group instead of all unicast IPs forwarded by larousse...
18:45 mark: Removed miss_access line from knams squids to solve the cache peer errors. Repeat at yaseo if it works...
13:49 ævar: Installed the nogomatch extension experementally on enwiki.
08:00 Tim: Removed all NFS mounts from srv1's fstab. Set up a simple /home directory on its local hard drive.
06:06 kate: reverted root prompt on zwinger so it's not invisible on a white background
04:47 James: stop slave on bacon while dumper is running. Slave will restart when done.
02:45 Tim: changed root prompt on zwinger. Started sync-to-seoul, with -u option this time so we don't accidentally overwrite stuff
01:50 brion: seems to be mostly back up at this point. boot seemed to be aided by disabling named and letting it lookup from albert
01:36 brion: zwinger boot still going on; nfs init is *very* slow doing the exportfs -r; seems to be slow dns lookups
00:38 brion: jeronim did this: [root@zwinger srv38]# reboot - unfortunately it was not srv38, but zwinger.
00:05 brion: mounted /home on srv1; couldn't login, caused sync-file failures
00:05 brion: enabled Nuke extension on meta & mediawiki.org

September 18

14:00 jeronim: rebooted zwinger by mistake and it needed a manual reset by colo staff to come back up. Site was offline for about an hour.
04:34 brion: vandale kernel panic, frozen
04:30 Solar: srv36-srv50 are racked, have ip's, and are ready for production
03:10 Tim: moved compressOld.php to dalembert (where dumpHTML.php has been running), on complaints that it was causing problems on zwinger.

September 17

22:17 brion: running unique-ip counter on fuchsia with saved logs (into uniqueip table on vandale)
22:02 brion: disabled disused info-de-l list by request of list admins
11:05 brion: ran initStats on all wikisources to initialise those not already set
07:06 brion: canceled upload dump for commons backup due to size and slowness; too big to fit
06:30 jeronim: on larousse, removed fedora netcat and installed from source into /usr/local
04:30 Tim: used ntpdate -u pool.ntp.org to set the times on all the yaseo machines, some were a long way out. Then set all their timezones to UTC. This apparently caused ganglia to think yf1000 and yf1002 were down, fixed by restarting the local gmond.
04:10 Tim: Started replication on henbane
01:10 brion: enabled wikidiff on all wikis. (can be disabled selectively w/ wgUseExternalDiffEngine in InitialiseSettings)
Tim: Set up mysql on henbane, made a consistent dump of kowiki and commonswiki using bacon, copied dump to henbane ready to start replication

September 16

22:20 Tim: started mysqld on srv26, it had been off for 12 hours or so. The compression script had been running all that time, srv26 caught up to the master without incident.
Colo (Solar):
- supposedly bart is brought back up
- borrowed HP switch connected to gi0/4 on the cisco
- moreri was moved, and is trying to netboot (fails)
- 10 of the 20 new servers have been racked and wired to the borrowed HP switch, but don't have IPs yet
11:37 brion: updating sitenames on he, el, ru wikisource
11:30 brion: started backup run
03:17 brion: frwiki reimport done
02:47 brion: frwiki reimport started
02:35 brion: jawiki reimport done
01:49 brion: started jawiki reimport
01:33 brion: bacon catching up; suda is fine as it is partial mirror
01:29 brion: took bacon, suda out of rotation for further investigation
01:23 brion: nlwiki open for editing
01:03 brion: reimporting nlwiki on samuel
00:41 brion: nl/fr/ja dumps done (in /var/backup/private/recovery). going to try reimporting soon
00:16 brion: running attachLatest on *wikisource

September 15

23:14 brion: 3 dumps from adler done; doing extra backups from samuel too. setting adler to read-only
22:37 dumping nlwiki, frwiki, jawiki databases from adler onto sql files on benet
22:18 put load back on samuel for enwiki with adler disabled. fr, nl, ja wikipedias are locked while we work this out
22:09 commented out adler from db.php; adler appears to be misconfigured and all kinds of breakage is going on. it's not read-only, and has some revisions that others don't have
21:56 brion: took load off bacon (was 100 load on fr, nl, ja; nl and fr reporting weird editing problems possibly freak lag problems, and it was consistently lagging a few seconds at least)
17:25 mark: Setup IPsec between bacon and vandale. Who wants to setup replication?
16:50 mark: Altered geodns: pointed Malaysia at yaseo, and Israel, Turkey, Cyprus at knams
13:04 Tim: Shutting down apache on dalembert temporarily so that I can use it for HTML dump testing and generation
12:35 Tim: Restarted compressOld.php, it stopped when I shut down bacon to do the copy to adler.
11:30 mark: Restarted some knams squids to increase FDs, changed /etc/rc.local startup script
11:15 mark: Deployed squid on yf1003 and yf1004, and added them to the DNS pool
11:10 mark: Recompiled squid on yaseo to increase filedescriptors to 8192 and restarted all squids with 4096
07:37 brion: running importDumpFixPages.php on wikisources to fix bogus rev_page items
02:30 kate: ariel's down
02:29 brion: recompiling mono 1.1.9 on benet for xml bugfix
00:15 brion: removed humboldt and hypatia from mediawiki-installation node group, neither has port 80 on:
- humboldt prompts for password, not configured correctly?
- hypatia shows host key changed; was reinstalled?
00:10 brion: disabled MWSearchUpdater plugin as the daemon is broken; briefly broke the wiki due to bad include_path; need to fix config for MWBlockerHook to make sure the path is right even w/o the lucene include

September 14

21:30 mark: Setup log rotation at yaseo to knams, routed japanese and chinese clients to yaseo squids.
20:30 midom: adler online, bacon catching up
20:15 mark: Deployed squid on yf1001, and routed Korean clients to the Florida squid cluster.
18:15 mark: Deployed squid on yf1000.
18:10 mark: Wrote a YASEO squid deploy script /home/wikipedia/deployment/yaseo-squid/prepare-host (yahoo cluster only, should I put it at florida?) after Tim's apache prepare-host script
17:48 ævar: de-opped myself on ruwiki and stopped my revert bot, the russians hate me even more now.
16:30 mark: Set up a squid on yf1001. Same setup as knams, except it's in /usr/local/squid as in florida. Adapted florida's squid and mediawiki configs accordingly.
13:19 ævar: ran INSERT INTO user_groups VALUES (1165, "sysop"); on ruwiki to make myself temp. sysop to fix the MediaWiki: fsckup.
11:15 brion: halted nlwiki partial temp backup as enough was run to test problem
- (identified problem as [4])
10:41 brion: running another nlwiki backup to get raw dumpBackup.php output for testing
10:39 brion: halted old backup sequence (at nlwiki, with a mystery breakage in output that needs examining)
10:33 brion: hacking dumpBackup.php to load php_utfnormal.so extension (not yet enabled sitewide)
10:05 brion: running kowikisource and zhwikisource imports on formerly broken parts
08:55 brion: updated messages on jawikisource
08:30ish brion: updated messages on *wikisource
01:30 jeronim: access to yaseo console server should be back hopefully within a few hours - eam is dealing with it

September 14

13:32 Tim: Shut down mysql on bacon, started copying data directory to adler

September 13

23:23 brion: set logo on dewikiquote to commons version
23:ish brion: installing mono 1.1.9 with xml patch on benet to fix future dumps ([5])
17:23 ævar: Logging Exif debug information to /home/wikipedia/logs/exif.log using wgDebugLogGroups.
16:40 jeronim: yf1000 - yf1004 are all set up with reiserfs now. The only yaseo machine not working is yf1013 which is in an unknown state as the console server (konsoler04.krs.yahoo.com (10.11.1.186)) is unreachable.
16:18 Tim: Started moving some text to cluster2, starting with frwiki.

September 12

11:59 brion: killed search update daemon; going to replace this (again) with a more robust queuing system
15:00 or so kate: upgraded perlbal to 1.37
13:24 jeronim/kyle: lots of machines connected to SCS, port labels corrected. The APC has apparently vanished - Kyle couldn't find it.
09:40 brion: installed ICU 3.4 on zwinger and mediawiki-installation from RPMs built from the ICU-provided spec file. Source and binary rpms in /home/wikipedia/src/icu
09:34 brion: fixed misnamed krwikisource -> kowikisource db
8:50 Tim: rebuilt interwiki tables
02:15 brion: replaced old php.ini on zwinger with symlink to the common one. added /usr/local/lib/php back into the default include_path (for PEAR stuff sometimes used)
01:04 brion: blocked leech enciclopedia.ipg.com.br

September 11

22:05 brion: trying batch clears in parallel overloaded zwinger; canceled, running in serial again
21:35 brion: running batch operation to remove bad cached messages
21:00 brion: reconfigured blocker daemon to log to samuel. had to set up permission grant again on samuel
18:19 Tim: finally managed to fix the message problem, except for some erroneous values stored in cache
~18:00 ævar: To get interwiki links working on hrwikisource: sourced the output of maintenance/rebuildInterwiki.php and sourced mainteance/interwiki.sql on all wikis, some interwiki prefixes appear to have been lost in the progress e.g. bugzilla: (only mediazilla: exists in interwiki.sql) looks like we need better interwiki update scripts...
- Don't run interwiki.sql, under any circumstances. Add new prefixes to m:Interwiki map. -- Tim 08:52, 12 Sep 2005 (UTC)
16:05 Tim: switched master to samuel. Adler asks for root pw after reboot due to failed fsck.
15:10 Adler crashed. Tim and JeLuF on the scene, wiki switched to read-only mode
14:59 Tim: Non-default language message caching completely f****d up. Blank messages everywhere
07:10 brion: now using blocker list
07:00 brion: installed limited librsvg on apache cluster, svg back on
15:40 Tim: Installed apache, php, turck and mediawiki on yf1005. Put all required commands in /home/wikipedia/deployment/yaseo-apache/prepare-host. Still needs database, memcached and mediawiki configuration.
05:05 brion: restarted MWUpdateDaemon, hung again at 1gb used memory
02:38 brion: disabled svg for further security work
01:20 brion: reconfiguring wikisource to allow en.wikisource.org to work (hr ja kr sv zh en now imported)
01:09 brion: installed librsvg 2.11.1 on the apaches; it's in /usr/local. (old librsvg versions seemed to muck up text pretty bad)

September 10

22:49 brion: importing wikisource nl ro ru
22:34 ævar: deinstalled the wgDebugLogFile on commonswiki, got enough debug output to see if anything was wrong.
--:-- jeronim: yaseo stuff:
- reinstalled FC4 on yf1000, yf1001, yf1003, yf1004 with reiserfs
- reinstalled FC4 on dryas & henbane with 10GB ext3 root partition and the bulk of the disk as jfs on /a
- rsyncing /home, /tftpboot, /root, /var/www, /usr/local, and /etc from amaryllis to dryas in preparation for reinstalling amaryllis with reiserfs. It's a script, /root/amaryllis-rsync.sh, running in a screen on dryas.
14:14 ævar: installed a wgDebugLogFile for commonswiki in /home/wikipedia/logs/commonswiki.log to monitor Exif debug output.
13:26 ævar: ran maintenance/deleteImageMemcached.php on all wikis fixing bug 3410
10:44 brion: cleaning out old mysql data from benet to free up space for current backups (40 days+ out of date, not too useful)
10:00 brion: restored working frame-breakout code (pending cached wikibits.js)
07:58 Tim: moved some ancient rubbish from /home/wikipedia/htdocs to /var/backup/home/wikipedia/htdocs
07:10 brion: running data split for additional wikisource languages
02:40 Tim: Changed names of Seoul machines
02:15 brion: set edit rate limit for new accounts to same as ip rate limit
01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads

September 9

06:30 brion: restarted stalled de,en dumps

September 8

19:18 brion: checker daemon running
10:50 brion: setting up vandal checker daemon on larousse
10:42 hashar: enabled subpages for portal (100) and portal discussion (101) on dewiki.
7:45 hashar: added two namespaces for frwiki : 100=>Portail, 101=>Discussion_Portail .

September 7

22:00 jeronim: fixed avar's login problem on servers in the mediawiki-installation group -
- nscd -i passwd did not work
- /etc/init.d/nscd restart ; /etc/init.d/sshd restart did solve the problem on each machine except for benet; for benet, problem was finally solved after doing the restarts twice more, then nscd -i passwd, then doing the 2 restarts with a pause in the middle
21:30 jeronim: killed everyone's ssh sessions and sshd on zwinger (sorry)
10:25 midom: After Tim did put live memcached patch, site's sessions were switched from NFS to memc.
06:54 brion: killed stalled backup -- memcached send hang for the last day or so. It's continuing w/ dkwiki; will rerun stalled dewiki and enwiki

September 6

19:55 brion: tgwiktionary to lowercase
05:30 brion: set up experimental upload verification hook
04:02 koko: removed firewall

September 5

12:40 brion: set up to shut down search builder daemon every hour (at 47 minutes) to protect aganst memory leaks in builder; search-update-daemon wrapper script set to auto-restart 5 seconds after shutdown/crash of the daemon
09:05 brion: rebuildMessages.php --update on all wikis to add various new messages
06:09 brion: starting mass lucene updates of pages edited in august
05:18 brion: lucene back-deletions done, reoptimizing build index
01:10 brion: search updater up; running queued deletions
00:45 brion: vincent back in active search rotation

September 4

23:55 brion: splitting lucene config to lucene.php. putting coronelli on search, wiht optimized index
19:30 jeronim: created helpdesk-l
17:20 jeronim: fuchsia does not boot on the latest kernel (see below), but it does boot on the 2.6.11-1.33_FC3smp kernel, so switched it to boot that kernel by default
16:27 mark: Because of cascading incidents in knams, we moved all traffic to florida and lopar via DNS.
14:30 jeronim: fuchsia was dead or very close, so power-cycled it using the IPMI. It is broken:

NaodW29-pre4d8ccd6810bab9a700000001

13:16 Tim: made /home/wikipedia/lib/install.sh ignore x86_64 machines, added a part to clean up rubbish left in /usr/lib, then ran it everywhere with dsh -a -f
04:20 Tim: reinstalling PHP 4.4.0 with exif support. Using php-upgrade-440, which calls the new script /home/wikipedia/lib/install.sh to set up shared libraries in /usr/local/lib.

September 3

18:40 jeronim: removed body of mailman archive messages here and here on yannf's request
06:40 brion: relaunch updated backup script with some of the broken bits fixed.
04:50 Tim: Finished benchmarking PHP 4.4.0, see GCC benchmarking. Now deploying the new binaries, from source tree /home/wikipedia/src/php/php-4.4.0-gcc4
sometime brion: added .log to text/plain on benet's lighty

September 2

12:00 brion: ran backup test on aawiki using the new dump splitter and partial new backup script. (script is in ~brion/run-backup.sh if anyone wants to examine it)
07:19 Tim: compiling GCC 4.0.1 on zwinger. It will be installed with a program suffix, so gcc is still the old compiler, and gcc-4.0.1 is the new one. Source directory is /home/wikipedia/src/gcc/gcc-4.0.1, build directory is /home/wikipedia/src/gcc/gcc-4.0.1-build.
06:21 Tim: removing hypatia from perlbal nodelist for an hour or so, for some benchmarking

September 1

07:45 brion: set sitename/meta namespace on mtwiki
07:00 brion: running cleanupTitles.php to rename broken pages. Will be at Special:Prefixindex/Broken/ at each wiki.

August 30

17:30 jeronim: made a robots.txt on larousse (noc/kohl) to disallow some dynamic pages and a few others
16:40 jeronim: created wikimediapl-l

August 29

21:30 brion: blocked wissens-schatz.de for remote loading
17:30 jeluf: anonymized a name in the archive of wikide-l
11:30 brion: running a batch job checking for invalid titles on various wikis (cleanupTitles). shouldn't interfere with anything, making no changes.

August 28

22:15 brion: locking plwiktionary for capitalization change
15:18 hashar: created wikimk-l mailing list.
15:15 mark: Brought mayflower back up. Repaired the filesystems, and rebooted it. It was reporting lines like

Aug 28 04:22:34 mayflower kernel: swap_free: Bad swap file entry 7800007ffffff00f

14:30 mark: Another Kennisnet V-20 went down, this time it was mayflower dieing somewhere this morning. Depooled it... As it's not critical and we still have SP access, I will have a look at it first.

August 27

00:45 brion: turned on wegge's experimental watchlist bot thingy on dawiki

August 26

sometime: lots of data imported on wikisources

August 25

16:02 jeronim: added fc-mirror.wikimedia.org DNS entry for fedora mirror
- fc-mirror 1H IN CNAME albert
15:40 hashar: created wikials-l mailing list. TODO: delete /h/w/htdocs/mail/.index.html.sw(o|p) (swap files by fire).
19:00 mark: PowerDNS on pascal appeared corrupted. Most probably because of an overlapping zones problem in bindbackend (not bindbackend2). I integrated rev.wikimedia.org into the wikimedia.org to evade that.
16:09 hashar: blacklisted www . izynews . com on florida squids (using acl badbadip src 62.75.174.182/32). Need to be done on kennisnet and paris cluster too.
11:00 brion: set up https on kohl. (old ssl key files backed up; wasn't using the established password, nobody knew what it might have been)
07:05 brion: rebuilt interwiki tables; using correct interwikis for the new wikisources.
06:51 brion: added sr.wikisource.org
02:02 hashar: updated in HEAD LanguagePt.php from meta. Watchout when syncronising.

August 24

14:04 hashar: disabled lucene search. Daemon run on maurus but timeout / dont give any output.
04:00 Jamesday: started nice bzip2 for slow query log and first 72 binary logs on adler to free 40GB of disk. Can archive them on another box later.
- use avicenna for binlog archives -- Tim 05:53, 25 Aug 2005 (UTC)
00:43 brion: trying out an older version of MWDaemon on vincent to see if memory leak is a new code problem

August 23

16:17 jeluf: removed 10.0.0.17 (vincent) from MWDaemon pool. Was always reporting errors.
09:39 brion: added http://ar.wikisource.org http://da.wikisource.org http://de.wikisource.org http://el.wikisource.org http://es.wikisource.org http://fr.wikisource.org http://gl.wikisource.org http://it.wikisource.org http://la.wikisource.org http://nl.wikisource.org http://pl.wikisource.org http://pt.wikisource.org http://ro.wikisource.org http://ru.wikisource.org
05:23 Tim: Reports from users of frequent "connection refused" errors reported by the browser. Investigated, found squid was crashing once every 10 minutes or so, on 4 out of 6 squids. The two that weren't crashing were running a newer version of squid, I upgraded them all to that.

August 22

22:12 brion: upped max post size to 75mb on squids; were problems posting large videos to commons (or something)
21:50 brion: renamed presswiki to internalwiki

August 21

22:53 brion: bugzilla up; removed ssl-ticket.wikimedia.org from pascal's apache conf.d dir
22:48 brion: bugzilla.wikimedia.org appears to be offline.
13:30 Tim: reduced lucene load on vincent to 1/4, maybe that will stop it from locking up (which it did again)
13:00 Tim: restarted lucene on vincent, it was closing connections as soon as they were established
06:27 brion: otrs now accessible again on https://ticket.wikimedia.org/ ; now with redirect for the index page! For reference: Apache is in /usr/local/otrs
06:00 brion: trying to start otrs on ragweed. apache configuration appears to be borked.

August 20

10:00 jeluf: finished OTRS transition to ragweed. Spamassasin setup finished.
09:53 Tim: Switched site to 1.6alpha
08:16 Tim: Applying schema update for 1.6alpha, basically an ALTER TABLE watchlist
01:00 Tim: ran update-special-pages

August 19

23:30 brion: changed postfix 'myhostname' setting from zwinger.wikimedia.org to mail.wikimedia.org, should prevent the mail loop errors reported sending to the full addr
23:00 brion: ran namespace conflict checks for updates on tawiki and gawiki
21:40 brion: updated rebuildInterwiki

August 18

23:30 jeluf: OTRS status: Installed apache/php/perl/postfix/mysql client on ragweed. Using pascal as DB server. Problems with sessions, sessions seem to be mixed up, sometimes I get logged in as presroi, sometimes as JeLuF :-/ Stopped apache for now. Postfix still accepting new tickets.
22:30 mark: Changed DNS CNAME ticket.wikimedia.org to point to ragweed
22:17 brion: disabled account creation throttle on press wiki; this is closed wiki and all accounts are created by an admin
10:00 midom: suda is back again, with enwiki and commonswiki databases
05:00 jeluf: copied OTRS tables to pascal, copied otrs binaries to pascal, configured pascal to serve https. Can access old tickets again. Currently can't send new tickets to otrs. DNS change needs to be done.
00:55 brion: recreated wikimediasr-l list on zwinger

August 17

19:27 brion: fixed bug in db.php that set all database load factors to NULL

August 16

20:15 jeluf: renamed project namespace on cswikibooks to Wikiknihy.
15:30 midom: resumed idle bacon's mysql replication, we might need to do external store migration soon, and bring back suda with smaller dataset.

August 15

21:46 kate: always_bcc on zwinger was set to "quagga" and its mbox was full, so it generated lots of bounce messages. i removed the setting.
12:30 mark: Mint seems to have at least a bad disk, possibly other problems. Sun will look at it. In the meantime, we can *try* to network boot it and recover data.
10:30 jeronim: had a look at mint via the IPMI - tried to power cycle it but it wouldn't switch off. Mark will tell the kennisnet guys about it. There's a dump of the OTRS DB from before the transfer to mint in albert:/root. If mailman is to be put back to zwinger, chapter-l and the new Serbian list will need to be re-created (and maybe some other lists?).
09:00 mark: Mint apparently is fucked, RAID and SP settings were reverted to factory defaults. Trying to do data recovery now. Possibly a power problem?

August 14

19:51 brion: mail config on zwinger broken or funky or otherwise annoying; just leaving it off for now. moved dns for mail back to mint (which is still dead) sighhhh
19:26 brion: moved mail.wikimedia.org back to zwinger due to extended outage on mint. With our limited support contract on knams we can't afford to have this critical service there.
14:30 midom: srv27,srv26,srv25 joined external storage service, waiting for payload
09:30 brion: mint is offline, no ping
00:20 brion: stopped bacon to run backup dump
01:00 jeluf: enabled spamassassin for OTRS on mint (~otrs/.procmailrc)

August 13

sometime kate: moved otrs to mint
23:25 brion: added wikimediasr-l aliases to mailman on mint
sometime someone: Apparently mail.wikimedia.org has been moved to mint.
10:42 jeronim: set ticket.wikimedia.org to CNAME mint.knams.wikimedia.org. (move of OTRS to mint is in progress)
00:58 Tim: started update-special-pages
00:19 Tim: it happened again so I disabled otrs's crontab. Original crontab is in /opt/otrs/crontab

August 12

23:18-23:30 Tim: An OTRS process on albert (PostMaster.pl) developed a runaway memory leak, causing heavy swapping. This slowed down albert sufficiently to cause the entire apache cluster to lock up with high load. Killed the process at 23:30 and the site soon returned to normal.
09:30 brion: took srv1 out of 'apaches' node group and shut off apache on it. DON'T RUN APACHE ON SRV1

August 11

21:26 Tim: TICK TICK TICK, that's the sound of 58 servers with their clocks ticking in synchrony, maximum offset 80ms.
20:30 Tim: Added the missing restrict line for 10.0.0.200 to ntp.conf on (almost) all machines
19:30 Tim: Synchronised ntp.conf on hypatia, humboldt, rose, anthony, rabanus, diderot and srv1 with /home/config/others/etc/ntp.conf.vlan2 . This made them remotely queryable, for easier debugging in the future, and also switched their preferred server from zwinger to the cisco (in broadcastclient mode).
18:35 Tim: Fixed tingxi's resolv.conf
17:45 mark: Fixed inconsistent favicons on apaches. Older apaches had symlinks to a common (wikipedia) favicon, which got overwritten with the new wikinews favicon by brion. Removed the symlinks, and put the correct favicons in place.
12:20 brion: set up pl.wikimedia.org and press.wikimedia.org (press is locked, and currently has no user accounts. a sysop/bureaucrat will need to be added for it to be used)
07:28 brion: updated wikinews.org favicon

August 9

23:20 mark: Rerouted Europe back to knams, because all sorts of weird problems were occuring. Fixed a typo (pmpta) in DNS. Some nameservers report TTL 0 for some of our DNS records - need to investigate that.
22:20 mark: Moved Squid service IP 207.142.131.246 from overloaded srv10 to srv5. Cleared the ARP entry on the l3 switch.
22:00 mark: Reroute everything from knams to pmtpa directly, because of routing problems
13:35 mark: changed biruni's hostname from biruni.wikimedia.org to biruni
13:30 mark: added avicenna and biruni to node_groups/apaches
13:00 mark: Restarted apaches on avicenna, alrazi and biruni with -DSLOW, and changed startup scripts
08:52 jeronim: blocked 61.48.105.65 spammer IP from all wikis using block-ip-all - so ipblocklist message will speak of "vandalism" instead of "spam"
08:25 jeronim: created chapter-l for mailman on mint

August 8

09:22 kate: enabled greylisting on mail.wm.org
20:54 hashar: readded srv2 (with ip x.x.0.1 ) to the apache pool
18:25 hashar: avicenna & biruni readded. Monitoring error log, #wikipedia and memory.
17:43 brion: added /mnt/upload mounts on avicenna and biruni
17:32 hashar: forgot sync-common on avicenna and biruni :/ I though scap would do the job ... They both missing the upload directory.
15:45 brion: stopped apache on avicenna and biruni pending more information on reported errors
15:36 hashar: TODO: biruni hostname seems wrong /etc/sysconfig/network list HOSTNAME=biruni.wikimedia.org whereas other servers just get HOSTNAME=zwinger or HOSTNAME=srv30 ...
15:36 hashar: removed srv1 from mediawiki-installation dsh file (as apache is not meant to run on).
15:24 hashar: bringed back biruni in mediawiki-installation pool
15:12 hashar: bringed back avicenna in mediawiki-installation pool
14:30 hashar: started apache on srv11.
06:30 kate: moved mailing lists to mint. let's see if it starts sucking less.

August 7

20:50 brion: postfix hung zombified on zwinger, wouldn't restart automatically. had to remove master.pid and restart.
16:25 brion: installed DynamicPageList on wikiquote per [6]
15:50 brion: locked tlhwiki
07:47 brion: added application/ogg as mime type for ogg files on albert
00:59 brion: set localized logo for ptwiktionary

August 3

14:15 mark: Switched over upload.wikimedia.org to lighttpd instead of apache on albert
12:00 brion: added frankfurt city map to wikimania whitelist. whoops!

August 2

15:45 mark: Bound albert's apache to a single IP, instead of INADDR_ANY
09:40 brion: added wildcard subdomains for wiktionary.com redirection

August 1

22:30 all: samuel's disk filled up. Switched master to adler. Re-syncing samuel from suda.
14:50 mark: Put all kennisnet squids back into DNS, updated DNS on pascal

2000s

Archive 1: 2004 Jun - 2004 Sep
Archive 2: 2004 Oct - 2004 Nov
Archive 3: 2004 Dec - 2005 Mar
Archive 4: 2005 Apr - 2005 Jul
Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
Archive 6: 2005 Nov - 2006 Feb
Archive 7: 2006 Mar - 2006 Jun
Archive 8: 2006 Jul - 2006 Sep
Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
Archive 10: 2007 Feb - 2007 Jun
Archive 11: 2007 Jul - 2007 Dec
Archive 12: 2008 Jan - 2008 Jul
Archive 12a: 2008 Aug
Archive 12b: 2008 Sept
Archive 13: 2008 Oct - 2009 Jun
Archive 14: 2009 Jun - 2009 Dec

2010s

2020s