Server admin log/Archive 12

From Wikitech
Jump to: navigation, search

July 31

  • 23:27 mark: Installed db12 (146G drives) and db13 (72 G).
  • 21:45 RobH: srv101 back online.
  • 20:12 RobH: srv104 back online.
  • 10:17 mark: Disabled switchports srv101 and srv104.

July 30

  • 21:15 brion: wasted a lot of time there. most of the segfaults have stopped as mysteriously as they came. 101 still reporting some, but can't get into it anyway, it's pretty broken atm, should get rebooted. can't ssh in, but it's a memcached so don't want to just shut it off anyway without rearranging shit
  • 20:30 brion: seeing a lot of segfaults on apaches, trying to track it down
  • 20:08 mark: Installed Ubuntu on db11
  • 18:26 brion: mobile.wikipedia.org temporarily moved to yongle on backend
  • 17:59 brion: mobile.wikipedia.org (anthony) down
  • 17:30ish rob is moving servers around
  • 04:50 Tim: removed comcomwiki from all.dblist. Obsolete, internal.wikimedia.org is used instead.
  • 03:00 Tim: Firewalled srv101/srv104 from the slave servers as well, to prevent pollution of the revision cache.
  • 02:30 Tim: Report on the village pump of ongoing database corruption due to job queue runners on srv101/srv104 with bad database configuration. Tried for half an hour to work out how to disable their switch ports, eventually gave up when I couldn't work out how to log in to asw3-pmtpa (the bulk of the half hour was used in determining which switch srv101 is connected to). Instead, firewalled them from db2, ixia and adler.

July 29

  • 19:15 JeLuF: moved enwiki thumbs to storage1
  • 05:00 Tim: disabled NRPE arguments since it allows arbitrary shell commands to be invoked by any user with access to the NRPE port
  • 04:35 Tim: samuel appears to be up, started mysqld on it
  • ~4:00 Tim: installed NRPE on adler, ixia, thistle, webster
  • 03:50 Tim: ran yum update on holbach
  • ~03:30 Tim: installed NRPE on FC4 servers db2 and lomaria using custom built RPM.

July 28

  • 23:00 - 02:30 Rob & Tim: installed NRPE on storage1, storage2 and all ubuntu DB core DB servers. Installed NRPE from source on bart. Switched nagios to use NRPE for disk space monitoring on these servers.
  • 21:40 brion: seen a huge rash of errors where a non-object is passed to Title::equals(). Added a live hack exception to try to track it down, but it's not being logged in exception.log
  • 21:05ish brion: fixed the categoryfinder infinite recursion bug -- lots less segfaults on labs...
  • 20:00 jeluf & mark: Stopped all services on friedrich, preparing for decommissioning. If you miss anything, be quick.
  • 06:18 brion: mass panic erupts as update involves several fatal errors from calls in 'skins' to things not yet in 'includes', etc, increasing the pain level during the code update
  • 06:09 brion: stopped srv71 apache, weird php fatal errs in log ('missing LocalFile' etc)

July 27

  • 23:33 brion: db9 mysql 5 test server has at least some bad entries in user_group table
  • 19:45 JeLuF: reinstalled knsq30, back in the pool
  • 14:30 JeLuF: rr.yaseo is down, switched to DNS-scenario yaseo-down, investigating

July 26

  • 02:20 Tim: changed master on cluster17 from srv101 to srv102. Took srv101 out of ext. store rotation. Pre-emptive action before it dies completely.
  • 02:14 Tim: Deployed a new version of wmerrors. Segfaults started spewing out everywhere, so I disabled it. srv101 went down, took it out of rotation.
    • srv101 is a current ext store master, apparently it's still doing mysql, but it's segfaulting with apache and not responding on ssh

July 25

  • 23:00 Tim: Attempting to upgrade librsvg on srv37. Involves upgrading the system from FC3 to FC4.
  • 21:35 Tim: added squid ACLs to make /w/thumb.php go to the rendering cluster. The ubuntu apaches aren't set up properly for rendering. Also escaped the dots in the existing url_regex ACLs.
  • 20:40 RobH: Reinstalled sq21-sq30 with ubuntu 8.04.
  • 20:06 RobH: Reinstalled sq11-sq20 with ubuntu 8.04.
  • 19:14 RobH: Reinstalled sq1-sq10 with ubuntu 8.04.
  • 18:23 RobH: replaced /c1/p9 in storage2 and put in the rebuild for the array.
  • 18:17 RobH: replaced /c0/p3 and /c0/p6 in amane and put in the rebuild for the array.
  • 08:31 Tim: disabled UsernameBlacklist, nasty regexes cause the servers to crash. By some reports, 8/10 of account creation requests were failing.
  • 07:36 Tim: temporarily enabling core dumps on all apache servers, to see if I can track down the abort()s we're seeing

July 24

  • 20:29 RobH: knsq23-knsq29 reinstalled to 8.04.
  • 16:45 brion: fixed ownership on new dirs on upload4 -- fixes upload problem 14906
  • 16:18 RobH: knsq21-knsq22 reinstalled to 8.04.
  • 15:23 RobH: knsq16-knsq19 reinstalled to 8.04.
  • 13:45 Tim: installed ganglia on storage2, added storage1 and storage2 to nagios (but no working services), freed up a little bit of space on storage2 (was full)
  • 13:06 Tim: returned srv128 to the pool
  • 07:53 Tim: srv128 showing spurious OOM errors. Took it out of rotation (DID NOT RESTART), so that I can have a look at it with gdb in a couple of hours when I have time.

July 23

  • 21:47 RobH: knsq11-knsq15 reinstalled to 8.04.
  • 21:10 RobH: knsq1-knsq10 reinstalled to 8.04.
  • 20:43 brion: "fixed itself" -- may have been overload on amane/storage2, or from those slow procs on srv38, who knows. seems resolved now
  • 20:30 brion: massive thumbnailing errors on commons (thumb servers return 500 err)
  • 16:17 RobH: sq36-sq40 reinstalled to 8.04.
  • 15:51 RobH: sq31-sq35 reinstalled to 8.04.
  • 14:44 RobH: sq46-sq49 reinstalled to 8.04.
  • 14:00 RobH: sq41-sq45 had wrong lvs ip, fixed.
  • 13:50 brion: sighted but non-stable pages on de were being marked as 'noindex,nofollow' due to a logic buglet in FlaggedRevs. FlaggedArticle.php has been updated, but they'll be in cache. Sigh.

July 22

  • 23:35 RobH: reinstalled sq41 - sq45
  • 20:43 Tim: increased the number of job runners from 10 to 39.
  • 19:20 Tim: fixing redirect table on all wikis
  • 19:10ish brion: redirected donate.wikimedia.org to [1]; blank drupal page confuses people and davidstrauss didn't seem interested in fixing it to look nice
  • 19:05 RobH: srv65 kernel panic, reboot.
  • 19:00 RobH: srv78 kernel panic, reboot.
  • 15:37 mark: Upgraded yf1019 and bayle to Ubuntu 8.04 hardy.

July 21

  • 21:00ish brion: enabled collection extension, now with proper temp file usage, on test/labs
  • 15:57 mark: Deployed srv150 as an Ubuntu 8.04 Hardy application served for testing (it's pooled)

July 19

  • 11:03 mark: Pages containing timelines were giving HTTP 500 errors since some recent sync; reverted the live timeline extension to r31101 for now.
  • 07:30 Tim: setting up authentication and write commands on nagios.wikimedia.org

July 18

  • 08:00 brion: fixed img_auth on SSL sites
  • 07:35 Tim: removed srv61 from memcached rotation, is down. Added some servers to the mc-pmtpa.php spares list.
  • 07:19 Tim: cleaned up binlogs on adler, was 14GB free, now 84GB free

July 17

  • 13:45 brion: disabled collection again, there's stupid race conditions in the extension's CURL usage. wtf guys
  • 13:33 brion: got pediapress collection thingy at least sort of working. WSGI self-hosting just doesn't work at all. Fought a lot with Python's horrible eggs to tweak up all its awful ugly directories so the damn modules import. Enabled on test only atm. There's a warning that it causes segfaults with category loads.

Jul 16

  • 09:36 river: shut down HGTN peering, excessive failure to route packets anywhere useful

July 15

  • 22:00ish brion: popped in briefly to do code review and update live code. seems non-exploding so far

July 14

  • 19:00 mark: Built a test setup for my new LVS kernel module: mirrored all live traffic to alrazi (upload.wm.org) to lvs3 in a separate test VLAN with no outside connectivity
  • 06:00 Tim: installing mysql on db1 using the source tree in /home/midom/mysql/server
  • 05:15 Tim: samuel down, removed from rotation

July 12

  • 16:13 brion: trimming more stuff from storage2 (upload old versions, timelines not vital to backup)

July 11

  • 19:45 mark: srv78 went down again
  • 12:45 mark: srv39 (image rendering cluster) was out-of-memory and image rendering was mostly down, restarted apache to clear it up

July 10

  • 22:08 mark, rob: Reinstalled yaseo servers yf1000 - yf1017 with Ubuntu 8.04 Hardy and installed squid 2.6.21 (now in the repository). Switched back traffic
  • 21:48 brion: srv144 has been reporting 'read-only filesystem'. Have shut it down remotely.
  • 17:59 mark: srv37 is out of date and giving internal server errors, depooled it in LVS
  • 17:29 mark: Brought sq5 back up with a newly built squid-2.6.21 deb (hardy only), not yet in the repository.
  • 14:25 RobH: srv46 apache process was not responsive, synced it and restarted it.
  • 14:20 RobH: srv36 rebooted due to PDU swap, HDD died upon reboot.
  • 14:20 RobH: Swapped out the PDU for the switch feeds in C4.

July 9

  • 19:07 brion: srv37 appears to have a db.php from february -- something very wrong. (it's a scaler)
    • it's commented out of mediawiki-installation, so has not been receiving any scaps since! this probably explains the intermittent image thumb breakages
  • 16:13 brion: fixing mobile.wikipedia.org dns entry
  • 16:00 jeluf: unpooled knsq30, setting up some performance and config tests with OSM.
  • 08:34 mark: yaseo text squids under strong memory pressure again, moved traffic to pmtpa until I have time to look into it

July 8

  • 14:31 RobH: yf1016 reinstalled.
  • 14:08 RobH: yf1015 reinstalled.
  • 10:18 Tim: removed the firewall rule on adler which was preventing srv31 from connecting to it.
  • 8:45 mark: yaseo text squids ran out of socket memory

July 7

  • 21:49 RobH: reinstalled yf1010(mark), yf1011, yf1012, yf1013
  • 18:51 RobH: db1 reinstalled.
  • 18:07 brion: restarting apache on srv135, let's see if it continues
  • 18:04 brion: huge rash of 'out of memory in WebStart' on srv135; shutting off its apache.
  • 09:24 mark: yf1000 has no SSH host keys and appears not having a root password set - unaccessible, needs reinstall. Blocked its Squid process by sysrq force unmount.
  • 09:15 mark: yf1001 had no SSH host keys, created them, rebooted it, synced, back online
  • 08:40 mark: yaseo text squid cluster in trouble, run into swap. Decreased cache_mem to 1000 and restarted backends

July 5

  • 22:45 domas: added gmond on db7, thistle, db4. stopped db10 for data copy to db7, trying to make db7 to rotate two hourly snapshots
  • 22:00 brion: KNAMS seems to have fallen off of ganglia; pascal is down

July 3

  • 21:51 brion: finally got the damn software updated. yay!
  • 20:23 mark: Added AAAA record for hume.wikimedia.org (and therefore, static.wikipedia.org).

July 2

  • 15:29 RobH: srv124 crashed, back online.
  • 15:28 RobH: srv78 kernel panic, back online.
  • 15:27 RobH: srv107 back online and in lvs.
  • 15:11 RobH: srv107 broken, rerunning the bootstrap.
  • 14:52 RobH: db7 back online, reinstalled, replaced network cable. Needs db setups.
  • 03:21 Tim: deployed wmerrors extension, see wikitech-l

July 1

  • 22:45 domas: Collection was causing Categoryfinder to go into infinite recursion, thus causing segfaults. I need stickers 'I love Magnus' :)
  • 18:45 mark: Implemented the Cloud.
  • ~13:00 Tim: resynced /usr/local/dsh/node_groups/apaches and dalembert:/usr/local/etc/apaches. Symlinked node_groups/apaches_pmtpa -> apaches. Resynced nagios accordingly. Deleted the "Apaches 1 CPU" group from gmetad. Reassigned srv2 to the miscellaneous group, stopped apache on it, removed apache-specific lines from /etc/rc.local.
  • 8:00 jeluf: stopped apaches on srv101 and srv112, they are segfaulting several times per second.

June 30

  • 19:00 domas: restarted few segfaulting httpds, as well as few using additional number of cpu cycles, started few httpds on boxes that didn't have httpd running. killed stale child processes yet again.

June 29

  • 18:45 jeluf: srv107 depooled, needs a reboot.
  • 18:42 jeluf,rob: Switch in rack C4 powered up again, rack reachable again
  • 18:30 jeluf: added 10 more memcached candidates to mc-pmtpa.php
  • 18:00 jeluf: Rack C4 outage again, like last sunday. Contacted Powermedium. Shut down postfix on bart. going to change memcached.

June 27

  • 07:39 Tim: killed srv129 with sysrq-trigger, was segfaulting

June 26

  • 22:38 Tim: Set mysql root password on srv179 (OTRS), was blank, now standard
  • 18:34 RobH: srv140 needed a poke, apache had crashed.

June 24

  • 20:20 brion: unmounted dead albert dirs from srv31.
  • 20:15 brion: poking at storage2 -- disk full again. trimming some old stuff
  • 20:14 RobH: Synced and restarted apache on srv31

June 22

  • 23:10 brion: updated static dump robots.txt to exclude 'new' and 'downloads'
  • 17:48 mark: All servers reachable again, reinstated Exim config on mchenry, repooled apaches
  • 15:07 mark: Added temporary hack to Exim config on mchenry to have it not check the OTRS db for address existence, but just defer addresses not accepted earlier
  • 14:54 mark, domas: Installed new memcached instances on srv151..168, replaced all down instances, depooled down servers in apache LVS
  • 14:20 mark, jeluf: One rack in Florida seems to be down. Mark informs hostway. OTRS can't reach its DB, postfix on bart stopped.

June 21

  • 17:00 brion: added 'shared' dir on *.planet.wikimedia.org for shared styles/icons/etc
  • SSL cert on the IMAP server has expired.

June 20

  • 23:58 brion: experimentally enabled pdf download (special:collection pediapress thingy) on test.wikipedia.org. Still some major problems with this system; many pages don't render properly (eg, at all)
  • 21:13 Tim: recompiled ircd with nicklen=20 and maxclients=10000
  • 20:30ish brion: tweaked dir permissions on all top-level upload dirs so the automatic subdir creation for new wikis should work consistently
  • 18:05 brion: syslog was broken on suda -- nothing going to /var/log/messages. restarted syslog, seems ok now.
  • 17:55 brion: leuksman.com was offline for several hours. Sago sends automatic hourly emails "your server doesn't respond to ping!" but doesn't reboot it until you ask them. yay!

June 19

  • 21:15 Tim: spike on enwiki DB servers, possible outage, blocked offending client in squid
  • 16:15 RobH: db5 hdd replaced, reinstalled, needs setup.
  • 16:00 RobH: sparky pulled and boxed for return.
  • 15:28 RobH: amane disk replaced, raid needs rebuild.

June 17

  • 19:28 RobH: srv133 back online.
  • 18:54 RobH: srv51 moved and re-racked, back online.
  • 17:33 brion: srv133 hanging, probably 'read-only filesystem' or similar. shutdown via sysrq
  • 17:00 RobH: moved srv126 - srv123 to B1
  • 16:38 mark: Made catchall aliases work for CiviCRM on mchenry, added MX records to donate.wikimedia.org.
  • 15:40 RobH: srv76 shutdown due to primary HDD failure.
  • 15:35 RobH: srv141 shutdown briefly to move into its new home in B1.
  • 15:30 RobH: srv142 shutdown briefly to move a switch out of rack. Back online.

June 16

  • 14:52 mark: Moved test.wikipedia.org from srv3 to srv35 so the former can be decommissioned.

June 12

  • 20:19 RobH: srv27 shutdown per rainman, as its not working and needs to be decomissioned.

June 11

  • 19:45 domas: srv101-srv109 are part of ES duty as cluster17-19. all raid1, myisam, set up by http://p.defau.lt/?GA97Jegyef8uQGQJTTFILg
  • 18:00 RobH: srv141 rebooted, back online.
  • 17:04 mark: Moved rendering cluster LVS from avicenna to alrazi so avicenna can be decommissioned
  • 12:22 mark: Upgraded khaldun to Ubuntu 8.04 hardy
  • 11:54 mark: Shut down albert's switchport in preparation for decommissioning it
  • 11:32 mark: Changed static.wikipedia.org to point to hume, removed Squid configuration for it
  • 11:08 mark: Moved mirror.pmtpa.wmnet (Fedora mirror) to a proxy setup on khaldun so we can retire albert
  • 03:42 Tim: Switched back to Preprocessor_DOM, uses 4 times less memory on Chicago test case

June 10

  • 19:56 brion: disabled spam blacklist on internal private/fishbowl wikis so jay can use tinyurl *rolleyes*
  • 16:37 RobH: srv78 rebooted, back online.
  • 00:01 brion: added a generic wgReadOnlyFile setting to all wikis in 'closed' group that don't have one specifically set.

June 9

  • 20:50 brion: removes srv81 from cluster6 db rotation. downed slave was causing timeouts in transwiki imports due to crappy failover in ES connections
  • 19:04 brion: re-cleared localnames and globalnames tables to fix CentralAuth unattached lists. Only localnames had been cleared, but nothing got lazy-populated since there was already a globalnames entry.
  • 16:44 brion: srv141 rebooted (not online yet)
  • 16:39 brion: srv141 read-only filesystem
  • 16:30ish brion: setting up private collab.wikimedia.org

June 8

  • 16:00 - 20:00 mark, gmaxwell: IPv6 AAAA reachability testing, which lets us determine how much would break if we'd put an AAAA record on our main hostnames
    • Installed Hardy on iris
    • Set up lighttpd on v4 and v6
    • Added hostnames ipv6.labs.wikimedia.org, ipv4., ipv6and4. and results. to DNS
    • Put a modified version of this script in [[w:en:Special:Watchlist]]'s javascript
  • 14:00 domas: db10 died (02 disk went 'foreign', after recreating array came back up), db1 died few hours before, telling about dimm7 errors. few MCE events complained about L2 cache though.

June 7

  • 09:21 Tim: /mnt/ganglia_tmp on zwinger was full, fixed.
  • 02:00 Tim: started new static HTML dump on hume

June 6

  • 12:00 - 12:45 mark: Network migration and firmware update. Split vlan 100 into 100 (Squids/LVS) and 101 (public services). Reloaded csw5-pmtpa with newer image.
  • 12:00 RobH: srv147 hdd dead, powered down pending rma.

June 5

  • 21:45 RobH: ssl enabled on srv9
  • 02:34 Tim: Danny B. asked me to delete the entire archive for wikimediacz-l, so I moved the private archive directory to wikimediacz-oldprivate, moved the mbox to wikimediacz-oldprivate.mbox, recreated a blank mbox, and regenerated the empty index with "arch".

June 4

  • 22:56 brion: updated PHP didn't change things. shutting srv134 back off
  • 22:51 brion: srv134 still giving mystery errors. reinstalling PHP
  • 19:43 RobH: sq5 reinstalled.
  • 19:40 RobH: srv134 memory tests passed, restarted.
  • 19:34 RobH: db1 reinstalled with Ubuntu 8.04.
  • 18:45 RobH: db1 mainboard and cpu replaced.
  • 17:30 RobH: hume memory upgraded from 2 GB to 8 GB.
  • 07:38 Tim: TorBlock extension enabled
  • 02:35 Tim: cleaned up binlogs on srv139

June 3

  • 22:16 brion: started two more dump threads to keep things churning while the big boys run
  • 20:46 mark: Brought knsq28 back up after its broken disk has been replaced

June 2

  • 20:24 brion: refreshing category counts for enwiki > 'Unprintworthy_redirects', was last one that was reached in the original batch process. Some inconsistent counts reported in the Ws.
  • 09:40 robert: removed srv30 from search_pool_1 on diderot. decommissioned and seem to cause balancer problems
  • 02:59 brion: manually depooled srv30 from diderot's LVS balancer for search... again... why isn't pybal taking care of this?

May 30

  • 19:36 domas depooled db9 -- mysql 5 test server had bad dataset, missing revisions
  • 16:19 brion: added more wikis to closed.dblist -- somebody started closing things by hardcoding wgReadOnly instead of setting lock files, so they weren't seen on the first pass. Sigh.
  • 15:00 RobH: yongle updated to Ubuntu 8.04.

May 29

  • 19:33 brion: disabled "apc.stat=off" in php.ini on srv3 and srv2. srv3 was breaking updates to InitialiseSettings.php on test.wikipedia.org
  • 17:42 RobH: srv146 fsck and back online.
  • 17:23 brion: shutting down srv134, has been reporting mysterious "Possible integer overflow" errors which may indicate bad RAM, corrupt software, or some other mystery problem
  • 16:30 domas: upgraded db1 to hardy, datadir severely desynced, will work on it later

May 28

  • 21:09 brion: excluded closed wikis from CentralAuth to avoid interference. closed.dblist provides a "closed" group for InitialiseSettings.php
  • 17:23 brion: reenabling $wgCentralAuthCreateOnView for now
  • 16:28 RobH: srv150 to srv130 now have IPMI connected via msw-b1-pmtpa
  • 15:59 RobH: srv67 to srv80 moved off asw2-pmtpa to asw-b2-pmtpa
  • 15:29 RobH: srv78 kernel panic, rebooted.
  • 15:29 RobH: srv79 accidentally rebooted.
  • 15:17 RobH: srv130 kernel panic, back online. Synced back into cluster in db.php.
  • 14:47 RobH: srv146 was shutdown in rack? Back online, cluster16 back online.
  • 14:32 RobH: db1 dimm4 swapped, back online.
  • 02:31 Tim: Site went down briefly due to ext store master srv139's disk filling up. Fixed.

May 27

  • 23:50 brion: disabling local account autocreation on page view for now (controlled via $wgCentralAuthCreateOnView); it's too darned annoying
  • 23:11 brion: running batch-updates of all globalname/localname records (some missing entries are messing up due to lazy initialization)
  • 22:30 brion: CentralAuth tweaks: disabled centralauth special perms for meta sysops (should just be bcrats for now); fixed global cookies for SSL
  • 20:18 brion: reverted Whatlinkshere to state of r35370; was broken by use of subqueries in 35371 and following updates.
  • 19:00 domas: db9 is in persistent state of testing, upgraded to hardy, 5.0.64, etc. not for production use yet.
  • 18:04 RobH: renewed secure.wikimedia.org cert on bart
  • 15:40 domas: srv146 as down, cluster16 master, disabled cluster16 for now.

May 26

  • 21:30 brion: tweaked up wap portal to fix JPEG images (PHP recompile -- missing JPEG libraries) and have a safer, more reliable image loading method
  • ~06:15 - 08:15 Tim: created wikis listed on bugzilla:13264 and bugzilla:14252.

May 25

  • 01:44 brion: upgraded wap portal server to PHP 5.2.6, seems to fix some crashing cases

May 24

  • 23:05 brion: updated wap portal so the search engine actually works
  • 19:51 brion: freeing up some space on storage2, restarting dump runners
  • 19:33 brion: storage2 full; poking...
  • 19:29 brion: remounted storage2 on the dump runners; NFS got broken by the reboot and stuck the clients
  • 11:43 mark: Whoops, I sent UAE upload to rr.knams a few days ago... corrected.

May 23

  • 21:42 brion: disabled write api on testwiki for now... it seems to accept edits over GET which is very.... not good.

May 22

  • 21:11 brion: ran migrateStewards on meta -- global steward flags active
  • 17:49 mark: Upgraded storage2 from Ubuntu 6.10 to 8.04 and rebooted it
  • 17:14 brion: stuck 79.115.44.59 in the proxy blocker set for now (mass link spamming)
  • 17:06 brion: commonswiki dump was stuck again (was on old stickable code, iirc). killed the fetch thread.... seems to be stopped/over/eh?
  • 05:00 brion: enabled SimpleAntiSpam for test logging

May 21

  • 19:09 brion: hacked bugzilla templates to avoid the big scary error when you log in after a password reset. got one too many complaints about the damn thing. :D
  • 18:22 RobH: srv133 went read-only. Ran FSCK to correct errors, restarted.
  • 18:20 RobH: srv149 back online, no issues in fsck.
  • 15:11 mark: Upgraded GnuTLS on Exim servers, restarted Exim, replaced ssh keys for mail sync jobs

May 20

  • 22:51 mark: Shutdown srv133's switchport

May 19

  • 19:52 brion: $wgEnableWriteAPI on testwiki
  • 15:52 brion: poking at srv133, read-only filesystem
  • 15:40 brion: poked stuck dump thread (commonswiki)... had a stuck fetchText subprocess. (should be fixed in svn now)
  • 02:40 brion: took srv149 off network (ifconfig eth2 down :D)
  • 02:10 brion: disabled $wgShowUpdatedMarker for now, it seems wonky
  • 01:33 brion: recovered some disk space from rabanus, but seem unable to account for a bunch of space o_O
  • 01:00 brion: srv149 borked with read-only filesystem. srv150 borked in some unspecified way (login problems). rabanus disk full

May 18

  • 8:20 JeLuF: Enabled the replaced disk on amane so that the disk is also being used.

May 16

  • 17:12 RobH: db1 memtest passed, no memory errors reported, but OS detects memory errors! As OS tests better than memtest, RMA has been placed for memory in slot 4. Server currently offline.
  • 17:00 RobH: srv130 kernel panic. Restarted.
  • 17:00 RobH: srv101 had a stalled apache, synced and restarted.

May 15

  • 23:55 brion: commented out down srv130 and srv127 from cluster13. failover to srv129 was working, but vveerrryyy slowly. [2] was taking 9s for a handful of revisions; this broke transwiki special:import due to our strict HTTP timeouts internally
  • 20:16 brion: got cs.planet actually working now
  • 18:27 brion: setting up cs.planet.wikimedia.org DNS, adding it to generation/update list
  • 14:22 mark: Upgraded ssh on mayflower so weak keys are detected and denied access

May 14

  • 21:10 brion: restarted lsearchd on maurus, was borked -- is currently the only non-main-namespace search server in enwiki cluster, so it broke some searches. also restarted sshd, which was mysteriously on the fritz and rejected rainman's logins
  • 17:27 RobH: replaced disk in amane
  •  ?? mark: restarted pybal on diderot
  • 16:55 brion: manually depooled srv30 from LVS on diderot. Either pybal isn't being used to do health checks on search pools or it's not working.
  • 16:50 brion: per report on WP:VPT, seem to be seeing a lot of search failures on enwiki. investigating

May 13

  • 22:30 brion: logevents api back on, allegedly fixed
  • 21:39 brion: putting flaggedrevs back on. api has been reenabled, with logevents query disabled by a nice clean exception
  • 21:26 brion: scapped everything up to date, flaggedrevs still off. DISABLED API due to still having bad queries
  • 20:58 brion: db5 and webster up to date. samuel in process...
  • 20:53 brion, jeluf, domas: we managed to (hopefully) reconstructed the broken statement from adler binlog file, and have got db5 resyncing up from that point. if it seems to be going well we'll put the other two s3 slaves in shortly. (s3 still in read-only)
  • 20:21 brion: removed a bunch of bogus old keys for srv150-srv170 from zwinger's /etc/ssh/ssh_known_hosts file. hopefully will clear up the broken sync issues
  • 20:15ish brion: disabled flaggedrevs; old patch is bogus. still busy with other problems before resyncing to current
  • warning: new db.php config doesn't appear to allow marking a cluster as all read-only. this is being a problem for maintenance
  • 20:00ish jeluf, brion: replication is broken on s3; corrupted binlog. investigating
  • 19:30-50ish brion: some fun times w/ DB overload as bad API queries flooded DBs. reverted updated code to r34539 plus domas's patch for flaggedrevs. still having some ssh key problems, want that sorted out before tackling it again
  • 18:37 RobH: Upgraded packages on yf1000-yf1009 and regenerated keys.
  • 17:33 mark: Ran weak key detection script on mayflower and did chmod 000 on matching authorized_keys files - Brion will contact these users and ask for a new key
  • 17:31 RobH: Upgraded packages on knsq1 - knsq30, and regenerated keys.
  • 17:18 RobH: Upgraded packages on db5, db8, db10, mchenry, sanger, srv8, srv10, srv151-srv170, sage, mint, mayflower, fuchsia, and regenerated keys.
  • 15:50 RobH: Upgraded packages on bayle, did not regenerate key, as it used to be 6.10 with original key generation.
  • 15:49 RobH: Upgraded packages on webster, adler, ixia, thistle, hume, db1, db3, db4 and regenerated keys.
  • 15:33 RobH: Upgraded packages on isidore and regenerated keys.
  • 14:51 RobH: Upgraded packages on sq46 through sq50 and regenerated keys.
  • 14:46 RobH: Upgraded packages on sq31 through sq45 and regenerated keys.

May 12

  • 22:20 brion: wikimania2009.wikimedia.org and se.wikimedia.org set up
  • 19:19 brion: adding DNS stub for se.wikimedia.org
  • 17:02 brion: query.php now consistently disabled when current API is

May 11

May 9

  • 18:05 brion: load spike due to unindexed API recentchanges queries. reverted circa last day's changes to API rather than dig around in there

May 8

May 7

  • 19:55 brion: srv30 read-only filesystem; needs to be taken out and shotexamined for disk problems. (I shut it down for the time bein')
  • 17:50 brion: created new global groups tables; trying full scap again
  • 17:40 brion: site broken by CentralAuth upgrades which silently added use of a table that's not present. reverting pending addition of tables

May 6

  • 19:40 brion: fixed bugzilla upgrade
  • 00:15ish tim: activated FlaggedRevs on dewiki again

May 5

  • 00:00ish brion: restarted leuksman.com; server was down most of the day (Sago sends you an email every hour your server is down, but doesn't reboot it until you ask :)

May 3

  • 14:01 mark: Restarted lsearchd on maurus
  • 13:56 mark: Upgraded will to Ubuntu 8.04 Hardy
  • 01:12 brion: disabled flaggedrevs on dewiki. Some problems with reviewed pages list and the UI disrupting page layout, which need to be resolved before we put it back on.
  • 00:55 brion: starting up test deployment of flaggedrevs on dewiki. FlaggedRevs config is in flaggedrevs.php. Disable the section if it's causing trouble over the next couple days!
  • 00:34 brion: clearing off old log files from maurus again. we need some log rotation or to dump those logs :)
  • 00:24 brion: maurus out of space again
  • 00:15 brion: updating test, labs wikis for separate flaggedpages table. debugging some deployment issues

May 1

  • 23:10 brion: removed commons' foreign repo config from itself, so we don't get dupe file warnings :)
  • 22:50ish brion: reenabled newpages uesr filter for non-affected wikis. index use is right now \o/
  • 19:15 RobH: srv149 rebooted.
  • 19:00 RobH: srv36 rebooted.
  • 18:33 RobH: srv78 kernel panic. Rebooted.
  • 00:08 brion: switching it back off, doesn't seem right.... insanely slow
    • The index code has a typo, forcing it two use one of two bad indexes ;) Aaron 13:24, 1 May 2008 (UTC)
  • 00:05 brion: putting newpages username search back except for the four wikis affected by bad rc_user_text indexes; wgEnableNewpagesUserFilter is off for them

April 30

  • 18:35 brion: added $wmgUseDualLicense switch to InitialiseSettings.php. Set this to true for new wikis which should be created with the GFDL+CC-BY-SA 3 dual-license mode to set their default copyright link properly.

April 28

  • 18:00 brion: turned off the double-diff-then-log (no hits since saturday, yay). turned on a logging log to check issues with updated log code
  • 15:00 jeluf: mysqld on db5 hanging. Couldn't shut it down or even kill -9 it. Had to reboot the box. mysqld is currently recovering.

April 26

  • 21:56 brion: bad diff logging indicated that problems were only on fc4 apaches. possibly a c++ version mismatch? recompiled wikidiff2 RPMs fresh on fc3 and fc4; upgraded the fc4 boxes, log's stopped dead. so far so good
  • 21:38 brion: cleared old mwsearch indexes off rabanus, resyncing mw inst.
  • 21:23 brion: bumped diff cache version to force diff regeneration
  • 21:20 brion: enabled bad diff hack -- runs every diff twice, logging in baddiff.log if they don't match. (the shorter text is then returned, which may reduce the incidence of visible diff errors)
  • 21:17 brion: rabanus disk full
  • 17:50: jeronim: on db2 in ntp.conf, changed restrict 66.230.200.234 nomodify notrap to restrict 208.80.152.189 nomodify notrap to match the server 208.80.152.189 line below it. Output from ntpq -p looks much better now, showing an IP address in the refid column instead of ".RSTR."
  • 16:53 mark: Installed lvs2, lvs3 and lvs4 for testing
  • 15:58 mark: Installed Ubuntu 8.04 on lvs1 for testing
  • 15:58 mark: Ubuntu 8.04 Hardy Heron installs are now possible on all VLANs
  • 12:09 jeronim: did /etc/init.d/ntpd restart on db2 which fixed clock offset of about 6 seconds; underlying problem not fixed

April 25

  • 20:09 mark: lily under extreme load, investigating
  • 19:28 brion: added 'Vary: Cookie' HTTP header to blogs... don't know if it'll do a damn thing, I can't even clear things from these squids using the normal methods
  • 18:33 brion: upgraded blog.wikimedia.org and whygive.wikimedia.org to wordpress 2.5.1
  • 17:13 brion: fixed MWSearch extension to use Http::get() instead of file() to hit the backend. This should resolve the load spikes we've been seeing around 7:30-8:00 UTC daily; the servers slow down while indexes are being resynced, and the long default timeouts caused things to back up on the front end instead of failing out gracefully.
  • 16:55 brion: upgraded utfnormal extension on srv42 so dumps will work again. (note that dumpBackup.php no longer works when autoselecting database connections, probably a bug due to the new load balancer. works in live use as a server is explicitly passed on command line.)
  • 07:56 river: /var/lock on lily became full from the spamd bayes database; moved it to /var/spamd. expired the old bayes database because its size was causing spamd to be very slow (30+ seconds per mail).

April 24

  • 05:50 Tim: fixed wikidiff2 on fedora apaches, was missing since 5.2.5 upgrade.

April 23

  • 19:54 brion: restarted apache on bart (secure proxy), seems happier
  • 19:50 brion: secure.wikimedia.org connections hanging
  • 00:25 brion: resynced db2's clock; was 7 seconds slow, causing all s1 slaves to think they were lagged, causing all enwiki jobs runners to sit waiting for them to catch up

April 22

  • 00:42 brion: enabling wgEnableMWSuggest globally for a few minutes to evaluate DB impact

April 21

  • 23:18 brion: enabled $wgCookieHttpOnly -- new session & user token/name/id cookies should be sent HttpOnly, so supporting browsers won't expose them to JavaScript as an additional protection against some categories of XSS
  • 23:10 brion: upgrading php on srv141, was down during 5.2 updates
  • 22:26 brion: got a report of a commons image with missing archive versions. Files are present on upload4 but not on upload3... which is odd because as far as I can tell, only the thumbs are used on upload4 for commons. Why is there a full copy of commons, and why don't they match?
  • 22:13 brion: getting lots of complaints from scap about time sync. clock offsets mostly <1s but some >3s
  • 20:55 RobH: lvs3 and lvs4 racked and remote access enabled.
  • 19:44 RobH: db4 reinstalled.
  • 19:44 RobH: lvs1 and lvs2 racked and remote access enabled.
  • 19:26 RobH: thistle reinstalled.
  • 17:30 RobH: db1 unresponsive, rebooted.
  • 17:30 RobH: racked srv141 and brought back online

April 20

  • 15:15 mark: squid on khaldun had disappeared due to an upgrade a few days ago, and dependency conflict with the Wikimedia packages
  • 13:00 mark: Depooled srv2 and srv4, the only remaining 32 bit apaches in rotation.

April 19

  • 18:00 mark: srv133's time was off, corrected

April 18

April 17

  • 21:50 brion: lowering db4 priority from 150 to 50; still loaded
  • 21:10 brion: lowering db4 priority from 200 to 150; seems very highly loaded compared to db3 with same priority
  • 20:25 RobH: Relocated srv136 & srv135.
  • 19:40 RobH: Relocated srv137
  • 19:25 RobH. Relocated srv138. Put ext store cluster 14 back in service.
  • 19:19 brion: applying pt_title encoding fixes
  • 18:59 RobH: Relocated srv141, srv140, srv139, srv138.
  • 18:50 RobH: Removed ext store cluster 14 from active use.
  • 18:44 mark: Removed AAAA record on khaldun.wikimedia.org, apparently apt doesn't even try v4 when it has a proxy hostname with an AAAA record and a v6 route is not available.
  • 18:44 mark: Fixed httpd on pascal
  • 18:20 brion: fixed ganglia reporting knams -> pmtpa (old zwinger IP in trusted list on pascal); detail reporting still down due to broken httpd on pascal
  • 18:10 mark: Fixed MySQL group in ganglia by making ixia an aggregator again
  • 18:11 RobH: srv143 and srv142 relocated.
  • 17:58 brion: enabled search suggestion drop-down on testwiki
  • 17:13 RobH: srv144 relocated.
  • 17:00 RobH: srv145 relocated.

April 16

  • 22:46 brion: enabled TitleKey extension, search suggestions, and HttpOnly cookies on wikitech
  • 21:40ish brion: hopefully fixed the php5.1 bug with global sessions on secure.wikimedia.org
  • 21:21 RobH: srv150 relocated.
  • 21:11 RobH: srv149 relocated.
  • 21:06 brion: enabling global sessions on secure.wikimedia.org
  • 20:57 srv148 relocated.
  • 20:47 brion: restarted data dumps on srv31 and srv42
  • 20:45 srv147 relocated.
  • 20:31 brion: cluster16 back in rotation; tim restarted mysql
  • 20:25 brion: rash of complaints of db errors due to srv146 being out (cluster16 ES master). took cluster16 out of $wgDefaultExternalStore while it's being fixed
  • 20:11 RobH: srv146 relocated.
  • 16:52 brion: fixed ticket.wikimedia.org redirect to otrs
  • 10:50 brion: got a mystery SMS complaining of 5-minute lag on dewiki

April 15

  • 23:55 brion: giving planet its own little user account :)
  • 22:24 brion: PMTPA databases, all KNAMS, and all YASEO are missing from ganglia and have been for a while. What's going on?
  • 19:00 mark: Cleaned up csw5-pmtpa's config, added BGP inbound filtering on prefix lists and known bogons
  • 17:35 brion: rc_user_text index is missing from frwiki, nlwiki, plwiki, and svwiki. Special:newpages was using it in some cases; have disabled the index and the username lookup feature for it pending fixes.
  • 00:25 brion: updated SpecialNewpages.php to tweak index forces per domas's request; new pages was causing some sort of problem

April 14

  • 23:55 brion: gettin' ready to svn up! applied flaggedrevs_promote table on test & labs, and the centralauth gu_token field
  • 19:10 brion: restarted IRCD, was hanging mysteriously
  • 16:25 RobH: srv130 synced and apache restarted.
  • 16:00 RobH: srv0 and benet powered down pending drive wipe for decommissioning.

April 13

  • 10:46 Tim: pybal on diderot was depooling servers due to name lookup failure (timeout). Traced the problem back to nscd and restarted it, that fixed it.

April 12

  • 00:15 brion: robots.txt may or may not be fixed for blog.wikimedia.org; some kind of freakish default, probably from wordpress 404 handling, redirected it to robots.txt/ (with final slash) which disallowed all by default apparently (?!). added a plain file... but caching is still taking the redir that i can see
  • 00:10 brion: sql script doesn't work for non-wiki dbs such as 'centralauth' and 'oai' at the moment; lookup fails
  • 00:02 brion: setting up sr.planet.wikimedia.org

April 11

  • 12:48 mark: Discovered that lighttpd does not allow caching of unknown content-type responses. amane was serving quite a lot of unknown content types, which were consequently not cached by the Squid clusters. Fixed this by adding a lot of content types to lighttpd.conf, as well as a default content-type in case any are missed.

April 10

  • 21:30 jeluf: fixed nagios' conf.php, to reflect the latest db.php changes.
  • 16:50 brion: restricted wfNoDeleteMainPage to enwiki which I presume it was added for. It's a huge nuisance for other wikis which quite legitimately are rearranging their content.

April 9

  • (all day) mark: Restarted various daemons on lots of servers to get DNS resolver libs to use the new DNS IPs (mostly nscd, apache, some mysql)

April 8

  • 21:35 brion: fixed (?) nad nsswitch.conf on bart (nis -> ldap)
  • 16:48 brion: adjusted new $wgExpensiveParserFunctionLimit to match old $wgMaxIfExistCount
  • 16:38 Tim reenabled search
  •  ?????? Tim disabled search sitewide
  • 7:40-8:40 Tim: the lack of a FORCE INDEX caused LogPager queries to be extremely slow. The site eventually went down when the cumulative query load built up sufficiently. Took a bit of time to disable the queries properly, kill the MySQL threads, and get the site back up.
  • 07:40 Tim: updated to r32943
  • 07:30 jeluf: restored .procmailrc for OTRS. We've lost all mails coming in between 0:38 and 7:30 UTC. I can't find them in /var/spool/mail, and they didn't go to OTRS. Any idea where postfix has put them?
  • 07:19 Tim: deleted 100GB of binlogs on ixia
  • 04:30 jeluf: migrated some of the changes that I've made to our OTRS. Installed a big red MOTD message on the login screen.
  • 01:10 brion: reinstalled OTRS FAQ module, fixing broken ticket zoom.
  • 00:40 brion: upgraded OTRS to 2.1.8. If you have information about the patches that were previously applied, please provide them! They have not been copied over since it's unclear what's what.

April 7

  • 18:29 RobH: srv117 shutdown due to failed HDD. RMA placed.
  • 18:18 RobH: db1 rebooted due to hard lockup.
  • 17:25 Tim: running maintenance/archives/upgradeLogging.php on various (eventually all) wikis
  • 00:10 brion: running a bzip2 integrity check on enwiki-20080312-pages-meta-history.xml.bz2; .7z is cut off

April 6

  • 11:24 mark: Changing resolver IPs on all servers
  • 05:10 Tim: cleaned up binlogs on srv139 and srv146

April 5

  • 17:42 mark: lighttpd on storage2 had run out of FDs and crashed. Increased the limit.
  • 16:52 mark: Stopped announcing prefix 66.230.200.0/24 in BGP.
  • 16:00 mark: Removed old IPs from various servers.

April 4

  • 19:52 brion: srv117 is borked; logins hanging

April 3

  • 21:18 brion: moved dump monitor thread to srv31; stale ruwiki dump marked correctly as aborted now. NOTE: IPs for storage NFS mounts should be changed when enwiki and dewiki dumps finish..........
  • 21:15 brion: killed dump & sitemap processes on benet. we're retiring it...
  • 15:59 RobH: Removed vincent, biruni, kludge, humboldt, & hypatia from all dsh groups and apache pool for decommissioning.

April 2

  • 22:01 RobH: isidore updated with newest wordpress installation for blog and donation blog.
  • 17:55 RobH: db1 rebooted.
  • 17:45 brion: added bart's new ip to known proxy list
  • 17:32 mark: Renumbered friedrich
  • 16:07 mark: Renumbered srv8, bayle
  • 15:57 mark: Renumbered srv9 and srv10
  • 15:43 mark: Renumbered yongle
  • 15:34 mark: Renumbered isidore
  • 15:26 mark: Renumbered browne
  • 15:10 mark: Renumbered storage1, anthony
  • 14:16 mark: Renumbered storage2, will
  • 14:00 mark: Restored symlinks in /etc/powerdns/templates/, be careful when working on/copying those files, they are heavily symlinked!
  • 13:15 mark: Renumbering bart to new IP range
  • 11:00 - 11:30 mark: Reloaded csw1-knams with new firmware; temporarily moved traffic to florida

April 1

  • 08:00 domas: db1 didn't like oracle migration, crashed

March 31

  • 4:30 JeLuF: Added srv145 back to external storage pool 'cluster16'. Added srv130 back to external storage pool 'cluster13'.
  • 4:00 JeLuF: Fixed mysql on srv81 and srv145. On srv138, resolved "out of diskspace" situation. The second disk was not mounted and both mysql datafiles were on one disk only.

March 28

  • 18:57 RobH: sq12 back online from lockup.
  • 18:46 RobH: Replaced DIMM4 in srv166
  • 18:09 RobH: srv51 back online from kernel panic.
  • 17:59 RobH: srv78 & srv81 back online from kernel panic.
  • 17:55 RobH: srv130 & srv131 back online from kernel panic.
  • 17:46 RobH: srv145 back online, was powered down?

March 26

  • 19:00 brion: previous fix had a bug which broke wikis with language variants. fixed.
  • 18:20 brion: Worked around mystery segfaults with voodoo fix (r32477)
  • 17:26 brion: mysterious [crashes on private wiki root redirects, still trying to diagnose. (backtrace)
  • 15:26 mark: Set up sq50 as temporary LVS balancer instead of avicenna, so it's not a squid atm.
  • 15:00 mark: PyBal's configuration file had a syntax error, causing LVS to go down. Avicenna completely swamped and unreachable.
  • 14:08 mark: Rendering cluster down due to OOM kills on all 3 servers. Killed apaches and restarted them.

March 25

  • 22:31 brion: disabled CentralAuth debug log; found the bug i was looking for :)
  • 22:22 brion: enabled CentralAuth debug log

March 24

  • 23:11 brion: set default perms for upload to autoconfirmed except on commonswiki... this may be rolled back or changed if unpopular
  • 17:50 brion: restarting category builds on commons and enwiki
  • 17:45 brion: poked around old paypal post urls

March 21

March 20

  • 19:25 brion: restarted lighty on storage2; was down mysteriously
  • 16:53 storage2's lighty appears to have died... had lots of errors about too many open files etc
  • 12:53 RobH: srv150 back online.
  • 12:46 RobH: srv81 rebooted from kernel panic.

March 19

  • 23:55 brion: starting batch category table population...
  • 23:27 brion: updating code; stub updatelog and category tables applied. will populate tables after gone live...

March 18

  • 17:49 brion: benet crashed again. moving DNS for dumps.wikimedia.org over to storage2. it had a lighty pointing to a now-empty backups directory; pointed it at the currently-used dir for dump storage instead.
  • 17:00 and earlier -- some network issue with PowerMedium? large packets dying on routes through HGTN. mark did something to the network to cut our PowerMedium route? can't reach 66.230.200.* network from outside now; secure.wikimedia.org and planet.wikimedia.org at least using these addrs publically still
  • 08:45 mark, JeLuF: Routing knams-pmtpa switched to another provider, dns switched to "normal". Everything looks fine. During the "knams-down" time, request rate in pmtpa dropped, needs further investigation.
  • 08:30 JeLuF: Lost connection pmtpa-knams, switched DNS to scenario "knams-down".
  • 07:23 Tim: hume's v1 partition is 92% full, set up a symlink farm to start filling v2.
  • 01:18 brion: secondary problem was some kind of overload on avicenna (pmtpa text LVS). river managed to tweak it into submission by taking it off net for a couple minutes. things appear up for now
  • 01:06 brion: packet loss down from 33%+ to about 4%... can reach ganglia consistently, still some outage issues
  • 00:18ish brion: major net issues in tampa? lots of packet loss; cpu down dramatically

March 17

  • 19:48 brion: fixed upload dir on wikimania2008wiki
  • 18:00 jeluf: srv51 is down. Replaced by memcached on srv65.

March 16

  • 15:28 mark: Renumbered mchenry to the new v4 IP range
  • 14:47 mark: Renumbered sanger to the new v4 IP range
  • 14:18 mark: Bound IPv6 IPs on csw5-pmtpa's vlan routing interfaces - so most if not all servers will have acquired one or more IPv6 addresses. Renumbered khaldun to the new IP range and published its IPv6 record as AAAA record in DNS (for apt.wikimedia.org)

March 13

  • 21:19 mark: Shutdown srv150's switchport, it has a ro fs and doesn't react to IPMI.
  • 19:55 brion: reenabled search result context for anons on LuceneSearch wikis
  • 04:28 Tim: enabled CentralAuth in dry-run mode on all wikis

March 12

  • 21:26 brion: de.labs thumbs mysteriously broken again. who knows...
  • 21:05 brion: poked at thumb-handler.php ... it was apparently pointing to the wrong backend URL for de-labs (de.labs) etc. Hacked in a special case for non-wikipedias.... which may well be even more broken. Look at this again... :P
  • 17:10 brion: dissolved mediawiki-ng-l list. Too much forced moderation and no mission meant it was never seriously used.

March 11

  • 18:57 brion: swapped LuceneSearch for MWSearch plugin on test.wikipedia.org and commons.wikimedia.org. Search front-end now includes thumbnails for image page results, which is kind of handy. :) Will do a little more testing before swapping wholesale; there are still UI differences and things which should be improved.

March 10

  • 20:25 brion: arbcom_enwiki was missing from dblist files (except private.dblist). Added it back to all.dblist and special.dblist, works again.
  • 19:07 brion: installed svn 1.4.6 on zwinger in /usr/local/svn; use this to svn up if the old version keeps whining
  • 18:36 brion: zwinger's old copy of svn (1.2.3) has decided that it can't deal with something in our repository (extensions/DumpHTML/wm-scripts). :(
  • 18:02 brion: removed the evil transclusion at Server admin log/All which caused updates of this log page to be insanely slow, by forcing links refresh of 12 huge log pages all combined into a giant page of death
  • 17:47 brion: set chapcom lang to 'en' instead of defaulting to 'chapcom'. special: page links now working instead of ':Userlogin' etc. not sure why it did that; seemed fine in command-line tests
  • 17:32 brion: reported language config issues on chapcom; exmaining
  • 16:54 brion: fixed spider blocks. :P
  • 16:37 brion: blocked an evil spider IP from mayflower; SVN http back up
  • 16:28 brion: mayflower overloaded in some way; load avg 147 o_O

Marc 9

  • 17:13 brion: en.labs.wikimedia.org and de.labs.wikimedia.org have FlaggedRevs testing configurations enabled. Still doing imports from en.wikibooks on en.labs, though. (Internal names are de_labswikimedia and en_labswikimedia.)

March 8

  • 08:45 Tim: cluster14 was inexplicably missing mywiki. No data loss, it's been missing since the cluster was created, apparently. Added it.
  • 11:09 Tim: srv81 is down. Removed it from external storage rotation.
  • 11:00 brion: updated hawhaw; WAP portal now looks nice in Mobile Safari on the iPhone SDK simulator app

March 7

  • 23:34 brion: importing de.wikibooks to de.labs.wikimedia.org....
  • 21:59 brion: setting up stub en.labs.wikimedia.org and de.labs.wikimedia.org for flaggedrevisions testing
  • 12:05 domas: srv25 has 40GB of lucene logs. disk full.
  • 12:00 domas: resynced samuel form db1, db5 remaining
  • 11:46 Tim: running dumpHTML on hume with 16 threads
  • 08:00 domas: s3 master switch, samuel_bin_log.171:224349875 to adler-bin.002:3522
  • 00:28 Tim: Updated zwinger:/etc/ntp.conf
  • 00:19 Tim: updated MySQL grants for new subnet

March 6

  • 23:26 Tim: added 208.80.152.128/26 to suda:/etc/exports and srv1:/var/yp/securenets. Created checklist at IP addresses
  • 06:49 brion: noticed zwinger can't access database servers since the IP renumbering. :P
  • 00:48 RobH: hume installation complete.

March 5

  • 23:57 brion: leuksman.com was offline for a while (net problems at sago)
  • 14:12 RobH: srv65 back online.
  • 13:59 RobH: srv150 back online from kernel panic.
  • 13:38 RobH: upgraded kernel in storage2
  • 13:28 RobH: srv127 back online from kernel panic.
  • 13:27 RobH: upgraded kernel in storage1

March 4

  • 22:30 mark: Changed dhcpd.conf on zwinger, firewall setup on khaldun and dhcp forwarding on csw5-pmtpa to make installs work from the new IP ranges.
  • 22:00 mark: Migrated zwinger onto the new IP range, changed its DNS entry to 208.80.152.189.
  • 19:08 brion: took out read-only
  • 19:05ish brion: put in temporary limit of Special:Newpages to 200; lots of reads with limit 5000 on dewiki were bogging down holbach. DB overload cleared up.
  • 18:53 brion: taking s2 and s2a to read-only temporarily while we work out this overload issue
  • 18:40 jeluf: DB servers for s2a cluster (dewiki) overloaded. ixia logs
[5100027.207458] Machine check events logged
  • 18:25 (large CPU spike up on mysql and apaches; continuing...)
  • 11:00 domas: db1 and adler are running compacted/fixed schema/tablespaces - next targets are db5 and samuel, master switch imminent

March 3

  • 21:18 brion: removed the special-case in lucene configuration for testwiki to use srv79. That seems to have an experimental version of the lucene server which is currently broken. search now works on testwiki
  • 18:57 mark: srv65 went offline, taking its memcached instance with it. Replaced the memcached slot by the last spare one.
  • 16:00 RobH: yf1019 kernel upgraded.
  • 16:00 RobH: yf1018 kernel upgraded.
  • 15:36 RobH: yf1016 kernel upgraded.
  • 15:36 RobH: yf1015 kernel upgraded.
  • 15:27 RobH: henbane kernel upgraded.
  • 14:59 RobH: sage kernel upgraded.
  • 14:51 RobH: mayflower kernel upgraded.
  • 14:41 RobH: hawthorn kernel upgraded.
  • 14:35 RobH: lily kernel upgraded.

March 2

  • 11:30 Tim: Not sure what the deal was. Cleaned up the mount options a bit: reduced timeout, switched from TCP to UDP mode (lost TCP connections cause temporary hangs), removed "intr" (useless when in soft mode). Remounted.
  • 11:17 Tim: amane immediately locked up again due to hang on NFS read of storage1. Unmounted /mnt/upload4 temporarily to restore service.
  • 11:09 Tim: restarted lighttpd on amane, was broken

February 29

  • 21:15 RobH: restarted ssh and put srv61 back in pool.
  • 21:15 RobH: brought srv130 back from kernel panic.
  • 19:56 RobH: Racked hume, new static-dump server. DRAC: 10.1.252.190 DHCPD needs modification to netboot this subnet.
  • 14:26 Tim: Removed /etc/cron.daily/find from all ubuntu apache servers that had it. Killed all long-running sort commands.

February 28

February 27

  • 22:22 RobH: Shutdown srv11-srv20 + srv6. (Old, warranty expiring, causing heat issues in that rack, per mark)
  • 18:34 RobH: upgraded kernel on will
  • 18:23 RobH: upgraded kernel on mchenry & sanger
  • 18:05 RobH: upgraded kernel on bayle
  • 18:00 RobH: upgraded kernel on khaldun
  • 17:45 RobH: upgraded kernel on srv9 & srv10
  • 17:37 RobH: upgraded kernel on yongle

February 26

  • 23:59 RobH: upgraded kernel on yf1009
  • 22:48 RobH: upgraded kernel on yf1005 to yf1008
  • 22:14 brion: rebuilding enwiki-20080103-pages-meta-current.xml.bz2 (as -2 for now) on srv31
  • 21:30 to 22:10 RobH: upgraded kernel on yf1002 to yf1004
  • 19:45 RobH: fixed replication on srv77 to srv8
  • 14:12 Tim: started lighttpd on benet, had crashed again

February 25

  • 23:51 brion: someone mucked up wgRemoveGroups on srwiki, listing pretty much every permission they could think of. pared it down to array( 'bot', 'patroller', 'rollbacker', 'autopatrolled')
  • 20:00 RobH: yf1001 security updates.
  • 19:58 RobH: yf1000 security updates.
  • 19:45 brion: maurus disk space filled up for a bit; there's a 39gb log file in /usr/local/search/log. Freed up some space from old index data; recommend adding some log rotation to search servers!

February 22

  • 21:33 RobH: srv171-srv175 kernel and security updates.
  • 20:32 RobH: srv161-srv170 kernel and security updates.
  • 20:00 RobH: srv151-srv160 kernel and security updates.
  • 16:53 RobH: sq33-sq40 kernel and security updates.
  • 16:34 RobH: sq24-sq32 kernel and security updates.
  • 16:09 RobH: sq16-sq23 kernel and security updates.
  • 15:52 RobH: sq41-sq50 kernel and security updates.
  • 05:15 Tim: Applying schema updates patch-page_props.sql and patch-ipb_by_text.sql
  • 02:00 - 04:45 mark: Migration of office DSL connections to Cisco 2841 - server is policy routed over the lower speed connection.

February 21

  • 22:42 RobH: sq10 - sq15 updated (kernel and security updates.)
  • 21:45 RobH: sq2 - sq9 updated (kernel and security updates.)
  • 20:08 RobH: sq1 updated (kernel and security updates.)

February 20

  • 23:53 RobH: knsq28 seems to not be rebuilding. Letting mark know.
  • 23:45 RobH: Upgraded kernel and such on knsq16 through knsq22 (apt-get upgrade). Not distro upgrade.
  • 23:21 RobH: Upgraded kernel and such on knsq8 through knsq15 (apt-get upgrade). Not distro upgrade.
  • 22:15 RobH: fuchsia back up by mark. All traffic remains routed to PMTPA (while rob finishes squid upgrades.)
  • 22:15 RobH: fuchsia down. All traffic routed to PMTPA.
  • 21:56 RobH: Upgraded kernel and such on knsq23 through knsq26 (apt-get upgrade). Not distro upgrade.
  • 21:30 RobH: Upgraded kernel and such on knsq1 through knsq7 (apt-get upgrade). Not distro upgrade.

February 18

  • 21:15 brion: manually mounted upload4 on srv189. Was not created in /mnt or listed in fstab.

February 17

  • 7:30 jeluf: suda's root FS was 100% full. Changed logrotate.conf to rotate logs daily instead of weekly, added switch.log to the log rotation.

February 13

  • After 18:44 RobH: Reinstalled db1 OS.
  • 18:44 RobH: rebooted srv37 from crash, back online.
  • 18:35 RobH: Restarted apache on srv166 per domas.
  • 15:03 RobH: storage2 disk 12 replaced. and is rebuilding

February 11

  • 03:38 Tim: srv61 is refusing ssh connections, still serving HTTP. Depooled.

February 10

  • 10:40 domas: db1 still needs fixing..
  • 07:30 Tim: upgrading the remaining squids with ~tstarling/squid/squid-upgrade.php. The script will upgrade one squid every two hours, in random order. This mitigates the effect of the cache clear for items with a Vary header (i.e. text). sq17 and sq18 were done during script testing.
  • 06:18 Tim: upgraded squid on sq16, including XVO feature
  • 05:40 Tim: srv150 accepts connections on SSH or HTTP and then hangs for a long time. Removed it from mediawiki_installation and apaches and depooled it.

February 8

  • 01:40 Tim: added "hidden" table (oversight) on wikis that didn't have it. Added it to addwiki.php.

February 7

  • 17:43 mark: Wrote a Mailman withlist script to change the embedded web_page_url variable to use https, as this is not possible using config_list.
  • 15:00ish to 16:30ish RobH: lily lightttpd.conf changed to support/redirect mailman with SSL certificate.

February 6

  • 17:45 brion: updated bugzilla to 3.0.3
  • 16:13 Tim: MW configuration changes:
    • Renamed some wikimedia-specific globals from $wgXxxx to $wmgXxxx. Some of them had rather obvious names that could potentially conflict with extension configuration in the future.
    • Moved passwords and private keys out to PrivateSettings.php
    • Changed SiteConfiguration.php to allow "tags" such as "fishbowl" and "private" to be applied to wikis. These tags can be used to specify settings in InitialiseSettings.php.
    • Used these tags to full effect by adding using fishbowl.dblist and private.dblist to set the fishbowl and private tags, and then removing all the fishbowl/private wiki lists from InitialiseSettings.php. This will make adding new private wikis easier.
    • Fixed some whitespace and removed some old commented-out code
    • Moved various ancient subdirectories of /h/w/common to /h/w/junk/common
  • 14:43 RobH: srv166 had a memory error, reseated memory, and restarted server.
  • 14:22 RobH: storage2 disk 2 replaced. Not rebuilding? (please show rob how to force this.)

February 4

  • 21:11 RobH: isidore now running bugzilla.wikimedia.org with a SSL Cert.

February 3

  • 11:47 mark: lighttpd disappeared on storage1 and was also inaccessible from the new IP range due to an old and broken firewall. Why was it there? Removed it.
  • 11:25 mark: Move traffic back to pmtpa

February 2

  • 20:30 mark: Added new service IPs to bayle and mchenry being the pmtpa DNS resolvers, and a new service IP for ns0.wikimedia.org on bayle.
  • 20:15 mark: Forgot that we have some DNS records pointing at 66.230.200.100 directly, so those were down for a while until I updated DNS.
  • 17:52 mark: Moved all text.* traffic to knams as well
  • 17:04 mark: Put Canadian traffic on pmtpa, to seed those caches a bit
  • 14:40 jeluf: storage1 overloaded. Killed static dump processes on srv136, srv135, srv134, srv133, srv132, srv131, srv42
  • 13:15 mark: Updated upload Squid configs to use the new pmtpa IP range, causing immediate pmtpa CARP cache clear, but mitigated by the knams squids.
  • 11:37 mark: Moved all upload.* traffic to knams, to prevent an effective CARP cache clear due to IP address changes swamping amane.

February 1

  • 20:19 brion: reverted r30405 which broke boardvote and re-enabled the ext
  • 20:10 brion: broken boardvote extension... was breaking all special pages; temporarily disabled the ext
    • Feb 1 20:08:18 kluge httpd[12208]: PHP Fatal error: Call to undefined function wfBoardVoteInitMessages() in /usr/local/apache/common-local/php-1.5/extensions/BoardVote/GoToBoardVote_body.php on line 3
  • 11:15 domas: restarted lighty on benet, did run away?

January 31

  • 10:53 Tim: deleted binlogs on srv146
  • 00:12 brion: svn.wikimedia.org resolved to old 145.* addy from anthony... since that doesn't work anymore, this is making svn access a pain for seeing about updating the wap interface. Tried to update resolv.conf with current values from zwinger, but still no dice.
    • have temporarily resorted to /etc/hosts hack

January 30

  • 22:25 brion: various reports of "blank pages" and/or 503 errors from Peru. Nothing narrowed down yet on our end.
  • 20:35 brion: switched Apple Dictionary app backend to OpenSearch. bumped MaxClients on yongle up to 20, may resolve the 'gets really slow for no reason' issue
  • 20:10 brion: enabling TitleKey sitewide. (Indexes should be rebuilt overnight to ensure they're up to date for changes in the last 15 hours.)
  • 05:54 brion: building TitleKey indexes generally (not fully enabled yet so opensearch isn't useless until done; want them built first)
  • 05:25 brion: experimenting with TitleKey ext on testwiki
  • 04:50 Tim: Fixed thumb-handler to not attempt to "cache" files locally on storage1. Removed bacon from /h/w/upload-scripts/sync.

January 29

  • 21:58 mark: Raised persistent_request_timeout on the backend squids from the default 2 minutes to 10 minutes, to make existing connection reuse even more likely between all communicating pairs of squids
  • 10:30 Tim: Setting up storage1 as a static HTML dump storage server. Installed ganglia on it.
  • 09:10 Tim: updatedb was running on storage1, attempting to index millions of files. Killed it, added /export to PRUNEPATHS, and re-ran it. Seems to work.

January 28

  • 22:30 brion: csw5-pmtpa has been spewing alarms about 5/3 and 5/4 optical connections for a while. :(
    • domas says this is harmless -- an unused port
  • 18:50 brion: svn revert'd some live hack in Parser.php which apparently added a $clearState parameter to Parser::internalParse() which never gets passed to it, thus spewing error logs with billions of lines of PHP warnings

January 24

  • 21:00 jeluf: installed lighty on storage1, configured squid so that all dewiki image requests and all commons thumb image requests go to storage1. Images fast again, backend request rate down to normal level.
  • 18:40 brion: images still very slow :(
  • 14:00 mark: Assigned new, extra IP addresses to Florida Squids, and added the new IP range to all squid.conf's. Also removed the old knams IP range, which has been unused over 2 months. This seems to have caused a massive cache clear in knams upload squids, causing a huge increase of image requests and overload of Amane. A real explanation is as of yet unknown... speculation is that old objects in knams caches have been invalidated somehow because they had the (now removed) old IP prefix in their caching info.

January 23

  • 02:09 Tim: reverted refresh_pattern changes in squid (ignore-reload) to fix user JS/CSS problems. With Brion's blessing.

January 22

  • 20:46 mark: Set $wgUserEmailReplyTo back to false, as mchenry will now rewrite envelope sender addresses from MediaWiki to wiki@wikimedia.org
  • 16:12 Rob: srv11 back online
  • 15:55 Rob: srv130,srv132,srv134 back online, see detailed server pages for crash information.

January 21

  • 12:30 jeluf: mark reports twice as much backend requests as usual. live-patched opensearch_desc.php to send proper Cache-Control headers. Needs to be updated in SVN. Backend request rate back to normal levels.
  • 07:10 brion: set $wgUserEmailUseReplyTo to protect against SPF failures and privacy leakage due to bounce messages in user-to-user emails. (Caused by sSMTP, which forces the envelope sender and From: address to be the same.) This uglifies user-to-user emails but keeps the same. In the long term I recommend replacing sSMTP with a minimal postfix or something like we used to use, which should work in a safe manner.
  • 03:24 brion: taking srv184 out of apache rotation to test ssmtp config issues

January 20

  • 21:45 jeluf: unpooled srv183, investigated why NFS mounts were missing after a reboot. Seems to be related to https://bugs.launchpad.net/ubuntu/+source/sysvinit/+bug/44836 . The fix suggested in that bug seems to help. Have to package it tomorrow.
  • 21:40 brion: mounted NFS shares on srv183
  • 21:39 brion: srv183 was rebooted 2h55m ago. its apaches are running, but NFS shared aren't mounted. nothing works properly. lead to several reports of captcha failures, and might have lead to some uplaod-related issues
  • 18:30 jeluf: rebooted srv183, un-killable convert jobs were blocking port 80
  • 18:29 brion: apache not restarting on srv164, srv176, srv183, srv184 -- "(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80"
  • 18:25 brion: killed job runner jobs on srv90-99, they were the error-spewers. syslog is clean. :D
  • 18:18 brion: several apaches in srv90-99 range still spewing errors, but seem to have the right file. stuck apc?
  • 18:11 brion: removed the random '$key' parameter from MessageCache::transform
  • 18:06 brion: space was filled by /var/log/messages and /var/log/syslog; runaway PHP warnings from some live hack extra parameter. truncating the log files and resyncing
  • 17:56 brion: turned off their apaches. looking for the space culprit.... they have most of their space wasted in a /a partition and a tiny / where all the stuff is
  • 17:53 brion: lots of srv's in 150-190 range out of disk space; broken (LocalRepo.php update failed)
  • 11:12 brion: file histories were broken for a few minutes (bad commit got through)
  • 07:08 brion: enabling $wgFileRedirects on test.wikipedia

January 19

  • 06:29 and a bit before - brion: some brief segfaulting due to a bad recursion in my SiteConfiguration update. Note: non-string values in InitialiseSettings.php (false, null, ints, etc) will now work.

January 18

  • 22:46 brion: wikibugs was idle for an hour or so due to being autoblocked for bounces again...
  • 22:40 brion: srv11 is hung; no ssh, HTTP opens but doesn't respond
  • 18:40 brion: created wikimedia-sf mailing list

January 16

  • 22:30ish brion: someone tried to delete sandbox on en.wikipedia, leading to various DB error warnings (transactions full) and breakage of most editing for nearly an hour. Have hacked in a 5000-revision limit on deletions, will prettify it shortly.
  • 21:39 brion: Added a default "Cache-control: no-cache" header on output in CommonSettings.php. This will protect PHP Fatal Error blank pages and such from getting cached due to a 200 result code and lack of cache-control headers. Actual cache-control output will override the default one. (Had to manually purge a Special:Random on en.wikipedia... various issues with editing etc)
  • 07:32 brion: fixed IRC recentchanges name for wikimania2008.wikimedia (was sending to the 2007 channel)

January 15

  • 21:00 jeluf: removed memcached on srv56,57,58 on rainman-sr's request. Memcached was causing problems with the indexer.

January 14

  • 21:33 brion: clearing a giant watchlist on users' request; may cause some s1 replag
  • 21:00ish brion: we seem to be getting blank PHP fatal error pages stuck in squid caches. :( latest php should mark these as 500...
  • 20:00 Rob: All yaseo upload squids upgraded.
  • 19:45 Rob: All yaseo text squids upgraded.
  • 18:45 Rob: Upgraded squid on sq41-sq50
  • 17:45 Rob: Upgraded squid on sq11-sq15
  • 17:00 Rob: Upgraded squid on sq6-sq10
  • 17:00 Rob: Upgraded squid on sq1-sq4
  • 16:20 Rob: Upgraded squid on sq32-sq40
  • 16:20 Rob: Upgraded squid on sq24-sq31
  • 16:03 Rob: Upgraded squid on sq16-sq23
  • 15:26 Rob: Upgraded squid on knsq16,knsq17, knsq18, knsq20, knsq21, knsq22.
  • 15:00 Rob: Upgraded squid on knsq8,knsq8, knsq9, knsq10, knsq11, knsq12, knsq13, knsq14, knsq15

January 13

  • 20:34 mark: Enabled access log on mayflower's apache (why was it disabled?)
  • 18:12 mark: Upgraded all knams text squids to new squid version
  • 17:30 mark: Set refresh_pattern . 60 50% 3600 ignore-reload on all text squids to override reload headers
  • 17:00 mark: Upgraded knsq1 to the new Squid
  • 16:15 mark: Brought up knsq19, and installed a new squid 2.6.18-1wm1 on it, including Domas' Accept-Encoding normalization patches. If you notice anything weird, notify Mark or Domas...
  • 04:25 Tim: Updated MW from r29455 to r29682.

January 12

  • 11:00 domas: removing titleblacklist. there's certain level of crap beyond which I won't fix stuff.
  • 03:10 brion: importing checkuser logs
  • 02:59 brion: upgrading to current CheckUser code (per-wiki logs for now)

January 11

  • 12:00 domas: installed lighty on zwinger for ganglia use

January 10

  • 17:00 domas: disabled CentralNotice

January 9

  • 21:00 domas: increased revtext ttl to 1w, fixed parser cache ttl problem, where magicwords were causing most of enwiki (and other template-aware wiki) pages to be cached for 1h only (r29511)
  • 09:00 domas: memcached arena increased to 158GB, 79 active nodes, ES instances getting lower buffer pools on servers running memcached (1000M to 100M), full cache drop
  • 00:14 brion: now that we've expanded storage2's size and removed a bunch of useless thumb and temp files from the amane backup so there's room again; have restarted up dump runs, including a continuation run of enwiki (which should start up from meta-current)

January 8

  • 22:33 jeluf: extended storage2:/export by 650 GB
  • 22:03 brion: uploads broken for several minutes by r29361 (reverted)
  • 21:48 brion: srv17 and srv18 are whining about high temperatures
  • 21:00 Rob: srv17 segfaults in httpd, resynced and restarted apache.
  • 17:10 Rob: srv78 Kernel Panic, rebooted and back online.
  • 16:45 Rob: srv177 cpu overheating, pulled, replaced thermal paste, back online.
  • 16:20 Rob: srv15] cpu overheating, pulled, replaced thermal paste, back online.
  • 16:15 Rob: srv189 back in rotation.
  • 14:59 Rob: srv189 reinstalled, needs apache setup.
  • 14:54 Rob: srv130 rebooted and back online.
  • 07:50 domas: added db8 and db10 to ganglia

January 7

  • 08:34 Tim: mounted upload4 on albert for static.wikipedia.org symlinks

January 6

  • 21:33 mark: Enabled TCP ECN on lily and mayflower
  • 21:03 mark: Added mayflower's EUI-64 address to DNS - svn may use it.
  • 20:06 mark: Added a v6 service IP to lily (lists.wikimedia.org) and put it in DNS.

January 4

  • 00:34 brion: restarting backup syncs from amane to storage2; was broken by bad script... trimming more thumbnails out of storage2 to clear up space

January 3

  • 19:29 brion: starting enwiki dump on srv42, will continue with general worker thread
  • 19:13 brion: Setting up srv42 to run dump worker threads as well as general batches, since it seems idle.
  • 15:05 mark: Rebooted fuchsia with an LVS optimized kernel, moved all LVS services back onto it
  • 13:45 mark: LVS on fuchsia overloaded, moved LVS for upload to mint
  • 00:26 brion: http://download.wikimedia.org/ now running off storage2. will restart dump runs aiming at it until we have a better place to put the backend (with benet still not checked for its disk issues)

Archives