Server Admin Log/Archive 12a

August 31

23:05 mark: A parser bug in the PowerDNS Bind backend caused unavailability of the wikimedia.org zone for a few minutes, ouch...
22:55 mark: Deployed a PowerDNS pipebackend instance with this script on ns2.wikimedia.org (lily) only. Just one out of three nameservers for stability testing for now. Should there be major trouble, remove all "pipe" backend references from /etc/powerdns/pdns.conf.
18:38 Tim: Going to bed. Status is: srv107 replicating but locked with slow alter table. Can be re-added after it catches up. cluster18 is working, for no apparent reason, and should be migrated to max_rows=20M ASAP. cluster17 needs a master switch so that srv102 can be fixed, after that it should be re-added to the write list. Once srv142 is done copying, it can be restarted and repooled, as can srv145. No need to fix the replication there since it's an old cluster.
18:30 Tim: re-adding cluster19 to the write list, without srv107 which is still altering.
16:22 Tim: srv141 didn't work out, out of disk space, trying copy to srv142 instead (from srv145)
14:44 Tim: srv103 and srv110 done, repooling.
14:02 Tim: srv108 done, changed master to srv108, started max_rows change on srv107
13:51 Tim: started max_rows change on srv110. Not patient enough to do them one at a time.
13:38 Tim: copy to srv110 finished. Put srv110 in, srv103 left out for now for max_rows change
13:27 Tim: taking srv145 out of rotation for copy to new ext store srv141 (has same partitioning)
12:45 Tim: srv109 finished, starting on srv108
11:45 Tim: taking srv103 out of rotation for copy to new ext store srv110
11:37 Tim: alter table blobs max_rows=10000000; on srv109.
11:34 Tim: cluster is too much of a mongrel undocumented mess to set up new ext store servers, and we don't have that many candidates left anyway. Going to try saving the existing clusters.
10:27 Tim: received reports that cluster19 has gone the same way. Most likely all slaves and masters set up that time are affected and will fail roughly simultaneously. Will set up new clusters.
10:15 Tim: set mysql root password on external storage servers where it was blank
10:07 Tim: cluster17 master srv102 has stopped being writable for enwiki due to exhausted MyISAM index table size (max_rows=1000000). Removed from write list, working on it.
07:00 Tim: On srv189: added ddebs.ubuntu.com to sources.list. Installed debug symbols for apache.

August 30

22:11 mark: Set up an experimental IPv6 to IPv4 proxy on iris
17:13 Tim: killed long-running convert processes on srv152-189

August 29

21:00 jeluf: checked srv104, added it back to its ES pool, added cluster18 back to wgDefaultExternalStore
16:12 RobH: moved srv52 and srv56 from B2 to C4 for heat issues.
15:32 RobH: srv149 reinstalled as apache core.
13:08 Tim: images on kuwiki were actually broken because the move from amane to storage2 failed. The directory on amane was probably recreated by the thumbnail handler before the migration script created the symlink, resulting in a new writable image directory with no images in it. Merged the two directories and fixed the symlink.
12:00 domas: did space cleanups on amaryllis, and all DBs (all <80% disk usage now :) - preparing for vacation. VACATION!!! :)

August 28

22:50 mark: Set up a dirty, temporary test setup of PyBal on lvs2 doing SSH logins on all apaches for health checking.
21:43 RobH: reinstalled srv134 back online as apache core.
21:10 RobH: reinstalled srv130 back online as apache core.
20:09 RobH: searchidx1, search1, search2, search3, search4, search5, search6, & search7 racked with remote management enabled.
16:09 RobH: db9 reinstalled for misc db role.
13:28 Tim: removed dkwiktionary and dkwikibooks from all.dblist. Apparently they're visible on the web when they were previously removed. They were created accidentally years ago due to dk being an alias for da.
- They became visible due to Rob's changes to langlist.
05:59 Tim: Following complaint about bad uploads on kuwiki, running "find -type d -not -perm 777 -exec chmod 777 {} \;" in various upload directories with various maxdepth options.

August 27

22:57 RobH: srv127 reinstalled and back online as apache.
22:34 RobH: srv36 reinstalled and back online as apache.
22:09 RobH: srv117 reinstalled and back online as apache.
22:00 mark: Commented out most LVS related checks in /home/wikipedia/bin/apache-sanity-check which are no longer relevant
22:00 mark: Various changes to the Ubuntu installer, to make SM apache installs work, and for preseeding of NTP config.
21:48 RobH: srv81 reinstalled and back online as apache.
19:07 RobH: Purged cz.wikimedia.org redirect from all knams squids.
18:10 RobH: srv147 reinstalled and deployed as apache.
16:30 RobH: sq48 had a possible issue with hdc. Tested fine, cleaned and back online.
15:19 RobH: srv146 was read-only. Rebooted, fsck, restarted.
08:38 Tim: added FlaggedRevs stats update to crontab on hume
08:03 Tim: running FlaggedRevs/maintenance/updateLinks.php on dewiki

August 26

20:00 RobH: moved srv84 and srv85 from B4 to B3 rack.
18:39 RobH: moved srv82 and srv83 from B4 to B3 rack.
17:30 RobH: srv81 reinstalled and running apache. Needs ext store setup.
16:35 RobH: srv103 restarted and synced.
16:01 brion: srv103 serving pages with stale software but unreachable. needs to be shut down
14:53 RobH: reinstalled db10 for misc. db tasks.
13:27 Tim: disabled some user account on otrs-wiki
11:15 mark: Added coronelli to search pool 3 on lvs3
00:26 RobH: fixed my own typo in redirects.conf, pushed, graceful all apache.
00:15 RobH: pushed some fixes on InitialiseSettings.php for a private wiki.

August 25

23:07 brion: enabled write API, let's see what happens!
22:41 brion: query.php disabled as scheduled.
22:07 brion: a SiteConfiguration code change broke upload dirs for a bit. reverted it.
20:15 brion: set wgNewUserSuppressRC to true, was false unsure why it's annoying
14:30 RobH: pushed dns changes to langlist to support cz. as well as a number of other langlist redirects not added to dns.
14:15 RobH: Fixed an error in my additions for the cz.wikistuff, pushed out the redirects to apaches.
12:10 domas: mark stealing db10 for stuff
11:00 domas: reenabled db10, added db14 to s1, db9 given away to non-core tasks, added full contributions load to db16 (as it has covering index)
09:55 domas: reverted an instance where 'IndexPager' was causing filesorts... :)
08:00 domas: cleaned up hume / diskspace, was full, added /a to updatedb prunepaths, apt-get clean too - 4.5G released
08:00 domas: disabled db10 for db14 bootstrap
07:36 domas: updating FlaggedRevs schema on ruwiki.
02:26 brion: updating MW, including FlaggedRevs schema update (fp_pending_since, flaggedrevs_tracking)

August 24

17:15 domas: removing db9 entirely, crashed, disk gone...
07:20 Tim: deployed the TrustedXFF extension that I just wrote.
02:56 Tim: removed db9 from the contributions, watchlist and recentchangeslinked query groups. Long running queries (2000 seconds) from IndexPager::reallyDoQuery and ApiQueryContributions::execute, probably needs index fixes. Removed general load from the remaining query group server, db7.

August 22

21:34 RobH: will moved from A4 to A2.
21:00 RobH: diderot unracked
00:27 brion: FR feedback on on enwikinews as well
00:24 brion: Deleting email record rows from cu_changes; some had slipped through before we disabled the privacy breakage

August 21

23:47 brion: FlaggedRevs feedback enabled on test & labs
23:35 brion: Enabled experimental HTML diff on test.wikipedia.org, en.labs.wikimedia.org, and de.labs.wikimedia.org
18:17 RobH: Updated DNS entries to add a number of .cz domains. Also updated redirects.conf to support the added domains.
11:43 Tim: installing GlobalBlocking
02:42 Tim: returned db16 to general load, a less critical role
02:30 Tim: installed mysql-client-5.0 on db11-16. Installed ganglia-metrics on thistle, db1, db4, db7, db12, db13, db14, db15, db16.
02:20 Tim: offloaded query group read load from db16. System+user CPU disappeared.
- Recovery spike in I/O shows that replication was suppressed due to read activity. Caught up in ~8 minutes.
02:11 Tim: db16 is chronically lagged, probably overloaded with inflexible query group load
- db16 shows high flat system+user CPU since ~01:05

August 20

04:15 Tim: attempting to upgrade hume from Ubuntu 7.10 to 8.04
01:24 brion: experimentally lifting $wgExportMaxLimit from 1000 to infinity on enwiki -- testing hack to SpecialExport.php to use unbuffered query

August 19

08:38 Tim: done with lomaria
07:42 Tim: taking lomaria out of rotation to drop non-s2a databases and change its replication to s2a-only.
04:45 Tim: increased load on db13 to relieve db8, stressed by removal of lomaria from s2
04:10 Tim: A hotlinking mirror, getting images from thumb.php, was being visited at high rate, DoSing our storage servers. Referer blocked.
03:50 Tim: ixia disk space critical, fixed
03:45 Tim: Older s3 slave servers are showing signs of strain. Adding more s3 load to db11 to test its capacity.
- db11 is fine at 47% load ratio, reporting 80-90% disk util, await 5-7ms, load ~6
- 96% load ratio, reporting disk util ~90%, await ~6ms, load ~7.5. Wait CPU ~12%. Yawning in mock-boredom.
03:37 Tim: lomaria was relatively overloaded. Adjusted loads, put it in an s2a role since we haven't had any s2a servers since holbach was decommissioned
02:40 Tim: removed holbach, webster and bacon from db.php, decomissioned. Removed decomissioned servers from $wgSquidServersNoPurge.
02:27 Tim: compiled udpprofile on zwinger, started collector. Firewalled port 3811 inbound, /etc/init.d/iptables save. Updated MediaWiki configuration. Updated report.py on bart.
01:40 Tim: reduced apache "TimeOut" on srv38/39 from 300 to 10, to limit the impact of LVS flapping

August 18

23:00 RobH: added the image scaling servers back into the apache node group and updated their config files. This fixes the thumbnail generation issue evident on both uploads. and se.wikimedia (may have existed elsewhere as well, in fact, it most certainly must have.) All apaches restarted.

August 17

22:30 jeluf: restarted apaches on srv38/39 due to user reports about broken thumbnails.

August 16

13:20 mark: Reenabled ProxyFetch monitor on rendering cluster on lvs3, and set depool_threshold = .5.
12:58 Tim: removed ProxyFetch monitor from rendering cluster in pybal on lvs3
12:50 Tim: thumbnailing broke completely, at ~03:00 UTC. The apache processes on srv38/39 were stuck waiting for writes to the storage servers. Couldn't find the associated PHP threads on the storage servers to see if something was holding them up, so I tried restarting apache on srv38/39 instead. Suspect broken connections due to regular depooling by pybal

August 14

18:55 domas fixed db16 replication
18:50 brion: db16 replication is broken -- contribs/watchlists/recentchangeslinked for enwiki stopped at about 4 hours ago
??? ??? db16 crashed

August 13

17:10 Tim: Changed http://noc.wikimedia.org/conf/ to use a PHP script to highlight the source files from NFS on request, instead of them being updated periodically. Added a warning header to all affected files.
06:17 Tim: Removed old ExtensionDistributor snapshots (find -mtime +1 -exec rm {} \;), synced r39273
02:40 brion: fixed permissions on dewiki thumb dir -- root-owned directory not writable by apache worked for existing directories, but failed for the 'archive' directory needed for old-version thumbnails used by FlaggedRevs

August 12

21:06 mark: Moved LVS load balancing of apaches to lvs3 as well, using a new service IP (10.2.1.1)
18:10 brion: fixed up security config that disabled PHP execution in extension directories; several configs had this wrong and non-functional
12:45 tfinc: removed /srv/org.wikimedia.dev.donate & /srv/org.wikimedia.donate on srv9 and removed the apache confs that mention them.

August 11

23:53 mark: Moved traffic from Russia (iso code 643) to knams
23:53 mark: Moved the rendering cluster LVS to lvs3 as well.
22:45 mark: Deployed lvs3 as the first new internal LVS cluster host, and moved over the search pools to it using new service IPs (outside the subnet). The rest of the LVS cluster as well as the documentation are a work in progress - let me know if there are any problems.

August 10

17:43 Tim: freed up another 100GB or so by deleting all dumps from February 2008.
17:27 Tim: freed up a few GB on storage2 by deleting failed dumps: enwiki/{20080425,20080521,20080618,20080629}, dewiki/20080629.

August 8

22:46 RobH: setup network access LOM for db13, db14, db15, & db16
22:40 brion: set up 'inactive' group on private wikis; this is just "for show" to indicate disabled accounts, adding a user to the group doesn't actually disable them :)
21:15 brion: can't seem to reach the 'oai' audit database on adler from the wiki command-line scripts. This is rather annoying; permissions wrong maybe?

August 6

17:25 brion: updated dump index page to indicate dumps are halted atm

August 5

22:09 mark: Shutdown BGP session to XO for maintenance
18:27 RobH: db14, db15, db16 installed with Ubuntu.
18:24 brion: enabling flaggedrevs on ruwiki per [1]
17:09 brion: enabling flaggedrevs on enwikinews per [2]
6:20 jeluf: set wgEnotifUserTalk to true on all but the top wikis, see bugzilla

August 4

05:58 brion: dewiki homepage broken for a few minutes due to a bogus i18n update in imagemap breaking the 'desc' alignment options

August 3

14:15 robert: got reports about lots of failed searches on nl and pl.wiki, looks like diderot (again) failed to depool a dead server (rabanus), removed manually.

August 1

21:05 brion: forcing display_errors on for CLI so I don't keep discovering my command-line scripts are broken _after_ I run them, they don't show any errors, and I thought they worked. :)
06:39 Tim: wrote a PHP syntax check for scap, using parsekit, that runs about 6 times faster than the old one
04:58 Tim: installing PHP on suda (CLI only) for syntax check speed test
01:46 Tim: removed db1 from rotation, it's stopped in gdb at a segfault.
00:22 brion: aha! found the problem. MaxClients was turned down to 10 from default of 150 long ago, while the old prefix search was being tested. :) now back to 150
00:19 brion: just turning off the mobile gateway on yongle for now, it just doesn't appear to be working at full load. (files moved to subdir -- in /x/ it works fine seemingly). Server doesn't appear overly loaded -- CPU and load are low -- just the requests stick.
00:10 brion: installing APC on yongle, php bits are ungodly slow sometimes

2000s

Archive 1: 2004 Jun - 2004 Sep
Archive 2: 2004 Oct - 2004 Nov
Archive 3: 2004 Dec - 2005 Mar
Archive 4: 2005 Apr - 2005 Jul
Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
Archive 6: 2005 Nov - 2006 Feb
Archive 7: 2006 Mar - 2006 Jun
Archive 8: 2006 Jul - 2006 Sep
Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
Archive 10: 2007 Feb - 2007 Jun
Archive 11: 2007 Jul - 2007 Dec
Archive 12: 2008 Jan - 2008 Jul
Archive 12a: 2008 Aug
Archive 12b: 2008 Sept
Archive 13: 2008 Oct - 2009 Jun
Archive 14: 2009 Jun - 2009 Dec

2010s

2020s