Server Admin Log/Archive 12a
Appearance
August 31
- 23:05 mark: A parser bug in the PowerDNS Bind backend caused unavailability of the wikimedia.org zone for a few minutes, ouch...
- 22:55 mark: Deployed a PowerDNS pipebackend instance with this script on ns2.wikimedia.org (lily) only. Just one out of three nameservers for stability testing for now. Should there be major trouble, remove all "pipe" backend references from /etc/powerdns/pdns.conf.
- 18:38 Tim: Going to bed. Status is: srv107 replicating but locked with slow alter table. Can be re-added after it catches up. cluster18 is working, for no apparent reason, and should be migrated to max_rows=20M ASAP. cluster17 needs a master switch so that srv102 can be fixed, after that it should be re-added to the write list. Once srv142 is done copying, it can be restarted and repooled, as can srv145. No need to fix the replication there since it's an old cluster.
- 18:30 Tim: re-adding cluster19 to the write list, without srv107 which is still altering.
- 16:22 Tim: srv141 didn't work out, out of disk space, trying copy to srv142 instead (from srv145)
- 14:44 Tim: srv103 and srv110 done, repooling.
- 14:02 Tim: srv108 done, changed master to srv108, started max_rows change on srv107
- 13:51 Tim: started max_rows change on srv110. Not patient enough to do them one at a time.
- 13:38 Tim: copy to srv110 finished. Put srv110 in, srv103 left out for now for max_rows change
- 13:27 Tim: taking srv145 out of rotation for copy to new ext store srv141 (has same partitioning)
- 12:45 Tim: srv109 finished, starting on srv108
- 11:45 Tim: taking srv103 out of rotation for copy to new ext store srv110
- 11:37 Tim: alter table blobs max_rows=10000000; on srv109.
- 11:34 Tim: cluster is too much of a mongrel undocumented mess to set up new ext store servers, and we don't have that many candidates left anyway. Going to try saving the existing clusters.
- 10:27 Tim: received reports that cluster19 has gone the same way. Most likely all slaves and masters set up that time are affected and will fail roughly simultaneously. Will set up new clusters.
- 10:15 Tim: set mysql root password on external storage servers where it was blank
- 10:07 Tim: cluster17 master srv102 has stopped being writable for enwiki due to exhausted MyISAM index table size (max_rows=1000000). Removed from write list, working on it.
- 07:00 Tim: On srv189: added ddebs.ubuntu.com to sources.list. Installed debug symbols for apache.
August 30
- 22:11 mark: Set up an experimental IPv6 to IPv4 proxy on iris
- 17:13 Tim: killed long-running convert processes on srv152-189
August 29
- 21:00 jeluf: checked srv104, added it back to its ES pool, added cluster18 back to wgDefaultExternalStore
- 16:12 RobH: moved srv52 and srv56 from B2 to C4 for heat issues.
- 15:32 RobH: srv149 reinstalled as apache core.
- 13:08 Tim: images on kuwiki were actually broken because the move from amane to storage2 failed. The directory on amane was probably recreated by the thumbnail handler before the migration script created the symlink, resulting in a new writable image directory with no images in it. Merged the two directories and fixed the symlink.
- 12:00 domas: did space cleanups on amaryllis, and all DBs (all <80% disk usage now :) - preparing for vacation. VACATION!!! :)
August 28
- 22:50 mark: Set up a dirty, temporary test setup of PyBal on lvs2 doing SSH logins on all apaches for health checking.
- 21:43 RobH: reinstalled srv134 back online as apache core.
- 21:10 RobH: reinstalled srv130 back online as apache core.
- 20:09 RobH: searchidx1, search1, search2, search3, search4, search5, search6, & search7 racked with remote management enabled.
- 16:09 RobH: db9 reinstalled for misc db role.
- 13:28 Tim: removed dkwiktionary and dkwikibooks from all.dblist. Apparently they're visible on the web when they were previously removed. They were created accidentally years ago due to dk being an alias for da.
- They became visible due to Rob's changes to langlist.
- 05:59 Tim: Following complaint about bad uploads on kuwiki, running "find -type d -not -perm 777 -exec chmod 777 {} \;" in various upload directories with various maxdepth options.
August 27
- 22:57 RobH: srv127 reinstalled and back online as apache.
- 22:34 RobH: srv36 reinstalled and back online as apache.
- 22:09 RobH: srv117 reinstalled and back online as apache.
- 22:00 mark: Commented out most LVS related checks in /home/wikipedia/bin/apache-sanity-check which are no longer relevant
- 22:00 mark: Various changes to the Ubuntu installer, to make SM apache installs work, and for preseeding of NTP config.
- 21:48 RobH: srv81 reinstalled and back online as apache.
- 19:07 RobH: Purged cz.wikimedia.org redirect from all knams squids.
- 18:10 RobH: srv147 reinstalled and deployed as apache.
- 16:30 RobH: sq48 had a possible issue with hdc. Tested fine, cleaned and back online.
- 15:19 RobH: srv146 was read-only. Rebooted, fsck, restarted.
- 08:38 Tim: added FlaggedRevs stats update to crontab on hume
- 08:03 Tim: running FlaggedRevs/maintenance/updateLinks.php on dewiki
August 26
- 20:00 RobH: moved srv84 and srv85 from B4 to B3 rack.
- 18:39 RobH: moved srv82 and srv83 from B4 to B3 rack.
- 17:30 RobH: srv81 reinstalled and running apache. Needs ext store setup.
- 16:35 RobH: srv103 restarted and synced.
- 16:01 brion: srv103 serving pages with stale software but unreachable. needs to be shut down
- 14:53 RobH: reinstalled db10 for misc. db tasks.
- 13:27 Tim: disabled some user account on otrs-wiki
- 11:15 mark: Added coronelli to search pool 3 on lvs3
- 00:26 RobH: fixed my own typo in redirects.conf, pushed, graceful all apache.
- 00:15 RobH: pushed some fixes on InitialiseSettings.php for a private wiki.
August 25
- 23:07 brion: enabled write API, let's see what happens!
- 22:41 brion: query.php disabled as scheduled.
- 22:07 brion: a SiteConfiguration code change broke upload dirs for a bit. reverted it.
- 20:15 brion: set wgNewUserSuppressRC to true, was false unsure why it's annoying
- 14:30 RobH: pushed dns changes to langlist to support cz. as well as a number of other langlist redirects not added to dns.
- 14:15 RobH: Fixed an error in my additions for the cz.wikistuff, pushed out the redirects to apaches.
- 12:10 domas: mark stealing db10 for stuff
- 11:00 domas: reenabled db10, added db14 to s1, db9 given away to non-core tasks, added full contributions load to db16 (as it has covering index)
- 09:55 domas: reverted an instance where 'IndexPager' was causing filesorts... :)
- 08:00 domas: cleaned up hume / diskspace, was full, added /a to updatedb prunepaths, apt-get clean too - 4.5G released
- 08:00 domas: disabled db10 for db14 bootstrap
- 07:36 domas: updating FlaggedRevs schema on ruwiki.
- 02:26 brion: updating MW, including FlaggedRevs schema update (fp_pending_since, flaggedrevs_tracking)
August 24
- 17:15 domas: removing db9 entirely, crashed, disk gone...
- 07:20 Tim: deployed the TrustedXFF extension that I just wrote.
- 02:56 Tim: removed db9 from the contributions, watchlist and recentchangeslinked query groups. Long running queries (2000 seconds) from IndexPager::reallyDoQuery and ApiQueryContributions::execute, probably needs index fixes. Removed general load from the remaining query group server, db7.
August 22
- 21:34 RobH: will moved from A4 to A2.
- 21:00 RobH: diderot unracked
- 00:27 brion: FR feedback on on enwikinews as well
- 00:24 brion: Deleting email record rows from cu_changes; some had slipped through before we disabled the privacy breakage
August 21
- 23:47 brion: FlaggedRevs feedback enabled on test & labs
- 23:35 brion: Enabled experimental HTML diff on test.wikipedia.org, en.labs.wikimedia.org, and de.labs.wikimedia.org
- 18:17 RobH: Updated DNS entries to add a number of .cz domains. Also updated redirects.conf to support the added domains.
- 11:43 Tim: installing GlobalBlocking
- 02:42 Tim: returned db16 to general load, a less critical role
- 02:30 Tim: installed mysql-client-5.0 on db11-16. Installed ganglia-metrics on thistle, db1, db4, db7, db12, db13, db14, db15, db16.
- 02:20 Tim: offloaded query group read load from db16. System+user CPU disappeared.
- Recovery spike in I/O shows that replication was suppressed due to read activity. Caught up in ~8 minutes.
- 02:11 Tim: db16 is chronically lagged, probably overloaded with inflexible query group load
- db16 shows high flat system+user CPU since ~01:05
August 20
- 04:15 Tim: attempting to upgrade hume from Ubuntu 7.10 to 8.04
- 01:24 brion: experimentally lifting $wgExportMaxLimit from 1000 to infinity on enwiki -- testing hack to SpecialExport.php to use unbuffered query
August 19
- 08:38 Tim: done with lomaria
- 07:42 Tim: taking lomaria out of rotation to drop non-s2a databases and change its replication to s2a-only.
- 04:45 Tim: increased load on db13 to relieve db8, stressed by removal of lomaria from s2
- 04:10 Tim: A hotlinking mirror, getting images from thumb.php, was being visited at high rate, DoSing our storage servers. Referer blocked.
- 03:50 Tim: ixia disk space critical, fixed
- 03:45 Tim: Older s3 slave servers are showing signs of strain. Adding more s3 load to db11 to test its capacity.
- db11 is fine at 47% load ratio, reporting 80-90% disk util, await 5-7ms, load ~6
- 96% load ratio, reporting disk util ~90%, await ~6ms, load ~7.5. Wait CPU ~12%. Yawning in mock-boredom.
- 03:37 Tim: lomaria was relatively overloaded. Adjusted loads, put it in an s2a role since we haven't had any s2a servers since holbach was decommissioned
- 02:40 Tim: removed holbach, webster and bacon from db.php, decomissioned. Removed decomissioned servers from $wgSquidServersNoPurge.
- 02:27 Tim: compiled udpprofile on zwinger, started collector. Firewalled port 3811 inbound, /etc/init.d/iptables save. Updated MediaWiki configuration. Updated report.py on bart.
- 01:40 Tim: reduced apache "TimeOut" on srv38/39 from 300 to 10, to limit the impact of LVS flapping
August 18
- 23:00 RobH: added the image scaling servers back into the apache node group and updated their config files. This fixes the thumbnail generation issue evident on both uploads. and se.wikimedia (may have existed elsewhere as well, in fact, it most certainly must have.) All apaches restarted.
August 17
- 22:30 jeluf: restarted apaches on srv38/39 due to user reports about broken thumbnails.
August 16
- 13:20 mark: Reenabled ProxyFetch monitor on rendering cluster on lvs3, and set depool_threshold = .5.
- 12:58 Tim: removed ProxyFetch monitor from rendering cluster in pybal on lvs3
- 12:50 Tim: thumbnailing broke completely, at ~03:00 UTC. The apache processes on srv38/39 were stuck waiting for writes to the storage servers. Couldn't find the associated PHP threads on the storage servers to see if something was holding them up, so I tried restarting apache on srv38/39 instead. Suspect broken connections due to regular depooling by pybal
August 14
- 18:55 domas fixed db16 replication
- 18:50 brion: db16 replication is broken -- contribs/watchlists/recentchangeslinked for enwiki stopped at about 4 hours ago
- ??? ??? db16 crashed
August 13
- 17:10 Tim: Changed http://noc.wikimedia.org/conf/ to use a PHP script to highlight the source files from NFS on request, instead of them being updated periodically. Added a warning header to all affected files.
- 06:17 Tim: Removed old ExtensionDistributor snapshots (find -mtime +1 -exec rm {} \;), synced r39273
- 02:40 brion: fixed permissions on dewiki thumb dir -- root-owned directory not writable by apache worked for existing directories, but failed for the 'archive' directory needed for old-version thumbnails used by FlaggedRevs
August 12
- 21:06 mark: Moved LVS load balancing of apaches to lvs3 as well, using a new service IP (10.2.1.1)
- 18:10 brion: fixed up security config that disabled PHP execution in extension directories; several configs had this wrong and non-functional
- 12:45 tfinc: removed /srv/org.wikimedia.dev.donate & /srv/org.wikimedia.donate on srv9 and removed the apache confs that mention them.
August 11
- 23:53 mark: Moved traffic from Russia (iso code 643) to knams
- 23:53 mark: Moved the rendering cluster LVS to lvs3 as well.
- 22:45 mark: Deployed lvs3 as the first new internal LVS cluster host, and moved over the search pools to it using new service IPs (outside the subnet). The rest of the LVS cluster as well as the documentation are a work in progress - let me know if there are any problems.
August 10
- 17:43 Tim: freed up another 100GB or so by deleting all dumps from February 2008.
- 17:27 Tim: freed up a few GB on storage2 by deleting failed dumps: enwiki/{20080425,20080521,20080618,20080629}, dewiki/20080629.
August 8
- 22:46 RobH: setup network access LOM for db13, db14, db15, & db16
- 22:40 brion: set up 'inactive' group on private wikis; this is just "for show" to indicate disabled accounts, adding a user to the group doesn't actually disable them :)
- 21:15 brion: can't seem to reach the 'oai' audit database on adler from the wiki command-line scripts. This is rather annoying; permissions wrong maybe?
August 6
- 17:25 brion: updated dump index page to indicate dumps are halted atm
August 5
- 22:09 mark: Shutdown BGP session to XO for maintenance
- 18:27 RobH: db14, db15, db16 installed with Ubuntu.
- 18:24 brion: enabling flaggedrevs on ruwiki per [1]
- 17:09 brion: enabling flaggedrevs on enwikinews per [2]
- 6:20 jeluf: set wgEnotifUserTalk to true on all but the top wikis, see bugzilla
August 4
- 05:58 brion: dewiki homepage broken for a few minutes due to a bogus i18n update in imagemap breaking the 'desc' alignment options
August 3
- 14:15 robert: got reports about lots of failed searches on nl and pl.wiki, looks like diderot (again) failed to depool a dead server (rabanus), removed manually.
August 1
- 21:05 brion: forcing display_errors on for CLI so I don't keep discovering my command-line scripts are broken _after_ I run them, they don't show any errors, and I thought they worked. :)
- 06:39 Tim: wrote a PHP syntax check for scap, using parsekit, that runs about 6 times faster than the old one
- 04:58 Tim: installing PHP on suda (CLI only) for syntax check speed test
- 01:46 Tim: removed db1 from rotation, it's stopped in gdb at a segfault.
- 00:22 brion: aha! found the problem. MaxClients was turned down to 10 from default of 150 long ago, while the old prefix search was being tested. :) now back to 150
- 00:19 brion: just turning off the mobile gateway on yongle for now, it just doesn't appear to be working at full load. (files moved to subdir -- in /x/ it works fine seemingly). Server doesn't appear overly loaded -- CPU and load are low -- just the requests stick.
- 00:10 brion: installing APC on yongle, php bits are ungodly slow sometimes
2000s
- Archive 1: 2004 Jun - 2004 Sep
- Archive 2: 2004 Oct - 2004 Nov
- Archive 3: 2004 Dec - 2005 Mar
- Archive 4: 2005 Apr - 2005 Jul
- Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
- Archive 6: 2005 Nov - 2006 Feb
- Archive 7: 2006 Mar - 2006 Jun
- Archive 8: 2006 Jul - 2006 Sep
- Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
- Archive 10: 2007 Feb - 2007 Jun
- Archive 11: 2007 Jul - 2007 Dec
- Archive 12: 2008 Jan - 2008 Jul
- Archive 12a: 2008 Aug
- Archive 12b: 2008 Sept
- Archive 13: 2008 Oct - 2009 Jun
- Archive 14: 2009 Jun - 2009 Dec
2010s
- Archive 15: 2010 Jan - 2010 Jun
- Archive 16: 2010 Jul - 2010 Oct
- Archive 17: 2010 Nov - 2010 Dec
- Archive 18: 2011 Jan - 2011 Jun
- Archive 19: 2011 Jul - 2011 Dec
- Archive 20: 2011 Dec - 2012 Jun, with revision history 2007-02-21 to 2012-03-27
- Archive 21: 2012 Jul - 2013 Jan
- Archive 22: 2013 Jan - 2013 Jul
- Archive 23: 2013 Aug - 2013 Dec
- Archive 24: 2014 Jan - 2014 Mar
- Archive 25: 2014 April - 2014 September
- Archive 26: 2014 October - 2014 December
- Archive 27: 2015 January - 2015 July
- Archive 28: 2015 August - 2015 December
- Archive 29: 2016 January - 2016 May
- Archive 30: 2016 June - 2016 August
- Archive 31: 2016 September - 2016 December
- Archive 32: 2017 January - 2017 July
- Archive 33: 2017 August - 2017 December
- Archive 34: 2018 January - 2018 April
- Archive 35: 2018 May - 2018 August
- Archive 36: 2018 September - 2018 December
- Archive 37: 2019 January - 2019 April
- Archive 38: 2019 May - 2019 August
- Archive 39: 2019 September - 2019 December
2020s
- Archive 40: 2020 January - 2020 April
- Archive 41: 2020 May - 2020 July
- Archive 42: 2020 August - 2020 November
- Archive 43: 2020 December
- Archive 44: 2021 January - 2021 April
- Archive 45: 2021 May - 2021 July
- Archive 46: 2021 August - 2021 October
- Archive 47: 2021 November - 2021 December
- Archive 48: 2022 January
- Archive 49: 2022 February
- Archive 50: 2022 March
- Archive 51: 2022 April 1-15
- Archive 52: 2022 April 16-30
- Archive 53: 2022 May
- Archive 54: 2022 June
- Archive 55: 2022 July
- Archive 56: 2022 August
- Archive 57: 2022 September
- Archive 58: 2022 October
- Archive 59: 2022 November 1-15
- Archive 60: 2022 November 16-30
- Archive 61: 2022 December
- Archive 62: 2023 January
- Archive 63: 2023 February
- Archive 64: 2023 March
- Archive 65: 2023 April
- Archive 66: 2023 May
- Archive 67: 2023 June
- Archive 68: 2023 July
- Archive 69: 2023 August 1-15
- Archive 70: 2023 August 16-31
- Archive 71: 2023 September
- Archive 72: 2023 October
- Archive 73: 2023 November
- Archive 74: 2023 December
- Archive 75: 2024 January
- Archive 76: 2024 February
- Archive 77: 2024 March
- Archive 78: 2024 April
- Archive 79: 2024 May 1-15
- Archive 80: 2024 May 16-31
- Archive 81: 2024 June 1-15
- Archive 82: 2024 June 16-30
- Archive 83: 2024 July
- Archive 84: 2024 August
- Archive 85: 2024 September
- Archive 86: 2024 October