Server admin log/2008-08

From Wikitech
Jump to: navigation, search

August 31

  • 23:05 mark: A parser bug in the PowerDNS Bind backend caused unavailability of the wikimedia.org zone for a few minutes, ouch...
  • 22:55 mark: Deployed a PowerDNS pipebackend instance with this script on ns2.wikimedia.org (lily) only. Just one out of three nameservers for stability testing for now. Should there be major trouble, remove all "pipe" backend references from /etc/powerdns/pdns.conf.
  • 18:38 Tim: Going to bed. Status is: srv107 replicating but locked with slow alter table. Can be re-added after it catches up. cluster18 is working, for no apparent reason, and should be migrated to max_rows=20M ASAP. cluster17 needs a master switch so that srv102 can be fixed, after that it should be re-added to the write list. Once srv142 is done copying, it can be restarted and repooled, as can srv145. No need to fix the replication there since it's an old cluster.
  • 18:30 Tim: re-adding cluster19 to the write list, without srv107 which is still altering.
  • 16:22 Tim: srv141 didn't work out, out of disk space, trying copy to srv142 instead (from srv145)
  • 14:44 Tim: srv103 and srv110 done, repooling.
  • 14:02 Tim: srv108 done, changed master to srv108, started max_rows change on srv107
  • 13:51 Tim: started max_rows change on srv110. Not patient enough to do them one at a time.
  • 13:38 Tim: copy to srv110 finished. Put srv110 in, srv103 left out for now for max_rows change
  • 13:27 Tim: taking srv145 out of rotation for copy to new ext store srv141 (has same partitioning)
  • 12:45 Tim: srv109 finished, starting on srv108
  • 11:45 Tim: taking srv103 out of rotation for copy to new ext store srv110
  • 11:37 Tim: alter table blobs max_rows=10000000; on srv109.
  • 11:34 Tim: cluster is too much of a mongrel undocumented mess to set up new ext store servers, and we don't have that many candidates left anyway. Going to try saving the existing clusters.
  • 10:27 Tim: received reports that cluster19 has gone the same way. Most likely all slaves and masters set up that time are affected and will fail roughly simultaneously. Will set up new clusters.
  • 10:15 Tim: set mysql root password on external storage servers where it was blank
  • 10:07 Tim: cluster17 master srv102 has stopped being writable for enwiki due to exhausted MyISAM index table size (max_rows=1000000). Removed from write list, working on it.
  • 07:00 Tim: On srv189: added ddebs.ubuntu.com to sources.list. Installed debug symbols for apache.

August 30

  • 22:11 mark: Set up an experimental IPv6 to IPv4 proxy on iris
  • 17:13 Tim: killed long-running convert processes on srv152-189

August 29

  • 21:00 jeluf: checked srv104, added it back to its ES pool, added cluster18 back to wgDefaultExternalStore
  • 16:12 RobH: moved srv52 and srv56 from B2 to C4 for heat issues.
  • 15:32 RobH: srv149 reinstalled as apache core.
  • 13:08 Tim: images on kuwiki were actually broken because the move from amane to storage2 failed. The directory on amane was probably recreated by the thumbnail handler before the migration script created the symlink, resulting in a new writable image directory with no images in it. Merged the two directories and fixed the symlink.
  • 12:00 domas: did space cleanups on amaryllis, and all DBs (all <80% disk usage now :) - preparing for vacation. VACATION!!! :)

August 28

  • 22:50 mark: Set up a dirty, temporary test setup of PyBal on lvs2 doing SSH logins on all apaches for health checking.
  • 21:43 RobH: reinstalled srv134 back online as apache core.
  • 21:10 RobH: reinstalled srv130 back online as apache core.
  • 20:09 RobH: searchidx1, search1, search2, search3, search4, search5, search6, & search7 racked with remote management enabled.
  • 16:09 RobH: db9 reinstalled for misc db role.
  • 13:28 Tim: removed dkwiktionary and dkwikibooks from all.dblist. Apparently they're visible on the web when they were previously removed. They were created accidentally years ago due to dk being an alias for da.
    • They became visible due to Rob's changes to langlist.
  • 05:59 Tim: Following complaint about bad uploads on kuwiki, running "find -type d -not -perm 777 -exec chmod 777 {} \;" in various upload directories with various maxdepth options.

August 27

  • 22:57 RobH: srv127 reinstalled and back online as apache.
  • 22:34 RobH: srv36 reinstalled and back online as apache.
  • 22:09 RobH: srv117 reinstalled and back online as apache.
  • 22:00 mark: Commented out most LVS related checks in /home/wikipedia/bin/apache-sanity-check which are no longer relevant
  • 22:00 mark: Various changes to the Ubuntu installer, to make SM apache installs work, and for preseeding of NTP config.
  • 21:48 RobH: srv81 reinstalled and back online as apache.
  • 19:07 RobH: Purged cz.wikimedia.org redirect from all knams squids.
  • 18:10 RobH: srv147 reinstalled and deployed as apache.
  • 16:30 RobH: sq48 had a possible issue with hdc. Tested fine, cleaned and back online.
  • 15:19 RobH: srv146 was read-only. Rebooted, fsck, restarted.
  • 08:38 Tim: added FlaggedRevs stats update to crontab on hume
  • 08:03 Tim: running FlaggedRevs/maintenance/updateLinks.php on dewiki

August 26

  • 20:00 RobH: moved srv84 and srv85 from B4 to B3 rack.
  • 18:39 RobH: moved srv82 and srv83 from B4 to B3 rack.
  • 17:30 RobH: srv81 reinstalled and running apache. Needs ext store setup.
  • 16:35 RobH: srv103 restarted and synced.
  • 16:01 brion: srv103 serving pages with stale software but unreachable. needs to be shut down
  • 14:53 RobH: reinstalled db10 for misc. db tasks.
  • 13:27 Tim: disabled some user account on otrs-wiki
  • 11:15 mark: Added coronelli to search pool 3 on lvs3
  • 00:26 RobH: fixed my own typo in redirects.conf, pushed, graceful all apache.
  • 00:15 RobH: pushed some fixes on InitialiseSettings.php for a private wiki.

August 25

  • 23:07 brion: enabled write API, let's see what happens!
  • 22:41 brion: query.php disabled as scheduled.
  • 22:07 brion: a SiteConfiguration code change broke upload dirs for a bit. reverted it.
  • 20:15 brion: set wgNewUserSuppressRC to true, was false unsure why it's annoying
  • 14:30 RobH: pushed dns changes to langlist to support cz. as well as a number of other langlist redirects not added to dns.
  • 14:15 RobH: Fixed an error in my additions for the cz.wikistuff, pushed out the redirects to apaches.
  • 12:10 domas: mark stealing db10 for stuff
  • 11:00 domas: reenabled db10, added db14 to s1, db9 given away to non-core tasks, added full contributions load to db16 (as it has covering index)
  • 09:55 domas: reverted an instance where 'IndexPager' was causing filesorts... :)
  • 08:00 domas: cleaned up hume / diskspace, was full, added /a to updatedb prunepaths, apt-get clean too - 4.5G released
  • 08:00 domas: disabled db10 for db14 bootstrap
  • 07:36 domas: updating FlaggedRevs schema on ruwiki.
  • 02:26 brion: updating MW, including FlaggedRevs schema update (fp_pending_since, flaggedrevs_tracking)

August 24

  • 17:15 domas: removing db9 entirely, crashed, disk gone...
  • 07:20 Tim: deployed the TrustedXFF extension that I just wrote.
  • 02:56 Tim: removed db9 from the contributions, watchlist and recentchangeslinked query groups. Long running queries (2000 seconds) from IndexPager::reallyDoQuery and ApiQueryContributions::execute, probably needs index fixes. Removed general load from the remaining query group server, db7.

August 22

  • 21:34 RobH: will moved from A4 to A2.
  • 21:00 RobH: diderot unracked
  • 00:27 brion: FR feedback on on enwikinews as well
  • 00:24 brion: Deleting email record rows from cu_changes; some had slipped through before we disabled the privacy breakage

August 21

  • 23:47 brion: FlaggedRevs feedback enabled on test & labs
  • 23:35 brion: Enabled experimental HTML diff on test.wikipedia.org, en.labs.wikimedia.org, and de.labs.wikimedia.org
  • 18:17 RobH: Updated DNS entries to add a number of .cz domains. Also updated redirects.conf to support the added domains.
  • 11:43 Tim: installing GlobalBlocking
  • 02:42 Tim: returned db16 to general load, a less critical role
  • 02:30 Tim: installed mysql-client-5.0 on db11-16. Installed ganglia-metrics on thistle, db1, db4, db7, db12, db13, db14, db15, db16.
  • 02:20 Tim: offloaded query group read load from db16. System+user CPU disappeared.
    • Recovery spike in I/O shows that replication was suppressed due to read activity. Caught up in ~8 minutes.
  • 02:11 Tim: db16 is chronically lagged, probably overloaded with inflexible query group load
    • db16 shows high flat system+user CPU since ~01:05

August 20

  • 04:15 Tim: attempting to upgrade hume from Ubuntu 7.10 to 8.04
  • 01:24 brion: experimentally lifting $wgExportMaxLimit from 1000 to infinity on enwiki -- testing hack to SpecialExport.php to use unbuffered query

August 19

  • 08:38 Tim: done with lomaria
  • 07:42 Tim: taking lomaria out of rotation to drop non-s2a databases and change its replication to s2a-only.
  • 04:45 Tim: increased load on db13 to relieve db8, stressed by removal of lomaria from s2
  • 04:10 Tim: A hotlinking mirror, getting images from thumb.php, was being visited at high rate, DoSing our storage servers. Referer blocked.
  • 03:50 Tim: ixia disk space critical, fixed
  • 03:45 Tim: Older s3 slave servers are showing signs of strain. Adding more s3 load to db11 to test its capacity.
    • db11 is fine at 47% load ratio, reporting 80-90% disk util, await 5-7ms, load ~6
    • 96% load ratio, reporting disk util ~90%, await ~6ms, load ~7.5. Wait CPU ~12%. Yawning in mock-boredom.
  • 03:37 Tim: lomaria was relatively overloaded. Adjusted loads, put it in an s2a role since we haven't had any s2a servers since holbach was decommissioned
  • 02:40 Tim: removed holbach, webster and bacon from db.php, decomissioned. Removed decomissioned servers from $wgSquidServersNoPurge.
  • 02:27 Tim: compiled udpprofile on zwinger, started collector. Firewalled port 3811 inbound, /etc/init.d/iptables save. Updated MediaWiki configuration. Updated report.py on bart.
  • 01:40 Tim: reduced apache "TimeOut" on srv38/39 from 300 to 10, to limit the impact of LVS flapping

August 18

  • 23:00 RobH: added the image scaling servers back into the apache node group and updated their config files. This fixes the thumbnail generation issue evident on both uploads. and se.wikimedia (may have existed elsewhere as well, in fact, it most certainly must have.) All apaches restarted.

August 17

  • 22:30 jeluf: restarted apaches on srv38/39 due to user reports about broken thumbnails.

August 16

  • 13:20 mark: Reenabled ProxyFetch monitor on rendering cluster on lvs3, and set depool_threshold = .5.
  • 12:58 Tim: removed ProxyFetch monitor from rendering cluster in pybal on lvs3
  • 12:50 Tim: thumbnailing broke completely, at ~03:00 UTC. The apache processes on srv38/39 were stuck waiting for writes to the storage servers. Couldn't find the associated PHP threads on the storage servers to see if something was holding them up, so I tried restarting apache on srv38/39 instead. Suspect broken connections due to regular depooling by pybal

August 14

  • 18:55 domas fixed db16 replication
  • 18:50 brion: db16 replication is broken -- contribs/watchlists/recentchangeslinked for enwiki stopped at about 4 hours ago
  •  ??? ??? db16 crashed

August 13

  • 17:10 Tim: Changed http://noc.wikimedia.org/conf/ to use a PHP script to highlight the source files from NFS on request, instead of them being updated periodically. Added a warning header to all affected files.
  • 06:17 Tim: Removed old ExtensionDistributor snapshots (find -mtime +1 -exec rm {} \;), synced r39273
  • 02:40 brion: fixed permissions on dewiki thumb dir -- root-owned directory not writable by apache worked for existing directories, but failed for the 'archive' directory needed for old-version thumbnails used by FlaggedRevs

August 12

  • 21:06 mark: Moved LVS load balancing of apaches to lvs3 as well, using a new service IP (10.2.1.1)
  • 18:10 brion: fixed up security config that disabled PHP execution in extension directories; several configs had this wrong and non-functional
  • 12:45 tfinc: removed /srv/org.wikimedia.dev.donate & /srv/org.wikimedia.donate on srv9 and removed the apache confs that mention them.

August 11

  • 23:53 mark: Moved traffic from Russia (iso code 643) to knams
  • 23:53 mark: Moved the rendering cluster LVS to lvs3 as well.
  • 22:45 mark: Deployed lvs3 as the first new internal LVS cluster host, and moved over the search pools to it using new service IPs (outside the subnet). The rest of the LVS cluster as well as the documentation are a work in progress - let me know if there are any problems.

August 10

  • 17:43 Tim: freed up another 100GB or so by deleting all dumps from February 2008.
  • 17:27 Tim: freed up a few GB on storage2 by deleting failed dumps: enwiki/{20080425,20080521,20080618,20080629}, dewiki/20080629.

August 8

  • 22:46 RobH: setup network access LOM for db13, db14, db15, & db16
  • 22:40 brion: set up 'inactive' group on private wikis; this is just "for show" to indicate disabled accounts, adding a user to the group doesn't actually disable them :)
  • 21:15 brion: can't seem to reach the 'oai' audit database on adler from the wiki command-line scripts. This is rather annoying; permissions wrong maybe?

August 6

August 5

  • 22:09 mark: Shutdown BGP session to XO for maintenance
  • 18:27 RobH: db14, db15, db16 installed with Ubuntu.
  • 18:24 brion: enabling flaggedrevs on ruwiki per [1]
  • 17:09 brion: enabling flaggedrevs on enwikinews per [2]
  • 6:20 jeluf: set wgEnotifUserTalk to true on all but the top wikis, see bugzilla

August 4

  • 05:58 brion: dewiki homepage broken for a few minutes due to a bogus i18n update in imagemap breaking the 'desc' alignment options

August 3

  • 14:15 robert: got reports about lots of failed searches on nl and pl.wiki, looks like diderot (again) failed to depool a dead server (rabanus), removed manually.

August 1

  • 21:05 brion: forcing display_errors on for CLI so I don't keep discovering my command-line scripts are broken _after_ I run them, they don't show any errors, and I thought they worked. :)
  • 06:39 Tim: wrote a PHP syntax check for scap, using parsekit, that runs about 6 times faster than the old one
  • 04:58 Tim: installing PHP on suda (CLI only) for syntax check speed test
  • 01:46 Tim: removed db1 from rotation, it's stopped in gdb at a segfault.
  • 00:22 brion: aha! found the problem. MaxClients was turned down to 10 from default of 150 long ago, while the old prefix search was being tested. :) now back to 150
  • 00:19 brion: just turning off the mobile gateway on yongle for now, it just doesn't appear to be working at full load. (files moved to subdir -- in /x/ it works fine seemingly). Server doesn't appear overly loaded -- CPU and load are low -- just the requests stick.
  • 00:10 brion: installing APC on yongle, php bits are ungodly slow sometimes

Archives