Server admin log/2008-09

From Wikitech
Jump to: navigation, search

September 30

  • 23:45 brion: apt is borked on mayflower due to smartmontools refusing to load
  • 20:46 tomasz: cp'd phpmyadmin from dev.civicrm to prod civicrm for david
  • 20:06 brion: replaced 'wikimedia.org' with 'meta.wikimedia.org' in local VHosts list in wgConf.php. The general 'wikimedia.org' was causing CodeReview's diff loads (via codereview-proxy.wikimedia.org) to fail as they were hitting localhost instead of the proxy. Do we need to add more vhosts to his list, or redo how it works?
  • 19:45 brion: test-deploying CodeReview on mediawiki.org
  • 19:?? brion: set up temporary limited SVN JSON proxy as codereview-proxy.wikimedia.org
  • 19:17 RobH: updated DNS to add something for Brion.
  • 18:21 mark: cache cleaning complete
  • 15:05 Tim: doing some manual purges of URLs requested on #wikimedia-tech
  • 15:00 mark: Cleaning caches of all backend text squids one by one, starting with pmtpa
  • 14:20 mark: pooled all squids manually to fix the issues.
  • 14:10 RobH: Site back up, slow as squids play catchup.
  • 14:06 RobH: Pushed out old redirects.conf and restarted apaches.
  • 14:01 RobH: Site is down, go me =[
  • 14:00 RobH: updated redirects.conf and pushed change for orphaned domains.
  • 13:38 RobH: updated dns for more orphaned domains.
  • 13:11 Tim: cluster13 and cluster14 both have only one server left in rotation. Shut down apache on srv129 and srv139 out of fear that it might hasten their doom.
  • 10:12 Tim: Switched ES cluster 3-10 to use Ubuntu servers (again)
  • 10:03 Tim: depooled ES on srv127, has been wiped
  • 10:00 Tim: depooled thistle, is down
  • 09:20 Tim: Set up MediaWiki UDP logging
  • 08:05 Tim: removed the ORDER BY clauses from the ApiQueryCategoryMembers queries, to work around MySQL bug, probably involving truncated indexes
  • 07:08 Tim: re-enabled the API
  • 06:56 Tim: ixia (s2 master) overloaded due to ApiQueryCategoryMembers queries. Disabled the API and killed the offending queries

September 29

  • 22:20 brion: reenabled history export ($wgExportAllowHistory), but put $wgExportMaxHistory back to 1000 instead of experimental 0 for enwiki. (sorry enwiki)
  • 21:27 RobH: fixed the mounts on srv163 and started apache back up.
  • 20:20 brion: srv163 has bad NFS config, missing upload and math mounths. I've shut off its apache so it stops polluting the parser cache with math errors.
  • 17:01 RobH: updated apache redirects.conf for orphaned domains, restarted all apaches.
  • 15:06 RobH: updated DNS to reflect a number of orphaned domains.
  • 08:48 Tim: put db7 back into watchlist rotation (99%)
  • 08:08 domas: enabled ipblocks replication on db7, resynced from db16
  • 08:00 domas: Replaced gcc-4.2 build on db7 with gcc-4.1 one, from /home/wikipedia/src/mysql-4.0.40-r9-hardy-x86_64-gcc4.1.tar.gz

September 28

  • 17:52 mark: Upgraded mchenry to Hardy.
  • 17:15 mark: Upgraded sanger to Hardy.
  • 13:43 mark: Repooled srv150
  • 13:25 mark: Upgraded php5 and APC on all ubuntu apaches... got tired of restarting them. ;)
  • 12:06 Tim: on db7: replicate-ignore-table=enwiki.ipblocks. Good enough for now.
  • 11:51 Tim: schema update at 04:44 made db7 segfault. Replication stopped, watchlists stopped working after code referencing the new schema was synced. Switched to db16 for watchlist and RCL. Tried INSERT SELECT, that segfaulted too.
  • 09:37 mark: Made syslog-ng on db20 filter the flood of 404s in /var/log/remote
  • 09:15 mark: Restarted all (and only) segfaulting apaches
  • 05:38 Tim: svn up/scap to r41337.
  • 04:44 Tim: applying patch-ipb_allow_usertalk.sql on all DBs. No master switches.

September 27

  • 20:41 mark: Packaged a newer PHP5 (5.2.4 from Ubuntu Hardy, with CDB support) and a new APC (3.0.19). Deployed it on srv153 for testing.
  • 18:15 brion: srv100 looks particularly crashy.
  • 18:09 brion: got some complaints about ERROR_ZERO_SIZED_OBJECT on saves, seeing a lot of segfaults in log. Restarting all apaches to see what they do.

September 26

  • 22:49 RobH: repooled sq49.
  • 22:00 RobH: depooled sq49 for power testing.
  • 21:50 RobH: pulled search7 for power testing and left off, as the power circuit would trip if it was left on there.
  • 21:18 RobH: put srv189 back into rotation.
  • 19:51 RobH: Pulled srv189 for power testing.

September 25

  • 21:41 RobH: had to recreate /home/wikipedia/logs/jobqueue/error as it was lost and job queue runners failed due to it not being there. Restarted runners.
  • 19:08 domas: fixed clear-profile by replacing 'zwinger' with 'zwinger.wikimedia.org' - apparently datagrams to 127.1 used to fail.
  • 18:44 brion: manually applied r41264 to MimeMagic.php to fix uploads of OpenDocument files to private/internal wikis
  • 15:25 RobH: bayes minimally installed.
  • 15:23 RobH: reverted statistics1 to bayes in dns, pushed dns change.
  • 14:04 RobH: bayes racked and ready for install.
  • 05:00 mark: Flapped BGP session to HGTN, to resolve blackholing of traffic
  • 03:20 Tim: stopped apache on srv167, was segfaulting again. I suspect binary version mismatch between compile and deployment, e.g. APC was compiled for libc 2.5-0ubuntu1, deployed on libc 2.7-10ubuntu3.
  • 03:03 Tim: restarted segfaulting apaches srv111,srv168,srv154,srv167,srv46
  • 02:28 Tim: srv35 was segfaulting again, probably because it was in both the test.wikipedia.org pool and the main apache pool. Having two copies of everything tends to make the APC cache overflow, which triggers bugs in APC and leads to segfaulting. Removed it from the main apache pool.

September 24

  • 20:23 RobH: restarted srv186 apache due to segfault.
  • 20:21 RobH: restarted srv179 apache due to segfault.
  • 20:05 brion: restarted srv35's apache (test.wikipedia.org) was segfaulting
  • 19:25 tomasz: restricted grant for 'exim'@'208.80.152.186' to 150 MAX_USER_CONNECTIONS
  • 18:40 mark Increased TCP backlog setting on mchenry from 20 to 128.
  • 18:19 brion: restoring ApiQueryDeletedrevs and Special:Export since they're not at issue. Domas thinks some of the hangs may be caused by mails getting stuck via ssmtp when the mail server is overloaded; auto mails on account creation etc may hold funny transactions open
  • 17:52 brion: disabling SiteStats::update() actual update query since it's blocking for reasons we can't identify and generally breaking shit
  • 17:50 RobH: updated nagios files/node groups for raid checking on hosts without 3ware present
  • 17:37 brion: domas thinks the problem is some kind of lock contention on site_stats, causing all the edit updates to hang -- as a result the ES connections stack up while waiting on the core master. I'm disabling ss_active_users update for now, that sounds slow...
  • 17:34 RobH: srv131 apache setup is borked, removing from lvs.
  • 17:33 RobH: added proper ip info for lo device on srv131
  • 17:24 brion: temporarily disabling special:export
  • 17:22 brion: the revert got us back to being able to read the site most of the time, but still lots of problems saving -- ES master on cluster18 still has lots of sleeper connections and refuses new saves
  • 17:10 brion: trying a set of reverts to recent ES changes
  • 16:43 brion: temporarily disabling includes/api/ApiQueryDeletedrevs.php, it may or may not be hitting too much ES or something?
  • 16:38 brion: seeing lots of long-delayed sleeping connections on ES masters, not running queries. trying to figure out w/ Aaron what could cause these
  • 16:36 mark: Set up a syslog server on db20, logging messages from other servers to /var/log/remote.
  • 16:31 brion: confirmed PHP fatal error during connection error (backend connection error "too many connections"). Manually merging r41230 to live copy to skip around the frontend PHP error
  • 16:20 brion: we're getting reports of eg "(Can't contact the database server: Unknown error (10.0.2.104))" on save. Trying to investigate, but MediaWiki was borked by the previous reversions of core DB-related files to a 6-month-old version with incompatible paths. Trying to re-sanitize MW to r41097 straight
  • 15:45 Rob: setup wikimedia-task-appserver on srv141.
  • 15:09 mark: The problem reappeared, looks like a bug in MediaWiki, possibly triggered by some issue in ES. Reverted the files includes/ExternalStore.php includes/ExternalStoreDB.php includes/Revision.php includes/db/Database.php includes/db/LoadBalancer.php to r35098 and ran scap.
  • 14:50 mark: Reports of most/all saves failing with PHP fatal error in /usr/local/apache/common-local/php-1.5/includes/ExternalStoreDB.php line 127: Call to a member function nextSequenceValue() on a non-object. Suspected APC cache corruption, did a hard restart of all apaches which appeared to resolve the problem.
  • 07:15 Tim: installed wikimedia-nis-client on db20

September 23

  • 20:03 RobH: srv170 reporting apache down, synced, restarted.
  • 20:02 RobH: srv188 was not running apache, synced and started.
  • 19:59 RobH: Installed memcached on srv183, updated mc-pmtpa.php.
  • 19:57 RobH: Installed memcached on srv66, updated mc-pmtpa.php.
  • 19:54 RobH: Installed memcached on srv141, updated mc-pmtpa.php.
  • 19:52 RobH: srv106 back up, apache synced and memcached running.
  • 19:45 RobH: srv127 complained of port in use starting apache, rebooted, all is fine.
  • 19:27 RobH: removed srv106 from active memcached, replaced with srv127, sync-file mc-pmtpa.php
  • 18:00 RobH: srv127 had booting issues into the OS, reinstalled and redeployed.
  • 17:08 RobH: srv138 was locked up, restarted.
  • 16:53 RobH: srv136 was locked up, restarted, synced, added correct lvs ip info.
  • 16:45 RobH: srv126 was locked up, restarted, synced, added correct lvs ip info.
  • 16:29 RobH: rebooted srv106, was locked up.
  • 16:25 RobH: reinstalled srv101, was old ubuntu with no ES data.
  • 16:13 RobH: reinstalled srv143 and srv148 from FC to Ubuntu, redeployed as apache
  • 15:57 RobH: reinstalled srv128 and srv140 from FC to Ubuntu, redeployed as apache.
  • 14:00-14:50 Tim: cleaned up /home/wikipedia somewhat, put various things in /home/wikipedia/junk or /home/wikipedia/backup, moved some lock files to lockfiles, deleted ancient /h/w/c/*.png symlinks, etc.
  • 14:50 Tim: Made sync-common-file use rsync instead of NFS since some mediawiki-installation servers still have a stale NFS handle for /home
  • 14:31 RobH: srv189 back in apache rotation
  • 14:20 RobH: srv130 back in apache rotation
  • 13:56 Tim: started rsync daemon on db20
  • 13:49 Tim: restored dsh node groups on zwinger
  • 13:40 Tim: installed udplog 1.3 on henbane
  • 00:05 - 01:20 Tim: copying everything from the recovered suda image except /home/kate/xx, /home/from-zwinger and /home/wikipedia/logs. Will copy /home/wikipedia/logs selectively.

September 22

  • 21:30 brion: noting that ExtensionDistributor extension is disabled for now due to the NFS problem
  • 18:59 RobH: srv131 offline due to kernel panic. Cannot bring back until /home issue is resolved.
  • 18:00 brion: things seem at least semi-working.
    1. everything hung
    2. suda had some kind of kernel crash
    3. after reboot, it was found to have a couple flaky disks
    4. brion hacked up MW config files to skip the NFS logging
    5. mark set up an alternate /home NFS server
  • 17:50 mark: Set up db20 as an (empty) temporary suda replacement. Set up NFS server for /home.
  • 17:20 mark: suda died.
  • 17:25 RobH: srv130 not working right, removed from pool.
  • 16:32 RobH: removed srv8 and srv10 from nagios, resynced.
  • 15:00 mark: Site down completely. Post-mortem:
    1. Rob is untangling power cables in rack B2, and both asw-b2-pmtpa and asw3-pmtpa (in B4) lose power
    2. Two racks unreachable, PyBal sees too many hosts down and won't depool more
    3. Rob brings power to asw-b2-pmtpa back up, but connectivity loss to B4 is not noticed
    4. Mark investigates why LVS isn't working, adjusts PyBal parameters, until PyBal pools not a single server
    5. Apaches are unhappy about completely missing ES clusters
    6. Connectivity loss to B4 discovered, restored
    7. Site back online

September 21

  • 10:10 Tim: disabled srv106's switch port. Was running the job queue with old configuration, inaccessible by ssh.

September 20

  • 14:45 Tim: re-enabled Special:Export with $wgExportAllowHistory=false. Please find some way of doing transwiki requests which doesn't involve crashing the site.
  • 14:30 Tim: People were reporting ES current master overload, no ability to save pages at all. This was apparently due to the small number of max connections on srv103/srv104. Most threads were sleeping. The real culprit was apparently db2 being slow due to a long-running (1 hour) Special:Export request. Disabled Special:Export entirely.
  • 12:00 mark: Restored zwinger's IPv6 connectivity; removed svn.wikimedia.org from /etc/hosts
  • 11:40 mark: Found an IP conflict; 208.80.152.136 was assigned to srv9 but not listed in DNS
  • 10:09 Tim: removed srv69 and srv118 from the memcached list, down
  • 09:02 Tim: ES on srv84 had new passwords, was not accepting connections from 3.23 clients on srv32-34. Fixed.
  • 08:45 Tim: depooled ES srv110, reformatted by Rob while it was still a current ES slave. Depooled srv137, mysqld was shut down on it for some reason. One server left in cluster14.
    • srv137 has a corrupt read-only file system on /usr/local/mysql/data2
  • 05:34 Tim: svn.wikimedia.org not reachable from zwinger via IPv6, causing very slow operation due to timeouts. Hacked /etc/hosts.
  • 04:58 Tim: svn up/scap to r41053
  • 01:06 Tim: ES migration failed on all clusters except cluster3 (the cluster I used to test the script), due to MySQL 4.0-4.1 version differences. Restarting with mysqldump --default-character-set=latin1.
  • 00:14 Tim: restarted segfaulting apaches: srv167,srv152,srv172,srv171,srv153,srv151,srv176,srv155,srv112,srv119,srv111,srv113
  • 00:10 Tomasz: upraded public and private depot to svn 1.5 data format.
  • 00:00 Tomasz: svn installed ubuntu 8.04 along with svn 1.5.

September 19

  • 23:00 Tomasz: svn installed ubunu 7.10, ready
  • 22:55 RobH: db20 installed, ready for next upgrade.
  • 22:38 RobH: db19 installed, ready for setup.
  • 22:26 RobH: db18 installed, ready for setup.
  • 18:00 brion: updated mwlib on bindery.wikimedia.org and Collection extension
  • 15:59 RobH: reinstalled srv70, srv100, srv110-srv119 from FC to ubuntu, redeployed.
  • 07:30 Tim: srv38 was hanging while attempting to write to log files on /home. Fixed permissions on /mnt/upload4/en/thumb which was causing a high log write rate, restarted apache, disabled search-restart cron job, restarted pybal. Seems to be fixed.
  • 01:55 Tim: the issue with ES was the lack of a master pos wait between transfer and slave shutdown. Fixing.
  • 01:00 Tim: restarting possibly segfaulting apaches on srv158,srv177,srv178,srv173,srv51,srv187,srv182,srv44,srv117. Keeping srv139 for debugging, it has kindly depooled itself by segfaulting on pybal health checks.

September 18

  • 17:39 RobH: srv35, srv37, srv55 & srv59 bootstrapped with ganglia.
  • 17:37 RobH: srv40, srv41, srv43-srv53 bootstrapped with ganglia.
  • 17:36 RobH: srv60-srv68 bootstrapped with ganglia.
  • 17:31 RobH: srv151-srv188 bootstrapped with ganglia.
  • 11:45 Tim: reverted db.php change, still has issues.
  • 11:18 Tim: removed apaches_yaseo from nagios config, changed apaches_pmtpa to apaches.
  • 11:09 Tim: in db.php, switched ES clusters 3-10 to use the ubuntu servers

September 17

  • 23:57 brion: set $wgLogo to $stdpath for wikinews -- old local /upload path failed to redirect properly on secure.wikimedia.org interface
  • 22:19 mark: Deployed the rest of the new search servers, search2 - search7.
  • 19:25 JeLuF: changed robots.php to send both Mediawiki:robots.txt and /apache/common/robots.txt
  • 19:23 RobH: Removed srv63 from memcache list, put in spare memcache and synced file.
  • 19:14 RobH: restarted memcached on srv74
  • 19:00 RobH: reinstalled srv62, srv64, srv65, srv66, srv67, & srv68 from FC to Ubuntu.
  • 18:26 RobH: srv63 shutdown due to hdd failure.
  • 18:25 RobH: srv61 shutdown due to overheating issue.
  • 18:16 RobH: Reinstalled srv51, srv52, srv53, srv54, srv55, srv56, srv57, srv58, srv59, srv60, srv61 as ubuntu apache servers.
  • 16:56 RobH: Reinstalled srv44, srv45, srv46, sr47, srv48, srv49, & srv50 as ubuntu apache servers.
  • 16:00 RobH: Reinstalled srv35, srv37, srv40, srv41, srv43 as ubuntu apache servers.
  • 16:00 RobH: moved srv37 from pybal render group to apache group
  • 01:50 brion: killed obsolete juriwiki-l list per delphine

September 16

  • 22:59 mark: srv133 is giving Bus errors, read-only file systems, and was therefore automatically depooled by PyBal. Good times.
  • 22:59 mark: Installed memcached on srv182 (was missing?), restarted memcached on srv70, srv169 and replaced instance of srv141 by srv142.
  • 22:36 mark: Prepared searchidx1 and search1 for production, if things work sufficiently well I'll deploy the others tomorrow
  • 21:30 brion: found a bunch of memcache machines down or not running memcached: 170, 141, 70, 169, 182
  • 21:01 mark Building search deployment with rainman, with search1 as test host
  • 20:33 brion: fixed secure.wikimedia.org for Wikimania wikis -- wikimedia-ssl-backend.conf rewrite rules were mistakenly excluding digits from the wiki pseudodir
  • 18:00 JeLuF: made the main page of https://secure.wikimedia.org/ editable via http://meta.wikimedia.org/wiki/Secure.wikimedia.org_template using extract2.php

September 15

  • 22:45 Tim: rebooted srv151. Shut down mysqld and then gave it a sync; sysrq b.
  • 21:11 RobH: Installed Ubuntu on searchidx1, search1, search2, search3, search4, search5, search6, search7.
  • 19:00 RobH: searchidx1 installed.

September 14

  • 18:45 mark: Upgraded PyBal on lvs3 to a newer version, and set up SSH checking (once a minute) of all apaches, see LVS.
  • 18:42 mark: srv170 is doing OOM kills
  • 18:28 mark: Upgraded wikimedia-task-appserver on all Ubuntu app servers, which creates a limited ssh account pybal-check for use by PyBal. Create the account manually on all Fedora apaches
  • 17:01 mark: Apache on srv151 is stuck on an NFS mountpoint and cannot be restarted. I'm not rebooting the box as I'm not sure what's going on with ES atm.

September 12

  • 23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
  • 23:15 mark: NTP ip missing on zwinger, readded
  • 23:00 jeluf: proxy robots.txt requests through live-1.5/robots.php, which delivers Mediawiki:robots.txt if it exists and /apache/common/robots.txt else.
  • 15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
  • 15:00 RobH: bart crashed, rebooted.
  • 14:56 Tim: pulling out all the stops now, running migrate.php migrate-all.
  • 14:45 RobH: synced srv104, back online.
  • 14:40 RobH: synced db.php.
  • 14:32 RobH: srv105 unresponsive, rebooted.
  • 14:25 Tim: Removed the corrupted ES installations on srv151-176
  • 14:18 RobH: Installed NRPE plugins on db9-db16.
  • 09:01 Tim: reverted, blob corruption due to charset conversion observed
  • 07:58 Tim: Experimentally switched db.php to use the ubuntu servers for cluster3/4.
  • 07:50 Tim: Stopping replication on the ubuntu cluster3 and cluster4 servers, and changing the file permissions on the MyISAM files to prevent any kind of modification by the mysql daemon. This is done by the new lock/unlock commands in ~tstarling/migrateExtToUbuntu/migrate.php.

September 11

  • 05:30 Tim: Migrating cluster4. Testing new binlog deletion feature.

September 10

  • 15:40 RobH: Racktables database moved from will to db9.
  • 15:00 RobH: Reinstalled srv185, srv186, srv187 to newest ubuntu, online as apache.
  • 05:00 - 10:10 Tim: copied cluster3 to srv151, srv163 and srv175, second attempt, seems to have worked this time

September 9

  • 23:25 brion: for a few minutes got some complaints about 'Can't contact the database server: Unknown error (10.0.6.22)' (db12). This box seems to be semi-down pending some data recovery, but load wasn't disabled from it. May have gotten load due to other servers being lagged at the time. Set its load to 0.
  • 18:49 RobH: Moved maurus from A4 to A2.
  • 18:05 mark: Made lvs2 a temporary LVS host for upload.pmtpa.wikimedia.org to be able to remove alrazi from its rack. Will redo this setup soon.
  • 17:50 RobH: srv61 reinstalled and setup as apache and memcached.
  • 17:50 RobH: srv144 reinstalled, needs ES setup.
  • 17:50ish brion: updated planet to 2.0, cleared en feed caches. Something was broken in them which caused updated to fail since September 5.
  • 17:42 RobH: Updated DNS to reflect new search servers.
  • 15:11 RobH: Moved isidore, upon reboot, noticed the wordpress update didnt take, reapplied it to blog and whygive installations.
  • 14:49 RobH: zwinger and khaldun moved from A4 to A2.
  • 10:26 Tim: copying ES data from srv32 to srv151, srv163 and srv175
  • 01:30-10:20 Tim: testing and debugging the ubuntu ES migration script on srv151, srv163 and srv175
  • 02:15 Tomasz: Added bugzilla reporting cron on isidore.
  • 00:48 Tim: granted root access to zwinger on all ES servers, useful for migration

September 8

September 7

  • 15:48 mark: alrazi overloaded, switch traffic back to knams and hope it can take the load
  • 14:37 mark: knams partially back up, broken line card still down. Moved some important servers to another line card. knsq16 - knsq30 will be down for the upcoming days, as well as most management.
  • 10:20 domas: copied in mysql build from db16 to db12 - db12 was running gcc-4.2 one, and in crashloop. next crash will bring up proper build :)

September 6

  • 20:15 river: failure of many hosts at knams (including lvs), moved to authdns-scenario knams-down
  • 12:05 hashar : merged r40433 to fix &editintro
  • 5:30 JeLuF: image upload on enwiki enabled again. Slowly deleting images from amane.
  • 3:00 JeLuF: image upload on enwiki disabled, copying enwiki images to storage1

September 5

  • 22:00-00:00 Hashar : gmaxwell provided backup of files (downloaded in ~/files/), I recovered non existent one.
  • 17:03 Tim: Updated trusted-xff.cdb. Fixes AOL problems.
  • 14:45 JeLuF: started to rsync enwiki images from amane to storage1 in preparation of tomorrow's final move of the image directory
  • 04:24 Tim: sync-file screwup caused thumbnails to be created in the source image directory. Will try to repair.
  • 03:13 Tim: srv151 is depooled for some reason. No indication as to why in the logs or config files. Using it to test the new wikimedia-task-appserver package. Will repool once I get it working properly.

September 4

  • 22:15 JeLuF: Switched srv179's mysql to read_only
  • 22:10 JeLuF: OTRS back online, switched to db9. Changed exim config on mchenry, too.
  • 20:00 JeLuF,RobH: Shut down OTRS, migrating its DB from srv179 to db9
  • 19:49 RobH: db10 replication slave of db9
  • 17:58 RobH: civicrm and dev civicrm database now located on db9 (was on srv10)
  • 17:19 RobH: Bugzilla database is now located on db9 (was on srv8)
  • 16:52 RobH: Both the wikimedia blog and donation blog databases are now residing on db9 (was on srv8)
  • 16:43 Tim: re-enabled thumb.php after some of the culprits came to talk to me on #wikimedia-tech and promised to reform their ways
  • 11:09 Tim: fixed APC on srv38 and srv39, was broken.
  • 10:35 Tim: srv38 and srv39 have been overloaded since 05:50. Blocked thumb.php for external clients.
  • 05:30 Tim: restarted srv138 with sysrq-trigger. Was reporting "bus error" on sync-file.
  • 04:03 Tim: upgrading to wmerrors-1.0.2 on all mediawiki-installation

September 3

  • 23:00 jeluf: moved enwiki's upload archive from amane to storage1, freeing up some 20G on amane.
  • 16:54 brion: tweaking ApiOpenSearchXml to hopefully fix the rendering-thumbs-on-text-apaches problem
  • 14:01 RobH: updated libtiff4 on all apaches
  • 04:23 Tim: svn up/scap to r40356
  • 04:13 Tim: populating ss_active_users
  • 03:21 Tim: applying patch-ss_active_users.sql

September 2

  • 19:50 mark: Repooled srv181
  • 19:31 mark: Many boxes still in inconsistent state because of OOM kills. Some background processes not running (e.g. ntpd). Rebooted srv159, srv182, srv154, srv156, srv157, srv158, srv181, srv188
  • 19:28 mark: scap
  • 19:01 mark: Killed all stuck convert processes on srv151..srv188 (but left srv189 intact for debugging)
  • 18:51 mark: Rebooted srv169, srv180
  • 18:48 mark: Remounted /mnt/upload4 on srv151..srv188 (not srv189)
  • 18:33 mark: Many application servers are running out of memory, one by one. This seems to be caused by stuck thumbnail convert processes which end up there. The thumbnail convert processes on the regular apaches are indirectly caused by the API, and is opensearch/prefixsearch/allpages related - but I get lost in that code. One sample url is http://en.wikipedia.org/w/api.php?action=opensearch&search=Gina&format=xml Another interesting and likely related question is why many apaches can no longer reach storage1 NFS...
  • 17:07 RobH: Restarted ssh process which had stalled on srv188.
  • 16:52 mark: Rebooted srv186
  • 16:00 RobH: Pushed a number of dns changes for CZ chapter redirects.
  • 15:25 RobH: Updated dns for arbcom.de.wikimedia.org. Also added wiki to the cluster.

September 1

  • 23:10 mark: Added upload.v4.wikimedia.org hostname (explicitly A-record only), and allowed it in Squid frontend.conf
  • 17:40 jeluf: unpooled apache srv138, srv181 ssh not working
  • 17:30 jeluf: re-enabled srv124 in ES cluster12
  • 17:15 jeluf: re-enabled srv86 in ES cluster7
  • 16:32 mark: Deployed the PowerDNS pipe backend with the selective-answer script on all authoritative servers
  • 09:38 Tim: srv102 done, re-added cluster17 to the write list
  • 04:09 Tim: repooled ES on srv107, schema change done
  • 03:50 Tim: depooled apache on srv105, had old MW configuration, no ssh
  • 03:45 Tim: starting max_rows change on srv102. srv107 is actually stopped due to disk full, fixing.
  • 03:37 Tim: switching masters on cluster17 to srv103.
  • 02:14 Tim: Killed job runner on srv107 to speed up schema change.
  • 02:10 Tim: Brought srv142 and srv145 into ES rotation in cluster16.

Archives