Server Admin Log/Archive 12b
Jump to navigation Jump to search
(Redirected from Server Admin Log/2008-09)
- 23:45 brion: apt is borked on mayflower due to smartmontools refusing to load
- 20:46 tomasz: cp'd phpmyadmin from dev.civicrm to prod civicrm for david
- 20:06 brion: replaced 'wikimedia.org' with 'meta.wikimedia.org' in local VHosts list in wgConf.php. The general 'wikimedia.org' was causing CodeReview's diff loads (via codereview-proxy.wikimedia.org) to fail as they were hitting localhost instead of the proxy. Do we need to add more vhosts to his list, or redo how it works?
- 19:45 brion: test-deploying CodeReview on mediawiki.org
- 19:?? brion: set up temporary limited SVN JSON proxy as codereview-proxy.wikimedia.org
- 19:17 RobH: updated DNS to add something for Brion.
- 18:21 mark: cache cleaning complete
- 15:05 Tim: doing some manual purges of URLs requested on #wikimedia-tech
- 15:00 mark: Cleaning caches of all backend text squids one by one, starting with pmtpa
- 14:20 mark: pooled all squids manually to fix the issues.
- 14:10 RobH: Site back up, slow as squids play catchup.
- 14:06 RobH: Pushed out old redirects.conf and restarted apaches.
- 14:01 RobH: Site is down, go me =[
- 14:00 RobH: updated redirects.conf and pushed change for orphaned domains.
- 13:38 RobH: updated dns for more orphaned domains.
- 13:11 Tim: cluster13 and cluster14 both have only one server left in rotation. Shut down apache on srv129 and srv139 out of fear that it might hasten their doom.
- 10:12 Tim: Switched ES cluster 3-10 to use Ubuntu servers (again)
- 10:03 Tim: depooled ES on srv127, has been wiped
- 10:00 Tim: depooled thistle, is down
- 09:20 Tim: Set up MediaWiki UDP logging
- 08:05 Tim: removed the ORDER BY clauses from the ApiQueryCategoryMembers queries, to work around MySQL bug, probably involving truncated indexes
- 07:08 Tim: re-enabled the API
- 06:56 Tim: ixia (s2 master) overloaded due to ApiQueryCategoryMembers queries. Disabled the API and killed the offending queries
- 22:20 brion: reenabled history export ($wgExportAllowHistory), but put $wgExportMaxHistory back to 1000 instead of experimental 0 for enwiki. (sorry enwiki)
- 21:27 RobH: fixed the mounts on srv163 and started apache back up.
- 20:20 brion: srv163 has bad NFS config, missing upload and math mounths. I've shut off its apache so it stops polluting the parser cache with math errors.
- 17:01 RobH: updated apache redirects.conf for orphaned domains, restarted all apaches.
- 15:06 RobH: updated DNS to reflect a number of orphaned domains.
- 08:48 Tim: put db7 back into watchlist rotation (99%)
- 08:08 domas: enabled ipblocks replication on db7, resynced from db16
- 08:00 domas: Replaced gcc-4.2 build on db7 with gcc-4.1 one, from /home/wikipedia/src/mysql-4.0.40-r9-hardy-x86_64-gcc4.1.tar.gz
- 17:52 mark: Upgraded mchenry to Hardy.
- 17:15 mark: Upgraded sanger to Hardy.
- 13:43 mark: Repooled srv150
- 13:25 mark: Upgraded php5 and APC on all ubuntu apaches... got tired of restarting them. ;)
- 12:06 Tim: on db7: replicate-ignore-table=enwiki.ipblocks. Good enough for now.
- 11:51 Tim: schema update at 04:44 made db7 segfault. Replication stopped, watchlists stopped working after code referencing the new schema was synced. Switched to db16 for watchlist and RCL. Tried INSERT SELECT, that segfaulted too.
- 09:37 mark: Made syslog-ng on db20 filter the flood of 404s in /var/log/remote
- 09:15 mark: Restarted all (and only) segfaulting apaches
- 05:38 Tim: svn up/scap to r41337.
- 04:44 Tim: applying patch-ipb_allow_usertalk.sql on all DBs. No master switches.
- 20:41 mark: Packaged a newer PHP5 (5.2.4 from Ubuntu Hardy, with CDB support) and a new APC (3.0.19). Deployed it on srv153 for testing.
- 18:15 brion: srv100 looks particularly crashy.
- 18:09 brion: got some complaints about ERROR_ZERO_SIZED_OBJECT on saves, seeing a lot of segfaults in log. Restarting all apaches to see what they do.
- 22:49 RobH: repooled sq49.
- 22:00 RobH: depooled sq49 for power testing.
- 21:50 RobH: pulled search7 for power testing and left off, as the power circuit would trip if it was left on there.
- 21:18 RobH: put srv189 back into rotation.
- 19:51 RobH: Pulled srv189 for power testing.
- 21:41 RobH: had to recreate /home/wikipedia/logs/jobqueue/error as it was lost and job queue runners failed due to it not being there. Restarted runners.
- 19:08 domas: fixed clear-profile by replacing 'zwinger' with 'zwinger.wikimedia.org' - apparently datagrams to 127.1 used to fail.
- 18:44 brion: manually applied r41264 to MimeMagic.php to fix uploads of OpenDocument files to private/internal wikis
- 15:25 RobH: bayes minimally installed.
- 15:23 RobH: reverted statistics1 to bayes in dns, pushed dns change.
- 14:04 RobH: bayes racked and ready for install.
- 05:00 mark: Flapped BGP session to HGTN, to resolve blackholing of traffic
- 03:20 Tim: stopped apache on srv167, was segfaulting again. I suspect binary version mismatch between compile and deployment, e.g. APC was compiled for libc 2.5-0ubuntu1, deployed on libc 2.7-10ubuntu3.
- 03:03 Tim: restarted segfaulting apaches srv111,srv168,srv154,srv167,srv46
- 02:28 Tim: srv35 was segfaulting again, probably because it was in both the test.wikipedia.org pool and the main apache pool. Having two copies of everything tends to make the APC cache overflow, which triggers bugs in APC and leads to segfaulting. Removed it from the main apache pool.
- 20:23 RobH: restarted srv186 apache due to segfault.
- 20:21 RobH: restarted srv179 apache due to segfault.
- 20:05 brion: restarted srv35's apache (test.wikipedia.org) was segfaulting
- 19:25 tomasz: restricted grant for 'exim'@'220.127.116.11' to 150 MAX_USER_CONNECTIONS
- 18:40 mark Increased TCP backlog setting on mchenry from 20 to 128.
- 18:19 brion: restoring ApiQueryDeletedrevs and Special:Export since they're not at issue. Domas thinks some of the hangs may be caused by mails getting stuck via ssmtp when the mail server is overloaded; auto mails on account creation etc may hold funny transactions open
- 17:52 brion: disabling SiteStats::update() actual update query since it's blocking for reasons we can't identify and generally breaking shit
- 17:50 RobH: updated nagios files/node groups for raid checking on hosts without 3ware present
- 17:37 brion: domas thinks the problem is some kind of lock contention on site_stats, causing all the edit updates to hang -- as a result the ES connections stack up while waiting on the core master. I'm disabling ss_active_users update for now, that sounds slow...
- 17:34 RobH: srv131 apache setup is borked, removing from lvs.
- 17:33 RobH: added proper ip info for lo device on srv131
- 17:24 brion: temporarily disabling special:export
- 17:22 brion: the revert got us back to being able to read the site most of the time, but still lots of problems saving -- ES master on cluster18 still has lots of sleeper connections and refuses new saves
- 17:10 brion: trying a set of reverts to recent ES changes
- 16:43 brion: temporarily disabling includes/api/ApiQueryDeletedrevs.php, it may or may not be hitting too much ES or something?
- 16:38 brion: seeing lots of long-delayed sleeping connections on ES masters, not running queries. trying to figure out w/ Aaron what could cause these
- 16:36 mark: Set up a syslog server on db20, logging messages from other servers to /var/log/remote.
- 16:31 brion: confirmed PHP fatal error during connection error (backend connection error "too many connections"). Manually merging r41230 to live copy to skip around the frontend PHP error
- 16:20 brion: we're getting reports of eg "(Can't contact the database server: Unknown error (10.0.2.104))" on save. Trying to investigate, but MediaWiki was borked by the previous reversions of core DB-related files to a 6-month-old version with incompatible paths. Trying to re-sanitize MW to r41097 straight
- 15:45 Rob: setup wikimedia-task-appserver on srv141.
- 15:09 mark: The problem reappeared, looks like a bug in MediaWiki, possibly triggered by some issue in ES. Reverted the files includes/ExternalStore.php includes/ExternalStoreDB.php includes/Revision.php includes/db/Database.php includes/db/LoadBalancer.php to r35098 and ran scap.
- 14:50 mark: Reports of most/all saves failing with PHP fatal error in /usr/local/apache/common-local/php-1.5/includes/ExternalStoreDB.php line 127: Call to a member function nextSequenceValue() on a non-object. Suspected APC cache corruption, did a hard restart of all apaches which appeared to resolve the problem.
- 07:15 Tim: installed wikimedia-nis-client on db20
- 20:03 RobH: srv170 reporting apache down, synced, restarted.
- 20:02 RobH: srv188 was not running apache, synced and started.
- 19:59 RobH: Installed memcached on srv183, updated mc-pmtpa.php.
- 19:57 RobH: Installed memcached on srv66, updated mc-pmtpa.php.
- 19:54 RobH: Installed memcached on srv141, updated mc-pmtpa.php.
- 19:52 RobH: srv106 back up, apache synced and memcached running.
- 19:45 RobH: srv127 complained of port in use starting apache, rebooted, all is fine.
- 19:27 RobH: removed srv106 from active memcached, replaced with srv127, sync-file mc-pmtpa.php
- 18:00 RobH: srv127 had booting issues into the OS, reinstalled and redeployed.
- 17:08 RobH: srv138 was locked up, restarted.
- 16:53 RobH: srv136 was locked up, restarted, synced, added correct lvs ip info.
- 16:45 RobH: srv126 was locked up, restarted, synced, added correct lvs ip info.
- 16:29 RobH: rebooted srv106, was locked up.
- 16:25 RobH: reinstalled srv101, was old ubuntu with no ES data.
- 16:13 RobH: reinstalled srv143 and srv148 from FC to Ubuntu, redeployed as apache
- 15:57 RobH: reinstalled srv128 and srv140 from FC to Ubuntu, redeployed as apache.
- 14:00-14:50 Tim: cleaned up /home/wikipedia somewhat, put various things in /home/wikipedia/junk or /home/wikipedia/backup, moved some lock files to lockfiles, deleted ancient /h/w/c/*.png symlinks, etc.
- 14:50 Tim: Made sync-common-file use rsync instead of NFS since some mediawiki-installation servers still have a stale NFS handle for /home
- 14:31 RobH: srv189 back in apache rotation
- 14:20 RobH: srv130 back in apache rotation
- 13:56 Tim: started rsync daemon on db20
- 13:49 Tim: restored dsh node groups on zwinger
- 13:40 Tim: installed udplog 1.3 on henbane
- 00:05 - 01:20 Tim: copying everything from the recovered suda image except /home/kate/xx, /home/from-zwinger and /home/wikipedia/logs. Will copy /home/wikipedia/logs selectively.
- 21:30 brion: noting that ExtensionDistributor extension is disabled for now due to the NFS problem
- 18:59 RobH: srv131 offline due to kernel panic. Cannot bring back until /home issue is resolved.
- 18:00 brion: things seem at least semi-working.
- everything hung
- suda had some kind of kernel crash
- after reboot, it was found to have a couple flaky disks
- brion hacked up MW config files to skip the NFS logging
- mark set up an alternate /home NFS server
- 17:50 mark: Set up db20 as an (empty) temporary suda replacement. Set up NFS server for /home.
- 17:20 mark: suda died.
- 17:25 RobH: srv130 not working right, removed from pool.
- 16:32 RobH: removed srv8 and srv10 from nagios, resynced.
- 15:00 mark: Site down completely. Post-mortem:
- Rob is untangling power cables in rack B2, and both asw-b2-pmtpa and asw3-pmtpa (in B4) lose power
- Two racks unreachable, PyBal sees too many hosts down and won't depool more
- Rob brings power to asw-b2-pmtpa back up, but connectivity loss to B4 is not noticed
- Mark investigates why LVS isn't working, adjusts PyBal parameters, until PyBal pools not a single server
- Apaches are unhappy about completely missing ES clusters
- Connectivity loss to B4 discovered, restored
- Site back online
- 10:10 Tim: disabled srv106's switch port. Was running the job queue with old configuration, inaccessible by ssh.
- 14:45 Tim: re-enabled Special:Export with $wgExportAllowHistory=false. Please find some way of doing transwiki requests which doesn't involve crashing the site.
- 14:30 Tim: People were reporting ES current master overload, no ability to save pages at all. This was apparently due to the small number of max connections on srv103/srv104. Most threads were sleeping. The real culprit was apparently db2 being slow due to a long-running (1 hour) Special:Export request. Disabled Special:Export entirely.
- 12:00 mark: Restored zwinger's IPv6 connectivity; removed svn.wikimedia.org from /etc/hosts
- 11:40 mark: Found an IP conflict; 18.104.22.168 was assigned to srv9 but not listed in DNS
- 10:09 Tim: removed srv69 and srv118 from the memcached list, down
- 09:02 Tim: ES on srv84 had new passwords, was not accepting connections from 3.23 clients on srv32-34. Fixed.
- 08:45 Tim: depooled ES srv110, reformatted by Rob while it was still a current ES slave. Depooled srv137, mysqld was shut down on it for some reason. One server left in cluster14.
- srv137 has a corrupt read-only file system on /usr/local/mysql/data2
- 05:34 Tim: svn.wikimedia.org not reachable from zwinger via IPv6, causing very slow operation due to timeouts. Hacked /etc/hosts.
- 04:58 Tim: svn up/scap to r41053
- 01:06 Tim: ES migration failed on all clusters except cluster3 (the cluster I used to test the script), due to MySQL 4.0-4.1 version differences. Restarting with mysqldump --default-character-set=latin1.
- 00:14 Tim: restarted segfaulting apaches: srv167,srv152,srv172,srv171,srv153,srv151,srv176,srv155,srv112,srv119,srv111,srv113
- 00:10 Tomasz: upraded public and private depot to svn 1.5 data format.
- 00:00 Tomasz: svn installed ubuntu 8.04 along with svn 1.5.
- 23:00 Tomasz: svn installed ubunu 7.10, ready
- 22:55 RobH: db20 installed, ready for next upgrade.
- 22:38 RobH: db19 installed, ready for setup.
- 22:26 RobH: db18 installed, ready for setup.
- 18:00 brion: updated mwlib on bindery.wikimedia.org and Collection extension
- 15:59 RobH: reinstalled srv70, srv100, srv110-srv119 from FC to ubuntu, redeployed.
- 07:30 Tim: srv38 was hanging while attempting to write to log files on /home. Fixed permissions on /mnt/upload4/en/thumb which was causing a high log write rate, restarted apache, disabled search-restart cron job, restarted pybal. Seems to be fixed.
- 01:55 Tim: the issue with ES was the lack of a master pos wait between transfer and slave shutdown. Fixing.
- 01:00 Tim: restarting possibly segfaulting apaches on srv158,srv177,srv178,srv173,srv51,srv187,srv182,srv44,srv117. Keeping srv139 for debugging, it has kindly depooled itself by segfaulting on pybal health checks.
- 17:39 RobH: srv35, srv37, srv55 & srv59 bootstrapped with ganglia.
- 17:37 RobH: srv40, srv41, srv43-srv53 bootstrapped with ganglia.
- 17:36 RobH: srv60-srv68 bootstrapped with ganglia.
- 17:31 RobH: srv151-srv188 bootstrapped with ganglia.
- 11:45 Tim: reverted db.php change, still has issues.
- 11:18 Tim: removed apaches_yaseo from nagios config, changed apaches_pmtpa to apaches.
- 11:09 Tim: in db.php, switched ES clusters 3-10 to use the ubuntu servers
- 23:57 brion: set $wgLogo to $stdpath for wikinews -- old local /upload path failed to redirect properly on secure.wikimedia.org interface
- 22:19 mark: Deployed the rest of the new search servers, search2 - search7.
- 19:25 JeLuF: changed robots.php to send both Mediawiki:robots.txt and /apache/common/robots.txt
- 19:23 RobH: Removed srv63 from memcache list, put in spare memcache and synced file.
- 19:14 RobH: restarted memcached on srv74
- 19:00 RobH: reinstalled srv62, srv64, srv65, srv66, srv67, & srv68 from FC to Ubuntu.
- 18:26 RobH: srv63 shutdown due to hdd failure.
- 18:25 RobH: srv61 shutdown due to overheating issue.
- 18:16 RobH: Reinstalled srv51, srv52, srv53, srv54, srv55, srv56, srv57, srv58, srv59, srv60, srv61 as ubuntu apache servers.
- 16:56 RobH: Reinstalled srv44, srv45, srv46, sr47, srv48, srv49, & srv50 as ubuntu apache servers.
- 16:00 RobH: Reinstalled srv35, srv37, srv40, srv41, srv43 as ubuntu apache servers.
- 16:00 RobH: moved srv37 from pybal render group to apache group
- 01:50 brion: killed obsolete juriwiki-l list per delphine
- 22:59 mark: srv133 is giving Bus errors, read-only file systems, and was therefore automatically depooled by PyBal. Good times.
- 22:59 mark: Installed memcached on srv182 (was missing?), restarted memcached on srv70, srv169 and replaced instance of srv141 by srv142.
- 22:36 mark: Prepared searchidx1 and search1 for production, if things work sufficiently well I'll deploy the others tomorrow
- 21:30 brion: found a bunch of memcache machines down or not running memcached: 170, 141, 70, 169, 182
- 21:01 mark Building search deployment with rainman, with search1 as test host
- 20:33 brion: fixed secure.wikimedia.org for Wikimania wikis -- wikimedia-ssl-backend.conf rewrite rules were mistakenly excluding digits from the wiki pseudodir
- 18:00 JeLuF: made the main page of https://secure.wikimedia.org/ editable via http://meta.wikimedia.org/wiki/Secure.wikimedia.org_template using extract2.php
- 22:45 Tim: rebooted srv151. Shut down mysqld and then gave it a sync; sysrq b.
- 21:11 RobH: Installed Ubuntu on searchidx1, search1, search2, search3, search4, search5, search6, search7.
- 19:00 RobH: searchidx1 installed.
- 18:45 mark: Upgraded PyBal on lvs3 to a newer version, and set up SSH checking (once a minute) of all apaches, see LVS.
- 18:42 mark: srv170 is doing OOM kills
- 18:28 mark: Upgraded wikimedia-task-appserver on all Ubuntu app servers, which creates a limited ssh account pybal-check for use by PyBal. Create the account manually on all Fedora apaches
- 17:01 mark: Apache on srv151 is stuck on an NFS mountpoint and cannot be restarted. I'm not rebooting the box as I'm not sure what's going on with ES atm.
- 23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
- 23:15 mark: NTP ip missing on zwinger, readded
- 23:00 jeluf: proxy robots.txt requests through live-1.5/robots.php, which delivers Mediawiki:robots.txt if it exists and /apache/common/robots.txt else.
- 15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
- 15:00 RobH: bart crashed, rebooted.
- 14:56 Tim: pulling out all the stops now, running migrate.php migrate-all.
- 14:45 RobH: synced srv104, back online.
- 14:40 RobH: synced db.php.
- 14:32 RobH: srv105 unresponsive, rebooted.
- 14:25 Tim: Removed the corrupted ES installations on srv151-176
- 14:18 RobH: Installed NRPE plugins on db9-db16.
- 09:01 Tim: reverted, blob corruption due to charset conversion observed
- 07:58 Tim: Experimentally switched db.php to use the ubuntu servers for cluster3/4.
- 07:50 Tim: Stopping replication on the ubuntu cluster3 and cluster4 servers, and changing the file permissions on the MyISAM files to prevent any kind of modification by the mysql daemon. This is done by the new lock/unlock commands in ~tstarling/migrateExtToUbuntu/migrate.php.
- 05:30 Tim: Migrating cluster4. Testing new binlog deletion feature.
- 15:40 RobH: Racktables database moved from will to db9.
- 15:00 RobH: Reinstalled srv185, srv186, srv187 to newest ubuntu, online as apache.
- 05:00 - 10:10 Tim: copied cluster3 to srv151, srv163 and srv175, second attempt, seems to have worked this time
- 23:25 brion: for a few minutes got some complaints about 'Can't contact the database server: Unknown error (10.0.6.22)' (db12). This box seems to be semi-down pending some data recovery, but load wasn't disabled from it. May have gotten load due to other servers being lagged at the time. Set its load to 0.
- 18:49 RobH: Moved maurus from A4 to A2.
- 18:05 mark: Made lvs2 a temporary LVS host for upload.pmtpa.wikimedia.org to be able to remove alrazi from its rack. Will redo this setup soon.
- 17:50 RobH: srv61 reinstalled and setup as apache and memcached.
- 17:50 RobH: srv144 reinstalled, needs ES setup.
- 17:50ish brion: updated planet to 2.0, cleared en feed caches. Something was broken in them which caused updated to fail since September 5.
- 17:42 RobH: Updated DNS to reflect new search servers.
- 15:11 RobH: Moved isidore, upon reboot, noticed the wordpress update didnt take, reapplied it to blog and whygive installations.
- 14:49 RobH: zwinger and khaldun moved from A4 to A2.
- 10:26 Tim: copying ES data from srv32 to srv151, srv163 and srv175
- 01:30-10:20 Tim: testing and debugging the ubuntu ES migration script on srv151, srv163 and srv175
- 02:15 Tomasz: Added bugzilla reporting cron on isidore.
- 00:48 Tim: granted root access to zwinger on all ES servers, useful for migration
- 22:20 RobH: reinstalled srv178, srv179, srv180, srv181, srv182, srv184.
- 21:20 RobH: reinstalled srv175, srv176, & srv177.
- 20:30 RobH: reinstalled srv172, srv173, & srv174.
- 19:23 RobH: reinstalled srv169, srv170, & srv171.
- 18:23 RobH: reinstalled srv166, srv167, & srv168.
- 18:00 RobH: reinstalled srv163, srv164, & srv165.
- 16:40 RobH: reinstalled srv160, srv161, & srv162.
- 15:40 RobH: reinstalled srv157, srv158, & srv159.
- 15:05 RobH: reinstalled srv154, srv155, & srv156.
- 14:36 mark: Exchanged down srv126 for srv140, and down srv137 for srv141 in mc-test.php
- 14:12 RobH: reinstalled srv151, srv152, & srv153.
- 06:16 Tim: Gave myself a RackTables account
- 05:33 Tim: srv146 down, removed from ES rotation
- 05:08 Tim: accidentally crashed srv37. Needs restart.
- 15:48 mark: alrazi overloaded, switch traffic back to knams and hope it can take the load
- 14:37 mark: knams partially back up, broken line card still down. Moved some important servers to another line card. knsq16 - knsq30 will be down for the upcoming days, as well as most management.
- 10:20 domas: copied in mysql build from db16 to db12 - db12 was running gcc-4.2 one, and in crashloop. next crash will bring up proper build :)
- 20:15 river: failure of many hosts at knams (including lvs), moved to authdns-scenario knams-down
- 12:05 hashar : merged r40433 to fix &editintro
- 5:30 JeLuF: image upload on enwiki enabled again. Slowly deleting images from amane.
- 3:00 JeLuF: image upload on enwiki disabled, copying enwiki images to storage1
- 22:00-00:00 Hashar : gmaxwell provided backup of files (downloaded in ~/files/), I recovered non existent one.
- run ~/check_missing_pics.pl for hints (output example)
- 17:03 Tim: Updated trusted-xff.cdb. Fixes AOL problems.
- 14:45 JeLuF: started to rsync enwiki images from amane to storage1 in preparation of tomorrow's final move of the image directory
- 04:24 Tim: sync-file screwup caused thumbnails to be created in the source image directory. Will try to repair.
- 03:13 Tim: srv151 is depooled for some reason. No indication as to why in the logs or config files. Using it to test the new wikimedia-task-appserver package. Will repool once I get it working properly.
- 22:15 JeLuF: Switched srv179's mysql to read_only
- 22:10 JeLuF: OTRS back online, switched to db9. Changed exim config on mchenry, too.
- 20:00 JeLuF,RobH: Shut down OTRS, migrating its DB from srv179 to db9
- 19:49 RobH: db10 replication slave of db9
- 17:58 RobH: civicrm and dev civicrm database now located on db9 (was on srv10)
- 17:19 RobH: Bugzilla database is now located on db9 (was on srv8)
- 16:52 RobH: Both the wikimedia blog and donation blog databases are now residing on db9 (was on srv8)
- 16:43 Tim: re-enabled thumb.php after some of the culprits came to talk to me on #wikimedia-tech and promised to reform their ways
- 11:09 Tim: fixed APC on srv38 and srv39, was broken.
- 10:35 Tim: srv38 and srv39 have been overloaded since 05:50. Blocked thumb.php for external clients.
- 05:30 Tim: restarted srv138 with sysrq-trigger. Was reporting "bus error" on sync-file.
- 04:03 Tim: upgrading to wmerrors-1.0.2 on all mediawiki-installation
- 23:00 jeluf: moved enwiki's upload archive from amane to storage1, freeing up some 20G on amane.
- 16:54 brion: tweaking ApiOpenSearchXml to hopefully fix the rendering-thumbs-on-text-apaches problem
- 14:01 RobH: updated libtiff4 on all apaches
- 04:23 Tim: svn up/scap to r40356
- 04:13 Tim: populating ss_active_users
- 03:21 Tim: applying patch-ss_active_users.sql
- 19:50 mark: Repooled srv181
- 19:31 mark: Many boxes still in inconsistent state because of OOM kills. Some background processes not running (e.g. ntpd). Rebooted srv159, srv182, srv154, srv156, srv157, srv158, srv181, srv188
- 19:28 mark: scap
- 19:01 mark: Killed all stuck convert processes on srv151..srv188 (but left srv189 intact for debugging)
- 18:51 mark: Rebooted srv169, srv180
- 18:48 mark: Remounted /mnt/upload4 on srv151..srv188 (not srv189)
- 18:33 mark: Many application servers are running out of memory, one by one. This seems to be caused by stuck thumbnail convert processes which end up there. The thumbnail convert processes on the regular apaches are indirectly caused by the API, and is opensearch/prefixsearch/allpages related - but I get lost in that code. One sample url is http://en.wikipedia.org/w/api.php?action=opensearch&search=Gina&format=xml Another interesting and likely related question is why many apaches can no longer reach storage1 NFS...
- 17:07 RobH: Restarted ssh process which had stalled on srv188.
- 16:52 mark: Rebooted srv186
- 16:00 RobH: Pushed a number of dns changes for CZ chapter redirects.
- 15:25 RobH: Updated dns for arbcom.de.wikimedia.org. Also added wiki to the cluster.
- 23:10 mark: Added upload.v4.wikimedia.org hostname (explicitly A-record only), and allowed it in Squid frontend.conf
- 17:40 jeluf: unpooled apache srv138, srv181 ssh not working
- 17:30 jeluf: re-enabled srv124 in ES cluster12
- 17:15 jeluf: re-enabled srv86 in ES cluster7
- 16:32 mark: Deployed the PowerDNS pipe backend with the selective-answer script on all authoritative servers
- 09:38 Tim: srv102 done, re-added cluster17 to the write list
- 04:09 Tim: repooled ES on srv107, schema change done
- 03:50 Tim: depooled apache on srv105, had old MW configuration, no ssh
- 03:45 Tim: starting max_rows change on srv102. srv107 is actually stopped due to disk full, fixing.
- 03:37 Tim: switching masters on cluster17 to srv103.
- 02:14 Tim: Killed job runner on srv107 to speed up schema change.
- 02:10 Tim: Brought srv142 and srv145 into ES rotation in cluster16.
- Archive 1: 2004 Jun - 2004 Sep
- Archive 2: 2004 Oct - 2004 Nov
- Archive 3: 2004 Dec - 2005 Mar
- Archive 4: 2005 Apr - 2005 Jul
- Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
- Archive 6: 2005 Nov - 2006 Feb
- Archive 7: 2006 Mar - 2006 Jun
- Archive 8: 2006 Jul - 2006 Sep
- Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
- Archive 10: 2007 Feb - 2007 Jun
- Archive 11: 2007 Jul - 2007 Dec
- Archive 12: 2008 Jan - 2008 Jul
- Archive 12a: 2008 Aug
- Archive 12b: 2008 Sept
- Archive 13: 2008 Oct - 2009 Jun
- Archive 14: 2009 Jun - 2009 Dec
- Archive 15: 2010 Jan - 2010 Jun
- Archive 16: 2010 Jul - 2010 Oct
- Archive 17: 2010 Nov - 2010 Dec
- Archive 18: 2011 Jan - 2011 Jun
- Archive 19: 2011 Jul - 2011 Dec
- Archive 20: 2011 Dec - 2012 Jun, with revision history 2007-02-21 to 2012-03-27
- Archive 21: 2012 Jul - 2013 Jan
- Archive 22: 2013 Jan - 2013 Jul
- Archive 23: 2013 Aug - 2013 Dec
- Archive 24: 2014 Jan - 2014 Mar
- Archive 25: 2014 April - 2014 September
- Archive 26: 2014 October - 2014 December
- Archive 27: 2015 January - 2015 July
- Archive 28: 2015 August - 2015 December
- Archive 29: 2016 January - 2016 May
- Archive 30: 2016 June - 2016 August
- Archive 31: 2016 September - 2016 December
- Archive 32: 2017 January - 2017 July
- Archive 33: 2017 August - 2017 December
- Archive 34: 2018 January - 2018 April
- Archive 35: 2018 May - 2018 August
- Archive 36: 2018 September - 2018 December
- Archive 37: 2019 January - 2019 April
- Archive 38: 2019 May - 2019 August
- Archive 39: 2019 September - 2019 December
- Archive 40: 2020 January - 2020 April
- Archive 41: 2020 May - 2020 July
- Archive 42: 2020 August - 2020 November
- Archive 43: 2020 December
- Archive 44: 2021 January - 2021 April
- Archive 45: 2021 May - 2021 July
- Archive 46: 2021 August - 2021 October
- Archive 47: 2021 November - 2021 December
- Archive 48: 2022 January
- Archive 49: 2022 February
- Archive 50: 2022 March
- Archive 51: 2022 April 1-15
- Archive 52: 2022 April 16-30
- Archive 53: 2022 May
- Archive 54: 2022 June
- Archive 55: 2022 July
- Archive 56: 2022 August
- Archive 57: 2022 September
- Archive 58: 2022 October
- Archive 59: 2022 November 1-15
- Archive 60: 2022 November 16-30
- Archive 61: 2022 December
- Archive 62: 2023 January
- Archive 63: 2023 February
- Archive 64: 2023 March
- Archive 65: 2023 April
- Archive 66: 2023 May