Server Admin Log/Archive 20: Difference between revisions

From Wikitech
Content deleted Content added
imported>More Bots
Updated the glue record for ns1.wikimedia.org (mark)
imported>More Bots
Decommissioned all yaseo servers, wiped their disks (mark)
Line 1: Line 1:
== July 23 ==
== July 23 ==
* 22:16 mark: Decommissioned all yaseo servers, wiped their disks
* 20:35 mark: Updated the glue record for ns1.wikimedia.org
* 20:35 mark: Updated the glue record for ns1.wikimedia.org
* 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne)
* 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne)

Revision as of 22:16, 23 July 2009

July 23

  • 22:16 mark: Decommissioned all yaseo servers, wiped their disks
  • 20:35 mark: Updated the glue record for ns1.wikimedia.org
  • 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne)
  • 19:58 mark: Installed linne.wikimedia.org as auth DNS server
  • 19:12 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'restore FancyCaptcha now that image crisis is diverted'
  • 15:38 Rob: setup strategywiki for the strategy planning whatever
  • 15:37 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 15:32 logmsgbot: robh ran sync-common-all

July 22

  • 21:20 Fred: installing wikimedia-task-appserver on srv122. Incoming reboot
  • 15:11 Tim: fixed firewall on browne to deny RC->IRC UDP packets from outside the local network
  • 09:57 logmsgbot: midom synchronized php-1.5/wmf-config/../StartProfiler.php
  • 09:51 domas: apparently someone decided that our profiling is not useful and should be disabled? :)

July 21

  • 23:56 Fred: rebooted pascal (for realz this time)
  • 23:15 tomaszf: fred is pulling backups from ms4 onto storage2.
  • 23:07 Fred: rebooting pascal as he fell over again
  • 22:45 tomaszf: adding snapshot1,2,3 to DHCP
  • 22:03 mark: Increased large object cache dir size to 120 GB on eiximenis
  • 18:28 domas: srv122 booted into netinstall, apparently
  • 17:39 Rob: updated both blog and techblog to newest stable release of wordpress
  • 16:36 brion: internal UDP logging broken since 17 July; looks like udp2log isn't running on db20 since reboot?
  • 16:21 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php 'death to captcha'
  • 16:15 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'death to the uplaoder group'
  • 16:06 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php
  • 16:00 Rob: updated sync-fiel script for new file locations
  • 16:00 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php
  • 15:50 Rob: removing old whygive blog data from dns and archiving the database.
  • 15:07 Rob: updated planet with http://meta.wikimedia.org/wiki/Planet_Wikimedia#Requests_for_inclusion
  • 14:43 mark: Increased COSS cache dirs on pmtpa upload squids
  • 11:30 domas: for i in $(ssh db20 findevilapaches); do ssh $i invoke-rc.d apache2 restart; done \o/
  • 11:29 domas: killed brion's sync processes on zwinger, hanging since July17 :)
  • 09:15 domas: mgmt-restarted srv156

July 20

  • 22:18 mark: Rebooted pascal
  • 21:15 mark: Doubled cache dir sizes on eiximenis, upped carp load from 20 to 30
  • 18:09 hcatlin: restarted mobile1 cluster to load in new software
  • 15:55 Fred: bounced apache on srv193
  • 07:18 Tim: re-enabled CentralNotice
  • 07:17 logmsgbot: tstarling synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 07:13 apergos: enough data removed from ms1 to feel safe for a few days; started mass copy of remaining thumbs to ms4 in prep for complete repo switchover (running in root screen on ms1)
  • 04:48 Tim: copying up all available MW release files from my laptop
  • 04:21 Tim: mounted ms4:/export/dumps on zwinger
  • 04:16 Tim: changed export options for ms4:/export/dumps to allow root access for the local subnet
  • 01:19 hcatlin: On mobile1 we are now gzipping the log files after rotation in /srv/wikimedia-mobile/logs

July 19

  • 22:02 Fred: restarted memcached on srv159
  • 20:43 mark: eiximenis backend squid pooled
  • 20:10 mark: Restarted deadlocked powerdns on bayle
  • 19:14 mark: Installed eiximenis with a Squid OS install
  • 18:58 mark: Moved eiximenis to vlan 100 (squids)
  • 18:55 mark: Changed eiximenis' IP into 208.80.152.119 for Squid testing
  • 17:41 hcatlin: Mobile1's web stack just got switched from Phusion Passenger to Nginx/Thin/Rack.

July 18

  • 15:23 apergos: some thumb directories on ms4 created at request of img scalers were created with owner root and perms 700... fixing
  • 03:55 river: ms5 is ready
  • 01:20 atglenn: continuing with removals of thumbs on ms1. 789G free now, need to reach about 1450 before we can just "maintain". but we're gaining on it.
  • 00:22 brion: set up temporary data dump index, copied the dvd index (it's just offsite links). still need to track some MW releases
  • 00:07 brion: recovering MediaWiki 1.6 through 1.10 release files and re-uploading them...

July 17

  • 23:42 brion: added a 404 page and recovered index.php for our temp download.wikimedia.org
  • 22:05 brion: set wikitech to use vector skin by default :D
  • 22:03 Andrew: Fixed morebots, which was relying on a fragile version check. Just deleted it :)
  • 20:43 brion: fixed paths for noc.wikimedia.org/conf file highlighting
  • 20:38 domas: ms2 has broken disks..
  • 20:31 brion: We're going to see about setting up the previously-idle ms5 so we can get our thumbnailing on
  • 20:01 brion: rob's poking raid rebuild on storage2 (dumps server)
  • 19:03 RobH_A90: eiximenis and dobson pulled for solid state drive testing, do not use for other tasks
  • 18:28 logmsgbot: brion synchronized wmf-deployment/wmf-config/InitialiseSettings.php 'enabling vector for rtl'
  • 18:25 atglenn: started mass move out of the way of thumbnail dirs and replacing with symlinks to ms4
  • 18:25 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'bump style version'
  • 18:25 logmsgbot: brion ran sync-common-all
  • 18:24 brion: running sync-common-all for UI updates. need to poke the style ver too :)
  • 18:08 brion: svn up'ing wmf-deployment for test.wikipedia.org. Merged UI fixes from usability team
  • 18:03 Fred: spun a couple more apache server into image scalers: srv219..srv224.
  • 17:28 rainman-sr: putting new location of initialisesettings to lsearch-global-2.1.conf so the incremental updater works again
  • 17:20 Fred: srv224 is now an image_scaler. Adjusted on lvs3, ganglia and dsh's node_list.
  • 17:14 Fred: db20 back online
  • 16:50 Fred: rebooting db20 as it is in a "state"
  • 16:45 brion: looks like we've lost internal /home NFS, which makes some of our internal services very unhappy. investigating...
  • 16:43 brion: ganglia out.
  • 13:44 apergos1: doing next round of removals on ms1 (/export/upload/wikipedia/en/thumb/2) to keep ahead of the game
  • 04:15 apergos: starting removal of /export/upload/wikipedia/en/thumb/1 on ms1 (moved away and symlink to ms4 done already) for more space
  • 03:54 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'Disabling sitenotice from maintenance'
  • 03:29 brion: reenabling uploads & image deletion/undeletion
  • 03:29 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
  • 03:28 brion: remounting ms1 on apaches
  • 00:49 atglenn: only about 1gb gain on each so doing all of /export/upload/wikipedia/en/thumb/0
  • 00:39 atglenn: removing more directories in /export/upload/wikipedia/en/thumb/0 on ms1 and replacing with symlinks to ms4
  • 00:30 logmsgbot: brion synchronized wmf-deployment/includes/specials/SpecialUpload.php
  • 00:30 logmsgbot: brion synchronized wmf-deployment/includes/ImagePage.php
  • 00:27 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
  • 00:20 brion: temporarily disabling image delete/rename during maintenance

July 16

  • 23:56 logmsgbot: fvassard synchronized php-1.5/wmf-config/CommonSettings.php 'Disabling uploads and setting captcha to not-fancy.'
  • 23:29 atglenn: removing the images in /export/upload/wikipedia/en/thumb/0/00 on ms1 (real dir is a symlink to ms4) to get back some space
  • 22:51 atglenn: sym link back in place, let's see what happens
  • 22:47 atglenn: reverting temporarily while we resolve mount issues for the ms4 share
  • 22:40 atglenn: ...whether the image scalers will fall over if we force them to do (some) regeneration.
  • 22:37 atglenn: on ms1, /export/upload/wikipedia/en/thumb/0/00 symlinked to (shared from ms4) /mnt/thumbs/wikipedia/en/thumb/0/00 to test
  • 21:29 brion: robots.php for robots.txt generation now also working. yay!
  • 21:28 logmsgbot: brion synchronized live-1.5/robots.php
  • 21:28 brion: extract2.php now fixed up for new deployment; portal pages ok (www.wikipedia.org)
  • 21:27 logmsgbot: brion synchronized live-1.5/robots.php
  • 21:26 logmsgbot: brion synchronized extract2.php
  • 21:26 logmsgbot: brion synchronized extract2.php
  • 21:22 logmsgbot: brion synchronized extract2.php
  • 21:18 logmsgbot: brion ran sync-common-all
  • 21:18 brion: rsync messed up the php-1.5 directory to symlink translation. retrying as root
  • 21:14 logmsgbot: brion synchronized extract2.php
  • 21:13 logmsgbot: brion synchronized extract2.php
  • 21:13 atglenn: started copy of thumbnails to ms4, symlinks going in on ms1 (but no data removal yet)
  • 21:11 logmsgbot: brion synchronized live-1.5/extract2.php
  • 21:10 logmsgbot: brion synchronized live-1.5/robots.php
  • 21:09 logmsgbot: brion ran sync-common-all
  • 21:08 brion: attempting to replace the old php-1.5 dir with wmf-deployment symlink
  • 21:02 logmsgbot: brion synchronized wmf-deployment/wmf-config/InitialiseSettings.php 'I think touching the new master InitialiseSettings will fix it'
  • 21:01 logmsgbot: brion synchronized wmf-deployment/includes/GlobalFunctions.php 'mkdir error trackdown hack'
  • 20:54 logmsgbot: brion synchronized wmf-deployment/wmf-config/missing.php
  • 20:52 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
  • 20:52 logmsgbot: brion synchronized wmf-deployment/wmf-config/reporting-setup.php
  • 20:48 brion: switching all sites to wmf-deployment branch
  • 20:48 logmsgbot: brion synchronized live-1.5/MWVersion.php
  • 19:06 Tim: copying ExtensionDistributor stuff to ms4:/export/ext-dist, from root screen on ms1
  • 19:01 brion: Now running test.wikipedia.org, www.mediawiki.org, and meta.wikimedia.org on new deployment checkout
  • 19:01 logmsgbot: brion synchronized live-1.5/MWVersion.php
  • 18:58 logmsgbot: brion ran sync-common-all
  • 18:39 Tim: restarted xinetd on zwinger
  • 18:24 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php
  • 17:57 brion: also restarted 186, 196 which had some funkiness in php err log
  • 17:56 brion: srv186 also bad sudo
  • 17:55 brion: srv171 has some borkage; sudo config is broken can't run apache-restart as user
  • 17:52 logmsgbot: brion ran sync-common-all
  • 17:51 brion: running updated sync-common-all friendly to non-NFS boxes
  • 17:49 brion: swapped private SVN-managed /home/wikipedia/bin into place
  • 15:09 apergos: removing the last of our snapshots on ms1 :-( getting us a little more space
  • 14:47 apergos: disabled snapshots on ms1 in preparation for move of thumbnails to ms4
  • 14:38 brion: updated wikibugs-l list config to allow bugzilla-daemon@wikimedia.org to post
  • 14:34 brion: restarted wikibugs bot
  • 14:27 brion: ms1 performance seems to be sucking again
  • 14:17 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'adjusting throttle temporarily for outreach event'
  • 11:55 RoanKattouw: ExtensionDistributor repeatedly reported broken in the past 48 hrs
  • 07:08 Fred: traffic profile switched back to normal. Esams is back to normal.
  • 06:11 hcatlin: Mobile1 has returned to normal function.
  • 05:58 hcatlin: Error after restarting mobile1 stopped stats logging from working. Stats will be low for July 15th and higher for July 16th. Parsing of the 6 hour log file (about 1GB) might slow server for next few minutes until caught up.
  • 04:24 Rob: outage for esams servers started at approx 3:20 gmt
  • 04:15 Rob: still waiting on esams to update us about the rack(s), moving traffic to pmtpa
  • 00:59 tomaszf: started backup for latest xml snapshots from storage2 to ms4

July 15

  • 22:30 Rob: updated dns for new snapshot servers becasue tomasz did not want to be in charge of dump servers.
  • 22:10 brion: brion checking around for 0-byte files (not thumbs) to see if we can recover
  • 21:33 atglenn: verified that zfs patch is in place on ms4 (it got sucked in during river's update yesterday)
  • 21:26 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'Restore fancy captcha mode'
  • 21:16 logmsgbot: I_am_not_root synchronized php-1.5/CommonSettings.php 're-enabling Uploads and removing site notice.'
  • 21:01 atglenn: rebooting ms1 after applying zfs patch. *cross fingers*
  • 20:51 logmsgbot: brion synchronized php-1.5/CommonSettings.php
  • 20:51 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
  • 20:42 brion: reenabled captcha in simple mode (no images; math q)
  • 20:37 brion: captcha system broken while images are offline, need to disable it temporarily
  • 20:18 brion: updated http://en.wikipedia.org/wiki/MediaWiki:Uploaddisabledtext & http://commons.wikimedia.org/wiki/MediaWiki:Uploaddisabledtext
  • 19:43 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Disabling Uploads while ms1 gets fixed (again with an s after upload).'
  • 19:40 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Disabling Uploads while ms1 gets fixed.'
  • 19:40 atglenn: bringing solaris up to current patch level on ms1
  • 19:34 brion: Ok, we're going to temporarily shut off uploading and unmount the uploads dir while we muck about with ms1.
  • 19:14 brion: dropping export/upload@daily-2009-07-11_03:10:00
  • 19:08 brion: restarting web server on ms1, see if that resets some connections to the backend scalers
  • 19:05 brion: restarting nfsd on ms1
  • 18:58 brion: dropping zfs snapshot export/upload@daily-2009-07-09_03:10:00
  • 18:25 RobH_A90: drac and physical setup done for dump1,2,3, will install remotely
  • 17:52 RobH_A90: updated dns for new dump processing servers public and management ips
  • 17:41 Fred: bounced apache on srv45
  • 17:37 Fred: bounced apache on srv47
  • 17:09 RobH_A90: pdf1 is not coming back, working on it
  • 16:56 RobH_A90: shutting down pdf1 and mobile1 to move their power too, weee
  • 16:55 RobH_A90: shutting down spence to move
  • 16:50 RobH_A90: shutting down singer to move its power, blogs and other associated services will be offline for approx. 5 minutes
  • 16:47 Andrew: Restarting apache on prototype
  • 16:46 RobH_A90: shutting down grosley for power move
  • 16:45 RobH_A90: all these power moves are to add the new dump processing servers to the rack
  • 16:45 RobH_A90: shutting down fenari for power move
  • 16:43 RobH_A90: shut down eiximenis and erzurumi to move their power
  • 16:34 RobH_A90: shutting down some servers and moving power around in a4-sdtpa
  • 16:17 Andrew: Changed morebots to tell you through a channel message instead of a private notice when the logging is successful.
  • 15:54 Fred: kernel updated on wikitech from 2.6.18.8 to 2.6.29 (latest available on linode)
  • 15:49 Andrew: Fixed auto-submission of honeypot data, was broken because it needed my perl include path.
  • 15:40 Fred: rebooting wikitech to install new kernel
  • 14:04 Ariel: stopped apaches on image scalers, stopped nfs on ms1, restarting nfs and apaches...
  • 13:52 Ariel: removing more snapshots on ms1 (lockstat showed it hung up in metaslab_alloc again)

July 14

  • 23:15 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fix fix to enwiki confirmed gruop :D'
  • 22:24 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fix to confirmed group for en'
  • 20:44 logmsgbot: brion synchronized wmf-deployment/cache/trusted-xff.cdb
  • 20:41 logmsgbot: brion synchronized wmf-deployment/cache/trusted-xff.cdb
  • 20:40 logmsgbot: brion synchronized wmf-deployment/AdminSettings.php
  • 20:22 Fred: restarted a bunch of dead apaches
  • 20:10 brion: doing a sync-common-all w/ attempt to put test.wikipedia on wmf-deployment branch
  • 19:50 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 19:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611 forgot one thing'
  • 19:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611'
  • 19:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611'
  • 14:16 domas: dropped all june snapshots on ms1, thus providing some relief
  • 01:52 river: patched ms4 in preperation for upload copy

July 13

  • 21:31 Rob: pushing dns update to fix management ips for new apaches
  • 19:05 Fred: added storage3 to ganglia monitor.
  • 18:50 logmsgbot: brion synchronized php-1.5/abusefilter.php 'Disable dewiki missingsummary, mysteriously in abusefilter section. Per bug 19208'
  • 16:30 Fred: install wikimedia-nis-client to srv66 and mounted /home.
  • 16:28 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fixing wikispecies RC-IRC prefix to species.wikimedia'
  • 16:27 brion: test wiki was apparently moved from dead srv35 to srv66, which has new NFS-less config. thus fail since test runs from nfs
  • 16:24 brion: test wiki borked; reported down for several days now :) investigating
  • 15:12 logmsgbot: midom synchronized php-1.5/db.php 'db26 raid issues'
  • 14:55 logmsgbot: midom synchronized php-1.5/db.php 'db3 and db5 coming live as commons servers'
  • 14:13 domas: dropped few more snapshots, as %sys was increasing on ms1...
  • 11:16 domas: manually restarted plethora of failing apaches (direct segfaults and other possible APC corruptions, leading to php OOM errors)
  • 09:50 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialBlockip.php
  • 09:00 Tim: restarted apache2 on image scalers
  • 08:39 logmsgbot: tstarling synchronized php-1.5/includes/Math.php 'statless render hack'
  • 08:05 Tim: killed all image scalers to see if that helps with ms1 load
  • 08:00 Tim: killed waiting apache processes
  • 07:35 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php
  • 07:24 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php 'swapping out srv81'
  • 04:11 Tim: fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure
  • 02:16 brion: -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683
  • 02:12 brion: sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed?
  • 02:03 brion: srv159 is absurdly loaded/lagged wtf?
  • 01:58 brion: reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking...
  • 01:16 brion: so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok
  • 01:10 brion: trying switching image scalers back in for a few, see if they go right back to old pattern or not
  • 01:03 brion: load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P
  • 00:59 brion: stopping apache on image scaler boxes, see what that does
  • 00:49 brion: attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00
  • 00:45 brion: restarting nfs server
  • 00:44 brion: stopping nfs server, restarting web server
  • 00:40 brion: restarting nfs server on ms1
  • 00:36 brion: doesn't seem so far to have changed the NFS access delays on image scalers.
  • 00:31 brion: shutting down webserver7 on ms1
  • 00:23 brion: investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside

July 12

  • 20:30 domas: dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately
  • 20:05 domas: we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1
  • 18:55 domas: zil_disable=1 on ms1
  • 18:34 mark: Upgraded pybal on lvs3
  • 18:16 mark: Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s
  • 17:58 domas: scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this.
  • 17:55 domas: investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-)
  • 17:54 domas: bumped up ms1 SJSWS thread count
  • 11:00 domas: hehehehehe, disabled peer verification on zwinger for now:
      Issuer: C=US, ST=Florida, L=Tampa, O=Wikimedia Foundation Inc., OU=Operations, CN=srv1.pmtpa.wmnet
       Validity
           Not Before: Jul  8 08:03:52 2006 GMT
           Not After : Jul 12 08:03:52 2009 GMT
  • 08:43 tomaszf: rebooted wikitech due to out of memory
Jul 12 14:17:32 <TimStarling>	!log reduced MaxClients on wikitech.wikimedia.org from 150 to 5
Jul 12 14:06:33 <domas>	!log srv1 certificate expired
Jul 12 11:31:58 <tomaszf>	!log rebooted wikitech due to out of memory
Jul 12 11:07:58 <tomaszf>	!log rebooting wikitech
Jul 12 08:41:30 <logmsgbot>	!log tstarling synchronized php-1.5/InitialiseSettings.php 
Jul 12 08:40:31 <logmsgbot>	!log tstarling synchronized php-1.5/includes/ImagePage.php 
Jul 12 08:40:15 <logmsgbot>	!log tstarling synchronized php-1.5/includes/DefaultSettings.php 
Jul 12 08:39:55 <TimStarling>	!log merging and deploying r53130, will disable archive thumbnails and see if it has an impact on ms1 load
Jul 12 00:31:07 <logmsgbot>	!log midom synchronized php-1.5/db.php 
Jul 11 22:17:15 <logmsgbot>	!log andrew synchronized php-1.5/InitialiseSettings.php 
Jul 11 22:15:46 <werdna>	!log Still very slow, going to disable CentralNotice again
Jul 11 22:07:30 <RoanKattouw>	!log wikitech.wikimedia.org is down
Jul 11 20:40:26 <logmsgbot>	!log tstarling synchronized php-1.5/InitialiseSettings.php  're-enabling CentralNotice'
Jul 11 19:32:06 <TimStarling>	!log killed waiting processes again
Jul 11 19:24:11 <TimStarling>	!log killed all processes in the rpc_wait state, to buy us some time
Jul 11 19:12:06 <mark>	!log Reverted cache_mem reduction on upload squids; the cause of memory pressure is a memleak
Jul 11 19:07:47 <TimStarling>	!log apaches took a while to restart due to some shell processes hanging on to listening *:80 filehandles while waiting for NFS, should be fixed now
Jul 11 19:03:02 <mark>	!log Restarting memory leaking frontend squids in upload pmtpa cluster
Jul 11 18:57:48 <TimStarling>	!log restarting apaches
Jul 11 18:56:15 <mark>	!log Reduced cache_mem from 3000 to 2000 MB on pmtpa upload cache squids

July 11

  • 15:45 mark: Rebooting sq1
  • 15:31 Tim: rebooting ms1
  • 14:54 Tim: disabled CentralNotice temporarily
  • 14:54 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'disabling CentralNotice'
  • 14:53 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'disabling CentralAuth'
  • 14:36 Tim: restarted webserver7 on ms1
  • 14:22 Tim: some kind of overload, seems to be image related
  • 10:09 logmsgbot: midom synchronized php-1.5/db.php 'db8 doing commons read load, full write though'
  • 09:22 domas: restarted job queue with externallinks purging code, <3
  • 09:22 domas: installed nrpe on db2 :)
  • 09:22 logmsgbot: midom synchronized php-1.5/db.php 'giving db24 just negligible load for now'
  • 08:38 logmsgbot: midom synchronized php-1.5/includes/parser/ParserOutput.php 'livemerging r53103:53105'
  • 08:37 logmsgbot: midom synchronized php-1.5/includes/DefaultSettings.php

July 10

  • 21:21 Fred: added ganglia to db20
  • 19:58 logmsgbot: azafred synchronized php-1.5/CommonSettings.php 'removed border=0 from wgCopyrightIcon'
  • 18:58 Fred: synched nagios config to reflect cleanup.
  • 18:52 Fred: cleaned up the node_files for dsh and removed all decommissioned hosts.
  • 18:36 mark: Added DNS entries for srv251-500
  • 18:18 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'Added a couple spare memcache hosts.'
  • 18:16 RobH_DC: moved test to srv66 instead.
  • 18:08 RobH_DC: turning srv210 into test.wikipedia.org
  • 17:56 Andrew: Reactivating UsabilityInitiative globally, too.
  • 17:55 Andrew: Scapping, back-out diff is in /home/andrew/usability-diff
  • 17:43 Andrew: Apply r52926, r52930, and update Resources and EditToolbar/images
  • 16:44 Fred: reinstalled and configured gmond on storage1.
  • 15:08 Rob: upgraded blog and techblog to wordpress 2.8.1
  • 13:58 logmsgbot: midom synchronized php-1.5/includes/api/ApiQueryCategoryMembers.php 'hello, fix\!'
  • 12:40 Tim: prototype.wikimedia.org is in OOM death, nagios reports down 3 hours, still responsive on shell so I will try a light touch
  • 11:07 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'more'
  • 10:58 Tim: installed memcached on srv200-srv209
  • 10:51 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'deployed the 11 available spares, will make some more'
  • 10:48 Tim: mctest.php reports 17 servers down out of 78, most from the range that Rob decommissioned
  • 10:37 Tim: installed memcached on srv120, srv121, srv122, srv123
  • 10:32 Tim: found rogue server srv101, missing puppet configuration and so skipping syncs. Uninstalled apache on it.

July 9

  • 23:56 RoanKattouw: Rebooted prototype around 16:30, got stuck around 15:30
  • 21:43 Rob: srv35 (test.wikipedia.org) is not posting, i think its dead jim.
  • 21:35 Rob: decommissioned srv55 and put srv35 in its place in C4, test.wikipedia.org should be back online shortly
  • 20:04 Rob: removed decommissioned servers from node groups, getting error on syncing up nagios.
  • 20:03 Rob: updated dns for new apache servers
  • 19:54 Rob: decommissioned all old apaches in rack pmtpa b2
  • 16:22 Tim: creating mhrwiki (bug 19515)
  • 13:27 domas: db13 controller battery failed, s2 needs master switch eventually

July 8

  • 15:48 domas: frontend.conf changes: fixed cache-control headers for /w/extensions/ assets, did some RE optimizations %)
  • 13:31 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'disabling usability initiative on all wikis, except test and usability. someone who enabled this and left at this state should be shot'

July 7

  • 19:06 Fred: adjusted www.wikipedia.org apache conf file to remove a redirect-loop to www.wikibooks.org. (bug #19460)
  • 17:34 Fred: found the cause of Ganglia issues: Puppet. Seems like the configuration of the master hosts gets reverted to being deaf automagically...
  • 17:05 Fred: ganglia fixed. For some reason the master cluster nodes were set to Deaf mode... (ie the aggregator couldn't gather data from them).
  • 15:02 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19470 Rollback on pt.wikipedia'
  • 03:37 Fred: fixing ganglia. Expect disruption
  • 00:27 tomaszf: starting six worker threads for xml snapshots
  • 00:12 Fred: srv142 and srv55 will need manual power-cycle.
  • 00:10 Fred: Rolling reboot has finally completed.

July 6

  • 23:57 Fred: restarted ganglia since it is acting up...
  • 23:54 tomaszf: restarting all xml snapshots due to kernel upgrades
  • 18:49 Rob: upgraded spam detection plugins on blog and techblog
  • 18:47 Fred: starting rolling reboot of servers in Apaches cluster.
  • 17:53 tomaszf: cleaning out space on storage2. lowering retention for xml snapshots to 10
  • 17:53 Fred: upgrading kernel on cluster. This will take a while!
  • 17:46 Fred: rebooting srv220 to test kernel update.

July 3

  • 12:51 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewEdit.php 'Re-activating abuse filter public logging in the logging table now that log_type and log_action have been expanded.'
  • 11:45 mark: Kicked iris so it would boot
  • 10:11 logmsgbot: andrew synchronized php-1.5/skins/common/htmlform.js 'IE7 fixes for new preference system
  • 05:51 Tim: restarted squid instances on sq28
  • 05:47 Tim: restarted squid instances on sq2
  • 05:46 Tim: started squid backend on sq10 and sq23, sq24, sq31, restarted frontend on most of those to reduce memory usage
  • 05:35 Tim: restarted squid backend on sq16, was reporting "gateway timeout" apparently for all requests. Seemed to fix it. Will try that for a few more that nagios is complaining about.

July 2

  • 21:38 Rob: sq24 wont accept ssh, depooling.
  • 21:34 Rob: rebooting sq21
  • 21:26 Rob: ran changes to push dns back to normal scenario
  • 19:52 mark: Power outage at esams, moving traffic
  • 19:44 Andrew: Knams down, Rob is looking into it
  • 19:41 Andrew: Reports of problems from Europe
  • 19:25 Andrew: running sync-common-all to deploy mobileRedirect.php to fix hcatlin's mobile redirect/cookie bug
  • 19:22 logmsgbot: andrew synchronized live-1.5/mobileRedirect.php
  • 17:15 mark: Rebooted srv159
  • 16:13 Fred: shutting 217 back down as it is not supposed to be up due to faulty timer causing issues.
  • 16:12 Fred: rebooted srv217. Was unpingable.
  • 14:09 Andrew: Started sending updates of spam.log to Project Honeypot folks every 5 minutes, in my crontab on hume.
  • 11:20 logmsgbot: andrew synchronized php-1.5/skins/common/shared.css 'Live-merging r52669, r52684 at rainman's request, search fixes.'
  • 11:18 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialSearch.php 'Live-merging r52669, r52684 at rainman's request, search fixes.'
  • 00:03 logmsgbot: brion synchronized php-1.5/CommonSettings.php
  • 00:02 logmsgbot: brion synchronized php-1.5/extensions/MWSearch/MWSearch_body.php 'de-merge broken r52664'

July 1

  • 23:40 brion: poking in tweaks to search and updates to vector
  • 23:22 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'bump wgStyleVersion'
  • 23:21 logmsgbot: brion synchronized php-1.5/skins/vector/main-rtl.css
  • 23:21 logmsgbot: brion synchronized php-1.5/skins/vector/main-ltr.css
  • 23:10 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'set vector skin, new toolbar on for usability wiki'
  • 23:07 mark: Kicked pascal
  • 23:05 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.php 'bumping the js ver no'
  • 23:01 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.js
  • 22:59 logmsgbot: brion synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php 'to 52659'
  • 22:57 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.i18n.php
  • 22:44 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'enabling UsabilityInitiative (for optional EditToolbar)'
  • 22:43 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'disabling EditWarning pending addl talk'
  • 22:40 brion-codereview: updating UsabilityInitiative ext to r52657 in prep for enabling new toolbar option
  • 22:10 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Enabling new search UI formatting sitewide'
  • 22:02 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fixing the RTL disable for vector'
  • 21:58 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Vector should now be available in prefs for non-RTL sites'
  • 21:57 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'vector config tweak'
  • 21:42 brion-codereview: updating Vector to current
  • 19:38 logmsgbot: midom synchronized php-1.5/db.php
  • 16:13 Fred: bayes is running out of memory on a regular basis. Enabled process accounting / sar to gather more data.
  • 15:48 Fred: rebooting Bayes as it locked up again.
  • 11:48 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'trying a lower value for $wgMaxMsgCacheEntrySize'
  • 11:19 domas: cleaned up srv100
  • 11:18 domas: noticed that imagemagick tempfiles are currently created in /u/l/a/c-l/p/ :)
  • 09:24 domas: pinned mysqlds on half of cores on 8-core boxes: for i in {11..30}; do ssh db$i 'taskset -pc 0-15:2 $(pidof mysqld)' ; done


Archives