Server Admin Log/Archive 20: Difference between revisions
Appearance
Content deleted Content added
imported>More Bots Updated the glue record for ns1.wikimedia.org (mark) |
imported>More Bots Decommissioned all yaseo servers, wiped their disks (mark) |
||
Line 1: | Line 1: | ||
== July 23 == |
== July 23 == |
||
* 22:16 mark: Decommissioned all yaseo servers, wiped their disks |
|||
* 20:35 mark: Updated the glue record for ns1.wikimedia.org |
* 20:35 mark: Updated the glue record for ns1.wikimedia.org |
||
* 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne) |
* 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne) |
Revision as of 22:16, 23 July 2009
July 23
- 22:16 mark: Decommissioned all yaseo servers, wiped their disks
- 20:35 mark: Updated the glue record for ns1.wikimedia.org
- 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne)
- 19:58 mark: Installed linne.wikimedia.org as auth DNS server
- 19:12 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'restore FancyCaptcha now that image crisis is diverted'
- 15:38 Rob: setup strategywiki for the strategy planning whatever
- 15:37 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php
- 15:32 logmsgbot: robh ran sync-common-all
July 22
- 21:20 Fred: installing wikimedia-task-appserver on srv122. Incoming reboot
- 15:11 Tim: fixed firewall on browne to deny RC->IRC UDP packets from outside the local network
- 09:57 logmsgbot: midom synchronized php-1.5/wmf-config/../StartProfiler.php
- 09:51 domas: apparently someone decided that our profiling is not useful and should be disabled? :)
July 21
- 23:56 Fred: rebooted pascal (for realz this time)
- 23:15 tomaszf: fred is pulling backups from ms4 onto storage2.
- 23:07 Fred: rebooting pascal as he fell over again
- 22:45 tomaszf: adding snapshot1,2,3 to DHCP
- 22:03 mark: Increased large object cache dir size to 120 GB on eiximenis
- 18:28 domas: srv122 booted into netinstall, apparently
- 17:39 Rob: updated both blog and techblog to newest stable release of wordpress
- 16:36 brion: internal UDP logging broken since 17 July; looks like udp2log isn't running on db20 since reboot?
- 16:21 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php 'death to captcha'
- 16:15 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'death to the uplaoder group'
- 16:06 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php
- 16:00 Rob: updated sync-fiel script for new file locations
- 16:00 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php
- 15:50 Rob: removing old whygive blog data from dns and archiving the database.
- 15:07 Rob: updated planet with http://meta.wikimedia.org/wiki/Planet_Wikimedia#Requests_for_inclusion
- 14:43 mark: Increased COSS cache dirs on pmtpa upload squids
- 11:30 domas: for i in $(ssh db20 findevilapaches); do ssh $i invoke-rc.d apache2 restart; done \o/
- 11:29 domas: killed brion's sync processes on zwinger, hanging since July17 :)
- 09:15 domas: mgmt-restarted srv156
July 20
- 22:18 mark: Rebooted pascal
- 21:15 mark: Doubled cache dir sizes on eiximenis, upped carp load from 20 to 30
- 18:09 hcatlin: restarted mobile1 cluster to load in new software
- 15:55 Fred: bounced apache on srv193
- 07:18 Tim: re-enabled CentralNotice
- 07:17 logmsgbot: tstarling synchronized php-1.5/wmf-config/InitialiseSettings.php
- 07:13 apergos: enough data removed from ms1 to feel safe for a few days; started mass copy of remaining thumbs to ms4 in prep for complete repo switchover (running in root screen on ms1)
- 04:48 Tim: copying up all available MW release files from my laptop
- 04:21 Tim: mounted ms4:/export/dumps on zwinger
- 04:16 Tim: changed export options for ms4:/export/dumps to allow root access for the local subnet
- 01:19 hcatlin: On mobile1 we are now gzipping the log files after rotation in /srv/wikimedia-mobile/logs
July 19
- 22:02 Fred: restarted memcached on srv159
- 20:43 mark: eiximenis backend squid pooled
- 20:10 mark: Restarted deadlocked powerdns on bayle
- 19:14 mark: Installed eiximenis with a Squid OS install
- 18:58 mark: Moved eiximenis to vlan 100 (squids)
- 18:55 mark: Changed eiximenis' IP into 208.80.152.119 for Squid testing
- 17:41 hcatlin: Mobile1's web stack just got switched from Phusion Passenger to Nginx/Thin/Rack.
July 18
- 15:23 apergos: some thumb directories on ms4 created at request of img scalers were created with owner root and perms 700... fixing
- 03:55 river: ms5 is ready
- 01:20 atglenn: continuing with removals of thumbs on ms1. 789G free now, need to reach about 1450 before we can just "maintain". but we're gaining on it.
- 00:22 brion: set up temporary data dump index, copied the dvd index (it's just offsite links). still need to track some MW releases
- 00:07 brion: recovering MediaWiki 1.6 through 1.10 release files and re-uploading them...
July 17
- 23:42 brion: added a 404 page and recovered index.php for our temp download.wikimedia.org
- 22:05 brion: set wikitech to use vector skin by default :D
- 22:03 Andrew: Fixed morebots, which was relying on a fragile version check. Just deleted it :)
- 20:43 brion: fixed paths for noc.wikimedia.org/conf file highlighting
- 20:38 domas: ms2 has broken disks..
- 20:31 brion: We're going to see about setting up the previously-idle ms5 so we can get our thumbnailing on
- 20:01 brion: rob's poking raid rebuild on storage2 (dumps server)
- 19:03 RobH_A90: eiximenis and dobson pulled for solid state drive testing, do not use for other tasks
- 18:28 logmsgbot: brion synchronized wmf-deployment/wmf-config/InitialiseSettings.php 'enabling vector for rtl'
- 18:25 atglenn: started mass move out of the way of thumbnail dirs and replacing with symlinks to ms4
- 18:25 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'bump style version'
- 18:25 logmsgbot: brion ran sync-common-all
- 18:24 brion: running sync-common-all for UI updates. need to poke the style ver too :)
- 18:08 brion: svn up'ing wmf-deployment for test.wikipedia.org. Merged UI fixes from usability team
- 18:03 Fred: spun a couple more apache server into image scalers: srv219..srv224.
- 17:28 rainman-sr: putting new location of initialisesettings to lsearch-global-2.1.conf so the incremental updater works again
- 17:20 Fred: srv224 is now an image_scaler. Adjusted on lvs3, ganglia and dsh's node_list.
- 17:14 Fred: db20 back online
- 16:50 Fred: rebooting db20 as it is in a "state"
- 16:45 brion: looks like we've lost internal /home NFS, which makes some of our internal services very unhappy. investigating...
- 16:43 brion: ganglia out.
- 13:44 apergos1: doing next round of removals on ms1 (/export/upload/wikipedia/en/thumb/2) to keep ahead of the game
- 04:15 apergos: starting removal of /export/upload/wikipedia/en/thumb/1 on ms1 (moved away and symlink to ms4 done already) for more space
- 03:54 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'Disabling sitenotice from maintenance'
- 03:29 brion: reenabling uploads & image deletion/undeletion
- 03:29 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
- 03:28 brion: remounting ms1 on apaches
- 00:49 atglenn: only about 1gb gain on each so doing all of /export/upload/wikipedia/en/thumb/0
- 00:39 atglenn: removing more directories in /export/upload/wikipedia/en/thumb/0 on ms1 and replacing with symlinks to ms4
- 00:30 logmsgbot: brion synchronized wmf-deployment/includes/specials/SpecialUpload.php
- 00:30 logmsgbot: brion synchronized wmf-deployment/includes/ImagePage.php
- 00:27 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
- 00:20 brion: temporarily disabling image delete/rename during maintenance
July 16
- 23:56 logmsgbot: fvassard synchronized php-1.5/wmf-config/CommonSettings.php 'Disabling uploads and setting captcha to not-fancy.'
- 23:29 atglenn: removing the images in /export/upload/wikipedia/en/thumb/0/00 on ms1 (real dir is a symlink to ms4) to get back some space
- 22:51 atglenn: sym link back in place, let's see what happens
- 22:47 atglenn: reverting temporarily while we resolve mount issues for the ms4 share
- 22:40 atglenn: ...whether the image scalers will fall over if we force them to do (some) regeneration.
- 22:37 atglenn: on ms1, /export/upload/wikipedia/en/thumb/0/00 symlinked to (shared from ms4) /mnt/thumbs/wikipedia/en/thumb/0/00 to test
- 21:29 brion: robots.php for robots.txt generation now also working. yay!
- 21:28 logmsgbot: brion synchronized live-1.5/robots.php
- 21:28 brion: extract2.php now fixed up for new deployment; portal pages ok (www.wikipedia.org)
- 21:27 logmsgbot: brion synchronized live-1.5/robots.php
- 21:26 logmsgbot: brion synchronized extract2.php
- 21:26 logmsgbot: brion synchronized extract2.php
- 21:22 logmsgbot: brion synchronized extract2.php
- 21:18 logmsgbot: brion ran sync-common-all
- 21:18 brion: rsync messed up the php-1.5 directory to symlink translation. retrying as root
- 21:14 logmsgbot: brion synchronized extract2.php
- 21:13 logmsgbot: brion synchronized extract2.php
- 21:13 atglenn: started copy of thumbnails to ms4, symlinks going in on ms1 (but no data removal yet)
- 21:11 logmsgbot: brion synchronized live-1.5/extract2.php
- 21:10 logmsgbot: brion synchronized live-1.5/robots.php
- 21:09 logmsgbot: brion ran sync-common-all
- 21:08 brion: attempting to replace the old php-1.5 dir with wmf-deployment symlink
- 21:02 logmsgbot: brion synchronized wmf-deployment/wmf-config/InitialiseSettings.php 'I think touching the new master InitialiseSettings will fix it'
- 21:01 logmsgbot: brion synchronized wmf-deployment/includes/GlobalFunctions.php 'mkdir error trackdown hack'
- 20:54 logmsgbot: brion synchronized wmf-deployment/wmf-config/missing.php
- 20:52 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
- 20:52 logmsgbot: brion synchronized wmf-deployment/wmf-config/reporting-setup.php
- 20:48 brion: switching all sites to wmf-deployment branch
- 20:48 logmsgbot: brion synchronized live-1.5/MWVersion.php
- 19:06 Tim: copying ExtensionDistributor stuff to ms4:/export/ext-dist, from root screen on ms1
- 19:01 brion: Now running test.wikipedia.org, www.mediawiki.org, and meta.wikimedia.org on new deployment checkout
- 19:01 logmsgbot: brion synchronized live-1.5/MWVersion.php
- 18:58 logmsgbot: brion ran sync-common-all
- 18:39 Tim: restarted xinetd on zwinger
- 18:24 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php
- 17:57 brion: also restarted 186, 196 which had some funkiness in php err log
- 17:56 brion: srv186 also bad sudo
- 17:55 brion: srv171 has some borkage; sudo config is broken can't run apache-restart as user
- 17:52 logmsgbot: brion ran sync-common-all
- 17:51 brion: running updated sync-common-all friendly to non-NFS boxes
- 17:49 brion: swapped private SVN-managed /home/wikipedia/bin into place
- 15:09 apergos: removing the last of our snapshots on ms1 :-( getting us a little more space
- 14:47 apergos: disabled snapshots on ms1 in preparation for move of thumbnails to ms4
- 14:38 brion: updated wikibugs-l list config to allow bugzilla-daemon@wikimedia.org to post
- 14:34 brion: restarted wikibugs bot
- 14:27 brion: ms1 performance seems to be sucking again
- 14:17 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'adjusting throttle temporarily for outreach event'
- 11:55 RoanKattouw: ExtensionDistributor repeatedly reported broken in the past 48 hrs
- 07:08 Fred: traffic profile switched back to normal. Esams is back to normal.
- 06:11 hcatlin: Mobile1 has returned to normal function.
- 05:58 hcatlin: Error after restarting mobile1 stopped stats logging from working. Stats will be low for July 15th and higher for July 16th. Parsing of the 6 hour log file (about 1GB) might slow server for next few minutes until caught up.
- 04:24 Rob: outage for esams servers started at approx 3:20 gmt
- 04:15 Rob: still waiting on esams to update us about the rack(s), moving traffic to pmtpa
- 00:59 tomaszf: started backup for latest xml snapshots from storage2 to ms4
July 15
- 22:30 Rob: updated dns for new snapshot servers becasue tomasz did not want to be in charge of dump servers.
- 22:10 brion: brion checking around for 0-byte files (not thumbs) to see if we can recover
- 21:33 atglenn: verified that zfs patch is in place on ms4 (it got sucked in during river's update yesterday)
- 21:26 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'Restore fancy captcha mode'
- 21:16 logmsgbot: I_am_not_root synchronized php-1.5/CommonSettings.php 're-enabling Uploads and removing site notice.'
- 21:01 atglenn: rebooting ms1 after applying zfs patch. *cross fingers*
- 20:51 logmsgbot: brion synchronized php-1.5/CommonSettings.php
- 20:51 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
- 20:42 brion: reenabled captcha in simple mode (no images; math q)
- 20:37 brion: captcha system broken while images are offline, need to disable it temporarily
- 20:18 brion: updated http://en.wikipedia.org/wiki/MediaWiki:Uploaddisabledtext & http://commons.wikimedia.org/wiki/MediaWiki:Uploaddisabledtext
- 19:43 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Disabling Uploads while ms1 gets fixed (again with an s after upload).'
- 19:40 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Disabling Uploads while ms1 gets fixed.'
- 19:40 atglenn: bringing solaris up to current patch level on ms1
- 19:34 brion: Ok, we're going to temporarily shut off uploading and unmount the uploads dir while we muck about with ms1.
- 19:14 brion: dropping export/upload@daily-2009-07-11_03:10:00
- 19:08 brion: restarting web server on ms1, see if that resets some connections to the backend scalers
- 19:05 brion: restarting nfsd on ms1
- 18:58 brion: dropping zfs snapshot export/upload@daily-2009-07-09_03:10:00
- 18:25 RobH_A90: drac and physical setup done for dump1,2,3, will install remotely
- 17:52 RobH_A90: updated dns for new dump processing servers public and management ips
- 17:41 Fred: bounced apache on srv45
- 17:37 Fred: bounced apache on srv47
- 17:09 RobH_A90: pdf1 is not coming back, working on it
- 16:56 RobH_A90: shutting down pdf1 and mobile1 to move their power too, weee
- 16:55 RobH_A90: shutting down spence to move
- 16:50 RobH_A90: shutting down singer to move its power, blogs and other associated services will be offline for approx. 5 minutes
- 16:47 Andrew: Restarting apache on prototype
- 16:46 RobH_A90: shutting down grosley for power move
- 16:45 RobH_A90: all these power moves are to add the new dump processing servers to the rack
- 16:45 RobH_A90: shutting down fenari for power move
- 16:43 RobH_A90: shut down eiximenis and erzurumi to move their power
- 16:34 RobH_A90: shutting down some servers and moving power around in a4-sdtpa
- 16:17 Andrew: Changed morebots to tell you through a channel message instead of a private notice when the logging is successful.
- 15:54 Fred: kernel updated on wikitech from 2.6.18.8 to 2.6.29 (latest available on linode)
- 15:49 Andrew: Fixed auto-submission of honeypot data, was broken because it needed my perl include path.
- 15:40 Fred: rebooting wikitech to install new kernel
- 14:04 Ariel: stopped apaches on image scalers, stopped nfs on ms1, restarting nfs and apaches...
- 13:52 Ariel: removing more snapshots on ms1 (lockstat showed it hung up in metaslab_alloc again)
July 14
- 23:15 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fix fix to enwiki confirmed gruop :D'
- 22:24 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fix to confirmed group for en'
- 20:44 logmsgbot: brion synchronized wmf-deployment/cache/trusted-xff.cdb
- 20:41 logmsgbot: brion synchronized wmf-deployment/cache/trusted-xff.cdb
- 20:40 logmsgbot: brion synchronized wmf-deployment/AdminSettings.php
- 20:22 Fred: restarted a bunch of dead apaches
- 20:10 brion: doing a sync-common-all w/ attempt to put test.wikipedia on wmf-deployment branch
- 19:50 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
- 19:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611 forgot one thing'
- 19:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611'
- 19:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611'
- 14:16 domas: dropped all june snapshots on ms1, thus providing some relief
- 01:52 river: patched ms4 in preperation for upload copy
July 13
- 21:31 Rob: pushing dns update to fix management ips for new apaches
- 19:05 Fred: added storage3 to ganglia monitor.
- 18:50 logmsgbot: brion synchronized php-1.5/abusefilter.php 'Disable dewiki missingsummary, mysteriously in abusefilter section. Per bug 19208'
- 16:30 Fred: install wikimedia-nis-client to srv66 and mounted /home.
- 16:28 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fixing wikispecies RC-IRC prefix to species.wikimedia'
- 16:27 brion: test wiki was apparently moved from dead srv35 to srv66, which has new NFS-less config. thus fail since test runs from nfs
- 16:24 brion: test wiki borked; reported down for several days now :) investigating
- 15:12 logmsgbot: midom synchronized php-1.5/db.php 'db26 raid issues'
- 14:55 logmsgbot: midom synchronized php-1.5/db.php 'db3 and db5 coming live as commons servers'
- 14:13 domas: dropped few more snapshots, as %sys was increasing on ms1...
- 11:16 domas: manually restarted plethora of failing apaches (direct segfaults and other possible APC corruptions, leading to php OOM errors)
- 09:50 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialBlockip.php
- 09:00 Tim: restarted apache2 on image scalers
- 08:39 logmsgbot: tstarling synchronized php-1.5/includes/Math.php 'statless render hack'
- 08:05 Tim: killed all image scalers to see if that helps with ms1 load
- 08:00 Tim: killed waiting apache processes
- 07:35 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php
- 07:24 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php 'swapping out srv81'
- 04:11 Tim: fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure
- 02:16 brion: -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683
- 02:12 brion: sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed?
- 02:03 brion: srv159 is absurdly loaded/lagged wtf?
- 01:58 brion: reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking...
- 01:16 brion: so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok
- 01:10 brion: trying switching image scalers back in for a few, see if they go right back to old pattern or not
- 01:03 brion: load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P
- 00:59 brion: stopping apache on image scaler boxes, see what that does
- 00:49 brion: attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00
- 00:45 brion: restarting nfs server
- 00:44 brion: stopping nfs server, restarting web server
- 00:40 brion: restarting nfs server on ms1
- 00:36 brion: doesn't seem so far to have changed the NFS access delays on image scalers.
- 00:31 brion: shutting down webserver7 on ms1
- 00:23 brion: investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside
July 12
- 20:30 domas: dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately
- 20:05 domas: we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1
- 18:55 domas: zil_disable=1 on ms1
- 18:34 mark: Upgraded pybal on lvs3
- 18:16 mark: Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s
- 17:58 domas: scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this.
- 17:55 domas: investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-)
- 17:54 domas: bumped up ms1 SJSWS thread count
- 11:00 domas: hehehehehe, disabled peer verification on zwinger for now:
Issuer: C=US, ST=Florida, L=Tampa, O=Wikimedia Foundation Inc., OU=Operations, CN=srv1.pmtpa.wmnet Validity Not Before: Jul 8 08:03:52 2006 GMT Not After : Jul 12 08:03:52 2009 GMT
- 08:43 tomaszf: rebooted wikitech due to out of memory
Jul 12 14:17:32 <TimStarling> !log reduced MaxClients on wikitech.wikimedia.org from 150 to 5 Jul 12 14:06:33 <domas> !log srv1 certificate expired Jul 12 11:31:58 <tomaszf> !log rebooted wikitech due to out of memory Jul 12 11:07:58 <tomaszf> !log rebooting wikitech Jul 12 08:41:30 <logmsgbot> !log tstarling synchronized php-1.5/InitialiseSettings.php Jul 12 08:40:31 <logmsgbot> !log tstarling synchronized php-1.5/includes/ImagePage.php Jul 12 08:40:15 <logmsgbot> !log tstarling synchronized php-1.5/includes/DefaultSettings.php Jul 12 08:39:55 <TimStarling> !log merging and deploying r53130, will disable archive thumbnails and see if it has an impact on ms1 load Jul 12 00:31:07 <logmsgbot> !log midom synchronized php-1.5/db.php Jul 11 22:17:15 <logmsgbot> !log andrew synchronized php-1.5/InitialiseSettings.php Jul 11 22:15:46 <werdna> !log Still very slow, going to disable CentralNotice again Jul 11 22:07:30 <RoanKattouw> !log wikitech.wikimedia.org is down Jul 11 20:40:26 <logmsgbot> !log tstarling synchronized php-1.5/InitialiseSettings.php 're-enabling CentralNotice' Jul 11 19:32:06 <TimStarling> !log killed waiting processes again Jul 11 19:24:11 <TimStarling> !log killed all processes in the rpc_wait state, to buy us some time Jul 11 19:12:06 <mark> !log Reverted cache_mem reduction on upload squids; the cause of memory pressure is a memleak Jul 11 19:07:47 <TimStarling> !log apaches took a while to restart due to some shell processes hanging on to listening *:80 filehandles while waiting for NFS, should be fixed now Jul 11 19:03:02 <mark> !log Restarting memory leaking frontend squids in upload pmtpa cluster Jul 11 18:57:48 <TimStarling> !log restarting apaches Jul 11 18:56:15 <mark> !log Reduced cache_mem from 3000 to 2000 MB on pmtpa upload cache squids
July 11
- 15:45 mark: Rebooting sq1
- 15:31 Tim: rebooting ms1
- 14:54 Tim: disabled CentralNotice temporarily
- 14:54 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'disabling CentralNotice'
- 14:53 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'disabling CentralAuth'
- 14:36 Tim: restarted webserver7 on ms1
- 14:22 Tim: some kind of overload, seems to be image related
- 10:09 logmsgbot: midom synchronized php-1.5/db.php 'db8 doing commons read load, full write though'
- 09:22 domas: restarted job queue with externallinks purging code, <3
- 09:22 domas: installed nrpe on db2 :)
- 09:22 logmsgbot: midom synchronized php-1.5/db.php 'giving db24 just negligible load for now'
- 08:38 logmsgbot: midom synchronized php-1.5/includes/parser/ParserOutput.php 'livemerging r53103:53105'
- 08:37 logmsgbot: midom synchronized php-1.5/includes/DefaultSettings.php
July 10
- 21:21 Fred: added ganglia to db20
- 19:58 logmsgbot: azafred synchronized php-1.5/CommonSettings.php 'removed border=0 from wgCopyrightIcon'
- 18:58 Fred: synched nagios config to reflect cleanup.
- 18:52 Fred: cleaned up the node_files for dsh and removed all decommissioned hosts.
- 18:36 mark: Added DNS entries for srv251-500
- 18:18 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'Added a couple spare memcache hosts.'
- 18:16 RobH_DC: moved test to srv66 instead.
- 18:08 RobH_DC: turning srv210 into test.wikipedia.org
- 17:56 Andrew: Reactivating UsabilityInitiative globally, too.
- 17:55 Andrew: Scapping, back-out diff is in /home/andrew/usability-diff
- 17:43 Andrew: Apply r52926, r52930, and update Resources and EditToolbar/images
- 16:44 Fred: reinstalled and configured gmond on storage1.
- 15:08 Rob: upgraded blog and techblog to wordpress 2.8.1
- 13:58 logmsgbot: midom synchronized php-1.5/includes/api/ApiQueryCategoryMembers.php 'hello, fix\!'
- 12:40 Tim: prototype.wikimedia.org is in OOM death, nagios reports down 3 hours, still responsive on shell so I will try a light touch
- 11:07 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'more'
- 10:58 Tim: installed memcached on srv200-srv209
- 10:51 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'deployed the 11 available spares, will make some more'
- 10:48 Tim: mctest.php reports 17 servers down out of 78, most from the range that Rob decommissioned
- 10:37 Tim: installed memcached on srv120, srv121, srv122, srv123
- 10:32 Tim: found rogue server srv101, missing puppet configuration and so skipping syncs. Uninstalled apache on it.
July 9
- 23:56 RoanKattouw: Rebooted prototype around 16:30, got stuck around 15:30
- 21:43 Rob: srv35 (test.wikipedia.org) is not posting, i think its dead jim.
- 21:35 Rob: decommissioned srv55 and put srv35 in its place in C4, test.wikipedia.org should be back online shortly
- 20:04 Rob: removed decommissioned servers from node groups, getting error on syncing up nagios.
- 20:03 Rob: updated dns for new apache servers
- 19:54 Rob: decommissioned all old apaches in rack pmtpa b2
- 16:22 Tim: creating mhrwiki (bug 19515)
- 13:27 domas: db13 controller battery failed, s2 needs master switch eventually
July 8
- 15:48 domas: frontend.conf changes: fixed cache-control headers for /w/extensions/ assets, did some RE optimizations %)
- 13:31 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'disabling usability initiative on all wikis, except test and usability. someone who enabled this and left at this state should be shot'
July 7
- 19:06 Fred: adjusted www.wikipedia.org apache conf file to remove a redirect-loop to www.wikibooks.org. (bug #19460)
- 17:34 Fred: found the cause of Ganglia issues: Puppet. Seems like the configuration of the master hosts gets reverted to being deaf automagically...
- 17:05 Fred: ganglia fixed. For some reason the master cluster nodes were set to Deaf mode... (ie the aggregator couldn't gather data from them).
- 15:02 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19470 Rollback on pt.wikipedia'
- 03:37 Fred: fixing ganglia. Expect disruption
- 00:27 tomaszf: starting six worker threads for xml snapshots
- 00:12 Fred: srv142 and srv55 will need manual power-cycle.
- 00:10 Fred: Rolling reboot has finally completed.
July 6
- 23:57 Fred: restarted ganglia since it is acting up...
- 23:54 tomaszf: restarting all xml snapshots due to kernel upgrades
- 18:49 Rob: upgraded spam detection plugins on blog and techblog
- 18:47 Fred: starting rolling reboot of servers in Apaches cluster.
- 17:53 tomaszf: cleaning out space on storage2. lowering retention for xml snapshots to 10
- 17:53 Fred: upgrading kernel on cluster. This will take a while!
- 17:46 Fred: rebooting srv220 to test kernel update.
July 3
- 12:51 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewEdit.php 'Re-activating abuse filter public logging in the logging table now that log_type and log_action have been expanded.'
- 11:45 mark: Kicked iris so it would boot
- 10:11 logmsgbot: andrew synchronized php-1.5/skins/common/htmlform.js 'IE7 fixes for new preference system
- 05:51 Tim: restarted squid instances on sq28
- 05:47 Tim: restarted squid instances on sq2
- 05:46 Tim: started squid backend on sq10 and sq23, sq24, sq31, restarted frontend on most of those to reduce memory usage
- 05:35 Tim: restarted squid backend on sq16, was reporting "gateway timeout" apparently for all requests. Seemed to fix it. Will try that for a few more that nagios is complaining about.
July 2
- 21:38 Rob: sq24 wont accept ssh, depooling.
- 21:34 Rob: rebooting sq21
- 21:26 Rob: ran changes to push dns back to normal scenario
- 19:52 mark: Power outage at esams, moving traffic
- 19:44 Andrew: Knams down, Rob is looking into it
- 19:41 Andrew: Reports of problems from Europe
- 19:25 Andrew: running sync-common-all to deploy mobileRedirect.php to fix hcatlin's mobile redirect/cookie bug
- 19:22 logmsgbot: andrew synchronized live-1.5/mobileRedirect.php
- 17:15 mark: Rebooted srv159
- 16:13 Fred: shutting 217 back down as it is not supposed to be up due to faulty timer causing issues.
- 16:12 Fred: rebooted srv217. Was unpingable.
- 14:09 Andrew: Started sending updates of spam.log to Project Honeypot folks every 5 minutes, in my crontab on hume.
- 11:20 logmsgbot: andrew synchronized php-1.5/skins/common/shared.css 'Live-merging r52669, r52684 at rainman's request, search fixes.'
- 11:18 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialSearch.php 'Live-merging r52669, r52684 at rainman's request, search fixes.'
- 00:03 logmsgbot: brion synchronized php-1.5/CommonSettings.php
- 00:02 logmsgbot: brion synchronized php-1.5/extensions/MWSearch/MWSearch_body.php 'de-merge broken r52664'
July 1
- 23:40 brion: poking in tweaks to search and updates to vector
- 23:22 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'bump wgStyleVersion'
- 23:21 logmsgbot: brion synchronized php-1.5/skins/vector/main-rtl.css
- 23:21 logmsgbot: brion synchronized php-1.5/skins/vector/main-ltr.css
- 23:10 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'set vector skin, new toolbar on for usability wiki'
- 23:07 mark: Kicked pascal
- 23:05 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.php 'bumping the js ver no'
- 23:01 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.js
- 22:59 logmsgbot: brion synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php 'to 52659'
- 22:57 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.i18n.php
- 22:44 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'enabling UsabilityInitiative (for optional EditToolbar)'
- 22:43 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'disabling EditWarning pending addl talk'
- 22:40 brion-codereview: updating UsabilityInitiative ext to r52657 in prep for enabling new toolbar option
- 22:10 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Enabling new search UI formatting sitewide'
- 22:02 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fixing the RTL disable for vector'
- 21:58 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Vector should now be available in prefs for non-RTL sites'
- 21:57 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'vector config tweak'
- 21:42 brion-codereview: updating Vector to current
- 19:38 logmsgbot: midom synchronized php-1.5/db.php
- 16:13 Fred: bayes is running out of memory on a regular basis. Enabled process accounting / sar to gather more data.
- 15:48 Fred: rebooting Bayes as it locked up again.
- 11:48 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'trying a lower value for $wgMaxMsgCacheEntrySize'
- 11:19 domas: cleaned up srv100
- 11:18 domas: noticed that imagemagick tempfiles are currently created in /u/l/a/c-l/p/ :)
- 09:24 domas: pinned mysqlds on half of cores on 8-core boxes: for i in {11..30}; do ssh db$i 'taskset -pc 0-15:2 $(pidof mysqld)' ; done
Archives
- Server admin log/Archive 1 (2004 Jun - 2004 Sep)
- Server admin log/Archive 2 (2004 Oct - 2004 Nov)
- Server admin log/Archive 3 (2004 Dec - 2005 Mar)
- Server admin log/Archive 4 (2005 Apr - 2005 Jul)
- Server admin log/Archive 5 (2005 Aug - 2005 Oct)
- Server admin log/Archive 6 (2005 Nov - 2006 Feb)
- Server admin log/Archive 7 (2006 Mar - 2006 Jun)
- Server admin log/Archive 8 (2006 Jul - 2006 Sep)
- Server admin log/Archive 9 (2006 Oct - 2007 Jan)
- Server admin log/Archive 10 (2007 Feb - 2007 Jun)
- Server admin log/Archive 11 (2007 Jul - 2007 Dec)
- Server admin log/Archive 12 (2008 Jan - 2008 Jul)
- Server admin log/2008-08
- Server admin log/2008-09
- Server admin log/Archive 13 (2008 Oct - 2009 Jun)