Server Admin Log/Archive 20: Difference between revisions

Browse history interactively

← Older edit Newer edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 22:16, 23 July 2009

July 23

22:16 mark: Decommissioned all yaseo servers, wiped their disks
20:35 mark: Updated the glue record for ns1.wikimedia.org
20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne)
19:58 mark: Installed linne.wikimedia.org as auth DNS server
19:12 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'restore FancyCaptcha now that image crisis is diverted'
15:38 Rob: setup strategywiki for the strategy planning whatever
15:37 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php
15:32 logmsgbot: robh ran sync-common-all

July 22

21:20 Fred: installing wikimedia-task-appserver on srv122. Incoming reboot
15:11 Tim: fixed firewall on browne to deny RC->IRC UDP packets from outside the local network
09:57 logmsgbot: midom synchronized php-1.5/wmf-config/../StartProfiler.php
09:51 domas: apparently someone decided that our profiling is not useful and should be disabled? :)

July 21

23:56 Fred: rebooted pascal (for realz this time)
23:15 tomaszf: fred is pulling backups from ms4 onto storage2.
23:07 Fred: rebooting pascal as he fell over again
22:45 tomaszf: adding snapshot1,2,3 to DHCP
22:03 mark: Increased large object cache dir size to 120 GB on eiximenis
18:28 domas: srv122 booted into netinstall, apparently
17:39 Rob: updated both blog and techblog to newest stable release of wordpress
16:36 brion: internal UDP logging broken since 17 July; looks like udp2log isn't running on db20 since reboot?
16:21 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php 'death to captcha'
16:15 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'death to the uplaoder group'
16:06 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php
16:00 Rob: updated sync-fiel script for new file locations
16:00 logmsgbot: robh synchronized php-1.5/wmf-config/CommonSettings.php
15:50 Rob: removing old whygive blog data from dns and archiving the database.
15:07 Rob: updated planet with http://meta.wikimedia.org/wiki/Planet_Wikimedia#Requests_for_inclusion
14:43 mark: Increased COSS cache dirs on pmtpa upload squids
11:30 domas: for i in $(ssh db20 findevilapaches); do ssh $i invoke-rc.d apache2 restart; done \o/
11:29 domas: killed brion's sync processes on zwinger, hanging since July17 :)
09:15 domas: mgmt-restarted srv156

July 20

22:18 mark: Rebooted pascal
21:15 mark: Doubled cache dir sizes on eiximenis, upped carp load from 20 to 30
18:09 hcatlin: restarted mobile1 cluster to load in new software
15:55 Fred: bounced apache on srv193
07:18 Tim: re-enabled CentralNotice
07:17 logmsgbot: tstarling synchronized php-1.5/wmf-config/InitialiseSettings.php
07:13 apergos: enough data removed from ms1 to feel safe for a few days; started mass copy of remaining thumbs to ms4 in prep for complete repo switchover (running in root screen on ms1)
04:48 Tim: copying up all available MW release files from my laptop
04:21 Tim: mounted ms4:/export/dumps on zwinger
04:16 Tim: changed export options for ms4:/export/dumps to allow root access for the local subnet
01:19 hcatlin: On mobile1 we are now gzipping the log files after rotation in /srv/wikimedia-mobile/logs

July 19

22:02 Fred: restarted memcached on srv159
20:43 mark: eiximenis backend squid pooled
20:10 mark: Restarted deadlocked powerdns on bayle
19:14 mark: Installed eiximenis with a Squid OS install
18:58 mark: Moved eiximenis to vlan 100 (squids)
18:55 mark: Changed eiximenis' IP into 208.80.152.119 for Squid testing
17:41 hcatlin: Mobile1's web stack just got switched from Phusion Passenger to Nginx/Thin/Rack.

July 18

15:23 apergos: some thumb directories on ms4 created at request of img scalers were created with owner root and perms 700... fixing
03:55 river: ms5 is ready
01:20 atglenn: continuing with removals of thumbs on ms1. 789G free now, need to reach about 1450 before we can just "maintain". but we're gaining on it.
00:22 brion: set up temporary data dump index, copied the dvd index (it's just offsite links). still need to track some MW releases
00:07 brion: recovering MediaWiki 1.6 through 1.10 release files and re-uploading them...

July 17

23:42 brion: added a 404 page and recovered index.php for our temp download.wikimedia.org
22:05 brion: set wikitech to use vector skin by default :D
22:03 Andrew: Fixed morebots, which was relying on a fragile version check. Just deleted it :)
20:43 brion: fixed paths for noc.wikimedia.org/conf file highlighting
20:38 domas: ms2 has broken disks..
20:31 brion: We're going to see about setting up the previously-idle ms5 so we can get our thumbnailing on
20:01 brion: rob's poking raid rebuild on storage2 (dumps server)
19:03 RobH_A90: eiximenis and dobson pulled for solid state drive testing, do not use for other tasks
18:28 logmsgbot: brion synchronized wmf-deployment/wmf-config/InitialiseSettings.php 'enabling vector for rtl'
18:25 atglenn: started mass move out of the way of thumbnail dirs and replacing with symlinks to ms4
18:25 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'bump style version'
18:25 logmsgbot: brion ran sync-common-all
18:24 brion: running sync-common-all for UI updates. need to poke the style ver too :)
18:08 brion: svn up'ing wmf-deployment for test.wikipedia.org. Merged UI fixes from usability team
18:03 Fred: spun a couple more apache server into image scalers: srv219..srv224.
17:28 rainman-sr: putting new location of initialisesettings to lsearch-global-2.1.conf so the incremental updater works again
17:20 Fred: srv224 is now an image_scaler. Adjusted on lvs3, ganglia and dsh's node_list.
17:14 Fred: db20 back online
16:50 Fred: rebooting db20 as it is in a "state"
16:45 brion: looks like we've lost internal /home NFS, which makes some of our internal services very unhappy. investigating...
16:43 brion: ganglia out.
13:44 apergos1: doing next round of removals on ms1 (/export/upload/wikipedia/en/thumb/2) to keep ahead of the game
04:15 apergos: starting removal of /export/upload/wikipedia/en/thumb/1 on ms1 (moved away and symlink to ms4 done already) for more space
03:54 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php 'Disabling sitenotice from maintenance'
03:29 brion: reenabling uploads & image deletion/undeletion
03:29 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
03:28 brion: remounting ms1 on apaches
00:49 atglenn: only about 1gb gain on each so doing all of /export/upload/wikipedia/en/thumb/0
00:39 atglenn: removing more directories in /export/upload/wikipedia/en/thumb/0 on ms1 and replacing with symlinks to ms4
00:30 logmsgbot: brion synchronized wmf-deployment/includes/specials/SpecialUpload.php
00:30 logmsgbot: brion synchronized wmf-deployment/includes/ImagePage.php
00:27 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
00:20 brion: temporarily disabling image delete/rename during maintenance

July 16

23:56 logmsgbot: fvassard synchronized php-1.5/wmf-config/CommonSettings.php 'Disabling uploads and setting captcha to not-fancy.'
23:29 atglenn: removing the images in /export/upload/wikipedia/en/thumb/0/00 on ms1 (real dir is a symlink to ms4) to get back some space
22:51 atglenn: sym link back in place, let's see what happens
22:47 atglenn: reverting temporarily while we resolve mount issues for the ms4 share
22:40 atglenn: ...whether the image scalers will fall over if we force them to do (some) regeneration.
22:37 atglenn: on ms1, /export/upload/wikipedia/en/thumb/0/00 symlinked to (shared from ms4) /mnt/thumbs/wikipedia/en/thumb/0/00 to test
21:29 brion: robots.php for robots.txt generation now also working. yay!
21:28 logmsgbot: brion synchronized live-1.5/robots.php
21:28 brion: extract2.php now fixed up for new deployment; portal pages ok (www.wikipedia.org)
21:27 logmsgbot: brion synchronized live-1.5/robots.php
21:26 logmsgbot: brion synchronized extract2.php
21:26 logmsgbot: brion synchronized extract2.php
21:22 logmsgbot: brion synchronized extract2.php
21:18 logmsgbot: brion ran sync-common-all
21:18 brion: rsync messed up the php-1.5 directory to symlink translation. retrying as root
21:14 logmsgbot: brion synchronized extract2.php
21:13 logmsgbot: brion synchronized extract2.php
21:13 atglenn: started copy of thumbnails to ms4, symlinks going in on ms1 (but no data removal yet)
21:11 logmsgbot: brion synchronized live-1.5/extract2.php
21:10 logmsgbot: brion synchronized live-1.5/robots.php
21:09 logmsgbot: brion ran sync-common-all
21:08 brion: attempting to replace the old php-1.5 dir with wmf-deployment symlink
21:02 logmsgbot: brion synchronized wmf-deployment/wmf-config/InitialiseSettings.php 'I think touching the new master InitialiseSettings will fix it'
21:01 logmsgbot: brion synchronized wmf-deployment/includes/GlobalFunctions.php 'mkdir error trackdown hack'
20:54 logmsgbot: brion synchronized wmf-deployment/wmf-config/missing.php
20:52 logmsgbot: brion synchronized wmf-deployment/wmf-config/CommonSettings.php
20:52 logmsgbot: brion synchronized wmf-deployment/wmf-config/reporting-setup.php
20:48 brion: switching all sites to wmf-deployment branch
20:48 logmsgbot: brion synchronized live-1.5/MWVersion.php
19:06 Tim: copying ExtensionDistributor stuff to ms4:/export/ext-dist, from root screen on ms1
19:01 brion: Now running test.wikipedia.org, www.mediawiki.org, and meta.wikimedia.org on new deployment checkout
19:01 logmsgbot: brion synchronized live-1.5/MWVersion.php
18:58 logmsgbot: brion ran sync-common-all
18:39 Tim: restarted xinetd on zwinger
18:24 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php
17:57 brion: also restarted 186, 196 which had some funkiness in php err log
17:56 brion: srv186 also bad sudo
17:55 brion: srv171 has some borkage; sudo config is broken can't run apache-restart as user
17:52 logmsgbot: brion ran sync-common-all
17:51 brion: running updated sync-common-all friendly to non-NFS boxes
17:49 brion: swapped private SVN-managed /home/wikipedia/bin into place
15:09 apergos: removing the last of our snapshots on ms1 :-( getting us a little more space
14:47 apergos: disabled snapshots on ms1 in preparation for move of thumbnails to ms4
14:38 brion: updated wikibugs-l list config to allow bugzilla-daemon@wikimedia.org to post
14:34 brion: restarted wikibugs bot
14:27 brion: ms1 performance seems to be sucking again
14:17 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'adjusting throttle temporarily for outreach event'
11:55 RoanKattouw: ExtensionDistributor repeatedly reported broken in the past 48 hrs
07:08 Fred: traffic profile switched back to normal. Esams is back to normal.
06:11 hcatlin: Mobile1 has returned to normal function.
05:58 hcatlin: Error after restarting mobile1 stopped stats logging from working. Stats will be low for July 15th and higher for July 16th. Parsing of the 6 hour log file (about 1GB) might slow server for next few minutes until caught up.
04:24 Rob: outage for esams servers started at approx 3:20 gmt
04:15 Rob: still waiting on esams to update us about the rack(s), moving traffic to pmtpa
00:59 tomaszf: started backup for latest xml snapshots from storage2 to ms4

July 15

22:30 Rob: updated dns for new snapshot servers becasue tomasz did not want to be in charge of dump servers.
22:10 brion: brion checking around for 0-byte files (not thumbs) to see if we can recover
21:33 atglenn: verified that zfs patch is in place on ms4 (it got sucked in during river's update yesterday)
21:26 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'Restore fancy captcha mode'
21:16 logmsgbot: I_am_not_root synchronized php-1.5/CommonSettings.php 're-enabling Uploads and removing site notice.'
21:01 atglenn: rebooting ms1 after applying zfs patch. *cross fingers*
20:51 logmsgbot: brion synchronized php-1.5/CommonSettings.php
20:51 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
20:42 brion: reenabled captcha in simple mode (no images; math q)
20:37 brion: captcha system broken while images are offline, need to disable it temporarily
20:18 brion: updated http://en.wikipedia.org/wiki/MediaWiki:Uploaddisabledtext & http://commons.wikimedia.org/wiki/MediaWiki:Uploaddisabledtext
19:43 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Disabling Uploads while ms1 gets fixed (again with an s after upload).'
19:40 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Disabling Uploads while ms1 gets fixed.'
19:40 atglenn: bringing solaris up to current patch level on ms1
19:34 brion: Ok, we're going to temporarily shut off uploading and unmount the uploads dir while we muck about with ms1.
19:14 brion: dropping export/upload@daily-2009-07-11_03:10:00
19:08 brion: restarting web server on ms1, see if that resets some connections to the backend scalers
19:05 brion: restarting nfsd on ms1
18:58 brion: dropping zfs snapshot export/upload@daily-2009-07-09_03:10:00
18:25 RobH_A90: drac and physical setup done for dump1,2,3, will install remotely
17:52 RobH_A90: updated dns for new dump processing servers public and management ips
17:41 Fred: bounced apache on srv45
17:37 Fred: bounced apache on srv47
17:09 RobH_A90: pdf1 is not coming back, working on it
16:56 RobH_A90: shutting down pdf1 and mobile1 to move their power too, weee
16:55 RobH_A90: shutting down spence to move
16:50 RobH_A90: shutting down singer to move its power, blogs and other associated services will be offline for approx. 5 minutes
16:47 Andrew: Restarting apache on prototype
16:46 RobH_A90: shutting down grosley for power move
16:45 RobH_A90: all these power moves are to add the new dump processing servers to the rack
16:45 RobH_A90: shutting down fenari for power move
16:43 RobH_A90: shut down eiximenis and erzurumi to move their power
16:34 RobH_A90: shutting down some servers and moving power around in a4-sdtpa
16:17 Andrew: Changed morebots to tell you through a channel message instead of a private notice when the logging is successful.
15:54 Fred: kernel updated on wikitech from 2.6.18.8 to 2.6.29 (latest available on linode)
15:49 Andrew: Fixed auto-submission of honeypot data, was broken because it needed my perl include path.
15:40 Fred: rebooting wikitech to install new kernel
14:04 Ariel: stopped apaches on image scalers, stopped nfs on ms1, restarting nfs and apaches...
13:52 Ariel: removing more snapshots on ms1 (lockstat showed it hung up in metaslab_alloc again)

July 14

23:15 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fix fix to enwiki confirmed gruop :D'
22:24 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fix to confirmed group for en'
20:44 logmsgbot: brion synchronized wmf-deployment/cache/trusted-xff.cdb
20:41 logmsgbot: brion synchronized wmf-deployment/cache/trusted-xff.cdb
20:40 logmsgbot: brion synchronized wmf-deployment/AdminSettings.php
20:22 Fred: restarted a bunch of dead apaches
20:10 brion: doing a sync-common-all w/ attempt to put test.wikipedia on wmf-deployment branch
19:50 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
19:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611 forgot one thing'
19:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611'
19:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19611'
14:16 domas: dropped all june snapshots on ms1, thus providing some relief
01:52 river: patched ms4 in preperation for upload copy

July 13

21:31 Rob: pushing dns update to fix management ips for new apaches
19:05 Fred: added storage3 to ganglia monitor.
18:50 logmsgbot: brion synchronized php-1.5/abusefilter.php 'Disable dewiki missingsummary, mysteriously in abusefilter section. Per bug 19208'
16:30 Fred: install wikimedia-nis-client to srv66 and mounted /home.
16:28 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fixing wikispecies RC-IRC prefix to species.wikimedia'
16:27 brion: test wiki was apparently moved from dead srv35 to srv66, which has new NFS-less config. thus fail since test runs from nfs
16:24 brion: test wiki borked; reported down for several days now :) investigating
15:12 logmsgbot: midom synchronized php-1.5/db.php 'db26 raid issues'
14:55 logmsgbot: midom synchronized php-1.5/db.php 'db3 and db5 coming live as commons servers'
14:13 domas: dropped few more snapshots, as %sys was increasing on ms1...
11:16 domas: manually restarted plethora of failing apaches (direct segfaults and other possible APC corruptions, leading to php OOM errors)
09:50 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialBlockip.php
09:00 Tim: restarted apache2 on image scalers
08:39 logmsgbot: tstarling synchronized php-1.5/includes/Math.php 'statless render hack'
08:05 Tim: killed all image scalers to see if that helps with ms1 load
08:00 Tim: killed waiting apache processes
07:35 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php
07:24 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php 'swapping out srv81'
04:11 Tim: fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure
02:16 brion: -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683
02:12 brion: sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed?
02:03 brion: srv159 is absurdly loaded/lagged wtf?
01:58 brion: reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking...
01:16 brion: so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok
01:10 brion: trying switching image scalers back in for a few, see if they go right back to old pattern or not
01:03 brion: load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P
00:59 brion: stopping apache on image scaler boxes, see what that does
00:49 brion: attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00
00:45 brion: restarting nfs server
00:44 brion: stopping nfs server, restarting web server
00:40 brion: restarting nfs server on ms1
00:36 brion: doesn't seem so far to have changed the NFS access delays on image scalers.
00:31 brion: shutting down webserver7 on ms1
00:23 brion: investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside

July 12

20:30 domas: dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately
20:05 domas: we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1
18:55 domas: zil_disable=1 on ms1
18:34 mark: Upgraded pybal on lvs3
18:16 mark: Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s
17:58 domas: scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this.
17:55 domas: investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-)
17:54 domas: bumped up ms1 SJSWS thread count
11:00 domas: hehehehehe, disabled peer verification on zwinger for now:

      Issuer: C=US, ST=Florida, L=Tampa, O=Wikimedia Foundation Inc., OU=Operations, CN=srv1.pmtpa.wmnet
       Validity
           Not Before: Jul  8 08:03:52 2006 GMT
           Not After : Jul 12 08:03:52 2009 GMT

08:43 tomaszf: rebooted wikitech due to out of memory

Jul 12 14:17:32 <TimStarling>	!log reduced MaxClients on wikitech.wikimedia.org from 150 to 5
Jul 12 14:06:33 <domas>	!log srv1 certificate expired
Jul 12 11:31:58 <tomaszf>	!log rebooted wikitech due to out of memory
Jul 12 11:07:58 <tomaszf>	!log rebooting wikitech
Jul 12 08:41:30 <logmsgbot>	!log tstarling synchronized php-1.5/InitialiseSettings.php 
Jul 12 08:40:31 <logmsgbot>	!log tstarling synchronized php-1.5/includes/ImagePage.php 
Jul 12 08:40:15 <logmsgbot>	!log tstarling synchronized php-1.5/includes/DefaultSettings.php 
Jul 12 08:39:55 <TimStarling>	!log merging and deploying r53130, will disable archive thumbnails and see if it has an impact on ms1 load
Jul 12 00:31:07 <logmsgbot>	!log midom synchronized php-1.5/db.php 
Jul 11 22:17:15 <logmsgbot>	!log andrew synchronized php-1.5/InitialiseSettings.php 
Jul 11 22:15:46 <werdna>	!log Still very slow, going to disable CentralNotice again
Jul 11 22:07:30 <RoanKattouw>	!log wikitech.wikimedia.org is down
Jul 11 20:40:26 <logmsgbot>	!log tstarling synchronized php-1.5/InitialiseSettings.php  're-enabling CentralNotice'
Jul 11 19:32:06 <TimStarling>	!log killed waiting processes again
Jul 11 19:24:11 <TimStarling>	!log killed all processes in the rpc_wait state, to buy us some time
Jul 11 19:12:06 <mark>	!log Reverted cache_mem reduction on upload squids; the cause of memory pressure is a memleak
Jul 11 19:07:47 <TimStarling>	!log apaches took a while to restart due to some shell processes hanging on to listening *:80 filehandles while waiting for NFS, should be fixed now
Jul 11 19:03:02 <mark>	!log Restarting memory leaking frontend squids in upload pmtpa cluster
Jul 11 18:57:48 <TimStarling>	!log restarting apaches
Jul 11 18:56:15 <mark>	!log Reduced cache_mem from 3000 to 2000 MB on pmtpa upload cache squids

July 11

15:45 mark: Rebooting sq1
15:31 Tim: rebooting ms1
14:54 Tim: disabled CentralNotice temporarily
14:54 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'disabling CentralNotice'
14:53 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'disabling CentralAuth'
14:36 Tim: restarted webserver7 on ms1
14:22 Tim: some kind of overload, seems to be image related
10:09 logmsgbot: midom synchronized php-1.5/db.php 'db8 doing commons read load, full write though'
09:22 domas: restarted job queue with externallinks purging code, <3
09:22 domas: installed nrpe on db2 :)
09:22 logmsgbot: midom synchronized php-1.5/db.php 'giving db24 just negligible load for now'
08:38 logmsgbot: midom synchronized php-1.5/includes/parser/ParserOutput.php 'livemerging r53103:53105'
08:37 logmsgbot: midom synchronized php-1.5/includes/DefaultSettings.php

July 10

21:21 Fred: added ganglia to db20
19:58 logmsgbot: azafred synchronized php-1.5/CommonSettings.php 'removed border=0 from wgCopyrightIcon'
18:58 Fred: synched nagios config to reflect cleanup.
18:52 Fred: cleaned up the node_files for dsh and removed all decommissioned hosts.
18:36 mark: Added DNS entries for srv251-500
18:18 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'Added a couple spare memcache hosts.'
18:16 RobH_DC: moved test to srv66 instead.
18:08 RobH_DC: turning srv210 into test.wikipedia.org
17:56 Andrew: Reactivating UsabilityInitiative globally, too.
17:55 Andrew: Scapping, back-out diff is in /home/andrew/usability-diff
17:43 Andrew: Apply r52926, r52930, and update Resources and EditToolbar/images
16:44 Fred: reinstalled and configured gmond on storage1.
15:08 Rob: upgraded blog and techblog to wordpress 2.8.1
13:58 logmsgbot: midom synchronized php-1.5/includes/api/ApiQueryCategoryMembers.php 'hello, fix\!'
12:40 Tim: prototype.wikimedia.org is in OOM death, nagios reports down 3 hours, still responsive on shell so I will try a light touch
11:07 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'more'
10:58 Tim: installed memcached on srv200-srv209
10:51 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'deployed the 11 available spares, will make some more'
10:48 Tim: mctest.php reports 17 servers down out of 78, most from the range that Rob decommissioned
10:37 Tim: installed memcached on srv120, srv121, srv122, srv123
10:32 Tim: found rogue server srv101, missing puppet configuration and so skipping syncs. Uninstalled apache on it.

July 9

23:56 RoanKattouw: Rebooted prototype around 16:30, got stuck around 15:30
21:43 Rob: srv35 (test.wikipedia.org) is not posting, i think its dead jim.
21:35 Rob: decommissioned srv55 and put srv35 in its place in C4, test.wikipedia.org should be back online shortly
20:04 Rob: removed decommissioned servers from node groups, getting error on syncing up nagios.
20:03 Rob: updated dns for new apache servers
19:54 Rob: decommissioned all old apaches in rack pmtpa b2
16:22 Tim: creating mhrwiki (bug 19515)
13:27 domas: db13 controller battery failed, s2 needs master switch eventually

July 8

15:48 domas: frontend.conf changes: fixed cache-control headers for /w/extensions/ assets, did some RE optimizations %)
13:31 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'disabling usability initiative on all wikis, except test and usability. someone who enabled this and left at this state should be shot'

July 7

19:06 Fred: adjusted www.wikipedia.org apache conf file to remove a redirect-loop to www.wikibooks.org. (bug #19460)
17:34 Fred: found the cause of Ganglia issues: Puppet. Seems like the configuration of the master hosts gets reverted to being deaf automagically...
17:05 Fred: ganglia fixed. For some reason the master cluster nodes were set to Deaf mode... (ie the aggregator couldn't gather data from them).
15:02 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19470 Rollback on pt.wikipedia'
03:37 Fred: fixing ganglia. Expect disruption
00:27 tomaszf: starting six worker threads for xml snapshots
00:12 Fred: srv142 and srv55 will need manual power-cycle.
00:10 Fred: Rolling reboot has finally completed.

July 6

23:57 Fred: restarted ganglia since it is acting up...
23:54 tomaszf: restarting all xml snapshots due to kernel upgrades
18:49 Rob: upgraded spam detection plugins on blog and techblog
18:47 Fred: starting rolling reboot of servers in Apaches cluster.
17:53 tomaszf: cleaning out space on storage2. lowering retention for xml snapshots to 10
17:53 Fred: upgrading kernel on cluster. This will take a while!
17:46 Fred: rebooting srv220 to test kernel update.

July 3

12:51 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewEdit.php 'Re-activating abuse filter public logging in the logging table now that log_type and log_action have been expanded.'
11:45 mark: Kicked iris so it would boot
10:11 logmsgbot: andrew synchronized php-1.5/skins/common/htmlform.js 'IE7 fixes for new preference system
05:51 Tim: restarted squid instances on sq28
05:47 Tim: restarted squid instances on sq2
05:46 Tim: started squid backend on sq10 and sq23, sq24, sq31, restarted frontend on most of those to reduce memory usage
05:35 Tim: restarted squid backend on sq16, was reporting "gateway timeout" apparently for all requests. Seemed to fix it. Will try that for a few more that nagios is complaining about.

July 2

21:38 Rob: sq24 wont accept ssh, depooling.
21:34 Rob: rebooting sq21
21:26 Rob: ran changes to push dns back to normal scenario
19:52 mark: Power outage at esams, moving traffic
19:44 Andrew: Knams down, Rob is looking into it
19:41 Andrew: Reports of problems from Europe
19:25 Andrew: running sync-common-all to deploy mobileRedirect.php to fix hcatlin's mobile redirect/cookie bug
19:22 logmsgbot: andrew synchronized live-1.5/mobileRedirect.php
17:15 mark: Rebooted srv159
16:13 Fred: shutting 217 back down as it is not supposed to be up due to faulty timer causing issues.
16:12 Fred: rebooted srv217. Was unpingable.
14:09 Andrew: Started sending updates of spam.log to Project Honeypot folks every 5 minutes, in my crontab on hume.
11:20 logmsgbot: andrew synchronized php-1.5/skins/common/shared.css 'Live-merging r52669, r52684 at rainman's request, search fixes.'
11:18 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialSearch.php 'Live-merging r52669, r52684 at rainman's request, search fixes.'
00:03 logmsgbot: brion synchronized php-1.5/CommonSettings.php
00:02 logmsgbot: brion synchronized php-1.5/extensions/MWSearch/MWSearch_body.php 'de-merge broken r52664'

July 1

23:40 brion: poking in tweaks to search and updates to vector
23:22 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'bump wgStyleVersion'
23:21 logmsgbot: brion synchronized php-1.5/skins/vector/main-rtl.css
23:21 logmsgbot: brion synchronized php-1.5/skins/vector/main-ltr.css
23:10 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'set vector skin, new toolbar on for usability wiki'
23:07 mark: Kicked pascal
23:05 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.php 'bumping the js ver no'
23:01 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.js
22:59 logmsgbot: brion synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php 'to 52659'
22:57 logmsgbot: brion synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.i18n.php
22:44 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'enabling UsabilityInitiative (for optional EditToolbar)'
22:43 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'disabling EditWarning pending addl talk'
22:40 brion-codereview: updating UsabilityInitiative ext to r52657 in prep for enabling new toolbar option
22:10 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Enabling new search UI formatting sitewide'
22:02 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'fixing the RTL disable for vector'
21:58 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Vector should now be available in prefs for non-RTL sites'
21:57 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'vector config tweak'
21:42 brion-codereview: updating Vector to current
19:38 logmsgbot: midom synchronized php-1.5/db.php
16:13 Fred: bayes is running out of memory on a regular basis. Enabled process accounting / sar to gather more data.
15:48 Fred: rebooting Bayes as it locked up again.
11:48 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'trying a lower value for $wgMaxMsgCacheEntrySize'
11:19 domas: cleaned up srv100
11:18 domas: noticed that imagemagick tempfiles are currently created in /u/l/a/c-l/p/ :)
09:24 domas: pinned mysqlds on half of cores on 8-core boxes: for i in {11..30}; do ssh db$i 'taskset -pc 0-15:2 $(pidof mysqld)' ; done

@@ Line 1: / Line 1: @@
 == July 23 ==
+* 22:16 mark: Decommissioned all yaseo servers, wiped their disks
 * 20:35 mark: Updated the glue record for ns1.wikimedia.org
 * 20:20 mark: Changed IP of ns1.wikimedia.org to 208.80.152.142 (a svc ip on linne)

Revision as of 22:16, 23 July 2009

July 23

July 22

July 21

July 20

July 19

July 18

July 17

July 16

July 15

July 14

July 13

July 12

July 11

July 10

July 9

July 8

July 7

July 6

July 3

July 2

July 1

Archives