Server Admin Log/Archive 20: Difference between revisions

From Wikitech
Content deleted Content added
imported>More Bots
think i killed em, now trying to restart apache procs (brion)
imported>More Bots
I FOUND HOW TO REVIVE APACHES (domas)
Line 1: Line 1:
== February 21 ==
== February 21 ==
* 01:47 domas: I FOUND HOW TO REVIVE APACHES
* 01:46 brion: think i killed em, now trying to restart apache procs
* 01:46 brion: think i killed em, now trying to restart apache procs
* 01:43 brion: poking to see if we can restart apaches...
* 01:43 brion: poking to see if we can restart apaches...

Revision as of 01:47, 21 February 2009

February 21

  • 01:47 domas: I FOUND HOW TO REVIVE APACHES
  • 01:46 brion: think i killed em, now trying to restart apache procs
  • 01:43 brion: poking to see if we can restart apaches...
  • 01:42 brion: syncing fixed InitialiseSettings/COmmonSettings to apaches
  • 01:14 brion: and flyingparchment
  • 01:14 brion: domas and mark are attempting to restart the NFS server, but aren't mentioning any details in the public channel or log
  • 00:52 domas: http://p.defau.lt/?_M1iGbA0PCz2OOt2_KKPug
  • 00:52 mark: db20 in trouble
  • 00:39 mark: @brion you don't need to wake up
  • 00:36 domas: disabled 2006 fundraising cronjob on amane :-)

February 20

  • 23:31 Rob: upgraded squid and kernel on sq34-sq36
  • 23:12 Rob: upgraded kernel and squid on sq31-sq33, redeployed and online
  • 23:08 brion: updating CentralNotice for improved test script (plus i8n update)
  • 22:54 Rob: upgraded kernel + squid on sq28-sq30
  • 22:29 Rob: completed upgrades to sq25-sq27
  • 22:12 Rob: upgrading kernel and squid versions on sq25-sq27 (if i crash the site, i apologize in advance)
  • 22:08 Rob: upgraded kernel and squid on sq24
  • 21:59 river: added current patches to ms4, set zil_disable=1 and rebooted
  • 21:30 brion: srv31 seems to be down, so no dump activity
  • 21:08 brion: scapping to update FlaggedRevs to r47588 (fixing fatal err)
  • 21:01 Rob: updated kernel and squid on sq23
  • 20:58 Rob: updated kernel and squid on sq22
  • 20:36 Rob: updated kernel and squid on sq20 and sq21
  • 20:25 domas: some apaches in crashloop like this: http://p.defau.lt/?s9YhHD_0qHroVhauBdQb_g
  • 20:09 Rob: restarted apache on srv74
  • 20:03 Rob: upgraded kernel and squid on sq19
  • 19:50 Rob: upgraded kernel + squid on sq18
  • 19:34 Rob: upgraded kernel + squid on sq17
  • 19:19 brion: updating FlaggedRevs to r47574
  • 18:16 river: set zil_disable on ms1 to improve nfs write performance
  • 18:15 mark: Raised max-conns to 50
  • 18:03 mark: Cut down max conns even more (25) for pmtpa upload backend squids
  • 17:40 mark: Limited maximum connections to backend (ms1) to 50 per squid on upload squids, 1000 per squid on text
  • 16:17 domas: plenty of fedoras had futex deadlocks
  • 16:16 Rob: upgraded kernel and squid on sq14 and sq15
  • 15:49 Rob: updated squid and kernel on sq13, rebooted, back online
  • 15:26 Rob: upgraded squid and kernel on sq9-sq12 (not all at the same time)
  • 14:59 Rob: upgraded squid and kernel on sq5, sq6, sq7, sq8
  • 14:51 Rob: upgraded squid and kernel on sq2-sq4
  • 14:50 Tim: updated ContactPage extension, will deploy it on nlwiki shortly
  • 10:52 mark: Reduced cache_mem from 3000 to 2500 for pmtpa upload backend squids - no restart, will take effect with the 2.7 upgrade later today
  • 10:00 mark: Started backend squid on sq26, it was gone

February 19

  • 23:54 brion: updating AbuseFilter to r47523 :P
  • 23:51 brion: updating AbuseFilter to r47522
  • 23:40 brion: updating FlaggedRevs to r47522
  • 23:39 Andrew: Enabled Abuse Filter on MediaWiki.org
  • 23:17 mark: Stopped experimental varnish on sq1, please keep Squid off as well
  • 22:52 Andrew: Allowed bureaucrats to remove 'sysop' right on testwiki.
  • 22:42 brion: updating includes/api to r47522 to fix a couple regressions
  • 22:15 mark: Started an experimental varnish instance on sq1 port 80
  • 21:22 mark: Stopped Squids on sq1
  • 14:23 Tim: removing memcached from srv154,srv155,srv157,srv158,srv169,srv170
  • 14:18 Tim: started memcached on srv190-199
  • 14:06 mark: Added "vport=80" to the http_host directive on all backend squids, to force Squid to use the default HTTP port, 80
  • 10:53 domas: livemerged r47483 (backlinks cache read explicit order, :( )
  • 07:56 Tim: restarted job runners with 4 processes per server instead of 1. Db2 is now heavily loaded, apparently due to the SELECT queries involved in the large numbers of unnecessary refreshLinks2 jobs that were queued before r47478 went live. But they should be done in a few hours at this rate.
  • 05:00 Brion: enabling Collection on fr, pl, nl, pt, es, simple Wikipedias
  • 02:12 Tim: deploying r47478

February 18

  • 22:41 Andrew: morebots back up, now logs to identi.ca with the name wikimediatech
  • 22:38 tomaszf: installed srv208 with Ubuntu 8.10.1 and installed app sever software.
  • 22:12 domas: Andrew killed morebots. let's see how he fixes it... :)
  • 21:59 Rob: PDF creation moved to pdf1
  • 21:58 Rob: changed pdf generation from eruzumi to pdf1, testing.
  • 19:21 Rob: srv255 changed to pdf1 and moved, drac setup along with dns resolution
  • 19:19 brion: scapping
  • 19:18 brion: svn up'ing test to r47457
  • 18:37 Rob: reinstalling srv209 due to dhcp misconfiguration making it think it was srv208
  • 15:13 mark: Restarted all upload frontend squids to get rid of the memleaking
  • 14:20 mark: Blocked all non-GET/HEAD HTTP methods in requests to upload frontend squids
  • 12:46 Tim: put r47447 live for temporary proposed fix of bug 17552
  • 08:38 Tim: svn up r47434 to fix Special:BrokenRedirects
  • 08:04 Tim: cleaned up binlogs on db2
  • 06:33 brion: note there's a live hack in api categorymembers query which may be breaking lookups
  • 05:54 Tim: set up bugzilla attachment_base, pointing to the new domain http://bug-attachment.wikimedia.org/, and set allow_attachment_display=on
  • 05:51 brion: disabling $wgTorTagChanges in CommonSettings after the ext gets loaded (needs fix for testwiki)
  • 05:46 brion: syncing reverted expr.php w/o bc stuff
  • 05:25 brion: syncing extensions/FlaggedRevs/specialpages/OldReviewedPages_body.php fix
  • 05:24 brion: syncing fix to Expr.php for bcpow() error
  • 05:16 brion: syncing fix to extensions/ParserFunctions/Expr.php
  • 04:59 brion: starting scap process...
  • 04:52 brion: svn up'ing test to r47418
  • 04:45 brion: svn up'd test to 47417
  • 04:30 brion: removing editor, reviewer from add/remove for all users in test. that ws an old test not needed anymore :D
  • 03:42 brion: rc tags tables created sitewide; should be safe to scap and check for final problems if we're brave
  • 03:35 brion: applying patch-change-tags to all wikis
  • 02:57 brion: ran patch-change_tag.sql on testwiki
  • 02:52 brion: full svn up'ing for test wiki
  • 02:06 brion: worked around breakage with pager base class incompat with latest codereview :P
  • 01:52 brion: svn up'ing CodeReview to aid in completing code review ;)

February 17

  • 23:58 Rob: srv217-srv223 installed and online as apache servers. Updated dsh groups and nagios, as well as pybal
  • 23:24 Rob: installed OS on srv217-srv223, moving on to package installation.
  • 21:12 Rob: reinstalling srv209, which thought it was srv208. silly server. srv208 has not been installed, gave to tomasz to check against setup checklist.
  • 21:05 Rob: actually, srv209 installed as 208, bad dhcp entry. Fixing
  • 21:04 Rob: pulling srv208 and srv209 for quick reboots, their drac ips are wrong.
  • 21:04 Rob: racked srv217-223 (also racked srv224/225 but no power yet)
  • 18:30 brion: starting a batch run of update-special-pages-small just to ensure it actually works
  • 18:25 brion: fixed hardcoded /usr/local path for PHP and use of obsolete /etc/cluster in update-special-pages and update-special-pages-small; removing misleading log files (bugzilla:17534)
  • 03:19 Tim: removed live hack updating MW_DIFF_VERSION, changed on December 30 and the cache expiry is a week. Should not cause a significant amount of load.
  • 03:01 Tim: removed live hacks from extension/Cite, updated to r47350.
  • 01:49 Tim: deleting all enotif jobs from the job queue, there is still a huge backlog

February 16

  • 16:46 mark: Did emergency rollback of squid 2.7.6 to squid 2.6.21 because of incompatible HTTP Host: header
  • 16:21 Rob: stopped upgrades, sq36 completed before stop
  • 16:17 Rob: performing upgrades to sq35-sq38 (not depooling in pybal, letting pybal handle that automatically)
  • 16:16 Rob: performed dist-upgrade on sq31-34
  • 15:35 Rob: depooled sq31-sq34 for upgrade
  • 08:12 Tim: patched in r47309, Article.php tweak
  • 05:00 Tim: made runJobs.php log to UDP instead of via stdout and NFS
  • 04:53 Tim: fixed incorrect host keys in /etc/ssh/ssh_known_hosts for srv38, srv39 and srv77
  • 04:13 Tim: removing all refreshLinks2 jobs from the job queue, duplicate removal is broken so to clear the backlog it's better to just run maintenance/refreshLinks.php

February 15

  • 21:59 mark: Experimentally blocked non GET/HEAD HTTP methods on sq3 frontend squid
  • 16:15 mark: Upgraded PyBal on lvs2 - others will follow
  • 13:11 domas: db23 has multiple MCEs for same dimm logged: http://p.defau.lt/?IarKD4gbFhe5RmaV0RB_Xg
  • 12:38 domas: in wikistats, placed older than 10 days files into ./archive/yyyy/mm/ - maybe will make flack crash less :))
  • 11:56 mark: Doing Squid memleak searching on sq1 with valgrind, pooled with weight 1 in LVS
  • 03:09 Andrew: CentralNotice still not working properly, and when we tried to set it to testwiki-only, it never came up. Left it on testwiki only for the time being, until somebody who knows CentralNotice can take a look at it.
  • 02:21 Tim: fixed permissions on the rest of the logs in /home/wikipedia/logs/norotate (fixes centralnotice)

February 14

  • 19:19 Az1568_: re-enabled CentralNotice on testwiki to try and find the problem (we've had this before, but fixed it somehow...possibly with a regen? See November 16th log.)
  • 18:34 domas: filed a bug at https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/329489 - could use some Canonical escalation too
  • 18:26 domas: same affected srv47 - this is related to switching locking to fcntl() - this drives apparmor crazy
  • 17:47 domas: srv178 kernel memleaked few gigs. blame: apparmor
  • 14:34 domas: srv215 very much dead, doesn't show vitality signs even after serveractionhardreset
  • 14:28 domas: correction, srv208.mgmt is pointing to uninstalled box
  • 14:27 domas: DRAC serial on all new boxes is ttyS1 which is not in securetty
  • 14:24 domas: srv209.mgmt is actually srv208's SP, and srv208.mgmt is pointing to dead box
  • 14:15 domas: srv209,215 down?
  • 13:43 domas: installing php5-apc-3.0.19-1wm2 (no more futexes) on all ubuntu appservers.
  • 10:02 Andrew: Reports that CentralNotice broke on all wikis, displaying just the message name in angle brackets, even though the message existed on meta. I have no idea what caused it and I couldn't find anybody who knows anything about it, so I disabled the notice itself on Special:CentralNotice on meta. Somebody who knows what they're doing should probably look into it later.

February 13

  • 22:10 mark: esams squid upgrade complete
  • 21:05 RobH: deployed srv207-srv216 in apaches cluster
  • 20:34 RobH: added new servers to nagois and restarted it
  • 20:15 RobH: setup all node groups, ganglia, apache, so on for srv199-srv206 and added into rotation
  • 19:38 mark: Upgrading esams squids to 2.7.6
  • 18:36 mark: Upgraded squid on sq1 to 2.7.6 and rebooted the box
  • 18:03 mark: Memory leak issues on the upload frontend squids, which started in November
  • 18:01 RobH: sq13 back online, seems there is a memory leak, go mark for finding =]
  • 17:54 RobH: lomaria install done for domas
  • 17:49 RobH: rebooting sq13 due to it failing out in ganglia, OOM error evident.
  • 17:48 RobH: reinstalling lomaria per domas request
  • 17:37 RobH: sq8 was unresponsive to console, locked up, rebooted, cleaned cache, and bringing back online
  • 17:34 RobH: srv38 and srv39 back in rotation
  • 17:23 RobH: srv38 and srv39 reinstalled, installing packages now
  • 16:57 RobH: reinstalling srv38/srv39
  • 16:57 RobH: srv80 reinstalled as ubuntu apache and back in rotation
  • 16:31 RobH: srv79 back in rotation
  • 16:21 RobH: srv79 reinstalled, installing packages and ganglia
  • 16:12 RobH: reinstalling srv79
  • 16:00 RobH: ganglia installed on srv77, back in rotation
  • 15:55 RobH: srv77 redeployed as ubuntu apache server
  • 15:48 RobH: reinstalling srv77 to ubuntu

February 12

  • 23:59 brion: adding 'helppage' to ui-content messages on commons per bugzilla:5925
  • 23:01 RobH: racked and setup drac for srv298-srv216
  • 21:20 mark: Killed blocked apache processes on srv180, and restarted apache
  • 21:19 mark: Killed blocked apache processes on srv172, and restarted apache
  • 21:07 brion: fixed ownership on log files for updateSpecialPages cronjob, which likely is what broke it
  • 20:28 mark: Upgraded experimental squid 2.7.5 on knsq1 to squid 2.7.6
  • 20:00 brion: fixed typo which broke access to revision deletion log for oversighters. tx to aaron for the spot :D
  • 19:45 mark: Replaced "2 cpu apaches" group aggregator srv32 by srv35
  • 18:55 RobH: racked, wired, and remote management setup for srv199-srv207
  • 09:51 domas: added srv190-srv198 to apaches dsh group, as they seem to be alive and kicking
  • 09:48 domas: changed weights for srv190-srv198 80->100 (to account for 1.85->2.5 ghz cpu step )
  • 00:29 brion: running updateRestrictions on wikis to clean up remaining funky restrictions entries per bugzilla:16846
  • 00:22 Tim: restarted apache on srv172

February 11

  • 23:23 mark: Pooled srv190-198
  • 23:23 Tim: re-enabling search suggestions
  • 23:19 mark: Installed Ganglia on srv190-198
  • 23:17 mark: Installed MediaWiki application server packages on srv190-198
  • 23:02 mark: Added srv190-198 to mediawiki_installation node_group (not any others)
  • 22:55 mark: Ran dist-upgrade && reboot on srv190-198
  • 22:46 mark: OS installed on srv190-198
  • 22:19 RobH: racked and setup drac on srv195-srv198
  • 22:11 RobH: racked and setup drac on srv192, srv193, srv194
  • 22:00 RobH: racked and setup drac on srv190, srv191
  • 21:24 brion: putting ixia back in rotation, it's caught up
  • 20:05 brion: depooling ixia while it catches up
  • 20:05 brion: ixia lagged 8810 secs
  • 20:00 brion: ixia replication is broken -- causing contribs lag on itwiki
  • 19:19 RobH: setup msw-a5-sdtpa like 30 minutes ago, opps ;]
  • 19:00 mark: Added srv190-225 to DNS & DHCP
  • 18:55 mark: set up RANCID for asw-a4-sdtpa and asw-a5-sdtpa
  • 18:54 brion: disabled srv38,39,77,79,80 in lvs3 pybal config to ensure they don't go back into service accidentally until fixed up
  • 18:37 brion: stopping apache on those bad machines for the moment
  • 18:35 brion: srv38, 39, 77, 79, and 80 appear to have been prematurely put into apaches pool, running old version of PHP. need to be halted and upgraded
  • 17:26 domas: restarted apache on srv154 after teh deadlock in apc
  • 16:04 Tim: disabled checkers.php hack, using mwsuggest.js hack instead
  • 15:52 Tim: emergency optimisation: disabled search suggest via checkers.php
  • 15:41 domas: srv159 restarted as proper apache, not -DSCALER
  • 09:02 domas: moved morebots to ~morebots@wikitech.wikimedia, startup line in rc.local :)
  • 07:05 Tim: running maintenance/fixBug17442.php
  • 06:56 Tim: restarted job runners
  • 04:31 Tim: upgraded bugzilla to 3.0.8 with cvs up, and copied in the docs directory from the 3.0.8 tarball
  • 03:31 Tim: gave myself an account on isidore, cleaned up some crap in /srv/org/wikimedia to /srv/org/wikimedia/backup
  • 02:58 Tim: apt-get upgrade on isidore

February 10

  • 23:47 mark: Moved upload esams LVS from mint to hawthorn
  • 23:41 mark: Installed a specially compiled LVS Feisty kernel on hawthorn (running Hardy) & rebooted
  • 22:33 RobH: updated mwlib on erzurumi per brion
  • 22:25 RobH: some resets and such on searchidx1 to get ssh working. system is very sluggish.
  • 19:28 brion: wikitech server crashed; CPU pegged and OOM. rob rebooted it, yay
  • 02:46 Tim: running maintenance/fixBug17300.php to create missing redirect table entries
  • 01:18 Tim: reverted PP caching patch
  • 01:14 Tim: re-enabled search suggestions

February 9

  • 23:13 domas: grunt session finished
  • 23:10 domas: brought up srv80 from hibernation and made it work.
  • 22:53 domas: added srv61 too
  • 22:23 domas: added srv144 and srv147 to duty, added ganglia stuff too
  • 22:01 domas: started appserver work on srv77,srv79
  • 21:54 domas: started srv35,38,49 as appservers, restarted deadlocked srv49 processes
  • 16:14 mark: Moved upload LVS back from hawthorn to mint - even a optimized 2.6.24 kernel is not fast enough to serve upload LVS
  • 16:03 Tim: disabled search suggest as an emergency optimsation measure
  • 16:02 mark: Rebooted hawthorn with an LVS optimized kernel, moved upload LVS back to it
  • 15:53 mark: Moved upload esams LVS back to mint
  • 15:37 mark: Moved upload.esams LVS from mint to hawthorn
  • 15:28 mark: Reinstalled server hawthorn with Hardy 8.04
  • 13:55 domas: fixed ganglia group for srv159 (it is scaler, not appserv)
  • 13:51 domas: brought srv182 up
  • 13:32 domas: repooled srv104 and srv105, after few months of vacation
  • 13:20 domas: killed few orphaned tidy processes that were very very busy since Feb1
  • 13:13 domas: heeheee, extorted this: [15:11] <rainman-sr> so, srv77,79,80, rose, coronelli and maurus could be converted to apaches
  • 12:36 Tim: trying apc.localcache=1 on srv176
  • 04:27 Tim: patching in r46936
  • 03:48 Tim: attempting to reproduce APC lock contention on srv188

February 8

  • 22:43 brion: may or may not have fixed that -- log file was unwritable. hard to test the command since 'su' bitches about apache not being loginabble on hume :P
  • 22:39 brion: investigating why centralnotice update is still broken. getting fatal php errors wtf?
  • 20:17 domas: we were hitting APC lock contention after some CPU peak. Dear Ops Team, please upgrade to APC with localcache support. :)))))

February 7

  • 22:49 domas: db17 came up, but it crashed with different symptoms than other boxes, and it was running 2.6.28.1 kernel. might be previous hardware problems resurfacing
  • 22:47 brion: chmod'ing centralnotice JS output on ms1 so batch processes running as 'apache' user can actually update them. hadn't been getting updated since february 5, leading to complaints when the swedes updated a translation on the steward banner
  • 21:23 domas: db17 down

February 6

  • 12:33 brion: stopped that process since it was taking a while and just saved it as an hourly cronjob. :) log to /opt/mwlib/var/log/cache-cleaning
  • 12:28 brion: running mw-serve cache cleanup for files older than 24h

February 5

  • 18:19 brion: put ulimit back with -v 1024000 that's better :D
  • 18:18 brion: removed the ulimit; was unable to reach server with it in place
  • 18:15 brion: hacked mw-serve to ulimit -v 102400 on erzurumi, see if this helps with the leaks for now
  • 16:56 domas: rebooted erzuruzumi, placed swap-watchdog ( http://p.defau.lt/?mELQFcwRSvYRYdiIR9pvKQ ) into rc.local
  • 16:03 mark: Added Qatar (634) to the list of esams countries
  • 01:27 Tim: migrated arzwiki upload directory from amane to ms1
  • 01:00 Tim: fixed arzwiki upload directory permissions
  • 00:56 Tim: moved most cron jobs from admin user cron tabs to /etc/cron.d on hume

February 4

  • 22:33 tomaszf: Adding cron for torblock under tfinc@hume
  • 22:20 tomaszf: ran loadExitNodes() to update tor block list
  • 18:36 brion: running TorBlock/loadExitNodes.php
  • 17:25 brion: stripped BOM from en.planet config.ini; re-running.
  • 17:24 brion_: attempting to run planet update for en.planet manually..... there's a config error
  • 16:30 domas: stealing db27 for moar tests

February 3

  • 13:05 mark: Remote-hands replaced some cables, fuchsia is back up but idling
  • 06:57 Tim: doing some schema changes on the otrs database. Some fields should be blobs and are text instead, perhaps due to a previous 4.0 -> 5.0 MySQL upgrade
  • 01:48 Tim: added blob_tracking table to ukwikimedia
  • 01:42 Tim: repooled db3 and db4
  • 00:34 mark: Moved traffic back
  • 00:28 mark: Shutdown switchport of fuchsia in order to prevent it from interfering with mint (which took up text LVS as well as upload)
  • 00:20 mark: Moved European traffic to pmtpa - text LVS unreachable

February 2

  • 23:54 domas: took out db29 for some testing
  • 22:07 mark: Modified Exim configuration on williams to not discard but delivered spam-recognized messages to OTRS with an X-OTRS-Queue: Junk header, as well as SpamAssassin headers
  • 21:35 brion: reverting change to Cite_body.php
  • 21:28 brion: caching for cite refs is known to cause problems with links randomly replacing with other links; likely strip marker problem. andrew is investigating
  • 19:31 domas: merged in Andrew's Cite cache to live site
  • 16:47 brion-sick: syncing update to Collection to do more efficient sidebar lookups
  • 16:18 brion-sick: large spike in text backend service times
  • 16:15 brion-sick: secure.wikimedia.org is returning 503 Service Temporarily Unavailable
  • 08:11 Tim: removing ancient static HTML dump from srv31
  • 08:05 Tim: removed cluster13 and cluster14 from db.php, will watch exception.log for attempted connections
  • 08:02 Tim: removed srv130 from LVS and the apaches node group, not accessible by ssh but still serving pages
  • 07:56 Tim: find /home/wikipedia/logs -size 0 -delete
  • 07:43 Tim: re-added db22 to s1 rotation, no explanation for its removal in server admin log
  • 06:39 Tim: dropped the otrs_test database
  • 06:38 Tim: moved the OTRS database from otrs_real back to otrs. Updated exim4 config on mchenry
  • 04:23 Tim: db10's relay log was corrupted, did a flush slave/change master
  • 01:10 Tim: started mysqld on db23, doing recovery
  • 00:59 Tim: rebooted db23
  • 00:56 Tim: db23 down, depooled
  • 00:05 Tim: adjusted innodb configuration on db10, restarted, starting replication

February 1

  • 23:40 Tim: OTRS recovery script done
  • 22:13 brion: updating rowikibooks logo bugzilla:17273 (note the log bot is down again)
  • 21:25 Tim: running script to copy deleted OTRS data from db10
  • 20:40 mark: Lily was overloaded due to the long downtime of mchenry, stalling all mailing lists deliveries
  • 20:39 mark: Granted SELECT access to mchenry and williams for database otrs_real - they've been giving temp rejects for hours
  • 11:24 Tim: mysqld on db10 crashes when it tries to run the current replicated query. Probably needs a resync. Set --skip-slave-start
  • 10:05 Tim: updated OTRS DB name on mchenry
  • 09:53 Tim: reading in SQL backup
  • 09:33 Tim: moving the otrs database to otrs_real to allow easier binlog import
  • 03:52 Tim: done 1 and 2
  • 03:10 Tim: recovery plan is as follows: 1. re-enable r/w web access, 2. compile a list of deleted IDs from the binlogs (confirmed that this is possible), 3. read in the pre-upgrade backup to a separate DB and execute binlogs to the appropriate point, 4. copy affected IDs from the backup to the live DB
  • 02:52 Tim: patched GenericAgent.pm to prevent ticket deletion
  • 02:27 Tim: it seems some admin inserted a GenericAgent job called "temp1" at 09:46 with the effect of deleting all tickets older than 30 days. The binlogs show a duplicate "Valid" key, with one row setting it to 0 and the next setting it to 1, so it's possible the user set valid=0 in the UI but due to a bug in OTRS, the job was considered valid. The job appears to have been run first at 09:46, probably from the web, then regularly at 10 minute intervals, most likely due to the cron job on bart which was not deactivated. I've now removed the relevant crontab and revoked bart's OTRS permissions.
  • 01:11 Tim: put an explanatory note on the OTRS login screen and deleted all sessions to send users there
  • 00:38 Tim: revoked write access from the otrs mysql user, to prevent any further damage. Making a copy of the binlogs. The plan is to do forensics first and then recovery second.

January 31

  • 18:17 mark: Following reports of OTRS rapidly deleting old tickets/emails every ~ 10 minutes, I disabled (set to invalid) all GenericAgent jobs pending investigation
  • 15:43 mark: Set local_from_check = false in exim.conf on williams, to prevent Sender headers from being added (annoying for Outlook users)
  • 07:11 Tim: converting OTRS database to proper UTF-8 (instead of UTF-8 in latin1 fields) using ~/fix-schema.php
  • 01:30 brion: updating eswikibooks logo bugzilla:17078
  • 00:55 brion: setting mswikibooks logo bugzilla:17263
  • 00:53 brion: copied wikimedia favicon to blog.wikimedia.org bugzilla:17171
  • 00:51 domas: lomaria needs reinstall, db24 and db30 are live in s2 duty

January 30

  • 17:54 domas: *giggle*, booted up lomaria with SMP kernel
  • 17:43 domas: lomaria kernel detects just one CPU (out of four)
  • 17:26 domas: converted lomaria into dewiki-only server
  • 14:20 Tim: Done with OTRS for now. Some bugs remain, particularly the missing ticket list in AgentTicketCustomer. I'll probably have to downgrade to 2.3.x tomorrow.
  • 12:51 mark: Installed ganglia on williams
  • 11:50 mark: Letting OTRS mail through to williams on mchenry
  • 10:50 Tim: running upgrade of OTRS DB
  • 10:44 mark: Removed all OTRS test copies in the queue of williams
  • 10:42 mark: Deferring all OTRS mail on the queue of mchenry
  • 10:30 mark: Put in a quick hack to forward misrouted OTRS mails from williams to bart
  • 08:52 Tim: sent upgrade warning email to all OTRS agents
  • 06:56 Tim: RCT should be finished now, no more connections are expected on cluster13 or 14. Current connection counts: 123943575, 295618929.
  • 02:36 Tim: set up SSL on williams and switched ticket.wikimedia.org DNS to point to there
  • 02:21 brion: set up new SSL cert for ticket.wikimedia.org; tim's poking at installing it
  • 02:19 brion: updated password on tridge *cough*
  • 01:43 brion: syncing update to Drafts with IE 7 fix (r46571 and style ver update)
  • 00:16 brion: live-merging r46570 -- fixes to DB access in revisiondelete

January 29

  • 22:55 mark: Did s/knams/esams/ on the selective AAAA answer config of ns0/ns1/ns2.wikimedia.org
  • 22:47 mark: While messages are held in the queue on williams, use "mailq" to view the queue, and "exim -M <messageid>" to let an individual message through for testing
  • 22:44 mark: SpamAssassin training from the OTRS Junk queue not yet setup
  • 22:43 mark: Note: Exim on williams queries for mail addresses from the live OTRS database, not the test database
  • 22:42 mark: Completed OTRS mail setup on williams. wikitech documentation updated in OTRS and Mail. OTRS mail is still copied to williams, and then held on the queue.
  • 22:00 mark: Added db10 as secondary DB to query for Exim on mchenry
  • 21:59 mark: Granted SELECT privileges on otrs.system_address to exim@williams on db9/db10
  • 21:58 brion: enabling revision & log suppression for oversighters
  • 21:12 brion: live-merging r46429 change to Special:Contributions -- stub marking fix
  • 21:01 mark: Copying OTRS mail to williams, where it's automatically held in the queue without extra processing; useful for testing
  • 21:00 mark: Installed SpamAssassin on williams for OTRS, copied training data from bart
  • 20:14 recompressTracked.php finished
  • 19:18 brion: aborted old enwiki dump so a fresh one can start, since that old history will never finish on the old system
  • 19:17 brion: updated data dump scripts
  • 17:57 brion: disabled 'mark patrolled' link for views without specific rcid param; but now it's back when we actually ask for it so actual rc/new pages patrol works again http://rafb.net/p/puGHC095.html
  • 17:54 brion: poking at patrol link live hack
  • 17:40 brion: erzurumi is rebooted and serving out PDFs again. need to implement some resource limits...
  • 17:35 brion: rebooting erzurumi via drac
  • 17:32 brion: i hate the drac shell
  • 17:24 brion: erzurumi appears to have been victim to a massive memory leak. seeing if we can reboot it
  • 17:17 brion: poking at mw-serve on erzurumi; not responding
  • 16:15 domas: livehacked out 'patrol' link on article views %)
  • 04:02 Tim: added DNS entry for OTRS test
  • 03:19 tomaszf: installed grosley
  • 01:31 Tim: fixed srv76 and the wikimedia-task-appserver package
  • 01:31 brion-busy: syncing r46513 -- fix for categoryfinder, update to fix for Collection
  • 01:14 brion-busy: updating Collection ext -- compat issue with changed category
  • 00:56 brion-busy: stopped apache on srv76 for the moment
  • 00:55 brion-busy: srv76 doesn't have upload5 mounted
  • 00:41 brion: live-hacking out a broken check in getDupeWarning() which broke uploading if you had a duplicate file
  • 00:34 mark: DOM readouts on br1-knams:
br1-knams#sh optic 1
 Port Temperature    Tx Power       Rx Power    Tx Bias Current Monitor
+----+-----------+--------------+--------------+---------------+-------+
  1/1   24.0078 C    000.7776 dBm                  84.360 mA    Disabled
  1/2   N/A            N/A            N/A            N/A            
  1/3   37.0000 C   -003.4582 dBm  -003.8111 dBm   58.470 mA    Disabled
  1/4   32.0234 C    000.4669 dBm                  71.928 mA    Disabled
  • 00:22 Tim: synced nagios config

January 28

  • 23:40 mark: s/knams/esams/ in DNS geobackend files
  • 23:25 mark: Deployed fix in /lib/lsb/init-functions on sanger, mchenry, williams and lily which caused (amongst others) Exim reloads (-HUP) to be turned into a kill -TERM (Debian bug #434756)
  • 23:15 mark: Set up basic mail system for OTRS on williams. Still incomplete and needs fine tuning and testing, spam checking is not yet implemented amongst other things.
  • 22:30 mark: Restarted Exim on sanger, disappeared mysteriously
  • 21:50 mark: Raised Dovecot max login process count from 128 to 1024
  • 21:04 brion: merging reupload fixed: r46479, r46483, r46487
  • 20:49 mark: Base OS install finished on williams.wikimedia.org
  • 20:02 brion: merging r46472 (FlaggedRevs autopromote fix), r46464-46476 (feed RTL style fix, re-upload disabled field fix)
  • 18:05 RobH: setup mail relay for wikimedia.cz for Danny and Co  ;]
  • 08:43 domas: s3 replication switched from db1-bin.325:437169827 to db11-bin.026 :79
  • 08:35 domas: s2 rep switched from ixia-bin.150:119337662 to db13-bin.004:79
  • 06:15 Tim: creating backup of db10 on storage2
  • 04:29 brion: svn up'ing and scapping to r46424 consistently
  • 04:22 brion: updating FlaggedRevs to r46422
  • 04:17 brion: merging r46419, r46421 -- search display fixlets
  • 03:51 brion: attempting scap again; tweaking DataCenter.ui.php since the scap syntax checks are whinging about the abstract static method o_O
  • 03:40 brion: scapping to r46413
  • 01:35 brion: svn up'ing to r46413 on test...

January 27

  • 19:28 brion: syncing updates to Collection
  • 19:04 brion: scapping update to AbuseFilter for test. updated its schema...
  • 18:44 brion: db16 lagged 2188s
  • 18:44 brion: restarting slave thread on db16. it got stopped with a lock wait timeout on a page_touched update (wtf?!)
  • 18:43 brion: slave stopped on db16
  • 17:41 mark: knsq1 Up and serving requests with squid 2.7.5
  • 17:25 mark: Trying squid 2.7.5 on knsq1 - might be unstable in the mean time
  • 17:22 mark: Reduced cache_mem on backend esams text squids from 3000 to 2500
  • 16:23 RobH: srv76 had a failed hdd, replaced, reinstalled, and bringing back into rotation
  • 16:18 RobH: srv146 was powered down (heat issue?), powered back up, synced and now in rotation.
  • 16:09 RobH: srv139 didnt have apache running, synced and started
  • 16:01 RobH: srv129 didnt have apache running, synced and started
  • 15:59 RobH: sq11 back online, cleaned
  • 15:40 RobH: srv126 back online. possible bad disk, if it crashes again, the disk needs replacement. (it went read only before, which seems to sometimes happen even when the disks are not bad.)
  • 15:25 RobH: srv76 wont boot up, reinstalling.
  • 15:12 RobH: srv130 coming back online, updated fstab, synced, putting it back in rotation.
  • 15:05 RobH: moved ts-array4 to its dedicated ports, now its kate's problem ;]
  • 14:49 Tim: restarted recompressTracked.php
  • 14:33 Tim: henbane's disk has been full for 8 days due to donate-campaign.log, starting cleanup
  • 14:18 Tim: killed recompressTracked.php
  • 14:08 domas: removed unnecessary ms1 stat from CommonSettings.php. Recovery observed. ( diff )
  • 13:44 mark: CARP weight redistribution caused large load spike in upload backend request, causing ms1 overload, probably causing issues on apaches via NFS, etc etc...
  • 13:29 mark: Lowered CARP weight from 10 to 5 for sq1-10.wikimedia.org, from 15 to 10 for sq11-15
  • 08:20 Tim: depooled db3 and db4 to improved recompressTracked speed
  • 07:09 Tim: There was a bug in recompressTracked.php which caused the last batch of orphans for any given wiki to be skipped. Re-running recompressTracked.php to repair it.
  • 05:55 Tim: killed all job runners, changed the job-runners group to srv151-180, started job runners on those servers
  • 05:50 Tim: migrated job runner scripts to ubuntu and started job runners on srv110-119
  • 05:29 Tim: started job runner on srv89
  • 02:13 brion: updating extensions/AbuseFilter/Views/AbuseFilterViewList.php (mysql 4 compat issue)
  • 02:04 brion: installed release versions of mwlib on erzurumi and restarted. these should have updated localizations
  • 01:48 brion: turning AbuseFilter on on test.... having some mysql 4.0 compat issues. poking
  • 01:47 brion: srv31 seems very sad; slow/borked login?
  • 01:39 brion: scapping to update AbuseFilter to current
  • 01:27 brion: prepping testing of AbuseFilter on test.wikipedia
  • 00:46 brion: enabling Collection also for de.wikisource per frank's req passed on from community
  • 00:36 brion: adding NS_HELP to $wgCollectionArticleNamespaces
  • 00:12 brion: Collection extension being enabled on dewiki

January 26

  • 22:39 RobH: UK Chapter wiki setup per https://bugzilla.wikimedia.org/show_bug.cgi?id=16996
  • 22:18 RobH: pushed apache changes for uk chapter wiki
  • 22:13 RobH: updated dns for uk chapter wiki
  • 19:29 brion: going to update Collection to current trunk in prep for further activation today
  • 17:01 RobH: added support for the phone server to dns

January 25

  • 12:18 mark: Announcing routes to AS16265 again
  • 10:17 domas: our deadlocks are described in X4240 manuals. the fix is either disabling MSI or setting 'options forcedeth max_interrupt_work=15' in modprobe.conf. product notes
  • 09:31 domas: db17 live, with 2.6.28.1 kernel

January 24

January 23

  • 18:04 brion: putting load back on db3, it's up to date
  • 17:49 brion: taking some load off db3 until it catches up
  • 17:46 brion: also killed a WantedTemplatesPage::recache query which had been running for a day. that ain't sustainable. :P
  • 17:44 brion: domas restarted morebots a few minutes ago :D
  • 17:43 brion: syncing update to ApiQueryBacklinks.php with the USE INDEX that was added for this problem
  • 17:41 brion: killing some stray backlinks queries
  • 17:38 brion: ~1-hour lag on db3
  • morebots is broken/down? unable to edit

January 22

  • 00:10 brion: whitelisting .ott (OpenDocument templates) for private-wiki uploads

January 21

  • 20:25 RobH: some tinkering on http redirects, rollback
  • 17:51 RobH: setup https for wikitech
  • 17:23 RobH: setup wikitech to stream weekly backups to tridge
  • 10:29 domas: db28 powered down because of temperature reading over threshold (45C???)

January 20

  • 21:45 RobH: killed some run away processes on db9 that were killing bugzilla
  • 21:44 brion: stock long queries on bz again. got rob poking em
  • 20:31 brion: putting $wgEnotifUseJobQ back for now. change postdates some of the spikes i'm seeing, but it'll be easier to not have to consider it
  • 20:19 mark: Upgraded kernel to 2.6.24-22 on sq22
  • 19:57 brion: disabling $wgEnotifUseJobQ since the lag is ungodly
  • 17:58 JeLuF: db2 overloaded, error messages about unreachable DB server have been supported. Nearly all connections on DB2 are in status "Sleep"
  • 17:21 JeLuF: srv154 is reachable again, current load average is 25, no obvious CPU consuming processes visible
  • 17:10 JeLuF: srv154 went down. Replaced its memcached by srv144's memcached
  • 03:02 brion: syncing InitialiseSettings -- reenabling CentralNotice which we'd taken temporarily out during the upload breakage
  • 01:50 Tim: exim4 on lily died while I examined reports of breakage, restarted it

January 19

  • 21:28 mark: Distribution upgrade on lily complete
  • 21:27 mark: Letting mail through again on lily
  • 21:01 JeLuF: Bugzilla didn't work. Some long-running (>3h) requests were locking some tables. Killed all long running jobs.
  • 20:05 mark: Put mail delivery on hold on lily
  • 20:03 mark: Upgrading lily (Mailing list server) to Ubuntu 8.04 Hardy
  • 14:04 mark: Set a static ARP entry for 85.17.163.246 on csw1-esams to see if it helps with the inbound packet loss effects

January 18

  • 20:25 mark: Cut outbound announcements to AS16265 to counter the inbound packet loss on that link
  • 17:50 river: started copying ms1:/export/upload to ms4
  • 00:21 Tim: restarted apache on srv158,srv177,srv106,srv66,srv109,srv140,srv86,srv90,srv133,srv172
  • 00:19 Tim: cleaned up binlogs on db1

January 17

  • 12:43 mark: Shut down transit link to 16265 due to intermittent packet loss

January 16

  • 23:25 brion: activating Drafts extension on testwiki
  • 21:18 brion: updating english/default wikibooks logo bugzilla:17034
  • 19:50 brion: uncommented srv101 from apache nodelist
  • 19:41 mark: Fixed authentication on srv101, and mounted /mnt/upload5
  • 19:25 brion: srv101 is commented out of 'apaches' node group so didn't show up on my earlier sweep
  • 19:23 brion: poking around, srv101 at least is missing upload5 mount still

January 15

  • 21:16 brion: seems magically better now
  • 20:48 brion: ok webserver7 started
  • 20:43 brion: per mark's recommendation, retrying webserver7 now that we've reduced hit rate and are past peak...
  • 20:28 brion: bumping styles back to apaches
  • 20:25 brion: restarted w/ some old server config bits commented out
  • 20:24 brion: tom recompiled lighty w/ the solaris bug patch. may or may not be workin' better, but still not throwing a lot of reqs through. checking config...
  • 19:48 brion: trying webserver7 again to see if it's still doing the funk and if we can measure something useful
  • 19:47 brion: we're gonna poke around http://redmine.lighttpd.net/issues/show/673 but we're really not sure what the original problem was to begin with yet
  • 19:39 brion: turning lighty back on, gonna poke it some more
  • 19:31 brion: stopping lighty again. not sure what the hell is going on, but it seems not to respond to most requests
  • 19:27 brion: image scalers are still doing wayyy under what they're supposed to, but they are churning some stuff out. not overloaded that i can see...
  • 19:20 brion: seems to spawn its php-cgi's ok
  • 19:19 brion: trying to stop lighty to poke at fastcgi again
  • 19:15 brion: looks like ms1+lighty is successfully serving images, but failing to hit the scaling backends. possible fastcgi buggage
  • 19:12 brion: started lighty on ms1 a bit ago. not realyl sure if it's configured right
  • 19:00 brion: stopping it again. confirmed load spike still going on
  • 18:58 brion: restarting webserver on ms1, see what happens
  • 18:56 brion: apache load seems to have dropped back to normal
  • 18:48 brion: switching stylepath back to upload (should be cached), seeing if that affects apache load
  • 18:40 brion: switching $wgStylePath to apaches for the moment
  • 18:39 brion: load dropping on ms1; ping time stabilizing also
  • 18:38 RobH: sq14, sq15, sq16 back up and serving requests
  • 18:38 brion: trying stopping/starting webserver on ms1
  • 18:27 brion: nfs upload5 is not happy :(
  • 18:27 brion: some sort of issues w/ media fileserver, we think, perhaps pressure due to some upload squid cache clearing?
  • 18:23 RobH: sq14-aq16 offline, rebooting and cleaning cache
  • 18:16 RobH: sq2, sq4, and sq10 were unresponsive and down. Restarted, cleaned cache, and brought back online.
  • 04:32 Tim: increased squid max post size from 75MB to 110MB so that people can actually upload 100MB files as advertised in the media

January 14

January 13

  • 23:32 Tim: fixed NRPE on db29
  • 22:56 Tim: cleaned up binlogs on db1 and ixia
  • 22:54 brion: poking WP alias on frwiki bugzilla:16887
  • 21:11 RobH: setup ganglia on erzurumi
  • 20:42 brion: setting all pdf generators to use the new server
  • 20:40 brion: testing pdf gen on erzurumi on testwiki
  • 20:35 RobH: setup erzurumi for dev testing
  • 20:35 RobH: some random updates on server roles to clean it up
  • 19:37 mark: Restored normal situation, with 14907 -> 43821 traffic downpreffed to HGTN to avoid peering network congestion
  • 18:40 mark: Retracted outbound announcement to all AMS-IX peers, 16265 and 13030 to force inbound via 1299
  • 18:25 mark: Undid any routing changes as they were not having the desired effect
  • 18:14 mark: Prepended 43821 twice on outgoing announcements to 16265 to make pmtpa-esams path via nycx less attractive
  • 11:38 Tim: reducing innodb_buffer_pool_size on db19, db21, db22, db29
  • 09:15 Tim: restarting mysqld on db23 again
  • 09:09 Tim: restarting mysqld on db18 again
  • 07:08 Tim: removed db23 from rotation, since I'm bringing it up soon and it will be lagged
  • 07:02 Tim: shutting down mysqld on db18 for further mem usage tweak
  • 06:53 Tim: fixed broken /etc/fstab on db23 via serial console
  • 06:42 Tim: restarting db23
  • 00:08 Tim: repooling db18, has caught up

January 12

  • 21:50 brion: testing a scap after touching MessagesWuu.php to see if that clears borked serialized btis
  • 21:22 RobH: erzurumi installed
  • 21:00 tomaszf: moved erzurumi to vlan 101 on asw-a4-sdtpa
  • 17:55 brion: temporarily stopped apache on srv78, srv118
  • 17:54 brion: srv78 doesn't have upload5 mounted
  • 17:54 brion: srv118 doesn't have upload5 mounted
  • 17:46 RobH: fixed some settings for flaggedrevs in https://bugzilla.wikimedia.org/show_bug.cgi?id=14648
  • 17:31 RobH: per brion commented out db18 in db.php cuz its making other crap lag too much (bugzilla:16993)
  • 17:26 RobH: updated flaggedrevs.php for https://bugzilla.wikimedia.org/show_bug.cgi?id=16365
  • 17:23 RobH: updated apache config on yongle for wap => mobile forwarding oversight per https://bugzilla.wikimedia.org/show_bug.cgi?id=16692
  • 17:05 brion: db18 is backlogged 191k seconds. depooling it; complaints of hella lag
  • 15:32 Tim: restarted mysqld on db18 with reduced memory usage, repooled
  • 14:12 Tim: rebooting db18
  • 13:20 Tim: depooled db18 (is down)

January 10

  • 16:08 domas: rotated 300g sampled-1000.log ;-)
  • 07:09 river: applied current OS patches to ms2 and rebooted
  • 01:21 Tim: restarted apache on srv95,srv114,srv37,srv49
  • 01:19 Tim: cleaned up disk space on db1. Still looks suspiciously like the master...
  • 00:33 brion: redirecting old bylaws.pdf to wiki page bylaws on wikimediafoundation.org (foundation.conf update)
  • 00:13 brion: reconfigured exim on wikitech to hopefully actually send mail out. whether it reaches anything, we'll see
  • 00:12 tomaszf: turned off fundraising banners
  • 00:08 brion: installed a mail server on wikitech server, hopefully

January 9

January 8

  • 22:08 brion: putting db12 back in service, caught up
  • 21:42 RobH: changed the ip address for the management interfaces on sq31-sq50
  • 21:30 RobH: updated dns with the squids and srv mangement info for pmtpa
  • 21:16 brion: taking load off db12 while it updates
  • 21:15 brion: killing stuck query threads on db12 (lagged 13k seconds)
  • 20:23 RobH: updated dns removing a large number of decommissioned servers from records.
  • 20:08 RobH: pushed updates to dns for mangement ip allocations, changed mangement ips of search8-search12
  • 19:42 RobH: changed the mangement ip addresses of db5-db10 to fit into current ip scheme
  • 18:20 RobH: updated dns for the management name resolution of db11-db30
  • 18:11 RobH: ms5 has lom access enabled and is ready for testing. (Only one ethernet connection in lieu of the typical 3 on the thumper/thors)
  • 15:50 RobH: srv118 reinstalled
  • 15:46 RobH: srv136 is borked. Even after reinstall, it will run for a few minutes, then lock hard. Going to RMA it.
  • 15:38 RobH: reinstalled srv136 and srv118 cuz they were pissing me off (a valid reinstallation reason if there ever was one.)
  • 15:08 RobH: and srv118 back down, thing is borked.
  • 15:06 RobH: srv118 back online and serving requests.
  • 15:01 RobH: pushed db13 back into cluster, same with db14, from yesterdays work
  • 14:26 RobH: srv101 back online and in lvs
  • 14:15 RobH: reinstalled srv101, installing wikimedia-task-app packages now
  • 06:37 JeLuF: rebooted db18. Mysqld was stuck but couldn't be killed.
  • 04:08 Tim: migrated all locked wikis from $wgReadOnly(File) to permissions-based locking, so that stewards can edit the alternate project links, and so that various MediaWiki components don't break on page view
  • 03:57 river: set up ms3/ms4 with solaris 10 update 6

January 7

  • 22:50 RobH: db13 and db14 are replicating but not in the cluster (not sure if they are caught up)
  • 22:35 RobH: updated power strip information for ps1-a1-sdtpa and balanced load
  • 22:35 RobH: reseated mrj cable for csw1-sdtpa_1/13
  • 21:36 RobH: started up db13 and db14
  • 21:19 RobH: updating firmware on db13-db14
  • 21:14 RobH: shutdown db13 and db14 to fix lom lockup issue.
  • 20:52 RobH: depooled db13 and db14 in db.php to reboot them and fix the SP lockup issue.
  • 20:49 RobH: updating firmware on db16.
  • 20:43 RobH: started mysql back up on db15
  • 20:42 RobH: cold reset of db16 to resolve lom issue. will update firmware upon boot.
  • 20:39 RobH: swappned hostnames on ms3 and ms4, updated racktables and dns to reflect change
  • 20:24 brion: disabled wikidiff2 on wikitech since it's not installed, and this apparaently is nicely broken
  • 20:21 RobH: db15 now responsive to lom and ready to be re-integrated into the cluster
  • 20:12 RobH: db15 cold reset fixes the LOM non-responsive issue. Upgrading its firmware to prevent future issues.
  • 20:06 brion: removed stray whitespace from wikitech config file which was breaking rss feeds
  • 19:22 mark: Possibility that esams LVS was overloaded, split over 2 boxes (fuchsia & mint)
  • 19:19 RobH: ms3 and ms4 are accessible via LOM and ready for setup/deployment
  • 19:05 RobH: updated dns for ms3-ms5, updated dns for mangement for all media servers.
  • 19:03 brion: touching MessagesZh.php and re-trying scap; may not have properly updated
  • 17:40 brion-plague: scapping -- merged r45507 zh specialpage alias fix to live. also r45499 (revert of Cite error thingy) seems to already have been merged
  • 13:58 Tim: ran updateAutoPromote.php on all flaggedRevs wikis
  • 13:41 Tim: scap
  • 13:21 Tim: repooled db3 and db4
  • 12:47 Tim: recompressTracked.php complete. Recompressed 628 GB of data to 30GB, a 21x reduction over per-revision compression.
  • 04:36 brion-codereview: svn up'ing testwiki to r45489

January 6

  • 16:01 mark: Changed 'knams' into 'esams' in DNS, kept a lot of old names in place
  • 15:26 Tim: cleaned up binlogs on db1
  • 13:09 mark: Did some Traffic Engineering on the Amsterdam network
  • 11:58 Tim: installed NRPE on new ES servers
  • 11:47 domas: added db29 to s3 duty
  • 11:32 Tim: locked clusters 18 and 19, updated nagios
  • 11:27 Tim: fixed lack of schema on srv161
  • 11:21 Tim: retired cluster18 from the write list, added cluster20 and cluster21
  • 11:15 Tim: cleaned up binlogs on srv105
  • 00:04 tomaszf: built out eiximenis with ubuntu-8.04 for mobile server

January 5

  • 20:47 brion: re-updating SpecialSearch.php and MWSearch.php for better fix of the XSS
  • 20:40 brion: updating SpecialSearch.php for XSS issue
  • 20:00 RobH: wikitech is moved to new host. Still needs HTTPS setup. Redirects from old host are in place.
  • 13:17 domas: setting up db24-db26 LVMs per http://p.defau.lt/?eAOimTjd9r_QvSDiIhHjng
  • 12:56 mark: Brought down BGP transit session to AS 1145 / Kennisnet
  • 12:29 domas: db16 had our special deadlock, didn't come up after reboot, SP not responding, needs datacenter activity
  • 12:07 domas: upgraded BIOS firmware on db29,db30 and accidently on db19 (damn .29 ip :)
  • 11:47 domas: added 208.80.152.185 to noc.wikimedia.org vhost ServerAlias
  • 10:33 mark: Brought BGP session to AS 16265 back up
  • 00:04 Tim: cleaned up binlogs on ixia and db1

January 4

  • 17;08 mark: Restored traffic to esams
  • 16:38 mark: Moved route sourcing from br1-knams to csw1-esams
  • 15:55 mark: Moving esams traffic to pmtpa (scenario knams-down)

January 3

  • 23:57 mark: Restored AAAA record on upload.wikimedia.org
  • 12:04 domas: db17, db18 had OS/firmware updates, rebooted
  • 10:50 domas: db19 RAID complaining about temperature, check-raid/kswapd/mysqld deadlock. upgrading RAID firmware, rebooting, etc
  • 01:23 Tim: removed db3 and db4 from rotation again, to allow recompressTracked to go faster
  • 00:36 Tim: depooled db19, is down
  • 00:32 Tim: restarting recompressTracked with an extra wfWaitForSlaves()
  • 00:08 Tim: repooled db3 and db4

January 2

  • 22:35 Tim: depooled db3 and db4 temporarily
  • 21:56 Tim: killed recompressTracked for now, not waiting for slaves properly. db3 and db4 lagged.
  • 20:54 mark: Set db4 s1 load to 0, 4368s lagged
  • 00:42 Tim: restarting recompressTracked.php on hume

January 1

  • 20:34 brion: live-merging file delete fatal error fix from r45278
  • 19:47 brion: bumped meter image to 7
  • 01:59 brion: scapping!
  • 01:39 brion: svn up'ing test.wiki to r45274
  • 00:55 brion: svn up'ing on test.wikipedia

December 31

  • 18:40 brion: fixed old whygive.wikimedia.org blog by copying de-conflicted WordPress source files out of the active blog where we fixed it after the 2.7 upgrade

December 30

  • 23:02 RobH: is leaving on a jet plane, weeeeeeeee.. in 8 hours.
  • 23:01 RobH: all knams squids are now online.
  • 22:49 RobH: knsq23-26 back in rotation, 3 more to go.
  • 22:33 RobH: enabled knsq16-knsq22 in lvs, almost time to go back to hotel and die.
  • 22:22 brion: attempting to purge affected pages on dawiktionary, dawiki
  • 22:21 brion: taking dawiki, dawiktionary out of read-only because the rest of the fixes won't work until it's disabled :P
  • 22:14 brion: poking diff version in live DifferenceEngine.php to eliminate bogus cache entries for dawiki/dawiktionary
  • 22:11 RobH: stopping and clearing the cache on knsq16-knsq30.
  • 22:06 brion: trying it again, but this time with the right variable names
  • 22:02 brion: attempting to clear revision text loading cache entries for dawiktionary, dawiki
  • 21:47 brion: live-merging r45206 so bugzilla:16841 corrupted entries will be loaded properly on dawiki/dawiktionary. need to clear revision, diff, parser caches...
  • 21:15 brion: locking dawiki, dawiktionary ($wgReadOnly) pending encoding fix
  • 20:07 brion: killed recompressTracked.php processes on hume pending investigation of encoding breakage
  • 20:02 brion: commenting ariel out of pmtpa also
  • 19:58 brion: trying to clear no-longer-in-dns hosts from ALL node group
  • 19:57 brion: PLEASE SAY WHAT SERVER YOU'RE RUNNING BATCH PROCESSES ON IF THEY'RE NOT ON ZWINGER. thanks
  • 19:56 RobH: power disconnection for primary routing rack in esams. power restored, and totally was not robh's fault regardless of what lies mark may say to the contrary.
  • 19:54 brion: encoding issues reported with some old edits on dawiki. wondering if this is recompression-related?
  • 18:46 brion: added PMTPA nameserver back in mayflower's resolv.conf so DNS actually works on it until things are fixed
  • 17:42 brion: internal DNS for knams seems to be down (at least on mayflower), this is breaking at least SVN update notifications
  • 17:14 brion: updating logo for pmswiki bugzilla:16587
  • 13:29 Tim: starting recompressTracked.php on all wikis
  • 11:22 mark: Shutting down knsq16-30
  • 10:59 mark: In case of overload problems, please move traffic to pmtpa (scenario knams-down)
  • 10:54 mark: Depooled knsq16-30
  • 10:47 mark: Set DNS timeout on fuchsia (LVS) to 1s, PyBal timeout to 8s
  • 10:21 mark: Unracking pascal, mint, lily
  • 09:57 Tim: testing recompressTracked on huwiki
  • 09:38 mark: ts-array3/A --> yarrow/0
  • 09:23 TimStarling: testing recompressTracked on testwiki
  • 09:20 mark: hemlock/eth1 <--> clematis/eth1
  • 09:17 mark: ts-array2 -> zedler scsi B, ts-array1/0 -> zedler scsi A
  • 08:47 Tim: running FlaggedRevs/maintenance/clearCachedText.php on all FlaggedRevs wikis

December 29

  • 11:24 mark: Shutting down and unracking mayflower (subversion)
  • 11:21 mark: Temporarily disabled AAAA record upload.wikimedia.org for ipv6 participants
  • 11:19 mark: Unracked fuchsia
  • 11:16 mark: In case of overload problems, move traffic to pmtpa!
  • 11:11 mark: Moving all LVS to mint
  • 09:56 mark: Depooled knsq8-15
  • 09:56 mark: Unracked knsq1-7
  • 09:43 mark: Repooled knsq23-30, depooled knsq1-7
  • 09:23 mark: Depooled knsq23-30
  • 08:47 Tim: deleted some binlogs on srv108.
  • 04:50-05:32 Tim: set up external storage on the remaining 9 servers in srv151-186: srv160, srv161, srv162, srv172, srv173, srv174, srv184, srv185, srv186
  • 03:41 Tim: running orphanStats.php on all wikis
  • 03:26 Tim: restarted apache on srv33, srv146, srv169, srv172
  • 03:00 Tim: cleaned up binlogs on srv105

December 28

  • 21:33 brion: tweaked namespace robot policies for hewiki bugzilla:16247
  • 20:52 brion: tweaking it correctly this time
  • 20:50 brion: tweaking centralnotice loader path for secure.wm.o
  • 20:20ish brion: copied a couple image files for Bugzilla skin to local dir, since Firefox 3.1b whinges about loading images via http: from an https: page
  • 18:21 brion: we've been getting reports of difficulties reaching PMTPA via Level3
  • 18:03 brion: updating thwiki logo bugzilla:16008
  • 17:54 mark: csw1-esams racked and configured; link established with br1-knams
  • 12:14 mark: Moving equipment to EvoSwitch
  • 11:55 mark: Moved udpmcast from pascal to lily
  • 11:48 mark: sage stays at knams, to be racked into J-13 later
  • 11:44 mark: Unracking ragweed
  • 11:38 mark: Unracking hawthorn
  • 11:37 mark: Unracking sage
  • 11:37 mark: Unracked csw1-knams
  • 11:25 mark: Directed traffic back to knams
  • 10:52 mark: knams network should be back up
  • 09:05 mark: Moving knams traffic to pmtpa

December 27

  • 21:50 brion: removed stale sitemaps dirs for several private wikis

December 26

  • 00:50 Tim: started mysqld on db19, repooled
  • 00:44 Tim: got connection on db19 and assumed it was still broken, initiated shutdown
  • 00:44 domas: db19 had jfs/kswapd/etc deadlock, came up after reboot
  • 00:34 Tim: noticed db19 was down, depooled it.

December 25

  • 23:59 domas: restarted db19 with sysrq without telling anyone
  • 19:37 brion: adjusted subpage namespaces for arbcom_enwiki
  • 19:11 brion: disabled magic_quotes_gpc on yongle -- mobile.wikimedia.org gateway doesn't compensate for quoted input. :P
  • 19:09 brion: merry christmas!
  • 01:09 brion: re-running SVN metadata import for CodeReview to fix comment encoding (bugzilla:16640)

December 24

  • 21:55 brion: merging r45005 (restoring default font for Safari textarea)

December 23

  • 23:35 brion: svn up'd to r44990 (serialization updates broken by Setup.php change)
  • 23:28 brion: starting scap!
  • 23:24 brion: svn up'ing to r44989, prep for scap!
  • 22:41 brion: think i tweaked scap script to update skin files on upload.wikimedia.org ...hopefully :)
  • 22:09 brion-codereview: svn up'ing test.wikipedia.org to r44982 -- DO NOT SCAP UNTIL TESTED!
  • 02:38 Tim: cleaned up binlogs on db1, db2. Removed cluster19 from the write list, it's almost full.
  • 02:28 brion: clearing out bogus page_restrictions entries (bugzilla:16629)

December 22

  • 22:56 brion: updated timezone for huwikinews (bugzilla:14343)

December 21

  • 03:05 Tim: depooled db4 temporarily to speed up a long running trackBlobs query

December 20

  • 01:08 brion: starting a cleanupImages run on all wikis
  • 00:57 brion: set UI lang fo rmainpage on meta bugzilla:16701

December 19

  • 23:52 brion: removing MessageCache::get profiling hack, all done
  • 22:16 brion: adding profiling hack for MessageCache::get
  • 13:48 mark: Found knsq12 turned off, brought it back up
  • 12:17 mark: Unracking knsq15 to make room for the new router
  • 08:53 Tim: changed crontab on hume to run rebuildTemplates.php every 30 minutes instead of every 10 minutes, since it's taking about 30 minutes to finish each run
  • 07:42 Tim: started trackBlobs.php running on hume, for all wikis

December 18

  • 23:16 brion: updating MessagesLij.php, MessagesMt.php -- namespace breakage
  • 21:53 brion: bugzilla:16597 spam regex update
  • 21:01 RobH: added wikitech subdomain for future setup/migration of wikitech mediawiki
  • 20:33 RobH: added commons to meta imports allowed per https://bugzilla.wikimedia.org/show_bug.cgi?id=16665
  • 14:50 RobH: pushed dns change to correct spence.mgmt.pmtpa.wmnet.
  • 03:09 TimStarling: killed long-running query on db9, 5762 seconds, plain select query probably with a read lock held by the thread, all read queries were waiting for the lock
  • 02:27 TimStarling: deleted binlogs on srv105 and srv108
  • 01:16 brion: briefly experimented with changing wgLogo on testwiki via Configure and it didn't explode. yay! setting it back to default and just letting it be. only stewards can edit config, and only wgLogo is configable atm.
  • 01:12 brion: testing Configure on testwiki only
  • 01:10 brion: created test Configure ext tables in 'wikiconfig' db
  • 00:49 brion: scapping for update of Configure extension prior to small-scale test deployment
  • 00:48 Danny_B: wikibugs-l stopped to send mails to wikibugs-irc mailbox due to excessive bounces. reenabling sending again
  • 00:28 RobH: fixed part of the revert for lucene that i missed.
  • 00:24 RobH: reverted lucene.php changes from rainman's testing.

December 17

  • 23:18 RobH: more lucene changes
  • 22:36 brion: applied fix for Android browser on mobile gateway (also did the pl language setup recently)
  • 22:05 RobH: more lucene.php changes
  • 21:12 RobH: additions to lucene.php per rainmain
  • 20:39 mark: Corrected LVS service IPs on search2, search10-12
  • 20:03 brion: hacked mw-serve init script on yongle into shape. will commit it in a bit and update docs
  • 19:38 brion: pdf server seems to have eaten all temp space on yongle. clearing...
  • 19:26 mark: Set up search2, search8-12
  • 18:57 RobH: pushing dns changes for new misc. servers management resolution
  • 18:30 RobH: updated lucene.php with rainman to do things that I really do not get but he knows about.
  • 16:28 RobH: new servers auth1, nfs2, streber and williams are racked, IP's allocated, DRAC working. No DHCP entries or OS installed yet.
  • 16:08 mark: restarted lighttpd on zwinger
  • 15:59 RobH: added williams to dns records, updated dns
  • 15:50 TimStarling: removed some binlogs on ixia
  • 01:17 brion: scapping a couple more fixes to r44698
  • 00:36 brion-codereview: srv126 is borked -- read-only filesystem
  • 00:23 brion-codereview: scapping to 44696
  • 00:15 brion-codereview: svn up'ing on test...

December 16

  • 23:09 brion-codereview: disabling FixedImage extension -- was used for old 2006 and 2007 fundraisers; images no longer exist and are not applicable to current fundraisers
  • 20:34 RobH: ariel is dead, will decommission later.
  • 20:29 RobH: ariel is fubar, rebooting and investigating.
  • 20:25 RobH: restarted services on sq13
  • 20:21 RobH: took down sq13 to clean its cache
  • 20:09 RobH: replaced bad /c0/p0 in amane
  • 19:45 RobH: setup drac access for nfs1, brewster, auth2, dobson, eiximenis, erzurumi, fenari, grosley, loudon, singer, & spence. The other 3 misc. servers will be setup later. OS not installed, just remote access setup and IP space allocated. (Not setup in DHCP yet.)
  • 18:47 brion: applying temporary resource limit lift on enwiki for an IP for workshop in SF
  • 17:40 RobH: updated dns for misc. servers project.
  • 01:08 brion: deploying r44643 update to CodeReview subversion proxy (swapped encoding protocol to avoid bugs in json_decode with some diffs)
  • 00:04 brion: running cleanupTitles.php in bg on all wikis...

December 15

  • 23:20 brion: going to test fixes for FiveUpgrade.inc to back cleanupTitles.php, cleanupImages.php etc
  • 22:21 RobH: changed settings on metawiki to allow banned users to edit their talk pages per https://bugzilla.wikimedia.org/show_bug.cgi?id=16621
  • 21:25 brion: reenabling handheld skin setting, was turned off during overload emergencies on 11-17
  • 21:13 brion: rsyncd appears to be running on srv56. does anything else need to be done for index updates?
  • 20:10 brion: yongle hanging again, restarting apache
  • 18:58 RobH: started rsync daemon on srv56 per rainman
  • 18:35 RobH: setup new planet per https://bugzilla.wikimedia.org/show_bug.cgi?id=16511.
  • 01:39 brion-weekend: applying API deletion log fix from r44541 (bugzilla:16626)
  • 00:09 rainman-sr: rsyncd is not running on srv56, updates for wikis served by old indexer halted since Oct7. Run rsync --daemon on srv56

December 14

  • 02:04 Platonides: Connections timing out

December 13

  • 02:04 brion: applied patch-rfb_ratings.sql to flaggedrevs wikis
  • 01:46 brion: did some debugging on RatingHistory graph generation with Aaron and got it working yay!

December 12

  • 22:47 brion: patched Bugzilla so we can exclude CC-only mails from wikibugs-l ([bugzilla:15585]])
  • 21:52 brion: scapping to r44509
  • 19:19 brion: put all the themes and plugins and patches back on wordpress for blog.wm.o. whee
  • 19:15 brion: restarted apache on isidore while fiddling with php error logging settings and blog started magically working again. sigh. going back to tweak its config back to normal
  • 18:04 brion: we managed to fix the svn update conflict on blog.wm.o (to wordpress 2.7) but it's still showing main page as blank
  • 17:42 mark: Telia connection / BGP session was up for 20 hours; problem seems resolved. Removed route filters
  • 00:29 brion: bumping to r44485 for more NS fixes for ms, ast
  • 00:12 brion: scapping bump to r44484, fixing a few issues w/ hu
  • 00:06 brion: updated wikibugs irc script to r44483, fixes issues w/ users w/o real name setting

December 11

  • 23:19 brion: shutting down srv118; bad config. missing upload5 mount, seems to have bogus authenticatin (local su to root fails with "Authentication service cannot retrieve authentication info")
  • 23:10 brion: restarted apache on 134, it's scary/corrupt
  • 22:55 brion: manually syncing updated skin files to upload.wm.o ...
  • 22:53 brion: scapping to r44474
  • 21:31 brion: don't sync yet; RC regression in r44033 being worked on
  • 19:41 brion-codereview: removed conflicting live profiling hack from AutoLoader.php. Put this stuff in SVN, huh guys?
  • 19:39 brion-codereview: applying flaggedrevs schema updates
  • 19:38 brion-codereview: starting svn up for testwiki
  • 13:41 mark: configured asw-a4-sdtpa and asw-a5-sdtpa, but no link
  • 10:41 mark: bart out of disk space, removed some old cruft (mailman)

December 10

  • 23:50 RobH: pulled srv76 due to two dead fans (yay for da bot)
  • 23:35 RobH: srv78 reinstalled and in apache pool
  • 22:57 RobH: srv78 kernel panic, old FC install, pulled for reinstall
  • 22:49 RobH: sq1. sq3, sq6 cache cleaned and back online serving requests.
  • 22:35 RobH: sq1, sq3, sq6 all unresponsive to console, flashing leds on kvm. rebooted.
  • 20:40 RobH: srv118 installation completed.
  • 20:00 RobH: reinstalled srv118 after replacing dead parts. installing packages now.
  • 19:48 RobH: started rebuild of storage1 /c1/p0 into array
  • 19:47 RobH: replaced disk /c1/p0 in storage1. /c1/p13 is now bad as well, placing rma for it.
  • 19:14 RobH: db13-db16 responsive to ssh.
  • 19:13 RobH: db15 rebooted.
  • 18:05 RobH: temp probes installed in a3-sdtpa

December 9

  • 18:46 RobH: fixed group names in add/remove groups per https://bugzilla.wikimedia.org/show_bug.cgi?id=16248
  • 18:42 RobH: updated some settings for no.wikimedia.org and pushed to cluster.
  • 15:23 RobH: backedup blog frontend/database and upgraded to 2.6.5 successfully
  • 14:21 RobH: updated InitialiseSettings for nowikimedia wiki
  • 06:47 Tim: srv146 did not have /mnt/upload5 mounted. Fixed.
  • 02:03 brion: dropped loading of obsolete RenderHash ext (bug 16114)

December 8

  • 23:30 RobH: updated enwiktionary group settings per https://bugzilla.wikimedia.org/show_bug.cgi?id=16248
  • 23:24 brion: updating Oversight for bug 16065
  • 22:44 RobH: no.wikimedia.org is now functioning per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
  • 22:35 RobH: made changes to InitialiseSettings.php for cswikisource per https://bugzilla.wikimedia.org/show_bug.cgi?id=16277
  • 21:37 RobH: authdns-update for no.wikimedia.org
  • 21:20 RobH: running sync-common-all for wikimedia norge (found the php error)
  • 21:01 RobH: its all back up now.
  • 20:59 RobH: I stupidly crashed the site with a php typo, rolling back my changes since i was ignorant and did not php -l ;_;
  • 20:58 RobH: setup wikimedia norge wiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
  • 19:23 brion: updating OggHandler for fix for bug 15920 (chopped oggs)
  • 15:57 mark: Set up mirroring of traffic of e7/2 to e7/14 for testing the fiber patch loop/optics
  • 13:16 Tim: added some IWF proxies to the trusted XFF list. These proxies are probably about 30% of the IWF traffic, the other 70% comes from proxies that pass through the XFF header without adding the client address.

December 5

  • 22:42 domas: srv47 is running scaler usr.sbin.apache2 aa profile in learning mode
  • 22:33 RobH: sq50 reinstalled and back in rotation
  • 22:25 RobH: finished setup on srv146, back in apache pool
  • 21:32 RobH: setting up packages on srv146
  • 21:32 RobH: reinstalling sq50
  • 21:27 brion: pointing SiteMatrix at local copy, not NFS master, of langlist file
  • 19:19 RobH: added sq48, and sq49 back into pool. sq50 pending reinstallation.
  • 18:58 mark: depooled broken squids sq1 and sq3
  • 18:26 RobH: depooled sq48-sq50 for relocation
  • 18:17 RobH: added sq44-sq47 back into pybal, relocation complete.
  • 17:45 brion: sync-common-all to add w/test-headers.php
  • 17:28 RobH: shutting down sq44-sq47 for relocation.
  • 17:27 RobH: sq41 - sq43 back online.
  • 17:17 RobH: sq40 oddness, but its back up now
  • 16:44 RobH: accidentally pulled power for sq38, opps!
  • 15:36 RobH: removed sq41 - sq43 from pybal to relocate from pmtpa to sdtpa
  • 15:34 domas: srv178 running usr.sbin.apache2 aa profile in complain mode
  • 15:34 RobH: removed sq40 from pybal to relocate from pmtpa to sdtpa

December 4

  • 22:50 domas: job runners are no longer blue on ganglia CPU graphs :(((((((
  • 22:45 domas: fc4 maintenance, reniced job runners to 20 (10 behind apaches), installed apc3.0.19 (APC3.0.13 seams to have hit severe lock contention/busylooping at overloads)
  • 22:04 RobH: re-enabled sq38 in pybal. all is well
  • 22:02 RobH: fired sq37-sq39 back up
  • 21:58 RobH: shutdown sq37-sq39, cuz I need to balance the power distribution a bit better.
  • 21:40 RobH: sq38 is trying to break my spirit, so i reinstalled it to show it who is boss (me!)
  • 21:02 RobH: setup asw-a4-sdtpa and asw-a5-sdtpa on scs-a1-sdtpa
  • 20:52 mark: Increased TCP buffers on srv88 (a Fedora), matching the Ubuntus - Fedora Apaches appear to get stuck/deadlocked on writes to Squids
  • 19:39 RobH: pulled sq38 back out, as it is giving me issues. need to fix the msw-a3-sdtpa before i can fix sq38.
  • 19:35 RobH: added sq38, sq39 back into pybal
  • 19:25 RobH: added sq36, sq37 back into pybal
  • 18:14 RobH: I need to stop forgetting about lunch and stop working through it, oh well.
  • 18:13 RobH: depooled sq36-sq39 for move from pmtpa to sdtpa.
  • 18:12 RobH: some tinkering with lvs4 and idleconnection timer was fixed by mark.
  • 17:46 RobH: racked sq21-sq35 in sdtpa-a3. added back to pybal.
  • 16:31 RobH: depooled sq31-sq35 from lvs4 to move from pmtpa to sdtpa
  • 15:15 RobH: reinstalled storage1 to ubuntu 8.04, left data partition intact and untouched.

December 3

  • 23:46 JeLuF: performing importImage.php imports to commons for Duesentrieb
  • 19:13 RobH: tested i/o on db17, issue where it pauses disk access is gone.
  • 19:02 mark: Shutdown TeliaSonera (AS1299) BGP session, the link is flaky resuling in unidirectional traffic only for most of the day
  • 19:02 RobH: replaced hardware in db17, reinstalled.
  • 18:58 mark: Prepared search10, search11 and search12 as search servers
  • 17:26 brion: investigating ploticus config breakage bugzilla:16085
  • 17:18 brion: ploticus seems to be missing from most new apaches
  • 17:12 RobH_DC: search10, search11, search12 racked and installed.
  • 14:29 RobH_DC: srv136 was unresponsive, rebooted, synced, back in rotation.

December 2

  • 23:57 Tim: added CNAME poke.wikimedia.org for SMS notification project
  • 23:33 brion: scapping to update ContributionReporting ext
  • 23:11 Tim: db7 wasn't deleting its relay logs for some reason, since August 21. Disk critical. Did a reset slave.
  • 20:03 brion: rebuilt public_reporting with fixed encoding
  • 19:53 brion: fudged charsets in triggers for donation db update, let's see if that helps
  • 12:11 Tim: started squid (backend instance) on sq40, stopped for 13 days for no apparent reason
  • 12:08 Tim: restarted apache on srv161, srv122, srv137, attempted on srv123 but it is waiting for dead NFS mount
  • 11:48: srv183 made a miraculous recovery
  • 11:44 Tim: took srv183 out of memcached rotation
  • 11:10-11:35: a spike in backend requests (as seen in lvs3 network) caused the application cluster to overload. Due to the extra threads, srv183 went into swap and died.
  • 10:50 Tim: purged binlogs on ixia and db1 (both critical)

December 1

  • 23:49 brion: sync-common-all'ing to add a wikispecies little icon for sul shared session login, since people keep asking for it :)
  • 20:31 RobH: synced and restarted apache on srv89
  • 19:33 RobH: manually setup apache-check for pybal on srv138, synced, enabled.
  • 19:29 RobH: manually setup the apache_check stuff for srv126 and pybal.
  • 17:19 RobH: synced and restarted apache on srv176 & srv176
  • 17:18 RobH: did the sync and restart thing for apache on srv162
  • 17:16 RobH: synced and restarted apache on srv145
  • 17:13 RobH: synced and restarted apache on srv121 and srv125
  • 17:00 RobH: apache wasnt working on srv102 and srv106, restarted them after syncing
  • 15:10 mark: Restarted stuck pdns_server on bayle, lots of stale selective_answer.py processes
  • 14:44 domas: restored Roma article on itwiki, had orphaned revision entries after deleting it, manually inserted page entry
  • 14:40 mark: Setup Telia transit at knams, but all inbound routes filtered
  • 14:35 RobH: removed images from plwiki flaggedrevs per request from Leinad

November 30

  • 12:14 mark: restarted flapping apache on srv119, looks like memory corruption going on

November 28

  • 18:58 brion-holiday: updating User-Agent blacklists to block 'WebCapture' download tool but not the Library of Congress's www.loc.gov/webcapture/ spider
  • 18:17 yksinaisyyteni: fixed broken upload/deletion/timeline on jawiki
  • 07:11 JeLuF: succeeded to umount /mnt
  • 07:10 JeLuF: killed hanging cron entries on db22. updatedb.mlocate. Might be related to broken mount db16:/a -> /mnt
  • 07:05 JeLuF: killed lots of jobs running on db22, "SELECT /* ApiQueryBacklinks::run XX.XXX.XXX.X */ page_id,page_title,page_namespace,page_is_redirect" which were in status "copying to tmp table"

November 27

  • 13:10 mark: hungover, headache, lack of voice

November 26

  • 17:00 RobH: fixed flaggedrevs to work on ruwikiquote, due to my own mistake in earlier implementation, per https://bugzilla.wikimedia.org/show_bug.cgi?id=14863
  • 02:38 brion: updated Math.php to r43966 which both fixes 0-byte math PNGs and generates correct URLs *cough*
  • 02:36 brion: broke math temporarily woops
  • 02:29 brion: bumped Math.php to r43965 to hopefully clear out those 0-byte math images (bugzilla:16440)
  • 02:01 brion: updating CentralNotice to r43962 to fix sitenames again :P
  • 01:57 brion: poking centralNotice to r43961 for evil hacks to bump limits temporarily :D
  • 01:31 brion: updating CentralNotice to r43959

November 25

  • 19:25 brion: syncing update to CentralNotice
  • 18:28 RobH: root password changed across all servers. if you didnt get a copy and you should have one, talk to another tech team member.
  • 17:58 RobH: added bayes to allowed nfs connections to storage2, setup fstab for nfs mounts on bayes, revoked shell access for ezachte on storage2 (not needed for what he wanted)
  • 15:49 RobH: updated some points for huwiki flaggedrevs and removed an outdated user group per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568
  • 15:38 RobH: gave erik zachte login rights to storage2
  • 15:16 RobH: updated dns for survey software
  • 01:35 brion: updating ContributionReporting ext
  • 01:06 brion: forcing a manual run of centralnotice batch update on hume
  • 01:04 brion: retstarting memcached on srv64
  • 01:02 brion: memcache bad on srv64
  • 01:01 brion: notice texts borked on at least wikimedia, wiktionary

November 24

  • 22:45 brion: updated ContributionReporting for some silly bugs
  • 22:20 RobH: portal and portal_talk namespaces added to dvwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16403
  • 22:04 RobH: added two new namespaces to dewikinews per https://bugzilla.wikimedia.org/show_bug.cgi?id=16263
  • 21:29 RobH: removed a group and granted further permission customization for huwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568
  • 21:09 RobH: pushed a bad flaggedrevs.php that rendered blank pages for all wiki's with flaggedrevs enabled. fixed it, its working properly now, opps ;]
  • 21:06 RobH: appended page and dossier namespaces into the frwikinews flagged revisions per https://bugzilla.wikimedia.org/show_bug.cgi?id=15346
  • 20:36 RobH: enabled flaggedrevs on ukwiktionary per https://bugzilla.wikimedia.org/show_bug.cgi?id=15335, and ran sync-common-all
  • 20:27 RobH: ran sync-common-all
  • 20:27 RobH: enabled flaggedrevs on dewiktionary
  • 20:07 mark: moved upload knams LVS to mint
  • 20:05 brion: mark is on the case -- LVS overload
  • 19:58 brion: seem to be getting heavy packet loss on some routes to knams
  • 19:47 RobH: changed nameservers for wikimedia.li to WMCH administered name servers.
  • 19:30 RobH: re-enabled arzwiki, cannot find the bugzilla entry.
  • 15:43 RobH: search2 reinstalled and ready for search setup and deployment

November 22

  • 18:28 yksinaisyyteni: srv108 (cluster19) disk full, removing old logs
  • 00:37 brion: bumped php.ini post/file upload limit to 100mb, we'll see how well uploads to that size actually work  :)

November 21

  • 23:11 brion: dropping 'Wikipedia: a non-profit project" banner from rotation, as it's apparently not a winner
  • 22:56 brion: updated logo for cr.wikipedia (bugzilla:16417)
  • 18:34 brion: running updateAutoPromote on new flaggedrevs wikis (bugzilla:16415)

November 20

  • 01:00 brion: updating ContributionHistory
  • 00:34 brion: moving $wgStyleSheetPath back to upload.wikimedia.org

November 19

  • 22:47 brion: updating Tomas skin to r43752 for toc fix
  • 22:41 brion: scapping for ContributionReporting update to 43750 (localization bugs)
  • 22:40 brion: ran namespaceDupes --prefix=D on enwiki and dewiki -- some 'D:blah' pages conflicted with iw prefix 'd' for wiktionary
  • 15:53 brion: updated centralnotice templates with user-targetted lightweight collapsed notice (wish it was for everybody)
  • 01:38 brion: updating CentralNotice to r43697 for anon/user collapsed variants
  • 00:35 yksinaisyyteni: unmounted storage1:/export/upload on all hosts
  • 00:32 yksinaisyyteni: rebooted srv{114,184,166} to fix stuck nfs mount

November 18

  • 23:52 brion: enabling new search UI on testwiki
  • 21:35 brion: switching css/js back to text temporarily to reduce load on upload squids
  • 21:27 brion: request -- squid conf deploy script should do a config file dry-run before actually deploying
  • 21:26 brion: there's load on ms1...
  • 21:25 brion: started more... most... all? squids in squids_uploda
  • 21:24 brion: restarted squid manually on 46
  • 21:17 brion: uploads still borked, we're investigating the squid config problem
  • 21:16 brion: rebuilding squid conf, was a little funky
  • 21:12 brion: updating squid config to send centralnotice to ms1 instead of storage1
  • 20:41 RobH: db24 reinstalled, awaiting domas to do the magic db stuff
  • 20:38 RobH: replaced disk /c0/p7 in amane and started rebuild
  • 20:34 RobH: replaced controller in search2, search2 requires reinstall
  • 20:34 RobH: replaced controller in db24, db24 reinstalling.
  • 20:03 mark: installed gmond on db9 and db10
  • 19:59 brion: scapping to update Collection for regression fix
  • 01:51 mark: Moved text LVS to temporary LVS host lvs4, with an optimized kernel
  • 01:48 brion: setting $wgStyleSheetPath to point at upload.wikimedia.org/skins for non-SSL hosts
  • 01:30 brion: disabling handheld stylesheet; one less thing to load, should have little impact
  • 01:15 brion: another crappy slow squid this time in pmtpa

November 17

November 16

  • 17:24 brion: notices are becoming unborked with new regen. should be done and recached within 10 minutes
  • 17:17 brion: srv120 memcached now functional according to test: 10.0.2.120:11000 set: 100 incr: 100 get: 100 time: 0.0809991359711
  • 17:16 brion: restarting memcached on srv120
  • 17:14 brion: srv120's memcached seems broken: 10.0.2.120:11000 set: 100 incr: 0 get: 0 time: 0.0769970417023
  • 17:05 brion: investigating centralnotice borkage on non-wikipedia sites

November 15

  • 01:03 brion: scapping to r43514 -- regression in CodeReview :)
  • 00:49 brion: enabled UDP->IRC logging for CentralAuth user creations, now that it works instead of crashing PHP
  • 00:45 brion: set up ariel on isidore for blog maint
  • 00:24 brion: starting scap from r42593 to r43512
  • 00:02 brion: preparing for general svn up && scap

November 14

  • 23:24 RobH: updated flaggedrevs: $wgFlaggedRevValues to 4 from 2 for enwikibooks, synced files out to cluster.
  • 23:11 RobH: FlaggedRevs deployed on enwikibooks.
  • 23:00 RobH: removed the crap for specific seroul servers in sync-common-all
  • 22:43 brion: tweaked flaggedrevs.php to have cleaner default behavior
  • 20:27 RobH: setup the backend stuff for arz wiki but not enabled yet.
  • 19:59 brion: yongle is back up! yay
  • 19:48 RobH: fixed authdns-update script, was not rsyncing over the langlist file
  • 19:47 brion: swapping codereview-proxy to isidore since yongle's still down
  • 18:01 brion: requesting reboot on yongle from PM support
  • 17:14 domas: yongle is hanging, apple dictionary searches staled
  • 16:12 RobH: upgraded installation of blog.wikimedia.org and whygive.wikimedia.org to newest stable versions.
  • 15:14 RobH: limesurvey.wikimedia.org online on isidore, initial users created and deployed.
  • 02:03 brion: pascal down again
  • 00:00 brion: syncing to update InputBox extension (note: renamed from inputbox)

November 13

  • 23:41 brion: scapping to update CodeReview
  • 20:26 brion: scapping updates to Collection and ContributionReporting exts
  • 17:33 brion: set up TrevorParscal with access to reporting database so he can grab updates to test with
  • 17:03 river: upgraded ms1 to solaris 10 update 6 + rebooted
  • 09:57 Tim: db10 sync worked just fine this time, it's now replicating all DBs
  • 08:27 Tim: db10 slave start potentially botched, going to re-read the dump and try again
  • 06:43 Tim: loading data into mysqld on db10
  • 06:35 Tim: copy finished, restored r/w on bugzilla
  • 05:43 Tim: copying data from db9 to db10 using: mysqldump -h db9 --master-data --single-transaction --all-databases | gzip --fast > db9-master-data-2008-11-13.sql.gz
  • 05:34 Tim: switching bugzilla into read-only mode for copy to db10. Queries will be denied by user permissions for all tables except logincookies.
  • 05:02 Tim: converting all tables in bugzilla to InnoDB except longdescs
  • 04:53 Tim: converting the MyISAM tables in otrs to InnoDB (the large ones are done already)
  • 04:49 Tim: converted donateblog and newsblog to innodb
  • 03:34 Tim: converted racktables DB to InnoDB
  • 01:59 atglenn: changed wireless network password
  • 01:43 Tim: doing lockless backup of db9 to db10. This will give us a fallback in case disaster strikes during the considerably more complex replication synchronised dump which will follow.
  • 00:45 brion: poked it again
  • 00:29 brion: updating for ContributionReporting

November 12

  • 23:38 brion: XHTML fixes for Collection made the broken 'Random book' link on en.wikibooks.org work again (it very inefficiently loads a giant page of links via JS, and needs it to be clean XML to parse it)
  • 23:16 brion: updated mw-serve
  • 22:48 brion: scapping for Collection ext updates
  • 20:10 brion: updated wgNoticeProject to wikimedia for incubator
  • 18:46 brion: added "uploader" group so we can bump known-good people into being able to upload without waiting for the autoconfirm heuristic
  • 03:14 river: didn't reboot ms1 as its lom is unreachable
  • 01:20 Tim: an error in the cron job on hume caused the r43398 bug to persist until this time, delivering incorrect language text in some site notices.
  • 01:08 Tim: Fixed those 50 servers with a couple of sed commands. Many of them were attempting to send data to larousse and zwinger. Tested srv125.
  • 00:56 Tim: srv125 was spewing PHP fatal errors without reporting them to the syslog on db20. Restarted it. A quick check (ddsh -cM -g apaches -- 'grep -q @syslog /etc/syslog.conf || echo help') suggests that there are 50 apache servers in the same situation.
  • 00:27 Tim: updated ExtensionDistributor configuration to account for amane -> ms1 storage move. (bug 16308)
  • 00:13 Tim: some language issues caused by r43398, reverted at 23:50 and resynced in fixed form at 00:12.

November 11

  • 23:47 Tim: restored FlaggedRevs stats job as per Batch jobs, removal was not documented.
  • 23:35 Tim: r43398 worked just fine, memory usage dropped from ~4GB to 90MB. Adding rebuildTemplates.php to my crontab on hume, removing it permanently from Brion's on zwinger.
  • 23:28 Tim: updated CentralNotice templates on hume (which has enough memory to do it, unlike zwinger)
  • 22:11 Tim: deleted some binlogs on db1. Remaining disk space is still only 48 GB with negligible InnoDB free space.
  • 16:20 RobH: search2 still down, drives will not detect reliably. Ticket with sun reopened.
  • 15:56 RobH: replaced backplane on search2, reinstalling.
  • 15:13 RobH: srv137 back online. apache and memcached back up.
  • 14:49 RobH: srv100 back online.
  • 10:44 river: removed centralnotice php from brion's crontab as it was breaking zwinger
    • Core dump suggests the memory usage may be dominated by the localisation cache. wfMsgExt() loads the localisation for the requested language, and all languages are requested. -- Tim 12:07, 11 November 2008 (UTC)[reply]
  • 01:19 brion: swapped Commons to use $wgNoticeProject 'wikimedia' rather than having separate 'Commons needs you' notices
  • 00:57 brion: swapped in fundraiser to all projects

November 10

  • 19:18 mark: Shutdown AMS-IX route server 1 session as it's been flapping for hours

November 9

  • 16:11 river: removed nfsfind cronjob on ms1

November 7

  • 22:52 brion_: tossing 2008_meter_2b notice into partial rotation on enwiki -- has reduced collapsed version
  • 22:49 brion_: adding "_collapsed" to banner source tracking for collapsed view
  • 22:27 brion: scapping updates to ContributionReporting and CentralNotice
  • 01:43 Tim: experimentally reading the civicrm database into db10 with --master-data=1
  • 01:19 brion: db9 temporarily (hopefully) messed up. tim's fiddling with it to put it back
  • 01:05 Tim: my.cnf on db10 had an error in it, replicate-wild-do-tables instead of replicate-wild-do-table. Fixed it. The OTRS snapshot is now hopelessly out of date anyway, so I might wipe the data directory and start again. The idea is to set it up to replicate civicrm first. It's 100% InnoDB so should be easy to copy.
  • 00:09 river: upgraded ms2 to solaris 10 update 6

November 6

  • 21:03 Tim: switched GIFs to use Bitmap_ClientOnly (client-side scaling)
  • 17:23 brion: restarting apache on srv47, seem smysteriously stuck
  • 17:15 brion: setting $wgMaxAnimatedGifArea to 1 to prevent animated thumbnailing of GIFs for now, see if that helps
  • 17:10 brion: river complaining of image scaler issues -- load spikes, depooling?
  • 02:35 mark: disabled BGP, now using lvs2 only
  • 02:25 mark: restarting lvs2 with new kernel
  • 01:52 due to switch issues, load balancing to lvs2/lvs4 stopped working. Mark restarted the BGP session which fixed it temporarily.
  • 01:42 Tim: restarting squids
  • 01:42 mark: Setup lvs4 as temp LVS support for lvs2, balancing the load
  • 01:07 brion: updated ContributionReporting to add paging links to ContributionHistory (might be a little funky w/ caching, we'll work it out :)
  • 00:45 Tim: progressively clearing /a on the remaining image scalers
  • 00:37 Tim: wiping /a on srv44
  • ~00:30 lvs2 went into overload and started losing packets. Upload squid slowly went down over the next half hour.
  • 00:00 brion: scapping for update to ContributionReporting

November 5

  • 23:38 brion: set yongle to restart apache every hour since it still seems to bork up and get stuck sometimes
  • 22:01 RobH: srv100 rebooted, was down.
  • 18:28 mark: tech team is procrastinating
  • 18:16 atglenn: added dhelps to office@wikimedia.org alias, redirected office@wikipedia.org to him also
  • 18:14 brion: disabling centralnotice on private wikis, we don't need to be told to donate to ourselves ;)
  • 18:03 brion: poking sitenotices off wikibooks, on *.wikipedia
  • 18:03 brion: set up ariel on mchenry for mail admin
  • 05:38 brion_: opera users may rejoice ;)
  • 05:38 brion_: tweaked storage1 lighttpd config so centralnotice.js is served with utf-8 charset
  • 05:17 brion_: for reference -- load spikes are page rendering on enwiki and dewiki mostly :)
  • 05:16 brion_: bumping enwiki notice to 100%
  • 05:06 Tim: killed various mysqld_safe processes which were using 100% CPU on ES servers
  • 04:50 brion_: fixed morebots -- bots now allowed to edit again at wikitech
  • 04:50 brion_: enabling enwiki notice at about 10% sampling
  • 03:27 brion_: squids are... i think.... looking better :D
  • ... brion: cleaned up movepage attack, restricted editing here for convenience
  • 02:47 brion_: seems happier after restart of front-end squids
  • 02:43 brion_: tim's doing hard restarts of more squids, we're kinda offline briefly
  • 02:34 brion_: disabling centralnotices on remaining sites just for good measure while we debug
  • 02:29 brion_: current status: the squids which borked are still kind of borked, but perhaps slightly better. mark is examining squid memory reports
  • 02:14 brion: tim's attempting to restart borked squids
  • 02:01 brion: disabling enwiki centralnotice while investigating hits dropoff

November 4

  • 21:36 Tim: added nagios monitoring of HTTP on image backends
  • 21:14 Tim: installed NRPE stuff on db19
  • 19:37 Tim: killed the broken NFS mount on db21:/mnt with umount -l. The processes that are waiting for it will probably hang until system restart
  • 18:33 brion_: enabling ja-wikipedia notice for testing :D
  • 18:32 Tim: installed nagios stuff on db21,db22,db23
  • 18:27 Tim: srv104 done, cluster18 re-added to the write list
  • 18:15 Tim: installed NRPE on srv159,srv171,srv183
  • 17:25 domas: bounced db16 after jfs deadlock
  • 17:24 brion: settin' centralnotice on wikibooks to test, should show up in a few minutes
  • 16:00 Tim: fixing max_rows on srv104
  • 15:41 Tim: switching cluster18 master from srv104 to srv105
  • 01:33 Tim: fixing max_rows on srv105 and srv106
  • 01:28 Tim: removed cluster17 from the write list, is full.

November 3

  • 23:28 Tim: installed xdiff and gmp on hume. Used a source install of libxdiff since it's not packaged, and pecl install for the pecl module. Used the stock libgmp, a source install from the debian sources for the PHP GMP module.
  • 22:05 brion: enabled extra file upload types for foundationwiki, since it's restricted-write-access
  • 21:42 Tim: initialising srv159/171/183 as cluster20.
  • 21:24 Tim: srv159 needs to be an ext store, and so will be moved from the disk-intensive image scaler role back to an ordinary apache.
  • 20:46 brion: Special:ContributionTracking form submission intermediary live on foundationwiki
  • 20:33 brion: scapping for ContribtionTracking extension
  • 19:59 brion: enabled mp3 and aiff uploads for private wikis so jay can upload some radio PSAs for fundraiser
  • 19:46 brion: poking $wgSquidMaxage from 31 days to 1 hour on wikimediafoundation.org, since templates and funkypage URLs may do funky things and not get purged (extra parameters)
  • 19:32 brion: note there's no notice up yet ;)
  • 19:31 brion: enabling centralnotice loader on all wikis
  • 11:00 domas: mount -o remount,nobarrier /a on db15, observed 20x more performance. I am an idiot. :)
  • 02:36 brion-away: got a test centralnotice notice running on test.wikipedia.org. rock on
  • 02:18 brion: set up every-10-minute cronjob on zwinger to regen the centralnotice template JS files
  • 02:10 brion: centralnotice .js file loader up on test and meta for poking at
  • 01:12 mark: level 3 blackholing of traffic disappeared, brought BGP sessions back up
  • 00:59 mark: shutdown BGP session to AS 30217, for blackholing of traffic behind it (L3?)
  • 00:58 brion: network problems at pmtpa
  • 00:44 brion: for fun, did some load-time optimization on wikitech. trimmed out unneeded user/site .js, consolidated several .js files, and enabled mod_deflate for .css/.js. ssl setup time still sucks, and it's still a 1.7GHz Celeron. :)

November 2

  • 23:43 brion: added bot flag to domas's log bot so it doesn't get hit by the URL captcha
  • 23:29 domas: db19 jfs deadlocked: http://p.defau.lt/?hC8C7MTk9BdTKBEHFgcsqA
  • 23:28 brion: scapping for CentralNotice tweak update
  • 23:11 brion: setting up ContactFormFundraiser on wikimediafoundation.org for fundraiser templates
  • 22:52 brion: scapping for ContactPageFundraiser setup
  • 22:41 brion: poked spamregex update
  • 22:14 brion: added 403 block in checkers.php for 'speichern' GET parameter -- bug in a common dewiki user script allowing CSRF-type vandalism
  • 17:13 Tim: Unmounted /tmp, cleaned up /tmp. Deleting /a/tmp on all image scalers.
  • 16:48 Tim: set ImageMagick temporary directory to /a/magick-tmp. Will unbind the /tmp -> /a/tmp mount.
  • 15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
  • 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
  • 13:02 river: copying en from storage1 to ms1
  • 10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
  • 10:36 river: completed move of commons, now being served from ms1 (except archive/)

November 1

  • 22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
  • 22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
  • 22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
  • 20:53 brion: deploying CentralNotice editing system on meta, woo
  • 20:27 brion: scapping to update reporting and centralnotice bits internally
  • 19:38 brion: rescapping to make sure 159 is unbroken
  • 19:27 brion: svn up'ing on wikitech just for domas
  • 19:25 brion: srv159 is out of space
    • We need to clean out the damn temp files somehow, eh?
  • 19:20 brion: scapping to update ContributionReporting ext
  • 12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
  • 11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
  • 07:55 Tim: disabled ParserDiffTest, obsolete
  • 07:06 mark: XO circuit back up:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <w005.z207088246.xo.cnc.net>, session is now up

October 31

  • 23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
  • 17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
  • 14:24 mark: XO circuit went down:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because <Port State Down>
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because <Port State Down>

October 30

  • 23:11 Tim: fixed disk space on srv159, db1, srv103
  • 19:03 brion: updated triggers for donation reporting database a few minutes ago
  • 18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
  • 17:46 RobH: db26 OS installed and online
  • 17:28 brion: added a spam filter rule for private-l messages :)
  • 04:54 river: testing sun web server on ms1
  • 03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
  • 03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
  • 02:59 brion: poking experimental expires options on amane for static centralnotice tests
  • 02:44 brion: brion broke lighttpd.conf briefly

October 29

  • 22:39 brion: enabling $wgCodeReviewENotif experimentally
  • 18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
  • 18:02 RobH: db28 & db29 OS installed and online.
  • 17:59 brion: fixed some upload directory perms on foundationwiki
  • 17:12 RobH: db27 OS installed and online.
  • 16:54 RobH: db21 OS installed and online.
  • 16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
  • 14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1

October 28

  • 20:02 domas: re-enabled db16
  • 18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
  • 17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
  • 15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
  • 12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
  • 10:44 mark: Reverted yesterday's search changes

October 27

  • 23:24 mark: Switched to lucenesearch 2.1 for all wikis
  • 23:06 mark: pooled search8 as the only search server in search pool 3
  • 22:25 mark: rainman-sr is making me do more ugly things to lucene.php
  • 22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
  • 22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2

October 26

  • 15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
  • 13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
  • 13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
  • 09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
  • 04:31 Tim: fixed the "service_ips" hostgroup in nagios
  • 03:03 Tim: hardware reboot of db18
  • 02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
  • 02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search

October 25

  • 20:46 brion: scapped to r42573
  • 08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
  • 05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
  • 05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
  • 05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
  • 04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
  • 04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
  • 04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
  • 01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
  • 00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
  • 00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
  • 00:44 brion: preparing a svn up...
  • 00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB

October 24

  • 23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
  • 16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
  • 15:39 mark: slacking
  • 15:36 TimStarling: added special nagios user to ES instances on clematis
  • 14:00 domas: re-enabled db5, added db18 to s3
  • 10:45 domas: taking out db5 for copy to db18
  • 10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
  • 09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
  • 07:50 Tim: added monitoring for toolserver ES clusters 17-19
  • 07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
  • 05:00 Tim: fixed nagios check script handling of MySQL connection errors
  • 01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
  • 01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
  • 01:03 brion: disabling logentry, still borken?

October 23

  • 22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
  • 21:34 brion: updating ipblocks table definition
  • 21:25 brion: re-ran svnImport to update path listings for CodeReview
  • 20:11 mark: Set up search7 - search9
  • 17:05 mark: Pooled search4 as a s1 search server to help with dead search2
  • 16:33 brion: updated mw-serve
  • 15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
  • 15:24 Tim: removed temporary files on image scalers again
  • 14:54 RobH: Replaced dead disk in amane, rebuilding array.
  • 11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
  • 10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
  • 10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
  • 10:32 Tim: freed up some space on srv103 (was down to 500MB)
  • 10:29 Tim: fixed monitoring for MegaRAID SAS
  • 07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
  • 05:05 Tim: running CodeReview svnImport.php

October 22

  • 18:26 brion: enabling ODT output for collection
  • 18:17 brion: updating collection and codereview extensions
  • 18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
  • 17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation

October 21

  • 19:29 RobH: Bayes upgraded from 2GB to 10GB.
  • 13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
  • 04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)

October 20

  • 18:11 RobH: srv118 rebooted, back online.
  • 17:25 RobH: srv79 was in kernel panic, rebooted.
  • 05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
  • 02:40 Tim: installed NRPE on khaldun and db20
  • 02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
  • 01:53 Tim: installed NRPE on the new ext stores
  • 01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
  • 00:30-01:30 Tim: Listed down servers on DC tasks. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.

October 18

  • 21:45 RobH's mighty index finger brought amane and the site back up.
  • 21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
  • 20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
  • 19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
  • 13:40 Tim: down again, single process allocating all memory
  • 07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
  • 06:27 Tim: restarted srv160
  • 05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
  • 03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
  • 03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.

October 17

  • 21:10 brion: enabled Commons foreign image repo on Wikitech
  • 18:45 brion: created Wikimedia-Boston list for SJ
  • 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
  • 16:45 brion: deleted some junk comments from bugzilla
  • 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
  • 14:22 RobH: srv131 back up.
  • 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
  • 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
  • 01:59 Tim: removing cups on all servers where it is running
  • 00:00 RobH: restarted srv43-47

October 16

  • 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
  • 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
  • 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
  • 18:40 RobH: rebooted scs-ext opps!
  • 18:26 RobH: srv61 reinstalled and redeployed.
  • 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
  • 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
  • 17:02 brion: thumbnails on commons are insanely slow and/or broken
  • 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
  • 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
  • 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
  • 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
  • 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
  • 11:25 Tim: synced upload scripts (including to ms1)
  • 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
  • 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
  • 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
  • 08:26 Tim: Restarted ntpd on search7, was broken
  • 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

  • 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
  • 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
  • 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
  • 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
  • 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
  • 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
  • 02:30 Tim: moved image scalers into their own ganglia cluster
  • 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
  • 02:13 tomasz: upgraded srv9 to ubuntu 8.04
  • 02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

  • 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
  • 19:11 mark: Pooled old scaling servers srv38, srv39
  • 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
  • 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
  • 16:20 brion: image scaling is broken in some way, investigating
  • 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
  • 00:10 Tim: oops, forgot to add VIPs, switched back.
  • 00:05 Tim: switched image scaling LVS to srv43-47

October 13

  • 23:45 Tim: prepping srv43-47 as image scaling servers
  • 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
  • 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
  • 21:30 RobH: search8 and search9 are online, awaiting configuration.
  • 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
    • rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
  • 20:48 domas: thistle serving as s2a server
  • 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
  • 19:53 RobH: search7 back online, awaiting addition to the search cluster.
  • 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
  • 17:00 RobH: srv21-srv29 decommissioned and unracked.
  • 12:05 domas: put lomaria back in rotation
  • 11:50 domas: Enabled write-behind caching on db15. Restarted.
  • 10:40 domas: restarted replication on db15 and lomaria
  • 10:27 domas: loading dewiki data from SQL dump into thistle
  • 09:09 Tim: restarted logmsgbot
  • 08:27 Tim: folded s2b back into s2
  • 08:06 Tim: db13 in rotation
  • 08:02 domas: copying from db15 to lomaria
  • 07:38 Tim: started replication on db13
  • 04:51 Tim: copying
  • 03:27 Tim: Preparing for copy from db15 to db13
  • 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

  • 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
  • 22:05 brion: db15 sucks hard. putting categories back to db13
  • 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
  • 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
  • 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
  • 19:20 Tim: depooled db15.
  • 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
  • 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
  • 18:38 Tim: offloading commons RCL queries to db13
  • 18:36 Tim: dewiki r/w with ixia (master) only
  • 18:33 Tim: offloading commons category queries to db13
  • 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
  • 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
  • 17:46 Tim: db8->db15 copy finished, deploying
  • 17:33 Tim: installed NRPE on thistle.
  • 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
  • 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
  • 16:07 Tim: repooled ixia/db8 r/o
  • 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
  • 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
  • 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
  • 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
  • 14:11 Tim: copy now in progress as planned.
  • 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
  • 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
  • 13:00 jeluf: moved wikipedia/m* image directories to ms1
  • 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
  • 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

  • 7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

  • 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
  • 19:20 RobH: Bayes online.
  • 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
  • 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
  • 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
  • 12:17 domas: killed long running threads
  • ~12:04: s2 down due to slave server overload

October 9

  • 22:52 brion: enabled Collection on de.wikibooks so they can try it out
  • 20:00 jeluf: moved wikipedia/i* images to ms1
  • 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
  • 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
  • 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

  • 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
  • ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
    • Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
  • 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
  • 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

  • 21:47 brion: started two dump threads (srv31)
  • 21:16 RobH: installed and configured gmond on all knams squids.
  • 21:00 jeluf: moved wikipedia/g* to ms1
  • 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
  • 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
  • 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
  • 15:54 RobH: srv101 crashed again, running tests.
  • 15:45 RobH: srv146 was powered down for no reason. Powered back up.
  • 15:42 RobH: srv138 locked up, rebooted, back online.
  • 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
  • 15:31 RobH: srv101 back up and synced.
  • 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
  • 15:21 RobH: updated lucene.php and synced.
  • 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
  • 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

  • 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
  • 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
  • 14:36 RobH: setup ganglia on all pmtpa squids.
  • 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
  • 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

  • 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
  • 20:35 jeluf: wikipedia/b* moved, too
  • 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
  • 14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

  • 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
  • 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
  • 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
  • 00:26 mark: Depooled srv61
  • 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
    • updating packages on srv37. srv61 seems to have internal auth breakage
    • updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

  • 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
  • 20:01 brion: running updateRestrictions on all wikis (done)
  • 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
  • 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
  • 17:13 RobH: srv130 back online.
  • 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
  • 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

  • 20:02 RobH: srv63 back online.
  • 19:35 RobH: srv61 and srv133 back online.
  • 18:22 RobH: storage3 online and handed off to brion.
  • 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
  • 17:32 RobH: srv61 faulty fan replaced, back online.
  • 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
  • 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
  • 23:46 mark: Reinstalled knsq24
  • 22:55 mark: Reenabled switchports of knsq16 - knsq30
  • 20:45 jeluf: fixed resolv.conf on srv131
  • 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
  • 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
  • 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

Archives