Server Admin Log/Archive 20: Difference between revisions

Browse history interactively

← Older edit Newer edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 01:47, 21 February 2009

February 21

01:47 domas: I FOUND HOW TO REVIVE APACHES
01:46 brion: think i killed em, now trying to restart apache procs
01:43 brion: poking to see if we can restart apaches...
01:42 brion: syncing fixed InitialiseSettings/COmmonSettings to apaches
01:14 brion: and flyingparchment
01:14 brion: domas and mark are attempting to restart the NFS server, but aren't mentioning any details in the public channel or log
00:52 domas: http://p.defau.lt/?_M1iGbA0PCz2OOt2_KKPug
00:52 mark: db20 in trouble
00:39 mark: @brion you don't need to wake up
00:36 domas: disabled 2006 fundraising cronjob on amane :-)

February 20

23:31 Rob: upgraded squid and kernel on sq34-sq36
23:12 Rob: upgraded kernel and squid on sq31-sq33, redeployed and online
23:08 brion: updating CentralNotice for improved test script (plus i8n update)
22:54 Rob: upgraded kernel + squid on sq28-sq30
22:29 Rob: completed upgrades to sq25-sq27
22:12 Rob: upgrading kernel and squid versions on sq25-sq27 (if i crash the site, i apologize in advance)
22:08 Rob: upgraded kernel and squid on sq24
21:59 river: added current patches to ms4, set zil_disable=1 and rebooted
21:30 brion: srv31 seems to be down, so no dump activity
21:08 brion: scapping to update FlaggedRevs to r47588 (fixing fatal err)
21:01 Rob: updated kernel and squid on sq23
20:58 Rob: updated kernel and squid on sq22
20:36 Rob: updated kernel and squid on sq20 and sq21
20:25 domas: some apaches in crashloop like this: http://p.defau.lt/?s9YhHD_0qHroVhauBdQb_g
20:09 Rob: restarted apache on srv74
20:03 Rob: upgraded kernel and squid on sq19
19:50 Rob: upgraded kernel + squid on sq18
19:34 Rob: upgraded kernel + squid on sq17
19:19 brion: updating FlaggedRevs to r47574
18:16 river: set zil_disable on ms1 to improve nfs write performance
18:15 mark: Raised max-conns to 50
18:03 mark: Cut down max conns even more (25) for pmtpa upload backend squids
17:40 mark: Limited maximum connections to backend (ms1) to 50 per squid on upload squids, 1000 per squid on text
16:17 domas: plenty of fedoras had futex deadlocks
16:16 Rob: upgraded kernel and squid on sq14 and sq15
15:49 Rob: updated squid and kernel on sq13, rebooted, back online
15:26 Rob: upgraded squid and kernel on sq9-sq12 (not all at the same time)
14:59 Rob: upgraded squid and kernel on sq5, sq6, sq7, sq8
14:51 Rob: upgraded squid and kernel on sq2-sq4
14:50 Tim: updated ContactPage extension, will deploy it on nlwiki shortly
10:52 mark: Reduced cache_mem from 3000 to 2500 for pmtpa upload backend squids - no restart, will take effect with the 2.7 upgrade later today
10:00 mark: Started backend squid on sq26, it was gone

February 19

23:54 brion: updating AbuseFilter to r47523 :P
23:51 brion: updating AbuseFilter to r47522
23:40 brion: updating FlaggedRevs to r47522
23:39 Andrew: Enabled Abuse Filter on MediaWiki.org
23:17 mark: Stopped experimental varnish on sq1, please keep Squid off as well
22:52 Andrew: Allowed bureaucrats to remove 'sysop' right on testwiki.
22:42 brion: updating includes/api to r47522 to fix a couple regressions
22:15 mark: Started an experimental varnish instance on sq1 port 80
21:22 mark: Stopped Squids on sq1
14:23 Tim: removing memcached from srv154,srv155,srv157,srv158,srv169,srv170
14:18 Tim: started memcached on srv190-199
14:06 mark: Added "vport=80" to the http_host directive on all backend squids, to force Squid to use the default HTTP port, 80
10:53 domas: livemerged r47483 (backlinks cache read explicit order, :( )
07:56 Tim: restarted job runners with 4 processes per server instead of 1. Db2 is now heavily loaded, apparently due to the SELECT queries involved in the large numbers of unnecessary refreshLinks2 jobs that were queued before r47478 went live. But they should be done in a few hours at this rate.
05:00 Brion: enabling Collection on fr, pl, nl, pt, es, simple Wikipedias
02:12 Tim: deploying r47478

February 18

22:41 Andrew: morebots back up, now logs to identi.ca with the name wikimediatech
22:38 tomaszf: installed srv208 with Ubuntu 8.10.1 and installed app sever software.
22:12 domas: Andrew killed morebots. let's see how he fixes it... :)
21:59 Rob: PDF creation moved to pdf1
21:58 Rob: changed pdf generation from eruzumi to pdf1, testing.
19:21 Rob: srv255 changed to pdf1 and moved, drac setup along with dns resolution
19:19 brion: scapping
19:18 brion: svn up'ing test to r47457
18:37 Rob: reinstalling srv209 due to dhcp misconfiguration making it think it was srv208
15:13 mark: Restarted all upload frontend squids to get rid of the memleaking
14:20 mark: Blocked all non-GET/HEAD HTTP methods in requests to upload frontend squids
12:46 Tim: put r47447 live for temporary proposed fix of bug 17552
08:38 Tim: svn up r47434 to fix Special:BrokenRedirects
08:04 Tim: cleaned up binlogs on db2
06:33 brion: note there's a live hack in api categorymembers query which may be breaking lookups
05:54 Tim: set up bugzilla attachment_base, pointing to the new domain http://bug-attachment.wikimedia.org/, and set allow_attachment_display=on
05:51 brion: disabling $wgTorTagChanges in CommonSettings after the ext gets loaded (needs fix for testwiki)
05:46 brion: syncing reverted expr.php w/o bc stuff
05:25 brion: syncing extensions/FlaggedRevs/specialpages/OldReviewedPages_body.php fix
05:24 brion: syncing fix to Expr.php for bcpow() error
05:16 brion: syncing fix to extensions/ParserFunctions/Expr.php
04:59 brion: starting scap process...
04:52 brion: svn up'ing test to r47418
04:45 brion: svn up'd test to 47417
04:30 brion: removing editor, reviewer from add/remove for all users in test. that ws an old test not needed anymore :D
03:42 brion: rc tags tables created sitewide; should be safe to scap and check for final problems if we're brave
03:35 brion: applying patch-change-tags to all wikis
02:57 brion: ran patch-change_tag.sql on testwiki
02:52 brion: full svn up'ing for test wiki
02:06 brion: worked around breakage with pager base class incompat with latest codereview :P
01:52 brion: svn up'ing CodeReview to aid in completing code review ;)

February 17

23:58 Rob: srv217-srv223 installed and online as apache servers. Updated dsh groups and nagios, as well as pybal
23:24 Rob: installed OS on srv217-srv223, moving on to package installation.
21:12 Rob: reinstalling srv209, which thought it was srv208. silly server. srv208 has not been installed, gave to tomasz to check against setup checklist.
21:05 Rob: actually, srv209 installed as 208, bad dhcp entry. Fixing
21:04 Rob: pulling srv208 and srv209 for quick reboots, their drac ips are wrong.
21:04 Rob: racked srv217-223 (also racked srv224/225 but no power yet)
18:30 brion: starting a batch run of update-special-pages-small just to ensure it actually works
18:25 brion: fixed hardcoded /usr/local path for PHP and use of obsolete /etc/cluster in update-special-pages and update-special-pages-small; removing misleading log files (bugzilla:17534)
03:19 Tim: removed live hack updating MW_DIFF_VERSION, changed on December 30 and the cache expiry is a week. Should not cause a significant amount of load.
03:01 Tim: removed live hacks from extension/Cite, updated to r47350.
01:49 Tim: deleting all enotif jobs from the job queue, there is still a huge backlog

February 16

16:46 mark: Did emergency rollback of squid 2.7.6 to squid 2.6.21 because of incompatible HTTP Host: header
16:21 Rob: stopped upgrades, sq36 completed before stop
16:17 Rob: performing upgrades to sq35-sq38 (not depooling in pybal, letting pybal handle that automatically)
16:16 Rob: performed dist-upgrade on sq31-34
15:35 Rob: depooled sq31-sq34 for upgrade
08:12 Tim: patched in r47309, Article.php tweak
05:00 Tim: made runJobs.php log to UDP instead of via stdout and NFS
04:53 Tim: fixed incorrect host keys in /etc/ssh/ssh_known_hosts for srv38, srv39 and srv77
04:13 Tim: removing all refreshLinks2 jobs from the job queue, duplicate removal is broken so to clear the backlog it's better to just run maintenance/refreshLinks.php

February 15

21:59 mark: Experimentally blocked non GET/HEAD HTTP methods on sq3 frontend squid
16:15 mark: Upgraded PyBal on lvs2 - others will follow
13:11 domas: db23 has multiple MCEs for same dimm logged: http://p.defau.lt/?IarKD4gbFhe5RmaV0RB_Xg
12:38 domas: in wikistats, placed older than 10 days files into ./archive/yyyy/mm/ - maybe will make flack crash less :))
11:56 mark: Doing Squid memleak searching on sq1 with valgrind, pooled with weight 1 in LVS
03:09 Andrew: CentralNotice still not working properly, and when we tried to set it to testwiki-only, it never came up. Left it on testwiki only for the time being, until somebody who knows CentralNotice can take a look at it.
02:21 Tim: fixed permissions on the rest of the logs in /home/wikipedia/logs/norotate (fixes centralnotice)

February 14

19:19 Az1568_: re-enabled CentralNotice on testwiki to try and find the problem (we've had this before, but fixed it somehow...possibly with a regen? See November 16th log.)
18:34 domas: filed a bug at https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/329489 - could use some Canonical escalation too
18:26 domas: same affected srv47 - this is related to switching locking to fcntl() - this drives apparmor crazy
17:47 domas: srv178 kernel memleaked few gigs. blame: apparmor
14:34 domas: srv215 very much dead, doesn't show vitality signs even after serveractionhardreset
14:28 domas: correction, srv208.mgmt is pointing to uninstalled box
14:27 domas: DRAC serial on all new boxes is ttyS1 which is not in securetty
14:24 domas: srv209.mgmt is actually srv208's SP, and srv208.mgmt is pointing to dead box
14:15 domas: srv209,215 down?
13:43 domas: installing php5-apc-3.0.19-1wm2 (no more futexes) on all ubuntu appservers.
10:02 Andrew: Reports that CentralNotice broke on all wikis, displaying just the message name in angle brackets, even though the message existed on meta. I have no idea what caused it and I couldn't find anybody who knows anything about it, so I disabled the notice itself on Special:CentralNotice on meta. Somebody who knows what they're doing should probably look into it later.

February 13

22:10 mark: esams squid upgrade complete
21:05 RobH: deployed srv207-srv216 in apaches cluster
20:34 RobH: added new servers to nagois and restarted it
20:15 RobH: setup all node groups, ganglia, apache, so on for srv199-srv206 and added into rotation
19:38 mark: Upgrading esams squids to 2.7.6
18:36 mark: Upgraded squid on sq1 to 2.7.6 and rebooted the box
18:03 mark: Memory leak issues on the upload frontend squids, which started in November
18:01 RobH: sq13 back online, seems there is a memory leak, go mark for finding =]
17:54 RobH: lomaria install done for domas
17:49 RobH: rebooting sq13 due to it failing out in ganglia, OOM error evident.
17:48 RobH: reinstalling lomaria per domas request
17:37 RobH: sq8 was unresponsive to console, locked up, rebooted, cleaned cache, and bringing back online
17:34 RobH: srv38 and srv39 back in rotation
17:23 RobH: srv38 and srv39 reinstalled, installing packages now
16:57 RobH: reinstalling srv38/srv39
16:57 RobH: srv80 reinstalled as ubuntu apache and back in rotation
16:31 RobH: srv79 back in rotation
16:21 RobH: srv79 reinstalled, installing packages and ganglia
16:12 RobH: reinstalling srv79
16:00 RobH: ganglia installed on srv77, back in rotation
15:55 RobH: srv77 redeployed as ubuntu apache server
15:48 RobH: reinstalling srv77 to ubuntu

February 12

23:59 brion: adding 'helppage' to ui-content messages on commons per bugzilla:5925
23:01 RobH: racked and setup drac for srv298-srv216
21:20 mark: Killed blocked apache processes on srv180, and restarted apache
21:19 mark: Killed blocked apache processes on srv172, and restarted apache
21:07 brion: fixed ownership on log files for updateSpecialPages cronjob, which likely is what broke it
20:28 mark: Upgraded experimental squid 2.7.5 on knsq1 to squid 2.7.6
20:00 brion: fixed typo which broke access to revision deletion log for oversighters. tx to aaron for the spot :D
19:45 mark: Replaced "2 cpu apaches" group aggregator srv32 by srv35
18:55 RobH: racked, wired, and remote management setup for srv199-srv207
09:51 domas: added srv190-srv198 to apaches dsh group, as they seem to be alive and kicking
09:48 domas: changed weights for srv190-srv198 80->100 (to account for 1.85->2.5 ghz cpu step )
00:29 brion: running updateRestrictions on wikis to clean up remaining funky restrictions entries per bugzilla:16846
00:22 Tim: restarted apache on srv172

February 11

23:23 mark: Pooled srv190-198
23:23 Tim: re-enabling search suggestions
23:19 mark: Installed Ganglia on srv190-198
23:17 mark: Installed MediaWiki application server packages on srv190-198
23:02 mark: Added srv190-198 to mediawiki_installation node_group (not any others)
22:55 mark: Ran dist-upgrade && reboot on srv190-198
22:46 mark: OS installed on srv190-198
22:19 RobH: racked and setup drac on srv195-srv198
22:11 RobH: racked and setup drac on srv192, srv193, srv194
22:00 RobH: racked and setup drac on srv190, srv191
21:24 brion: putting ixia back in rotation, it's caught up
20:05 brion: depooling ixia while it catches up
20:05 brion: ixia lagged 8810 secs
20:00 brion: ixia replication is broken -- causing contribs lag on itwiki
19:19 RobH: setup msw-a5-sdtpa like 30 minutes ago, opps ;]
19:00 mark: Added srv190-225 to DNS & DHCP
18:55 mark: set up RANCID for asw-a4-sdtpa and asw-a5-sdtpa
18:54 brion: disabled srv38,39,77,79,80 in lvs3 pybal config to ensure they don't go back into service accidentally until fixed up
18:37 brion: stopping apache on those bad machines for the moment
18:35 brion: srv38, 39, 77, 79, and 80 appear to have been prematurely put into apaches pool, running old version of PHP. need to be halted and upgraded
17:26 domas: restarted apache on srv154 after teh deadlock in apc
16:04 Tim: disabled checkers.php hack, using mwsuggest.js hack instead
15:52 Tim: emergency optimisation: disabled search suggest via checkers.php
15:41 domas: srv159 restarted as proper apache, not -DSCALER
09:02 domas: moved morebots to ~morebots@wikitech.wikimedia, startup line in rc.local :)
07:05 Tim: running maintenance/fixBug17442.php
06:56 Tim: restarted job runners
04:31 Tim: upgraded bugzilla to 3.0.8 with cvs up, and copied in the docs directory from the 3.0.8 tarball
03:31 Tim: gave myself an account on isidore, cleaned up some crap in /srv/org/wikimedia to /srv/org/wikimedia/backup
02:58 Tim: apt-get upgrade on isidore

February 10

23:47 mark: Moved upload esams LVS from mint to hawthorn
23:41 mark: Installed a specially compiled LVS Feisty kernel on hawthorn (running Hardy) & rebooted
22:33 RobH: updated mwlib on erzurumi per brion
22:25 RobH: some resets and such on searchidx1 to get ssh working. system is very sluggish.
19:28 brion: wikitech server crashed; CPU pegged and OOM. rob rebooted it, yay
02:46 Tim: running maintenance/fixBug17300.php to create missing redirect table entries
01:18 Tim: reverted PP caching patch
01:14 Tim: re-enabled search suggestions

February 9

23:13 domas: grunt session finished
23:10 domas: brought up srv80 from hibernation and made it work.
22:53 domas: added srv61 too
22:23 domas: added srv144 and srv147 to duty, added ganglia stuff too
22:01 domas: started appserver work on srv77,srv79
21:54 domas: started srv35,38,49 as appservers, restarted deadlocked srv49 processes
16:14 mark: Moved upload LVS back from hawthorn to mint - even a optimized 2.6.24 kernel is not fast enough to serve upload LVS
16:03 Tim: disabled search suggest as an emergency optimsation measure
16:02 mark: Rebooted hawthorn with an LVS optimized kernel, moved upload LVS back to it
15:53 mark: Moved upload esams LVS back to mint
15:37 mark: Moved upload.esams LVS from mint to hawthorn
15:28 mark: Reinstalled server hawthorn with Hardy 8.04
13:55 domas: fixed ganglia group for srv159 (it is scaler, not appserv)
13:51 domas: brought srv182 up
13:32 domas: repooled srv104 and srv105, after few months of vacation
13:20 domas: killed few orphaned tidy processes that were very very busy since Feb1
13:13 domas: heeheee, extorted this: [15:11] <rainman-sr> so, srv77,79,80, rose, coronelli and maurus could be converted to apaches
12:36 Tim: trying apc.localcache=1 on srv176
04:27 Tim: patching in r46936
03:48 Tim: attempting to reproduce APC lock contention on srv188

February 8

22:43 brion: may or may not have fixed that -- log file was unwritable. hard to test the command since 'su' bitches about apache not being loginabble on hume :P
22:39 brion: investigating why centralnotice update is still broken. getting fatal php errors wtf?
20:17 domas: we were hitting APC lock contention after some CPU peak. Dear Ops Team, please upgrade to APC with localcache support. :)))))

February 7

22:49 domas: db17 came up, but it crashed with different symptoms than other boxes, and it was running 2.6.28.1 kernel. might be previous hardware problems resurfacing
22:47 brion: chmod'ing centralnotice JS output on ms1 so batch processes running as 'apache' user can actually update them. hadn't been getting updated since february 5, leading to complaints when the swedes updated a translation on the steward banner
21:23 domas: db17 down

February 6

12:33 brion: stopped that process since it was taking a while and just saved it as an hourly cronjob. :) log to /opt/mwlib/var/log/cache-cleaning
12:28 brion: running mw-serve cache cleanup for files older than 24h

February 5

18:19 brion: put ulimit back with -v 1024000 that's better :D
18:18 brion: removed the ulimit; was unable to reach server with it in place
18:15 brion: hacked mw-serve to ulimit -v 102400 on erzurumi, see if this helps with the leaks for now
16:56 domas: rebooted erzuruzumi, placed swap-watchdog ( http://p.defau.lt/?mELQFcwRSvYRYdiIR9pvKQ ) into rc.local
16:03 mark: Added Qatar (634) to the list of esams countries
01:27 Tim: migrated arzwiki upload directory from amane to ms1
01:00 Tim: fixed arzwiki upload directory permissions
00:56 Tim: moved most cron jobs from admin user cron tabs to /etc/cron.d on hume

February 4

22:33 tomaszf: Adding cron for torblock under tfinc@hume
22:20 tomaszf: ran loadExitNodes() to update tor block list
18:36 brion: running TorBlock/loadExitNodes.php
17:25 brion: stripped BOM from en.planet config.ini; re-running.
17:24 brion_: attempting to run planet update for en.planet manually..... there's a config error
16:30 domas: stealing db27 for moar tests

February 3

13:05 mark: Remote-hands replaced some cables, fuchsia is back up but idling
06:57 Tim: doing some schema changes on the otrs database. Some fields should be blobs and are text instead, perhaps due to a previous 4.0 -> 5.0 MySQL upgrade
01:48 Tim: added blob_tracking table to ukwikimedia
01:42 Tim: repooled db3 and db4
00:34 mark: Moved traffic back
00:28 mark: Shutdown switchport of fuchsia in order to prevent it from interfering with mint (which took up text LVS as well as upload)
00:20 mark: Moved European traffic to pmtpa - text LVS unreachable

February 2

23:54 domas: took out db29 for some testing
22:07 mark: Modified Exim configuration on williams to not discard but delivered spam-recognized messages to OTRS with an X-OTRS-Queue: Junk header, as well as SpamAssassin headers
21:35 brion: reverting change to Cite_body.php
21:28 brion: caching for cite refs is known to cause problems with links randomly replacing with other links; likely strip marker problem. andrew is investigating
19:31 domas: merged in Andrew's Cite cache to live site
16:47 brion-sick: syncing update to Collection to do more efficient sidebar lookups
16:18 brion-sick: large spike in text backend service times
16:15 brion-sick: secure.wikimedia.org is returning 503 Service Temporarily Unavailable
08:11 Tim: removing ancient static HTML dump from srv31
08:05 Tim: removed cluster13 and cluster14 from db.php, will watch exception.log for attempted connections
08:02 Tim: removed srv130 from LVS and the apaches node group, not accessible by ssh but still serving pages
07:56 Tim: find /home/wikipedia/logs -size 0 -delete
07:43 Tim: re-added db22 to s1 rotation, no explanation for its removal in server admin log
06:39 Tim: dropped the otrs_test database
06:38 Tim: moved the OTRS database from otrs_real back to otrs. Updated exim4 config on mchenry
04:23 Tim: db10's relay log was corrupted, did a flush slave/change master
01:10 Tim: started mysqld on db23, doing recovery
00:59 Tim: rebooted db23
00:56 Tim: db23 down, depooled
00:05 Tim: adjusted innodb configuration on db10, restarted, starting replication

February 1

23:40 Tim: OTRS recovery script done
22:13 brion: updating rowikibooks logo bugzilla:17273 (note the log bot is down again)
21:25 Tim: running script to copy deleted OTRS data from db10
20:40 mark: Lily was overloaded due to the long downtime of mchenry, stalling all mailing lists deliveries
20:39 mark: Granted SELECT access to mchenry and williams for database otrs_real - they've been giving temp rejects for hours
11:24 Tim: mysqld on db10 crashes when it tries to run the current replicated query. Probably needs a resync. Set --skip-slave-start
10:05 Tim: updated OTRS DB name on mchenry
09:53 Tim: reading in SQL backup
09:33 Tim: moving the otrs database to otrs_real to allow easier binlog import
03:52 Tim: done 1 and 2
03:10 Tim: recovery plan is as follows: 1. re-enable r/w web access, 2. compile a list of deleted IDs from the binlogs (confirmed that this is possible), 3. read in the pre-upgrade backup to a separate DB and execute binlogs to the appropriate point, 4. copy affected IDs from the backup to the live DB
02:52 Tim: patched GenericAgent.pm to prevent ticket deletion
02:27 Tim: it seems some admin inserted a GenericAgent job called "temp1" at 09:46 with the effect of deleting all tickets older than 30 days. The binlogs show a duplicate "Valid" key, with one row setting it to 0 and the next setting it to 1, so it's possible the user set valid=0 in the UI but due to a bug in OTRS, the job was considered valid. The job appears to have been run first at 09:46, probably from the web, then regularly at 10 minute intervals, most likely due to the cron job on bart which was not deactivated. I've now removed the relevant crontab and revoked bart's OTRS permissions.
01:11 Tim: put an explanatory note on the OTRS login screen and deleted all sessions to send users there
00:38 Tim: revoked write access from the otrs mysql user, to prevent any further damage. Making a copy of the binlogs. The plan is to do forensics first and then recovery second.

January 31

18:17 mark: Following reports of OTRS rapidly deleting old tickets/emails every ~ 10 minutes, I disabled (set to invalid) all GenericAgent jobs pending investigation
15:43 mark: Set local_from_check = false in exim.conf on williams, to prevent Sender headers from being added (annoying for Outlook users)
07:11 Tim: converting OTRS database to proper UTF-8 (instead of UTF-8 in latin1 fields) using ~/fix-schema.php
01:30 brion: updating eswikibooks logo bugzilla:17078
00:55 brion: setting mswikibooks logo bugzilla:17263
00:53 brion: copied wikimedia favicon to blog.wikimedia.org bugzilla:17171
00:51 domas: lomaria needs reinstall, db24 and db30 are live in s2 duty

January 30

17:54 domas: *giggle*, booted up lomaria with SMP kernel
17:43 domas: lomaria kernel detects just one CPU (out of four)
17:26 domas: converted lomaria into dewiki-only server
14:20 Tim: Done with OTRS for now. Some bugs remain, particularly the missing ticket list in AgentTicketCustomer. I'll probably have to downgrade to 2.3.x tomorrow.
12:51 mark: Installed ganglia on williams
11:50 mark: Letting OTRS mail through to williams on mchenry
10:50 Tim: running upgrade of OTRS DB
10:44 mark: Removed all OTRS test copies in the queue of williams
10:42 mark: Deferring all OTRS mail on the queue of mchenry
10:30 mark: Put in a quick hack to forward misrouted OTRS mails from williams to bart
08:52 Tim: sent upgrade warning email to all OTRS agents
06:56 Tim: RCT should be finished now, no more connections are expected on cluster13 or 14. Current connection counts: 123943575, 295618929.
02:36 Tim: set up SSL on williams and switched ticket.wikimedia.org DNS to point to there
02:21 brion: set up new SSL cert for ticket.wikimedia.org; tim's poking at installing it
02:19 brion: updated password on tridge *cough*
01:43 brion: syncing update to Drafts with IE 7 fix (r46571 and style ver update)
00:16 brion: live-merging r46570 -- fixes to DB access in revisiondelete

January 29

22:55 mark: Did s/knams/esams/ on the selective AAAA answer config of ns0/ns1/ns2.wikimedia.org
22:47 mark: While messages are held in the queue on williams, use "mailq" to view the queue, and "exim -M <messageid>" to let an individual message through for testing
22:44 mark: SpamAssassin training from the OTRS Junk queue not yet setup
22:43 mark: Note: Exim on williams queries for mail addresses from the live OTRS database, not the test database
22:42 mark: Completed OTRS mail setup on williams. wikitech documentation updated in OTRS and Mail. OTRS mail is still copied to williams, and then held on the queue.
22:00 mark: Added db10 as secondary DB to query for Exim on mchenry
21:59 mark: Granted SELECT privileges on otrs.system_address to exim@williams on db9/db10
21:58 brion: enabling revision & log suppression for oversighters
21:12 brion: live-merging r46429 change to Special:Contributions -- stub marking fix
21:01 mark: Copying OTRS mail to williams, where it's automatically held in the queue without extra processing; useful for testing
21:00 mark: Installed SpamAssassin on williams for OTRS, copied training data from bart
20:14 recompressTracked.php finished
19:18 brion: aborted old enwiki dump so a fresh one can start, since that old history will never finish on the old system
19:17 brion: updated data dump scripts
17:57 brion: disabled 'mark patrolled' link for views without specific rcid param; but now it's back when we actually ask for it so actual rc/new pages patrol works again http://rafb.net/p/puGHC095.html
17:54 brion: poking at patrol link live hack
17:40 brion: erzurumi is rebooted and serving out PDFs again. need to implement some resource limits...
17:35 brion: rebooting erzurumi via drac
17:32 brion: i hate the drac shell
17:24 brion: erzurumi appears to have been victim to a massive memory leak. seeing if we can reboot it
17:17 brion: poking at mw-serve on erzurumi; not responding
16:15 domas: livehacked out 'patrol' link on article views %)
04:02 Tim: added DNS entry for OTRS test
03:19 tomaszf: installed grosley
01:31 Tim: fixed srv76 and the wikimedia-task-appserver package
01:31 brion-busy: syncing r46513 -- fix for categoryfinder, update to fix for Collection
01:14 brion-busy: updating Collection ext -- compat issue with changed category
00:56 brion-busy: stopped apache on srv76 for the moment
00:55 brion-busy: srv76 doesn't have upload5 mounted
00:41 brion: live-hacking out a broken check in getDupeWarning() which broke uploading if you had a duplicate file
00:34 mark: DOM readouts on br1-knams:

br1-knams#sh optic 1
 Port Temperature    Tx Power       Rx Power    Tx Bias Current Monitor
+----+-----------+--------------+--------------+---------------+-------+
  1/1   24.0078 C    000.7776 dBm                  84.360 mA    Disabled
  1/2   N/A            N/A            N/A            N/A            
  1/3   37.0000 C   -003.4582 dBm  -003.8111 dBm   58.470 mA    Disabled
  1/4   32.0234 C    000.4669 dBm                  71.928 mA    Disabled

00:22 Tim: synced nagios config

January 28

23:40 mark: s/knams/esams/ in DNS geobackend files
23:25 mark: Deployed fix in /lib/lsb/init-functions on sanger, mchenry, williams and lily which caused (amongst others) Exim reloads (-HUP) to be turned into a kill -TERM (Debian bug #434756)
23:15 mark: Set up basic mail system for OTRS on williams. Still incomplete and needs fine tuning and testing, spam checking is not yet implemented amongst other things.
22:30 mark: Restarted Exim on sanger, disappeared mysteriously
21:50 mark: Raised Dovecot max login process count from 128 to 1024
21:04 brion: merging reupload fixed: r46479, r46483, r46487
20:49 mark: Base OS install finished on williams.wikimedia.org
20:02 brion: merging r46472 (FlaggedRevs autopromote fix), r46464-46476 (feed RTL style fix, re-upload disabled field fix)
18:05 RobH: setup mail relay for wikimedia.cz for Danny and Co ;]
08:43 domas: s3 replication switched from db1-bin.325:437169827 to db11-bin.026 :79
08:35 domas: s2 rep switched from ixia-bin.150:119337662 to db13-bin.004:79
06:15 Tim: creating backup of db10 on storage2
04:29 brion: svn up'ing and scapping to r46424 consistently
04:22 brion: updating FlaggedRevs to r46422
04:17 brion: merging r46419, r46421 -- search display fixlets
03:51 brion: attempting scap again; tweaking DataCenter.ui.php since the scap syntax checks are whinging about the abstract static method o_O
03:40 brion: scapping to r46413
01:35 brion: svn up'ing to r46413 on test...

January 27

19:28 brion: syncing updates to Collection
19:04 brion: scapping update to AbuseFilter for test. updated its schema...
18:44 brion: db16 lagged 2188s
18:44 brion: restarting slave thread on db16. it got stopped with a lock wait timeout on a page_touched update (wtf?!)
18:43 brion: slave stopped on db16
17:41 mark: knsq1 Up and serving requests with squid 2.7.5
17:25 mark: Trying squid 2.7.5 on knsq1 - might be unstable in the mean time
17:22 mark: Reduced cache_mem on backend esams text squids from 3000 to 2500
16:23 RobH: srv76 had a failed hdd, replaced, reinstalled, and bringing back into rotation
16:18 RobH: srv146 was powered down (heat issue?), powered back up, synced and now in rotation.
16:09 RobH: srv139 didnt have apache running, synced and started
16:01 RobH: srv129 didnt have apache running, synced and started
15:59 RobH: sq11 back online, cleaned
15:40 RobH: srv126 back online. possible bad disk, if it crashes again, the disk needs replacement. (it went read only before, which seems to sometimes happen even when the disks are not bad.)
15:25 RobH: srv76 wont boot up, reinstalling.
15:12 RobH: srv130 coming back online, updated fstab, synced, putting it back in rotation.
15:05 RobH: moved ts-array4 to its dedicated ports, now its kate's problem ;]
14:49 Tim: restarted recompressTracked.php
14:33 Tim: henbane's disk has been full for 8 days due to donate-campaign.log, starting cleanup
14:18 Tim: killed recompressTracked.php
14:08 domas: removed unnecessary ms1 stat from CommonSettings.php. Recovery observed. ( diff )
13:44 mark: CARP weight redistribution caused large load spike in upload backend request, causing ms1 overload, probably causing issues on apaches via NFS, etc etc...
13:29 mark: Lowered CARP weight from 10 to 5 for sq1-10.wikimedia.org, from 15 to 10 for sq11-15
08:20 Tim: depooled db3 and db4 to improved recompressTracked speed
07:09 Tim: There was a bug in recompressTracked.php which caused the last batch of orphans for any given wiki to be skipped. Re-running recompressTracked.php to repair it.
05:55 Tim: killed all job runners, changed the job-runners group to srv151-180, started job runners on those servers
05:50 Tim: migrated job runner scripts to ubuntu and started job runners on srv110-119
05:29 Tim: started job runner on srv89
02:13 brion: updating extensions/AbuseFilter/Views/AbuseFilterViewList.php (mysql 4 compat issue)
02:04 brion: installed release versions of mwlib on erzurumi and restarted. these should have updated localizations
01:48 brion: turning AbuseFilter on on test.... having some mysql 4.0 compat issues. poking
01:47 brion: srv31 seems very sad; slow/borked login?
01:39 brion: scapping to update AbuseFilter to current
01:27 brion: prepping testing of AbuseFilter on test.wikipedia
00:46 brion: enabling Collection also for de.wikisource per frank's req passed on from community
00:36 brion: adding NS_HELP to $wgCollectionArticleNamespaces
00:12 brion: Collection extension being enabled on dewiki

January 26

22:39 RobH: UK Chapter wiki setup per https://bugzilla.wikimedia.org/show_bug.cgi?id=16996
22:18 RobH: pushed apache changes for uk chapter wiki
22:13 RobH: updated dns for uk chapter wiki
19:29 brion: going to update Collection to current trunk in prep for further activation today
17:01 RobH: added support for the phone server to dns

January 25

12:18 mark: Announcing routes to AS16265 again
10:17 domas: our deadlocks are described in X4240 manuals. the fix is either disabling MSI or setting 'options forcedeth max_interrupt_work=15' in modprobe.conf. product notes
09:31 domas: db17 live, with 2.6.28.1 kernel

January 24

14:53 domas: db16 and db17 deadlocked: http://p.defau.lt/?A_FG4J__2fq_IXyXWWOXyQ http://p.defau.lt/?XpVljy4JCy1aPECnBv7ilw
11:43 domas: db17 stuck at nc/tar/kswapd: http://p.defau.lt/?AftWGQnCtD1G80ZjIr7cyg
10:36 domas: took out db4,db5,db8 for cloning

January 23

18:04 brion: putting load back on db3, it's up to date
17:49 brion: taking some load off db3 until it catches up
17:46 brion: also killed a WantedTemplatesPage::recache query which had been running for a day. that ain't sustainable. :P
17:44 brion: domas restarted morebots a few minutes ago :D
17:43 brion: syncing update to ApiQueryBacklinks.php with the USE INDEX that was added for this problem
17:41 brion: killing some stray backlinks queries
17:38 brion: ~1-hour lag on db3
morebots is broken/down? unable to edit

January 22

00:10 brion: whitelisting .ott (OpenDocument templates) for private-wiki uploads

January 21

20:25 RobH: some tinkering on http redirects, rollback
17:51 RobH: setup https for wikitech
17:23 RobH: setup wikitech to stream weekly backups to tridge
10:29 domas: db28 powered down because of temperature reading over threshold (45C???)

January 20

21:45 RobH: killed some run away processes on db9 that were killing bugzilla
21:44 brion: stock long queries on bz again. got rob poking em
20:31 brion: putting $wgEnotifUseJobQ back for now. change postdates some of the spikes i'm seeing, but it'll be easier to not have to consider it
20:19 mark: Upgraded kernel to 2.6.24-22 on sq22
19:57 brion: disabling $wgEnotifUseJobQ since the lag is ungodly
17:58 JeLuF: db2 overloaded, error messages about unreachable DB server have been supported. Nearly all connections on DB2 are in status "Sleep"
17:21 JeLuF: srv154 is reachable again, current load average is 25, no obvious CPU consuming processes visible
17:10 JeLuF: srv154 went down. Replaced its memcached by srv144's memcached
03:02 brion: syncing InitialiseSettings -- reenabling CentralNotice which we'd taken temporarily out during the upload breakage
01:50 Tim: exim4 on lily died while I examined reports of breakage, restarted it

January 19

21:28 mark: Distribution upgrade on lily complete
21:27 mark: Letting mail through again on lily
21:01 JeLuF: Bugzilla didn't work. Some long-running (>3h) requests were locking some tables. Killed all long running jobs.
20:05 mark: Put mail delivery on hold on lily
20:03 mark: Upgrading lily (Mailing list server) to Ubuntu 8.04 Hardy
14:04 mark: Set a static ARP entry for 85.17.163.246 on csw1-esams to see if it helps with the inbound packet loss effects

January 18

20:25 mark: Cut outbound announcements to AS16265 to counter the inbound packet loss on that link
17:50 river: started copying ms1:/export/upload to ms4
00:21 Tim: restarted apache on srv158,srv177,srv106,srv66,srv109,srv140,srv86,srv90,srv133,srv172
00:19 Tim: cleaned up binlogs on db1

January 17

12:43 mark: Shut down transit link to 16265 due to intermittent packet loss

January 16

23:25 brion: activating Drafts extension on testwiki
21:18 brion: updating english/default wikibooks logo bugzilla:17034
19:50 brion: uncommented srv101 from apache nodelist
19:41 mark: Fixed authentication on srv101, and mounted /mnt/upload5
19:25 brion: srv101 is commented out of 'apaches' node group so didn't show up on my earlier sweep
19:23 brion: poking around, srv101 at least is missing upload5 mount still

January 15

21:16 brion: seems magically better now
20:48 brion: ok webserver7 started
20:43 brion: per mark's recommendation, retrying webserver7 now that we've reduced hit rate and are past peak...
20:28 brion: bumping styles back to apaches
20:25 brion: restarted w/ some old server config bits commented out
20:24 brion: tom recompiled lighty w/ the solaris bug patch. may or may not be workin' better, but still not throwing a lot of reqs through. checking config...
19:48 brion: trying webserver7 again to see if it's still doing the funk and if we can measure something useful
19:47 brion: we're gonna poke around http://redmine.lighttpd.net/issues/show/673 but we're really not sure what the original problem was to begin with yet
19:39 brion: turning lighty back on, gonna poke it some more
19:31 brion: stopping lighty again. not sure what the hell is going on, but it seems not to respond to most requests
19:27 brion: image scalers are still doing wayyy under what they're supposed to, but they are churning some stuff out. not overloaded that i can see...
19:20 brion: seems to spawn its php-cgi's ok
19:19 brion: trying to stop lighty to poke at fastcgi again
19:15 brion: looks like ms1+lighty is successfully serving images, but failing to hit the scaling backends. possible fastcgi buggage
19:12 brion: started lighty on ms1 a bit ago. not realyl sure if it's configured right
19:00 brion: stopping it again. confirmed load spike still going on
18:58 brion: restarting webserver on ms1, see what happens
18:56 brion: apache load seems to have dropped back to normal
18:48 brion: switching stylepath back to upload (should be cached), seeing if that affects apache load
18:40 brion: switching $wgStylePath to apaches for the moment
18:39 brion: load dropping on ms1; ping time stabilizing also
18:38 RobH: sq14, sq15, sq16 back up and serving requests
18:38 brion: trying stopping/starting webserver on ms1
18:27 brion: nfs upload5 is not happy :(
18:27 brion: some sort of issues w/ media fileserver, we think, perhaps pressure due to some upload squid cache clearing?
18:23 RobH: sq14-aq16 offline, rebooting and cleaning cache
18:16 RobH: sq2, sq4, and sq10 were unresponsive and down. Restarted, cleaned cache, and brought back online.
04:32 Tim: increased squid max post size from 75MB to 110MB so that people can actually upload 100MB files as advertised in the media

January 14

19:21 mark: Lower preffed paths from 13030 that were learned at NYIIX
18:44 brion: updated wikitech to current SVN and rebuilt text search index for new server to fix short words
18:30 RobH: removed the sysop and bcrat add/remove from bcrat permissions for eswiki
18:22 RobH: added groups for eswiki again per https://bugzilla.wikimedia.org/show_bug.cgi?id=16975
16:28 RobH: added rollbacker group per https://bugzilla.wikimedia.org/show_bug.cgi?id=16975

January 13

23:32 Tim: fixed NRPE on db29
22:56 Tim: cleaned up binlogs on db1 and ixia
22:54 brion: poking WP alias on frwiki bugzilla:16887
21:11 RobH: setup ganglia on erzurumi
20:42 brion: setting all pdf generators to use the new server
20:40 brion: testing pdf gen on erzurumi on testwiki
20:35 RobH: setup erzurumi for dev testing
20:35 RobH: some random updates on server roles to clean it up
19:37 mark: Restored normal situation, with 14907 -> 43821 traffic downpreffed to HGTN to avoid peering network congestion
18:40 mark: Retracted outbound announcement to all AMS-IX peers, 16265 and 13030 to force inbound via 1299
18:25 mark: Undid any routing changes as they were not having the desired effect
18:14 mark: Prepended 43821 twice on outgoing announcements to 16265 to make pmtpa-esams path via nycx less attractive
11:38 Tim: reducing innodb_buffer_pool_size on db19, db21, db22, db29
09:15 Tim: restarting mysqld on db23 again
09:09 Tim: restarting mysqld on db18 again
07:08 Tim: removed db23 from rotation, since I'm bringing it up soon and it will be lagged
07:02 Tim: shutting down mysqld on db18 for further mem usage tweak
06:53 Tim: fixed broken /etc/fstab on db23 via serial console
06:42 Tim: restarting db23
00:08 Tim: repooling db18, has caught up

January 12

21:50 brion: testing a scap after touching MessagesWuu.php to see if that clears borked serialized btis
21:22 RobH: erzurumi installed
21:00 tomaszf: moved erzurumi to vlan 101 on asw-a4-sdtpa
17:55 brion: temporarily stopped apache on srv78, srv118
17:54 brion: srv78 doesn't have upload5 mounted
17:54 brion: srv118 doesn't have upload5 mounted
17:46 RobH: fixed some settings for flaggedrevs in https://bugzilla.wikimedia.org/show_bug.cgi?id=14648
17:31 RobH: per brion commented out db18 in db.php cuz its making other crap lag too much (bugzilla:16993)
17:26 RobH: updated flaggedrevs.php for https://bugzilla.wikimedia.org/show_bug.cgi?id=16365
17:23 RobH: updated apache config on yongle for wap => mobile forwarding oversight per https://bugzilla.wikimedia.org/show_bug.cgi?id=16692
17:05 brion: db18 is backlogged 191k seconds. depooling it; complaints of hella lag
15:32 Tim: restarted mysqld on db18 with reduced memory usage, repooled
14:12 Tim: rebooting db18
13:20 Tim: depooled db18 (is down)

January 10

16:08 domas: rotated 300g sampled-1000.log ;-)
07:09 river: applied current OS patches to ms2 and rebooted
01:21 Tim: restarted apache on srv95,srv114,srv37,srv49
01:19 Tim: cleaned up disk space on db1. Still looks suspiciously like the master...
00:33 brion: redirecting old bylaws.pdf to wiki page bylaws on wikimediafoundation.org (foundation.conf update)
00:13 brion: reconfigured exim on wikitech to hopefully actually send mail out. whether it reaches anything, we'll see
00:12 tomaszf: turned off fundraising banners
00:08 brion: installed a mail server on wikitech server, hopefully

January 9

22:40 brion: updating missing.php for "missing site" error page (bug 11125)
22:26 tomaszf: updated triggers on db9 for civicrm2 to not show refunds in the public reporting table
21:31 brion: moving wikitech.leuksman.com dns entry to point to my new server in prep for retiring the old one; redirect to wikitech.wikimedia.org should remain intact.
19:54 RobH: fixed timezone setting for wikimedia norge per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
19:49 RobH: updated yongle so *.wap redirects to *.mobile.wikimedia.org per https://bugzilla.wikimedia.org/show_bug.cgi?id=16692
18:36 RobH: disabled anon editting of the wikimania2008 wiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16893
18:22 RobH: enabled flaggedrevisions on ptwikisource per https://bugzilla.wikimedia.org/show_bug.cgi?id=16365
16:19 RobH: enabled flagged revisions on hewikisource per https://bugzilla.wikimedia.org/show_bug.cgi?id=14648
16:00 RobH: updated redirect for wikimania.wikimedia.org to 2009 site instead of 2008 site per https://bugzilla.wikimedia.org/show_bug.cgi?id=16873

January 8

22:08 brion: putting db12 back in service, caught up
21:42 RobH: changed the ip address for the management interfaces on sq31-sq50
21:30 RobH: updated dns with the squids and srv mangement info for pmtpa
21:16 brion: taking load off db12 while it updates
21:15 brion: killing stuck query threads on db12 (lagged 13k seconds)
20:23 RobH: updated dns removing a large number of decommissioned servers from records.
20:08 RobH: pushed updates to dns for mangement ip allocations, changed mangement ips of search8-search12
19:42 RobH: changed the mangement ip addresses of db5-db10 to fit into current ip scheme
18:20 RobH: updated dns for the management name resolution of db11-db30
18:11 RobH: ms5 has lom access enabled and is ready for testing. (Only one ethernet connection in lieu of the typical 3 on the thumper/thors)
15:50 RobH: srv118 reinstalled
15:46 RobH: srv136 is borked. Even after reinstall, it will run for a few minutes, then lock hard. Going to RMA it.
15:38 RobH: reinstalled srv136 and srv118 cuz they were pissing me off (a valid reinstallation reason if there ever was one.)
15:08 RobH: and srv118 back down, thing is borked.
15:06 RobH: srv118 back online and serving requests.
15:01 RobH: pushed db13 back into cluster, same with db14, from yesterdays work
14:26 RobH: srv101 back online and in lvs
14:15 RobH: reinstalled srv101, installing wikimedia-task-app packages now
06:37 JeLuF: rebooted db18. Mysqld was stuck but couldn't be killed.
04:08 Tim: migrated all locked wikis from $wgReadOnly(File) to permissions-based locking, so that stewards can edit the alternate project links, and so that various MediaWiki components don't break on page view
03:57 river: set up ms3/ms4 with solaris 10 update 6

January 7

22:50 RobH: db13 and db14 are replicating but not in the cluster (not sure if they are caught up)
22:35 RobH: updated power strip information for ps1-a1-sdtpa and balanced load
22:35 RobH: reseated mrj cable for csw1-sdtpa_1/13
21:36 RobH: started up db13 and db14
21:19 RobH: updating firmware on db13-db14
21:14 RobH: shutdown db13 and db14 to fix lom lockup issue.
20:52 RobH: depooled db13 and db14 in db.php to reboot them and fix the SP lockup issue.
20:49 RobH: updating firmware on db16.
20:43 RobH: started mysql back up on db15
20:42 RobH: cold reset of db16 to resolve lom issue. will update firmware upon boot.
20:39 RobH: swappned hostnames on ms3 and ms4, updated racktables and dns to reflect change
20:24 brion: disabled wikidiff2 on wikitech since it's not installed, and this apparaently is nicely broken
20:21 RobH: db15 now responsive to lom and ready to be re-integrated into the cluster
20:12 RobH: db15 cold reset fixes the LOM non-responsive issue. Upgrading its firmware to prevent future issues.
20:06 brion: removed stray whitespace from wikitech config file which was breaking rss feeds
19:22 mark: Possibility that esams LVS was overloaded, split over 2 boxes (fuchsia & mint)
19:19 RobH: ms3 and ms4 are accessible via LOM and ready for setup/deployment
19:05 RobH: updated dns for ms3-ms5, updated dns for mangement for all media servers.

19:03 brion: touching MessagesZh.php and re-trying scap; may not have properly updated
17:40 brion-plague: scapping -- merged r45507 zh specialpage alias fix to live. also r45499 (revert of Cite error thingy) seems to already have been merged
13:58 Tim: ran updateAutoPromote.php on all flaggedRevs wikis
13:41 Tim: scap
13:21 Tim: repooled db3 and db4
12:47 Tim: recompressTracked.php complete. Recompressed 628 GB of data to 30GB, a 21x reduction over per-revision compression.
04:36 brion-codereview: svn up'ing testwiki to r45489

January 6

16:01 mark: Changed 'knams' into 'esams' in DNS, kept a lot of old names in place
15:26 Tim: cleaned up binlogs on db1
13:09 mark: Did some Traffic Engineering on the Amsterdam network
11:58 Tim: installed NRPE on new ES servers
11:47 domas: added db29 to s3 duty
11:32 Tim: locked clusters 18 and 19, updated nagios
11:27 Tim: fixed lack of schema on srv161
11:21 Tim: retired cluster18 from the write list, added cluster20 and cluster21
11:15 Tim: cleaned up binlogs on srv105
00:04 tomaszf: built out eiximenis with ubuntu-8.04 for mobile server

January 5

20:47 brion: re-updating SpecialSearch.php and MWSearch.php for better fix of the XSS
20:40 brion: updating SpecialSearch.php for XSS issue
20:00 RobH: wikitech is moved to new host. Still needs HTTPS setup. Redirects from old host are in place.
13:17 domas: setting up db24-db26 LVMs per http://p.defau.lt/?eAOimTjd9r_QvSDiIhHjng
12:56 mark: Brought down BGP transit session to AS 1145 / Kennisnet
12:29 domas: db16 had our special deadlock, didn't come up after reboot, SP not responding, needs datacenter activity
12:07 domas: upgraded BIOS firmware on db29,db30 and accidently on db19 (damn .29 ip :)
11:47 domas: added 208.80.152.185 to noc.wikimedia.org vhost ServerAlias
10:33 mark: Brought BGP session to AS 16265 back up
00:04 Tim: cleaned up binlogs on ixia and db1

January 4

17;08 mark: Restored traffic to esams
16:38 mark: Moved route sourcing from br1-knams to csw1-esams
15:55 mark: Moving esams traffic to pmtpa (scenario knams-down)

January 3

23:57 mark: Restored AAAA record on upload.wikimedia.org
12:04 domas: db17, db18 had OS/firmware updates, rebooted
10:50 domas: db19 RAID complaining about temperature, check-raid/kswapd/mysqld deadlock. upgrading RAID firmware, rebooting, etc
01:23 Tim: removed db3 and db4 from rotation again, to allow recompressTracked to go faster
00:36 Tim: depooled db19, is down
00:32 Tim: restarting recompressTracked with an extra wfWaitForSlaves()
00:08 Tim: repooled db3 and db4

January 2

22:35 Tim: depooled db3 and db4 temporarily
21:56 Tim: killed recompressTracked for now, not waiting for slaves properly. db3 and db4 lagged.
20:54 mark: Set db4 s1 load to 0, 4368s lagged
00:42 Tim: restarting recompressTracked.php on hume

January 1

20:34 brion: live-merging file delete fatal error fix from r45278
19:47 brion: bumped meter image to 7
01:59 brion: scapping!
01:39 brion: svn up'ing test.wiki to r45274
00:55 brion: svn up'ing on test.wikipedia

December 31

18:40 brion: fixed old whygive.wikimedia.org blog by copying de-conflicted WordPress source files out of the active blog where we fixed it after the 2.7 upgrade

December 30

23:02 RobH: is leaving on a jet plane, weeeeeeeee.. in 8 hours.
23:01 RobH: all knams squids are now online.
22:49 RobH: knsq23-26 back in rotation, 3 more to go.
22:33 RobH: enabled knsq16-knsq22 in lvs, almost time to go back to hotel and die.
22:22 brion: attempting to purge affected pages on dawiktionary, dawiki
22:21 brion: taking dawiki, dawiktionary out of read-only because the rest of the fixes won't work until it's disabled :P
22:14 brion: poking diff version in live DifferenceEngine.php to eliminate bogus cache entries for dawiki/dawiktionary
22:11 RobH: stopping and clearing the cache on knsq16-knsq30.
22:06 brion: trying it again, but this time with the right variable names
22:02 brion: attempting to clear revision text loading cache entries for dawiktionary, dawiki
21:47 brion: live-merging r45206 so bugzilla:16841 corrupted entries will be loaded properly on dawiki/dawiktionary. need to clear revision, diff, parser caches...
21:15 brion: locking dawiki, dawiktionary ($wgReadOnly) pending encoding fix
20:07 brion: killed recompressTracked.php processes on hume pending investigation of encoding breakage
20:02 brion: commenting ariel out of pmtpa also
19:58 brion: trying to clear no-longer-in-dns hosts from ALL node group
19:57 brion: PLEASE SAY WHAT SERVER YOU'RE RUNNING BATCH PROCESSES ON IF THEY'RE NOT ON ZWINGER. thanks
19:56 RobH: power disconnection for primary routing rack in esams. power restored, and totally was not robh's fault regardless of what lies mark may say to the contrary.
19:54 brion: encoding issues reported with some old edits on dawiki. wondering if this is recompression-related?
18:46 brion: added PMTPA nameserver back in mayflower's resolv.conf so DNS actually works on it until things are fixed
17:42 brion: internal DNS for knams seems to be down (at least on mayflower), this is breaking at least SVN update notifications
17:14 brion: updating logo for pmswiki bugzilla:16587
13:29 Tim: starting recompressTracked.php on all wikis
11:22 mark: Shutting down knsq16-30
10:59 mark: In case of overload problems, please move traffic to pmtpa (scenario knams-down)
10:54 mark: Depooled knsq16-30
10:47 mark: Set DNS timeout on fuchsia (LVS) to 1s, PyBal timeout to 8s
10:21 mark: Unracking pascal, mint, lily
09:57 Tim: testing recompressTracked on huwiki
09:38 mark: ts-array3/A --> yarrow/0
09:23 TimStarling: testing recompressTracked on testwiki
09:20 mark: hemlock/eth1 <--> clematis/eth1
09:17 mark: ts-array2 -> zedler scsi B, ts-array1/0 -> zedler scsi A
08:47 Tim: running FlaggedRevs/maintenance/clearCachedText.php on all FlaggedRevs wikis

December 29

11:24 mark: Shutting down and unracking mayflower (subversion)
11:21 mark: Temporarily disabled AAAA record upload.wikimedia.org for ipv6 participants
11:19 mark: Unracked fuchsia
11:16 mark: In case of overload problems, move traffic to pmtpa!
11:11 mark: Moving all LVS to mint
09:56 mark: Depooled knsq8-15
09:56 mark: Unracked knsq1-7
09:43 mark: Repooled knsq23-30, depooled knsq1-7
09:23 mark: Depooled knsq23-30
08:47 Tim: deleted some binlogs on srv108.
04:50-05:32 Tim: set up external storage on the remaining 9 servers in srv151-186: srv160, srv161, srv162, srv172, srv173, srv174, srv184, srv185, srv186
03:41 Tim: running orphanStats.php on all wikis
03:26 Tim: restarted apache on srv33, srv146, srv169, srv172
03:00 Tim: cleaned up binlogs on srv105

December 28

21:33 brion: tweaked namespace robot policies for hewiki bugzilla:16247
20:52 brion: tweaking it correctly this time
20:50 brion: tweaking centralnotice loader path for secure.wm.o
20:20ish brion: copied a couple image files for Bugzilla skin to local dir, since Firefox 3.1b whinges about loading images via http: from an https: page
18:21 brion: we've been getting reports of difficulties reaching PMTPA via Level3
18:03 brion: updating thwiki logo bugzilla:16008
17:54 mark: csw1-esams racked and configured; link established with br1-knams
12:14 mark: Moving equipment to EvoSwitch
11:55 mark: Moved udpmcast from pascal to lily
11:48 mark: sage stays at knams, to be racked into J-13 later
11:44 mark: Unracking ragweed
11:38 mark: Unracking hawthorn
11:37 mark: Unracking sage
11:37 mark: Unracked csw1-knams
11:25 mark: Directed traffic back to knams
10:52 mark: knams network should be back up
09:05 mark: Moving knams traffic to pmtpa

December 27

21:50 brion: removed stale sitemaps dirs for several private wikis

December 26

00:50 Tim: started mysqld on db19, repooled
00:44 Tim: got connection on db19 and assumed it was still broken, initiated shutdown
00:44 domas: db19 had jfs/kswapd/etc deadlock, came up after reboot
00:34 Tim: noticed db19 was down, depooled it.

December 25

23:59 domas: restarted db19 with sysrq without telling anyone
19:37 brion: adjusted subpage namespaces for arbcom_enwiki
19:11 brion: disabled magic_quotes_gpc on yongle -- mobile.wikimedia.org gateway doesn't compensate for quoted input. :P
19:09 brion: merry christmas!
01:09 brion: re-running SVN metadata import for CodeReview to fix comment encoding (bugzilla:16640)

December 24

21:55 brion: merging r45005 (restoring default font for Safari textarea)

December 23

23:35 brion: svn up'd to r44990 (serialization updates broken by Setup.php change)
23:28 brion: starting scap!
23:24 brion: svn up'ing to r44989, prep for scap!
22:41 brion: think i tweaked scap script to update skin files on upload.wikimedia.org ...hopefully :)
22:09 brion-codereview: svn up'ing test.wikipedia.org to r44982 -- DO NOT SCAP UNTIL TESTED!
02:38 Tim: cleaned up binlogs on db1, db2. Removed cluster19 from the write list, it's almost full.
02:28 brion: clearing out bogus page_restrictions entries (bugzilla:16629)

December 22

22:56 brion: updated timezone for huwikinews (bugzilla:14343)

December 21

03:05 Tim: depooled db4 temporarily to speed up a long running trackBlobs query

December 20

01:08 brion: starting a cleanupImages run on all wikis
00:57 brion: set UI lang fo rmainpage on meta bugzilla:16701

December 19

23:52 brion: removing MessageCache::get profiling hack, all done
22:16 brion: adding profiling hack for MessageCache::get
13:48 mark: Found knsq12 turned off, brought it back up
12:17 mark: Unracking knsq15 to make room for the new router
08:53 Tim: changed crontab on hume to run rebuildTemplates.php every 30 minutes instead of every 10 minutes, since it's taking about 30 minutes to finish each run
07:42 Tim: started trackBlobs.php running on hume, for all wikis

December 18

23:16 brion: updating MessagesLij.php, MessagesMt.php -- namespace breakage
21:53 brion: bugzilla:16597 spam regex update
21:01 RobH: added wikitech subdomain for future setup/migration of wikitech mediawiki
20:33 RobH: added commons to meta imports allowed per https://bugzilla.wikimedia.org/show_bug.cgi?id=16665
14:50 RobH: pushed dns change to correct spence.mgmt.pmtpa.wmnet.
03:09 TimStarling: killed long-running query on db9, 5762 seconds, plain select query probably with a read lock held by the thread, all read queries were waiting for the lock
02:27 TimStarling: deleted binlogs on srv105 and srv108
01:16 brion: briefly experimented with changing wgLogo on testwiki via Configure and it didn't explode. yay! setting it back to default and just letting it be. only stewards can edit config, and only wgLogo is configable atm.
01:12 brion: testing Configure on testwiki only
01:10 brion: created test Configure ext tables in 'wikiconfig' db
00:49 brion: scapping for update of Configure extension prior to small-scale test deployment
00:48 Danny_B: wikibugs-l stopped to send mails to wikibugs-irc mailbox due to excessive bounces. reenabling sending again
00:28 RobH: fixed part of the revert for lucene that i missed.
00:24 RobH: reverted lucene.php changes from rainman's testing.

December 17

23:18 RobH: more lucene changes
22:36 brion: applied fix for Android browser on mobile gateway (also did the pl language setup recently)
22:05 RobH: more lucene.php changes
21:12 RobH: additions to lucene.php per rainmain
20:39 mark: Corrected LVS service IPs on search2, search10-12
20:03 brion: hacked mw-serve init script on yongle into shape. will commit it in a bit and update docs
19:38 brion: pdf server seems to have eaten all temp space on yongle. clearing...
19:26 mark: Set up search2, search8-12
18:57 RobH: pushing dns changes for new misc. servers management resolution
18:30 RobH: updated lucene.php with rainman to do things that I really do not get but he knows about.
16:28 RobH: new servers auth1, nfs2, streber and williams are racked, IP's allocated, DRAC working. No DHCP entries or OS installed yet.
16:08 mark: restarted lighttpd on zwinger
15:59 RobH: added williams to dns records, updated dns
15:50 TimStarling: removed some binlogs on ixia
01:17 brion: scapping a couple more fixes to r44698
00:36 brion-codereview: srv126 is borked -- read-only filesystem
00:23 brion-codereview: scapping to 44696
00:15 brion-codereview: svn up'ing on test...

December 16

23:09 brion-codereview: disabling FixedImage extension -- was used for old 2006 and 2007 fundraisers; images no longer exist and are not applicable to current fundraisers
20:34 RobH: ariel is dead, will decommission later.
20:29 RobH: ariel is fubar, rebooting and investigating.
20:25 RobH: restarted services on sq13
20:21 RobH: took down sq13 to clean its cache
20:09 RobH: replaced bad /c0/p0 in amane
19:45 RobH: setup drac access for nfs1, brewster, auth2, dobson, eiximenis, erzurumi, fenari, grosley, loudon, singer, & spence. The other 3 misc. servers will be setup later. OS not installed, just remote access setup and IP space allocated. (Not setup in DHCP yet.)
18:47 brion: applying temporary resource limit lift on enwiki for an IP for workshop in SF
17:40 RobH: updated dns for misc. servers project.
01:08 brion: deploying r44643 update to CodeReview subversion proxy (swapped encoding protocol to avoid bugs in json_decode with some diffs)
00:04 brion: running cleanupTitles.php in bg on all wikis...

December 15

23:20 brion: going to test fixes for FiveUpgrade.inc to back cleanupTitles.php, cleanupImages.php etc
22:21 RobH: changed settings on metawiki to allow banned users to edit their talk pages per https://bugzilla.wikimedia.org/show_bug.cgi?id=16621
21:25 brion: reenabling handheld skin setting, was turned off during overload emergencies on 11-17
21:13 brion: rsyncd appears to be running on srv56. does anything else need to be done for index updates?
20:10 brion: yongle hanging again, restarting apache
18:58 RobH: started rsync daemon on srv56 per rainman
18:35 RobH: setup new planet per https://bugzilla.wikimedia.org/show_bug.cgi?id=16511.
01:39 brion-weekend: applying API deletion log fix from r44541 (bugzilla:16626)
00:09 rainman-sr: rsyncd is not running on srv56, updates for wikis served by old indexer halted since Oct7. Run rsync --daemon on srv56

December 14

02:04 Platonides: Connections timing out

December 13

02:04 brion: applied patch-rfb_ratings.sql to flaggedrevs wikis
01:46 brion: did some debugging on RatingHistory graph generation with Aaron and got it working yay!

December 12

22:47 brion: patched Bugzilla so we can exclude CC-only mails from wikibugs-l ([bugzilla:15585]])
21:52 brion: scapping to r44509
19:19 brion: put all the themes and plugins and patches back on wordpress for blog.wm.o. whee
19:15 brion: restarted apache on isidore while fiddling with php error logging settings and blog started magically working again. sigh. going back to tweak its config back to normal
18:04 brion: we managed to fix the svn update conflict on blog.wm.o (to wordpress 2.7) but it's still showing main page as blank
17:42 mark: Telia connection / BGP session was up for 20 hours; problem seems resolved. Removed route filters
00:29 brion: bumping to r44485 for more NS fixes for ms, ast
00:12 brion: scapping bump to r44484, fixing a few issues w/ hu
00:06 brion: updated wikibugs irc script to r44483, fixes issues w/ users w/o real name setting

December 11

23:19 brion: shutting down srv118; bad config. missing upload5 mount, seems to have bogus authenticatin (local su to root fails with "Authentication service cannot retrieve authentication info")
23:10 brion: restarted apache on 134, it's scary/corrupt
22:55 brion: manually syncing updated skin files to upload.wm.o ...
22:53 brion: scapping to r44474
21:31 brion: don't sync yet; RC regression in r44033 being worked on
19:41 brion-codereview: removed conflicting live profiling hack from AutoLoader.php. Put this stuff in SVN, huh guys?
19:39 brion-codereview: applying flaggedrevs schema updates
19:38 brion-codereview: starting svn up for testwiki
13:41 mark: configured asw-a4-sdtpa and asw-a5-sdtpa, but no link
10:41 mark: bart out of disk space, removed some old cruft (mailman)

December 10

23:50 RobH: pulled srv76 due to two dead fans (yay for da bot)
23:35 RobH: srv78 reinstalled and in apache pool
22:57 RobH: srv78 kernel panic, old FC install, pulled for reinstall
22:49 RobH: sq1. sq3, sq6 cache cleaned and back online serving requests.
22:35 RobH: sq1, sq3, sq6 all unresponsive to console, flashing leds on kvm. rebooted.
20:40 RobH: srv118 installation completed.
20:00 RobH: reinstalled srv118 after replacing dead parts. installing packages now.
19:48 RobH: started rebuild of storage1 /c1/p0 into array
19:47 RobH: replaced disk /c1/p0 in storage1. /c1/p13 is now bad as well, placing rma for it.
19:14 RobH: db13-db16 responsive to ssh.
19:13 RobH: db15 rebooted.
18:05 RobH: temp probes installed in a3-sdtpa

December 9

18:46 RobH: fixed group names in add/remove groups per https://bugzilla.wikimedia.org/show_bug.cgi?id=16248
18:42 RobH: updated some settings for no.wikimedia.org and pushed to cluster.
15:23 RobH: backedup blog frontend/database and upgraded to 2.6.5 successfully
14:21 RobH: updated InitialiseSettings for nowikimedia wiki
06:47 Tim: srv146 did not have /mnt/upload5 mounted. Fixed.
02:03 brion: dropped loading of obsolete RenderHash ext (bug 16114)

December 8

23:30 RobH: updated enwiktionary group settings per https://bugzilla.wikimedia.org/show_bug.cgi?id=16248
23:24 brion: updating Oversight for bug 16065
22:44 RobH: no.wikimedia.org is now functioning per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
22:35 RobH: made changes to InitialiseSettings.php for cswikisource per https://bugzilla.wikimedia.org/show_bug.cgi?id=16277
21:37 RobH: authdns-update for no.wikimedia.org
21:20 RobH: running sync-common-all for wikimedia norge (found the php error)
21:01 RobH: its all back up now.
20:59 RobH: I stupidly crashed the site with a php typo, rolling back my changes since i was ignorant and did not php -l ;_;
20:58 RobH: setup wikimedia norge wiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
19:23 brion: updating OggHandler for fix for bug 15920 (chopped oggs)
15:57 mark: Set up mirroring of traffic of e7/2 to e7/14 for testing the fiber patch loop/optics
13:16 Tim: added some IWF proxies to the trusted XFF list. These proxies are probably about 30% of the IWF traffic, the other 70% comes from proxies that pass through the XFF header without adding the client address.

December 5

22:42 domas: srv47 is running scaler usr.sbin.apache2 aa profile in learning mode
22:33 RobH: sq50 reinstalled and back in rotation
22:25 RobH: finished setup on srv146, back in apache pool
21:32 RobH: setting up packages on srv146
21:32 RobH: reinstalling sq50
21:27 brion: pointing SiteMatrix at local copy, not NFS master, of langlist file
19:19 RobH: added sq48, and sq49 back into pool. sq50 pending reinstallation.
18:58 mark: depooled broken squids sq1 and sq3
18:26 RobH: depooled sq48-sq50 for relocation
18:17 RobH: added sq44-sq47 back into pybal, relocation complete.
17:45 brion: sync-common-all to add w/test-headers.php
17:28 RobH: shutting down sq44-sq47 for relocation.
17:27 RobH: sq41 - sq43 back online.
17:17 RobH: sq40 oddness, but its back up now
16:44 RobH: accidentally pulled power for sq38, opps!
15:36 RobH: removed sq41 - sq43 from pybal to relocate from pmtpa to sdtpa
15:34 domas: srv178 running usr.sbin.apache2 aa profile in complain mode
15:34 RobH: removed sq40 from pybal to relocate from pmtpa to sdtpa

December 4

22:50 domas: job runners are no longer blue on ganglia CPU graphs :(((((((
22:45 domas: fc4 maintenance, reniced job runners to 20 (10 behind apaches), installed apc3.0.19 (APC3.0.13 seams to have hit severe lock contention/busylooping at overloads)
22:04 RobH: re-enabled sq38 in pybal. all is well
22:02 RobH: fired sq37-sq39 back up
21:58 RobH: shutdown sq37-sq39, cuz I need to balance the power distribution a bit better.
21:40 RobH: sq38 is trying to break my spirit, so i reinstalled it to show it who is boss (me!)
21:02 RobH: setup asw-a4-sdtpa and asw-a5-sdtpa on scs-a1-sdtpa
20:52 mark: Increased TCP buffers on srv88 (a Fedora), matching the Ubuntus - Fedora Apaches appear to get stuck/deadlocked on writes to Squids
19:39 RobH: pulled sq38 back out, as it is giving me issues. need to fix the msw-a3-sdtpa before i can fix sq38.
19:35 RobH: added sq38, sq39 back into pybal
19:25 RobH: added sq36, sq37 back into pybal
18:14 RobH: I need to stop forgetting about lunch and stop working through it, oh well.
18:13 RobH: depooled sq36-sq39 for move from pmtpa to sdtpa.
18:12 RobH: some tinkering with lvs4 and idleconnection timer was fixed by mark.
17:46 RobH: racked sq21-sq35 in sdtpa-a3. added back to pybal.
16:31 RobH: depooled sq31-sq35 from lvs4 to move from pmtpa to sdtpa
15:15 RobH: reinstalled storage1 to ubuntu 8.04, left data partition intact and untouched.

December 3

23:46 JeLuF: performing importImage.php imports to commons for Duesentrieb
19:13 RobH: tested i/o on db17, issue where it pauses disk access is gone.
19:02 mark: Shutdown TeliaSonera (AS1299) BGP session, the link is flaky resuling in unidirectional traffic only for most of the day
19:02 RobH: replaced hardware in db17, reinstalled.
18:58 mark: Prepared search10, search11 and search12 as search servers
17:26 brion: investigating ploticus config breakage bugzilla:16085
17:18 brion: ploticus seems to be missing from most new apaches
17:12 RobH_DC: search10, search11, search12 racked and installed.
14:29 RobH_DC: srv136 was unresponsive, rebooted, synced, back in rotation.

December 2

23:57 Tim: added CNAME poke.wikimedia.org for SMS notification project
23:33 brion: scapping to update ContributionReporting ext
23:11 Tim: db7 wasn't deleting its relay logs for some reason, since August 21. Disk critical. Did a reset slave.
20:03 brion: rebuilt public_reporting with fixed encoding
19:53 brion: fudged charsets in triggers for donation db update, let's see if that helps
12:11 Tim: started squid (backend instance) on sq40, stopped for 13 days for no apparent reason
12:08 Tim: restarted apache on srv161, srv122, srv137, attempted on srv123 but it is waiting for dead NFS mount
11:48: srv183 made a miraculous recovery
11:44 Tim: took srv183 out of memcached rotation
11:10-11:35: a spike in backend requests (as seen in lvs3 network) caused the application cluster to overload. Due to the extra threads, srv183 went into swap and died.
10:50 Tim: purged binlogs on ixia and db1 (both critical)

December 1

23:49 brion: sync-common-all'ing to add a wikispecies little icon for sul shared session login, since people keep asking for it :)
20:31 RobH: synced and restarted apache on srv89
19:33 RobH: manually setup apache-check for pybal on srv138, synced, enabled.
19:29 RobH: manually setup the apache_check stuff for srv126 and pybal.
17:19 RobH: synced and restarted apache on srv176 & srv176
17:18 RobH: did the sync and restart thing for apache on srv162
17:16 RobH: synced and restarted apache on srv145
17:13 RobH: synced and restarted apache on srv121 and srv125
17:00 RobH: apache wasnt working on srv102 and srv106, restarted them after syncing
15:10 mark: Restarted stuck pdns_server on bayle, lots of stale selective_answer.py processes
14:44 domas: restored Roma article on itwiki, had orphaned revision entries after deleting it, manually inserted page entry
14:40 mark: Setup Telia transit at knams, but all inbound routes filtered
14:35 RobH: removed images from plwiki flaggedrevs per request from Leinad

November 30

12:14 mark: restarted flapping apache on srv119, looks like memory corruption going on

November 28

18:58 brion-holiday: updating User-Agent blacklists to block 'WebCapture' download tool but not the Library of Congress's www.loc.gov/webcapture/ spider
18:17 yksinaisyyteni: fixed broken upload/deletion/timeline on jawiki
07:11 JeLuF: succeeded to umount /mnt
07:10 JeLuF: killed hanging cron entries on db22. updatedb.mlocate. Might be related to broken mount db16:/a -> /mnt
07:05 JeLuF: killed lots of jobs running on db22, "SELECT /* ApiQueryBacklinks::run XX.XXX.XXX.X */ page_id,page_title,page_namespace,page_is_redirect" which were in status "copying to tmp table"

November 27

13:10 mark: hungover, headache, lack of voice

November 26

17:00 RobH: fixed flaggedrevs to work on ruwikiquote, due to my own mistake in earlier implementation, per https://bugzilla.wikimedia.org/show_bug.cgi?id=14863
02:38 brion: updated Math.php to r43966 which both fixes 0-byte math PNGs and generates correct URLs *cough*
02:36 brion: broke math temporarily woops
02:29 brion: bumped Math.php to r43965 to hopefully clear out those 0-byte math images (bugzilla:16440)
02:01 brion: updating CentralNotice to r43962 to fix sitenames again :P
01:57 brion: poking centralNotice to r43961 for evil hacks to bump limits temporarily :D
01:31 brion: updating CentralNotice to r43959

November 25

19:25 brion: syncing update to CentralNotice
18:28 RobH: root password changed across all servers. if you didnt get a copy and you should have one, talk to another tech team member.
17:58 RobH: added bayes to allowed nfs connections to storage2, setup fstab for nfs mounts on bayes, revoked shell access for ezachte on storage2 (not needed for what he wanted)
15:49 RobH: updated some points for huwiki flaggedrevs and removed an outdated user group per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568
15:38 RobH: gave erik zachte login rights to storage2
15:16 RobH: updated dns for survey software
01:35 brion: updating ContributionReporting ext
01:06 brion: forcing a manual run of centralnotice batch update on hume
01:04 brion: retstarting memcached on srv64
01:02 brion: memcache bad on srv64
01:01 brion: notice texts borked on at least wikimedia, wiktionary

November 24

22:45 brion: updated ContributionReporting for some silly bugs
22:20 RobH: portal and portal_talk namespaces added to dvwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16403
22:04 RobH: added two new namespaces to dewikinews per https://bugzilla.wikimedia.org/show_bug.cgi?id=16263
21:29 RobH: removed a group and granted further permission customization for huwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568
21:09 RobH: pushed a bad flaggedrevs.php that rendered blank pages for all wiki's with flaggedrevs enabled. fixed it, its working properly now, opps ;]
21:06 RobH: appended page and dossier namespaces into the frwikinews flagged revisions per https://bugzilla.wikimedia.org/show_bug.cgi?id=15346
20:36 RobH: enabled flaggedrevs on ukwiktionary per https://bugzilla.wikimedia.org/show_bug.cgi?id=15335, and ran sync-common-all
20:27 RobH: ran sync-common-all
20:27 RobH: enabled flaggedrevs on dewiktionary
20:07 mark: moved upload knams LVS to mint
20:05 brion: mark is on the case -- LVS overload
19:58 brion: seem to be getting heavy packet loss on some routes to knams
19:47 RobH: changed nameservers for wikimedia.li to WMCH administered name servers.
19:30 RobH: re-enabled arzwiki, cannot find the bugzilla entry.
15:43 RobH: search2 reinstalled and ready for search setup and deployment

November 22

18:28 yksinaisyyteni: srv108 (cluster19) disk full, removing old logs
00:37 brion: bumped php.ini post/file upload limit to 100mb, we'll see how well uploads to that size actually work :)

November 21

23:11 brion: dropping 'Wikipedia: a non-profit project" banner from rotation, as it's apparently not a winner
22:56 brion: updated logo for cr.wikipedia (bugzilla:16417)
18:34 brion: running updateAutoPromote on new flaggedrevs wikis (bugzilla:16415)

November 20

01:00 brion: updating ContributionHistory
00:34 brion: moving $wgStyleSheetPath back to upload.wikimedia.org

November 19

22:47 brion: updating Tomas skin to r43752 for toc fix
22:41 brion: scapping for ContributionReporting update to 43750 (localization bugs)
22:40 brion: ran namespaceDupes --prefix=D on enwiki and dewiki -- some 'D:blah' pages conflicted with iw prefix 'd' for wiktionary
15:53 brion: updated centralnotice templates with user-targetted lightweight collapsed notice (wish it was for everybody)
01:38 brion: updating CentralNotice to r43697 for anon/user collapsed variants
00:35 yksinaisyyteni: unmounted storage1:/export/upload on all hosts
00:32 yksinaisyyteni: rebooted srv{114,184,166} to fix stuck nfs mount

November 18

23:52 brion: enabling new search UI on testwiki
21:35 brion: switching css/js back to text temporarily to reduce load on upload squids
21:27 brion: request -- squid conf deploy script should do a config file dry-run before actually deploying
21:26 brion: there's load on ms1...
21:25 brion: started more... most... all? squids in squids_uploda
21:24 brion: restarted squid manually on 46
21:17 brion: uploads still borked, we're investigating the squid config problem
21:16 brion: rebuilding squid conf, was a little funky
21:12 brion: updating squid config to send centralnotice to ms1 instead of storage1
20:41 RobH: db24 reinstalled, awaiting domas to do the magic db stuff
20:38 RobH: replaced disk /c0/p7 in amane and started rebuild
20:34 RobH: replaced controller in search2, search2 requires reinstall
20:34 RobH: replaced controller in db24, db24 reinstalling.
20:03 mark: installed gmond on db9 and db10
19:59 brion: scapping to update Collection for regression fix
01:51 mark: Moved text LVS to temporary LVS host lvs4, with an optimized kernel
01:48 brion: setting $wgStyleSheetPath to point at upload.wikimedia.org/skins for non-SSL hosts
01:30 brion: disabling handheld stylesheet; one less thing to load, should have little impact
01:15 brion: another crappy slow squid this time in pmtpa

November 17

22:27 brion: mw-serve having ups and downs as we test the new init script
22:12 brion: started mw-serve on bindery; probably didn't get restarted after yongle crashed. pinging RobH to set up a boot script for it :)
21:16 brion: fixed deferred entries in CodeReview -- schema updater to add 'reverted' accidentally removed 'deferred' :)
20:51 brion: scapping update to r43634
20:43 RobH: re-ran sync-common-all cuz vibber's svn of things messed with the first run. too many ppl, not enough real estate ;]
20:40 RobH: flaggedrevs enabled on ruwikiquote and ruwikisource per https://bugzilla.wikimedia.org/show_bug.cgi?id=15006 & https://bugzilla.wikimedia.org/show_bug.cgi?id=14863
20:39 brion: prepping scap
20:16 brion: updating codereview schema for 'reverted' status
20:11 mark: restarted knsq28 frontend to fix out of socket mem errors
20:00 RobH: FlaggedRevs is now active on french wikinews per https://bugzilla.wikimedia.org/show_bug.cgi?id=15346
19:59 brion: added hsb, oc subtitles for fundraising video
18:12 RobH: enabled flaggedrevs on huwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568 (with a long list of custom settings.)
18:01 RobH: FlaggedRevs enabled on alswiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=13968
17:54 RobH: FlaggedRevs enabled on zh_classicalwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=14715
17:29 RobH: resynced apaches after touching initilizesettings.php to make flaggedrevs active on plwiki
15:30 RobH: enabled flaggedrevs on plwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16177
15:21 RobH: sync-common-all because I neglected to check my own work in flaggedrevs.php
15:18 RobH: fixed https://bugzilla.wikimedia.org/show_bug.cgi?id=16351 flaggedrevs.php typo.
15:07 RobH: set $wgFlaggedRevsTabs = true; on dewiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16351
14:44 RobH: enabled flaggedrevs on eowiki
14:21 RobH: flaggedrevs enabled on pt.wikinews.org

November 16

17:24 brion: notices are becoming unborked with new regen. should be done and recached within 10 minutes
17:17 brion: srv120 memcached now functional according to test: 10.0.2.120:11000 set: 100 incr: 100 get: 100 time: 0.0809991359711
17:16 brion: restarting memcached on srv120
17:14 brion: srv120's memcached seems broken: 10.0.2.120:11000 set: 100 incr: 0 get: 0 time: 0.0769970417023
17:05 brion: investigating centralnotice borkage on non-wikipedia sites

November 15

01:03 brion: scapping to r43514 -- regression in CodeReview :)
00:49 brion: enabled UDP->IRC logging for CentralAuth user creations, now that it works instead of crashing PHP
00:45 brion: set up ariel on isidore for blog maint
00:24 brion: starting scap from r42593 to r43512
00:02 brion: preparing for general svn up && scap

November 14

23:24 RobH: updated flaggedrevs: $wgFlaggedRevValues to 4 from 2 for enwikibooks, synced files out to cluster.
23:11 RobH: FlaggedRevs deployed on enwikibooks.
23:00 RobH: removed the crap for specific seroul servers in sync-common-all
22:43 brion: tweaked flaggedrevs.php to have cleaner default behavior
20:27 RobH: setup the backend stuff for arz wiki but not enabled yet.
19:59 brion: yongle is back up! yay
19:48 RobH: fixed authdns-update script, was not rsyncing over the langlist file
19:47 brion: swapping codereview-proxy to isidore since yongle's still down
18:01 brion: requesting reboot on yongle from PM support
17:14 domas: yongle is hanging, apple dictionary searches staled
16:12 RobH: upgraded installation of blog.wikimedia.org and whygive.wikimedia.org to newest stable versions.
15:14 RobH: limesurvey.wikimedia.org online on isidore, initial users created and deployed.
02:03 brion: pascal down again
00:00 brion: syncing to update InputBox extension (note: renamed from inputbox)

November 13

23:41 brion: scapping to update CodeReview
20:26 brion: scapping updates to Collection and ContributionReporting exts
17:33 brion: set up TrevorParscal with access to reporting database so he can grab updates to test with
17:03 river: upgraded ms1 to solaris 10 update 6 + rebooted
09:57 Tim: db10 sync worked just fine this time, it's now replicating all DBs
08:27 Tim: db10 slave start potentially botched, going to re-read the dump and try again
06:43 Tim: loading data into mysqld on db10
06:35 Tim: copy finished, restored r/w on bugzilla
05:43 Tim: copying data from db9 to db10 using: mysqldump -h db9 --master-data --single-transaction --all-databases | gzip --fast > db9-master-data-2008-11-13.sql.gz
05:34 Tim: switching bugzilla into read-only mode for copy to db10. Queries will be denied by user permissions for all tables except logincookies.
05:02 Tim: converting all tables in bugzilla to InnoDB except longdescs
04:53 Tim: converting the MyISAM tables in otrs to InnoDB (the large ones are done already)
04:49 Tim: converted donateblog and newsblog to innodb
03:34 Tim: converted racktables DB to InnoDB
01:59 atglenn: changed wireless network password
01:43 Tim: doing lockless backup of db9 to db10. This will give us a fallback in case disaster strikes during the considerably more complex replication synchronised dump which will follow.
00:45 brion: poked it again
00:29 brion: updating for ContributionReporting

November 12

23:38 brion: XHTML fixes for Collection made the broken 'Random book' link on en.wikibooks.org work again (it very inefficiently loads a giant page of links via JS, and needs it to be clean XML to parse it)
23:16 brion: updated mw-serve
22:48 brion: scapping for Collection ext updates
20:10 brion: updated wgNoticeProject to wikimedia for incubator
18:46 brion: added "uploader" group so we can bump known-good people into being able to upload without waiting for the autoconfirm heuristic
03:14 river: didn't reboot ms1 as its lom is unreachable
01:20 Tim: an error in the cron job on hume caused the r43398 bug to persist until this time, delivering incorrect language text in some site notices.
01:08 Tim: Fixed those 50 servers with a couple of sed commands. Many of them were attempting to send data to larousse and zwinger. Tested srv125.
00:56 Tim: srv125 was spewing PHP fatal errors without reporting them to the syslog on db20. Restarted it. A quick check (ddsh -cM -g apaches -- 'grep -q @syslog /etc/syslog.conf || echo help') suggests that there are 50 apache servers in the same situation.
00:27 Tim: updated ExtensionDistributor configuration to account for amane -> ms1 storage move. (bug 16308)
00:13 Tim: some language issues caused by r43398, reverted at 23:50 and resynced in fixed form at 00:12.

November 11

23:47 Tim: restored FlaggedRevs stats job as per Batch jobs, removal was not documented.
23:35 Tim: r43398 worked just fine, memory usage dropped from ~4GB to 90MB. Adding rebuildTemplates.php to my crontab on hume, removing it permanently from Brion's on zwinger.
23:28 Tim: updated CentralNotice templates on hume (which has enough memory to do it, unlike zwinger)
22:11 Tim: deleted some binlogs on db1. Remaining disk space is still only 48 GB with negligible InnoDB free space.
16:20 RobH: search2 still down, drives will not detect reliably. Ticket with sun reopened.
15:56 RobH: replaced backplane on search2, reinstalling.
15:13 RobH: srv137 back online. apache and memcached back up.
14:49 RobH: srv100 back online.
10:44 river: removed centralnotice php from brion's crontab as it was breaking zwinger
- Core dump suggests the memory usage may be dominated by the localisation cache. wfMsgExt() loads the localisation for the requested language, and all languages are requested. -- Tim 12:07, 11 November 2008 (UTC)[reply]
01:19 brion: swapped Commons to use $wgNoticeProject 'wikimedia' rather than having separate 'Commons needs you' notices
00:57 brion: swapped in fundraiser to all projects

November 10

19:18 mark: Shutdown AMS-IX route server 1 session as it's been flapping for hours

November 9

16:11 river: removed nfsfind cronjob on ms1

November 7

22:52 brion_: tossing 2008_meter_2b notice into partial rotation on enwiki -- has reduced collapsed version
22:49 brion_: adding "_collapsed" to banner source tracking for collapsed view
22:27 brion: scapping updates to ContributionReporting and CentralNotice
01:43 Tim: experimentally reading the civicrm database into db10 with --master-data=1
01:19 brion: db9 temporarily (hopefully) messed up. tim's fiddling with it to put it back
01:05 Tim: my.cnf on db10 had an error in it, replicate-wild-do-tables instead of replicate-wild-do-table. Fixed it. The OTRS snapshot is now hopelessly out of date anyway, so I might wipe the data directory and start again. The idea is to set it up to replicate civicrm first. It's 100% InnoDB so should be easy to copy.
00:09 river: upgraded ms2 to solaris 10 update 6

November 6

21:03 Tim: switched GIFs to use Bitmap_ClientOnly (client-side scaling)
17:23 brion: restarting apache on srv47, seem smysteriously stuck
17:15 brion: setting $wgMaxAnimatedGifArea to 1 to prevent animated thumbnailing of GIFs for now, see if that helps
17:10 brion: river complaining of image scaler issues -- load spikes, depooling?
02:35 mark: disabled BGP, now using lvs2 only
02:25 mark: restarting lvs2 with new kernel
01:52 due to switch issues, load balancing to lvs2/lvs4 stopped working. Mark restarted the BGP session which fixed it temporarily.
01:42 Tim: restarting squids
01:42 mark: Setup lvs4 as temp LVS support for lvs2, balancing the load
01:07 brion: updated ContributionReporting to add paging links to ContributionHistory (might be a little funky w/ caching, we'll work it out :)
00:45 Tim: progressively clearing /a on the remaining image scalers
00:37 Tim: wiping /a on srv44
~00:30 lvs2 went into overload and started losing packets. Upload squid slowly went down over the next half hour.
00:00 brion: scapping for update to ContributionReporting

November 5

23:38 brion: set yongle to restart apache every hour since it still seems to bork up and get stuck sometimes
22:01 RobH: srv100 rebooted, was down.
18:28 mark: tech team is procrastinating
18:16 atglenn: added dhelps to office@wikimedia.org alias, redirected office@wikipedia.org to him also
18:14 brion: disabling centralnotice on private wikis, we don't need to be told to donate to ourselves ;)
18:03 brion: poking sitenotices off wikibooks, on *.wikipedia
18:03 brion: set up ariel on mchenry for mail admin
05:38 brion_: opera users may rejoice ;)
05:38 brion_: tweaked storage1 lighttpd config so centralnotice.js is served with utf-8 charset
05:17 brion_: for reference -- load spikes are page rendering on enwiki and dewiki mostly :)
05:16 brion_: bumping enwiki notice to 100%
05:06 Tim: killed various mysqld_safe processes which were using 100% CPU on ES servers
04:50 brion_: fixed morebots -- bots now allowed to edit again at wikitech
04:50 brion_: enabling enwiki notice at about 10% sampling
03:27 brion_: squids are... i think.... looking better :D
... brion: cleaned up movepage attack, restricted editing here for convenience
02:47 brion_: seems happier after restart of front-end squids
02:43 brion_: tim's doing hard restarts of more squids, we're kinda offline briefly
02:34 brion_: disabling centralnotices on remaining sites just for good measure while we debug
02:29 brion_: current status: the squids which borked are still kind of borked, but perhaps slightly better. mark is examining squid memory reports
02:14 brion: tim's attempting to restart borked squids
02:01 brion: disabling enwiki centralnotice while investigating hits dropoff

November 4

21:36 Tim: added nagios monitoring of HTTP on image backends
21:14 Tim: installed NRPE stuff on db19
19:37 Tim: killed the broken NFS mount on db21:/mnt with umount -l. The processes that are waiting for it will probably hang until system restart
18:33 brion_: enabling ja-wikipedia notice for testing :D
18:32 Tim: installed nagios stuff on db21,db22,db23
18:27 Tim: srv104 done, cluster18 re-added to the write list
18:15 Tim: installed NRPE on srv159,srv171,srv183
17:25 domas: bounced db16 after jfs deadlock
17:24 brion: settin' centralnotice on wikibooks to test, should show up in a few minutes
16:00 Tim: fixing max_rows on srv104
15:41 Tim: switching cluster18 master from srv104 to srv105
01:33 Tim: fixing max_rows on srv105 and srv106
01:28 Tim: removed cluster17 from the write list, is full.

November 3

23:28 Tim: installed xdiff and gmp on hume. Used a source install of libxdiff since it's not packaged, and pecl install for the pecl module. Used the stock libgmp, a source install from the debian sources for the PHP GMP module.
22:05 brion: enabled extra file upload types for foundationwiki, since it's restricted-write-access
21:42 Tim: initialising srv159/171/183 as cluster20.
21:24 Tim: srv159 needs to be an ext store, and so will be moved from the disk-intensive image scaler role back to an ordinary apache.
20:46 brion: Special:ContributionTracking form submission intermediary live on foundationwiki
20:33 brion: scapping for ContribtionTracking extension
19:59 brion: enabled mp3 and aiff uploads for private wikis so jay can upload some radio PSAs for fundraiser
19:46 brion: poking $wgSquidMaxage from 31 days to 1 hour on wikimediafoundation.org, since templates and funkypage URLs may do funky things and not get purged (extra parameters)
19:32 brion: note there's no notice up yet ;)
19:31 brion: enabling centralnotice loader on all wikis
11:00 domas: mount -o remount,nobarrier /a on db15, observed 20x more performance. I am an idiot. :)
02:36 brion-away: got a test centralnotice notice running on test.wikipedia.org. rock on
02:18 brion: set up every-10-minute cronjob on zwinger to regen the centralnotice template JS files
02:10 brion: centralnotice .js file loader up on test and meta for poking at
01:12 mark: level 3 blackholing of traffic disappeared, brought BGP sessions back up
00:59 mark: shutdown BGP session to AS 30217, for blackholing of traffic behind it (L3?)
00:58 brion: network problems at pmtpa
00:44 brion: for fun, did some load-time optimization on wikitech. trimmed out unneeded user/site .js, consolidated several .js files, and enabled mod_deflate for .css/.js. ssl setup time still sucks, and it's still a 1.7GHz Celeron. :)

November 2

23:43 brion: added bot flag to domas's log bot so it doesn't get hit by the URL captcha
23:29 domas: db19 jfs deadlocked: http://p.defau.lt/?hC8C7MTk9BdTKBEHFgcsqA
23:28 brion: scapping for CentralNotice tweak update
23:11 brion: setting up ContactFormFundraiser on wikimediafoundation.org for fundraiser templates
22:52 brion: scapping for ContactPageFundraiser setup
22:41 brion: poked spamregex update
22:14 brion: added 403 block in checkers.php for 'speichern' GET parameter -- bug in a common dewiki user script allowing CSRF-type vandalism
17:13 Tim: Unmounted /tmp, cleaned up /tmp. Deleting /a/tmp on all image scalers.
16:48 Tim: set ImageMagick temporary directory to /a/magick-tmp. Will unbind the /tmp -> /a/tmp mount.
15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
13:02 river: copying en from storage1 to ms1
10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
10:36 river: completed move of commons, now being served from ms1 (except archive/)

November 1

22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
20:53 brion: deploying CentralNotice editing system on meta, woo
20:27 brion: scapping to update reporting and centralnotice bits internally
19:38 brion: rescapping to make sure 159 is unbroken
19:27 brion: svn up'ing on wikitech just for domas
19:25 brion: srv159 is out of space
- We need to clean out the damn temp files somehow, eh?
19:20 brion: scapping to update ContributionReporting ext
12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
07:55 Tim: disabled ParserDiffTest, obsolete
07:06 mark: XO circuit back up:

[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <w005.z207088246.xo.cnc.net>, session is now up

October 31

23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
14:24 mark: XO circuit went down:

[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because <Port State Down>
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because <Port State Down>

October 30

23:11 Tim: fixed disk space on srv159, db1, srv103
19:03 brion: updated triggers for donation reporting database a few minutes ago
18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
17:46 RobH: db26 OS installed and online
17:28 brion: added a spam filter rule for private-l messages :)
04:54 river: testing sun web server on ms1
03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
02:59 brion: poking experimental expires options on amane for static centralnotice tests
02:44 brion: brion broke lighttpd.conf briefly

October 29

22:39 brion: enabling $wgCodeReviewENotif experimentally
18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
18:02 RobH: db28 & db29 OS installed and online.
17:59 brion: fixed some upload directory perms on foundationwiki
17:12 RobH: db27 OS installed and online.
16:54 RobH: db21 OS installed and online.
16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1

October 28

20:02 domas: re-enabled db16
18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
10:44 mark: Reverted yesterday's search changes

October 27

23:24 mark: Switched to lucenesearch 2.1 for all wikis
23:06 mark: pooled search8 as the only search server in search pool 3
22:25 mark: rainman-sr is making me do more ugly things to lucene.php
22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2

October 26

15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
04:31 Tim: fixed the "service_ips" hostgroup in nagios
03:03 Tim: hardware reboot of db18
02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search

October 25

20:46 brion: scapped to r42573
08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
00:44 brion: preparing a svn up...
00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB

October 24

23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
15:39 mark: slacking
15:36 TimStarling: added special nagios user to ES instances on clematis
14:00 domas: re-enabled db5, added db18 to s3
10:45 domas: taking out db5 for copy to db18
10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
07:50 Tim: added monitoring for toolserver ES clusters 17-19
07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
05:00 Tim: fixed nagios check script handling of MySQL connection errors
01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
01:03 brion: disabling logentry, still borken?

October 23

22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
21:34 brion: updating ipblocks table definition
21:25 brion: re-ran svnImport to update path listings for CodeReview
20:11 mark: Set up search7 - search9
17:05 mark: Pooled search4 as a s1 search server to help with dead search2
16:33 brion: updated mw-serve
15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
15:24 Tim: removed temporary files on image scalers again
14:54 RobH: Replaced dead disk in amane, rebuilding array.
11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
10:32 Tim: freed up some space on srv103 (was down to 500MB)
10:29 Tim: fixed monitoring for MegaRAID SAS
07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
05:05 Tim: running CodeReview svnImport.php

October 22

18:26 brion: enabling ODT output for collection
18:17 brion: updating collection and codereview extensions
18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation

October 21

19:29 RobH: Bayes upgraded from 2GB to 10GB.
13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)

October 20

18:11 RobH: srv118 rebooted, back online.
17:25 RobH: srv79 was in kernel panic, rebooted.
05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
02:40 Tim: installed NRPE on khaldun and db20
02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
01:53 Tim: installed NRPE on the new ext stores
01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
00:30-01:30 Tim: Listed down servers on DC tasks. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.

October 18

21:45 RobH's mighty index finger brought amane and the site back up.
21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
13:40 Tim: down again, single process allocating all memory
07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
06:27 Tim: restarted srv160
05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.

October 17

21:10 brion: enabled Commons foreign image repo on Wikitech
18:45 brion: created Wikimedia-Boston list for SJ
16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
16:45 brion: deleted some junk comments from bugzilla
16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
14:22 RobH: srv131 back up.
09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
01:59 Tim: removing cups on all servers where it is running
00:00 RobH: restarted srv43-47

October 16

20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
18:40 RobH: rebooted scs-ext opps!
18:26 RobH: srv61 reinstalled and redeployed.
18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
17:02 brion: thumbnails on commons are insanely slow and/or broken
14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
11:38 Tim: cleaned up temporary files on srv159, had filled its disk
11:25 Tim: synced upload scripts (including to ms1)
10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
08:26 Tim: Restarted ntpd on search7, was broken
06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
11:21 srv127 died just the same. Mark suggests using one with DRAC next.
10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
02:30 Tim: moved image scalers into their own ganglia cluster
02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
02:13 tomasz: upgraded srv9 to ubuntu 8.04
02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
19:11 mark: Pooled old scaling servers srv38, srv39
18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
16:20 brion: image scaling is broken in some way, investigating
02:54 Tim: fixed srv43-47, this is now the image scaling cluster
00:10 Tim: oops, forgot to add VIPs, switched back.
00:05 Tim: switched image scaling LVS to srv43-47

October 13

23:45 Tim: prepping srv43-47 as image scaling servers
21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
21:30 RobH: search8 and search9 are online, awaiting configuration.
21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
- rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
20:48 domas: thistle serving as s2a server
20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
19:53 RobH: search7 back online, awaiting addition to the search cluster.
19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
17:00 RobH: srv21-srv29 decommissioned and unracked.
12:05 domas: put lomaria back in rotation
11:50 domas: Enabled write-behind caching on db15. Restarted.
10:40 domas: restarted replication on db15 and lomaria
10:27 domas: loading dewiki data from SQL dump into thistle
09:09 Tim: restarted logmsgbot
08:27 Tim: folded s2b back into s2
08:06 Tim: db13 in rotation
08:02 domas: copying from db15 to lomaria
07:38 Tim: started replication on db13
04:51 Tim: copying
03:27 Tim: Preparing for copy from db15 to db13
00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
22:05 brion: db15 sucks hard. putting categories back to db13
22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
19:20 Tim: depooled db15.
19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
18:38 Tim: offloading commons RCL queries to db13
18:36 Tim: dewiki r/w with ixia (master) only
18:33 Tim: offloading commons category queries to db13
18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
17:46 Tim: db8->db15 copy finished, deploying
17:33 Tim: installed NRPE on thistle.
16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
16:07 Tim: repooled ixia/db8 r/o
15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
14:11 Tim: copy now in progress as planned.
13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
13:00 jeluf: moved wikipedia/m* image directories to ms1
08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
19:20 RobH: Bayes online.
19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
12:17 domas: killed long running threads
~12:04: s2 down due to slave server overload

October 9

22:52 brion: enabled Collection on de.wikibooks so they can try it out
20:00 jeluf: moved wikipedia/i* images to ms1
17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
- Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
19:02 mark: Upgraded packages on search1 - search6 and searchidx1
18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

21:47 brion: started two dump threads (srv31)
21:16 RobH: installed and configured gmond on all knams squids.
21:00 jeluf: moved wikipedia/g* to ms1
18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
17:26 RobH: reinstalled and redeployed knsq24 and knsq29
15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
15:54 RobH: srv101 crashed again, running tests.
15:45 RobH: srv146 was powered down for no reason. Powered back up.
15:42 RobH: srv138 locked up, rebooted, back online.
15:32 RobH: srv110 was locked up, rebooted, synced, back online.
15:31 RobH: srv101 back up and synced.
15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
15:21 RobH: updated lucene.php and synced.
15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
14:36 RobH: setup ganglia on all pmtpa squids.
13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
20:35 jeluf: wikipedia/b* moved, too
19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
00:26 mark: Depooled srv61
00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
- updating packages on srv37. srv61 seems to have internal auth breakage
- updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
20:01 brion: running updateRestrictions on all wikis (done)
17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
17:13 RobH: srv130 back online.
16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

20:02 RobH: srv63 back online.
19:35 RobH: srv61 and srv133 back online.
18:22 RobH: storage3 online and handed off to brion.
17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
17:32 RobH: srv61 faulty fan replaced, back online.
09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
08:36 Tim: fixed ipb_allow_usertalk default on all wikis
23:46 mark: Reinstalled knsq24
22:55 mark: Reenabled switchports of knsq16 - knsq30
20:45 jeluf: fixed resolv.conf on srv131
20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

@@ Line 1: / Line 1: @@
 == February 21 ==
+* 01:47 domas: I FOUND HOW TO REVIVE APACHES
 * 01:46 brion: think i killed em, now trying to restart apache procs
 * 01:43 brion: poking to see if we can restart apaches...