Server admin log/Archive 9

From Wikitech
Jump to: navigation, search

January 30

  • 14:27 brion: moving some old dump files off benet to free dump space
  • 13:04 mark: Move traffic back to knams
  • 10:20 jeluf: added external storage cluster12, nodes srv126 (master), srv125, srv124. Using clusters 10, 11 and 12 as wgDefaultExternalStore.
  • 08:50 jeluf: added external storage cluster11, nodes srv123 (master), srv122, srv120
  • 05:24 mark: knams down for 15+ mins, moved all traffic to pmtpa. Here ya go Ben ;)

January 29

  • 23:00 jeluf: setting up external storage cluster10 on srv95 (master), srv94, srv93
  • 22:40 mark: rebooted srv27
  • 22:30 mark: Brought up the new 10G link on csw5-pmtpa e5/1 using new subnet 84.40.25.100/30, brought down both 1 Gig E links (needed one fiber pair for the 10G link).
  • 21:05 jeluf: removed binlogs 80-89 on adler
  • mark: Installed csw5-pmtpa SFM, 2x PSU, 4x 10G line card

January 27

  • 01:22 Kyle: sq3 is in. Ready for OS.

January 26

  • 11:05 brion: installed djvulibre on srv53,srv54,srv61,srv68,srv120; install-djvulibre script failed on 53,54,61,68 due to missing FC3 x86_64 RPM; the other FC3 boxen have the FC4 rpm installed so I symlinked it to let the script work.

January 24

  • 21:12 Tim: taking db1 out of rotation for defragmentation of jawiki table space
  • 15:56 Tim: upgrading to squid 2.6.9
  • 09:30 brion: starting dumps on benet/srv31 again, need to watch it and see about disk usage, and get back to fixing up storage2 some time
  • 01:15 brion: setting up revision compression on wikitech.leuksman.com

January 23

  • 19:00 Tim: restarted ntpd on knsq*, many had no server in their peer list except the local clock.
  • 18:53 Tim: stepped clock on knsq2, was off by one hour.

January 22

  • 23:ish brion: seems to be some kind of problem with squid purging, with old or inconsistent versions of pages being returned from pmtpa squids; difficult to reproduce, but we could see several old versions of en:Barack Obama including a badly vandalized one from a couple users w/ Safari. Were not able to trace it to a particular squid before it finally got purged... somehow... Still hoping to figure this out.
  • 20:08 Tim: phasing in UDP logging configuration
  • 19:40 jeluf: deleted binlogs 100-109 on srv92
  • 18:00 jeluf: configured srv52 as apache
  • 16:58 Tim: upgrading all squids to 2.6.8 RC with UDP logging patch.
  • 15:40 jeluf: restarted squid on sq6, deleted binlogs 50-59 on srv88

January 21

  • 22:20 jeluf: db1 and samuel added to the pool.
  • 15:15 jeluf: shutting down db1 and samuel, copying the DB from db1 to samuel.

January 20

  • 20:20 mark: Raised the route cache on avicenna and alrazi by setting net.ipv4.route.max_size = 65536
  • 20:05 mark: Florida was down due to troubles on avicenna:
printk: 23355 messages suppressed.
dst cache overflow
  • 09:00 brion: switched on nofollow for enwiki article space in response to Jimmy's earlier request and the rumor of a spam championship targeting WP

January 19

  • 16:30 jeluf: fixed disk space on samuel and srv88, restarted squid on yf1000 and yf1003
  • 16:30 jeluf: fixed nagios config to reflect the master/slave change, the removal of harris, the new role of db10, ....
  • 14:26 brion: going read-write
  • 14:15-20ish brion: switching masters, disabling samuel from db.php
  • 14:00 brion: switched samuel to explicit read-only while working on it ftm
  • 13:50 brion: got woken up, some problem about full dbs
    • samuel disk full
    • all non-en locked

January 17

  • 20:15 Tim: phasing in UDP squid logging, trialling it on a few hosts
  • 03:27 Tim: set up henbane as a log host. Started udp2log on it. For now it is logging the access log of hawthorn, as a test.

January 16

  • 09:19 brion: got isidore moved over to harris' old external IP and on the public vlan, woo
  • 05:30 brion: broke isidore when experimenting with external IP on it. :D will poke it more shortly
  • 02:59 brion: bringing isidore back up to speed, plan to replace downed harris

January 14

  • 21:00 mark: Brought up db10 as a Squid with 16 GB memory and 8 disks, both to test and to help with the current upload load
  • 19:00 mark: Installed Ganglia on a few misc Ubuntu servers
  • 16:14 mark: Updated DNS names in zone wikimedia.org, removed the AAAA record for ns2 as it's reducing service quality for IPv6 users (geobackend doesn't work with ipv6)
  • 16:05 mark: Made knams use lily as the primary DNS resolver, secondary bayle in Tampa.
  • 15:45 mark: Set up PowerDNS Recursor on yf1019 as the new primary resolver for yaseo. Secondary is bayle in Tampa, no real point in having a secondary locally there...
  • 13:40 mark: Reducing cache_mem on sq1 - sq10 to give Linux more disk cache memory for the beating of tomorrow. Restarting those squids to have it take effect.
  • 10:56 brion: indeed that seemed to be most of it. restoring it for cascading pages only (english Main Page for nwo)
  • 10:18 brion: db cpu way up; testing removing the templatelinks update on page view to see if that's it
  • 09:55 brion: svn up'd to version with updated tooltips and cascading protection.

January 13

  • 23:15 mark, Rob: Kernel upgrades on sq1 - sq10 seem to cause hangs at bootup. Had to reinstall sq1, sq6 and sq8 to get them to boot again. Put package linux-image-2.6.17-10-server on hold to keep them from upgrading. Other types of servers don't seem to have issues...
  • 21:49 Rob: srv126 locked up again. Rebooting to bring back online.
  • 20:32 Rob: Memory test ran 38 passes on srv144 with no errors, rebooting srv144. Back online.
  • 19:26 mark: Replaced PowerDNS from Edgy on bayle by our custom package. Wasn't ours because the i386 variant wasn't built and in our repository before...
  • 18:46 mark: DNS migration complete. ns0.wikimedia.org migrated to bayle, ns1 to yf1019 and ns2 to lily, all using the new wikimedia-task-dns-auth package, and newer PowerDNS. Procedure for updating DNS has changed slightly, please read!
  • 17:08 mark: Starting authoritative DNS migration; please don't do any DNS updates until I'm done later.
  • 16:45 mark: Reinstalling yf1019 for use as a DNS server

January 11

  • 23:30 brion: stopped storage2 dump tests
  • 23:09 mark: reverted bacon change, overloading amane
  • 23:00 mark: memory frag issues back on florida image squids - many boxes down. restarting doesn't help. bacon is very loaded too - removed it from the squid conf to see if it helps anything
  • 18:13 Tim: replaced loreley with perlbal on diderot. Loreley was only staying up for a few minutes at a time. Perlbal is only using 30% CPU.
  • 18:04 RobH: srv52 was crashed. Rebooted. HTTPD still offline per JeLuF, but ssh is now working.
  • 17:55 RobH: sq8 locks on boot reading 'Starting Up' Attached to serial 10 for further troubleshooting.
  • 17:51 RobH: Started memtest on srv144 per JeLuF request.
  • 17:38 RobH: srv111 same error as srv126. Server is online after reboot.
  • 17:33 RobH: srv126 was down in nagios. Looked like a OS crash/lock. Rebooted and is now online.
  • 17:21 RobH: sq11 returned from SM and connected. Set console redirect to port 9. No OS present.
  • 16:55 RobH: thistle issues, would not detect mbr/boot media. Booted in to raid management for JeLuF to access via serial console.
  • 16:44 RobH: srv129 ssh not accessible. Rebooted and now works.
  • 16:26 RobH: srv133 ssh not accessible. Connected console and system rebooted. Server now online and working.
  • 16:16 RobH: Corrected db6 DRAC settings, should work now.
  • 16:00 RobH: Replaced bad drive in array for db8. Did not rebuild.
  • 15:45 RobH: Replaced the secondary powercord for db10.

January 10

  • 19:14 Tim: restarted loreley on diderot
  • 19:11 brion: spot-tests seem to retrieve bits out of ES just fine on storage2. possibly temporary problems such as overload, but not really sure. will investigate further later, but at least storage looks safe
  • 18:46 brion: external.log was being flooded by errors with enwiki blobs from the dump running on storage2 (have paused it). looking into whether it's a bug w/ storage2 setup or if we've got corrupt stuff
  • 18:01 brion: rotated external.log, oversized

January 9

  • river: borrowing fuchsia + vandale to dry-run zedler reinstall
  • 23:15 brion: replaced SSL cert on friedrich so fundraising.wikimedia.org has a non-annoying cert
    • expires in a month, will want to upgrade to a paid one by then :)
  • 22:12 mark: Set up header address rewriting on To: and CC: headers rewriting old mailing list addresses to new ones on lily
  • 21:30 river: v6 outgoing smtp connections are experimentally enabled on lily
  • 21:15 jeluf: removed binlogs 10 to 29 on srv88.
  • 19:00ish: big eswiki template issue
  • 18:55 river: repaired some broken knams hosts (fuchsia, mayflower, mint) to fix v6 config
  • 17:30 jeluf, domas: Current status: Loreley broke, apache waited for loreley, apaches overloaded, lvsmon depooled some, load was too high for the remaining apaches, lvs pooled some again, vicious circle -> *boom*. manually pooled all working apaches, site still slow.
  • 16:45 jeluf: restarted loreley.
  • 16:15 jeluf: killed loreley on diderot since it was no longer answering requests. Domas disabled lucene search.
  • 01:52 brion: restarted pdns on browne
  • 01:45 brion: some kind of internal dns breakage in pmtpa
  • 01:10 river: fixed ipv6 at knams (assigned proper IP to every host, disabled autoconf, fixed dns, fixed netmask (removes the annoying "wrong prefix" message))

January 8

  • 21:25 brion: all.dblist was corrupt, with many missing entries and a bunch of nonexistent wiktionaries. (cf bugzilla:8544) copied pmtpa.dblist back over it, which seemed ok. have a copy in my home dir if someone's interested in a postmortem
  • 14:18 mark: We were saturating Kennisnet's uplink, moving some more images to pmtpa
  • 13:14 mark: Installed rng-tools on lily; apparently it has a hardware RNG :)
  • 12:07 mark: Many Exim processes blocked on /dev/random because of a starved entropy pool on lily. Disabled outbound TLS for now; the real fix is to get a better random source or link Exim to OpenSSL instead of GnuTLS.

January 7

  • 20:36 mark: Removed mailing list info-de-l by akl's request
  • 18:40 Tim: restarting mysql on db7, changing innodb_flush_log_at_trx_commit from 1 to 2.
  • 17:09 mark: Moved a couple more countries from knams to pmtpa for images
  • 16:40 mark: The new Mailman hit a crash with some messages (~ 5 / 24h) in i18n.py that caused these messages to be shunted. Deployed a dirty fix/workaround to prevent this from happening.
  • 14:05 Tim: srv52 is down, replaced its memcached slot with srv53
  • 12:45 brion: fixed crond on leuksman
  • 12:00 brion: srv53,srv54,srv61,srv68 sudoers files fixed
  • 12:00 brion: srv3,srv53,srv54,srv61,srv68 do not update correctly due to broken sudo; poking at them (shtu down apaches)
  • 11:37 brion: running deleteDefaultMessages on all wikis serially (did a scap)
  • 11:30 thistle down
  • 02:27 Tim: Experimentally configured ariel to serve only enwiki's watchlist queries. It is the sole member of the watchlist query group.

January 6

  • 21:46 mark: Set up a privacy filter for mailing lists, but in freeze mode instead of bouncing.
  • 19:02 Tim: brought srv89 into ext. storage rotation
  • 18:30 Tim: installed memcached on about 5 reinstalled servers, added them to the spare list
  • 17:58 Tim: srv129 is broken, returning 404 errors via apache with no access via ssh. Took it out of rotation. Requires manual restart and reinstall.
  • 17:30 Tim: due to a configuration error, adler had no load and samuel (the everything else master) had lots. Fixed it.
  • 17:00 Tim: all enwiki servers were lagged by about 5 minutes. Sent STOP signal to backup job running on storage2.
  • 16:57 Tim: brought srv68 into apache rotation
  • 16:45 Tim: removed srv111 from memcached rotation, it's down. Deleted binlogs on srv89 to free up space.
  • 16:00 hashar: srv89 / partition is full.
  • 15:55 hashar: added new namespaces for itwikibooks (see bug 7354 & 8408).
  • 10:00 mark: Starting migration of mailing lists
  • 7:15 jeluf: Starting maintenance of OTRS. Migrating to new version and new DB servers srv7 and srv8

January 5

  • 19:39 brion: i think i bashed civicrm urls into shape. broke it for a while when trying to update the serialized config array in database (CR-LF sucks!)
  • 17:15 brion: updated wikibugs, now in svn (under tools/wikibugs)
  • 16:42 RobH: storage1 would not boot reliably. Reseated all cards and memory, it now boots just fine. No OS currently loaded.
  • 16:31 RobH: srv134 was in read-only filesystem. Ran manual FSCK and rebooted.
  • 16:19 RobH: Rebooted Will and set console redirection to 9600
  • 6:20 jeluf: SCSI errors on db8. / was remounted read-only due to these errors. Rebooting.
  • 6:00 jeluf: Cleaned up disk space on srv92, removed old binlogs 40-69

January 4

  • 18:00ish robchurch: ...seems to have fixed itself?
    • no, still throwing errors - on commit, it's going nuts at the top with "insufficient disk space, please try later" repeated over and over
Fixed by domas, pascal:/var/log/ldap.log took all the disk space. Need to be rotated / ziped.
  • 17:35 robchurch: BugZilla is dead, e.g. "my bugs" produces ./data/versioncache did not return a true value at globals.pl line 358
  • 15:00 Tim: made Rob Church a bugzilla admin
  • 14:40 Tim: reset mysql root password on pascal. New root password is in /root/.my.cnf

January 3

January 2

  • 15:24 mark: Increased cache dir size of knams text squids to 10 GB per disk.
  • 15:24 mark: Installed a new DNS recursor/resolver on lily
  • 10:40 brion: set up temporary web server on storage2 to watch the dump testing... it's pulling from ES directly (no previous XML to pull from) so may be extra slow, but should get cleaner copies from it in case old errors have accumulated
  • 10:30ish brion: made another internal chair wiki for anthere
  • 10:20 brion: browne dns temporarily broke... or something... after updating dns
  • 09:53 brion: shutting srv134 down, didn't come up after reboot
  • 09:38 brion: rebooting srv134 via ipmi, hopefully

January 1

  • 20:45 jeluf: db3's replication slave is broken. Processlist shows the same query all the time. show slave status is hanging. Restarting mysql.
  • 20:45 jeluf: srv134 is running apache, but doesn't allow SSH logins. Can't be updated by scap any more.

December 30

  • 19:35 mark: Shutdown BGP session to ar1, as ar1 seems to be the culprit of the packet loss. Uplink will be reconnected to a PM core router in the upcoming days.
  • 19:13 mark: udpmcast.py wasn't running on goeje, started it.
  • ~14:00 mark: There seems to be 4-6% packet loss outgoing to Hostway. Routed some problematic traffic over TWTC.
  • 12:40 mark: Rebooted csw5-pmtpa as an attempt to solve some strange issues we've been seeing
  • 04:30 brion: running dump-generation tests on storage2

December 29

  • ~16:00 Tim: setting up srv53, will do the others soon. Various problems experienced probably due to FC3, FC4 would have been easier.
  • 13:25 Tim: installed ganglia-metrics on ubuntu squids
  • 02:39 mark: Shutdown gi0/8 on csw1-pmtpa (srv128's port) by request of brion
  • 02:35 brion: whining about srv128 being generally slow and brokne
  • 01:55ish brion: site notices and general briefly broken by bad interwiki database file generation. not entirely sure how that happened o_O
  • 01:40 brion: rotated 5gb db error log file :P; setting up internal office wiki
  • 01:11 Kyle: srv144 was off, not sure why. Kill apache just in case.
  • 01:05 Kyle: srv126, srv129, and srv144 have had their ram replaced and apache killed for sanity. (Commented rc.local)
  • 00:55ish brion: updating DNS for office.wm.o
  • 00:46 Kyle: srv53, srv54, srv61, and srv68 have a fresh FC3 and a new raid card and are ready for apache service.

December 28

  • 22:44 mark: Installed storage2 for backup purposes by request of Brion. has a 3TB RAID-10 array, JFS, with some space left in the volume group.
  • ~19:55 Tim: installing gmond on various squid servers
  • 18:05 Tim: running updateSpecialPages.php on all wikis, with some of the more expensive pages disabled.
  • 16:50 mark: Noticed that goeje was about to crash again due to overload (most likely not hardware failure). There were lots of smtp and python (mailman) processes running. After killing them, load dropped and the box was under control again.
  • 10:56 Tim: running resolveStubs.php on jawiki

December 27

  • 12:38 Tim: remounting db1:/a with noatime
  • 11:53 Tim: running moveToExternal.php on jawiki (to cluster6)
  • 10:32 Tim: removed fedora mirror from srv81
  • 10:05 Tim: started replication on srv89 (old cluster8 master), from position srv88_log_bin.000001, 0.

December 26

  • 15:30 to 19:00 Rob: Replaced Fans with SM Warranty Tech for Storage1 and Storage2. Storage1 will not boot correctly. Storage2 is online and ready for testing once again.
  • 15:30 Tim: enabled variant aliases (e.g. http://sr.wikipedia.org/sr-el) on all serbian wikis
  • 12:14 Tim: disabled firewall on db7
  • 12:05 Tim: put db7 into rotation
  • 08:52 Tim: shut down mysql on db7, will shortly reboot it to enable write-behind caching

December 25

  • 06:02 brion: added a little more sanity checking for error messages. man, this cache script sucks :D
  • 05:37 brion: adjusted thumb cache script for better validity checking, de-escaping of input filenames so that images w/ punctuation or non-ascii chars should be less problematic (bugzilla:8367)
  • 02:48 brion: migrating files from benet to amane again to free space
  • 02:48 brion: requesting reboot on srv85
  • 02:40 brion: srv85 not responding; benet disk full; removing srv85 from slave rotation on es to lowre load and examine it

December 24

  • 11:27 Tim: started replication on db7

December 23

  • 22:00 brion: reenabled CIA-bot notifications on svn commit, in e-mail mode to hopefully not hang
  • 05:47 brion: srv15 freaking out about cpu temperatore and 'running in modulated clock mode'. taking out of service ... no wait, it stopped. odd. leaving it
  • 00:41 brion: running title cleanups; RLM/LRM marks now stripped from titles, and any other mystery borkages...

December 22

  • 23:47 brion: batch-initialising user_editcount fields
  • 19:00 mark: Shut down apache and lighttpd on amaryllis
  • 07:16 brion: aaaand it's up!
  • 07:00ish brion: colo guys trying to swap hardware back with harris to see if that works; if not we'll use the backup
  • 06:35 brion: goeje not coming back after reboot attempts. restoring its pre-move backup to harris, going to put it into place for now if we can't get it back up soon
  • 05:54 brion: goeje is not online. what happened?
    • It crashed after being online for about half an hour.
  • 04:20 Kyle: I'm not so sure mail is flowing correctly...
  • 04:01 Kyle: srv3 had no memory errors. Brought back up with no apache.
  • 03:53 Kyle: goeje's chassis swapped with harris's. Mail seems to be flowing again.

December 21

  • 22:06 brion: killed a stuck search index rebuild dump process... had been stuck since november 11! sigh... was holding up the build loop for small wikis

December 20

  • 23:56 mark: Prepending our ASN once one TWTC's link
  • 22:47 mark: Gave Adrian Chadd (adri) access to knsq15 (like yf1010), as he needed a busier server to test.
  • 19:45 brion: goeje back online; had to force mailman to restart again, its stupid lockfiles get left
  • 19:30 brion: requested reboot for goeje, dead again. kyle plans to transplant the drive into another mobo/chassis tomorrow, mark plans to replace the whole shebang in a week or two when we have a fresh new machine
  • 04:05 Kyle: Tests performed on Storage1 <- Results are on the page.
  • 02:50ish domas: been doing horrible things to db1
  • 00:25 mark: Loreley on diderot was stuck on a futex again, had to restart it.

December 19

  • 20:32 mark: Cleaned up the Squid leechers block list, updated some IPs. Most entries had long expired, domains no longer existed, URLs invalid or IPs were reassigned.
  • 19:36 brion: mostly recovered from MASSIVE SLAVE OVERLOAD due to bad sorting in Special:Categories query change
  • 19:16 brion: all dbs updated, so scapping to current mw
  • 18:12 mark: Installed db8.
  • 16:40 mark: Automated Ubuntu installs on internal servers are now possible. Installed db9.
  • 15:12 brion: set up apc on friedrich; it was disabled, making load pretty high since people have been linking the new fundraiser report pages which are a bit php-intense
  • 14:54 mark: Set up a forward Squid (the 2.6 version from Edgy, not our Wikimedia variant) on khaldun TCP port 8080 for use by internal servers, to let them access external webservers like security.ubuntu.com.
  • 08:44 Tim: reading enwiki dump into mysql on db7
  • 05:29 brion: master switch done. there was a brief period of 'write lock' errors on wikis not in the s2 or s3 group due to my slip-up. the use of read-only mode will have prevented this from causing data integrity problems (yay)
  • 05:17 brion: starting master switch for s2/s3 adler -> samuel
  • 04:54 Tim: started slave on ariel
  • 00:17 brion: running schema updates on db3

December 18

  • 15:08 mark: Doubled cache_dir size for knsq1 as a trial
  • 15:07 mark: Why does sq29 have a weird cache_dir setting?
  • 13:05 Tim: using ariel for SQL dump. Stopped slave.
  • 10:26 Tim: brought db6 back into rotation
  • 08:50 Tim: zwinger's root partition was full. Switched off debug-level logging in syslog on zwinger, deleted debug log.
  • 08:22 Tim: setenforce 0 on db6. This was the reason /etc/init.d/mysql wasn't working.
  • 07:26 Tim: mysqld on db6 was apparently running directly from a bash prompt, instead of via mysqld_safe. Restarting it, and increasing the deflault maximum number of FDs by editing /etc/init.d/mysql appropriately.
  • 06:31 Tim: starting master switch from db3 to db2
  • 05:30 Tim: installed ganglia on ariel
  • 03:40 brion: webster was missing its local socket for mysqld so couldn't be root-logged in locally in mysql. restarting daemon.
    • cron.daily/tmpwatch is suspected; could clear socket files after 10 days of no detected use...?
  • 03:20 brion: the following slaves were running with read_only OFF in violation of reliability guidelines:
    • db2 ariel db6 webster holbach
  • 03:11 brion: noticed ntp seems broken on db6; selinux is denying access to files?
  • 03:09 brion: depooled db6; replication broken
    Error 'Can't find file: './enwiki/text.frm' (errno: 24)' on query. Default database: 'enwiki'.
  • 00:32 brion: running db schema updates on slave servers (in a screen session on zwinger)

December 17

  • 23:26 brion: lowered ariel's priority from 100 to 50; it's consistently 15-30 seconds lagged
  • 19:15 jeluf: pooling db6
  • 18:07 mark: Users were reporting out-of-sync watchlists, nagios reported slave not running on db6, SHOW SLAVE STATUS confirmed. I depooled db6 by commenting out in db.php.
Odd error message: Error 'Can't find file: './enwiki/recentchanges.frm' (errno: 24)' on query ...
jeluf: Stopped slave, started slave, works fine. File was in place, no idea why mysql didn't see it
domas said that db6 was out of file descriptors

December 16

  • 21:54 brion: disabled CIA hit from SVN post-commit script; it's been hanging a lot lately
  • 05:40 Kyle: db5, db6, and db7 are at 1G and have the normal root password and are ready for msyql service.
  • 04:36 Kyle: Replaced cables for db5-10
  • 00:44 brion: set default sitenotice w/ basic fundraising info; tweaked the old one from last year a bit as the text is a little cleaner than the tiny anonnotice from en.wikipedia

December 15

  • 21ish brion: enabled UsernameBlacklist extension on dewiki by request
  • 19ish brion: enabled DismissableSiteNotice extension sitewide (now in svn, and with localization fixed for button)
  • 19ish RobH: Re-installed FC5 on db5. Confirmed cables for db5-db10 need replacement prefabs for gigabit operation.
  • 17:05 jeluf: increased retry count for "mysql running threads" and "lucene"
  • 15:35 Tim: moved srv41 from apache to search, in enwiki pool.
  • 15:10 Tim: re-added srv40 to the search pool
  • 08:47 Tim: updated FixedImage configuration
  • 08:40 Tim: stepped clock on amane (167s off)
  • 08:26 Tim: set up cron job for fundraising meter in amane:/etc/cron.d/fundraising . Configured lighttpd to send Cache-Control: max-age=300,s-maxage=300 for the relevant file.
  • 8:20 jeluf: removed binlogs 50-69 on adler
  • 6:20 jeluf: db6 added to enwiki pool
  • 5:30 jeluf: removed binlogs 1..29 on srv92
  • 5:00 jeluf: copying enwiki DB to db6
  • 12:00 onward RobH: Racked db5-db10. Enabled drac, installed fc5 on db5-db7

December 14

  • 7:45 jeluf: added ariel to the mysql pool.
  • 5:25 jeluf: copying mysql from db2 to ariel. Ariel has a broken disk in its RAID. Rob set up a new array without the broken disk.
  • 04:51 Tim: added names recursor0.wikimedia.org and recursor1.wikimedia.org for the new resolver VIPs, and also reverse DNS.

December 13

  • 23:35 mark: switch traffic back to yaseo

December 12

  • 14:29 mark: Doubling the size of the cache dirs of knams upload squids - it seems they can take it, others will follow if successful.
  • 13:35 mark: yaseo traffic suddenly dropped quite a bit, which seems like routing trouble. Sending all yaseo traffic to pmtpa.

December 11

  • 21:00 brion: updated search-rebuild-wiki script to use getSlaveServer to force slave use; an enwiki build was slurping from master, which made domas complain
  • 20:35 brion: restarted mailman, was accidentally left off a couple hours ago after a list archive modification
  • 18:45 jeluf: rebooted srv120. Its apache stopped answering several times.
  • 18:00 jeluf: locked mowiki (switched to readonly mode), according to resolution [1] and the voting at [2]
  • 17:40 Tim: enabled oversight on all wikis. The policy issue can be decided by stewards when they grant access, I don't have time to read yet another set of 600 debates.
  • 17:30 jeluf: added eth2 to /etc/sysctl.conf on srv147, started apache
  • 17:25 jeluf: restarted crashed squid on sq6
  • 16:45 Tim: recompiled FSS on srv145, was using old version
  • 16:40 Tim: took db4 out of rotation
  • 15:11 Tim: Updated index pages for download.wikimedia.org and static.wikimedia.org.
  • 12:00 db4 went down

December 10

  • 22:00 mark: Disabled options rotate in all srv*'s /etc/resolv.conf to use only the primary nameserver in normal circumstances. Also changed all nameserver lines to the new resolver service IPs 66.230.200.17 and 66.230.200.18 earlier, which caused some weirdness with the Foundry (overflowing CAM table?)
  • 21:04 mark: ariel.pmtpa.wmnet resolved to suda's ip due to my mistake, fixed
  • 19:51 mark: Set up a secondary DNS resolver temporally on khaldun - until we have a new mailserver.
  • 18:56 mark: Setting up a new DNS resolver (pdns-recursor) on bayle. Made it forward internal zones to ns0.wikimedia.org. srv1 now slaves these zones from ns0 as well, so do not edit zonefiles on srv1! albert doesn't even seem to have the internal zones, I'm not fixing that, redoing the entire setup.
  • 10:00 Domas: yesterday db4 was deployed with 4.0.28/tcmalloc - seems to be still working, but performance difference does not seem to be very huge. Needs proper benchmarking.

December 9

  • 23:59 Mark: Installed Ubuntu on bayle
  • 23:00 Kyle, Mark: Tried to install the 2 new storage servers, but there's something seriously wrong with the write performance of their OS arrays:
1048576000 bytes (1.0 GB) copied, 1475.95 seconds, 710 kB/s

storage2.wikimedia.org is up as a temporary test.

  • 15:30 - 17:44 Kyle, Mark: Reinstalled all remaining Squids (sq14 - sq30) with Ubuntu Edgy so they run tcmalloc.
  • 17:28 Kyle: srv78 switched to correct kernel and rebooted. Killed apache just in case its old.

December 8

  • 16:11 Tim: started new static HTML dump
  • 11:45 jeluf: started srv(117|120|121|145|148|149) apaches after scap.
  • 04:38 Kyle: srv117 has acpi off. I would like to see how it runs. Killed apache on this too just in case.
  • 04:28 Kyle: srv120 ram replaced. Killed apache and awaiting sanity check before service.
  • 04:22 Kyle: srv121 ram replaced. Killed apache and awaiting sanity check before service.
  • 04:05 Tim: deleted adler binlogs 40-49 (to November 28)
  • 02:00 mark: yf1010 was not in the Mediawiki trusted XFF list. Added all yaseo servers just to be sure.

December 7

  • 23:38 mark: Pooled Adri's testserver yf1010
  • 22:58 mark: Reinstalled yf1010 with Ubuntu Edgy, for temporary use by Squid developer Adrian Chadd - he has root access on the box.
  • 21:10 hashar : adler is running out of disk space ( 12/400GB free) [3]
  • jens: rebooted goeje at some point

December 6

  • 01:10 mark: Reinstalled yf1000 - yf1009 with Ubuntu Edgy to run the latest Squid deb. Just sq14 - sq30 left...
  • 22:16 mark: Deployed squid_2.6.5-1wm6 (with tcmalloc) on all Ubuntu Edgy Squids. Dapper Squids need to be upgraded, libgoogle-perftools is only available in Edgy.
  • 21:20 mark: Disabled Squid's coredumps again, they were causing more problems (filled up filesystems) than helpful information. I'll enable them selectively on certain debug-Squids from now on.
  • 21:19 brion: rebuilt stats table for frwikiquote, was empty/broken
  • 20:30ish brion: fixed info.txt with updated version
  • 17:55 brion: restarted leuksman web server, was mysteriously crashed again

December 5

  • 23:20 jeluf: added ariel back to the pool after mark reinstalled it to 64bit and domas set up mysql.
  • 19:23 mark: sq13 was running with 100% CPU, probably memory fragmentation. Installed the experimental tcmalloc squid deb.
  • 19:00 mark: ariel was installed with Fedora 32 bit, which is "not helpful". Remotely reinstalled it with Ubuntu Edgy AMD64. Had to move it to public VLAN for that, so new hostname is ariel.wikimedia.org.
  • 07:08 Tim: started mysqld on db2
  • 05:57 jeluf: OS configuration of ariel. Currently copying mysql from db2 to ariel
  • 05:26 jeluf: switched DNS back to use all three datacenters.
  • 02:58 Kyle: sq29 rebooted. Down for unknown reason.
  • 02:51 Kyle: srv117 brought up, and apache killed for sanity.
  • 02:46 Kyle: ariel is available.
  • 01:47 brion: killed srv3's apache and removed its LVS address so it won't restart itself. it doesn't sync scripts properly...

December 4

  • 23:48 mark: Moved knams traffic to pmtpa
  • 23:33 brion: knams down
  • 23:15 mark: Experimenting with bigger COSS cache dirs on knsq15
  • 23:15 brion: rerunning SUL pass 0 migration test with new schema
  • 20:45 mark: Running Squid linked to tcmalloc on knsq2 and hawthorn to try to solve the malloc fragmentation problems
  • 19:50 brion: reopened fr.wikiquote on the board's orders

December 3

  • 20:24 mark: loreley on diderot was blocked on a FUTEX. After killing and starting it wouldn't keep running, so started perlbal instead.

December 2

  • 14:59 mark: Fixed yf1001 and yf1013, yf1001 is up as a text squid.
  • 09:15 brion: completed manual tweaks for blob recovery and did a bunch of purges of affected pages
  • 02:39 brion: running disambiguation recovery for blobs
  • 00:39 brion: put srv89 into read-only while i continue working w/ it
  • 00:31 brion: srv3, 117, 121 also had bad config files and were saving into srv89. sigh. stopped those, and now poking dbs

December 1

  • 23:43 brion: bad page saves discovered on frwiki and perhaps others. bad blobs saved onto srv89 former ES master, accidentally its Apache was brought up with non-updated config files. have updated files on srv89, will need to find and clean up affected blobs.... somehow... :D
  • 01:23 river: replaced perlbal on diderot with loreley
  • 01:00 mark: Running an unoptimized Squid (-O0) on sq8 and knsq1 to get useful coredumps

November 30

  • 23:59 mark: Set coredump_dir /var/spool/squid in squid.conf
  • 14:59 mark: knsq15's hardware has been replaced. Installed it, it's up as an image squid
  • 12:44 river: testing loreley on diderot next to perlbal
  • 06:57 Tim: changed access rules on the text squids to allow queries to the bot entry points even from user agents on the stayaway list
  • 06:15 Tim: relaxed restrictions for missing user agents in checkers.php: allow for query.php, api.php and action=raw
  • 05:49 Tim: re-added srv82 to ext storage
  • 04:45 jeluf: synced nagios config to reflect the changes in the MySQL setup (i.e. Master of cluster 8)
  • 04:30 jeluf: Changed "root reserve" of /a on db1 from 5% to 0%
  • 04:25 jeluf: squid on sq8 crashed, restarted it
  • 03:55 brion: restarted data dumps on srv31 and benet
  • 03:14 Tim: upgraded FSS on srv88 and srv82, were using the old segfaulting version

November 29

  • 12:00 domas: note from yesterday, lucene perlbal was swapping with 500MB VM - memory leak in there. used 20MB after restart.
  • 10:50 mark: oprofiling squid on knsq1 and knsq3
  • 04:37 brion: rotated spam blacklist log - hit 2gb limit

November 28

  • 20:38 sq12 disappeared
  • 18:40 mark: Installed knsq2 (which was unreachable before) as squid.
  • 16:57 Tim: set up index page for http://upload.wikimedia.org/ . Also changed the MIME type for .html on amane to text/html.
  • 16:45 mark: Changed routing policy to send a bit more traffic to TWTC
  • 16:36 Tim: deleted adler binlogs 1-39
  • 16:21 brion: running centralauth pre-migration pass 1 testing (in a screen on zwinger)
  • 07:56 brion: running centralauth pre-migration pass 0 testing (in a screen on zwinger)
  • 07:20ish brion: webster replication broke from centralauth inserts confusing the limited replication. domas fiddling with settings
  • 07:02 Tim: srv89 not back up. Took it out of ES rotation, made srv88 the new cluster8 master.
  • 06:56 Tim: restarted srv89, wasn't responding to ssh
  • 04:55 brion: creating dummy centralauth db on commons servers, going to start back-end migration testing tonight
  • 01:40 brion: added wikisv-skilkom-l list

November 27

  • 19:39 jeluf: changed hardcoded "ariel" in nextJobDB.php into "db4" since ariel is down and job queues were filling up.

November 26

  • 08:22 Tim: changed some wiki logos
  • 00:30 mark: Removed sq1.pmtpa.wmnet - sq10.pmtpa.wmnet from internal DNS, as those servers have moved to external
  • 00:18 Kyle: sq3 up and ready for squid.

November 25

  • 23:55 Kyle: removed audit on srv82. Stopped apache for sanity check.
  • 23:50: mark, Kyle, JeLuF: reinstalled srv7 and srv8 with Ubuntu Edgy as a misc DB cluster for things like OTRS, bugzilla, etc...
  • 23:42 Kyle: brought up srv117, but killed apache for sanity check.
  • 16:00 - 22:30 Kyle, mark: Reinstalled sq1 - sq13 as Ubuntu Edgy squids
  • 17:53 Tim: same on srv120,srv71,srv56,srv59
  • 17:46 Tim: srv110 came up for some reason. Did sync-common, fixed time, recompiled FSS.
  • 16:19 Tim: removed XFF logs from March to July
  • 11:57 ariel died

November 23

  • 05:52 jeluf: added srv83 to external storage cluster 6, disabled srv82.

November 22

  • 23:33 brion: set wgAccountCreationThrottle to 1 on frwiki in response to proxy vandal attack
  • 21:55 brion: took srv82 out of ipvsadm manually
  • 21:50 brion: srv82 is breaking srwiki, doesn't respond to ssh. needs taking out of service
  • 18:15 jeluf: restarted squid on sq12 and sq13. They were down.

November 21

  • 17:00 domas: amane unhappy - restarted nfsd with more children (/etc/sysconfig/nfs created), restarted lighty and php env.
  • 13:03 Tim: did scap. Lots of servers started segfaulting about 15 minutes later. Disabled the new FSS stuff, that fixed it.
  • 06:40 brion: reopened access to stats.wikimedia.org now that the files are scrubbed
  • 04:24 Tim: deployed text squid configuration: redirected static.wikipedia.org from srv31 to albert.
  • 04:22 Tim: removed old keys for yaseo servers from zwinger:/etc/ssh/ssh_known_hosts. Hey, I don't suppose we could back these up and restore them next time we reinstall servers?

November 20

  • 19:06 mark: Anthony overloaded, sending en: thumbs back to amane
  • 17:40 mark: TWTC transit back up
  • 16:30 mark: Disabled HELO checking on albert, it was bouncing valid e-mail
  • 03:05 Kyle: srv83 - removed auditd. You can now log in.
  • 02:54 Kyle: Ram replaced in sq3, ready for service.

November 18

  • 11:30 mark: TWTC BGP session down for unknown reason, stuck in 'CONN' state

November 17

  • 19:50 mark: Installed TWTC transit
  • 16:08 brion: fixed
  • 16:05 brion: message serialized files maybe borked, missing some new data. :( trying to regen
  • 15:05 mark: Playing with Varnish on hawthorn
  • 11:00ish brion: fixed problem on arbcom-l where mail vanished into ether; for reference, problem was extra blank lines in the spam filter sending all mails into discard bitbucket
  • 07:14 Tim: Re-added srv74 to the ext store list.
  • 01:55 Tim: Most apaches have recovered, either by themselves or through my action, but srv34 is still in swapdeath.
  • 01:25 memory usage jump on some apaches, sending some into swap.

November 16

  • 11:40 brion: set $wgGenerateThumbnailOnParse back on for private wikis using img_auth.php, as img_auth doesn't automagically pass through not-yet-generated thumbnails
  • 11:34 mark: Removed proxy-only from all squid.conf sibling lines, as I believe it actually decreases performance and cacheability in various ways. We'll see what the actual effect on the site is.
  • 08:22 Tim: While investigating dumpHTML performance, I found that the NFS client in TCP mode was regularly pausing for 15 seconds, before disconnecting and reconnecting to the server. This was occurring for both /mnt/upload3 and /mnt/static. Switched srv122, srv123, srv124, srv125, srv42 to UDP mode in response, for all NFS shares.

November 15

  • 23:24 brion: goeje back up after reboot; took a couple hours to get pm to do this; slow response to email and phones were busy. possibly support overload due to their recent exciting network problem?
  • 11:57 brion: changed check-time script to use full path to ntpdate; some machines didn't have it in local path while scapping

November 14

  • 18:57 mark: Reinstalled hawthorn, iris, lily
  • 16:40 Tim: started HTML dump on albert
  • 16:16 Tim: srv143 and srv144 do not have the VIP on lo, presumably because of the now-fixed problem with eth2 and rc.local. Restarting, will attempt to bring into the pool.
  • 16:08 Tim: fixed sysctl.conf on srv141, restarted
  • 16:04 srv121 went down
  • 15:50 Tim: added srv121 and srv123 to the apache pool. Installed ganglia on srv121.
  • 14:46 Tim: installing mediawiki on albert for use as a static HTML dump controller
  • 11:50 mark: Disabling the old knams Squids; new servers seem to be running just fine
  • 09:34 Tim: going to run rsync --delete on the thumbnail servers, to fix outdated files which weren't purged, and cached error messages
  • 7:20 jeluf: restarted squid on sq13 (squid crashed around 2a.m.)
  • 7:15 jeluf: cleaned up disk space on adler
  • 06:25 Tim: fixed url encoding problem in HTCPpurger, set up synced copy in /usr/local/bin

November 13

  • 21:34 mark: Put knsq8 - knsq14 into production as image squids
  • 21:05 mark: Put knsq1 - knsq7 into production as text squids
  • 19:19 mark: knsq1, knsq4-knsq14 OS installed. knsq2 is inaccessible because of wrong BIOS settings (my fault), knsq15 seems broken, as it doesn't want to enter BIOS and just says System halted!.
  • 16:12 mark: Adding knsq1-15 to MediaWiki's XFF list
  • 16:08 mark: knsq3 entered production as a text squid
  • 15:52 mark: Installed Ubuntu Edgy on knsq3.
  • 15:51 Tim: What is this?
+                               global $wgMaxShellMemory;
+                               $wgMaxShellMemory *= 3;
+

stuck in the middle of reallyRenderThumb()? I'm not really a fan of exponentially increasing memory limits. So if I use 3 djvu images on a page, then I get up to 4GB for all subsequent images? Cool!

Rest assured that if I was Brion, I would be swearing right now, instead of making sarcastic comments.

  • 05:36 Tim: deleted some old lighttpd error logs from amane, to free up root partition space
  • 03:39 Tim: amane full too, deleting April, May and June backups

November 12

  • 15:25 brion: benet full; migrating files. sigh

November 11

  • 13:36 brion: removed 80.242.195.68 from tor node list in mwblocker.log by request
  • 13:19 brion: enabling email notification on commons
  • 10:25 brion: upgraded leuksman.com to mysql 5.0.27
  • 06:20 Tim: created Server roles
  • 02:52 Tim: holbach was still replicating from samuel! Switched it to adler and took it out of rotation while it catches up.
  • 02:20 Tim: running schema updates on the old masters

November 10

  • 16:02 Tim: updated nagios configurator, made it draw MySQL server lists from db.php instead of elsewhere.
  • 15:28 brion: restarted mailman runner on goeje; stale lockfile was left from the downtime
  • 14:48 Tim: starting master switch
  • 14:21 Tim: set up sync from /home/wikipedia/upload-scripts to local hard drives for thumb-handler.php etc.
  • 14:07 Tim: fixed cache-control headers for thumb.php error messages. Symlinked bacon's thumb-handler.php to amane's.
  • 09:07 Tim: goeje back up. I'm not sure if it was my request to PM or to Kyle which got through. I haven't heard anything from either of them.
  • 08:13 goeje down
  • 06:01 Tim: srv53 down, removed from memcached pool.
  • 02:57 Tim: srv83 is down, removed from external storage rotation. Ports are open but nobody's home.

November 9

  • 22:45 brion: commenting crawl-delay out of robots.txt; hopefully this is obsolete and no longer needed
  • 16:30 Tim: had conversation with VoiceOfAll (VoABot operator). He has patched the bot but has asked that it remain blocked until he has a chance to update the code, later today. The patch he describes will probably fix the problem, but it doesn't sound like he has the bug completely characterised, so I'm not 100% confident. I'm happy for the bot to be unblocked, but we should also implement some kind of protection on the server side against this kind of thing.
  • 12:32 Tim: applying patch-rc_user_text-index.sql to slaves
  • 11:00 brion: clearing math rows with the 'extra - at end of html' bug, so they'll re-render on next page parse
  • 06:30 Tim: blocked VoABot, was causing lock contention, about 100 concurrent threads running on the master.

November 8

  • 17:18 Added knsq1-15 to wikimedia DNS. Reverse DNS needs delegation.
  • 17:18 mark: Resized knams subnet from /27 to /26... on the router and pascal only. Other servers still need to be done. Updated pascal's dhcpd.conf and pdns-recursor.conf
  • 16:25 brion: updated dump runner scripts to use getSlaveServer.php instead of hardcoding servers
  • 14:08 Tim: set up staggered search restart
  • 12:08 Tim: frwiki search index rebuild was going very slow, maybe because it is using adler which has no cache of frwiki. Trying removing the --server option from dumpBackup.
  • 11:15 brion: hopefully fixed the parsertests automated reporting
  • 01:07 Tim: moved srv38 back from the search cluster to the apache cluster. It's doing OTRS DB, which conflicts with the resource requirements of search. Moved srv40 from apache to search in its place.
  • 00:23 Tim: stopped using perlbal for "small" search cluster. Split traffic among the 4 servers by crc32 hash of DB name instead.

November 7

  • 14:15 mark: Deleted all the upload.* ACLs on the text squids, should save a few percents of CPU
  • 13:47 Tim: set up two parallel search updater threads on srv37: one for enwiki and one for the rest.
  • 13:17 mark: Upgraded the PowerDNS Recursor to 3.1.4-pre3 on pascal, mayflower, amaryllis (security fix)
  • 11:30 mark: Successfully upgraded khaldun to Ubuntu Edgy
  • ~08:00 - 10:10 Tim: upgraded to nagios 2.5 (from source). Managed to get it sorting in natural order, after a lengthy battle.
  • 07:50 Tim: made "sort by hostname" in ganglia use natural order
  • 06:55 Tim: Due to a change in fedora, some of our servers just have /etc/rc.local, some have /etc/rc.local as a symlink to /etc/rc.d/rc.local, and some have both /etc/rc.d/rc.local and /etc/rc.local as regular files. Standardised on having a symlink from /etc/rc.local to rc.d/rc.local, mainly to avoid the problem of "decoy" files. Synchronised rc.local from /home/config, to fix the eth2 problem.
  • 06:40 Tim: Fixed rc.local on srv136 (eth2 problem). Did restart test. Also did restart test on srv78. It hasn't come back up yet.
  • 06:24 Tim: srv78's problem appears to be firewallinit.sh. Removing firewallinit.sh invocation from all apaches using sed -i~ '/firewallinit/d' /etc/rc.local . The problem may continue to recur on the many apaches that are currently down.
  • 05:30-05:50 Tim: stepped clocks on srv6, srv39, anthony, alrazi (318s!), srv10. Samuel and adler have no routing to zwinger, srv78 has no routing to 10/8.
    • Samuel and adler actually had the wrong IP address cached for zwinger, nscd -i hosts fixed them.
  • 05:25 Tim: put a time check in apache-sanity-check. Warning only. Can be run independently from /h/w/b/check-time.
  • 04:52 Tim: In nagios, set up a router dependency for knams and yaseo. Hopefully this will make for less noisy flapping on IRC.
  • 01:45 mark: Redirected upload requests with referer wikipedia - download . org to http://upload.wikimedia.org/not-wikipedia.png

November 6

  • 22:35 brion: resynced ntp config on srv63, srv74; were about 8 and 10 seconds off respectively
  • 20:30 jeluf, brion: srv3 needs a mem check. Apache is segfaulting at 30 times the rate of other servers. Powered off.
  • 19:00 jeluf: bw rebooted db4 since it was no longer pinging. Had to fsck /a after reboot, now recovering mysql.
  • 18:18ish brion: also killed runJobs.php on several apache boxen, they also were spewing connection loop errors
  • 18:14 brion: removed db4 from rotation; it's down and spewing a giant 11-gig dberror log
  • 18:00ish jeluf: rebooted srv119, srv3, srv142, srv32. Their apache always died after one to two minutes of service. Running fine since the reboot.

November 5

  • 22:30 brion: installed corrected fix for bugzilla:1109 which I think causes the intermittent 'application/octet-stream' errors for people. a recent addition of an output buffer in PHP via a live hack broke the old protection, which only peeled back one output buffer on 304 events, incorrectly assuming it would be the compression handler.
  • 17:26 Kyle: sq3 has MCE errors, RMA'ing RAM
  • 15:39 Tim: maurus ran out of disk space, cleaning up
  • 14:30 mark: Installed squid-2.6.5-1wm2 with a bugfix for squid bug #1818 on yf1005 - yf1007, but another bug showed up.
  • 13:20 Tim: sent frwiki search load to srv39
  • 12:05 Tim: installed normal (i.e. unicode NFC) extension on srv37
  • 03:00 mark: sq3 seems down

November 4

  • 23:23 mark: Brought sq1 and sq3 up as Edgy squids, for stability testing (RAID controller)
  • 22:55 mark: Upgraded all Ubuntu squids at pmtpa
  • 20:40 mark: Upgraded all yaseo squids
  • 19:29 brion: restarted nscd on mediawiki-installation group; 45 of 146 machines had nscd not running. load on ldap server went waaaay down after that :D
  • 19:18 brion: fiddling with logging for ldap on srv1. shut off from srv2 as no idea if it's set up right
  • 18:45 brion: started ldap on srv2 which is supposedly failover ldap. mark is also fiddling with srv1
  • 18:20 brion: restarted ldap server, lots of machines whining and confused
    • doesn't seem to have helped. lots of machines still complain about unknown user id or "you don't exist, go away"
  • 16:14 Tim: Started search index update on srv37, in an infinite loop.
  • 15:49 Tim: Search index update finished, syncing
  •  ? brion: fixed dump bug, restarted enwiki dump
  • 15:05 mark: Created squid-2.6.5-1wm1 deb, and included a fix for a crash bug we were experiencing. Installed it on ragweed and yf1004, will deploy on all other squids if nothing bad happens for a while.
  • 13:34 brion: rotated dberror.log, was too big for 32-bit boxen
    wow this sucks, we really should replace the logging infrastructure

November 3

  • 09:44 brion: upgraded leuksman.com to php 5.2.0 final release
  • 06:28 Tim: restarted mwsearchd on maurus, disabled squid
  • 05:42 Kyle: The APC is enabled and has anthony, bayle, isidore, and yongle on it.
  • 05:35 Tim: traced unusual disk activity on srv38 back to the DeleteAlbertMailerDaemon job, in OTRS's GenericAgent. Changed the job to delete bounce messages which have arrived in the last hour, rather than doing a search of all 370,000 tickets.

November 2

  • 16:59 brion: disallowed all mailing list archives from robots.txt now
  • 16:46 brion: got mailman-htdig working
  • 15:00 mark: Created temporary channel #wikimedia-tech on irc.wikimedia.org See you there?
  • 14:23 mark: Deflecting some traffic from knams to pmtpa
  • 14:20ish brion: upgrading mailman for htdig search
  • 14:19 mark: Freenode is under DDoS
  • 14:00 jeluf: irc.freenode.org does not resolve any longer. The cname points to chat.freenode.net, which gets a *** Can't find chat.freenode.net: No answer reply on nslookup
  • 09:43 brion: applying ipblocks schema updates
  • 08:45 brion: rebuilt wikifr-l archives to suppress some messages due to a problem; unfortunately the numbering got thrown off by something much earlier in the archives, possibly the old 'from' bug. oh wells

November 1

  • 21:43 river: scap broke blocking since db changes weren't applied, reverted PHP files from r17355
  • 17:05 Tim: back to 3, small cluster couldn't handle it
  • 16:10 Tim: back to 2 partitions. If the small servers can serve requests in ~200ms by hitting the disk, we may as well let them. srv38 and 39 will be better utilised by enwiki, which needs more CPU power allocated to it. We just have to be careful that the disk I/O on the small servers doesn't become saturated.
  • 15:37 Tim: split search nodes into 3 partitions instead of 2.
  • 15:22 Tim: sending dewiki search requests back to the "big" pool
  • ~15:10 Tim: moved srv38 and srv39 to the search pool
  • 14:20 Tim: started search index rebuild for all wikis
  • 14:15 Tim: Inserted bfr (from /home/wikipedia/src/bfr) into the pipe in search index rebuilds. It seems to improve performance, by ensuring that MWSearchTool does not stall waiting for dumpBackup.php.
  • 13:40 Tim: took srv37 out of apache rotation for lucene stuff
  • 5:11 Kyle: srv144 has bad ram, will RMA.

October 31

  • 16:55 brion: redirected sep11 to sep11memories.org
  • 11:51 Tim: added "umask 002" to JeLuF's .bashrc
  • 09:12 Tim: noticed that srv61 and srv67 are down, memcached instances with them. Brought in the spares.
  • 04:45 Tim: set up srv145-149
  • 04:26 Tim: srv144 crashed
  • 03:57 Tim: setting up srv126,srv138,srv141,srv143,srv144,srv145
  • 03:45 Tim: installed ganglia on srv121-145
  • 03:36 Tim: set up apache on srv122

October 30

  • 20:29 Kyle: srv146 - srv149 are available.
  • 16:24 Tim: fixed ganglia
  • 15:37 brion & mark: trying to fix ganglia, still borked
  • 15:27 mark: Started Apache on zwinger
  • 04:24 Tim: added bart to the trusted XFF list

October 29

  • 16:24 Tim: locked sep11.wikipedia.org at Erik's request

October 28

  • 14:45 Tim: removed dkwiki from all.dblist, old alias for da

October 27

  • 14:36 brion: adding redirect & querycachetwo tables, not yet populated
  • 05:04 Kyle: configured ipmi on srv121?? Maybe? I'm not sure how to test it.
  • 04:56 Kyle: srv39 was off, I don't know why. I turned it on. Also a bunch of unreachable srv's were fixed. (Of the newest batch)
  • 04:27 Kyle: sq1, and sq3 have Ubuntu Edgy. But need a password.
  • 04:04 Kyle: Accidently rebooted zwinger! Sorry!
  • 03:37 Kyle: Replaced power supply in sq11, its back up.

October 26

  • 21:15 mark: Reinstalled yf1010 with Ubuntu Edgy, instead of Dapper. Install went ok, but needs a few more tweaks to the preseeding files to make it fully automatic again.
  • 19:00 brion: tightened down friedrich, nfs /home no longer mounted

October 25

  • 23:08 mark, kyle: Swapped sq11's mainboard, reinstalled it and brought it up as an upload squid
  • 16:47 Tim: installed FSS on new apaches, added to install-modules51
  • 16:40 Tim: running rebuildMessages.php
  • 15:25 Tim: Set system-wide default for ssh ConnectTimeout to 5 seconds, on zwinger
  • 14:30 Tim: finished user table schema changes
  • ~13:30 Tim: switched masters to db2 and samuel.
  • 12:52 mark: Fuzheado says PMTPA is blocked in China. Updated the GeoIP maps to make sure as many Chinese IPs resolve to yaseo
  • 08:00 jeluf: Updated apaches 121-145, added to the pool, fixed startup scripts (use of eth2 instead of eth0/1). Still broken: srv122, srv126, srv136, srv138, srv141, srv145.
  • 04:00 Kyle: New apaches were down because of poor power distribution. Its fixed now and they are back up.
  • 03:19 Tim: starting user table schema changes

October 24

  • 12:34 mark: Deployed a newer PyBal on pascal, avicenna, alrazi and yf1018
  • 11:14 brion: wikibugs bot wasn't running; restarted it on goeje and added run-wikibugs to rc.local
  • 06:23 brion: restarted postfix on leuksman.com; svn mails were stalled
  • ~06:00 Tim: installed Dancer's dsh as ddsh on zwinger, changed scap and sync-file to use it. It shares perl dsh's node group files, via a symlink.

October 23

  • 19:00 jeluf: added srv121 and srv123-srv134 to the farm. srv122 and srv135 are unreachable. srv136-145 died earlier during a "scap". I've no idea why.
  • 16:13 mark: Users reporting image problems with IE in yaseo. Depooled dryas from the upload queue. What was it doing there and wtf wasn't it logged?
  • 08:20 Tim: made scap faster by turning off "lazy backups" and using an rsync daemon on suda instead of cp -prfu over NFS. Set up scap to recompile and install texvc automatically.
  • 07:18 Kyle: srv121-135 are available. srv143-145 fixed.

October 22

  • 14:08 mark: Deployed a newer PyBal on pascal

October 21

  • 18:00 jeluf, domas: installed apache&al on srv136-srv145. srv143 was already broken when we started, srv144 broke during the installation (had to reboot it, didn't come back)
  • 17:00ish brion: hack-bumped the $wgStyleVersion again
  • 17:55ish brion: tweaked mail servers on leuksman.com again
  • 16:50ish brion: did a svn up & scap; there may be some css/js issues with the changes to section edit links. germans have broken js
  • 16:15 brion: ldap is broken on srv144
  • 15:24 brion: updated leuksman.com to PHP 5.2.0RC6
  • 15:11 brion: disabling disused MWBlocker extension include; new boxen we're not installing the PEAR xml-rpc anymore since we don't use it anymore and the install kept breaking
  • 14:58 brion: removed ganglia port and interface options from mwsearch.conf, trying to see if these get through ganglia... manual from rabanus does go through using gmetric without the specifiers on the command line
  • 09:04 jeluf: created otrs-de-l, otrs-it-l
  • ~05:25 Tim: synced files on srv63, was out of date. Initialised srv103 as a memcached hot spare.

October 20

  • 13:30 Domas: enabled holbach, lomaria, ixia with higher loads.
  • 03:48 Kyle: srv136 - srv145 are available for service.
  • 00:36 Tim: noticed that srv68 is down, memcached instance included. Brought the hot spare on srv119 into rotation.

October 19

  • 10:31 Tim: updating fedora mirror
  • 00:09 Kyle: srv136 available. (More soon)
  • 00:09 Kyle: srv54, srv55, srv63, srv66 rebooted. Bad raid controllers.

October 18

  • 23:12 Kyle, Mark: csw1 uplinked to csw5.
  • 21:46 brion: upload.wm.o dead in pmtpa

October 17

  • 16:29 mark: Started Mailman on goeje
  • 16:00 mark: goeje back up after a PM reboot request.
  • 15:50 mark: Users reporting loss of session data. mctest.php reported srv55 down, which indeed doesn't reply to ping. Replaced its memcached slot by srv62.
  • 15:18 mark: Because goeje went down, srv1 couldn't resolve DNS, which brought the entire cluster into dismay (fun). Made srv1 forward to zwinger,goeje (in that order). Recursing DNS really needs to be fixed.
  • 15:00 mark: goeje went down
  • 13:45 mark: Converted sq14 and sq15 to upload squids
  • 07:01 jeluf: set up srv6 as thumb server, serving de/thumb, taking load from anthony, which is only serving en/thumb now
  • 06:39 brion: enabled AntiSpoof extension for active prevention as well as logging

October 16

  • 19:52 jeluf: restarted mwsearchd on coronelli
  • 18:40 jeluf: moved thumbs/en/ to anthony, which is now serving thumbs/en/ and /thumbs/de/. Set up another HTCPpurger in the second page of the screen session.
  • 14:38 mark: Increased swap size per COSS cache_dir from 5000 to 8000 on sq12 and sq13... After 4 days they had only a 4% i/o wait.
  • 14:30 mark: Disabled Squid cache digests, as I don't believe they work well in our very dynamic environment, and may actually decrease cache efficiency.
  • 12:20 mark: Squids were set to deny HTCP CLR requests from the pmtpa internal subnet, so purging didn't work in pmtpa. Fixed.
  • 03:54 brion: updated viewvc on leuksman.com to 1.0.4-dev
  • 02:09 Tim: installed FastStringSearch (fss)
  • 01:04 Tim: installed gmetricd on sq2-10
  • 00:43 Tim: ran updateArticleCount.php on the new wikis, to correct for a previous bug in the same script.
  • 00:14 mark: Reinstalled yf1000 - yf1004 with Ubuntu, set them up as text Squids. Taken yf1019 out of rotation.

October 15

  • 23:00ish brion: mysterious spike in apache cpu usage and segfaults, haven't figured out cause yet. reverting recent changes to mw to test
  • 21:30 mark: Reinstalled yf1005 - yf1009 with Ubuntu, set them up as upload squids. Set up LVS on yf1018, pointed upload.yaseo at it...
  • 18:54 mark: Changed MediaWiki's HTCP purge method from 'NONE' to 'GET' to make Squid 2.6 purge again
  • 18:40 mark: Built a new squid-2.6.4-2wm1 .deb with debug symbols and --enable-stacktrace, and installed it on sq15
  • 17:00 mark: Lots of Ubuntu Squids (with COSS) crashed around the same time. Restarted them.
  • 16:42 brion: added charset header on 404 page to fix utf-7 silliness
  • 16:15 mark: Fixed NTP on amaryllis. Y! has blocked UDP port 123, so SNAT to a high port...
  • 14:20 mark: Creating two separate Squid groups with distinct default origin servers and "special destinations": text for MediaWiki content from the Apaches, and upload for static content from Amane and the thumb servers. This allows us to tweak the two very different Squid groups much better. Each group has its own subdir under /h/w/conf/squid, along with a separate subdir with a backup of the old setup. Yaseo doesn't have its own upload group yet, but I hope to rectify that today.

October 14

  • 23:30 mark: Installed Ubuntu on clematis, it's back up as a Squid
  • 11:30 jeluf: migrated upload.wm.o/wikipedia/de/thumbs/ to anthony, migration of /wikipedia/en/thumbs/ still running.
  • 08:24 brion: [4] was somehow stuck in sq21's cache as a 301 to wikimediafoundation.org. UDP multicast packets to purge it could be seen when using ?action=purge, but had no effect. manually sending a PURGE over port 80 cleared it successfully
  • 07:35 brion: adjusted 'missing wiki' screen to send a 404 response instead of 200; should keep some transient errors out of caches more nicely
  • 07:29 brion: adding wikimania2007.wm.o to dns, preparing for wiki setup
  • 07:03 brion: recompiled utfnormal extension on benet against proper ICU headers *cough*, restarted dump thread 4
  • 06:48 brion: recompiled utfnormal extension on benet w/o -fPIC, restarted dump thread 4
  • 06:12 brion: started pmtpa data dumps
  • 05:00 Kyle: New ram with srv74, lets see how it does.
  • 04:48 brion: migrating some old dump data from benet to amane to make room for next dump run
  • 04:50ish brion: unmounted broken khaldun mount from benet

October 12

  • 18:00 mark, jeluf: added thumb server bacon. Serves upload.wikimedia.org/wikipedia/commons/thumb/[0-3]/*. Currently, the squid.conf is a live hack. The next deployment will break this again, unless squid.conf.php is fixed.
  • 17:05 mark: Set originserver on all parent cache_peers in squid.conf This makes Squid treat parents as origin content servers instead of proxy caches, and therefore enables Connection: keepalive and non-proxy GET requests.
  • 15:10 mark: amane overloaded, tweaked its TCP settings a little more
  • 07:39 Tim: secure.wikimedia.org back up, courtesy of mod_proxy.
  • 07:00 jeluf: installed lighty on bacon, changed thumb handler to save images it got from the apaches to the FS. 0/* has been copied from bacon, 1/* currently running. Todo: HTCP listener to delete thumbs
  • 05:57 Tim: disabled wiki stuff on secure.wikimedia.org temporarily, bart was overloaded. Will try to find a permanent solution involving proxying.
  • 03:50 brion: started apache on leuksman.com, died again. :(
  • Set somaxconn = 1024 and tcp_max_syn_backlog = 4096 on the old image squids, and on amane.

October 11

  • 23:40 mark: Made sq12 and sq13 image squids
  • 22:30ish brion: a recently committed bug in ObjectCache caused the db to be used instead of memcached, grindin geverything to a halt
  • 19:30 jeluf: copying amane's wikipedia/commons/thumb/* to bacon:/export/upload/wikipedia/commons/thumb using rsync on bacon, bwlimit 500

October 10

  • 23:00-* mark: Upgrading the new Squids sq12..sq30 to squid-2.6.4-1wm4 to enable COSS
  • 19:40 mark: Set connect-timeout=5 on Squid backend requests
  • 17:40 mark: Reduced amane's PHP processes from 64 to 32
  • 17:30 mark: Upgraded amane's lighttpd to 1.4.13.
  • 11:25 mark: Set up sq29 with COSS as well, though different settings than sq30, to compare.
  • 11:00 mark: Started Squid on several of the new servers. Squid had disappeared...
  • 11:00 mark: Set up sq30 with COSS filesystems, using devices /dev/sda6, /dev/sdb, /dev/sdc, /dev/sdd.
  • mark: Set up an Ubuntu Dapper mirror on khaldun
  • 07:54 brion: took stats.wikimedia.org offline; contains private info, needs scrubbing

October 9

  • 21:25 mark: Set 'refresh-pattern ignore-reload' on upload squids
  • 21:03 brion: removed anthony from mediawiki-installation group
  • 20:35ish brion: disabled FancyCaptcha; using now SimpleCaptcha. seems to be lighter on amane's NFS for now
  • 20:15ish brion: restarted many pmtpa upload squids with high InActConn backed up in lvs
  • 18:00 mark,kyle: Reinstalled khaldun as dedicated install server / archive mirror
  • 18:00 kyle,jeluf: Rebooted holbach. After reboot, mysqld's error log shows duplicate key errors while replicating. Shut down mysqld.
  • 03:27 brion: disabled obsolete firewall rules on maurus; was preventing rsyncing of search index updates, stopping the ex-yaseo wikis from being searchable

October 8

  • 15:02 Tim: Doubled the memcached instance count. srv104-118 brought into service with srv119 spare.
  • 08:41 Tim: Stepped clocks on sq1-8, which were off by 8 hours. This was messing up ganglia. In the process of fixing NTP.
  • 03:45 Tim: zwinger's disks were very overloaded due to the PMTPA gmetad. The data size is only 120MB, but apparently it was syncing very often. I moved the rrds to a tmpfs with an hourly rsync to disk.
  • 02:38 Tim: holbach is down, took it out of rotation
  • 02:34 Tim: removing old static HTML dump backup on srv35
  • 02:12 Tim: Fixed disk space exhaustion on coronelli. MWDaemon.log was to blame.
  • ~02:00 Tim: installed gmetricd in various places. diskio_* metrics should now be available.

October 7

  • 22:00 jeluf: restarted db's on ixia and db1, with help of domas. Running 4.0.27 on db1
  • 19:30 jeluf: Shut down mysql on ixia, copying DB to db1
  • 17:30 jeluf: rebooted sq1, disabled squid. Mark depooled it from the LB
  • 15:30 (Squid on) sq1 is down and being odd again
  • 13:00 jeluf: rebooted sq1
  • 03:45 Tim: removed sq11 from LVS on avicenna manually, it was down again and pybal didn't remove it.
  • 03:35ish - timeouts connecting to rr.pmtpa
  • 03:20 Kyle: db1 is now up and ready to be setup.

October 6

  • 20:15 mark: Brought sq1 back up. The reason PyBal didn't depool it last night, not even during a restart, was that PyBal was in dry run mode so that it prints ipvsadm commands but never actually executes them. Apparently it has been inactive for weeks. Sorry!
  • 19:00 jeluf: unmounted ikhaldun:/usr/local/upload on all apaches, removed from fstab
  • 17:15 mark: Set up imap.wikimedia.org (which points to my private colocated server) as a temporary solution. Various @wikimedia.org aliases will be redirected.
  • 17:15 jeluf: restarted apache on bart. Nagios and OTRS were not responding
  • 01:33 brion: sq1 switchport reenabled; still hasn't fully shut down.
  • 01:20ish tim manually removed sql from lvs; pybal wasn't removing it for unknown reason
  • 01:07 brion: rebooting sq1, still haven't figured out wtf is wrong
  • 00:55 brion: removed sq1 from pybal list while trying to kill its mad squid
  • 00:50 brion: restarting squid on sq1; insane load (30+), not responding
  • 00:44 brion: something wrong with upload.wikimedia.org; investigating. trouble connecting to pybal on alrazi; is it a problem with pybal or backends?

October 5

  • 21:32 brion: resyncing srv11 common files; all were missing!
  • 21:27 brion: wiped old copy of fundraising report scripts w/ redirect to new location
  • 19:30 mark: Set up ingress filtering on port e8/1 and e8/2 of csw5-pmtpa
  • Tim: Set up ganglia 3.0.3, more or less starting from scratch with the configuration. We now have a hierarchical arrangement of grids, with knams and pmtpa in the system at present, yaseo will perhaps follow later if we can get the ACLs set up.
  • 01:23 Tim: fixed replication on srv75. It's a read-only cluster so it's not critical. Had to skip some deleted binlogs, they were probably empty anyway. MAX(blob_id) looks fine.

October 4

  • 23:43 brion: starting search rebuilds for ex-yaseo wikis on maurus
  • 22:30 mark: Moved the console server to csw5-pmtpa and Wikia's network, so we have out of band access. Also moved the last bunch of machines off csw1-pmtpa.
  • 22:00 jeluf, kyle: hot-replaced amane's faulty drives, started rebuilding the RAID.
  • 21:26 jeluf: gzipped binlogs 1 and 2 on adler.
  • 17:40 jeluf: rebooted srv98,srv93,srv87,srv109 since their apaches locked up a few minutes after being restarted
  • ~13:15 Tim: albert was hanging, smtpd down. Mark's reboot -f attempts weren't working, so I did echo b > /proc/sysctl-trigger which did the trick. Came up without the right VIPs, I fixed it temporarily, Mark will fix it permanently.
  • 09:20 Tim: postfix on albert had been broken since 23:56, restarted.

October 3

  • 22:30 mark: Installed Ubuntu on yf1005 (used it as a testing host)
  • 15:25 Tim: Deployed new external storage: srv87-89 as cluster8 and srv90-92 as cluster9
  • 12:14 mark: Deployed sq21..30 as text squids to see if brute power solves the TCP open problem.
  • 11:53 mark: zwinger is not letting me log in. Stalls after "entering interactive session."

October 2

  • 18:50ish brion: set up www.wikibooks.org portal
  • 15:42 brion: disabling writes to cluster6; it's overloaded
  • 15:15ish overload on ES
  • 14:40 Tim: srv54 went down, replaced its memcached instance with srv68
  • 12:40 mark: Made zwinger external only by disabling eth1 and changing the default gateway to 66.230.200.10.
  • 03:00-07:00 kyle, tim, jeluf, river: suda broke, zwinger broke. rebooted suda, moved zwinger's dns resolver to goeje (temporary only)

October 1

  • 13:00-20:00 mark, bw, river: moved uplinks over to csw5, set up BGP and began advertising our network. brief downtime due to router breaking.
  • 13:51 Tim: Fixed uploads on new wikipedias. I also fixed the absence of a spoofuser table, earlier today.

Archives