Server admin log/Archive 8

From Wikitech
Jump to: navigation, search

September 30

  • 19:30 jeluf: Adding new wikipedias ([1])
  • 06:41 brion: yaseo wikis now living in pmtpa. yay!
  • various bits in yaseo migration plan
  • 03:12 brion: samuel back in rotation
  • 03:08 brion: fixed up samuel replication. need to clear out old binlogs and load yaseo again?
  • 03:01 brion: took samuel out of rotation; unexpectedly filled up with binlogs during import
  • 01:01 brion: holbach got forgotten; switched its master as well
  • 00:36-00:47 brion: switched masters from samuel to adler as part of yaseo migration plan

September 29

  • 18:13 brion: unlocking yaseo wikis to try fancier migration
  • 18:03 brion: locking yaseo wikis for yaseo migration plan
  • 09:25 Tim: upgraded 7z on srv31 to 4.42
  • 03:30 Kyle: srv74 has some ram error, I will need a mce log to figure out which stick. I will RMA.
  • 03:20 Tim: deleted text backups from amane for 2005 and Jan, Feb and March 2006. It now has 477GB free.
  • 03:00 Kyle: holbach up again. Raid card spits out a strange error. 3ware says it is cosmetic.

September 28

  • 23:37 Tim: srv74 was down too, took it out of external storage (cluster5) rotation. Re-added srv76, seems to be up and up to date.
  • 23:25 Tim: noticed holbach was down, took it out of rotation
  • 19:30 jeluf: srv34 looks better after a reboot.
  • 19:10 brion: had to kill -9 stuck apache procs on srv34
  • 16:35 jeluf: restarted apache on srv34. httpd was still running, but requests on port 80 were stuck in SYN_RECV.
  • 14:20 mark: Set /proc/sys/net/tcp_max_syn_backlog to 4096 (instead of 1024) on sq12. Let's see if this makes a difference tomorrow...
  • 14:15 mark: Restarted all new Squids
  • 14:00 Something funky
  • 11:30 Tim: restarted static HTML dump

September 27

  • 22:45 mark: Backup run of amane on zwinger (in a root screen) stalled, tar process on dest server disappeared. If anyone wants to poke it go ahead, I'm sleepy...
  • 15:50 mark: Restarted all new squids because they were unhappy and dropping open requests
  • 5:14 jeluf: deleted binlogs 1 to 19 on srv86
  • 5:01 jeluf: switched rr.wikimedia.org back to knams (for European users)
  • 4:49 jeluf: switched upload.wikimedia.org back to knams (for European users)
  • 00:32 brion: upload.wikimedia.org no longer resolves correctly in europe
    • rumored to be fixed

September 26

  • 22:33 network issues at surfnet, removed knams from geodns
  • 15:20 mark: Added sq1-10.pmtpa.wmnet to alrazi's /etc/hosts file because internal DNS is overloaded
  • 14:50 mark: Started a backup run of amane's upload/wiki* on zwinger
  • 11:30 mark: Gave avicenna a public IP using a public gateway. Assigned it an LVS service IP 66.230.200.100 to use for text squids. Put sq12 - sq20 in production.
  • 10:40 mark: 66.230.200.101 was on sq11, which has crashed twice now and is apparently broken.
  • 05:40ish brion: fixed caching bug I introduced a couple weeks ago which caused extra hits to MonoBook's generated CSS and JS
  • 04:09 Tim: Assigned 66.230.200.101 (www01) to srv10. It was unreachable. Could it have been down since will went down, 7 days ago?
  • 00:32 Tim: Gave Erik mail aliases, steward access, subscriptions to internal-l and private-l and OTRS board member role

September 25

  • 23:56 brion: restarted dump threads on srv31. (had to reboot srv31 due to benet mount being broken and lots of hung procs)
  • 23:25 brion: restarted lighty on benet so download.wm.o works again
  • 21:40 mark: Some network moves. Made benet external-only and gave it a sane network config again.
  • 17:50 mark: sq11 up as Ubuntu Squid.
  • 16:00 domas: enabled srv84 ES with MyISAM'y blob store
  • 14:00 domas: whoever installed json.so, did break 32-bit apaches, haha, pthread_once() was probably called twice, muhaha, for now json disabled
  • 14:00 mark: Installed squid-2.6.4-1wm2 with an experimental performance patch by Adrian Chadd on sq13
  • 13:20 mark: Increased cache_mem from 1500 to 2048 on ragweed and sage - they have (unused) swap space now, and have 512 MB free, so let's use it.

September 24

  • 18:00 mark: sage up as a Ubuntu squid.

September 22

  • 07:00 jeluf: Extended Nagios monitoring: disk space on NFS and MySQL core servers, raid status on 3ware controllers, mailq of goeje and albert
  • 05:40 jeluf: disabled firewall on coronelli.

September 21

  • 15:20 brion: amane out of disk space, breaking uploads (0 bytes). deleting some old crap from disk

September 20

  • 16:30 brion: added clerks-l mailing list for enwiki arbcom clerks
  • 15:40 brion: fixed alternate master setting for enwiki; steward Makesysop wanted to talk to enwiki on ariel, which is now a locked slave
  • 15:30 brion: added spoofusers table for AntiSpoof extension, will build contents and start logging soon

September 19

  • 12:45 mark: Added 66.230.200.110 to rr.pmtpa.wikimedia.org DNS pool
  • 12:39 brion: added sq11-30 to $wgSquidServersNoPurge, there were reports of new squid ips being blocked
  • 11:50 mark: Brought sq12 and sq13 up as emergency Ubuntu Squids, with a preliminary Wikimedia .deb
  • 10:45 mark: will had broken network settings and is now inaccessible. Moved the squid service ip to srv6. I'll try to get 3 new squids up this afternoon to deal with the load...

September 18

  • 22:30 mark: sq11 and sq13 up with Ubuntu server installs
  • 19:30 mark: Replaced vlan 1 for vlan 100 which caused some downtime
  • 22:10 Kyle: Removed auditd on srv78 for login problems.
  • 22:00 Kyle: Reinstall of db4. Its has a bad disk somewhere. Raid5 will find it.
  • 18:30 brion: enabled experimental revision text caching in memcached (set $wgRevisionCacheExpiry to 0 to disable)
  • 15:54 brion: fixed secure.wm.o IP alias on bart, was set to load only the old florida ip on boot
  • 10:55 mark: Reduced A records in DNS to just one service IP per squid

September 17

  • 12:44 mark: Updated udpmcast.py and made it send multicast packets with ttl 2 so they will pass a router between the two vlans

September 15

  • 20:56 brion: enwiki db update failed. ariel crash?
  • 20:23 brion: applying page table update on pmtpa wikis
  • 18:40 mark: Disabled passive IGMP snooping and enabled PIM DM routing on csw5 to fix the purging problems.
  • 15:00ish brion: squid purging is broken apparently. Need help investigating the UDP multicast stuff
  • 09:15 brion: lists online
  • 08:19 brion: taking mailing lists down for a bit to back up and upgrade
  • 08:00 brion: updated leuksman.com to PHP 5.2.0RC4 and APC 3.0.12p2

September 14

  • 21:00 domas: started rebuilding srv84(myisam) from srv86 - disabled writes to cluster7 for a while.
  • 11:39 brion: srv84, srv116 back online and resynced. restored to nodegroup.
  • 11:25 brion: restarted lvsmon on dalembert with srv84 and srv116 commented out of 'apaches' dsh group
    • don't forget to turn them back on some day after they're fixed they don't respond to ssh, waiting for reboot

September 13

  • 23:00 mark: Removed BGP announcement of Wikia's route on csw4-pmtpa, to disable incoming load balancing for them.
  • 22:30 mark: Rebooted csw5-pmtpa with new firmware. Moved avicenna and alrazi onto it, and reverted back to the Cisco - Foundry seems to insist on untagged default vlan 1
  • 19:00 brion: created sd.wikinews

September 12

  • 22:08 brion: continuing search index builds on maurus
  • 20:30 ævar: changed the logo on hrwikiquote and the wgSitename, didn't sync it (properly) though because all the servers hate be, boo hoo.
  • 19:06 brion: removed harris and borked goeje from internal dns; added harris.wm.o on external dns
  • 18:50 brion: restored port listen 80 on bart to prior config, with search.wm.o disabled
  • 18:00 brion: reclaiming harris for search.wikimedia.org
  • 15ish brion: moved search.wm.o to a separate ip, leaving bart free
  • 13:06 brion: temporarily disabled port 80 on bart.wm.o, testing very slow response on secure.wm.o/ticket.wm.o

September 11

  • 21:00 mark: avicenna broke, probably because of overload. Moved LVS services to alrazi and dalembert. avicenna is back after a reboot, on standby.
  • 19:30 jeluf: killed jobs runner loop on srv81, srv82 and srv83. Load on ES servers is too high
  • 18:24 brion: paused search index generator at frwiki; load on ES is too high
  • 11:50 brion: moved www.wikimedia.org country portals into svn [2]
  • 09:54 brion: installed new texvc on yaseo, built from srpms
  • 09:43 brion: new texvc seems to be missing at yaseo

September 10

September 9

  • 03:00 Kyle: rewired scs. A bunch of servers don't console, I'm going to try to fix a bunch. A few are also not on it yet. I'm not quite finished.

September 6

  • 13:28 brion: search index build restarted; weird mono thread-creation problem had hung it previously
  • ~11:30 Tim: installed report.py and cachemgr.cgi on amaryllis
  • 10:40 brion: restarted search build process on maurus, explicitly selecting db1/db3 as slaves this time
  • 09:38 brion: restarted enwiki dump
  • 09:14 Tim: brought yf1017 back into service as an apache (dual purpose with search)
  • 09:05 Tim: fixed gmetad writes on amaryllis
  • ~08:50 Tim: got gmetad, ganglia web working on amaryllis
  • 04:30 Kyle: sq1 back up. srv110, 62, and 117 up.
  • 01:00 river: srv62 broke, removed from memcached config

September 5

  • ~23:00 domas: innodb deadlock monitor deadlocked something on ipblocks table, ..
  • ~21:00 domas: restarted samuel and ariel with high TCP/IP backlogs and no name resolution, not sure if got rid of connection error problems completely
  • ~15:00 Tim: installing JSON PECL module
  • 10:30 brion: started pmtpa and yaseo dumps
  • 08:15 domas: lighty on mail.wikipedia.org

September 4

  • 22:45 Kyle, Mark: Put sq11 on the APC, which was empty and unconnected, but now on csw4-pmtpa:0/3 and SCS:9. sq11 is on SCS:14.
  • 22:00 - 00:30 Kyle, Mark: Massive network moves... including all DBs and search servers, so go wild, Brion...
  • 22:30 mark: Moved LVS on dalembert (Squids -> Apaches) to avicenna so dalembert could be moved.
  • 21:10 mark: Removed LVS services on avicenna which were still using the old IPs.
  • 21:00 mark: Deployed a new Squid.conf with all references to the old IP range replaced with new IPs.
  • 17:00 mark: Increased vandale's COSS cache_dir to 12 GB.
  • 12:49 brion: installing ploticus on yaseo boxen
  • somewhere... domas did something... which released a lot of email
  • 10:09 brion: some or all mail appears to be stuck? not getting bugmail for last few hours, self-mails seem not to get delivered

September 3

  • 20:25 domas: redirecting all domains mail to mail-eater, spammers used wiktionary.org... goeje... ergh...
  • 15:11 brion: setting up hourly search index synchronization in pmtpa
  • 08:58 brion: starting search-rebuild on maurus again; accidentally broke it last night

September 2

  • 10:35 river: restarted mailman on goeje again, was hung
  • 09:00 jeluf: Running runJobs.php on yf1008

September 1

  • 16:13 brion: recompiled mwsearch on maurus, added /etc/cluster, rewrote search-sync, started a full index rebuild (search-rebuild via search-rebuild-wiki) pulling from dbs.
    • auto syncs not set up yet, but will want to do those in the hourly restart, probably
    • running the builds in a screen session for now. will want to run it continuous later
  • 15:46 brion: added maurus back to mediawiki-installation group, set up a copy of php to do dump+index tests
  • 14:50 brion: set up rsync server on maurus so that search indexes can be updated more easily
  • 14:30ish brion: copied fedora 2 files back onto albert's yum mirror; they're gone from gatech mirror
  • 02:57 river: restarted mailman on goeje, was stalled

August 31

  • 15:10 brion: restarted sendmail on yaseo apaches. something was stuck somehow in them, causing the shell-outs to /usr/lib/sendmail to hang. since mail is sent *during* a db transaction in preference save, this caused some db locks also.
  • 14:39 brion: added proctitle on rest of yaseo apaches
  • 10:53 brion: yaseo problem appears to be stuck locks on user table, but i'm not sure why. trying to get the mighty domas on the case \o/
  • 10:29 brion: rotated bot-heuristic.log; over 2gb, broke 32-bit boxes
  • 09:50 brion: added proctitle on yf1009
  • 09:09 brion: yaseo seems rather sluggish

August 30

  • 13:36 Tim: started jobs-daemon on srv91-100
  • 12:50 mark: Testing Squid's COSS storage file system on vandale
  • 11:22 brion: fixing ^/wiki.phtml$ regexes in apache config to ^/wiki\.phtml$
  • 10:39 Tim: and on srv86-90
  • 10:30 Tim: started /h/w/b/jobs-daemon on srv81-85
  • 10:05 Tim: installed daemonize-1.4-5_wm on pmtpa apaches
  • 10:00 brion: added boardvote2006 to zedler's list of dbs not to replicate. may need to restart mysql to take effect. (replication is borked)
  • 06:00 Kyle: Finished moving sq1-10 to their new rackspace in prep for the new squids.

August 29

  • 20:45 mark: We got some evidence that routing for yaseo has been fucked all day, which may explain some reports we've been getting. Rerouting all traffic away from yaseo to florida just to be sure.
  • 12:38 Tim: Fixed password and hostname configuration for dryas. It's probably been broken, getting virtually no requests, since we got it.
  • 10:25 brion: amane offline for a few minutes due to network funkage, back alive now
  • 00:00 Kyle, River, Mark: Moved a bunch of servers around, both physically and onto the Foundry (csw5-pmtpa). Moved csw4-pmtpa's L2 link from csw1-pmtpa to csw5-pmtpa and gave csw4-pmtpa's HSRP group 2 a higher preempted priority in order to balance traffic better while migrations are in progress.

August 28

  • 15:50 Tim: fixed yaseo search, used an IP address instead of hostname
  • 13:35 Tim: reinstalled srv50 and put it into rotation
  • 13:20 mark: Made yaseo squids temporarily diskless while yf1001's outage lasts.

August 24

  • 15:30 mark: Ran scap to disable sending of HTTP ETag (bug #7098)
  • 13:30 mark: Fixed the PowerDNS setup on Browne, and added a beta.wikiversity.org DNS record, closing bug #7094.
  • 11:00 mark: Reducing cache_dir sizes for yaseo from 20 GB to 6.

August 23

  • 23:00 mark: Upgraded yf1000, yf1001 and yf1002 back to Squid 2.6.
  • 22:53 mark: yaseo Squid problems were simply caused by the broken DNS resolvers: Squid needs a -HUP before it rereads /etc/resolv.conf. Also replaced lvsmon by PyBal for Squids on yf1018.
  • 11:53 Tim: added version switch to squid.conf.php
  • 11:37 Tim: removed yf1003, yf1004 and yf1019 from LVS rotation temporarily
  • ~11:25 Tim: squid was mostly broken on yaseo, apparently not working at all for forwards to the local apaches, and very slow for forwards to pmtpa. It was just timing out. Lvsmon was reporting all squids down, it was just in "emergency mode", leaving a few in rotation for debugging. Downgraded squid to 1.5 on yf1000, yf1001 and yf1002, this fixed the problem on those squids. Lvsmon is now leaving those three in rotation and flapping the rest. The configuration file is locally hacked for the downgrade, will fix that shortly.
  • 07:00 Kyle, Domas: db1-4 physically moved to redundant power circuits.
  • 06:50 Kyle: Moved the load balancer. Right now unracked till I get rails for it.

August 22

  • 20:30 knams back up, pointing EU traffic back at knams
  • 19:35 mark: Kennisnet doesn't know what's wrong yet, and when it will be up again. Redirecting knams traffic to pmtpa.
  • 18:45: Kennisnet went down
  • 16:00 - 17:30 mark: Deployed squid-2.6.STABLE3-1wm on all Squids.
  • 09:40 brion: migrating old dump data from benet to free space
  • 09:30 brion: secure.wm.o whining again, about 10.0.2.7[4-6]

August 21

  • 12:32 brion: someone appears to have fixed db permissions so secure.wm.o works again. thanks, mysterious person (possibly domas) who didn't log it! :)
  • 00:59 brion: let's not forget to fix the database permissions today to secure.wm.o and manual db work from bastion hosts work again. Is it safe to update ourusers.php and resource the SQL, or has this changed?

August 20

  • 15:28 brion: srv58 appears dead, but was listed in memcached cluster. this maybe broke some sessions. switched to double-loading srv59.
  • 11:25 brion: restarted postfix on leuksman.com, mail was down for some reason so svn commit notices weren't being mailed
  • 10:11 brion: noticed secure.wm.o is not happy
  • 7:53 river: set goeje's relayhost to mayflower, so it can send mail while reverse DNS isn't working

August 19

  • 22:30 mark: Replaced Tim's modified PyBal by the current code in SVN, which already supported multiple services in one PyBal instance and configuration file. Tim's modified code is in /usr/local/pybal.old, old config is in /etc/pybal/old/.
  • 22:10 mark: Submitted support request to Verio to change delegation / nameserver glue record IPs on wikimedia.org.
  • 21:48 mark: Added IP 66.230.200.16 to browne, as the new ns0.wikimedia.org IP. Removed bogus 66.230.200.207 and 66.230.200.208 IPs, and killed the old firewall which didn't let any of the new IPs through.
  • 12:27 Tim: Modified pybal to accept an optional configuration file name on the command line (pybal.py -f <filename>). Started a second instance on avicenna with the configuration file /etc/pybal/pybal-newip.conf, to load balance for 66.230.200.228. Modified source is in ~tstarling/pybal and avicenna:/usr/local/pybal.
  • 10:32 river: starting to migrate services to the new network

August 18

  • 21:30 jeluf: added 66.230.200.x aliases to all 207.142.131.x hosts
  • 19:59 brion: set dbs back to read-write
  • 19:55 brion: apparently cogent has returned our old ips until monday
  • 19:50 brion: route to old ips suddenly back for many people. waiting to hear more details
  • 18:50 brion: we've been assigned new ip space, people are trying to figure out how to attach stuff to it
  • 17:50 brion: kyle got in touch with charles, apparently cogent deleted or re-advertised our ip space. they're trying to figure out what happened and fix it
  • 17:30 brion: pmtpa more or less inaccessible. called powermedium to have them investigate. texted kyle to see if he's available to look

August 17

  • 22:59 brion: resynced php files on yf1017; bad copy had extra language files leftover, breaking th.wikipedia.org
  • 15:00 domas, river: fixed broken mail configuration which caused delays in mail delivery, now returning 55xs for most of clients for more serious reasons instead of 450s on DNS failures.

August 16

  • 19:07 brion: resyncing common on srv21, had old wiki list
  • 18:01 brion: upgraded leuksman.com to apache 2.2.3 and php 5.2.0rc1
  • 06:43 brion: restarted enwiki dump, it got eaten by dead mysql servers
  • 01:30 brion: restarted apache on bart in response to reports of very slowness. tim found threads stuck in futex state. boooo. evil!

August 15

  • 21:40 brion: updating interwiki table for wikiversity
  • 21:13 brion: refreshLinks on frwiktionary, see if it trims bugzilla:7023
  • 20:19 brion: added wikiversity-l
  • 16:04 brion: polish file dump available
  • 14:34 river: fixed upload dirs for wikiversity. added account on zedler for brendang (OpenSolaris developer) to work on dtrace scripts.
  • 06:55 brion: adding enwikiversity, betawikiversity

August 14

  • 19:40 jeluf: restarted mwsearchd on rabanus
  • 18:11 brion: fixed dump settings for comcom, removed stats
  • 13:50 mark: Set up iBGP between csw4-pmtpa and csw5-pmtpa according to BGP.

August 13

  • 19:30 jeluf: installed djvulibre in yaseo. Was already installed in pmtpa.

August 12

  • 23:24 brion: Enabling email notification for watchlists on meta. amgine reminded me about it
  • 21:28 brion: added stewards-l

August 11

  • 22:09 brion: disabled wm06reg to ensure rails is not exposed until fixed

August 10

  • 20:49 brion: running enwiki dump from db3, domas fixed it hopefully
  • 12:03 river: added new mail gateway, mail-eater.wikimedia.org on albert (via LVS on avicenna) for incoming mail, so goeje doesn't die when it has to reject large amounts of mail
  • 00:00 mark: (Temporarily?) enabled captcha for dewiki, by request of DaB and elian due to a vandal attack going on

August 9

  • 14:49 brion: copying captcha image store to yaseo; was enabled on jawiki but wasn't set up properly yet. creating accounts on ja should work again

August 7

  • 05:00 jeluf: bugzilla complained about broken data/versioncache. Removed empty /srv/org/wikimedia/bugzilla/data/versioncache, bugzilla fine again.

August 6

  • 20:20 jeluf: mailmanctl restart on goeje
  • 19:40 jeluf: added cronjob to automaticaly update pascal's recipient map, /home/wikipedia/bin/UpdateBackupMX. The job runs every 15 minutes.
  • 10:00 jeluf: removed all MAILER-DAEMON mails(about 30'000) from pascal's mailq
  • 9:30 jeluf: added relay recipient map to pascal's mail configuration. It's generated by /home/wikipedia/bin/CreateRelayRecipientMap on goeje in /etc/postfix/recipient_map. It needs to be copied to pascal and processed by postmap. Todo: Automate this process

August 5

  • 18:05 jeluf: restarting postfix on pascal
  • 14:35 brion: postqueue -f on goeje just in case some old bits got stuck in queue

August 4

  • 18:00 jeluf: added srv81-86 to node group ext-stores, added srv83 and srv86 to ext-store-masters
  • 17:45 jeluf: only 8GB free on srv76 => removed binlogs 100 to 149

August 3

  • 22:29 brion: db2 and db4 still have broken grants, missing admin user. this broke enwiki dump
  • 22:16 brion: started dumps on pmtpa and yaseo
  • 13:20 Tim: fixed Special:Version
  • 13:10 Tim: Fixed mounts on srv13. Ran away from wikimania to avoid getting lynched by angry tired devs
  • 10:02 brion: forgot actual scap was different. blah. removed extra files manually from yaseo
  • 09:38 brion: adjusting scap15-2 on yaseo to use --delete as well as --delete-after in the hopes it'll properly delete the now-removed language files
  • 07:34 jeluf: Removed LanguageTh.php from yaseo apaches. Everything seems to be fine now (i.e. no user complains about problems)
  • 06:38 jeluf: More wikis seem to work, some still broken. LanguageTh doesn't have $wgNamespaceNamesEn defined, so the + at its beginning fails. Looks like the codebase is in some undefined state.
  • 06:24 jeluf: removed definition of
class LanguageUtf8 extends Language {}
in Language.php. I hope it doesn't break anything...
    • note that that could break things. recommend putting it back eventually, but some language files should have the remaining LanguageUtf8 references removed
  • 06:10 jeluf: Users complain about empty pages in YASEO, error messages regarding redefinition of LanguageUtf8 class, scapping: Not better. Still get
Aug  3 06:18:11 wikif1010 httpd[11932]: PHP Fatal error:  Cannot redeclare class languageutf8
  in /usr/local/apache/common-local/php-1.5/languages/LanguageUtf8.php on line 38
in /var/log/messages

August 2

  • 22:30 domas: cool guys hacking at OLPC
  • 03:14 mark: Built a squid 2.6.STABLE2 RPM and installed it on clematis

August 1

  • 4:10 jeluf: switched enwiki to read only, ariel out of sessions. domas killed hanging DB queries, switched to read/write at 4:20

July 31

  • 18:40 JeLuF, mark: both goeje and pascal didn't respond on tcp port 25, both complaining about SYN floods. Restarted Postfix on both, which "fixed" the problem.
  • 00:22 brion: noticed someone returned db4 to service, but didn't log it. Was it properly recloned or is it still broken? There are reports of database locking, which would be caused by detection of lagging slaves. There may or may not be some laggy problems.

July 30

  • 22:48 brion: fixed resolv.conf on other search boxes, synced wikimania search db
  • 22:20 brion: rebuilding wikimania2006 search db. corrected /etc/resolv.conf on maurus

July 29

  • 23:15 brion: had to restart yaseo search server again
  • 22:27 brion: mounted /mnt/math on srv11,srv12,srv14,srv15,srv16,srv17,srv18
  • 04:30 brion: search daemon was down on yf1017. restarted it
  • 00:48 brion: added oversight-l

July 28

  • 04:43 brion: yaseo math dir seems to have vanished. created upload/math and symlinked it to /mnt/math to match the symlink at /home/wikipedia/common/math

July 27

  • 20:50 brion: trying to remount /mnt/upload3 on srv12-19, mostly were missing. amane mountd has some problem
  • 05:19 brion: srv14 and srv20 had time about 50 minutes off, ntp hadn't properly started on boot. srv14 had to have /etc/ntp/step-tickers adjusted (was zwinger, now 10.0.0.200). may be multiple edits with wrong timestamp

July 26

  • 21:41 brion: db4 still has sync problems, taking out of rotation
  • 20:00 Kyle: During the server move earlier, something I did is causing ganglia to mis-report the down'ing of a bunch of apaches.
  • 20:00 brion: set apache to start on boot on albert
  • 19:59 brion: fixed nfs mounts on srv19, srv20, got them back in service
  • 19:50 brion: srv19 and srv20 have nfs mounts broken. albert's up but not running apache
  • 08:30 Kyle: srv11-20 are physically moved. The netgear has a new uplink to csw5-pmtpa on Patch B.

July 25

  • 21:50 jeluf: Built djvulibre FC4 package, installed on remaining hosts. Added to bootstrap script
  • 21:23 jeluf: Built djvulibre package for FC3. Tried to install on all "mediawiki-installation"-servers. Install failed on FC4.

July 24

  • 21:40 brion: added custom php.ini on bart for secret project extension
  • 19:25 brion: enabled DPL on *wikibooks
  • 00:20 brion: fixed thumbnail generation on wikimania2006wiki
    • had to tweak thumb-handler.php on amane to special-case the site prefix. some other sites may also require this

July 23

  • 22:20 brion: disabled page creation for anons on fawiki[3]
  • 21:09 jeluf: rotated botquerylog, sent log to yurik
  • 10:00 Tim: fixed nagios.wikimedia.org
  • 08:14 Tim: brought db2 and db4 back into rotation
  • 08:10 Tim: can't log in to srv78 but it's still serving HTTP. Took it out of rotation.
  • ~07:00 Tim: copied data directory from db2 to db4

July 22

  • 09:30 Tim: reniced WikiCounts.pl so that the job queue gets priority
  • 09:15 Tim: restarted job threads on srv42
  • 05:45 brion: started yaseo dump; installed local dbzip2 on amaryllis

July 21

  • 23:34 brion: fixing ipb_create_account on old user blocks, was incorrectly set to 0
  • 16:40 Tim: live patch to profile only requests above a certain minimum request time
  • ~10:00 Tim: copied most lost revisions from db4 using fixSlaveDesync.php, fixed 96 broken page_latest fields with some handwritten queries starting with "select page_id,rev_page,page_latest from page,revision where page_latest=rev_id and rev_page<>page_id". That seems to have dealt with most complaints.
  • 09:52 brion: took db4 out of slave rotation again
  • 09:50ish brion: tim took back read-write to attempt resync in background
  • 09:13 brion: set db1 and db3 read_only at runtime as well, for good measure
  • 09:05 brion: added read_only to my.cnf-core-slave-13G and regenerated master copies of my.cnf
  • 08:49 brion: in grievous violation of proper replication etiquette, both db4 and db2 DID NOT HAVE read_only SET AND WERE THUS DANGEROUS. db4 has been corrupted. db2 appears to be a clean copy of ariel. Have manually set read_only true (runtime, did not check config)
  • 08:40 brion: taking read-only on enwiki. site appears to be fucked; most edits were going to a slave over last several hours? did someone leave the slaves misconfigured to accept writes? what the fuck?
  • 08:36 brion: odd edit lagg reported in last hour, possibly due to ariel being commented out in db.php. fixing... hopefully
  • 08:11 brion: expanded filetypes for wikimania2006wiki
  • 07:52 brion: db.php briefly had a $wgReadOnly set on it for enwiki, for no apparent reason. Possibly accidentally re-saved by someone after/during/with some maintenance the other day?
  • 05:49 brion: added biruni back to apaches node group
  • 05:29 brion: running apache setup on biruni
  • 04:45 jeluf: Blocked UCD search bot's user agent at squid level

July 20

  • 21:05 jeluf: added info-cs alias, pointing to otrs
  • 19:00 jeluf: added block for the search bot to squid.conf, deployed.
  • 18:34 brion: lucene heavily overloaded last couple hours by some stupid bot. added a quickie block for it

July 19

  • mark: sq2 seems down. Kyle, can you look into it?
  • 17:09 brion: migrating old dump files from benet to amane

July 18

  • 22:07 brion: poking bart again
  • 19:11 brion: poking bart's php config

July 17

  • 20:52 brion: starting next enwiki dump
  • 20:something db slowness from some bad group bys; domas fixing it

July 16

  • 08:24 brion: added libwmf to install-imagemagick, silly dependencies...
  • 08:01 brion; fixed /etc/hosts on srv78, someone had hardcoded wrong ip for mirror address
  • 07:54 brion: removed wfLogProfilingData() from ProfileStub.php; seems to be now in GlobalFunctions rather than the profiler, and conflicts
  • 07:47 brion: got yum mirror set up on albert, hopefully
  • 07:37 brion: trying to fix albert. did a yum update, then setup-general which broke the yum configuration and haven't been able to bring it back to life yet
  • ~05:00 Tim: db4 was down for an hour or so due to a previous configuration sync, changing the ibdata file size, combined with a segfault

July 14

  • 22:45 jeluf: srv6 also restarted after fixing resolv.conf
  • 22:25 brion: jeluf fixed dns on srv9 and restarted squid, seems happier. srv6 also slow
  • 22:10 brion: srv9 seems to be very slow, trying to get in to poke it
  • 21:14 brion: moving stats.wikimedia.org to zwinger
  • 20:52 brion: since albert's down, setting up another fedora mirror on zwinger
  • 20:31 brion: srv78 setup...
  • 20:13 brion: reinstalling apc on srv12, srv19, srv24, was broken (0-byte .so)
  • 17:20 mark: Restricted DNS queries to internal subnets on zwinger
  • at some point: albert broke again and nobody fucking logged it
  • 06:something brion: with albert down, internal external dns is down. tim's poking it
  • 06:38 Kyle: albert reinstalled. Now on port 25 of csw1-pmtpa
  • 05:50 Kyle: srv78 has a new os.
  • 05:44 brion: running bad image name fixes

July 13

  • 18:22 brion: running message rebuilds for hu*, language file was updated recently
  • 09:00 Tim: moved 10.0.5.7 to srv1 ready for reinstallation of albert
  • 08:33 Tim: put srv117 into rotation
  • 08:08 Tim: unmounted all albert mounts, removed them from /etc/fstab
  • 08:00 Kyle: csw5-pmtpa is now on the scs.
  • 07:45 Tim: moving fedora mirrors to srv81 ahead of reinstallation of albert

July 12

  • 21:16 brion: running post-check on bad titles to make sure they're all dead
  • 20:15 brion: running bad title cleanaup; 99 wikis had at least some bad titles, at most a couple dozen
  • 19:22 brion: running non-invasive bad title checks on all pmtpa wikis
  • 08:57 Tim: installed apache/php/mw etc. on zwinger. Moved 207.142.131.234 back to zwinger.
  • ~07:10 Tim: all NFS shares on 10.0.0.4 unmounted except srv31, srv42 and benet. Removed /home/wikipedia/shared/math entry from all fstabs, it has been unmounted on all apaches for a few days with no trouble.
  • 06:52 Kyle: moreri is up with an ip of 10.0.0.32
  • 06:47 Kyle: zwinger has ip 10.0.0.34 and is on the scs.
  • 05:39 Kyle: srv110 rebooted after audit removed. It now allows logins.
  • 05:36 Kyle: srv68 and srv78 rebooted. Raid controller crashes.
  • 05:13 Kyle: db3 is running with the older kernel. Let's see how it does.

July 11

  • 15:58 Tim: Belatedly fixed annoying apache configuration warnings:
[Tue Jul 11 15:53:26 2006] [alert] httpd: Could not determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
[Tue Jul 11 15:53:26 2006] [warn] NameVirtualHost *:80 has no VirtualHosts
[Tue Jul 11 15:53:26 2006] [warn] NameVirtualHost *:80 has no VirtualHosts
  • 15:45 Tim: synchronised php.ini files
  • ~13:50 Tim: reinstalled PHP on srv2 and goeje, installed srv118 and srv119
  • 10:45 Tim: noticed srv78 was down, set up srv67 to take its place as a memcached server
  • 09:00-10:13 Tim: new recentchanges index
  • 08:43 Tim: new blocking code live
  • 07:14 Tim: disabled logwatch on suda, it had filled the root partition yet again. 7.7 GB in /var/cache/logwatch
  • 04:35 brion: knams inaccessible for a few minutes

July 9

  • 23:48 brion: disabled apc on friedrich; test drupal/civicrm doesn't seem to like the late binding problems
  • 22:41 brion: adding office2 temp cname on wm.o
  • 05:57 Kyle: Patches installed for the Foundry switch. Lets start discussing Foundry Crossover Procedure

July 8

  • 08:07 domas: set ldap.conf on suda to check peer certificates :-)
  • 08:05 domas: changed srv1 expired cert with 3-year long another one, stored it's public key in suda:/etc/openldap/cacerts/
  • 08:00 domas: set ldap.conf on suda not to check peer certificates.
  • 06:36 Kyle: Foundry switch is running. The first management module

s console port is temporarily plugged in where asw3-pmtpa was. Its ethernet port is plugged into port 46 of asw3. (I couldn't get the console to come up.)

  • 06:24 brion: killed & restarted apache on srv13; lots of fatal errors

July 7

  • 22:50 brion: fixed scap, hopefully, so that it updates the SVN revision number properly
  • 06:51 Kyle: took down, moved, and put back up srv120 to make room for the new switch that hasn't arrived yet.

July 6

  • 22:29 brion: enabling DPL on de.wiktionary

July 5

  • 15:54 Tim: fixed full root partition on suda
  • 07:43 Tim: bringing db2 back into service with a warmup load
  • 05:35 Tim: set $wgGenerateThumbnailOnParse = false

July 4

  • 22:17 brion: db2 broke; asking for reboot
  • 04:37 Tim: set logo on test.wikipedia.org

July 3

  • 17:08 Tim: set $wgJobRunRate=0 everywhere, srv42 appears to be doing a sufficient job

July 2

  • 21:40 brion: finished installing mono & mwsearchtool on srv31, restarted dump there
  • 21:33 brion: started remaining dump threads on benet and srv31
  • 16:30 brion: resolved 'Wikipedia:' title conflicts on lawiki

July 1

  • 22:55 brion: experimentally activating DynamicPageList on en.wiktionary.org, it's been requeste
  • 21:18 brion: added some otrs addresses (info-en-[coqrv])
  • 19:05 brion: zhwiki temporarily broken by a bad update to zh_cn lang file
  • 18:37 brion: running messages updates on all wikis
  • 07:30 Kyle: zwinger up with software raid 1 and FC4. (I sent an email with more detailed info to private-l)

Archives