Server Admin Log/Archive 9: Difference between revisions

From Wikitech
Content deleted Content added
imported>Solar
imported>JeLuF
(2 intermediate revisions by the same user not shown)
Line 14: Line 14:


== April 29 ==
== April 29 ==
* 19:45 jeluf: added [[db3]] back to db.php
* 19:20 jeluf: restarted mail services. Move of database failed.
* 19:00 jeluf: shut down goeje's MTA for move of OTRS database.
* 08:45 Kyle: [[isidore]] now with fc4 at 10.0.0.18.
* 08:45 Kyle: [[isidore]] now with fc4 at 10.0.0.18.
* 07:58 Kyle: [[anthony]] up.
* 07:58 Kyle: [[anthony]] up.

Revision as of 19:46, 29 April 2006

Template:Topnavbar

Thursday
2
May

April 29

  • 19:45 jeluf: added db3 back to db.php
  • 19:20 jeluf: restarted mail services. Move of database failed.
  • 19:00 jeluf: shut down goeje's MTA for move of OTRS database.
  • 08:45 Kyle: isidore now with fc4 at 10.0.0.18.
  • 07:58 Kyle: anthony up.
  • 07:51 Kyle: sq3 rebooted. Everytime it crashes it is a kernel panic on what looks to be a sync.
  • 07:11 Kyle: srv54 up.
  • 06:59 Kyle: srv20, srv80 back up with new drives.
  • 04:18 Kyle: Console redirection and netboot on will.

April 28

  • 21:31 hashar: changed nds_nlwiki sitename (Wikipedie)
  • 20:45 hashar: srv31 not sane, opened bug #5750
  • 20:40 hashar: scaping live trunk@13908 . Lot of "cp cannot open '*/.svn/empty-file for reading: permission denied'" :(
  • 13:10 Tim: finished squid upgrade
  • 13:10 jeronim: moved 207.142.131.246 squid VIP from will to srv10; starting OS reinstall on will
  • 12:20 Tim: starting upgrade of squid on sq1-9 and yaseo
  • 12:07 The new Squid RPM 2.5.STABLE13-4wm seems memleak free! Deployed it in knams and srv6-10 (not will). Set cache_mem to 2048 MB too. I'll be gone later, so if any severe problems occur, you might want to reinstall the old Squid RPM 2.5.STABLE12-somethingwm and revert the cache_mem setting.
  • 12:08 Tim: brought sq10 into service
  • 11:51 jeronim: changed mailman master password
  • 11:48 Tim: removed 10.0.0.30 from coronelli, is meant to be for bart
  • 10:00 jeluf: changed secure.wikimedia.org from an alias for goeje into a service IP, 207.142.131.219, which is now pointing to bart. Goeje is still serving mail.wikimedia.org
  • 7:30 jeluf: srv54 went down, using srv66 as replacement memcached
  • 03:57 brion: secure.wikimedia.org back up, load should stay lower now hopefully
  • 03:52 brion: goeje doesn't have apc installed, woops :) fixing
  • 03:45 brion: goeje is swamped with high load; possibly from a redirect from a chinese site to secure.wikimedia.org

April 27

  • 22:30 jeluf: Starting to move mail services (mailing lists, MTA, OTRS, etc) to bart.
  • mark: I found a big memleak in Squid which most likely is the one that's been giving us problems. I am testing the bugfix it on ragweed and will deploy on all squids if no problems occur.
  • 10:30 jeluf: changed ticket.wm.o into a CNAME to secure.wm.o, restarted powerdns because 'update' was hanging
  • 10:08 jeluf: activated subpages for the project namespace on frwiki
  • 4:38 Kyle: srv66 back up.
  • 4:33 Kyle: sq10 up.
  • 4:21 Kyle: db3 Backup. fsck'd.
  • 4:09 Kyle: sq3 back up.

April 26

  • 20:36 hashar: from dberror.log: The table 'profiling' is full (10.0.0.2)
  • 16:26 mark: Built and deployed a new Squid RPM 2.5.STABLE13 on ragweed, including a fix that might have been for our big memleak problem. Will deploy on all squids if testing is successful
  • 15:53 Tim: restarted squid on yf1004, was swapping heavily
  • 15:27 mark: Setup ragweed as Squid with FC5.
  • 14:12 jeluf: fixed yf1002's fstab, had corrupted entry for swap partition. Changed / to noatime
  • 09:12 jeluf: removed sq3 from the upload.wm.o pool, was timing out all the time. No SSH login to sq3 possible.
  • 07:46 Tim: Set icp_query_timeout to 10ms. This seems to have fixed about half of the sibling hit problem.
  • 04:20 Tim: Fixed squid.conf warnings. Restarting squid on sq1 for testing.
  • ~03:00 Tim: set up bart for squid sibling hit test
  • 02:18 Tim: readded sq9 to the pmtpa.wmnet zone. sq9 is on 10.0.3.9 but not 207.142.131.227, that's bart. Will be shortly be changing 207.142.131.227 from sq9.wikimedia.org to bart.wikimedia.org.

April 25

  • 22:43 brion: shutting down srv20; disk errors caused / to be remounted read-only
  • 02:11 Tim: re-running setup-apache on the servers brion complained about below. They have old versions of some things.

April 24

  • 22:05 brion: created chapcomwiki.blobs manually on srv73
  • 21:59 brion: created chapcomwiki.blobs manually on srv76
  • 21:55 brion: reports of save failures on chapcomwiki, probably missing blobs table
  • mark: Reinstalled ragweed with FC5
  • 16:10 jeronim: In goeje:/opt/otrs/Kernel/Config.pm, set $Self->{FQDN} = 'secure.wikimedia.org'; as it was incorrectly set to ticket.wikimedia.org (advice from Solensean, thanks).
  • 16:00 Hashar: r13845 should make sockets non blocking when purging squids (see the diff). Untested unfortunatly :(
  • 11:20 Tim: checkStorage.php completed, I'm now using it to fix the wikis corrupted by a bug in compressOld.php. Sample output at http://p.defau.lt/?z9EhTaOllcxImxBw0VacnQ , the rest is in /home/wikipedia/logs/checkStorage . Currently running on srv31, I might need to move to benet later to get higher dump filtering speeds
  • 03:37 brion: upgraded leuksman to php 5.1.3RC3 for testing
  • 03:27 brion: added dns for wm06reg
  • 02:43 brion: reenabled Austin's login so he can set up and maintain the wikimania registration
  • 02:15 brion: setting up ruby on rails for wikimania registration app on friedrich
  • 02:05 brion: mounted /home on friedrich

April 23

  • 22:14 brion: enabling debug log for botquery (/h/w/l/botquery.log)
  • 20:33 brion: srv51/55/61/67 segfault regularly. stopping apache on them for the moment.
    • These were out for memory replacement. Allegedly they are fixed, but why the crashes?
  • 20:30 jeluf: db3 is crashed, removed it from the pool
  • 20:15 jeluf: restarted apaches, some were showing APC issues
  • 19:45 brion: quick hack on special:boardvote to send enwiki visits to meta. enwiki no worky due to db changes.
  • 03:54 brion: restarting search boxes
  • 03:45 brion: syncing search database from finished dewiki

April 22

  • 12:00 jeluf: added thistle and db3 again to the pool
  • 10:19 brion: redoing lucene build of dewiki; it failed before, and now there's no index :P
  • 10:05 brion: restarted search servers; synced updated data
  • 09:51 brion: started dump in yaseo
  • 4:45 jeluf: stopped mysql on thistle and db3
  • 4:45 Kyle: A candle of Saint Jude is lit in the colo.

April 21

  • 23:41 brion: activated Makebot extension, for bureaucrats to assign bot status
  • 19:20 brion: loosened smtpd_helo_restrictions as danny keeps complaining about mail bouncing from poorly-configured mail servers
  • 05:50 jeluf: started db2 and db3 again.
  • 04:33 Kyle: srv56 back up after rma
  • 04:30 jeluf: taking db2 down to copy its DB to db3
  • 01:05 brion: changed scap and sync-common to use rsync instead of cp; cp was sometimes overwriting incorrect files when symlinks were replaced with regular files. (for example the common favicon.ico)
  • 00:41 brion: restarted replication on henbane, was stopped with 'impossible log position' error.
    • stuck on .445 position 900689993; restarted on .446 position 0.
      • (briefly accidentally started at .445 position 0. some duplicate key errors, shouldn't cause any harm.)

April 20

  • 18:30 jeronim: added info-sv otrs alias
  • 17:40 brion: created wikisource-l mailing list
  • 10:42 Tim: created missing mounts on harris
  • 09:18 brion: started dumps threads 3,4 both on benet
  • 08:37 brion: started dumps threads 1, 2
  • 08:31 brion: removed albert nfs mount from benet; obsolete
  • 08:25 brion: apache 2.2 on leuksman mysteriously hung somehow. had to kill -9 and restart.
  • 06:49 brion: starting the search index rebuild on maurus. again. try not to explode this time, data center
  • 06:39 brion: rebooting benet. added gateway on (disabled) eth1 config; also tried setting HWADDR to see if it sees that it has a problem more easily.
  • 05:46 Tim: moved cache epoch forward to 15:15 yesterday, which is just after Mark fixed the NFS mount problems
  • 05:45 brion: leuksman up, can now search for "rpm" on this wiki
  • 05:37 brion: taking leuksman mysql offline to change search parameters and back up
  • 05:21 brion: deleted 250megs of old log files, restarted pascal's apache. bugzilla back online.
  • 05:14 brion: pascal root partition full, broken bugzilla. clearing space; stopped apache while owrking
  • 04:30 jeluf: changed pppuser@zwinger to login shell '/bin/true', changed sqltunnel on zedler to use '-n -N' instead of 'while [ 1 ]; do sleep 10000; done'
  • 03:35 Tim: changed /etc/fstab everywhere to use 10.0.5.8 for /home. Suda has that IP, but nothing is actually using it yet.
  • 03:04 Tim: brought srv52 into apache rotation
  • 02:41 Tim: fixed harris, srv15, srv51
  • 02:15 brion: updated squid error pages with spanish fix
  • ~01:00 Tim: fixed ganglia

April 19

  • 23:52 Tim: ran "chkconfig ntpd on" on srv71-79, srv8 and srv9. Stepped time by about an hour on srv8. Started ntpd on srv8 and 9.
  • 22:08 hashar: http://zwinger.wikimedia.org/~hashar/ shows an experimental graph of number of jobs on enwiki. It's in my crontab every 5 minutes.
  • 21:16 brion: set wikitech wiki to require login to edit, to make sure we know who's editing this thing. :)
  • 20:51 brion: started job queue runner on srv31
  • 20:01 brion: restarted postfix on pascal. seemed to be trying to send mail to itself instead of on to pmtpa. [yay, bugmails coming to tampa now]
  • 19:35 jeluf: replaced memcached srv66 (down) by srv51 in mc-pmtpa.php
  • 19:21 jeluf: started gmetad on zwinger, added it to rc3.d
  • 18:18 brion: disabled wgAllowExternalImages on nlwiki. Why was it on?
  • 18:03 brion: started ntp on srv71-srv79, were not quite synced and no ntp running
  • 17:14 jeluf: reverted changes done on pascal's httpd.conf, so that bugzilla works again.
  • 17:05 jeluf: moved IP 246 from srv9 to will
  • 16:40 jeluf: added thistle to the mysql pool
  • 16:30 jeluf: started external storage server srv71, didn't come up at boot time (since LDAP was not available yet and mysql user is in LDAP only)
  • replication on zedler broken:
060419 16:25:43 [ERROR] Got fatal error 1236: 'Client requested master 
to start replication from impossible position' from master when reading
data from binary log
  • 16:24 jeluf: changed pppuser account from /bin/cat to /bin/bash, since the former didn't work. needs fixing
  • 16:03 jeluf: mwdaemon crashed on all 3(!) nodes, restarted.
  • 16:03 mark: Broke Benet's routing (even more than it was). Benet is down, colo says it freezes at 'booting the kernel'. Kyle needs to look into it
  • 16:02 mark: Fixed SCS routing
  • Tim (various times up to 16:00): reset slave on db2, db4, lomaria, db1. Spot check on max(rc_id) on various databases showed no apparent problems, replication continued afterwards. Had some trouble with apache threads connecting to database servers while they were starting up, swamping them with load when they started serving. Thistle is currently out of rotation due to this. All wikis now r/w.
  • 15:58 jeluf: startet mwdaemon on maurus, vincent, coronelli.
  • 14:58 mark: mail stuff on goeje up
  • 14:52 Tim, mark: enwiki up r/w, dewiki r/o, other DBs in progress.
  • 14:40 mark: Brought up all squids, chkconfig on
  • 14:30 mark: Brough up LVS on avicenna (pybal) and dalembert (lvsmon).
  • 14:09 mark: fundraising.wm.org up.
  • 13:10 Tim: All core mysql servers now have a /etc/init.d/mysql script, chkconfig on. All external storage servers have a /etc/init.d/mysqld script, due to the different RPM used on some of them. Also chkconfig on.
  • 12:34 Tim and mark: Started named on albert and did chkconfig on
  • 12:14 Tim and mark: Started named and ldap on srv1, did chkconfig on
  • 12:00 mark: Fixed DNS issues on zwinger (ip stolen by goeje) and pascal
  • 11:50 Tim: sent all traffic briefly to an error server on pascal, then when that didn't work, rr.knams. So that everyone can see the pretty ERR_CANNOT_FORWARD message instead of a connection refused.
  • 11:33 Kyle: Heading to colo, left message on brion's cell.
  • 10:00 PMTPA down.
  • 05:56 Tim: turned off transactions for external store connections
  • 05:20 jeluf: added sq3 to the upload.wikimedia.org pool of squids

April 18

  • 21:56 brion: started medium wikis dump job (pmtpa3) on srv31. had to mount /mnt/benet
  • 21:10 jeluf: started pybal.py on avicenna to update LVS weights.
  • 20:50 brion: fixed perms on srv60's common-local; fixed favicon.ico again
  • 10:38 brion: started the small wikis dump job (pmtpa4) on benet. will run others after making sure these went, then clearing space on benet
  • 10:02 brion: started dump job on yaseo. preparing partials on pmtpa

April 17

  • some time Domas: dumped db4 to db2 for enwiki db job
  • 21:11 brion: fixed grants for new toolserver repl user on ariel
  • 20:43 brion: added wikimedianz-l list
  • 10:17 brion: enabled special:nuke on pdcwiki
  • 05:09 brion: adding querycache_info tables...

April 16

  • 08:03 brion: fixed bad favicon on srv60

April 15

  • 10:05 brion: upgraded mailman to 2.1.8
  • 06:38 Tim: after restarting, db3 completed MySQL recovery successfully. Starting replication from ariel with fingers crossed.
  • 04:17 Tim: db3 crashed. I switched the enwiki master to ariel. Site back in r/w mode at 04:35.

April 14

  • 23:30 Tim: compressOld has finished on enwiki. Switched write destination to cluster4/cluster5.
  • 22:05 brion: tried unsuccessfully to get dab's toolserver login working on the global zone so he can use the ssh tunnel for db replication

April 13

  • 23:15 brion: more net troubles! pmtpa<->knams down. others ok
  • 22:28 brion: yaseo able to reach pmtpa again. all squid centers appear to work
  • 22:20 brion: net working for more people, but still not everywhere. (yaseo out; some in europe still reporting errors)
  • 21:07 brion: bw reports the issue as flapping at level3 dampening the routes. should be resolved soon...
  • 21:55 brion: net probs
  • 21:00 jeluf: enabled captchas for zhwiki upon request, to fight a vandalbot
  • 04:30 Kyle: hydra shutdown. Shipping soon.

April 12

  • 22:03 brion: adding new replication user for the toolserver thingies to use on adler, db3, and potentially other places.
  • 19:43 brion: srv71-srv79 don't appear to have working ntp, are ~30 seconds off. trying to fix again
  • 06:30 Tim, Domas: split off enwiki db cluster to (db3,db4,ariel)
  • 06:00 jeluf: created experimental rr-upload.wikimedia.org geozone

April 11

  • 15:43 Tim: started compressOld.php on enwiki from position 3710964
  • 15:37 Tim: started refreshLinks.php
  • 15:30 Tim: removed dalembert from mediawiki-installation
  • 15:00 Tim: did schema update for langlinks
  • Tim: db3 now has a copy of enwiki, up-to-date and replicating
  • 11:32 Tim: put lomaria back into service
  • 11:26 Tim: putting webster back into service with smaller data set
  • 08:00 jeronim: added /sbin/iptables -I INPUT 1 -p icmp -j ACCEPT

to /etc/rc.local on benet (download.wikimedia.org) so that Path MTU discovery is possible for clients. Don't block all ICMP. See [1] for why.

  • 06:00 jeluf: added wikimania-cfp alias for OTRS
  • 05:15 Tim: stopped slave on lomaria for SQL dump of webster's databases to db3.

April 10

  • 22:59 brion: fixing network setup on coronelli, rebooting it. upgrading mono on search servers
  • 22:34 brion: fixing network setup on maurus, rebooting it
  • 22:12 brion: restarted wikibugs irc bot. should have auto-started on goeje boot, but likely failed due to services being out
  • 22:04 brion: started apache on friedrich for fundraising.wm.o, set up crontab for updating
  • 20:45 jeluf: started MWDaemon on vincent, but still getting the google fallback page when searching
  • 20:00 jeluf: rebooted fuchsia
  • 18:35 jeluf: Started squid on srv8, moved .203 and .205 from will to srv8
  • 18:24 Tim: fixed Database::getLag(), properly this time I hope. Live hack, not sure how to commit it from there exactly.
  • 17:48 Tim: put ixia back into service
  • 16:09 Tim: put webster back into service
  • 11:40 Tim: started two data directory copies, one from db2 to lomaria, and one from webster to ixia.
  • 08:31 Tim: restarted compressOld.php, currently at 3708378. Stopped shortly afterward to reduce catchup times.
  • 07:55 Tim: running fixTimestamps.php on all wikis
  • 07:00 jeluf: thistle back in rotation after dammit repaired it.
  • 05:35 Tim: started gmetad on zwinger
  • 04:40 jeluf: started apache2 on albert
  • 04:26 jeluf: ixia, thistle, lomaria, db1 have broken replication settings, webster has database page corruption. Taking db2 out of rotation to create copies from it.
  • 04:20 jeluf: mounted /home on all DB servers
  • 04:03 brion: ran mass-correction of bad-timestamped entries on enwiki (1529 revision records)
  • 03:05 brion: srv71-srv79 had wrong clock, apparently set to local time instead of UTC.
  • 01:45 brion: irc feeds online. had to rescue udprec from kate's old home dir
  • 01:38 brion: taking thistle and db1 out of rotation; broken replication.
  • 01:32 brion: turning read_only off on adler. seems to be set to go on always on boot.
  • 01:28 brion: things look mostly good; tried to take site read/write but someone has put adler into read-only? examining
  • 01:23 brion: got fs-squids on the right ip. seems to work now.
  • 01:20 brion: had to start lighty on amane
  • 01:18 brion: trying to get fileserver squids+lvs up. (avicenna as lvs master)
  • 01:10 brion: run-icpagent.sh didn't take previously; seems to have helped now
  • 01:04 brion: trying to add 10.0.5.5 on dalembert also. no idea if this is correct. 10.0.5.3 works internally, but squids still don't show anything. there's no explanation for this that is obvious to me.
  • 00:55 brion: added the lvs master ip on dalembert; http'ing to it internally seems to work, but still nothing from outside
  • 00:49 brion: trying starting LVS monitor thingy on dalembert. no clue if it's working
  • 00:45 brion: turning on apaches

April 9

  • 23:45 brion: srv33, srv36 should now replicate properly.
  • 23:20 brion: looking at srv33, srv36 external storage; jens reports replication seems borked
  • 22:00 brion: added izwinger ip to suda; it wasn't automatic.
  • 21:52 brion: finally got into srv1 and albert. maybe working
  • 21:49 brion: ldap depends on dns; dns is still broken. we can't reach srv1 or albert.
  • 21:32 brion: still trying to get some core machines online (suda booting; albert ?? srv1 ??). kyle should be available in 30 minutes
  • 20:55 brion: bw is onsite and available to poke at machines. there was a power problem; some machines seem to still eb booting
  • 20:42 brion: phoned kyle (message)
  • 20:38 brion: network mostly back up, still trying to get in
  • 19:20 brion: PowerMedium offline?
  • 15:20 Tim: shutting down mysql on lomaria for copy to ixia
  • 14:50 Tim: installed mysql on ixia
  • 8:45 jeluf: deleted binlogs 110-129 on srv34. Now 33GB of disk space are left, that's about 3 weeks.

April 8

  • 03:17 Kyle: ixia back up. Ready for mysql.
  • 00:30 brion: added presskontakt otrs alias

April 7

  • 19:40 brion: fixing favicons/logos for chapcom, spcom, internal
  • 19:00 jeluf: rebooted iris, mayflower
  • 04:44 Tim: updated article count on idwiki from 20777 to 21250 to correct for drift due to subst bug. The recount was done by running select distinct pl_from from pagelinks,page where pl_from=page_id and page_namespace=0 and page_is_redirect=0; and observing the number of returned rows.
  • 01:40 brion: upgraded svn to 1.3.0 on zwinger; should fix the group-writable problems with 'svn up'

April 6

  • 21:26 mark: Reenabled knams in DNS
  • 00:38 brion: chapom.wikimedia.org and wikimaniateam.wikimedia.org set up

April 5

  • 23:37 brion: added eventscom-l list
  • 22:41 brion: found someone had borked over the live svn checkout as root. reassigned the .svn files to brion/wikidev and made group writable. hopefully svn will cooperate...
  • 21:20 jeluf: removed binlogs 90 to 109 on srv34. Only 4GB of disk space were left.
  • 19:34 brion: yaseo magically reachable again. the network gods have been appeased!
  • 19:20 brion: yaseo unreachable via pmtpa squids. yaseo squid seems ok.
  • 5:45 jeluf: checked grub.conf on sq1...10 and fixed the default kernel setting. When doing a yum update, please make sure that the server still boots 2.6.11. Only that kernel has the drivers for the SATA controller.
  • 5:35 jeluf: Rebooted srv24 (cluster2). It had a kernel oops and the mysqld was not responding to any events - not even kill -9. Everything looks fine after the reboot.

April 4

  • 19:30 brion: after manually readding the lvs ip to sq1, upload.wm.o seems a bit better. THESE BREAK ON REBOOT. THEY NEED TO BE SET UP PROPERLY.
  • 19:20 brion: sq* messed up somewhat. sq1 missing net stuff; sq2-4(?) have broken ldap
  • 07:00 brion: got wikibugs back online; running on goeje alongside the mail server
  • 06:39 brion: fixed /mnt/math on goeje (bugzilla:5441); unmounted old upload shares no longer used.

April 3

  • 21:50 jeluf: deleted srv34_log_bin.08*. There were only 2GB disk space left.
  • 21:05 mark: knams seems unreachable, Redirected knams traffic to pmtpa.
  • 21:07 mark: Moved back
  • 21:16 mark: and back.

April 2

  • 19:22 brion: restarted enwiki dump. it crapped out yesterday, apparently with a database error of some kind ('could not connect: unknown error')
  • 08:30 brion: SVN-ified the live checkout in /h/w/c/php-1.5
  • 04:05 Kyle: db4 reinstalled FC4.
  • 04:00 Kyle: Bad drive in ixia found, RMA requested

April 1

  • 14:30 Tim: increased default max factor in compressOld to 5, this should reduce the number of talk pages that are compressed with two revisions per chunk. This means that if a talk page is 100KB, the compressed chunk can be up to 500KB.
  • ~14:00 Tim: easing thistle back into rotation after listing it in enwiki's section to prevent lag due to compressOld.php, see comments in db.php
  • 10:50 brion: briefly taking thistle out; it's lagging a lot
  • 08:00 brion: working on installing svn on leuksman server

March 31

  • 19:42 brion: tossed together maintenance/purgeList.php, takes page titles on stdin and runs squid purges on them
  • 19:10 brion: wiped fr.wikiquote.org
  • 18:35 brion: got mailman back up. ran into a mailman encoding bug rebuilding archives
  • 17:58 brion: shutting down mailman temporarily to edit archives
  • 15:15 Tim: fixed mysql cluster in ganglia
  • 14:35 Tim: running compressOld.php on enwiki
  • 14:20 Tim: shutting down lomaria again for copy to db2
  • Tim: Set up db1-db4, installed mysql. Shut down lomaria for a while to copy its data directory to db1. db4 needs it mysql upgraded when it comes back up (killed during testing)
  • 02:50 ixia down

March 30

  • 18:40 brion: added info-da otrs alias
  • ~16:30 Tim: ran compressOld.php and moveToExternal.php on gawiki to test the various tweaks I made to both. Seems to have worked.
  • 14:40 Tim: running resolveStubs.php on all pmtpa wikis
  • 14:20 Tim: rebalanced database load
  • ~08:00 Tim: set up srv71-76 for external storage
  • 07:19 brion: tweaking the user-agent protections so 'PHP' check is case-sensitive; some false positive problems with '.php'
  • 05:45 brion: lomaria ans thistle were behind on replication: slow newusers log display (WHAT IS THIS WHY DOES IT HAPPEN EVERY COUPLE OF DAYS) and some other maint script. killed threads, lomaria caught up; thistle still running maint rebuilds, took temporarily out of rotation.
  • 04:13 brion: adding audio/midi for amane (bugzilla:5277)

March 29

  • 20:49 brion: set up cswikisource, mlwikisource, skwikisource. cs and sk imported pages from sourceswiki.
  • 19:15 brion: blocked access to frwikiquote dumps
  • 19:12 brion: locked frwikiquote per agreement
  • 18:16 brion: moved oversize x-moz; was breaking the 32-bit machines due to its >2gb-ness
  • 14:25 Tim: brought srv54 into apache service
  • 09:25 Tim: brought srv71-80 into apache service
  • 06:30 jeluf: activated upload directory creation in CommonSettings.php

March 28

  • 22:00 brion: hacked up refreshImageCount.php to force the ss_image columns to start replicating. the updater is still bad
  • 21:44 brion: ss_image is NULL on slaves; updater script uses variables which don't survive replication
  • 21:30 jeluf: Rebooted iris
  • 19:45 brion: created missing upload dirs for bat_smg, closed_zh_tw, fiu_vro, frp, ksh, lij, map_bms, nds_nl, nrm, pap, pdc, rmy, roa_rup, tet, vls, xal, zh_min_nan, zh_yue
  • 09:06 brion: updated messages on he* wikis
  • 07:24 brion: someone enabled the experimental ajax search on dewiki and dewikibooks. I've turned it off, as I received a complaint and I agree it's a very unexpected and painful UI (takes over the screen with no warning, what the hell)
  • 05:20 Tim: running update.php on all wikis, to add the ss_images field
  • 02:20 Tim: Fixed various ganglia problems.

March 27

  • 18:08 brion: fixed upload dir for pmswiki; fixed permission on wikipedia upload parent dir, should allow add script to add them automatically if it's doing that now. mounted upload3 on zwinger
  • 11:21 Tim: The upgrade had apparently stopped apache on most of the servers over the course of 2 hours, eventually causing extreme site slowness. Ran apache-start on all apaches.
  • 09:24 brion: finishing upgrades; some broke because yum is confused by the libxml manual upgrade. may want to check comparison vs fedora bits.
  • 01:49 brion: upgrading remaining PHP 5.1.1 boxen to 5.1.2; timezone bug showing +0100 instead of +0000 for UTC
  • 01:29 brion: added upload dir for test wiki

March 26

  • 22:52 brion: dropping lomaria temporarily from db.php as it's badly lagged atm -- running special page cache rebuilds or something
  • 22:00 JeLuF: SSH keys added to db1...db4, LDAP configured, NFS configured, NTP configured, timezone changed to UTC.
  • 21:30 JeLuF: Added Piedmontese wikipedia
  • 12:12 Kyle: db3 and db4 are now on vlan2 on csw1-pmtpa. Pingable.

March 25

  • 07:27 brion: fixing up backup processes; weren't properly setting server usage, might work around the adler oddities. (also fixing the display problem)
  • 07:00 jeluf: db3 and db4 (10.0.0.236 and .237) do not ping
  • 07:00 Tim: restarted squid on yf1001 and yf1003, heavy swapping
  • 04:32 Tim: Added namespaces to hewikibooks [2]
  • 02:35 Tim: reduced article size limit to 1MB on request from users in #wikipedia-en-vandalism

March 24

  • 23:00 jeluf: Added new Wikipedias for nds-nl, rmy, lij, bat-smg, map-bms, ksh, pdc, vls, nrm, frp, zh-yue, tet, xal, pap. See meta.
  • 16:40 jeluf: Restarted slave on zedler. Why doesn't it restart automatically?
  • 16:10 jeronim: unfirewalled all ICMP on benet to solve someone's problem with downloading from dumps.wm.org. /u/l/b/firewall-init.sh not altered because i don't know if that's the right script nowadays
  • 07:14 Kyle: db1-4 are ready for service at 10.0.0.234-237 with 408GB /a's
  • 5:40 jeluf: rebooting iris

March 23

  • 19:52 brion: another mystery case of 'Error: 1114 The table '<whatever>' is full' on adler. Various tables (job, text, pagelinks, etc). Plenty of disk space free, dump still running; unclear what's full. Adler's error log shows lots of "InnoDB: many active transactions running concurrently?060323 19:52:43InnoDB: Warning: cannot find a free slot for an undo log. Do you have too"
  • 19:49 jeluf: KNAMS back, switched back to old DNS map.
  • 19:05 jeluf: www.kennisnet.nl down, too, no SSH. DC outage assumed. Switched PowerDNS to point all of Europe to Florida.
  • 18:50 jeluf: KNAMS squids not responding. Load balancer?
  • 02:33 brion: starting enwiki backup again; last run got hit by a mysterious "Error: 1114 The table '#sql_a6a_0' is full (10.0.0.101)"

March 22

  • 07:30 domas: srv59, srv51 hit by /h/w/src/memcache/install-fc3, continuing...

March 21

  • 20:25 jeluf: srv59 is listed twice in the list of memcached servers. Replaced one of them by srv71.
  • 20:00 jeluf: Users complain about bad performance. No servers seem to be broken, but tugelas are behaving odd. There are fast ones (0.05s for 100 requests) and slow ones (5s for 100 requests). Slow ones have bi values of 450, fast ones have bi values of 20. bo is 0. mctest at 20:17 UTC:
10.0.2.51:11000 set: 100   incr: 100   get: 100 time: 4.16831994057
10.0.2.55:11000 set: 100   incr: 100   get: 100 time: 0.0873651504517
10.0.2.53:11000 set: 100   incr: 100   get: 100 time: 0.0911560058594
10.0.2.54:11000 set: 100   incr: 100   get: 100 time: 3.38875198364
10.0.2.56:11000 set: 100   incr: 100   get: 100 time: 0.061262845993
10.0.2.70:11000 set: 100   incr: 100   get: 100 time: 3.37843799591
10.0.2.58:11000 set: 100   incr: 100   get: 100 time: 0.126893043518
10.0.2.59:11000 set: 100   incr: 100   get: 100 time: 6.54098010063
10.0.2.59:11000 set: 100   incr: 100   get: 100 time: 6.14648485184
10.0.2.62:11000 set: 100   incr: 100   get: 100 time: 4.1362080574
10.0.2.64:11000 set: 100   incr: 100   get: 100 time: 4.54642486572
10.0.2.65:11000 set: 100   incr: 100   get: 100 time: 0.0734169483185
10.0.2.66:11000 set: 100   incr: 100   get: 100 time: 3.67762804031
10.0.2.68:11000 set: 100   incr: 100   get: 100 time: 0.155061006546
10.0.2.69:11000 set: 100   incr: 100   get: 100 time: 5.22008705139
localhost set: 100   incr: 0   get: 0 time: 0.0392808914185
  • 9:00 jeluf: rebooted hawthorn, mayflower, sage, clematis
  • 7:00 Kyle: Racked 4 new database servers, pending names and ip's.

March 20

  • 19:51 brion: dumps started up again in pmtpa
  • 19:30 jeluf: added symlink to init.d/nfs from rc3.d on benet
  • 19:13 brion: manually banged on benet, got it back online on the external IP. Somehow it's switched from using eth0 to using eth1, and config needs to be adjusted.
  • 18:54 brion: Someone, somewhere, somehow rebooted benet for some reason around midnight UTC two hours ago and there's a network problem, can't be reached from zwinger.
    • 16:44 PM rebooted benet
  • 15:30 jeluf: dumps.wikimedia.org down, connection refused when trying to ssh to the box, HTTP times out.

March 19

March 18

  • 08:40 jeluf: added srv36 to external storage cluster 3.

March 17

  • 21:55 brion: srv60's memcached/tugela/whatever is VERY slow, 120s response time. can't ssh in. temporarily replacing it with srv59 in the mc cluster

March 16

  • 23:14 brion: added redirects for quickipedia.(org|net) as requested
  • 21:45 jeluf: Set up mysql server on srv36, replicating data from srv34 (cluster3). No old data imported to srv36, yet.
  • 20:00 jeluf: Set up squid on srv8, moved one IP from srv6 to srv8
  • 19:00 jeluf: restarted srv7's squid, using /usr/sbin/squid instead of /usr/local/squid/bin/squid

March 15

  • 20:01 brion: adjusted checkers.php logging to use @ on all error_log() calls, so files that are forgotten on yaseo don't display warnings
  • 19:00 jeluf: moved IP .204 from srv7 to srv9 (now they have 3 IPs each)
  • 14:00 jeluf: restarted srv7's squid

March 14

  • 08:04 brion: fixed bad permissions on some servers which broke sync-dblist script (uses rsync to copy *.dblist out)
  • 07:44 brion: set up zh.wikinews.org
  • 07:00 brion: setting up spcom s3kr1t wiki

March 13

  • 22:55 brion: fixed (hopefully) the fallback for text loading. it was broken, badly, didn't notice before :P
  • 22:45 jeluf: fixed replication of srv33. It has a gap from 15:00-22:45. Added back to pool. If a revision does not exist, the master should be asked anyway.
  • 15:45 midom: srv32 manually resynced with srv34, srv33 still down
  • 14:45 jeluf: srv32 and srv33 have out-of-sync replicas, shut them down. srv34 overloaded, went read-only
  • 14:00 ævar: / on srv34 filled up, cleared out /tmp/mediawiki/, approx 70MB left

March 12

  • 23:00 mark: Moved back ns0.wikimedia.org's IP to zwinger to get DNS back up
  • 00:54 brion: renaming wikimaniawiki to wikimania2005wiki to future-proof and convenience things

March 11

  • 22:10 jeluf: set up NFS, NTP, timezone, ... on ixia, added it to the mysql pool
  • 07:30 jeluf: ixia doesn't start replication:
060311  2:13:04 Failed to open the relay log './lomaria-relay-bin.312' (relay_log_pos 36322078)
060311  2:13:04 Could not find target log during relay log initialization
060311  2:13:04 Failed to initialize the master info structure
The file is there, permissions are there, no idea what's wrong
  • 06:55 jeluf: restarted mysql on lomaria
  • 05:09 brion: fundraising display partially back online. waiting for dns to clear, and will start regularly updating again....
  • 01:12 brion: got friedrich switched; on 207.142.131.232. rebooting to test...

March 10

  • 23:00 brion: taking friedrich out of apache service to replace tingxi
  • 22:20 JeLuF: Taking lomaria down to copy its DB to ixia. Will take some hours.
  • 07:20 Solar: yongle back up, but not public interface, just private. (It only had one cat5, let me know if you want me to hook another up to csw1.

March 9

  • 05:22 Solar: Correct password on ixia

March 8

  • 23:25 brion: thistle caught up, back in service
  • 23:23 brion: taking thistle out of rotation temporarily; it's behind on master. reports of edits overwriting without conflict message may or may not be releated
  • 19:35 jeluf: Changed IP of mail.wikimedia.org from .207 to .221. This allows us to move ns0 back to zwinger (needs to be done later, when the change is known on all DNS servers)
  • 08:00 jeluf: khaldun had two default gateways. Removed default gw 10.0.0.4, ping to goeje works, NFS works
  • 08:00 jeluf: Khaldun down, NFS times out. No user complaints yet - is khaldun still in use at all? Update: Zwinger can reach khaldun, but goeje can't. Routing?

March 7

  • 23:53 avar: Made Naconkantari sysop on kowiki due to massive WP is communism vandalism which none of the kowiki admins were awake to clean up.
  • 02:48 Tim: Added Ozemail proxies to the trusted XFF list

March 6

  • 11:50 jeluf: Changed config to use spamd instead of spamassassin
  • 11:30 domas: reduced postfix, apache concurrency on goeje
  • 11:30 jeluf, domas: goeje up, rebooted by PM.
  • 09:00 jeluf: goeje down, postfix and apache shutdowns didn't help
  • 08:00 jeluf: goeje overloaded, load avg 260, slow to no response. shut down postfix, shutting down apache
  • 07:45 jeluf: replication of srv33 in sync with master. Restarted srv33 with mysql port 3306 enabled.

March 5

  • 23:10 brion: added external.log for ExternalStoreDB load failures. we think mysterious text load failures might have been from srv33
  • 23:05 jeluf: started srv33 with mysqld port set to 3307
  • 22:50 jeluf killed wiki by starting lagged external storage srv33, killed it.
  • 22:40 brion: jens put us back to read/write as the threads finished
  • 22:19 brion: adler broken. nobody bothering to update the admin log
    InnoDB: many active transactions running concurrently?
    060228 20:08:48InnoDB: Warning: cannot find a free slot for an undo log. Do you have too
    InnoDB: many active transactions running concurrently?
    Processlist showed several hundred attempts to invalidate one image page (Vynil_record.jpg). Perhaps from automated job?
    a template change

March 4

  • 21:35 brion: fixed problem (whitespace in language file), captchas back on except for sr.wikipedia, which is reasonably well-populated
  • 21:28 brion: disabling captchas on all sr projects; broken on sr for some reason
  • 12:05 brion: yaseo uploads resolved (bad symlink into /mnt/wikipedia/htdocs on yaseo docroot), math also fixed (rewrite condition crashed apache; changed it and now works)
  • 11:43 brion: noticed amaryllis / part is very small (10g) and full. nice.
  • 11:40 brion: yaseo uploads borked for some reason. tossed in a symlink on amaryllis so /mnt/upload works there, but not sure why many still don't work on http

March 3

  • 20:00 jeluf: set up new queues info-ch and info-als on OTRS.
  • 05:04 Tim: set up daily cron job on goeje, to backup its root directory to hypatia once per day, at 06:00.
  • 03:13 brion: started another enwiki dump, yaseo dump
  • 03:03 brion: installing setproctitle on srv31; php is whining

March 2

  • 23:09 brion: adding dns entries for wikimania200[56].wikimedia.org, will set up new wiki and redirects shortly
  • 14:00 Tim: started rsync of goeje's root directory to hypatia:/var/backup/ssl-server, for backup and maybe failover capability in the future.

March 1

  • 22:16 brion: turning on wgEmailAuthentication on public wikis. Somehow goeje got blacklisted by spamcop, allegedly for sending to blackhole addresses. There's a small possibility that active spamming was attempted through the wiki.
  • 05:30 Solar: srv55, srv57, srv61, srv67 have new ram, and are up, but out of sync
  • 04:20-04:25 Tim: srv54, a tugela server, was accidentally rebooted. This took the site down for about 5 minutes, probably due to unconfigurable fwrite() timeouts on persistent connections.

Archives