Server admin log/Archive 4

From Wikitech
Jump to: navigation, search

July 31

  • 11:50 brion: knams squid at 145.97.39.138 is not reachable, but still in dns rotation. THIS IS BAD
  • 01:50 brion: pascal is offline, reason unknown. bugzilla down, no NFS for knams cluster.

July 28

  • 01:06 kate: put a new skin on bugzilla

July 27

  • 18:50 brion: blocked irc4ever.net remote page loaders

July 26

  • 08:08 kate: upgraded mysql on vandale to 5.0.9

July 25

  • 19:05 brion: set $wgMetaNamespace to 'Vikipedi' on trwiki, refreshing links
  • 18:15 mark: Added two missing kennisnet squid IPs to the udpmcast startup script on larousse, and restarted it.
  • 17:29 brion: added wikimania-l mailing list
  • 17:25 mark: Pointed thailand at knams as a test - some people there say it is much faster than pmtpa. Will eventually be replaced by the yahoo cluster anyway...

July 24

  • 16:15 brion: set ndswiktionary to capitallinks off
  • 10:10 brion: updated sudoers file on srv0 so syncs work again

July 22

  • 22:50 brion: restarted search update daemon... still seems to be a memory leak and it hangs when it gets too large
  • 22:31 brion: moved wiki.mediawiki.org to www.mediawiki.org and redirected from mediawiki.org and wiki.mediawiki.org to it
  • 22:07 brion: srv0 clock was about 150 seconds in future. kate did something to fix it. synchronized all apaches from system to hc time to hope reboot works. Fixed one revision reported to be in a weird inversion appearance.
  • 13:50 brion: took avicenna out of search group to do experiments on index

July 21

  • 23:30 Tim: added rollback group
  • 22:00 Tim: moved group settings from CommonSettings.php to InitialiseSettings.php

July 20

  • 23:45 brion: updated clocks on srv1, rabanus, etc all apaches... hopefully
  • 21:40 brion: set wgCapitalLinks off on afwiktionary
  • 19:20 mark: Removed legacy zone gdns.wikimedia.org and corresponding georecord rr.gdns.wikimedia.org from all nameservers. It's not being used anymore, and only confuses people.
  • 19:05 mark: Pointed france and switzerland back at lopar in geodns
  • 14:10 brion: created wikinews-hotline mailing list by request

July 19

  • 23:58 Tim: fixed Special:Uncategorizedcategories, now running updateSpecialPages.php on /h/w/c/smallwikis
  • 15:30 brion: reverting build copy of search index to the previous version to try working around some corruption from daemon crash (?)

July 17

  • 18:27 mark: An empty line in the geomap file caused problems and made the site go down for non EU users. Apparently geobackend currently doesn't handle empty lines in geomap files (a bug which I will fix), so don't use them.
  • 18:18 mark: Pointed all European countries at knams wrt geodns

July 16

  • 17:07 kate: wrote a new statistics system and replaced webalizer with it
  • 07:30 brion: had to restart search daemons again due to breakage. whyyyyyyy they worked before *sob*
  • 00:15 hashar: overloaded suda for almost 5 minutes by running the unbugged updateSpecialPages script . Might be cause of Wantedpages.

July 14

  • 02:50 brion: separated mediawiki-installation and apache node groups. These must not point to the same file.
  • 02:00-3:15 erik: created Japanese Wikinews at http://ja.wikinews.org/

July 13

  • 20:59 brion: had to interrupt bgwiki backup due to memcached hang
  • 06:10 brion: restarted search servers; 'too many open files'
  • 01:30 brion: started backup on benet (slave stopped). updates in #wikimedia.15status

July 12

  • 23:35 brion: commented out lopar from geodns for now (moved them to knams)
  • 23:20 brion: there's intermittent packet loss to lopar...
  • 19:10 mark: Site was down due to crashed perlbal on holbach, restarted it
  • 12:03 kate: put lily back to squid pool
  • 08:10 jeronim: set yum on larousse (FC2) to use fedoralegacy.org
  • 08:00 mark: lily's hardware has been replaced.
  • 07:40 jeronim: set HostnameLookups Off on larousse's apache at hashar's request
  • 07:10 jeronim: added CNAME commons.wikipedia.org -> commons.wikimedia.org
  • 00:40 brion: restarted mysql on james's advice with config change. innodb_lock_monitor fails, however. have innodb_status_file=1 set now. had to do 'slave stop' on samuel, which is master. wtf

July 11

  • 23:40 brion: set innodb_lock_monitor on samuel on jameday's recommendation. will be active when mysqld restarted
  • 23:20 jeluf: restarted ServmonII. Died when it lost its irc connection earlier today.
  • 23:05 brion: removed teh fateful link so editing that page works for now
  • 22:30 brion: disabled deletion of recentchanges records due to slowness there. hacked Title::touchArray to go row by row due to weird hangings trying to edit Template:POTD on enwiki. Not sure what's wrong, it consistently hangs at User:Mulad/portal. What could be locking it?
  • 18:30 brion: biased search load to maurus, as avicenna (with less memory) was being sluggish. added comment to output saying which server was hit
  • 15:10 mark: Removed authoritative zones that were no longer pointing at zwinger from zwinger's Bind configuration (interferes with resolving). Set up AXFR slaving of zones that are supposed to be served by the new PowerDNS servers, but which are still delegated to Zwinger/bomis/fuchsia.
  • 14:50 mark: Fixed reverse DNS for knams

July 10

  • 17:00 brion: shut down slave thread on ariel before it explodes
  • 05:40 hashar: check out our new portal: http://noc.wikimedia.org/
  • 01:07 kate: removed ariel from load balancing because it only has 700MB of disk space left.

July 9

  • 10:30 brion: fixed up steward mode in special:makesysop plugin to provide the full userrights options
  • some time in the morning kate: reverse dns for knams started working, although under *.rev.wikipedia.org.
  • 08:02 brion: reassigned 'developer's on meta to steward group

July 8

  • 5:20 brion: started mass lucene index builds using the updater daemon. once done, will sync current index files out. (progress in #wikimedia.15status)

July 7

  • 13:50 brion: added page update hook for the lucene update daemon, see wikitech-l post
  • 11:38 mark: Installed java (!) on pascal, to allow Kennisnet/ZX to upgrade the SP and BIOS on lily.
  • 11:34 brion: maurus had bogus hostname (maurus.wikimedia.org, doesn't resolve). fixed live and in /etc/sysconfig/network
  • 08:55 brion: upgraded PEAR::XML_RPC to 1.3.2 on mediawiki-installation group. Patching mono on avicenna and maurus for ximian bug 75467
  • 08:30 brion: noticed vincent seems to be hung
  • 07:00 Jamesday changed holbach cache split from 200M/2800M to 200/2500M because of excessive page faulting in vmstat, not yet restarted.

July 6

  • 14:40 Tim: named on albert exit for no apparent reason, causing site-wide slowdown. Logged on via the scs and started it.
  • 07:00 brion: all wikis reading from 1.5 code now. zh-min-nan.wikipedia.org has the UI broken -- code problem selecting wrong UI language [since fixed]
  • 06:30 brion: fixed up broken conversions on sdwiki, rowikibooks, fiu_vrowiki, cowikibooks, aawiki
  • 06:00 brion: upgraded meta to 1.5
  • 04:00 kate: upgraded all knams machines to current kernel to fix bad pmd problem

July 5

  • 10:43 kate: put back mint to squid pool
  • 9:15 mark: Added zh-tw.wikimedia.org CNAME record to the wikipedia.org zonefile, as it was missing (and is not in langlist, for not being a language)
  • 8:40 mark: Added an admin account on lily's SP, and set up temporary port forwarding on pascal to give ZX (sysadmin partner of Kennisnet) access to diagnose lily's hardware problems

July 4

  • Jason/mark: Many Wikimedia project domains have been changed to use the new PowerDNS DNS servers, so if you see any DNS related problems, it might be having to do with that
  • 19:32 kate: set up squid log migration system
  • 08:10 brion: migrated forgotten changes to InitaliseSettings from 1.4 to 1.5 (jbowiki caps, fiu-vro logo, zhwiki externalstorage)
  • 03:08 kate: removed srv1 from apache pool again.

3 July

  • 21:35 jeronim: srv1, srv2 & LDAP alive again after manual reboot by colo staff. not sure if domas actually emailed about scs-ext problem.
  • 20:05 jeronim: and scs-ext.wm.org doesn't work anymore. dammit has emailed colo about this and srv1/srv2 problem
  • 20:00 midom: oopsie, srv1 also didn't come up after reboot, and apparently it was LDAP server... LDAP down.
  • 19:00 midom: resyncing holbach, updated misbehaving apache hosts (srv2,srv3,anthony,rose), srv2 didn't come up after reboot.
  • 06:10 brion: holbach crashed again, mysqld was restarting over and over. killed it for now.
  • 05:05 brion: fixed more wikimania registration files
  • 02:20 brion: fixed missing db config in wikimania attendees list

2 July

  • 21:55 brion: holbach died. restarted zhwiki conversion w/o it.
  • 19:30 brion: started asian large-wiki upgrades: jawiki, zhwiki
  • 16:00 midom: bacon joined perlbal service, restarted perlbal on holbach, site looks happier.
  • 09:00 brion: eswiki upgraded, doing ptwiki now. dammit took ariel out of rotation, ready for reloading
  • 07:40 kate: moved bugzilla to pascal
  • 06:51 brion: fixed db host for wikimania registration
  • 06:45 midom: samuel is our master.
    • mediawiki 1.4, mediawiki 1.5, bugzilla, and otrs should be configured properly for new master. is there anything else? [search server update needs changing anyway, working on this --brion]
  • 04:50 brion: ran refreshLinks on enwikinews
  • 04:30 brion: disabled sorbs checking for now
  • 02:40 Jamesday: changed bacon cache split from 800M/2000M to 200M/2600M, not yet restarted.
  • 02:30 Jameesday: changed holbach cache split from 1000M/2000M to 200M/2800M, not yet restarted.
  • 02:05 brion: running background refreshLinks.php on dewiki

1 July

  • 22:20 Jamesday: changed ariel my.cnf from MyISAM/InnoDB cache split of 1700M/3900M to 300M/5100M assuming minimal MyISAM use now. We've been this high before for InnoDB but there's a small chance that the new kernel on Ariel might not like going abouve 4G on the next restart - reduce it to 3900 if that happens. Not restarting ariel now because one is planned anyway and it's not that urgent - should improve load handling ability though. Decreased binlog_cache_size from 1M to 128k (it's per session and doesn't really need to be 1M).
  • 08:20 brion: changed Revision legacy encoding conversion to use //IGNORE in iconv... this may need tweaking
  • 06:10 brion: dewiki done.
  • 05:56 brion: moved 1.5 skins dir from /w/skins-1.5 to /skins-1.5. Turns out squid configuration does cache-control rewriting on /w which makes them uncacheable. Bad squid!
  • 00:45 brion: switched 1.5 wikis to shared filesystem sessions. A hack in User::matchEditToken fatally broke save attempts by previously-logged-in users because it didn't bother to check that memcached sessions were in use; I've commented it out.
  • 00:30 brion: switched 1.4 wikis to shared filesystem sessions, perhaps this will relieve memcached session problems?

30 June

  • 23:00 brion: installed test fix for firefox intermittent download problem
  • 06:30 brion: set tidy's line wrapping off on 1.4 config as well (already on 1.5)
  • 01:50 brion: finished. running refreshLinks.php on en.wiktionary.org (in background)
  • 01:00 brion: running cleanupCaps.php on en.wiktionary.org to rename all article pages to lowercase

29 June

  •  ??:?? brion: somebody moved en.wiktionary.org to wgCapitalLinks off, throwing it into total chaos. thanks!
  • 22:12 brion: removed some unused, added Mac OS X 10.4 to bugzilla operating systems list
  • 18:18 brion: set $wgCapitalLinks off on jbo.wikipedia.org

28 June

  • 22:20 brion: adding image table entries for 'missing' images (probably broken or half-canceled uploads from months back, mostly)
  • 15:50 kate: setup ganglia, ssh, yum on adler and samuel
  • 05:45 kate: set up and documented a better LDAP setup. removed srv1 from apache pool.
  • 01:03 brion: enwiki upgrade broke with its slave reads: page table was incomplete. rebuilding page table from ariel, ETA ~2hrs

27 June

  • 22:35 brion: turned off image metadata loading to speed things up -- will need to do that in a later script run
  • 20:25 brion: dropped & recreated empty links on enwiki to free innodb space (already converted)
  • 19:15 brion: disabled email authentication for now; will do mass checks later
  • 08:40 brion: enwiki upgrade is now pulling revision data from adler, writing to ariel.
  • 07:55 brion: somewhere in the midst of upgrading things. enwiki is going now; upgrade1_5.php is hacked up, please don't run any others until it's restored!
  • 03:35 brion: adler was broken and badly lagging because somebody removed its 'cur2' tables and replication died when we dropped them from the master. fixed, returning...
  • 02:15 brion: commons, wikinews, wikiquote, wiktionaries, and some misc are upgraded. Wikipedias and some others remain... Need to clear disk space on ariel

26 June

  • 04:50 brion: commonswiki being upgraded; ETA in 6-7am range
  • 04:10 brion: upgraded nostalgiawiki as test
  • 02:10 brion: setting things up preparing for 1.5 upgrade

24 June

  • 14:00 brion: adler back in rotation. probably needs reconfiguration for future...
  • 13:30 brion: took adler out of rotation; mysqld crashed OOM and is recovering

23 June

  • 21:45 mark: ...apparently because it was pointing at cache.wikimedia.org., which didn't exist in the new DNS zones... added.
  • 21:45 mark: wiktionary.zone contained an old record fr.wiktionary.org CNAME wikipedia.geo.blitzed.org which for some reason made things break only now. Removed.
  • 05:25 brion: changed sitename, metanamespace on la.wikiquote

22 June

  • 12:00 mark: Changed www-dumps.knams DNS to CNAME dumps in preparation for moving vandale to an internal vlan
  • 05:45 midom: did set global mysql timeout in php to 2 seconds.
  • 05:22 Tim: restored load to samuel, also experimentally changed some other loads
  • 04:28 Tim: realised that the site was down becuase of samuel and changed the load balancing ratios accordingly. At this time samuel is busy doing InnoDB recovery.
  • 04:22 Tim: finished moving binlogs
  • 04:18 samuel's mysqld exited for no apparent reason
  • ~04:10 Tim: started moving binlogs 232-240 to khaldun

21 June

  • 20:48 mark: Network problem has been worked around, switched geodns back.
  • 19:30 mark: Severe network / reachability problems for florida, but knams seems to be able to reach it. Pointed all of geodns at knams and lopar exclusively.
  • 00:01 kate: moved binlogs 228..231 from ariel to khaldun

19 June

  • 20:00 mark: New DNS setup is active, but DNS zone delegations still need to be dealt with. Please note that there are NO wildcards anymore, so you will need to update DNS zonefiles when creating wikis! Also, for the next week or so, update both the old DNS setup, and the new one when changing records.
    Problems will occur, DNS records may be missing, please tell me or update it yourself!
  • 19:00 mark: I broke the site (for the first time, yay! :) because of the mixed old and new DNS setup; the old zonefile was using rr.chtpa. while the new one expected rr.pmtpa.. Oops. Negative cache TTL of 1H means that some users will not be able to access the site for a while.
  • 18:50 mark: Activated new DNS setup on zwinger, which is partly used by the old Bind DNS setup
  • 17:30 mark: Added records ns0/1/2 to wikimedia.org to allow changing NS delegation for the new setup
  • 15:15 mark: Removed zwinger/gdns1 from the list of geodns nameservers in wikimedia.org, on order to build a new setup on zwinger
  • 00:50 brion: upgraded mono on vincent to 1.1.8 (rpm packages), running a mass lucene index update

18 June

  • 12:30 jeronim: added dsh node groups at florida: squids_lopar, squids_knams & squids_global
  • 11:30ish jeronim: added Disallow: /wiki/? to robots.txt because bots were indexing stuff like http://en.wikipedia.org/wiki/?title=Nl%253Aolijfboom&action=edit
  • 06:14-06:37 Tim: Started Folding@Home on mint, ragweed, hawthorn and mayflower
  • 06:14 Tim: Started Folding@Home on iris. I started it on sage and clematis a few days ago without logging it here.

16 June

  • 23:50ish brion: uplink via level3 died for a few minutes, either was fixed or PowerMedium rerouted it and we're back.
  • 22:24 brion: webster is dead: ssh doesn't let in, scs doesn't respond on it. does ping. possible kernel panic.
  • 20:15 jeluf: webster's mysqld got a signal 11, recovered automatically.

15 June

  • 21:10 mark: Set up new requests/s stats at http://noc.wikimedia.org/reqstats/
  • 20:40 mark: Removed lily from the knams squid pool in wikimedia.org DNS, it's broken.
  • 20:30 mark: Added missing peer statements to knams squids

14 June

  • 08:45 jeluf: moved ariel_log_bin.21[345678] to khaldun

13 June

  • 20:55 brion: truncated searchindex table on a bunch of wikis, freed ~5gb of disk space on ariel
  • 20:30 brion: rebuilt search indexes for dawiktionary, svwiktionary due to bad encoding config
  • 16:00 midom: used new apache-restart-all-hard (really hard!) so that slow watchlists (which actually was segfaulting apaches due to bad bytecode in cache on nearly all servers) would become fast ;-) we really need blank page logging somewhere..
  • 09:30 midom: removed icpagents from some of apache hosts, was a major headache recently.. ;-)
  • 09:00 midom: commented out some defunct or non-apache hosts (uh oh, nearly 10 in total) in perlbal's nodelist.

12 June

  • 03:12 brion: started adding name_title unique index on remaining smaller wikis (<10k pages), 30 second wait between each
  • 01:25 brion: unlocked ja.wiktionary on Angela's request

11 June

  • 22:20 jeluf: moved binlogs 204-212 from ariel to khaldun
  • 12:02 brion: stopped index addition for now (left off at bgwiktionary), will run at non-peak hours
  • 11:54 brion: dupe checks done, adding index...
  • 11:29 brion: running cleanupDupes.php on all wikis not already protected with a unique namespace+title index, then adding the index. the largest wikipedias were already protected.

10 June

  • 10:00 brion: ran a salt fixup script to correct entries which had been erroneously re-saved with bad password due to memcached records floating around in the first couple days

9 June

  • 21:45 chaper: inserted the CD that was delivered with the new hosts into srv4. jeluf mounted to /media/cdrom. Apparently containing RAID controller software for many OSes incl linux.
  • 19:55 kate: lily hung and died again during fsck. moved its ip to ragweed and left it off for now.
  • 02:50ish brion: inserted live debugging hack in Article.php for deletion problem on en.wikipedia.org (bug 2195)

8 June

  • 19:25 jeluf: moved binlogs 200-204 from ariel to khaldun. New DB servers have arrived in data center.
  • 13:15 brion: ran namespaceDupe checker on skwiktionary, skwikiquote due to prob w/ namespace changes there
  • 00:35 brion: added wikiskan-l list for scandanavians

7 June

  • 22:07 kate: setup dumps mirror at http://www-dumps.knams.wikimedia.org/
  • 19:00 jeluf: moved binlogs 198 and 199 from ariel to khaldun
  • 18:48 brion: reactivated search
  • 9:00-19:00 all: Moved to new Tampa data center
  • 10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
  • 08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.
  • 07:00 or somewhat: horrible things begin

6 June

  • 13:40 kate: started copying dumps to vandale
  • 11:30 kate: make a small db change for wikimania registration to implement a change in the form. left a backup of the old one at zwinger:/root/wikimania.prekate.sql
  • 10:05 kate: set up logrotate on knams
  • 01:43 Tim: moved binlogs 194-197

5 June

  • 22:40 kate: reinstalled mint with better partition layout, added it to squid pool
  • 21:00 gwicke: fixed mysql error messages in this wiki after config tweak to index words from 3 chars. You should now be able to search for things like 'DNS'.
  • 14:55 mark: bound bind to 145.97.39.130 (pascal's main ip) only, adapted firewallinit to allow incoming DNS zone transfers
  • 14:19 kate: added lily to squid pool
  • 13:25 mark: Added ip 145.97.39.158 to pascal, adapted /sbin/ifup-local.
  • 10:02 kate: iris -> squid pool
  • 09:03 kate: clematis -> squid pool
  • 08:46 kate: sv,dk,no.wp -> knams
  • 08:09 kate: de.wp -> knams
  • 07:56 kate: put mayflower to knams squid pool. fixed typo in commonsettings breaking squid caching.

4 June

  • 18:27 kate: added hawthorn to squid pool
  • 18:10 kate: created rr.knams pool, put UK, NL, DE and LT on it.
  • 16:28 kate, jer, dammit: started squid on ragweed, put it in lopar pool for now
  • 15:30 jeluf: moved binlogs 190-193 to kkhaldun
  • 13:54 jeronim: built new squid for will as old one had file descriptor limit of 1024 instead of 8192 so it was running out of FDs. In /home/wikipedia/src/squid/squid-2.5.STABLE9.wp20050604.S9plus.no2GB[icpfix,nortt,htcpclr]

3 June

  • 23:30 brion: fixed salting on user_newpassword for accounts not touched since the change.
  • 20:40 mark: Wrote /sbin/ifup-local script on pascal, to handle post-ifup tasks. Currently adds 10.21.0.2/24 IP to eth1 for accessing the LOMs.
  • 20:00 mark: Set up permanent source routing on pascal for Kennisnet out of band access using /etc/sysconfig/network-scripts/route-eth1 and rc.local
  • 19:05 mark: Rebooted csw2-knams with newer crypto image, setup SSH, changed DNS resolver
  • 09:40 kate: created 400GB LV at /sqldata on vandale, ext3. installed mysql. copied ariel's my.cnf over (can someone look at what needs to be changed there?). did not populate any sql data yet.
  • 05:50 kate: REMOVED WILDCARD NS RECORD under *.wikimedia.org. this means you will need to add NS records for new wikis in that domain or they won't work.
  • 05:48 kate: set up recursing NS on pascal and mayflower; tested pdns slave for wikimedia.org on fuchsia, seems to work (but not authorative yet).
  • 00:05 Tim: moving binlogs 186-189 from ariel to khaldun

2 June

  • 06:15 brion: clearing user records from memcached. two instances of can't-log-in reported might have been caused by stale cache records re-saving bogus unsalted passwords, but that's sheer speculation.
  • 06:00 JeLuF: fixed mail on dalembert and goeje to use smtp.pmtpa.wmnet as smarthost
  • 05:45 JeLuF: removed moreri and bart from "apaches" nodegroup

1 June

  • 19:10 JeLuF: moved binlogs 184 and 185 from ariel to khaldun
  • 15:04 Tim: fixed timezone on coronelli
  • 14:35 Tim: had a go at fixing ntpd on various servers. It was not installed on coronelli and not running on srv5, fixed both fairly easily. Synchronised configuration files on srv11-30, they're still reporting "synchronization failed" as ntpd starts up, although I was able to synchronise their clocks manually with ntpdate. "ntpdc -p" seems to indicate that they are working properly.
  • 5:10 jeluf: Added index, set site to read/write
  • 04:10 brion: updated user tables for password hash salting.
  • 3:00 jeluf: set farm to read only

31 May

  • 16:12 Tim: switched profiling from user time to real time
  • 13:45 brion: experimentally disabled MakeSpelFix in lucene search results to compare load / response time
  • 5:00 jeluf: CREATE INDEX id_title_ns ON cur (cur_id, cur_title, cur_namespace); on all wikiquote, wikinews, wiktionary, wikibooks, dewiki and all wikis with 10'000 to 100'000 articles. To be done tomorow: enwiki, frwiki, jawiki, wikipedias with <10'000 articles

30 May

  • 18:02 kate: starting copying khaldun:/usr/etc/images/enwiki/enwiki_upload.tar to srv11:/usr/etc/backup/images/
  • 11:55 brion: enwiki image archives and thumbnails have by now been copied to khaldun. all should be right with the world.
  • 07:49 brion: increased bacon's share of load, but not quite up to previous levels
  • 05:20 jeluf: moved binlogs 175-179 from ariel to khaldun
  • 03:20 brion: took khaldun out of apaches group, added to images group. en.wikipedia.org images are moved to khaldun, thumbnails still copying.

29 May

  • 23:30 brion: working on moving en.wikipedia.org's uploads from albert to khaldun
  • 21:18 brion: reduced load on bacon to keep it from lagging
  • 11:15 brion: added bugzilla stats collection to cron.daily

28 May

  • 20:07 kate: started a full image dump on khaldun using modified backup scripts
  • 08:00-ongoing jeluf: Migrating enwiki to external storage
  • 07:30 jeluf: moved binlogs 170-174 from ariel to khaldun

27 May

  • 22:59 brion: lucene search on wikimedia-wide
  • 11:57 brion: servmonii seems to be offline; not on irc, and smlogmsg fails when doing syncs
  • 11:38 brion: installed simple experimintal edit/move rate limiter with fairly conservative settings for now
  • 07:35 brion: changed default search namespaces from NS_TEMPLATE_TALK to NS_HELP (whoops!)

26 May

  • 06:30 jeluf: migrated dawiki
  • 05:30 jeluf: migrated concatZippedHistoryBlobs of eowiki,glwiki,bgwiki to external storage cluster srv28/29/30

25 May

  • 21:50 brion: vincent has been reinstalled with FC3. Running a full Lucene index build for all wikis now...
  • 20:00 dmonniaux: on bleuenn/chloe/ennael: disabled DNS through Wikimedia servers through PPP (didn't work, prevented squid from restarting); used Lost-Oasis servers instead (cf /etc/resolv.conf); inserted iptables -I INPUT -j ACCEPT so as to allow DNS etc. in (please remove once you know what you're doing)
  • 06:30 jeluf: moved binlogs 166 and 167 from ariel to khaldun
  • 04:17 Tim: noticed that webster had stopped replicating 4.5 hours ago. Offloaded it and ran "REPAIR TABLE bugs.bugs" to fix the problem.
  • 01:47 kate: albert's eth1 died for unknown reasons, site broke. configured eth0 as a trunk port to keep site operational.
  • 00:35 brion: running lucene updates on vincent; out of search rotation during build

24 May

  • 21:15 jeluf: moved binlogs 163-165 from ariel to khaldun
  • 15:16 Tim: started update-special-pages-loop, in a screen on zwinger. Using benet for DB.

23 May

  • 20:00 jeluf: added "-A" to /etc/sysconfig/ntpd, synched clocks
  • 15:30 jeluf: installing MySQL 4.0.24 to srv28-30, srv30 will be master, srv28 and 29 will be slaves
  • 08:53 brion: vincent is serving searches from the Mono-based server experimentally
  • 06:15 jeluf: moved binlogs 160-162 from ariel to khaldun
  • 02:35 brion: page moves back on
  • 02:14 brion: temporarily disabled page moves while cleaning up aftermath of a move vandal
  • 01:34 brion: running Lucene index updates and tests

22 May

  • 22:45 brion: fixed another hidden year 2025 entry on eswiki which screwed up recent changes caching
  • 20:15 jeluf: restarted slave. That was faster than I expected.
  • 20:00 jeluf: stopped slave on benet, doing some dumps.
  • 3:00 erik: ran /home/erik/commonsupdate.pl (logged to commonscategoryupdate*.txt) to change category sort keys "Special:Upload" and "Upload" to proper page titles (bug 73); this fixes paging on categories with more than 200 images. Bug 73 is now fixed, so this should not reoccur, but other wikis will have the same problem and can be quickly fixed with this script if necessary.

21 May

  • 23:10 jeluf: moved binlogs 153-159 from ariel to khaldun:/usr/backup/arielbinlog/
  • 3:00 erik: setting up sr.wikinews.org. Not announced yet until language files are fixed.

20 May

  • 19:00 Chad: put zwinger, holbach and webster on the scs. Took moreri, smellie and anthony off. Tim changed software labels.
  • 15:15 midom: killed suse firewall and kernel security stuff. it freaked out all sysadmins, shouldn't be allowed to live :)
  • 12:50 brion & many: all hell breaks loose with ldap oddness on albert and dns and... stuff
  • 4:30-5:40 Tim & onlookers: DNS failure on zwinger. Took us an hour to fix it instead of 2 minutes, and caused problems site-wide, because we're using non-redundant DNS instead of /etc/hosts. Logins were timing out because commands in the login scripts were waiting for a DNS response. Managed to get root on albert first, and set about modifying resolv.conf on all machines to use albert as well as zwinger. Eventually got root on zwinger, had to kill -9 named. Restarted it, everything is back to normal.

19 May

  • 22:00 jeluf: Upon GerardM's request, and due to ongoing vandalism on li.wikipedia.org, promoted user "cicero" to sysop on liwiki
  • 21:48 jeronim: removed isidore from squids dsh group (and condolences for the eurovision tragedy)
  • 21:30 midom: after surviving ddos aimed at my dsl and lithuania's failure in eurovision I finally moved some ariel binlogs to alrazi/khaldun (raid1 :)
  • 02:45 kate, brion: fixed ldap/firewall for external servers

18 May

  • 18:11 Tim: Categorised the servers by interface and vlan at Interfaces. Fixed routing tables on a few hosts that were non-standard for their category.
  • 16:35 Tim: removed isidore and vincent from dsh ALL node group, non-standard configuration. Also removed bart and moreri, permanently down
  • 14:54 jeluf: flushed firewall on bayle. Back in apache service.
  • 14:38 brion: readded vincent in search group

17 May

  • 19:45 JeLuF: restored lost history on dewiki
  • 10:29 Tim: Updated DNS to get closer to this, and hence reality.
  • 09:45 Tim: Fixing sources in gmetad.conf fixed it
  • 09:28 Tim: Moved ganglia configuration to /h/w/conf/gmond, symlinked config on new apaches, changed cluster name, restarted ganglia. It doesn't seem to have fixed the recording problem.

16 May

  • 16:50 Tim: Moving ariel binlogs 130-139 to avicenna. Don't ask me where 106-129 went.
    • as I already said: khaldun.
  • 05:50 brion: added a bunch of nazi spam subjects to wikipedia-l spam filters, hoping to reduce admin load

15 May

  • .... dammit is moving memcacheds around to work around browne problem ...
  • 12:05 brion: installed libtidy-devel and patched tidy PECL extension on srv11-srv30
  • 11:59 brion: browne is having some funky problems; can't talk to the srv machines, which is Bad for memcahced work
  • 10:10 brion: installed updated LanguageEl.php; had to fix permissions on file

14 May

  • 23:40 brion: disabled catfeed extension for security review
  • 23:25 brion: lucene search now up for all en, de, eo, and ru sites. In theory.
  • 10:30 brion: running enwiki index update again
  • 08:00 brion: vincent back online; eth0 had not initialized properly

13 May

  • 21:45 brion: ran checksetup.pl on bugzilla to apply stealth database updates which broke login
  • 19:05 brion: upgraded bugzilla to 2.18.1
  • 18:37 brion: wikibugs back on irc
  • 18:13 brion: hacked Image.php to ignore metadata with negative fileExists, and updated wgCacheEpoch to force rerendering. broken images should be mostly fixed now
  • 17:57 brion: grants wiki fixed (wrong directory was synced in docroot)
  • 16:20 jeronim: bugzilla bot not running, problems with images ("Missing image" on wiki pages) not fixed
  • 12:10 -14:00 and beyond: jer/kate/midom/tim: power loss @ florida colo. most servers lost power; albert, ariel, bacon, suda, khaldun, webster, holbach, srv2, srv3, srv4, srv6, srv7 did not
  • 06:00 brion: running lucene builds for all remaining en, eo, ru, de wikis

12 May

  • 22:23 brion: hacked language name for 'no' to 'Norsk (bokmål)' per request.
  • 21:00 JeLuF: Test installation of mysql cluster on srv29 and srv30. Management server running on srv0. Installation done according to howto.
  • 8:54 Tim: offloaded ariel to correct for load caused by compressOld.php and the pending deletion script
  • 08:10 Tim: deleting articles on en marked "pending deletion", see w:User:Pending deletion script

11 May

  • 19:46 Tim: started compressOld.php, running on a screen on zwinger
  • 00:05 brion: corrected year on a fr.wikinews revision from 2025 to 2005. Assumed a very badly set clock yesterday morning -- does anybody know about this? I can find no trace of it now, though there were several complaints about affected articles, other examples of which now show correct years. Did someone correct them? Who, and when?
midom: System clock wasn't synced to hardware clock before new server crash - servers came up with bad timers. Fixed bad entries in ~15wikis (wikipedias only), therefore frwn remained..

10 May

  • 23:59 brion: synched hardware clocks on all apaches to current system time. (some were hours off, a few were in 2003)
  • 21:33 brion: resynched clock on srv14 to zwinger with ntpdate; was about a minute off.
  • 14:00 midom: restarted all memcacheds
  • 13:15 midom: chain reaction of slow image server maxed out fds on memcached, which caused even more image server load. temporary workaround: remove some old apaches from service, so that memcached would function a bit better.
  • 12:20 midom: ldap server reached maxfiles. fixed in /etc/sysconfig/openldap && restarted
  • 12:00 midom: recovered broken new apaches
  • 07:55 brion: disabled curl extension loading in case it makes a difference if/when mysteriously killed machines are raised from dead
  • 07:31 midom: srv11-srv30 all died
CURL extension in effect
  • 07:30 brion: installed curl PHP module on apaches

9 May

  • 22:00 chaper, jeluf, midom: srv11-srv30 joined apache service.
  • 01:30 brion: removed three invalid image records from commons (from 1.3 era before some name validation fixes)
  • 01:00 brion: Somebody (gwicke?) checked out an entire phase3 source tree inside the 1.4 live installation directory. That's a very bad place for it -- it would get replicated to all servers if a full sync is run. I moved it to /tmp.

8 May

  • 22:09 Tim: discontinued freenode enwiki RC->IRC feed
  • 21:45 JeLuF: removed khaldun from dsh group mysqlslaves
  • 21:25 JeLuF: fixed replication on holbach, otrs.ticket_history was broken. Holbach back in service.
  • 15:00 JeLuF: fixed replication on bacon, otrs.ticket_history was broken. Bacon back in service.
  • 11:08 Tim: added CNAME for irc.wikimedia.org, still working its way through the caches. Opened up port 6667 on browne. Switched on RC->UDP for all wikis, the whole thing is now fully operational.
  • 11:00 brion: cleared image metadata cache entries for commonswiki due to latin1-wiki bug inserting bogus entries
  • 8:40 Tim: installed patched ircd-hybrid on browne

7 May

  • 11:00 brion: replaced wikimania's favicon with the WMF thang. running some lucene updates in background on vincent
  • 02:18 brion: started squid on isidore; had been down for some time. cause unknown.
  • (some time earlier) tim: made unspecified changes to squid configuration for another external squid

6 May

  • 03:00 brion: updated Latin language file changed namespaces on those wikis.
  • 01:05 brion: suda caught up. back in rotation.
  • 00:44 brion: restarted replication on suda (bugzilla's votes table had some kind of index error)
  • 00:37 brion: took suda out of rotation; replication is broken

5 May

  • 21:39 brion: starting rc bots on browne. Configuration has changed, they are not using a proxy and must be run from a machine with an external route.
  • 16:38 Tim: dumping, dropping and reimporting bgwiktionary.brokenlinks seems to have worked, gradually reapplying load
  • 15:55 Tim: Trying standard recovery procedures
  • 15:08: Suda crashed due to corrupt InnoDB page

4 May

  • 22:15 brion: hacked in os interwiki defs for wikipedias (not other wikis, not sure if they're even set up)
  • 18:52 Tim: installed RC->UDP->IRC system. The UDP->IRC component is udprec piped into mxircecho.py running in a screen on zwinger. This removes the high system requirements previously needed for RC bots.
  • 10:30 Tim: Bots K-lined. Removed enwiki and dewiki to avoid further offence, and left them in a reconnect loop. If someone wants to approach Geert yet again, be my guest.
  • 10:20 Tim: moved RC bots to browne, which is mostly idle, has plenty of RAM, and has an external IP address, allowing it to connect to freenode without going through the apparently undocumented and non-working port forwarder on zwinger.
To find the documentation, enter "forwarder", "forwarding", or "irc" into the search box on the left, and click the Search button. In the notes on the relevant page, the code for the forwarder comes after "code: ".
  • 6:45 jeluf: started squid on will, was down.

3 May

  • 22:25 kate: changed liwiki tz to Europe/Berlin
  • 4:40 jeluf: added webster to DB pool again.

2 May

  • 14:15 midom: after second consecutive webster crash, took it out from rotation, trying forced innodb recovery, planning resync:
050502 14:11:15InnoDB: Assertion failure in thread 1207892320 in file btr0cur.c line 3558
InnoDB: Failing assertion: copied_len == local_len + extern_len
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
InnoDB: about forcing recovery.
  • 14:00 midom: webster's mysql crashed with some assertions, did come up later and continued to serve requests after some load management
  • 11:00 brion: started squid on srv7, which had been down for unexplained reasons and its IP addresses had not been reassigned
  • 07:20 brion: rebuilt foundation-l list archives after removing some personal info by request

1 May

  • 00:05 brion: changed $wgUploadDirectory settings so they won't break in maintenance scripts. hopefully didn't get them wrong.

30 April

  • 23:30 brion: cleared image cache for all wikis. bogus entries probably added during links refresh; maintenance scripts have wrong $wgUploadDirectory
  • 23:00 brion: cleared image cache entries in memcached for commonswiki due to spurious entries marked as not existing.
  • 04:25 Tim: Setting up for perlbal throughput test on tingxi

29 April

  • 22:18 Tim: resumed refreshLinks.php
  • 15:57 Tim: stopped refreshLinks.php at the end of enwiki, before the delete queries
  • 15:28 Tim: Restarted avicenna, which caused the site to crash due to a large number of threads waiting for Lucene
<TimStarling> what is avicenna's role?
<dammit> was: search server
<dammit> dunno now
<TimStarling> avicenna is reporting 20% user CPU usage
<dammit> every host that runs lucene
<TimStarling> but nothing is showing up in top
<dammit> has broken top output
<dammit> and broken ps output
<TimStarling> nothing important shows up in netstat, I'll just reboot it
<TimStarling> ok?
<dammit> 'k
*site explodes*
  • 09:10 brion: took vincent out of lucene search rotation while it's building; changed default_socket_timeout in php.ini to 3 seconds from 60
  • 04:00 brion: started incremental index update for lucene search indexes
  • 03:38 Tim: resumed refreshLinks.php after having stopped it for a while during peak period

28 April

  • 05:12 Tim: Shutting down apache on srv1 to dedicate it to refreshLinks.php
  • 02:10 brion: set up logrotate on isidore to rotate squid log, in hourly cron
  • 01:40 brion: manually rotated squid log on isidore due to reaching 2gb, restarted squid.

27 April

26 April

  • 06:00ish brion: copied updated lucene indexes to avicenna and maurus, put vincent back in search rotation
  • 05:40-05:55: Severe external network problems
  • 05:25 Tim: deleted obsolete binlogs, moved the remainder (77-87) from ariel to avicenna. 33 GB of disk space remaining on ariel.
  • (yesterday) jeronim: installed python 2.4.1 from source on alrazi, using make altinstall instead of make install, so that the current python 2.3 installation is not interfered with -- the 2.4.1 binary is at /usr/local/bin/python2.4

25 April

  • 23:30 jeronim: clocks were wrong on 5 machines; fixed 4 of them (installed ntpdate on vincent). isidore still needs to be done (dammit? :)
  • 07:55 brion: started a second active search daemon on maurus (vincent is still rebuilding indexes)
  • 05:00 jeluf: enabled LuceneSearch.
  • 01:20 brion: had to restart srv7 squid again. moved logrotate from cron.daily to cron.hourly, where it should have been before but wasn't

24 April

  • 21:30 jeluf: disabled LuceneSearch. All apache processes were in state LuceneSearch::newFromQuery
  • 11:15 jeluf: set wgCountCategorizedImagesAsUsed for commons.
  • 02:55 brion: manually rotate squid log on srv7 again when it reached 2gb and crashed. logrotate needs to be fixed...
  • 02:15 brion: installed GCC 4.0 final on vincent, avicenna for GCJ. Taking vincent out of search rotation for index rebuild.

23 April

  • 13:15 Tim: recaching special pages, with wget script running in a screen on zwinger, which requests recache pages from bayle, which sends the expensive queries to benet.
  • 02:25 brion: manually rotated logs and restarted squid on srv7. had been down for 2.5 hours, but nobody noticed the alarm from servmon.

22 April

  • 10:20 brion: as a temporary hack, bumped rc_namespace on metawiki from tinyint to int. somebody added a russian help namespace at 128/129 which is outside of the signed tinyint range, so pages were recorded with the wrong namesapce.
  • 01:30 brion: removed 'wrap' option from tidy.conf to work around weird corruption problem (may be bug in tidy; investigating)

21 April

  • 18:00 midom: started backup run on benet

20 April

  • 11:25 brion: tidy extension installed on apaches, now active. To go back to external, set $wgTidyInternal = false; or remove extension=tidy.so from php.ini and restart apaches
  • 10:50 brion: added node groups fc3, freebsd, debian
  • 10:06 brion: removed isidore and vincent from fc2-i386 node group, as they're running FreeBSD and Debian
  • 10:00 brion: working on installing tidy extension for php...
  • 03:00 brion: re-enabled search

19 April

  • 16:50 Tim: Pope-related flash crowd, peaking at 2100/s. Apaches were hard hit by searches (about 50% of profile time) so I disabled them temporarily.
  • 16:00 Tim: we were getting reports of gzuncompress errors in memcached-client.php, on every page view on en. I put in an error suppression operator and instead logged all such errors to /home/wikipedia/logs/memcached_errors, to determine which server was the problem. It turned out to be not a server but a key, enwiki:messages to be precise. Deleting it and letting it reload fixed the problem.
  • 07:30 midom: sad notice, smellie down, memory or other hardware troubles, lots of segmentation faults and other signals before reboot, didn't come up after.

17 April

  • 09:00 midom: fixed broken webster replication, caused by table bugs at database bugs
  • 06:45 brion: fixed symlinked php.ini on srv2, srv3
  • 00:00 midom: reformatted suda data area from xfs to ext2, brought into MySQL service for enwiki only

14 April

  • 03:20 brion: eowiki lucene search live! others building...
  • 02:45 brion: started lucene index builds for eowiki, ruwiki, dewiki
  • 02:15 brion: lucene search live for meta
  • 01:45 brion: restarted meta search build, as it was pulling from wrong db. whoops!

13 April

  • 23:51 brion: noticed some spam coming in on bugzilla. hacked rel="nofollow" into comment processing, removed the comment, and disabled the account used to post it.
  • 22:40 brion: starting lucene index builds for metawiki and some other wikipedias
  • 00:08 brion: removed Apache-Midnight-Job from avicenna crontab

12 April

  • 23:50 brion: vincent and avicenna are sharing LuceneSearch burden.
  • 20:00 brion: Chad fixed vincent, which is now running lucene. Isidore lucene stopped, it's going to be squid soon. Will take over an apache for additional search capacity.
  • 13:30 brion: lucene search turned on for en with slightly old index file, daemon running on isidore
  • 10:30 brion: gcj on isidore seems horked; index rebuild is much too slow (eta 18 hours) so stopped it. uploading an index from home, and building mono for further testing.
  • 10:00 midom: holbach restored.
  • 08:55 holbach seems to be deadish
  • 08:50 brion: started lucene index build on isidore
  • 05:50 brion: vincent doesn't seem to be coming up again, will need to be kicked.
  • 05:20 brion: upgrading vincent to 2.6 kernel hoping to resolve threading/memory issues w/ MWDaemon
  • 02:10 brion: rebooting srv6 due to zombie squid eating port 80

11 April

  • 23:05 kate: experimenting with making an en.wp image dump using trickle (cvs: /tools/trickle/)
  • 08:00 midom: broken replication (by chineese scammer) on bacon, fixed by "use otrs; repair table article" - myisam tables are evil, aren't they?

10 April

  • ~23:00: kate: upgraded squid to STABLE9+patches (see squid builds) + restarted all squids.
  • mark: All squids are running with too few FDs (1024), and if noone replaces all daemons by the new one Kate just built, we may have a problem tomorrow during peak hours...
  • 19:15 midom: srv7 is now in squid service
  • 19:07 brion: MWDaemon's memory usage got high enough it started swapping. Hung connections ate up apaches and hung the site until it was restarted.
  • 5:30 brion: lucene search server active for en.wikipedia.org, running on vincent.

9 April

  • 15:45 midom: dropped thttpd (as it was using 32bit mmaps) on dumps in favor of lighttpd. It has superb performance, serves 3500hits/s under ab and served 70MB/s from benet in small reqs... Extreme recommendations for using lighttpd for image uploads.
  • 10:15 brion: running lucene search indexer on vincent (pulling enwiki from benet).
  • 05:25 brion: added additional is rcbots to #is.wikipedia for tionary/books/quote

8 April

7 April

  • Mark, Tim: implemented Multicast HTCP purging on all FL apaches/squids. French Squids still need a binary replacement.

6 April

  • 21:44 mark: Put port gi0/26 on csw1-pmtpa into trunking mode: vlans 1-2 only, with vlan 2 being the native vlan, no LACP negotiation
  • 11:30 midom: benet put into dump operation
  • 10:55 brion: reinstalled PHP on zwinger and apaches, compiled with memory limit and mbstring options enabled. This was left out when upgrading to 4.3.11.
  • 2:40 brion: added NetCabo proxies to trusted proxy list (inconveniently shared by Jorge and a Nazi vandal on pt.wikipedia.org)

4 April

  • 15:30 jeluf: disbaled logging of upload.wikimedia.org
  • 15:15 midom: yet another image server overload. rotated 30G upload.wikimedia logfile, could be fragmentation overhead.
  • 12:00 midom: moved log_bin.0[0123]? (40G worth of binlogs) from ariel to khaldun/avicenna backup/arielbinlog, reclaimed some master disk space.
    • Do we need those binlogs for anything?
      • Yes, we need binlogs back to the last full backup -- TS
  • 07:48 Tim: Started memcached on browne, it was in the list but not running. Fixed startup scripts. Noticed that browne can't contact albert on 10/8, modified yum.conf accordingly.

3 April

  • 18:25 midom: extended public IP address range (now: 12 addresses)
  • 17:50 midom: srv5 joined service as squid.

1 April

  • 22:30 midom: Enabled recentchanges-based watchlist hack. Servers go faaaast.
  • 23:15 brion: set default block expiry to 1h on dewiki by request of various admins

Archives