Server Admin Log/Archive 4

July 31

11:50 brion: knams squid at 145.97.39.138 is not reachable, but still in dns rotation. THIS IS BAD
01:50 brion: pascal is offline, reason unknown. bugzilla down, no NFS for knams cluster.

July 28

01:06 kate: put a new skin on bugzilla

July 27

18:50 brion: blocked irc4ever.net remote page loaders

July 26

08:08 kate: upgraded mysql on vandale to 5.0.9

July 25

19:05 brion: set $wgMetaNamespace to 'Vikipedi' on trwiki, refreshing links
18:15 mark: Added two missing kennisnet squid IPs to the udpmcast startup script on larousse, and restarted it.
17:29 brion: added wikimania-l mailing list
17:25 mark: Pointed thailand at knams as a test - some people there say it is much faster than pmtpa. Will eventually be replaced by the yahoo cluster anyway...

July 24

16:15 brion: set ndswiktionary to capitallinks off
10:10 brion: updated sudoers file on srv0 so syncs work again

July 22

22:50 brion: restarted search update daemon... still seems to be a memory leak and it hangs when it gets too large
22:31 brion: moved wiki.mediawiki.org to www.mediawiki.org and redirected from mediawiki.org and wiki.mediawiki.org to it
22:07 brion: srv0 clock was about 150 seconds in future. kate did something to fix it. synchronized all apaches from system to hc time to hope reboot works. Fixed one revision reported to be in a weird inversion appearance.
13:50 brion: took avicenna out of search group to do experiments on index

July 21

23:30 Tim: added rollback group
22:00 Tim: moved group settings from CommonSettings.php to InitialiseSettings.php

July 20

23:45 brion: updated clocks on srv1, rabanus, etc all apaches... hopefully
21:40 brion: set wgCapitalLinks off on afwiktionary
19:20 mark: Removed legacy zone gdns.wikimedia.org and corresponding georecord rr.gdns.wikimedia.org from all nameservers. It's not being used anymore, and only confuses people.
19:05 mark: Pointed france and switzerland back at lopar in geodns
14:10 brion: created wikinews-hotline mailing list by request

July 19

23:58 Tim: fixed Special:Uncategorizedcategories, now running updateSpecialPages.php on /h/w/c/smallwikis
15:30 brion: reverting build copy of search index to the previous version to try working around some corruption from daemon crash (?)

July 17

18:27 mark: An empty line in the geomap file caused problems and made the site go down for non EU users. Apparently geobackend currently doesn't handle empty lines in geomap files (a bug which I will fix), so don't use them.
18:18 mark: Pointed all European countries at knams wrt geodns

July 16

17:07 kate: wrote a new statistics system and replaced webalizer with it
07:30 brion: had to restart search daemons again due to breakage. whyyyyyyy they worked before *sob*
00:15 hashar: overloaded suda for almost 5 minutes by running the unbugged updateSpecialPages script . Might be cause of Wantedpages.

July 14

02:50 brion: separated mediawiki-installation and apache node groups. These must not point to the same file.
02:00-3:15 erik: created Japanese Wikinews at http://ja.wikinews.org/

July 13

20:59 brion: had to interrupt bgwiki backup due to memcached hang
06:10 brion: restarted search servers; 'too many open files'
01:30 brion: started backup on benet (slave stopped). updates in #wikimedia.15status

July 12

23:35 brion: commented out lopar from geodns for now (moved them to knams)
23:20 brion: there's intermittent packet loss to lopar...
19:10 mark: Site was down due to crashed perlbal on holbach, restarted it
12:03 kate: put lily back to squid pool
08:10 jeronim: set yum on larousse (FC2) to use fedoralegacy.org
08:00 mark: lily's hardware has been replaced.
07:40 jeronim: set HostnameLookups Off on larousse's apache at hashar's request
07:10 jeronim: added CNAME commons.wikipedia.org -> commons.wikimedia.org
00:40 brion: restarted mysql on james's advice with config change. innodb_lock_monitor fails, however. have innodb_status_file=1 set now. had to do 'slave stop' on samuel, which is master. wtf

July 11

23:40 brion: set innodb_lock_monitor on samuel on jameday's recommendation. will be active when mysqld restarted
23:20 jeluf: restarted ServmonII. Died when it lost its irc connection earlier today.
23:05 brion: removed teh fateful link so editing that page works for now
22:30 brion: disabled deletion of recentchanges records due to slowness there. hacked Title::touchArray to go row by row due to weird hangings trying to edit Template:POTD on enwiki. Not sure what's wrong, it consistently hangs at User:Mulad/portal. What could be locking it?
18:30 brion: biased search load to maurus, as avicenna (with less memory) was being sluggish. added comment to output saying which server was hit
15:10 mark: Removed authoritative zones that were no longer pointing at zwinger from zwinger's Bind configuration (interferes with resolving). Set up AXFR slaving of zones that are supposed to be served by the new PowerDNS servers, but which are still delegated to Zwinger/bomis/fuchsia.
14:50 mark: Fixed reverse DNS for knams

July 10

17:00 brion: shut down slave thread on ariel before it explodes
05:40 hashar: check out our new portal: http://noc.wikimedia.org/
01:07 kate: removed ariel from load balancing because it only has 700MB of disk space left.

July 9

10:30 brion: fixed up steward mode in special:makesysop plugin to provide the full userrights options
some time in the morning kate: reverse dns for knams started working, although under *.rev.wikipedia.org.
08:02 brion: reassigned 'developer's on meta to steward group

July 8

5:20 brion: started mass lucene index builds using the updater daemon. once done, will sync current index files out. (progress in #wikimedia.15status)

July 7

13:50 brion: added page update hook for the lucene update daemon, see wikitech-l post
11:38 mark: Installed java (!) on pascal, to allow Kennisnet/ZX to upgrade the SP and BIOS on lily.
11:34 brion: maurus had bogus hostname (maurus.wikimedia.org, doesn't resolve). fixed live and in /etc/sysconfig/network
08:55 brion: upgraded PEAR::XML_RPC to 1.3.2 on mediawiki-installation group. Patching mono on avicenna and maurus for ximian bug 75467
08:30 brion: noticed vincent seems to be hung
07:00 Jamesday changed holbach cache split from 200M/2800M to 200/2500M because of excessive page faulting in vmstat, not yet restarted.

July 6

14:40 Tim: named on albert exit for no apparent reason, causing site-wide slowdown. Logged on via the scs and started it.
07:00 brion: all wikis reading from 1.5 code now. zh-min-nan.wikipedia.org has the UI broken -- code problem selecting wrong UI language [since fixed]
06:30 brion: fixed up broken conversions on sdwiki, rowikibooks, fiu_vrowiki, cowikibooks, aawiki
06:00 brion: upgraded meta to 1.5
04:00 kate: upgraded all knams machines to current kernel to fix bad pmd problem

July 5

10:43 kate: put back mint to squid pool
9:15 mark: Added zh-tw.wikimedia.org CNAME record to the wikipedia.org zonefile, as it was missing (and is not in langlist, for not being a language)
8:40 mark: Added an admin account on lily's SP, and set up temporary port forwarding on pascal to give ZX (sysadmin partner of Kennisnet) access to diagnose lily's hardware problems

July 4

Jason/mark: Many Wikimedia project domains have been changed to use the new PowerDNS DNS servers, so if you see any DNS related problems, it might be having to do with that
19:32 kate: set up squid log migration system
08:10 brion: migrated forgotten changes to InitaliseSettings from 1.4 to 1.5 (jbowiki caps, fiu-vro logo, zhwiki externalstorage)
03:08 kate: removed srv1 from apache pool again.

3 July

21:35 jeronim: srv1, srv2 & LDAP alive again after manual reboot by colo staff. not sure if domas actually emailed about scs-ext problem.
20:05 jeronim: and scs-ext.wm.org doesn't work anymore. dammit has emailed colo about ~~this and~~ srv1/srv2 problem
20:00 midom: oopsie, srv1 also didn't come up after reboot, and apparently it was LDAP server... LDAP down.
19:00 midom: resyncing holbach, updated misbehaving apache hosts (srv2,srv3,anthony,rose), srv2 didn't come up after reboot.
06:10 brion: holbach crashed again, mysqld was restarting over and over. killed it for now.
05:05 brion: fixed more wikimania registration files
02:20 brion: fixed missing db config in wikimania attendees list

2 July

21:55 brion: holbach died. restarted zhwiki conversion w/o it.
19:30 brion: started asian large-wiki upgrades: jawiki, zhwiki
16:00 midom: bacon joined perlbal service, restarted perlbal on holbach, site looks happier.
09:00 brion: eswiki upgraded, doing ptwiki now. dammit took ariel out of rotation, ready for reloading
07:40 kate: moved bugzilla to pascal
06:51 brion: fixed db host for wikimania registration
06:45 midom: samuel is our master.
- mediawiki 1.4, mediawiki 1.5, bugzilla, and otrs should be configured properly for new master. is there anything else? [search server update needs changing anyway, working on this --brion]
04:50 brion: ran refreshLinks on enwikinews
04:30 brion: disabled sorbs checking for now
02:40 Jamesday: changed bacon cache split from 800M/2000M to 200M/2600M, not yet restarted.
02:30 Jameesday: changed holbach cache split from 1000M/2000M to 200M/2800M, not yet restarted.
02:05 brion: running background refreshLinks.php on dewiki

1 July

22:20 Jamesday: changed ariel my.cnf from MyISAM/InnoDB cache split of 1700M/3900M to 300M/5100M assuming minimal MyISAM use now. We've been this high before for InnoDB but there's a small chance that the new kernel on Ariel might not like going abouve 4G on the next restart - reduce it to 3900 if that happens. Not restarting ariel now because one is planned anyway and it's not that urgent - should improve load handling ability though. Decreased binlog_cache_size from 1M to 128k (it's per session and doesn't really need to be 1M).
08:20 brion: changed Revision legacy encoding conversion to use //IGNORE in iconv... this may need tweaking
06:10 brion: dewiki done.
05:56 brion: moved 1.5 skins dir from /w/skins-1.5 to /skins-1.5. Turns out squid configuration does cache-control rewriting on /w which makes them uncacheable. Bad squid!
00:45 brion: switched 1.5 wikis to shared filesystem sessions. A hack in User::matchEditToken fatally broke save attempts by previously-logged-in users because it didn't bother to check that memcached sessions were in use; I've commented it out.
00:30 brion: switched 1.4 wikis to shared filesystem sessions, perhaps this will relieve memcached session problems?

30 June

23:00 brion: installed test fix for firefox intermittent download problem
06:30 brion: set tidy's line wrapping off on 1.4 config as well (already on 1.5)
01:50 brion: finished. running refreshLinks.php on en.wiktionary.org (in background)
01:00 brion: running cleanupCaps.php on en.wiktionary.org to rename all article pages to lowercase

29 June

??:?? brion: somebody moved en.wiktionary.org to wgCapitalLinks off, throwing it into total chaos. thanks!
22:12 brion: removed some unused, added Mac OS X 10.4 to bugzilla operating systems list
18:18 brion: set $wgCapitalLinks off on jbo.wikipedia.org

28 June

22:20 brion: adding image table entries for 'missing' images (probably broken or half-canceled uploads from months back, mostly)
15:50 kate: setup ganglia, ssh, yum on adler and samuel
05:45 kate: set up and documented a better LDAP setup. removed srv1 from apache pool.
01:03 brion: enwiki upgrade broke with its slave reads: page table was incomplete. rebuilding page table from ariel, ETA ~2hrs

27 June

22:35 brion: turned off image metadata loading to speed things up -- will need to do that in a later script run
20:25 brion: dropped & recreated empty links on enwiki to free innodb space (already converted)
19:15 brion: disabled email authentication for now; will do mass checks later
08:40 brion: enwiki upgrade is now pulling revision data from adler, writing to ariel.
07:55 brion: somewhere in the midst of upgrading things. enwiki is going now; upgrade1_5.php is hacked up, please don't run any others until it's restored!
03:35 brion: adler was broken and badly lagging because somebody removed its 'cur2' tables and replication died when we dropped them from the master. fixed, returning...
02:15 brion: commons, wikinews, wikiquote, wiktionaries, and some misc are upgraded. Wikipedias and some others remain... Need to clear disk space on ariel

26 June

04:50 brion: commonswiki being upgraded; ETA in 6-7am range
04:10 brion: upgraded nostalgiawiki as test
02:10 brion: setting things up preparing for 1.5 upgrade

24 June

14:00 brion: adler back in rotation. probably needs reconfiguration for future...
13:30 brion: took adler out of rotation; mysqld crashed OOM and is recovering

23 June

21:45 mark: ...apparently because it was pointing at cache.wikimedia.org., which didn't exist in the new DNS zones... added.
21:45 mark: wiktionary.zone contained an old record fr.wiktionary.org CNAME wikipedia.geo.blitzed.org which for some reason made things break only now. Removed.
05:25 brion: changed sitename, metanamespace on la.wikiquote

22 June

12:00 mark: Changed www-dumps.knams DNS to CNAME dumps in preparation for moving vandale to an internal vlan
05:45 midom: did set global mysql timeout in php to 2 seconds.
05:22 Tim: restored load to samuel, also experimentally changed some other loads
04:28 Tim: realised that the site was down becuase of samuel and changed the load balancing ratios accordingly. At this time samuel is busy doing InnoDB recovery.
04:22 Tim: finished moving binlogs
04:18 samuel's mysqld exited for no apparent reason
~04:10 Tim: started moving binlogs 232-240 to khaldun

21 June

20:48 mark: Network problem has been worked around, switched geodns back.
19:30 mark: Severe network / reachability problems for florida, but knams seems to be able to reach it. Pointed all of geodns at knams and lopar exclusively.
00:01 kate: moved binlogs 228..231 from ariel to khaldun

19 June

20:00 mark: New DNS setup is active, but DNS zone delegations still need to be dealt with. Please note that there are NO wildcards anymore, so you will need to update DNS zonefiles when creating wikis! Also, for the next week or so, update both the old DNS setup, and the new one when changing records.
Problems will occur, DNS records may be missing, please tell me or update it yourself!
19:00 mark: I broke the site (for the first time, yay! :) because of the mixed old and new DNS setup; the old zonefile was using rr.chtpa. while the new one expected rr.pmtpa.. Oops. Negative cache TTL of 1H means that some users will not be able to access the site for a while.
18:50 mark: Activated new DNS setup on zwinger, which is partly used by the old Bind DNS setup
17:30 mark: Added records ns0/1/2 to wikimedia.org to allow changing NS delegation for the new setup
15:15 mark: Removed zwinger/gdns1 from the list of geodns nameservers in wikimedia.org, on order to build a new setup on zwinger
00:50 brion: upgraded mono on vincent to 1.1.8 (rpm packages), running a mass lucene index update

18 June

12:30 jeronim: added dsh node groups at florida: squids_lopar, squids_knams & squids_global
11:30ish jeronim: added Disallow: /wiki/? to robots.txt because bots were indexing stuff like http://en.wikipedia.org/wiki/?title=Nl%253Aolijfboom&action=edit
06:14-06:37 Tim: Started Folding@Home on mint, ragweed, hawthorn and mayflower
06:14 Tim: Started Folding@Home on iris. I started it on sage and clematis a few days ago without logging it here.

16 June

23:50ish brion: uplink via level3 died for a few minutes, either was fixed or PowerMedium rerouted it and we're back.
22:24 brion: webster is dead: ssh doesn't let in, scs doesn't respond on it. does ping. possible kernel panic.
20:15 jeluf: webster's mysqld got a signal 11, recovered automatically.

15 June

21:10 mark: Set up new requests/s stats at http://noc.wikimedia.org/reqstats/
20:40 mark: Removed lily from the knams squid pool in wikimedia.org DNS, it's broken.
20:30 mark: Added missing peer statements to knams squids

14 June

08:45 jeluf: moved ariel_log_bin.21[345678] to khaldun

13 June

20:55 brion: truncated searchindex table on a bunch of wikis, freed ~5gb of disk space on ariel
20:30 brion: rebuilt search indexes for dawiktionary, svwiktionary due to bad encoding config
16:00 midom: used new apache-restart-all-hard (really hard!) so that slow watchlists (which actually was segfaulting apaches due to bad bytecode in cache on nearly all servers) would become fast ;-) we really need blank page logging somewhere..
09:30 midom: removed icpagents from some of apache hosts, was a major headache recently.. ;-)
09:00 midom: commented out some defunct or non-apache hosts (uh oh, nearly 10 in total) in perlbal's nodelist.

12 June

03:12 brion: started adding name_title unique index on remaining smaller wikis (<10k pages), 30 second wait between each
01:25 brion: unlocked ja.wiktionary on Angela's request

11 June

22:20 jeluf: moved binlogs 204-212 from ariel to khaldun
12:02 brion: stopped index addition for now (left off at bgwiktionary), will run at non-peak hours
11:54 brion: dupe checks done, adding index...
11:29 brion: running cleanupDupes.php on all wikis not already protected with a unique namespace+title index, then adding the index. the largest wikipedias were already protected.

10 June

10:00 brion: ran a salt fixup script to correct entries which had been erroneously re-saved with bad password due to memcached records floating around in the first couple days

9 June

21:45 chaper: inserted the CD that was delivered with the new hosts into srv4. jeluf mounted to /media/cdrom. Apparently containing RAID controller software for many OSes incl linux.
19:55 kate: lily hung and died again during fsck. moved its ip to ragweed and left it off for now.
02:50ish brion: inserted live debugging hack in Article.php for deletion problem on en.wikipedia.org (bug 2195)

8 June

19:25 jeluf: moved binlogs 200-204 from ariel to khaldun. New DB servers have arrived in data center.
13:15 brion: ran namespaceDupe checker on skwiktionary, skwikiquote due to prob w/ namespace changes there
00:35 brion: added wikiskan-l list for scandanavians

7 June

22:07 kate: setup dumps mirror at http://www-dumps.knams.wikimedia.org/
19:00 jeluf: moved binlogs 198 and 199 from ariel to khaldun
18:48 brion: reactivated search
9:00-19:00 all: Moved to new Tampa data center
10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.
07:00 or somewhat: horrible things begin

6 June

13:40 kate: started copying dumps to vandale
11:30 kate: make a small db change for wikimania registration to implement a change in the form. left a backup of the old one at zwinger:/root/wikimania.prekate.sql
10:05 kate: set up logrotate on knams
01:43 Tim: moved binlogs 194-197

5 June

22:40 kate: reinstalled mint with better partition layout, added it to squid pool
21:00 gwicke: fixed mysql error messages in this wiki after config tweak to index words from 3 chars. You should now be able to search for things like 'DNS'.
14:55 mark: bound bind to 145.97.39.130 (pascal's main ip) only, adapted firewallinit to allow incoming DNS zone transfers
14:19 kate: added lily to squid pool
13:25 mark: Added ip 145.97.39.158 to pascal, adapted /sbin/ifup-local.
10:02 kate: iris -> squid pool
09:03 kate: clematis -> squid pool
08:46 kate: sv,dk,no.wp -> knams
08:09 kate: de.wp -> knams
07:56 kate: put mayflower to knams squid pool. fixed typo in commonsettings breaking squid caching.

4 June

18:27 kate: added hawthorn to squid pool
18:10 kate: created rr.knams pool, put UK, NL, DE and LT on it.
16:28 kate, jer, dammit: started squid on ragweed, put it in lopar pool for now
15:30 jeluf: moved binlogs 190-193 to kkhaldun
13:54 jeronim: built new squid for will as old one had file descriptor limit of 1024 instead of 8192 so it was running out of FDs. In /home/wikipedia/src/squid/squid-2.5.STABLE9.wp20050604.S9plus.no2GB[icpfix,nortt,htcpclr]

3 June

23:30 brion: fixed salting on user_newpassword for accounts not touched since the change.
20:40 mark: Wrote /sbin/ifup-local script on pascal, to handle post-ifup tasks. Currently adds 10.21.0.2/24 IP to eth1 for accessing the LOMs.
20:00 mark: Set up permanent source routing on pascal for Kennisnet out of band access using /etc/sysconfig/network-scripts/route-eth1 and rc.local
19:05 mark: Rebooted csw2-knams with newer crypto image, setup SSH, changed DNS resolver
09:40 kate: created 400GB LV at /sqldata on vandale, ext3. installed mysql. copied ariel's my.cnf over (can someone look at what needs to be changed there?). did not populate any sql data yet.
05:50 kate: REMOVED WILDCARD NS RECORD under *.wikimedia.org. this means you will need to add NS records for new wikis in that domain or they won't work.
05:48 kate: set up recursing NS on pascal and mayflower; tested pdns slave for wikimedia.org on fuchsia, seems to work (but not authorative yet).
00:05 Tim: moving binlogs 186-189 from ariel to khaldun

2 June

06:15 brion: clearing user records from memcached. two instances of can't-log-in reported might have been caused by stale cache records re-saving bogus unsalted passwords, but that's sheer speculation.
06:00 JeLuF: fixed mail on dalembert and goeje to use smtp.pmtpa.wmnet as smarthost
05:45 JeLuF: removed moreri and bart from "apaches" nodegroup

1 June

19:10 JeLuF: moved binlogs 184 and 185 from ariel to khaldun
15:04 Tim: fixed timezone on coronelli
14:35 Tim: had a go at fixing ntpd on various servers. It was not installed on coronelli and not running on srv5, fixed both fairly easily. Synchronised configuration files on srv11-30, they're still reporting "synchronization failed" as ntpd starts up, although I was able to synchronise their clocks manually with ntpdate. "ntpdc -p" seems to indicate that they are working properly.
5:10 jeluf: Added index, set site to read/write
04:10 brion: updated user tables for password hash salting.
3:00 jeluf: set farm to read only

31 May

16:12 Tim: switched profiling from user time to real time
13:45 brion: experimentally disabled MakeSpelFix in lucene search results to compare load / response time
5:00 jeluf: CREATE INDEX id_title_ns ON cur (cur_id, cur_title, cur_namespace); on all wikiquote, wikinews, wiktionary, wikibooks, dewiki and all wikis with 10'000 to 100'000 articles. To be done tomorow: enwiki, frwiki, jawiki, wikipedias with <10'000 articles

30 May

18:02 kate: starting copying khaldun:/usr/etc/images/enwiki/enwiki_upload.tar to srv11:/usr/etc/backup/images/
11:55 brion: enwiki image archives and thumbnails have by now been copied to khaldun. all should be right with the world.
07:49 brion: increased bacon's share of load, but not quite up to previous levels
05:20 jeluf: moved binlogs 175-179 from ariel to khaldun
03:20 brion: took khaldun out of apaches group, added to images group. en.wikipedia.org images are moved to khaldun, thumbnails still copying.

29 May

23:30 brion: working on moving en.wikipedia.org's uploads from albert to khaldun
21:18 brion: reduced load on bacon to keep it from lagging
11:15 brion: added bugzilla stats collection to cron.daily

28 May

20:07 kate: started a full image dump on khaldun using modified backup scripts
08:00-ongoing jeluf: Migrating enwiki to external storage
07:30 jeluf: moved binlogs 170-174 from ariel to khaldun

27 May

22:59 brion: lucene search on wikimedia-wide
11:57 brion: servmonii seems to be offline; not on irc, and smlogmsg fails when doing syncs
11:38 brion: installed simple experimintal edit/move rate limiter with fairly conservative settings for now
07:35 brion: changed default search namespaces from NS_TEMPLATE_TALK to NS_HELP (whoops!)

26 May

06:30 jeluf: migrated dawiki
05:30 jeluf: migrated concatZippedHistoryBlobs of eowiki,glwiki,bgwiki to external storage cluster srv28/29/30

25 May

21:50 brion: vincent has been reinstalled with FC3. Running a full Lucene index build for all wikis now...
20:00 dmonniaux: on bleuenn/chloe/ennael: disabled DNS through Wikimedia servers through PPP (didn't work, prevented squid from restarting); used Lost-Oasis servers instead (cf /etc/resolv.conf); inserted iptables -I INPUT -j ACCEPT so as to allow DNS etc. in (please remove once you know what you're doing)
06:30 jeluf: moved binlogs 166 and 167 from ariel to khaldun
04:17 Tim: noticed that webster had stopped replicating 4.5 hours ago. Offloaded it and ran "REPAIR TABLE bugs.bugs" to fix the problem.
01:47 kate: albert's eth1 died for unknown reasons, site broke. configured eth0 as a trunk port to keep site operational.
00:35 brion: running lucene updates on vincent; out of search rotation during build

24 May

21:15 jeluf: moved binlogs 163-165 from ariel to khaldun
15:16 Tim: started update-special-pages-loop, in a screen on zwinger. Using benet for DB.

23 May

20:00 jeluf: added "-A" to /etc/sysconfig/ntpd, synched clocks
15:30 jeluf: installing MySQL 4.0.24 to srv28-30, srv30 will be master, srv28 and 29 will be slaves
08:53 brion: vincent is serving searches from the Mono-based server experimentally
06:15 jeluf: moved binlogs 160-162 from ariel to khaldun
02:35 brion: page moves back on
02:14 brion: temporarily disabled page moves while cleaning up aftermath of a move vandal
01:34 brion: running Lucene index updates and tests

22 May

22:45 brion: fixed another hidden year 2025 entry on eswiki which screwed up recent changes caching
20:15 jeluf: restarted slave. That was faster than I expected.
20:00 jeluf: stopped slave on benet, doing some dumps.
3:00 erik: ran /home/erik/commonsupdate.pl (logged to commonscategoryupdate*.txt) to change category sort keys "Special:Upload" and "Upload" to proper page titles (bug 73); this fixes paging on categories with more than 200 images. Bug 73 is now fixed, so this should not reoccur, but other wikis will have the same problem and can be quickly fixed with this script if necessary.

21 May

23:10 jeluf: moved binlogs 153-159 from ariel to khaldun:/usr/backup/arielbinlog/
3:00 erik: setting up sr.wikinews.org. Not announced yet until language files are fixed.

20 May

19:00 Chad: put zwinger, holbach and webster on the scs. Took moreri, smellie and anthony off. Tim changed software labels.
15:15 midom: killed suse firewall and kernel security stuff. it freaked out all sysadmins, shouldn't be allowed to live :)
12:50 brion & many: all hell breaks loose with ldap oddness on albert and dns and... stuff
4:30-5:40 Tim & onlookers: DNS failure on zwinger. Took us an hour to fix it instead of 2 minutes, and caused problems site-wide, because we're using non-redundant DNS instead of /etc/hosts. Logins were timing out because commands in the login scripts were waiting for a DNS response. Managed to get root on albert first, and set about modifying resolv.conf on all machines to use albert as well as zwinger. Eventually got root on zwinger, had to kill -9 named. Restarted it, everything is back to normal.

19 May

22:00 jeluf: Upon GerardM's request, and due to ongoing vandalism on li.wikipedia.org, promoted user "cicero" to sysop on liwiki
21:48 jeronim: removed isidore from squids dsh group (and condolences for the eurovision tragedy)
21:30 midom: after surviving ddos aimed at my dsl and lithuania's failure in eurovision I finally moved some ariel binlogs to alrazi/khaldun (raid1 :)
02:45 kate, brion: fixed ldap/firewall for external servers

18 May

18:11 Tim: Categorised the servers by interface and vlan at Interfaces. Fixed routing tables on a few hosts that were non-standard for their category.
16:35 Tim: removed isidore and vincent from dsh ALL node group, non-standard configuration. Also removed bart and moreri, permanently down
14:54 jeluf: flushed firewall on bayle. Back in apache service.
14:38 brion: readded vincent in search group

17 May

19:45 JeLuF: restored lost history on dewiki
10:29 Tim: Updated DNS to get closer to this, and hence reality.
09:45 Tim: Fixing sources in gmetad.conf fixed it
09:28 Tim: Moved ganglia configuration to /h/w/conf/gmond, symlinked config on new apaches, changed cluster name, restarted ganglia. It doesn't seem to have fixed the recording problem.

16 May

16:50 Tim: Moving ariel binlogs 130-139 to avicenna. Don't ask me where 106-129 went.
- as I already said: khaldun.
05:50 brion: added a bunch of nazi spam subjects to wikipedia-l spam filters, hoping to reduce admin load

15 May

.... dammit is moving memcacheds around to work around browne problem ...
12:05 brion: installed libtidy-devel and patched tidy PECL extension on srv11-srv30
11:59 brion: browne is having some funky problems; can't talk to the srv machines, which is Bad for memcahced work
10:10 brion: installed updated LanguageEl.php; had to fix permissions on file

14 May

23:40 brion: disabled catfeed extension for security review
23:25 brion: lucene search now up for all en, de, eo, and ru sites. In theory.
10:30 brion: running enwiki index update again
08:00 brion: vincent back online; eth0 had not initialized properly

13 May

21:45 brion: ran checksetup.pl on bugzilla to apply stealth database updates which broke login
19:05 brion: upgraded bugzilla to 2.18.1
18:37 brion: wikibugs back on irc
18:13 brion: hacked Image.php to ignore metadata with negative fileExists, and updated wgCacheEpoch to force rerendering. broken images should be mostly fixed now
17:57 brion: grants wiki fixed (wrong directory was synced in docroot)
16:20 jeronim: bugzilla bot not running, problems with images ("Missing image" on wiki pages) not fixed
12:10 -14:00 and beyond: jer/kate/midom/tim: power loss @ florida colo. most servers lost power; albert, ariel, bacon, suda, khaldun, webster, holbach, srv2, srv3, srv4, srv6, srv7 did not
06:00 brion: running lucene builds for all remaining en, eo, ru, de wikis

12 May

22:23 brion: hacked language name for 'no' to 'Norsk (bokmål)' per request.
21:00 JeLuF: Test installation of mysql cluster on srv29 and srv30. Management server running on srv0. Installation done according to howto.
8:54 Tim: offloaded ariel to correct for load caused by compressOld.php and the pending deletion script
08:10 Tim: deleting articles on en marked "pending deletion", see w:User:Pending deletion script

11 May

19:46 Tim: started compressOld.php, running on a screen on zwinger
00:05 brion: corrected year on a fr.wikinews revision from 2025 to 2005. Assumed a very badly set clock yesterday morning -- does anybody know about this? I can find no trace of it now, though there were several complaints about affected articles, other examples of which now show correct years. Did someone correct them? Who, and when?

midom: System clock wasn't synced to hardware clock before new server crash - servers came up with bad timers. Fixed bad entries in ~15wikis (wikipedias only), therefore frwn remained..

10 May

23:59 brion: synched hardware clocks on all apaches to current system time. (some were hours off, a few were in 2003)
21:33 brion: resynched clock on srv14 to zwinger with ntpdate; was about a minute off.
14:00 midom: restarted all memcacheds
13:15 midom: chain reaction of slow image server maxed out fds on memcached, which caused even more image server load. temporary workaround: remove some old apaches from service, so that memcached would function a bit better.
12:20 midom: ldap server reached maxfiles. fixed in /etc/sysconfig/openldap && restarted
12:00 midom: recovered broken new apaches
07:55 brion: disabled curl extension loading in case it makes a difference if/when mysteriously killed machines are raised from dead
07:31 midom: srv11-srv30 all died

07:30 brion: installed curl PHP module on apaches

9 May

22:00 chaper, jeluf, midom: srv11-srv30 joined apache service.
01:30 brion: removed three invalid image records from commons (from 1.3 era before some name validation fixes)
01:00 brion: Somebody (gwicke?) checked out an entire phase3 source tree inside the 1.4 live installation directory. That's a very bad place for it -- it would get replicated to all servers if a full sync is run. I moved it to /tmp.

8 May

22:09 Tim: discontinued freenode enwiki RC->IRC feed
21:45 JeLuF: removed khaldun from dsh group mysqlslaves
21:25 JeLuF: fixed replication on holbach, otrs.ticket_history was broken. Holbach back in service.
15:00 JeLuF: fixed replication on bacon, otrs.ticket_history was broken. Bacon back in service.
11:08 Tim: added CNAME for irc.wikimedia.org, still working its way through the caches. Opened up port 6667 on browne. Switched on RC->UDP for all wikis, the whole thing is now fully operational.
11:00 brion: cleared image metadata cache entries for commonswiki due to latin1-wiki bug inserting bogus entries
8:40 Tim: installed patched ircd-hybrid on browne

7 May

11:00 brion: replaced wikimania's favicon with the WMF thang. running some lucene updates in background on vincent
02:18 brion: started squid on isidore; had been down for some time. cause unknown.
(some time earlier) tim: made unspecified changes to squid configuration for another external squid

6 May

03:00 brion: updated Latin language file changed namespaces on those wikis.
01:05 brion: suda caught up. back in rotation.
00:44 brion: restarted replication on suda (bugzilla's votes table had some kind of index error)
00:37 brion: took suda out of rotation; replication is broken

5 May

21:39 brion: starting rc bots on browne. Configuration has changed, they are not using a proxy and must be run from a machine with an external route.
16:38 Tim: dumping, dropping and reimporting bgwiktionary.brokenlinks seems to have worked, gradually reapplying load
15:55 Tim: Trying standard recovery procedures
15:08: Suda crashed due to corrupt InnoDB page

4 May

22:15 brion: hacked in os interwiki defs for wikipedias (not other wikis, not sure if they're even set up)
18:52 Tim: installed RC->UDP->IRC system. The UDP->IRC component is udprec piped into mxircecho.py running in a screen on zwinger. This removes the high system requirements previously needed for RC bots.
10:30 Tim: Bots K-lined. Removed enwiki and dewiki to avoid further offence, and left them in a reconnect loop. If someone wants to approach Geert yet again, be my guest.
10:20 Tim: moved RC bots to browne, which is mostly idle, has plenty of RAM, and has an external IP address, allowing it to connect to freenode without going through the apparently undocumented and non-working port forwarder on zwinger.

To find the documentation, enter "forwarder", "forwarding", or "irc" into the search box on the left, and click the Search button. In the notes on the relevant page, the code for the forwarder comes after "code: ".

6:45 jeluf: started squid on will, was down.

3 May

22:25 kate: changed liwiki tz to Europe/Berlin
4:40 jeluf: added webster to DB pool again.

2 May

14:15 midom: after second consecutive webster crash, took it out from rotation, trying forced innodb recovery, planning resync:

050502 14:11:15InnoDB: Assertion failure in thread 1207892320 in file btr0cur.c line 3558
InnoDB: Failing assertion: copied_len == local_len + extern_len
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
InnoDB: about forcing recovery.

14:00 midom: webster's mysql crashed with some assertions, did come up later and continued to serve requests after some load management
11:00 brion: started squid on srv7, which had been down for unexplained reasons and its IP addresses had not been reassigned
07:20 brion: rebuilt foundation-l list archives after removing some personal info by request

1 May

00:05 brion: changed $wgUploadDirectory settings so they won't break in maintenance scripts. hopefully didn't get them wrong.

30 April

23:30 brion: cleared image cache for all wikis. bogus entries probably added during links refresh; maintenance scripts have wrong $wgUploadDirectory
23:00 brion: cleared image cache entries in memcached for commonswiki due to spurious entries marked as not existing.
04:25 Tim: Setting up for perlbal throughput test on tingxi

29 April

22:18 Tim: resumed refreshLinks.php
15:57 Tim: stopped refreshLinks.php at the end of enwiki, before the delete queries
15:28 Tim: Restarted avicenna, which caused the site to crash due to a large number of threads waiting for Lucene

<TimStarling> what is avicenna's role?
<dammit> was: search server
<dammit> dunno now
<TimStarling> avicenna is reporting 20% user CPU usage
<dammit> every host that runs lucene
<TimStarling> but nothing is showing up in top
<dammit> has broken top output
<dammit> and broken ps output
<TimStarling> nothing important shows up in netstat, I'll just reboot it
<TimStarling> ok?
<dammit> 'k
*site explodes*

09:10 brion: took vincent out of lucene search rotation while it's building; changed default_socket_timeout in php.ini to 3 seconds from 60
04:00 brion: started incremental index update for lucene search indexes
03:38 Tim: resumed refreshLinks.php after having stopped it for a while during peak period

28 April

05:12 Tim: Shutting down apache on srv1 to dedicate it to refreshLinks.php
02:10 brion: set up logrotate on isidore to rotate squid log, in hourly cron
01:40 brion: manually rotated squid log on isidore due to reaching 2gb, restarted squid.

27 April

07:30 brion: installed patched Tidy extension on apaches to fix binary-safe string bug.

26 April

06:00ish brion: copied updated lucene indexes to avicenna and maurus, put vincent back in search rotation
05:40-05:55: Severe external network problems
05:25 Tim: deleted obsolete binlogs, moved the remainder (77-87) from ariel to avicenna. 33 GB of disk space remaining on ariel.
(yesterday) jeronim: installed python 2.4.1 from source on alrazi, using make altinstall instead of make install, so that the current python 2.3 installation is not interfered with -- the 2.4.1 binary is at /usr/local/bin/python2.4

25 April

23:30 jeronim: clocks were wrong on 5 machines; fixed 4 of them (installed ntpdate on vincent). isidore still needs to be done (dammit? :)
07:55 brion: started a second active search daemon on maurus (vincent is still rebuilding indexes)
05:00 jeluf: enabled LuceneSearch.
01:20 brion: had to restart srv7 squid again. moved logrotate from cron.daily to cron.hourly, where it should have been before but wasn't

24 April

21:30 jeluf: disabled LuceneSearch. All apache processes were in state LuceneSearch::newFromQuery
11:15 jeluf: set wgCountCategorizedImagesAsUsed for commons.
02:55 brion: manually rotate squid log on srv7 again when it reached 2gb and crashed. logrotate needs to be fixed...
02:15 brion: installed GCC 4.0 final on vincent, avicenna for GCJ. Taking vincent out of search rotation for index rebuild.

23 April

13:15 Tim: recaching special pages, with wget script running in a screen on zwinger, which requests recache pages from bayle, which sends the expensive queries to benet.
02:25 brion: manually rotated logs and restarted squid on srv7. had been down for 2.5 hours, but nobody noticed the alarm from servmon.

22 April

10:20 brion: as a temporary hack, bumped rc_namespace on metawiki from tinyint to int. somebody added a russian help namespace at 128/129 which is outside of the signed tinyint range, so pages were recorded with the wrong namesapce.
01:30 brion: removed 'wrap' option from tidy.conf to work around weird corruption problem (may be bug in tidy; investigating)

21 April

18:00 midom: started backup run on benet

20 April

11:25 brion: tidy extension installed on apaches, now active. To go back to external, set $wgTidyInternal = false; or remove extension=tidy.so from php.ini and restart apaches
10:50 brion: added node groups fc3, freebsd, debian
10:06 brion: removed isidore and vincent from fc2-i386 node group, as they're running FreeBSD and Debian
10:00 brion: working on installing tidy extension for php...
03:00 brion: re-enabled search

19 April

16:50 Tim: Pope-related flash crowd, peaking at 2100/s. Apaches were hard hit by searches (about 50% of profile time) so I disabled them temporarily.
16:00 Tim: we were getting reports of gzuncompress errors in memcached-client.php, on every page view on en. I put in an error suppression operator and instead logged all such errors to /home/wikipedia/logs/memcached_errors, to determine which server was the problem. It turned out to be not a server but a key, enwiki:messages to be precise. Deleting it and letting it reload fixed the problem.
07:30 midom: sad notice, smellie down, memory or other hardware troubles, lots of segmentation faults and other signals before reboot, didn't come up after.

17 April

09:00 midom: fixed broken webster replication, caused by table bugs at database bugs
06:45 brion: fixed symlinked php.ini on srv2, srv3
00:00 midom: reformatted suda data area from xfs to ext2, brought into MySQL service for enwiki only

14 April

03:20 brion: eowiki lucene search live! others building...
02:45 brion: started lucene index builds for eowiki, ruwiki, dewiki
02:15 brion: lucene search live for meta
01:45 brion: restarted meta search build, as it was pulling from wrong db. whoops!

13 April

23:51 brion: noticed some spam coming in on bugzilla. hacked rel="nofollow" into comment processing, removed the comment, and disabled the account used to post it.
22:40 brion: starting lucene index builds for metawiki and some other wikipedias
00:08 brion: removed Apache-Midnight-Job from avicenna crontab

12 April

23:50 brion: vincent and avicenna are sharing LuceneSearch burden.
20:00 brion: Chad fixed vincent, which is now running lucene. Isidore lucene stopped, it's going to be squid soon. Will take over an apache for additional search capacity.
13:30 brion: lucene search turned on for en with slightly old index file, daemon running on isidore
10:30 brion: gcj on isidore seems horked; index rebuild is much too slow (eta 18 hours) so stopped it. uploading an index from home, and building mono for further testing.
10:00 midom: holbach restored.
08:55 holbach seems to be deadish
08:50 brion: started lucene index build on isidore
05:50 brion: vincent doesn't seem to be coming up again, will need to be kicked.
05:20 brion: upgrading vincent to 2.6 kernel hoping to resolve threading/memory issues w/ MWDaemon
02:10 brion: rebooting srv6 due to zombie squid eating port 80

11 April

23:05 kate: experimenting with making an en.wp image dump using trickle (cvs: /tools/trickle/)
08:00 midom: broken replication (by chineese scammer) on bacon, fixed by "use otrs; repair table article" - myisam tables are evil, aren't they?

10 April

~23:00: kate: upgraded squid to STABLE9+patches (see squid builds) + restarted all squids.
mark: All squids are running with too few FDs (1024), and if noone replaces all daemons by the new one Kate just built, we may have a problem tomorrow during peak hours...
19:15 midom: srv7 is now in squid service
19:07 brion: MWDaemon's memory usage got high enough it started swapping. Hung connections ate up apaches and hung the site until it was restarted.
5:30 brion: lucene search server active for en.wikipedia.org, running on vincent.

9 April

15:45 midom: dropped thttpd (as it was using 32bit mmaps) on dumps in favor of lighttpd. It has superb performance, serves 3500hits/s under ab and served 70MB/s from benet in small reqs... Extreme recommendations for using lighttpd for image uploads.
10:15 brion: running lucene search indexer on vincent (pulling enwiki from benet).
05:25 brion: added additional is rcbots to #is.wikipedia for tionary/books/quote

8 April

16:00 midom: redirected http://download.wikimedia.org/ to benet, misses tomeraider and uploads...
13:00 Tim: switched to Mark's squid binary on the French squids

7 April

Mark, Tim: implemented Multicast HTCP purging on all FL apaches/squids. French Squids still need a binary replacement.

6 April

21:44 mark: Put port gi0/26 on csw1-pmtpa into trunking mode: vlans 1-2 only, with vlan 2 being the native vlan, no LACP negotiation
11:30 midom: benet put into dump operation
10:55 brion: reinstalled PHP on zwinger and apaches, compiled with memory limit and mbstring options enabled. This was left out when upgrading to 4.3.11.
2:40 brion: added NetCabo proxies to trusted proxy list (inconveniently shared by Jorge and a Nazi vandal on pt.wikipedia.org)

4 April

15:30 jeluf: disbaled logging of upload.wikimedia.org
15:15 midom: yet another image server overload. rotated 30G upload.wikimedia logfile, could be fragmentation overhead.
12:00 midom: moved log_bin.0[0123]? (40G worth of binlogs) from ariel to khaldun/avicenna backup/arielbinlog, reclaimed some master disk space.
- Do we need those binlogs for anything?
  - Yes, we need binlogs back to the last full backup -- TS
07:48 Tim: Started memcached on browne, it was in the list but not running. Fixed startup scripts. Noticed that browne can't contact albert on 10/8, modified yum.conf accordingly.

3 April

18:25 midom: extended public IP address range (now: 12 addresses)
17:50 midom: srv5 joined service as squid.

1 April

22:30 midom: Enabled recentchanges-based watchlist hack. Servers go faaaast.
23:15 brion: set default block expiry to 1h on dewiki by request of various admins

2000s

Archive 1: 2004 Jun - 2004 Sep
Archive 2: 2004 Oct - 2004 Nov
Archive 3: 2004 Dec - 2005 Mar
Archive 4: 2005 Apr - 2005 Jul
Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
Archive 6: 2005 Nov - 2006 Feb
Archive 7: 2006 Mar - 2006 Jun
Archive 8: 2006 Jul - 2006 Sep
Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
Archive 10: 2007 Feb - 2007 Jun
Archive 11: 2007 Jul - 2007 Dec
Archive 12: 2008 Jan - 2008 Jul
Archive 12a: 2008 Aug
Archive 12b: 2008 Sept
Archive 13: 2008 Oct - 2009 Jun
Archive 14: 2009 Jun - 2009 Dec