Server admin log/Archive 4
- 11:50 brion: knams squid at 126.96.36.199 is not reachable, but still in dns rotation. THIS IS BAD
- 01:50 brion: pascal is offline, reason unknown. bugzilla down, no NFS for knams cluster.
- 01:06 kate: put a new skin on bugzilla
- 18:50 brion: blocked irc4ever.net remote page loaders
- 08:08 kate: upgraded mysql on vandale to 5.0.9
- 19:05 brion: set $wgMetaNamespace to 'Vikipedi' on trwiki, refreshing links
- 18:15 mark: Added two missing kennisnet squid IPs to the udpmcast startup script on larousse, and restarted it.
- 17:29 brion: added wikimania-l mailing list
- 17:25 mark: Pointed thailand at knams as a test - some people there say it is much faster than pmtpa. Will eventually be replaced by the yahoo cluster anyway...
- 16:15 brion: set ndswiktionary to capitallinks off
- 10:10 brion: updated sudoers file on srv0 so syncs work again
- 22:50 brion: restarted search update daemon... still seems to be a memory leak and it hangs when it gets too large
- 22:31 brion: moved wiki.mediawiki.org to www.mediawiki.org and redirected from mediawiki.org and wiki.mediawiki.org to it
- 22:07 brion: srv0 clock was about 150 seconds in future. kate did something to fix it. synchronized all apaches from system to hc time to hope reboot works. Fixed one revision reported to be in a weird inversion appearance.
- 13:50 brion: took avicenna out of search group to do experiments on index
- 23:30 Tim: added rollback group
- 22:00 Tim: moved group settings from CommonSettings.php to InitialiseSettings.php
- 23:45 brion: updated clocks on srv1, rabanus, etc all apaches... hopefully
- 21:40 brion: set wgCapitalLinks off on afwiktionary
- 19:20 mark: Removed legacy zone gdns.wikimedia.org and corresponding georecord rr.gdns.wikimedia.org from all nameservers. It's not being used anymore, and only confuses people.
- 19:05 mark: Pointed france and switzerland back at lopar in geodns
- 14:10 brion: created wikinews-hotline mailing list by request
- 23:58 Tim: fixed Special:Uncategorizedcategories, now running updateSpecialPages.php on /h/w/c/smallwikis
- 15:30 brion: reverting build copy of search index to the previous version to try working around some corruption from daemon crash (?)
- 18:27 mark: An empty line in the geomap file caused problems and made the site go down for non EU users. Apparently geobackend currently doesn't handle empty lines in geomap files (a bug which I will fix), so don't use them.
- 18:18 mark: Pointed all European countries at knams wrt geodns
- 17:07 kate: wrote a new statistics system and replaced webalizer with it
- 07:30 brion: had to restart search daemons again due to breakage. whyyyyyyy they worked before *sob*
- 00:15 hashar: overloaded suda for almost 5 minutes by running the unbugged updateSpecialPages script . Might be cause of Wantedpages.
- 02:50 brion: separated mediawiki-installation and apache node groups. These must not point to the same file.
- 02:00-3:15 erik: created Japanese Wikinews at http://ja.wikinews.org/
- 20:59 brion: had to interrupt bgwiki backup due to memcached hang
- 06:10 brion: restarted search servers; 'too many open files'
- 01:30 brion: started backup on benet (slave stopped). updates in #wikimedia.15status
- 23:35 brion: commented out lopar from geodns for now (moved them to knams)
- 23:20 brion: there's intermittent packet loss to lopar...
- 19:10 mark: Site was down due to crashed perlbal on holbach, restarted it
- 12:03 kate: put lily back to squid pool
- 08:10 jeronim: set yum on larousse (FC2) to use fedoralegacy.org
- 08:00 mark: lily's hardware has been replaced.
- 07:40 jeronim: set HostnameLookups Off on larousse's apache at hashar's request
- 07:10 jeronim: added CNAME commons.wikipedia.org -> commons.wikimedia.org
- 00:40 brion: restarted mysql on james's advice with config change. innodb_lock_monitor fails, however. have innodb_status_file=1 set now. had to do 'slave stop' on samuel, which is master. wtf
- 23:40 brion: set innodb_lock_monitor on samuel on jameday's recommendation. will be active when mysqld restarted
- 23:20 jeluf: restarted ServmonII. Died when it lost its irc connection earlier today.
- 23:05 brion: removed teh fateful link so editing that page works for now
- 22:30 brion: disabled deletion of recentchanges records due to slowness there. hacked Title::touchArray to go row by row due to weird hangings trying to edit Template:POTD on enwiki. Not sure what's wrong, it consistently hangs at User:Mulad/portal. What could be locking it?
- 18:30 brion: biased search load to maurus, as avicenna (with less memory) was being sluggish. added comment to output saying which server was hit
- 15:10 mark: Removed authoritative zones that were no longer pointing at zwinger from zwinger's Bind configuration (interferes with resolving). Set up AXFR slaving of zones that are supposed to be served by the new PowerDNS servers, but which are still delegated to Zwinger/bomis/fuchsia.
- 14:50 mark: Fixed reverse DNS for knams
- 17:00 brion: shut down slave thread on ariel before it explodes
- 05:40 hashar: check out our new portal: http://noc.wikimedia.org/
- 01:07 kate: removed ariel from load balancing because it only has 700MB of disk space left.
- 10:30 brion: fixed up steward mode in special:makesysop plugin to provide the full userrights options
- some time in the morning kate: reverse dns for knams started working, although under *.rev.wikipedia.org.
- 08:02 brion: reassigned 'developer's on meta to steward group
- 5:20 brion: started mass lucene index builds using the updater daemon. once done, will sync current index files out. (progress in #wikimedia.15status)
- 13:50 brion: added page update hook for the lucene update daemon, see wikitech-l post
- 11:38 mark: Installed java (!) on pascal, to allow Kennisnet/ZX to upgrade the SP and BIOS on lily.
- 11:34 brion: maurus had bogus hostname (maurus.wikimedia.org, doesn't resolve). fixed live and in /etc/sysconfig/network
- 08:55 brion: upgraded PEAR::XML_RPC to 1.3.2 on mediawiki-installation group. Patching mono on avicenna and maurus for ximian bug 75467
- 08:30 brion: noticed vincent seems to be hung
- 07:00 Jamesday changed holbach cache split from 200M/2800M to 200/2500M because of excessive page faulting in vmstat, not yet restarted.
- 14:40 Tim: named on albert exit for no apparent reason, causing site-wide slowdown. Logged on via the scs and started it.
- 07:00 brion: all wikis reading from 1.5 code now. zh-min-nan.wikipedia.org has the UI broken -- code problem selecting wrong UI language [since fixed]
- 06:30 brion: fixed up broken conversions on sdwiki, rowikibooks, fiu_vrowiki, cowikibooks, aawiki
- 06:00 brion: upgraded meta to 1.5
- 04:00 kate: upgraded all knams machines to current kernel to fix bad pmd problem
- 10:43 kate: put back mint to squid pool
- 9:15 mark: Added zh-tw.wikimedia.org CNAME record to the wikipedia.org zonefile, as it was missing (and is not in langlist, for not being a language)
- 8:40 mark: Added an admin account on lily's SP, and set up temporary port forwarding on pascal to give ZX (sysadmin partner of Kennisnet) access to diagnose lily's hardware problems
- Jason/mark: Many Wikimedia project domains have been changed to use the new PowerDNS DNS servers, so if you see any DNS related problems, it might be having to do with that
- 19:32 kate: set up squid log migration system
- 08:10 brion: migrated forgotten changes to InitaliseSettings from 1.4 to 1.5 (jbowiki caps, fiu-vro logo, zhwiki externalstorage)
- 03:08 kate: removed srv1 from apache pool again.
- 21:35 jeronim: srv1, srv2 & LDAP alive again after manual reboot by colo staff. not sure if domas actually emailed about scs-ext problem.
- 20:05 jeronim: and scs-ext.wm.org doesn't work anymore. dammit has emailed colo about
this andsrv1/srv2 problem
- 20:00 midom: oopsie, srv1 also didn't come up after reboot, and apparently it was LDAP server... LDAP down.
- 19:00 midom: resyncing holbach, updated misbehaving apache hosts (srv2,srv3,anthony,rose), srv2 didn't come up after reboot.
- 06:10 brion: holbach crashed again, mysqld was restarting over and over. killed it for now.
- 05:05 brion: fixed more wikimania registration files
- 02:20 brion: fixed missing db config in wikimania attendees list
- 21:55 brion: holbach died. restarted zhwiki conversion w/o it.
- 19:30 brion: started asian large-wiki upgrades: jawiki, zhwiki
- 16:00 midom: bacon joined perlbal service, restarted perlbal on holbach, site looks happier.
- 09:00 brion: eswiki upgraded, doing ptwiki now. dammit took ariel out of rotation, ready for reloading
- 07:40 kate: moved bugzilla to pascal
- 06:51 brion: fixed db host for wikimania registration
- 06:45 midom: samuel is our master.
- mediawiki 1.4, mediawiki 1.5, bugzilla, and otrs should be configured properly for new master. is there anything else? [search server update needs changing anyway, working on this --brion]
- 04:50 brion: ran refreshLinks on enwikinews
- 04:30 brion: disabled sorbs checking for now
- 02:40 Jamesday: changed bacon cache split from 800M/2000M to 200M/2600M, not yet restarted.
- 02:30 Jameesday: changed holbach cache split from 1000M/2000M to 200M/2800M, not yet restarted.
- 02:05 brion: running background refreshLinks.php on dewiki
- 22:20 Jamesday: changed ariel my.cnf from MyISAM/InnoDB cache split of 1700M/3900M to 300M/5100M assuming minimal MyISAM use now. We've been this high before for InnoDB but there's a small chance that the new kernel on Ariel might not like going abouve 4G on the next restart - reduce it to 3900 if that happens. Not restarting ariel now because one is planned anyway and it's not that urgent - should improve load handling ability though. Decreased binlog_cache_size from 1M to 128k (it's per session and doesn't really need to be 1M).
- 08:20 brion: changed Revision legacy encoding conversion to use //IGNORE in iconv... this may need tweaking
- 06:10 brion: dewiki done.
- 05:56 brion: moved 1.5 skins dir from /w/skins-1.5 to /skins-1.5. Turns out squid configuration does cache-control rewriting on /w which makes them uncacheable. Bad squid!
- 00:45 brion: switched 1.5 wikis to shared filesystem sessions. A hack in User::matchEditToken fatally broke save attempts by previously-logged-in users because it didn't bother to check that memcached sessions were in use; I've commented it out.
- 00:30 brion: switched 1.4 wikis to shared filesystem sessions, perhaps this will relieve memcached session problems?
- 23:00 brion: installed test fix for firefox intermittent download problem
- 06:30 brion: set tidy's line wrapping off on 1.4 config as well (already on 1.5)
- 01:50 brion: finished. running refreshLinks.php on en.wiktionary.org (in background)
- 01:00 brion: running cleanupCaps.php on en.wiktionary.org to rename all article pages to lowercase
- ??:?? brion: somebody moved en.wiktionary.org to wgCapitalLinks off, throwing it into total chaos. thanks!
- 22:12 brion: removed some unused, added Mac OS X 10.4 to bugzilla operating systems list
- 18:18 brion: set $wgCapitalLinks off on jbo.wikipedia.org
- 22:20 brion: adding image table entries for 'missing' images (probably broken or half-canceled uploads from months back, mostly)
- 15:50 kate: setup ganglia, ssh, yum on adler and samuel
- 05:45 kate: set up and documented a better LDAP setup. removed srv1 from apache pool.
- 01:03 brion: enwiki upgrade broke with its slave reads: page table was incomplete. rebuilding page table from ariel, ETA ~2hrs
- 22:35 brion: turned off image metadata loading to speed things up -- will need to do that in a later script run
- 20:25 brion: dropped & recreated empty links on enwiki to free innodb space (already converted)
- 19:15 brion: disabled email authentication for now; will do mass checks later
- 08:40 brion: enwiki upgrade is now pulling revision data from adler, writing to ariel.
- 07:55 brion: somewhere in the midst of upgrading things. enwiki is going now; upgrade1_5.php is hacked up, please don't run any others until it's restored!
- 03:35 brion: adler was broken and badly lagging because somebody removed its 'cur2' tables and replication died when we dropped them from the master. fixed, returning...
- 02:15 brion: commons, wikinews, wikiquote, wiktionaries, and some misc are upgraded. Wikipedias and some others remain... Need to clear disk space on ariel
- 04:50 brion: commonswiki being upgraded; ETA in 6-7am range
- 04:10 brion: upgraded nostalgiawiki as test
- 02:10 brion: setting things up preparing for 1.5 upgrade
- 14:00 brion: adler back in rotation. probably needs reconfiguration for future...
- 13:30 brion: took adler out of rotation; mysqld crashed OOM and is recovering
- 21:45 mark: ...apparently because it was pointing at cache.wikimedia.org., which didn't exist in the new DNS zones... added.
- 21:45 mark: wiktionary.zone contained an old record fr.wiktionary.org CNAME wikipedia.geo.blitzed.org which for some reason made things break only now. Removed.
- 05:25 brion: changed sitename, metanamespace on la.wikiquote
- 12:00 mark: Changed www-dumps.knams DNS to CNAME dumps in preparation for moving vandale to an internal vlan
- 05:45 midom: did set global mysql timeout in php to 2 seconds.
- 05:22 Tim: restored load to samuel, also experimentally changed some other loads
- 04:28 Tim: realised that the site was down becuase of samuel and changed the load balancing ratios accordingly. At this time samuel is busy doing InnoDB recovery.
- 04:22 Tim: finished moving binlogs
- 04:18 samuel's mysqld exited for no apparent reason
- ~04:10 Tim: started moving binlogs 232-240 to khaldun
- 20:48 mark: Network problem has been worked around, switched geodns back.
- 19:30 mark: Severe network / reachability problems for florida, but knams seems to be able to reach it. Pointed all of geodns at knams and lopar exclusively.
- 00:01 kate: moved binlogs 228..231 from ariel to khaldun
- 20:00 mark: New DNS setup is active, but DNS zone delegations still need to be dealt with. Please note that there are NO wildcards anymore, so you will need to update DNS zonefiles when creating wikis! Also, for the next week or so, update both the old DNS setup, and the new one when changing records.
Problems will occur, DNS records may be missing, please tell me or update it yourself!
- 19:00 mark: I broke the site (for the first time, yay! :) because of the mixed old and new DNS setup; the old zonefile was using rr.chtpa. while the new one expected rr.pmtpa.. Oops. Negative cache TTL of 1H means that some users will not be able to access the site for a while.
- 18:50 mark: Activated new DNS setup on zwinger, which is partly used by the old Bind DNS setup
- 17:30 mark: Added records ns0/1/2 to wikimedia.org to allow changing NS delegation for the new setup
- 15:15 mark: Removed zwinger/gdns1 from the list of geodns nameservers in wikimedia.org, on order to build a new setup on zwinger
- 00:50 brion: upgraded mono on vincent to 1.1.8 (rpm packages), running a mass lucene index update
- 12:30 jeronim: added dsh node groups at florida: squids_lopar, squids_knams & squids_global
- 11:30ish jeronim: added Disallow: /wiki/? to robots.txt because bots were indexing stuff like http://en.wikipedia.org/wiki/?title=Nl%253Aolijfboom&action=edit
- 06:14-06:37 Tim: Started Folding@Home on mint, ragweed, hawthorn and mayflower
- 06:14 Tim: Started Folding@Home on iris. I started it on sage and clematis a few days ago without logging it here.
- 23:50ish brion: uplink via level3 died for a few minutes, either was fixed or PowerMedium rerouted it and we're back.
- 22:24 brion: webster is dead: ssh doesn't let in, scs doesn't respond on it. does ping. possible kernel panic.
- 20:15 jeluf: webster's mysqld got a signal 11, recovered automatically.
- 21:10 mark: Set up new requests/s stats at http://noc.wikimedia.org/reqstats/
- 20:40 mark: Removed lily from the knams squid pool in wikimedia.org DNS, it's broken.
- 20:30 mark: Added missing peer statements to knams squids
- 08:45 jeluf: moved ariel_log_bin.21 to khaldun
- 20:55 brion: truncated searchindex table on a bunch of wikis, freed ~5gb of disk space on ariel
- 20:30 brion: rebuilt search indexes for dawiktionary, svwiktionary due to bad encoding config
- 16:00 midom: used new apache-restart-all-hard (really hard!) so that slow watchlists (which actually was segfaulting apaches due to bad bytecode in cache on nearly all servers) would become fast ;-) we really need blank page logging somewhere..
- 09:30 midom: removed icpagents from some of apache hosts, was a major headache recently.. ;-)
- 09:00 midom: commented out some defunct or non-apache hosts (uh oh, nearly 10 in total) in perlbal's nodelist.
- 03:12 brion: started adding name_title unique index on remaining smaller wikis (<10k pages), 30 second wait between each
- 01:25 brion: unlocked ja.wiktionary on Angela's request
- 22:20 jeluf: moved binlogs 204-212 from ariel to khaldun
- 12:02 brion: stopped index addition for now (left off at bgwiktionary), will run at non-peak hours
- 11:54 brion: dupe checks done, adding index...
- 11:29 brion: running cleanupDupes.php on all wikis not already protected with a unique namespace+title index, then adding the index. the largest wikipedias were already protected.
- 10:00 brion: ran a salt fixup script to correct entries which had been erroneously re-saved with bad password due to memcached records floating around in the first couple days
- 21:45 chaper: inserted the CD that was delivered with the new hosts into srv4. jeluf mounted to /media/cdrom. Apparently containing RAID controller software for many OSes incl linux.
- 19:55 kate: lily hung and died again during fsck. moved its ip to ragweed and left it off for now.
- 02:50ish brion: inserted live debugging hack in Article.php for deletion problem on en.wikipedia.org (bug 2195)
- 19:25 jeluf: moved binlogs 200-204 from ariel to khaldun. New DB servers have arrived in data center.
- 13:15 brion: ran namespaceDupe checker on skwiktionary, skwikiquote due to prob w/ namespace changes there
- 00:35 brion: added wikiskan-l list for scandanavians
- 22:07 kate: setup dumps mirror at http://www-dumps.knams.wikimedia.org/
- 19:00 jeluf: moved binlogs 198 and 199 from ariel to khaldun
- 18:48 brion: reactivated search
- 9:00-19:00 all: Moved to new Tampa data center
- 10:00 brion: replaced lighttpd on fuchsia with apache because the errordocument stopped working for no reason
- 08:00 or so; brion: added fuchsia to wikimedia.org dns, using an alias from dammit because of crappy verio interface. still not on wikipedia.org because we can't get in to it.
- 07:00 or somewhat: horrible things begin
- 13:40 kate: started copying dumps to vandale
- 11:30 kate: make a small db change for wikimania registration to implement a change in the form. left a backup of the old one at zwinger:/root/wikimania.prekate.sql
- 10:05 kate: set up logrotate on knams
- 01:43 Tim: moved binlogs 194-197
- 22:40 kate: reinstalled mint with better partition layout, added it to squid pool
- 21:00 gwicke: fixed mysql error messages in this wiki after config tweak to index words from 3 chars. You should now be able to search for things like 'DNS'.
- 14:55 mark: bound bind to 188.8.131.52 (pascal's main ip) only, adapted firewallinit to allow incoming DNS zone transfers
- 14:19 kate: added lily to squid pool
- 13:25 mark: Added ip 184.108.40.206 to pascal, adapted /sbin/ifup-local.
- 10:02 kate: iris -> squid pool
- 09:03 kate: clematis -> squid pool
- 08:46 kate: sv,dk,no.wp -> knams
- 08:09 kate: de.wp -> knams
- 07:56 kate: put mayflower to knams squid pool. fixed typo in commonsettings breaking squid caching.
- 18:27 kate: added hawthorn to squid pool
- 18:10 kate: created rr.knams pool, put UK, NL, DE and LT on it.
- 16:28 kate, jer, dammit: started squid on ragweed, put it in lopar pool for now
- 15:30 jeluf: moved binlogs 190-193 to kkhaldun
- 13:54 jeronim: built new squid for will as old one had file descriptor limit of 1024 instead of 8192 so it was running out of FDs. In /home/wikipedia/src/squid/squid-2.5.STABLE9.wp20050604.S9plus.no2GB[icpfix,nortt,htcpclr]
- 23:30 brion: fixed salting on user_newpassword for accounts not touched since the change.
- 20:40 mark: Wrote /sbin/ifup-local script on pascal, to handle post-ifup tasks. Currently adds 10.21.0.2/24 IP to eth1 for accessing the LOMs.
- 20:00 mark: Set up permanent source routing on pascal for Kennisnet out of band access using /etc/sysconfig/network-scripts/route-eth1 and rc.local
- 19:05 mark: Rebooted csw2-knams with newer crypto image, setup SSH, changed DNS resolver
- 09:40 kate: created 400GB LV at /sqldata on vandale, ext3. installed mysql. copied ariel's my.cnf over (can someone look at what needs to be changed there?). did not populate any sql data yet.
- 05:50 kate: REMOVED WILDCARD NS RECORD under *.wikimedia.org. this means you will need to add NS records for new wikis in that domain or they won't work.
- 05:48 kate: set up recursing NS on pascal and mayflower; tested pdns slave for wikimedia.org on fuchsia, seems to work (but not authorative yet).
- 00:05 Tim: moving binlogs 186-189 from ariel to khaldun
- 06:15 brion: clearing user records from memcached. two instances of can't-log-in reported might have been caused by stale cache records re-saving bogus unsalted passwords, but that's sheer speculation.
- 06:00 JeLuF: fixed mail on dalembert and goeje to use smtp.pmtpa.wmnet as smarthost
- 05:45 JeLuF: removed moreri and bart from "apaches" nodegroup
- 19:10 JeLuF: moved binlogs 184 and 185 from ariel to khaldun
- 15:04 Tim: fixed timezone on coronelli
- 14:35 Tim: had a go at fixing ntpd on various servers. It was not installed on coronelli and not running on srv5, fixed both fairly easily. Synchronised configuration files on srv11-30, they're still reporting "synchronization failed" as ntpd starts up, although I was able to synchronise their clocks manually with ntpdate. "ntpdc -p" seems to indicate that they are working properly.
- 5:10 jeluf: Added index, set site to read/write
- 04:10 brion: updated user tables for password hash salting.
- 3:00 jeluf: set farm to read only
- 16:12 Tim: switched profiling from user time to real time
- 13:45 brion: experimentally disabled MakeSpelFix in lucene search results to compare load / response time
- 5:00 jeluf: CREATE INDEX id_title_ns ON cur (cur_id, cur_title, cur_namespace); on all wikiquote, wikinews, wiktionary, wikibooks, dewiki and all wikis with 10'000 to 100'000 articles. To be done tomorow: enwiki, frwiki, jawiki, wikipedias with <10'000 articles
- 18:02 kate: starting copying khaldun:/usr/etc/images/enwiki/enwiki_upload.tar to srv11:/usr/etc/backup/images/
- 11:55 brion: enwiki image archives and thumbnails have by now been copied to khaldun. all should be right with the world.
- 07:49 brion: increased bacon's share of load, but not quite up to previous levels
- 05:20 jeluf: moved binlogs 175-179 from ariel to khaldun
- 03:20 brion: took khaldun out of apaches group, added to images group. en.wikipedia.org images are moved to khaldun, thumbnails still copying.
- 23:30 brion: working on moving en.wikipedia.org's uploads from albert to khaldun
- 21:18 brion: reduced load on bacon to keep it from lagging
- 11:15 brion: added bugzilla stats collection to cron.daily
- 20:07 kate: started a full image dump on khaldun using modified backup scripts
- 08:00-ongoing jeluf: Migrating enwiki to external storage
- 07:30 jeluf: moved binlogs 170-174 from ariel to khaldun
- 22:59 brion: lucene search on wikimedia-wide
- 11:57 brion: servmonii seems to be offline; not on irc, and smlogmsg fails when doing syncs
- 11:38 brion: installed simple experimintal edit/move rate limiter with fairly conservative settings for now
- 07:35 brion: changed default search namespaces from NS_TEMPLATE_TALK to NS_HELP (whoops!)
- 06:30 jeluf: migrated dawiki
- 05:30 jeluf: migrated concatZippedHistoryBlobs of eowiki,glwiki,bgwiki to external storage cluster srv28/29/30
- 21:50 brion: vincent has been reinstalled with FC3. Running a full Lucene index build for all wikis now...
- 20:00 dmonniaux: on bleuenn/chloe/ennael: disabled DNS through Wikimedia servers through PPP (didn't work, prevented squid from restarting); used Lost-Oasis servers instead (cf /etc/resolv.conf); inserted iptables -I INPUT -j ACCEPT so as to allow DNS etc. in (please remove once you know what you're doing)
- 06:30 jeluf: moved binlogs 166 and 167 from ariel to khaldun
- 04:17 Tim: noticed that webster had stopped replicating 4.5 hours ago. Offloaded it and ran "REPAIR TABLE bugs.bugs" to fix the problem.
- 01:47 kate: albert's eth1 died for unknown reasons, site broke. configured eth0 as a trunk port to keep site operational.
- 00:35 brion: running lucene updates on vincent; out of search rotation during build
- 21:15 jeluf: moved binlogs 163-165 from ariel to khaldun
- 15:16 Tim: started update-special-pages-loop, in a screen on zwinger. Using benet for DB.
- 20:00 jeluf: added "-A" to /etc/sysconfig/ntpd, synched clocks
- 15:30 jeluf: installing MySQL 4.0.24 to srv28-30, srv30 will be master, srv28 and 29 will be slaves
- 08:53 brion: vincent is serving searches from the Mono-based server experimentally
- 06:15 jeluf: moved binlogs 160-162 from ariel to khaldun
- 02:35 brion: page moves back on
- 02:14 brion: temporarily disabled page moves while cleaning up aftermath of a move vandal
- 01:34 brion: running Lucene index updates and tests
- 22:45 brion: fixed another hidden year 2025 entry on eswiki which screwed up recent changes caching
- 20:15 jeluf: restarted slave. That was faster than I expected.
- 20:00 jeluf: stopped slave on benet, doing some dumps.
- 3:00 erik: ran /home/erik/commonsupdate.pl (logged to commonscategoryupdate*.txt) to change category sort keys "Special:Upload" and "Upload" to proper page titles (bug 73); this fixes paging on categories with more than 200 images. Bug 73 is now fixed, so this should not reoccur, but other wikis will have the same problem and can be quickly fixed with this script if necessary.
- 23:10 jeluf: moved binlogs 153-159 from ariel to khaldun:/usr/backup/arielbinlog/
- 3:00 erik: setting up sr.wikinews.org. Not announced yet until language files are fixed.
- 19:00 Chad: put zwinger, holbach and webster on the scs. Took moreri, smellie and anthony off. Tim changed software labels.
- 15:15 midom: killed suse firewall and kernel security stuff. it freaked out all sysadmins, shouldn't be allowed to live :)
- 12:50 brion & many: all hell breaks loose with ldap oddness on albert and dns and... stuff
- 4:30-5:40 Tim & onlookers: DNS failure on zwinger. Took us an hour to fix it instead of 2 minutes, and caused problems site-wide, because we're using non-redundant DNS instead of /etc/hosts. Logins were timing out because commands in the login scripts were waiting for a DNS response. Managed to get root on albert first, and set about modifying resolv.conf on all machines to use albert as well as zwinger. Eventually got root on zwinger, had to kill -9 named. Restarted it, everything is back to normal.
- 22:00 jeluf: Upon GerardM's request, and due to ongoing vandalism on li.wikipedia.org, promoted user "cicero" to sysop on liwiki
- 21:48 jeronim: removed isidore from squids dsh group (and condolences for the eurovision tragedy)
- 21:30 midom: after surviving ddos aimed at my dsl and lithuania's failure in eurovision I finally moved some ariel binlogs to alrazi/khaldun (raid1 :)
- 02:45 kate, brion: fixed ldap/firewall for external servers
- 18:11 Tim: Categorised the servers by interface and vlan at Interfaces. Fixed routing tables on a few hosts that were non-standard for their category.
- 16:35 Tim: removed isidore and vincent from dsh ALL node group, non-standard configuration. Also removed bart and moreri, permanently down
- 14:54 jeluf: flushed firewall on bayle. Back in apache service.
- 14:38 brion: readded vincent in search group
- 19:45 JeLuF: restored lost history on dewiki
- 10:29 Tim: Updated DNS to get closer to this, and hence reality.
- 09:45 Tim: Fixing sources in gmetad.conf fixed it
- 09:28 Tim: Moved ganglia configuration to /h/w/conf/gmond, symlinked config on new apaches, changed cluster name, restarted ganglia. It doesn't seem to have fixed the recording problem.
- 16:50 Tim: Moving ariel binlogs 130-139 to avicenna. Don't ask me where 106-129 went.
- as I already said: khaldun.
- 05:50 brion: added a bunch of nazi spam subjects to wikipedia-l spam filters, hoping to reduce admin load
- .... dammit is moving memcacheds around to work around browne problem ...
- 12:05 brion: installed libtidy-devel and patched tidy PECL extension on srv11-srv30
- 11:59 brion: browne is having some funky problems; can't talk to the srv machines, which is Bad for memcahced work
- 10:10 brion: installed updated LanguageEl.php; had to fix permissions on file
- 23:40 brion: disabled catfeed extension for security review
- 23:25 brion: lucene search now up for all en, de, eo, and ru sites. In theory.
- 10:30 brion: running enwiki index update again
- 08:00 brion: vincent back online; eth0 had not initialized properly
- 21:45 brion: ran checksetup.pl on bugzilla to apply stealth database updates which broke login
- 19:05 brion: upgraded bugzilla to 2.18.1
- 18:37 brion: wikibugs back on irc
- 18:13 brion: hacked Image.php to ignore metadata with negative fileExists, and updated wgCacheEpoch to force rerendering. broken images should be mostly fixed now
- 17:57 brion: grants wiki fixed (wrong directory was synced in docroot)
- 16:20 jeronim: bugzilla bot not running, problems with images ("Missing image" on wiki pages) not fixed
- 12:10 -14:00 and beyond: jer/kate/midom/tim: power loss @ florida colo. most servers lost power; albert, ariel, bacon, suda, khaldun, webster, holbach, srv2, srv3, srv4, srv6, srv7 did not
- 06:00 brion: running lucene builds for all remaining en, eo, ru, de wikis
- 22:23 brion: hacked language name for 'no' to 'Norsk (bokmål)' per request.
- 21:00 JeLuF: Test installation of mysql cluster on srv29 and srv30. Management server running on srv0. Installation done according to howto.
- 8:54 Tim: offloaded ariel to correct for load caused by compressOld.php and the pending deletion script
- 08:10 Tim: deleting articles on en marked "pending deletion", see w:User:Pending deletion script
- 19:46 Tim: started compressOld.php, running on a screen on zwinger
- 00:05 brion: corrected year on a fr.wikinews revision from 2025 to 2005. Assumed a very badly set clock yesterday morning -- does anybody know about this? I can find no trace of it now, though there were several complaints about affected articles, other examples of which now show correct years. Did someone correct them? Who, and when?
- midom: System clock wasn't synced to hardware clock before new server crash - servers came up with bad timers. Fixed bad entries in ~15wikis (wikipedias only), therefore frwn remained..
- 23:59 brion: synched hardware clocks on all apaches to current system time. (some were hours off, a few were in 2003)
- 21:33 brion: resynched clock on srv14 to zwinger with ntpdate; was about a minute off.
- 14:00 midom: restarted all memcacheds
- 13:15 midom: chain reaction of slow image server maxed out fds on memcached, which caused even more image server load. temporary workaround: remove some old apaches from service, so that memcached would function a bit better.
- 12:20 midom: ldap server reached maxfiles. fixed in /etc/sysconfig/openldap && restarted
- 12:00 midom: recovered broken new apaches
- 07:55 brion: disabled curl extension loading in case it makes a difference if/when mysteriously killed machines are raised from dead
- 07:31 midom: srv11-srv30 all died
- 07:30 brion: installed curl PHP module on apaches
- 22:00 chaper, jeluf, midom: srv11-srv30 joined apache service.
- 01:30 brion: removed three invalid image records from commons (from 1.3 era before some name validation fixes)
- 01:00 brion: Somebody (gwicke?) checked out an entire phase3 source tree inside the 1.4 live installation directory. That's a very bad place for it -- it would get replicated to all servers if a full sync is run. I moved it to /tmp.
- 22:09 Tim: discontinued freenode enwiki RC->IRC feed
- 21:45 JeLuF: removed khaldun from dsh group mysqlslaves
- 21:25 JeLuF: fixed replication on holbach, otrs.ticket_history was broken. Holbach back in service.
- 15:00 JeLuF: fixed replication on bacon, otrs.ticket_history was broken. Bacon back in service.
- 11:08 Tim: added CNAME for irc.wikimedia.org, still working its way through the caches. Opened up port 6667 on browne. Switched on RC->UDP for all wikis, the whole thing is now fully operational.
- 11:00 brion: cleared image metadata cache entries for commonswiki due to latin1-wiki bug inserting bogus entries
- 8:40 Tim: installed patched ircd-hybrid on browne
- 11:00 brion: replaced wikimania's favicon with the WMF thang. running some lucene updates in background on vincent
- 02:18 brion: started squid on isidore; had been down for some time. cause unknown.
- (some time earlier) tim: made unspecified changes to squid configuration for another external squid
- 03:00 brion: updated Latin language file changed namespaces on those wikis.
- 01:05 brion: suda caught up. back in rotation.
- 00:44 brion: restarted replication on suda (bugzilla's votes table had some kind of index error)
- 00:37 brion: took suda out of rotation; replication is broken
- 21:39 brion: starting rc bots on browne. Configuration has changed, they are not using a proxy and must be run from a machine with an external route.
- 16:38 Tim: dumping, dropping and reimporting bgwiktionary.brokenlinks seems to have worked, gradually reapplying load
- 15:55 Tim: Trying standard recovery procedures
- 15:08: Suda crashed due to corrupt InnoDB page
- 22:15 brion: hacked in os interwiki defs for wikipedias (not other wikis, not sure if they're even set up)
- 18:52 Tim: installed RC->UDP->IRC system. The UDP->IRC component is udprec piped into mxircecho.py running in a screen on zwinger. This removes the high system requirements previously needed for RC bots.
- 10:30 Tim: Bots K-lined. Removed enwiki and dewiki to avoid further offence, and left them in a reconnect loop. If someone wants to approach Geert yet again, be my guest.
- 10:20 Tim: moved RC bots to browne, which is mostly idle, has plenty of RAM, and has an external IP address, allowing it to connect to freenode without going through the apparently undocumented and non-working port forwarder on zwinger.
- To find the documentation, enter "forwarder", "forwarding", or "irc" into the search box on the left, and click the Search button. In the notes on the relevant page, the code for the forwarder comes after "code: ".
- 6:45 jeluf: started squid on will, was down.
- 22:25 kate: changed liwiki tz to Europe/Berlin
- 4:40 jeluf: added webster to DB pool again.
- 14:15 midom: after second consecutive webster crash, took it out from rotation, trying forced innodb recovery, planning resync:
050502 14:11:15InnoDB: Assertion failure in thread 1207892320 in file btr0cur.c line 3558 InnoDB: Failing assertion: copied_len == local_len + extern_len InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html InnoDB: about forcing recovery.
- 14:00 midom: webster's mysql crashed with some assertions, did come up later and continued to serve requests after some load management
- 11:00 brion: started squid on srv7, which had been down for unexplained reasons and its IP addresses had not been reassigned
- 07:20 brion: rebuilt foundation-l list archives after removing some personal info by request
- 00:05 brion: changed $wgUploadDirectory settings so they won't break in maintenance scripts. hopefully didn't get them wrong.
- 23:30 brion: cleared image cache for all wikis. bogus entries probably added during links refresh; maintenance scripts have wrong $wgUploadDirectory
- 23:00 brion: cleared image cache entries in memcached for commonswiki due to spurious entries marked as not existing.
- 04:25 Tim: Setting up for perlbal throughput test on tingxi
- 22:18 Tim: resumed refreshLinks.php
- 15:57 Tim: stopped refreshLinks.php at the end of enwiki, before the delete queries
- 15:28 Tim: Restarted avicenna, which caused the site to crash due to a large number of threads waiting for Lucene
<TimStarling> what is avicenna's role? <dammit> was: search server <dammit> dunno now <TimStarling> avicenna is reporting 20% user CPU usage <dammit> every host that runs lucene <TimStarling> but nothing is showing up in top <dammit> has broken top output <dammit> and broken ps output <TimStarling> nothing important shows up in netstat, I'll just reboot it <TimStarling> ok? <dammit> 'k *site explodes*
- 09:10 brion: took vincent out of lucene search rotation while it's building; changed default_socket_timeout in php.ini to 3 seconds from 60
- 04:00 brion: started incremental index update for lucene search indexes
- 03:38 Tim: resumed refreshLinks.php after having stopped it for a while during peak period
- 05:12 Tim: Shutting down apache on srv1 to dedicate it to refreshLinks.php
- 02:10 brion: set up logrotate on isidore to rotate squid log, in hourly cron
- 01:40 brion: manually rotated squid log on isidore due to reaching 2gb, restarted squid.
- 07:30 brion: installed patched Tidy extension on apaches to fix binary-safe string bug.
- 06:00ish brion: copied updated lucene indexes to avicenna and maurus, put vincent back in search rotation
- 05:40-05:55: Severe external network problems
- 05:25 Tim: deleted obsolete binlogs, moved the remainder (77-87) from ariel to avicenna. 33 GB of disk space remaining on ariel.
- (yesterday) jeronim: installed python 2.4.1 from source on alrazi, using make altinstall instead of make install, so that the current python 2.3 installation is not interfered with -- the 2.4.1 binary is at /usr/local/bin/python2.4
- 23:30 jeronim: clocks were wrong on 5 machines; fixed 4 of them (installed ntpdate on vincent). isidore still needs to be done (dammit? :)
- 07:55 brion: started a second active search daemon on maurus (vincent is still rebuilding indexes)
- 05:00 jeluf: enabled LuceneSearch.
- 01:20 brion: had to restart srv7 squid again. moved logrotate from cron.daily to cron.hourly, where it should have been before but wasn't
- 21:30 jeluf: disabled LuceneSearch. All apache processes were in state LuceneSearch::newFromQuery
- 11:15 jeluf: set wgCountCategorizedImagesAsUsed for commons.
- 02:55 brion: manually rotate squid log on srv7 again when it reached 2gb and crashed. logrotate needs to be fixed...
- 02:15 brion: installed GCC 4.0 final on vincent, avicenna for GCJ. Taking vincent out of search rotation for index rebuild.
- 13:15 Tim: recaching special pages, with wget script running in a screen on zwinger, which requests recache pages from bayle, which sends the expensive queries to benet.
- 02:25 brion: manually rotated logs and restarted squid on srv7. had been down for 2.5 hours, but nobody noticed the alarm from servmon.
- 10:20 brion: as a temporary hack, bumped rc_namespace on metawiki from tinyint to int. somebody added a russian help namespace at 128/129 which is outside of the signed tinyint range, so pages were recorded with the wrong namesapce.
- 01:30 brion: removed 'wrap' option from tidy.conf to work around weird corruption problem (may be bug in tidy; investigating)
- 18:00 midom: started backup run on benet
- 11:25 brion: tidy extension installed on apaches, now active. To go back to external, set $wgTidyInternal = false; or remove extension=tidy.so from php.ini and restart apaches
- 10:50 brion: added node groups fc3, freebsd, debian
- 10:06 brion: removed isidore and vincent from fc2-i386 node group, as they're running FreeBSD and Debian
- 10:00 brion: working on installing tidy extension for php...
- 03:00 brion: re-enabled search
- 16:50 Tim: Pope-related flash crowd, peaking at 2100/s. Apaches were hard hit by searches (about 50% of profile time) so I disabled them temporarily.
- 16:00 Tim: we were getting reports of gzuncompress errors in memcached-client.php, on every page view on en. I put in an error suppression operator and instead logged all such errors to /home/wikipedia/logs/memcached_errors, to determine which server was the problem. It turned out to be not a server but a key, enwiki:messages to be precise. Deleting it and letting it reload fixed the problem.
- 07:30 midom: sad notice, smellie down, memory or other hardware troubles, lots of segmentation faults and other signals before reboot, didn't come up after.
- 09:00 midom: fixed broken webster replication, caused by table bugs at database bugs
- 06:45 brion: fixed symlinked php.ini on srv2, srv3
- 00:00 midom: reformatted suda data area from xfs to ext2, brought into MySQL service for enwiki only
- 03:20 brion: eowiki lucene search live! others building...
- 02:45 brion: started lucene index builds for eowiki, ruwiki, dewiki
- 02:15 brion: lucene search live for meta
- 01:45 brion: restarted meta search build, as it was pulling from wrong db. whoops!
- 23:51 brion: noticed some spam coming in on bugzilla. hacked rel="nofollow" into comment processing, removed the comment, and disabled the account used to post it.
- 22:40 brion: starting lucene index builds for metawiki and some other wikipedias
- 00:08 brion: removed Apache-Midnight-Job from avicenna crontab
- 23:50 brion: vincent and avicenna are sharing LuceneSearch burden.
- 20:00 brion: Chad fixed vincent, which is now running lucene. Isidore lucene stopped, it's going to be squid soon. Will take over an apache for additional search capacity.
- 13:30 brion: lucene search turned on for en with slightly old index file, daemon running on isidore
- 10:30 brion: gcj on isidore seems horked; index rebuild is much too slow (eta 18 hours) so stopped it. uploading an index from home, and building mono for further testing.
- 10:00 midom: holbach restored.
- 08:55 holbach seems to be deadish
- 08:50 brion: started lucene index build on isidore
- 05:50 brion: vincent doesn't seem to be coming up again, will need to be kicked.
- 05:20 brion: upgrading vincent to 2.6 kernel hoping to resolve threading/memory issues w/ MWDaemon
- 02:10 brion: rebooting srv6 due to zombie squid eating port 80
- 23:05 kate: experimenting with making an en.wp image dump using trickle (cvs: /tools/trickle/)
- 08:00 midom: broken replication (by chineese scammer) on bacon, fixed by "use otrs; repair table article" - myisam tables are evil, aren't they?
- ~23:00: kate: upgraded squid to STABLE9+patches (see squid builds) + restarted all squids.
- mark: All squids are running with too few FDs (1024), and if noone replaces all daemons by the new one Kate just built, we may have a problem tomorrow during peak hours...
- 19:15 midom: srv7 is now in squid service
- 19:07 brion: MWDaemon's memory usage got high enough it started swapping. Hung connections ate up apaches and hung the site until it was restarted.
- 5:30 brion: lucene search server active for en.wikipedia.org, running on vincent.
- 15:45 midom: dropped thttpd (as it was using 32bit mmaps) on dumps in favor of lighttpd. It has superb performance, serves 3500hits/s under ab and served 70MB/s from benet in small reqs... Extreme recommendations for using lighttpd for image uploads.
- 10:15 brion: running lucene search indexer on vincent (pulling enwiki from benet).
- 05:25 brion: added additional is rcbots to #is.wikipedia for tionary/books/quote
- 16:00 midom: redirected http://download.wikimedia.org/ to benet, misses tomeraider and uploads...
- 13:00 Tim: switched to Mark's squid binary on the French squids
- Mark, Tim: implemented Multicast HTCP purging on all FL apaches/squids. French Squids still need a binary replacement.
- 21:44 mark: Put port gi0/26 on csw1-pmtpa into trunking mode: vlans 1-2 only, with vlan 2 being the native vlan, no LACP negotiation
- 11:30 midom: benet put into dump operation
- 10:55 brion: reinstalled PHP on zwinger and apaches, compiled with memory limit and mbstring options enabled. This was left out when upgrading to 4.3.11.
- 2:40 brion: added NetCabo proxies to trusted proxy list (inconveniently shared by Jorge and a Nazi vandal on pt.wikipedia.org)
- 15:30 jeluf: disbaled logging of upload.wikimedia.org
- 15:15 midom: yet another image server overload. rotated 30G upload.wikimedia logfile, could be fragmentation overhead.
- 12:00 midom: moved log_bin.0? (40G worth of binlogs) from ariel to khaldun/avicenna backup/arielbinlog, reclaimed some master disk space.
- Do we need those binlogs for anything?
- Yes, we need binlogs back to the last full backup -- TS
- Do we need those binlogs for anything?
- 07:48 Tim: Started memcached on browne, it was in the list but not running. Fixed startup scripts. Noticed that browne can't contact albert on 10/8, modified yum.conf accordingly.
- 18:25 midom: extended public IP address range (now: 12 addresses)
- 17:50 midom: srv5 joined service as squid.
- 22:30 midom: Enabled recentchanges-based watchlist hack. Servers go faaaast.
- 23:15 brion: set default block expiry to 1h on dewiki by request of various admins
- 2004 Jun - 2004 Sep
- 2004 Oct - 2004 Nov
- 2004 Dec - 2005 Mar
- 2005 Apr - 2005 Jul
- 2005 Aug - 2005 Oct, with history 2004-06-23–2005-11-25
- 2005 Nov - 2006 Feb
- 2006 Mar - 2006 Jun
- 2006 Jul - 2006 Sep
- 2006 Oct - 2007 Jan, with history 2005-11-25–2007-02-21
- 2007 Feb - 2007 Jun
- 2007 Jul - 2007 Dec
- 2008 Jan - 2008 Jul
- 2008 Aug
- 2008 Sept
- 2008 Oct - 2009 Jun
- 2009 Jun - 2009 Dec
- 2010 Jan - 2010 Jun
- 2010 Jul - 2010 Oct
- 2010 Nov - 2010 Dec
- 2011 Jan - 2011 Jun
- 2011 Jul - 2011 Dec
- 2011 Dec - 2012 Jun, with history 2007-02-21–2012-03-27
- 2012 Jul - 2013 Jan
- 2013 Jan - 2013 Jul
- 2013 Aug - 2013 Dec
- 2014 Jan - 2014 Mar
- 2014 April - 2014 September
- 2014 October - 2014 December
- 2015 January - 2015 July
- 2015 August - 2015 December
- 2016 January - 2016 May
- 2016 June - 2016 August
- 2016 September - 2016 December