Server Admin Log/Archive 5

From Wikitech
Revision as of 10:08, 20 September 2005 by imported>Brion (sync-apache)

Template:Topnavbar

19 April 10:50 (UTC, purge)

hourly traffic rate | Squid stats

Ganglia: A|S

September 20

  • 10:08 brion: Added sync-apache script to rsync the apache config files from zwinger to pmtpa apaches. Don't forget to use it after making changes and before restarting apaches!
  • 09:30 brion: moving apache configs a) into /h/w/conf/httpd subdir, and b) into local copies on each server which will be rsync'd
  • 08:05 brion: new apache configs on all
  • 07:23 brion: fixed up apache configs on *.wikimedia.org
  • 07:00 jeronim: added acpi=off panic=5 to adler's kernel params and rebooted, because apparently there are some ACPI problems, and so that it reboots on kernel panic instead of freezing
  • 06:53 brion: cleaning up apache config files; replacing ampescape rewrite usage with aliases to remove our patch dependency (tested on wikimediafoundation.org)
  • 06:40 jeronim: installed same kernel on adler as is on samuel and set it as default; also samuel's default kernel was changed to a newer one (by yum?) in Template:Filename, so changed it back to match the current kernel
  • 05:30 brion: put suda back in rotation; toned down its share of enwiki hits a bit
  • 05:02 brion: adler crashed again at some point
  • 02:36 brion: adler was rebooted by colo; running innodb recovery
  • 01:58 brion: adler is down, seems to have crashed (panic bits on scs output). taking out of rotation too
  • 01:45 brion: lots of delays trying to open suda from wiki; taking out of db rotation
  • 01:11 brion: halted backup; benet ran out of space. en_text_table.gz is much larger than expected (49gb), perhaps external storage has not been used correctly as expected? will remove file and continue.

September 19

  • 22:10 ævar: uninstalled nogomatch on enwiki, who's going to sort through all that gibberish data? Not me!
  • 21:07 brion: rebooting new8 machines to make sure they're running current kernel
  • 21:02 brion: new8 group status: srv47 online but borked; 31-35 and 49 offline. others to be set up as apaches
  • 20:46 brion: running special pages update on frwiki by request... will update others on cronjob if there's not already one?
  • 19:40 mark: Replaced udpmcast.py by a properly daemonized version. Set it up at knams to forward to a multicast group instead of all unicast IPs forwarded by larousse...
  • 18:45 mark: Removed miss_access line from knams squids to solve the cache peer errors. Repeat at yaseo if it works...
  • 13:49 ævar: Installed the nogomatch extension experementally on enwiki.
  • 08:00 Tim: Removed all NFS mounts from srv1's fstab. Set up a simple /home directory on its local hard drive.
  • 06:06 kate: reverted root prompt on zwinger so it's not invisible on a white background
  • 04:47 James: stop slave on bacon while dumper is running. Slave will restart when done.
  • 02:45 Tim: changed root prompt on zwinger. Started sync-to-seoul, with -u option this time so we don't accidentally overwrite stuff
  • 01:50 brion: seems to be mostly back up at this point. boot seemed to be aided by disabling named and letting it lookup from albert
  • 01:36 brion: zwinger boot still going on; nfs init is *very* slow doing the exportfs -r; seems to be slow dns lookups
  • 00:38 brion: jeronim did this: [root@zwinger srv38]# reboot - unfortunately it was not srv38, but zwinger.
  • 00:05 brion: mounted /home on srv1; couldn't login, caused sync-file failures
  • 00:05 brion: enabled Nuke extension on meta & mediawiki.org

September 18

  • 14:00 jeronim: rebooted zwinger by mistake and it needed a manual reset by colo staff to come back up. Site was offline for about an hour.
  • 04:34 brion: vandale kernel panic, frozen
  • 04:30 Solar: srv36-srv50 are racked, have ip's, and are ready for production
  • 03:10 Tim: moved compressOld.php to dalembert (where dumpHTML.php has been running), on complaints that it was causing problems on zwinger.

September 17

  • 22:17 brion: running unique-ip counter on fuchsia with saved logs (into uniqueip table on vandale)
  • 22:02 brion: disabled disused info-de-l list by request of list admins
  • 11:05 brion: ran initStats on all wikisources to initialise those not already set
  • 07:06 brion: canceled upload dump for commons backup due to size and slowness; too big to fit
  • 06:30 jeronim: on larousse, removed fedora netcat and installed from source into /usr/local
  • 04:30 Tim: used ntpdate -u pool.ntp.org to set the times on all the yaseo machines, some were a long way out. Then set all their timezones to UTC. This apparently caused ganglia to think yf1000 and yf1002 were down, fixed by restarting the local gmond.
  • 04:10 Tim: Started replication on henbane
  • 01:10 brion: enabled wikidiff on all wikis. (can be disabled selectively w/ wgUseExternalDiffEngine in InitialiseSettings)
  • Tim: Set up mysql on henbane, made a consistent dump of kowiki and commonswiki using bacon, copied dump to henbane ready to start replication

September 16

  • 22:20 Tim: started mysqld on srv26, it had been off for 12 hours or so. The compression script had been running all that time, srv26 caught up to the master without incident.
  • Colo (Solar):
    • supposedly bart is brought back up
    • borrowed HP switch connected to gi0/4 on the cisco
    • moreri was moved, and is trying to netboot (fails)
    • 10 of the 20 new servers have been racked and wired to the borrowed HP switch, but don't have IPs yet
  • 11:37 brion: updating sitenames on he, el, ru wikisource
  • 11:30 brion: started backup run
  • 03:17 brion: frwiki reimport done
  • 02:47 brion: frwiki reimport started
  • 02:35 brion: jawiki reimport done
  • 01:49 brion: started jawiki reimport
  • 01:33 brion: bacon catching up; suda is fine as it is partial mirror
  • 01:29 brion: took bacon, suda out of rotation for further investigation
  • 01:23 brion: nlwiki open for editing
  • 01:03 brion: reimporting nlwiki on samuel
  • 00:41 brion: nl/fr/ja dumps done (in /var/backup/private/recovery). going to try reimporting soon
  • 00:16 brion: running attachLatest on *wikisource

September 15

  • 23:14 brion: 3 dumps from adler done; doing extra backups from samuel too. setting adler to read-only
  • 22:37 dumping nlwiki, frwiki, jawiki databases from adler onto sql files on benet
  • 22:18 put load back on samuel for enwiki with adler disabled. fr, nl, ja wikipedias are locked while we work this out
  • 22:09 commented out adler from db.php; adler appears to be misconfigured and all kinds of breakage is going on. it's not read-only, and has some revisions that others don't have
  • 21:56 brion: took load off bacon (was 100 load on fr, nl, ja; nl and fr reporting weird editing problems possibly freak lag problems, and it was consistently lagging a few seconds at least)
  • 17:25 mark: Setup IPsec between bacon and vandale. Who wants to setup replication?
  • 16:50 mark: Altered geodns: pointed Malaysia at yaseo, and Israel, Turkey, Cyprus at knams
  • 13:04 Tim: Shutting down apache on dalembert temporarily so that I can use it for HTML dump testing and generation
  • 12:35 Tim: Restarted compressOld.php, it stopped when I shut down bacon to do the copy to adler.
  • 11:30 mark: Restarted some knams squids to increase FDs, changed /etc/rc.local startup script
  • 11:15 mark: Deployed squid on yf1003 and yf1004, and added them to the DNS pool
  • 11:10 mark: Recompiled squid on yaseo to increase filedescriptors to 8192 and restarted all squids with 4096
  • 07:37 brion: running importDumpFixPages.php on wikisources to fix bogus rev_page items
  • 02:30 kate: ariel's down
  • 02:29 brion: recompiling mono 1.1.9 on benet for xml bugfix
  • 00:15 brion: removed humboldt and hypatia from mediawiki-installation node group, neither has port 80 on:
    • humboldt prompts for password, not configured correctly?
    • hypatia shows host key changed; was reinstalled?
  • 00:10 brion: disabled MWSearchUpdater plugin as the daemon is broken; briefly broke the wiki due to bad include_path; need to fix config for MWBlockerHook to make sure the path is right even w/o the lucene include

September 14

  • 21:30 mark: Setup log rotation at yaseo to knams, routed japanese and chinese clients to yaseo squids.
  • 20:30 midom: adler online, bacon catching up
  • 20:15 mark: Deployed squid on yf1001, and routed Korean clients to the Florida squid cluster.
  • 18:15 mark: Deployed squid on yf1000.
  • 18:10 mark: Wrote a YASEO squid deploy script /home/wikipedia/deployment/yaseo-squid/prepare-host (yahoo cluster only, should I put it at florida?) after Tim's apache prepare-host script
  • 17:48 ævar: de-opped myself on ruwiki and stopped my revert bot, the russians hate me even more now.
  • 16:30 mark: Set up a squid on yf1001. Same setup as knams, except it's in /usr/local/squid as in florida. Adapted florida's squid and mediawiki configs accordingly.
  • 13:19 ævar: ran INSERT INTO user_groups VALUES (1165, "sysop"); on ruwiki to make myself temp. sysop to fix the MediaWiki: fsckup.
  • 11:15 brion: halted nlwiki partial temp backup as enough was run to test problem
    • (identified problem as [1])
  • 10:41 brion: running another nlwiki backup to get raw dumpBackup.php output for testing
  • 10:39 brion: halted old backup sequence (at nlwiki, with a mystery breakage in output that needs examining)
  • 10:33 brion: hacking dumpBackup.php to load php_utfnormal.so extension (not yet enabled sitewide)
  • 10:05 brion: running kowikisource and zhwikisource imports on formerly broken parts
  • 08:55 brion: updated messages on jawikisource
  • 08:30ish brion: updated messages on *wikisource
  • 01:30 jeronim: access to yaseo console server should be back hopefully within a few hours - eam is dealing with it

September 14

  • 13:32 Tim: Shut down mysql on bacon, started copying data directory to adler

September 13

  • 23:23 brion: set logo on dewikiquote to commons version
  • 23:ish brion: installing mono 1.1.9 with xml patch on benet to fix future dumps ([2])
  • 17:23 ævar: Logging Exif debug information to /home/wikipedia/logs/exif.log using wgDebugLogGroups.
  • 16:40 jeronim: yf1000 - yf1004 are all set up with reiserfs now. The only yaseo machine not working is yf1013 which is in an unknown state as the console server (konsoler04.krs.yahoo.com (10.11.1.186)) is unreachable.
  • 16:18 Tim: Started moving some text to cluster2, starting with frwiki.

September 12

  • 11:59 brion: killed search update daemon; going to replace this (again) with a more robust queuing system
  • 15:00 or so kate: upgraded perlbal to 1.37
  • 13:24 jeronim/kyle: lots of machines connected to SCS, port labels corrected. The APC has apparently vanished - Kyle couldn't find it.
  • 09:40 brion: installed ICU 3.4 on zwinger and mediawiki-installation from RPMs built from the ICU-provided spec file. Source and binary rpms in /home/wikipedia/src/icu
  • 09:34 brion: fixed misnamed krwikisource -> kowikisource db
  • 8:50 Tim: rebuilt interwiki tables
  • 02:15 brion: replaced old php.ini on zwinger with symlink to the common one. added /usr/local/lib/php back into the default include_path (for PEAR stuff sometimes used)
  • 01:04 brion: blocked leech enciclopedia.ipg.com.br

September 11

  • 22:05 brion: trying batch clears in parallel overloaded zwinger; canceled, running in serial again
  • 21:35 brion: running batch operation to remove bad cached messages
  • 21:00 brion: reconfigured blocker daemon to log to samuel. had to set up permission grant again on samuel
  • 18:19 Tim: finally managed to fix the message problem, except for some erroneous values stored in cache
  • ~18:00 ævar: To get interwiki links working on hrwikisource: sourced the output of maintenance/rebuildInterwiki.php and sourced mainteance/interwiki.sql on all wikis, some interwiki prefixes appear to have been lost in the progress e.g. bugzilla: (only mediazilla: exists in interwiki.sql) looks like we need better interwiki update scripts...
    • Don't run interwiki.sql, under any circumstances. Add new prefixes to m:Interwiki map. -- Tim 08:52, 12 Sep 2005 (UTC)
  • 16:05 Tim: switched master to samuel. Adler asks for root pw after reboot due to failed fsck.
  • 15:10 Adler crashed. Tim and JeLuF on the scene, wiki switched to read-only mode
  • 14:59 Tim: Non-default language message caching completely f****d up. Blank messages everywhere
  • 07:10 brion: now using blocker list
  • 07:00 brion: installed limited librsvg on apache cluster, svg back on
  • 15:40 Tim: Installed apache, php, turck and mediawiki on yf1005. Put all required commands in /home/wikipedia/deployment/yaseo-apache/prepare-host. Still needs database, memcached and mediawiki configuration.
  • 05:05 brion: restarted MWUpdateDaemon, hung again at 1gb used memory
  • 02:38 brion: disabled svg for further security work
  • 01:20 brion: reconfiguring wikisource to allow en.wikisource.org to work (hr ja kr sv zh en now imported)
  • 01:09 brion: installed librsvg 2.11.1 on the apaches; it's in /usr/local. (old librsvg versions seemed to muck up text pretty bad)

September 10

  • 22:49 brion: importing wikisource nl ro ru
  • 22:34 ævar: deinstalled the wgDebugLogFile on commonswiki, got enough debug output to see if anything was wrong.
  • --:-- jeronim: yaseo stuff:
    • reinstalled FC4 on yf1000, yf1001, yf1003, yf1004 with reiserfs
    • reinstalled FC4 on dryas & henbane with 10GB ext3 root partition and the bulk of the disk as jfs on /a
    • rsyncing /home, /tftpboot, /root, /var/www, /usr/local, and /etc from amaryllis to dryas in preparation for reinstalling amaryllis with reiserfs. It's a script, /root/amaryllis-rsync.sh, running in a screen on dryas.
  • 14:14 ævar: installed a wgDebugLogFile for commonswiki in /home/wikipedia/logs/commonswiki.log to monitor Exif debug output.
  • 13:26 ævar: ran maintenance/deleteImageMemcached.php on all wikis fixing bug 3410
  • 10:44 brion: cleaning out old mysql data from benet to free up space for current backups (40 days+ out of date, not too useful)
  • 10:00 brion: restored working frame-breakout code (pending cached wikibits.js)
  • 07:58 Tim: moved some ancient rubbish from /home/wikipedia/htdocs to /var/backup/home/wikipedia/htdocs
  • 07:10 brion: running data split for additional wikisource languages
  • 02:40 Tim: Changed names of Seoul machines
  • 02:15 brion: set edit rate limit for new accounts to same as ip rate limit
  • 01:40 brion: installed rsvg (librsvg2) on mediawiki-installation machines, enabled SVG uploads

September 9

  • 06:30 brion: restarted stalled de,en dumps

September 8

  • 19:18 brion: checker daemon running
  • 10:50 brion: setting up vandal checker daemon on larousse
  • 10:42 hashar: enabled subpages for portal (100) and portal discussion (101) on dewiki.
  • 7:45 hashar: added two namespaces for frwiki : 100=>Portail, 101=>Discussion_Portail .

September 7

  • 22:00 jeronim: fixed avar's login problem on servers in the mediawiki-installation group -
    • nscd -i passwd did not work
    • /etc/init.d/nscd restart ; /etc/init.d/sshd restart did solve the problem on each machine except for benet; for benet, problem was finally solved after doing the restarts twice more, then nscd -i passwd, then doing the 2 restarts with a pause in the middle
  • 21:30 jeronim: killed everyone's ssh sessions and sshd on zwinger (sorry)
  • 10:25 midom: After Tim did put live memcached patch, site's sessions were switched from NFS to memc.
  • 06:54 brion: killed stalled backup -- memcached send hang for the last day or so. It's continuing w/ dkwiki; will rerun stalled dewiki and enwiki

September 6

  • 19:55 brion: tgwiktionary to lowercase
  • 05:30 brion: set up experimental upload verification hook
  • 04:02 koko: removed firewall

September 5

  • 12:40 brion: set up to shut down search builder daemon every hour (at 47 minutes) to protect aganst memory leaks in builder; search-update-daemon wrapper script set to auto-restart 5 seconds after shutdown/crash of the daemon
  • 09:05 brion: rebuildMessages.php --update on all wikis to add various new messages
  • 06:09 brion: starting mass lucene updates of pages edited in august
  • 05:18 brion: lucene back-deletions done, reoptimizing build index
  • 01:10 brion: search updater up; running queued deletions
  • 00:45 brion: vincent back in active search rotation

September 4

  • 23:55 brion: splitting lucene config to lucene.php. putting coronelli on search, wiht optimized index
  • 19:30 jeronim: created helpdesk-l
  • 17:20 jeronim: fuchsia does not boot on the latest kernel (see below), but it does boot on the 2.6.11-1.33_FC3smp kernel, so switched it to boot that kernel by default
  • 16:27 mark: Because of cascading incidents in knams, we moved all traffic to florida and lopar via DNS.
  • 14:30 jeronim: fuchsia was dead or very close, so power-cycled it using the IPMI. It is broken:
Copyright (c) 1999-2004 LSI Logic Corporation
insmod: error inKernel panic - not syncing: Attempted to kill init!
serting '/lib/mpACPI: PCI Interrupt 0000:02:04.0[A] -> GSI 27 (level, low) -> IRQ 177
tmscsih.ko': -1 ptbase: Initiating ioc0 bringup
niknown symbol ioc0: 53C1030: Capabilities={Initiator}
 
 module
Call Trace:/sbin/udevstart <ffffffff80138164>{panic+196}e xited abnormaly!
Creating roo<ffffffff8034f811>{__down_read+49}t
                                                device
 dev: label /1 n      t found
Mountin<ffffffff80207ef1>{__up_read+33}g root filesyste m
mount: error <ffffffff8013ae53>{do_exit+99}2
                                             mounting ext2
       <mount: error 2ffffffff80207db1>{__up_write+49}mounting none
S witching to new <ffffffff8013ba8f>{do_group_exit+239}r
                                                        oot
 : mount failed:      22
umount /init<ffffffff8010eaa6>{system_call+126}r d/dev failed: 
  • 13:16 Tim: made /home/wikipedia/lib/install.sh ignore x86_64 machines, added a part to clean up rubbish left in /usr/lib, then ran it everywhere with dsh -a -f
  • 04:20 Tim: reinstalling PHP 4.4.0 with exif support. Using php-upgrade-440, which calls the new script /home/wikipedia/lib/install.sh to set up shared libraries in /usr/local/lib.

September 3

  • 18:40 jeronim: removed body of mailman archive messages here and here on yannf's request
  • 06:40 brion: relaunch updated backup script with some of the broken bits fixed.
  • 04:50 Tim: Finished benchmarking PHP 4.4.0, see GCC benchmarking. Now deploying the new binaries, from source tree /home/wikipedia/src/php/php-4.4.0-gcc4
  • sometime brion: added .log to text/plain on benet's lighty

September 2

  • 12:00 brion: ran backup test on aawiki using the new dump splitter and partial new backup script. (script is in ~brion/run-backup.sh if anyone wants to examine it)
  • 07:19 Tim: compiling GCC 4.0.1 on zwinger. It will be installed with a program suffix, so gcc is still the old compiler, and gcc-4.0.1 is the new one. Source directory is /home/wikipedia/src/gcc/gcc-4.0.1, build directory is /home/wikipedia/src/gcc/gcc-4.0.1-build.
  • 06:21 Tim: removing hypatia from perlbal nodelist for an hour or so, for some benchmarking

September 1

  • 07:45 brion: set sitename/meta namespace on mtwiki
  • 07:00 brion: running cleanupTitles.php to rename broken pages. Will be at Special:Prefixindex/Broken/ at each wiki.

August 30

  • 17:30 jeronim: made a robots.txt on larousse (noc/kohl) to disallow some dynamic pages and a few others
  • 16:40 jeronim: created wikimediapl-l

August 29

  • 21:30 brion: blocked wissens-schatz.de for remote loading
  • 17:30 jeluf: anonymized a name in the archive of wikide-l
  • 11:30 brion: running a batch job checking for invalid titles on various wikis (cleanupTitles). shouldn't interfere with anything, making no changes.

August 28

  • 22:15 brion: locking plwiktionary for capitalization change
  • 15:18 hashar: created wikimk-l mailing list.
  • 15:15 mark: Brought mayflower back up. Repaired the filesystems, and rebooted it. It was reporting lines like
Aug 28 04:22:34 mayflower kernel: swap_free: Bad swap file entry 7800007ffffff00f
  • 14:30 mark: Another Kennisnet V-20 went down, this time it was mayflower dieing somewhere this morning. Depooled it... As it's not critical and we still have SP access, I will have a look at it first.

August 27

  • 00:45 brion: turned on wegge's experimental watchlist bot thingy on dawiki

August 26

  • sometime: lots of data imported on wikisources

August 25

  • 16:02 jeronim: added fc-mirror.wikimedia.org DNS entry for fedora mirror
    • fc-mirror 1H IN CNAME albert
  • 15:40 hashar: created wikials-l mailing list. TODO: delete /h/w/htdocs/mail/.index.html.sw(o|p) (swap files by fire).
  • 19:00 mark: PowerDNS on pascal appeared corrupted. Most probably because of an overlapping zones problem in bindbackend (not bindbackend2). I integrated rev.wikimedia.org into the wikimedia.org to evade that.
  • 16:09 hashar: blacklisted www . izynews . com on florida squids (using acl badbadip src 62.75.174.182/32). Need to be done on kennisnet and paris cluster too.
  • 11:00 brion: set up https on kohl. (old ssl key files backed up; wasn't using the established password, nobody knew what it might have been)
  • 07:05 brion: rebuilt interwiki tables; using correct interwikis for the new wikisources.
  • 06:51 brion: added sr.wikisource.org
  • 02:02 hashar: updated in HEAD LanguagePt.php from meta. Watchout when syncronising.

August 24

  • 14:04 hashar: disabled lucene search. Daemon run on maurus but timeout / dont give any output.
  • 04:00 Jamesday: started nice bzip2 for slow query log and first 72 binary logs on adler to free 40GB of disk. Can archive them on another box later.
    • use avicenna for binlog archives -- Tim 05:53, 25 Aug 2005 (UTC)
  • 00:43 brion: trying out an older version of MWDaemon on vincent to see if memory leak is a new code problem

August 23

August 22

  • 22:12 brion: upped max post size to 75mb on squids; were problems posting large videos to commons (or something)
  • 21:50 brion: renamed presswiki to internalwiki

August 21

  • 22:53 brion: bugzilla up; removed ssl-ticket.wikimedia.org from pascal's apache conf.d dir
  • 22:48 brion: bugzilla.wikimedia.org appears to be offline.
  • 13:30 Tim: reduced lucene load on vincent to 1/4, maybe that will stop it from locking up (which it did again)
  • 13:00 Tim: restarted lucene on vincent, it was closing connections as soon as they were established
  • 06:27 brion: otrs now accessible again on https://ticket.wikimedia.org/ ; now with redirect for the index page! For reference: Apache is in /usr/local/otrs
  • 06:00 brion: trying to start otrs on ragweed. apache configuration appears to be borked.

August 20

  • 10:00 jeluf: finished OTRS transition to ragweed. Spamassasin setup finished.
  • 09:53 Tim: Switched site to 1.6alpha
  • 08:16 Tim: Applying schema update for 1.6alpha, basically an ALTER TABLE watchlist
  • 01:00 Tim: ran update-special-pages

August 19

  • 23:30 brion: changed postfix 'myhostname' setting from zwinger.wikimedia.org to mail.wikimedia.org, should prevent the mail loop errors reported sending to the full addr
  • 23:00 brion: ran namespace conflict checks for updates on tawiki and gawiki
  • 21:40 brion: updated rebuildInterwiki

August 18

  • 23:30 jeluf: OTRS status: Installed apache/php/perl/postfix/mysql client on ragweed. Using pascal as DB server. Problems with sessions, sessions seem to be mixed up, sometimes I get logged in as presroi, sometimes as JeLuF :-/ Stopped apache for now. Postfix still accepting new tickets.
  • 22:30 mark: Changed DNS CNAME ticket.wikimedia.org to point to ragweed
  • 22:17 brion: disabled account creation throttle on press wiki; this is closed wiki and all accounts are created by an admin
  • 10:00 midom: suda is back again, with enwiki and commonswiki databases
  • 05:00 jeluf: copied OTRS tables to pascal, copied otrs binaries to pascal, configured pascal to serve https. Can access old tickets again. Currently can't send new tickets to otrs. DNS change needs to be done.
  • 00:55 brion: recreated wikimediasr-l list on zwinger

August 17

  • 19:27 brion: fixed bug in db.php that set all database load factors to NULL

August 16

  • 20:15 jeluf: renamed project namespace on cswikibooks to Wikiknihy.
  • 15:30 midom: resumed idle bacon's mysql replication, we might need to do external store migration soon, and bring back suda with smaller dataset.

August 15

  • 21:46 kate: always_bcc on zwinger was set to "quagga" and its mbox was full, so it generated lots of bounce messages. i removed the setting.
  • 12:30 mark: Mint seems to have at least a bad disk, possibly other problems. Sun will look at it. In the meantime, we can *try* to network boot it and recover data.
  • 10:30 jeronim: had a look at mint via the IPMI - tried to power cycle it but it wouldn't switch off. Mark will tell the kennisnet guys about it. There's a dump of the OTRS DB from before the transfer to mint in albert:/root. If mailman is to be put back to zwinger, chapter-l and the new Serbian list will need to be re-created (and maybe some other lists?).
  • 09:00 mark: Mint apparently is fucked, RAID and SP settings were reverted to factory defaults. Trying to do data recovery now. Possibly a power problem?

August 14

  • 19:51 brion: mail config on zwinger broken or funky or otherwise annoying; just leaving it off for now. moved dns for mail back to mint (which is still dead) sighhhh
  • 19:26 brion: moved mail.wikimedia.org back to zwinger due to extended outage on mint. With our limited support contract on knams we can't afford to have this critical service there.
  • 14:30 midom: srv27,srv26,srv25 joined external storage service, waiting for payload
  • 09:30 brion: mint is offline, no ping
  • 00:20 brion: stopped bacon to run backup dump
  • 01:00 jeluf: enabled spamassassin for OTRS on mint (~otrs/.procmailrc)

August 13

  • sometime kate: moved otrs to mint
  • 23:25 brion: added wikimediasr-l aliases to mailman on mint
  • sometime someone: Apparently mail.wikimedia.org has been moved to mint.
  • 10:42 jeronim: set ticket.wikimedia.org to CNAME mint.knams.wikimedia.org. (move of OTRS to mint is in progress)
  • 00:58 Tim: started update-special-pages
  • 00:19 Tim: it happened again so I disabled otrs's crontab. Original crontab is in /opt/otrs/crontab

August 12

  • 23:18-23:30 Tim: An OTRS process on albert (PostMaster.pl) developed a runaway memory leak, causing heavy swapping. This slowed down albert sufficiently to cause the entire apache cluster to lock up with high load. Killed the process at 23:30 and the site soon returned to normal.
  • 09:30 brion: took srv1 out of 'apaches' node group and shut off apache on it. DON'T RUN APACHE ON SRV1

August 11

  • 21:26 Tim: TICK TICK TICK, that's the sound of 58 servers with their clocks ticking in synchrony, maximum offset 80ms.
  • 20:30 Tim: Added the missing restrict line for 10.0.0.200 to ntp.conf on (almost) all machines
  • 19:30 Tim: Synchronised ntp.conf on hypatia, humboldt, rose, anthony, rabanus, diderot and srv1 with /home/config/others/etc/ntp.conf.vlan2 . This made them remotely queryable, for easier debugging in the future, and also switched their preferred server from zwinger to the cisco (in broadcastclient mode).
  • 18:35 Tim: Fixed tingxi's resolv.conf
  • 17:45 mark: Fixed inconsistent favicons on apaches. Older apaches had symlinks to a common (wikipedia) favicon, which got overwritten with the new wikinews favicon by brion. Removed the symlinks, and put the correct favicons in place.
  • 12:20 brion: set up pl.wikimedia.org and press.wikimedia.org (press is locked, and currently has no user accounts. a sysop/bureaucrat will need to be added for it to be used)
  • 07:28 brion: updated wikinews.org favicon

August 9

  • 23:20 mark: Rerouted Europe back to knams, because all sorts of weird problems were occuring. Fixed a typo (pmpta) in DNS. Some nameservers report TTL 0 for some of our DNS records - need to investigate that.
  • 22:20 mark: Moved Squid service IP 207.142.131.246 from overloaded srv10 to srv5. Cleared the ARP entry on the l3 switch.
  • 22:00 mark: Reroute everything from knams to pmtpa directly, because of routing problems
  • 13:35 mark: changed biruni's hostname from biruni.wikimedia.org to biruni
  • 13:30 mark: added avicenna and biruni to node_groups/apaches
  • 13:00 mark: Restarted apaches on avicenna, alrazi and biruni with -DSLOW, and changed startup scripts
  • 08:52 jeronim: blocked 61.48.105.65 spammer IP from all wikis using block-ip-all - so ipblocklist message will speak of "vandalism" instead of "spam"
  • 08:25 jeronim: created chapter-l for mailman on mint

August 8

  • 09:22 kate: enabled greylisting on mail.wm.org
  • 20:54 hashar: readded srv2 (with ip x.x.0.1 ) to the apache pool
  • 18:25 hashar: avicenna & biruni readded. Monitoring error log, #wikipedia and memory.
  • 17:43 brion: added /mnt/upload mounts on avicenna and biruni
  • 17:32 hashar: forgot sync-common on avicenna and biruni :/ I though scap would do the job ... They both missing the upload directory.
  • 15:45 brion: stopped apache on avicenna and biruni pending more information on reported errors
  • 15:36 hashar: TODO: biruni hostname seems wrong /etc/sysconfig/network list HOSTNAME=biruni.wikimedia.org whereas other servers just get HOSTNAME=zwinger or HOSTNAME=srv30 ...
  • 15:36 hashar: removed srv1 from mediawiki-installation dsh file (as apache is not meant to run on).
  • 15:24 hashar: bringed back biruni in mediawiki-installation pool
  • 15:12 hashar: bringed back avicenna in mediawiki-installation pool
  • 14:30 hashar: started apache on srv11.
  • 06:30 kate: moved mailing lists to mint. let's see if it starts sucking less.

August 7

  • 20:50 brion: postfix hung zombified on zwinger, wouldn't restart automatically. had to remove master.pid and restart.
  • 16:25 brion: installed DynamicPageList on wikiquote per [3]
  • 15:50 brion: locked tlhwiki
  • 07:47 brion: added application/ogg as mime type for ogg files on albert
  • 00:59 brion: set localized logo for ptwiktionary

August 3

  • 14:15 mark: Switched over upload.wikimedia.org to lighttpd instead of apache on albert
  • 12:00 brion: added frankfurt city map to wikimania whitelist. whoops!

August 2

  • 15:45 mark: Bound albert's apache to a single IP, instead of INADDR_ANY
  • 09:40 brion: added wildcard subdomains for wiktionary.com redirection

August 1

  • 22:30 all: samuel's disk filled up. Switched master to adler. Re-syncing samuel from suda.
  • 14:50 mark: Put all kennisnet squids back into DNS, updated DNS on pascal

July 31

  • 11:50 brion: knams squid at 145.97.39.138 is not reachable, but still in dns rotation. THIS IS BAD
  • 01:50 brion: pascal is offline, reason unknown. bugzilla down, no NFS for knams cluster.

July 28

  • 01:06 kate: put a new skin on bugzilla

July 27

  • 18:50 brion: blocked irc4ever.net remote page loaders

July 26

  • 08:08 kate: upgraded mysql on vandale to 5.0.9

July 25

  • 19:05 brion: set $wgMetaNamespace to 'Vikipedi' on trwiki, refreshing links
  • 18:15 mark: Added two missing kennisnet squid IPs to the udpmcast startup script on larousse, and restarted it.
  • 17:29 brion: added wikimania-l mailing list
  • 17:25 mark: Pointed thailand at knams as a test - some people there say it is much faster than pmtpa. Will eventually be replaced by the yahoo cluster anyway...

July 24

  • 16:15 brion: set ndswiktionary to capitallinks off
  • 10:10 brion: updated sudoers file on srv0 so syncs work again

July 22

  • 22:50 brion: restarted search update daemon... still seems to be a memory leak and it hangs when it gets too large
  • 22:31 brion: moved wiki.mediawiki.org to www.mediawiki.org and redirected from mediawiki.org and wiki.mediawiki.org to it
  • 22:07 brion: srv0 clock was about 150 seconds in future. kate did something to fix it. synchronized all apaches from system to hc time to hope reboot works. Fixed one revision reported to be in a weird inversion appearance.
  • 13:50 brion: took avicenna out of search group to do experiments on index

July 21

  • 23:30 Tim: added rollback group
  • 22:00 Tim: moved group settings from CommonSettings.php to InitialiseSettings.php

July 20

  • 23:45 brion: updated clocks on srv1, rabanus, etc all apaches... hopefully
  • 21:40 brion: set wgCapitalLinks off on afwiktionary
  • 19:20 mark: Removed legacy zone gdns.wikimedia.org and corresponding georecord rr.gdns.wikimedia.org from all nameservers. It's not being used anymore, and only confuses people.
  • 19:05 mark: Pointed france and switzerland back at lopar in geodns
  • 14:10 brion: created wikinews-hotline mailing list by request

July 19

  • 23:58 Tim: fixed Special:Uncategorizedcategories, now running updateSpecialPages.php on /h/w/c/smallwikis
  • 15:30 brion: reverting build copy of search index to the previous version to try working around some corruption from daemon crash (?)

July 17

  • 18:27 mark: An empty line in the geomap file caused problems and made the site go down for non EU users. Apparently geobackend currently doesn't handle empty lines in geomap files (a bug which I will fix), so don't use them.
  • 18:18 mark: Pointed all European countries at knams wrt geodns

July 16

  • 17:07 kate: wrote a new statistics system and replaced webalizer with it
  • 07:30 brion: had to restart search daemons again due to breakage. whyyyyyyy they worked before *sob*
  • 00:15 hashar: overloaded suda for almost 5 minutes by running the unbugged updateSpecialPages script . Might be cause of Wantedpages.

July 14

  • 02:50 brion: separated mediawiki-installation and apache node groups. These must not point to the same file.
  • 02:00-3:15 erik: created Japanese Wikinews at http://ja.wikinews.org/

July 13

  • 20:59 brion: had to interrupt bgwiki backup due to memcached hang
  • 06:10 brion: restarted search servers; 'too many open files'
  • 01:30 brion: started backup on benet (slave stopped). updates in #wikimedia.15status

July 12

  • 23:35 brion: commented out lopar from geodns for now (moved them to knams)
  • 23:20 brion: there's intermittent packet loss to lopar...
  • 19:10 mark: Site was down due to crashed perlbal on holbach, restarted it
  • 12:03 kate: put lily back to squid pool
  • 08:10 jeronim: set yum on larousse (FC2) to use fedoralegacy.org
  • 08:00 mark: lily's hardware has been replaced.
  • 07:40 jeronim: set HostnameLookups Off on larousse's apache at hashar's request
  • 07:10 jeronim: added CNAME commons.wikipedia.org -> commons.wikimedia.org
  • 00:40 brion: restarted mysql on james's advice with config change. innodb_lock_monitor fails, however. have innodb_status_file=1 set now. had to do 'slave stop' on samuel, which is master. wtf

July 11

  • 23:40 brion: set innodb_lock_monitor on samuel on jameday's recommendation. will be active when mysqld restarted
  • 23:20 jeluf: restarted ServmonII. Died when it lost its irc connection earlier today.
  • 23:05 brion: removed teh fateful link so editing that page works for now
  • 22:30 brion: disabled deletion of recentchanges records due to slowness there. hacked Title::touchArray to go row by row due to weird hangings trying to edit Template:POTD on enwiki. Not sure what's wrong, it consistently hangs at User:Mulad/portal. What could be locking it?
  • 18:30 brion: biased search load to maurus, as avicenna (with less memory) was being sluggish. added comment to output saying which server was hit
  • 15:10 mark: Removed authoritative zones that were no longer pointing at zwinger from zwinger's Bind configuration (interferes with resolving). Set up AXFR slaving of zones that are supposed to be served by the new PowerDNS servers, but which are still delegated to Zwinger/bomis/fuchsia.
  • 14:50 mark: Fixed reverse DNS for knams

July 10

  • 17:00 brion: shut down slave thread on ariel before it explodes
  • 05:40 hashar: check out our new portal: http://noc.wikimedia.org/
  • 01:07 kate: removed ariel from load balancing because it only has 700MB of disk space left.

July 9

  • 10:30 brion: fixed up steward mode in special:makesysop plugin to provide the full userrights options
  • some time in the morning kate: reverse dns for knams started working, although under *.rev.wikipedia.org.
  • 08:02 brion: reassigned 'developer's on meta to steward group

July 8

  • 5:20 brion: started mass lucene index builds using the updater daemon. once done, will sync current index files out. (progress in #wikimedia.15status)

July 7

  • 13:50 brion: added page update hook for the lucene update daemon, see wikitech-l post
  • 11:38 mark: Installed java (!) on pascal, to allow Kennisnet/ZX to upgrade the SP and BIOS on lily.
  • 11:34 brion: maurus had bogus hostname (maurus.wikimedia.org, doesn't resolve). fixed live and in /etc/sysconfig/network
  • 08:55 brion: upgraded PEAR::XML_RPC to 1.3.2 on mediawiki-installation group. Patching mono on avicenna and maurus for ximian bug 75467
  • 08:30 brion: noticed vincent seems to be hung
  • 07:00 Jamesday changed holbach cache split from 200M/2800M to 200/2500M because of excessive page faulting in vmstat, not yet restarted.

July 6

  • 14:40 Tim: named on albert exit for no apparent reason, causing site-wide slowdown. Logged on via the scs and started it.
  • 07:00 brion: all wikis reading from 1.5 code now. zh-min-nan.wikipedia.org has the UI broken -- code problem selecting wrong UI language [since fixed]
  • 06:30 brion: fixed up broken conversions on sdwiki, rowikibooks, fiu_vrowiki, cowikibooks, aawiki
  • 06:00 brion: upgraded meta to 1.5
  • 04:00 kate: upgraded all knams machines to current kernel to fix bad pmd problem

July 5

  • 10:43 kate: put back mint to squid pool
  • 9:15 mark: Added zh-tw.wikimedia.org CNAME record to the wikipedia.org zonefile, as it was missing (and is not in langlist, for not being a language)
  • 8:40 mark: Added an admin account on lily's SP, and set up temporary port forwarding on pascal to give ZX (sysadmin partner of Kennisnet) access to diagnose lily's hardware problems

July 4

  • Jason/mark: Many Wikimedia project domains have been changed to use the new PowerDNS DNS servers, so if you see any DNS related problems, it might be having to do with that
  • 19:32 kate: set up squid log migration system
  • 08:10 brion: migrated forgotten changes to InitaliseSettings from 1.4 to 1.5 (jbowiki caps, fiu-vro logo, zhwiki externalstorage)
  • 03:08 kate: removed srv1 from apache pool again.

3 July

  • 21:35 jeronim: srv1, srv2 & LDAP alive again after manual reboot by colo staff. not sure if domas actually emailed about scs-ext problem.
  • 20:05 jeronim: and scs-ext.wm.org doesn't work anymore. dammit has emailed colo about this and srv1/srv2 problem
  • 20:00 midom: oopsie, srv1 also didn't come up after reboot, and apparently it was LDAP server... LDAP down.
  • 19:00 midom: resyncing holbach, updated misbehaving apache hosts (srv2,srv3,anthony,rose), srv2 didn't come up after reboot.
  • 06:10 brion: holbach crashed again, mysqld was restarting over and over. killed it for now.
  • 05:05 brion: fixed more wikimania registration files
  • 02:20 brion: fixed missing db config in wikimania attendees list

2 July

  • 21:55 brion: holbach died. restarted zhwiki conversion w/o it.
  • 19:30 brion: started asian large-wiki upgrades: jawiki, zhwiki
  • 16:00 midom: bacon joined perlbal service, restarted perlbal on holbach, site looks happier.
  • 09:00 brion: eswiki upgraded, doing ptwiki now. dammit took ariel out of rotation, ready for reloading
  • 07:40 kate: moved bugzilla to pascal
  • 06:51 brion: fixed db host for wikimania registration
  • 06:45 midom: samuel is our master.
    • mediawiki 1.4, mediawiki 1.5, bugzilla, and otrs should be configured properly for new master. is there anything else? [search server update needs changing anyway, working on this --brion]
  • 04:50 brion: ran refreshLinks on enwikinews
  • 04:30 brion: disabled sorbs checking for now
  • 02:40 Jamesday: changed bacon cache split from 800M/2000M to 200M/2600M, not yet restarted.
  • 02:30 Jameesday: changed holbach cache split from 1000M/2000M to 200M/2800M, not yet restarted.
  • 02:05 brion: running background refreshLinks.php on dewiki

1 July

  • 22:20 Jamesday: changed ariel my.cnf from MyISAM/InnoDB cache split of 1700M/3900M to 300M/5100M assuming minimal MyISAM use now. We've been this high before for InnoDB but there's a small chance that the new kernel on Ariel might not like going abouve 4G on the next restart - reduce it to 3900 if that happens. Not restarting ariel now because one is planned anyway and it's not that urgent - should improve load handling ability though. Decreased binlog_cache_size from 1M to 128k (it's per session and doesn't really need to be 1M).
  • 08:20 brion: changed Revision legacy encoding conversion to use //IGNORE in iconv... this may need tweaking
  • 06:10 brion: dewiki done.
  • 05:56 brion: moved 1.5 skins dir from /w/skins-1.5 to /skins-1.5. Turns out squid configuration does cache-control rewriting on /w which makes them uncacheable. Bad squid!
  • 00:45 brion: switched 1.5 wikis to shared filesystem sessions. A hack in User::matchEditToken fatally broke save attempts by previously-logged-in users because it didn't bother to check that memcached sessions were in use; I've commented it out.
  • 00:30 brion: switched 1.4 wikis to shared filesystem sessions, perhaps this will relieve memcached session problems?

Archives