Server Admin Log/Archive 1

From Wikitech

28 September

  • 23:55: brion: yongle's disk was full; deleted a few gigs of old squid logs from /var/backup/archiv
  • 09:42: jeronim: Re-started rsync of /home to albert, bandwidth-limited to 500kB/sec. Will sit on IRC and watch it. (Note that I did the rsync before because we must make backups, and a full backup of /home was long overdue. At some point we should move either mail or /home off zwinger.)
  • 06:15: brion: Managed to get the nasty killed after some time with load average over 100 and complete inability to do anything on zwinger. Can we please not do this?
Sure, once we find a better way to make backups.
  • 00:50: jeronim: currently rsyncing zwinger:/home to /home2 on albert. This may lead to zwinger being very very slow, especially when you are trying to log in. rsyncd on zwinger and rsync on albert already niced to lowest priority. If it gets too painful, get a root to run killall -STOP rsync on albert.
  • jeronim: coronelli's squid died horribly when its access.log hit 2GB. The access log hit 2GB because it was power-cycled during the previous logrotate. Noted 2GB access.log problem in on the squids page.

25 September

  • 21:27 kate: done, db is not changed.
  • 19:54 kate: running schema change tests on diderot. don't start slave!
  • 13:40 jeronim: set albert's default kernel to 2.6.8
  • 12:31 jeronim: good news: the cluster is once again called "Florida cluster" in ganglia instead of "unspecified"
    bad news: two weeks of ganglia data are not findable without altering the URL by hand
    This fsckup can be avoided in future by following the directions on the non-existent Add a server page.
  • 08:30 brion: added application/x-bzip2 to zwinger's mime.types to aid download.wikimedia.org
  • 07:11 Tim: upgraded PHP on the remaining apaches
  • 03:55 Tim: put diderot back into service as an apache, now with ICC-compiled PHP 4.3.8....

24 September

  • jeronim: partial FC1 and FC2 mirror on albert:/var/mirror/fedora - http://mirror.wikipedia.org/fedora - not accessible from outside
  • 10:00 Jamesday: searchindex keys disabled for Albert. Will be disabled permanently for diderot/goeje and similar slow slaves, to help them keep up with their backup job.
  • 09:10 jeronim: re-added bacon to ALL group for dsh
  • 09:00 jeronim: Installed 2.6.8 kernel on albert; 2.6.5 is still the default kernel. After it's established that it's stable with 2.6.8, change the default kernel in /etc/grub.conf to "0".
  • 04:50 Jamesday: Bacon in DB load sharing, will be experimenting with search and load weights to assess it's capacity.

23 September

  • 21:37 kate: purge now reads the list of squids from /usr/local/dsh/node_groups/squids
  • 20:00 brion: enabled strict file extension whitelist for uploads.
  • 19:45 brion: set up http://upload.wikimedia.org/ to hold all the wikis' uploads and set the old paths to redirect there. This keeps further IE security holes from having access to the wikis' cookies.
  • 18:00 brion: changed default & plaintext mime type on servers from text/plain to application/octet-stream to work around huge security flaw in Internet Explorer and Safari.
  • 15:06 Tim: switching live copy of MediaWiki to REL1_3A, new branch created for the purposes of quickly implementing a schema-changing security feature. The feature removes the need to send out password hashes for the "remember password" feature, and so makes cookies stealing a lot less scary.

21 September

  • 20:00 shaihulud: testing htdig on albert, restarted slave on albert
  • 14:00 Jamesday: changed load sharing to 5 Ariel: 20 suda from 5:15. Ariel has more current slow threads than Suda in mytop.
  • 13:45 Jamesday: stopped slave sql on Albert so it will complete its backup faster. It's reached pl.
  • nn:nn Jamesday: colo reported no memtest errors for Bacon.
  • jeronim: Jason will post the non-broken apache server from California to Florida this week.
  • jeronim: updated 207.142.131 zone file, which Jason will take care of tomorrow
  • jeronim: memtest86 is running overnight on bacon. Jimmy said he would get the colo staff to check the results in the morning and then reboot it. Don't know how we are going to find out the results though.

20 September

  • 20:53 jeronim: set NETMASK to 255.255.255.192 in /etc/sysconfig/network-scripts/ifcfg-eth0 on all machines
  • 20:00 shaihulud/jwales : serial cable (which was connecting zwinger and ariel) now connects bacon and ariel
  • 19:48 jeronim/jwales: bart and bayle removed from the APC to make way for bacon, which is now on ports 2 and 3
  • 19:30 Jamesday: took Albert out of DB load balancing
  • 07:32 hashar: cvs updated test. FIXME: g+w skins/ directory :(
  • 07:07 hashar: switched test back to english.
  • 06:50 Jamesday: Albert in load balancing with low weight. Can take up to about 50-70qps if necessary.

19 September

  • 09:30 jeronim: rabanus is now the 4th squid - load distribution is 2:3:3:3 for browne:coronelli:maurus:rabanus
  • 07:06 jeronim: 5 more IPs (207.142.131.202 thru 207.142.131.206) added into squids' DNS round-robin, for a total of 11
  • 06:00ish jeronim: memcached reshuffled to new arrangement with 34 blocks of 180MB spread across 11 machines for a total of 6120MB - see memcached
  • 03:10 Jamesday: switched albert, diderot and goeje to their searchindex updating settings
  • 02:00 brion: merged two separate 'wgSiteSupport' sections in InitialiseSettings.php and changed the french link per Anthere's request. Please check that en and de are still set correctly.

18 September

  • 21:00 jeluf: redirected (www.)wikipedia.ch to wikimedia.org/ch-portal. Old links to pages under wikipedia.ch/* are sent to de.wikipedia.org/* .
  • jeronim: Some machines in the upper half (207.142.131.224 - 207.142.131.255) of our block of 64 IP addresses had a netmask of 255.255.255.224. This was causing some traffic within the /26 subnet to go via the router, and appear on the MRTG accounting graphs from the colo, which is undesirable. Changed netmask on these machines to 255.255.255.192, which fixed the problem. Now all machines in the cluster have the same netmask. Command used was:
ifconfig eth0 netmask 255.255.255.192 ; route add default gw 207.142.131.225

17 September

  • 19:15 brion: tweaked wikibooks.org rewrite rules to fix ampersand problem: there was double-escaping using index.php/${ampescape:$1}
  • 15:48 : synced languages/LanguageFi.php (category namespace translation)

16 September

  • 07:39 tstarling: removed bacon from the dsh ALL node group

15 September

  • 17;20 shaihulud stopped copy from goeje to albert, was copying at 200KB/s..... Copy from suda to albert instead

14 September

  • 20:30 shaihulud: copying db from goeje to albert, for tests
  • 20:09 jeronim: suda: 2x36GB RAID (/dev/sda1) mounted at /a; symlink from /usr/local/mysql/data1 -> /a/data1

13 September

  • Jamesday: Suda in load sharing. Search on at it's restricted settings. Needs watching during next load peak to see how Suda does.
  • 21:45 shaihulud: 213.88.238.204 blocked in firewall rules, cause heavy use of wget
  • 04:45 Jamesday: diderot replication lag was growing, user CPU very high so: set global query_cache_size=0; set global table_cache=500; set global key_buffer_size=64*1024*1024; were 1048576, 4, 1048576. Needs to be saved to my.cnf. Not quite enough to keep up with replication so raised key_buffer_size to 256M and it started catching up. Can be decreased when caught up.
  • 20:00 jeluf: CoLo rebooted browne, since it was no longer responding.
  • 21:30 jeluf: added wikispecies.org to apache config and created htdocs directory as specified in the howto. Added DB, added interwiki links, tested. Not yet done: interwiki links from *wiki -> specieswiki

12 September

  • 11:40 shaihulud: slave started on suda and diderot

11 September

  • 12:50 shaihulud: slave started on bacon, now copying db from diderot to suda
  • 08:30 shaihulud :copying db from diderot to bacon, please dont start mysql on diderot
  • 00:51 jeronim: suda: backed up /etc/ssh/ contents to /etc/ssh/old/ and copied backups from before install in

10 September

  • suda is upgraded to 64-bit Fedora Core 2, and bacon and albert, the new search server and file server, are installed
  • 17:40 jeronim: suda files (minus DB) backed up to friedrich:/var/backup/suda, before jwales installs a new 146GB drive
  • 03:15 tstarling: created board@... postfix alias. Also anthere, fdevouard, tshell and mdavis.

9 September

  • 04:09 kate: enable search
  • 02:30 kate: disabled search - wiki too slow

8 September

  • Tim: increased upload limit to 20 MB
  • 08:00 shaihulud: suda cloned, have fun :)

7 September

  • 22:28 brion: removed curly from wikis' squid purge list
  • 14:50 Tim: yongle ran out of hard drive space and started issuing PHP errors. Stopped apache. The culprit seems to be squid logs, deleted one day's worth and started moving the rest to friedrich

6 September

  • 16:00 shaihulud I'll clone suda, please dont restart mysql on it if stopped
  • 12:00 Jamesday: Suda database needs to be cloned from one of the other slaves and replication started again. searchindex was damaged by the restart and repaired, then replication broke with error Could not parse relay log event entry and corrupt relay log on Suda. Other slaves are fine and proceeded beyond the point where Suda had the problem, so the error isn't from corrupt master log file on Ariel. Replication is currently running but with missed transactions.
  • 08:00 kate: changed recache-special to use magic
  • 01:00 brion: rebooted suda; sshd died

5 September

  • 04:43 Tim: moved minnan.wikipedia.org to zh-min-nan.wikipedia.org. Deleted the minnanwiktionary database, which had a few empty tables, and no cur table. It can be recreated using the script. Fixed Min Nan interlanguage links.

3 September

  • ??:?? Jamesday: changed Ariel slow query threshold from 60 seconds to 240 seconds live and in my.cnf. 143MB of log in four days with 60.
  • 01:47 Hashar: At the request of Angela: changed incorrect symlink in /home/wikipedia/htdocs/download for tomeraider (pointing to backups instead of backup. Makes http://download.wikimedia.org/tomeraider/ available again.

2 September

  • 22:10 Jamesday: still too much search load at 22:10; set to 23:00.
  • 21:40 Jamesday: looks as though someone changed search available start time to 21:00 a few minutes ago. Ariel couldn't handle it. Set it to 22:00.
  • 04:29 jeronim, Jamesday: read only for about 15 minutes to restart Ariel so changes in my.cnf could happen: longer lock wait timeout, slow query log turned on, etc.. Verified that anyone in group wikidev can now start MySQL on Ariel.
  • 02:51 jeronim: on ariel - /usr/local/mysql/bin/mysqld_safe added to sudoers so that group wikidev can start mysqld

1 September

  • 06:08 jeronim: changed ownership and permissions on /usr/local/mysql/data on ariel so that members of group wikidev can write to it. Original owner/perms were mysql:mysql/755:
[root@ariel mysql]# chown mysql:wikidev data
[root@ariel mysql]# chmod 775 data

31 August

  • 21:37 Jamesday: turned off full text search for two minutes to let Ariel catch up after two copies of the same slow search caused a a backlog. Recovered in under a minute.

30 August

  • 18:20: brion: a recompiled texvc in php-new should now propagate correctly via scap.
  • 16:45: shaihulud: disabled text search from 7 to 20. Please let it or the wiki become very slow !
  • 2:00ish brion: coronelli had broken its /home mount and could not remount. Problem was that routing was sending packets to zwinger via the eth0:2 alias (.248) and zwinger refused permission to mount. Temporarily taking down the aliases to remount seems to have done the trick for now.

28 August

  • 22:00 jeluf: stopped squid on diderot. CPU at 100%, slow.
  • 18:45 shaihulud: restarted backup slave on diderot
  • 18:30 shaihulud: disabled full text search, it causes wiki to be unavaible
  • 16:00 jeluf: changed mormo from being second mailman server to MX relay.
  • 15:30 jeluf: added new mailing list wikiis-l on zwinger.
  • 14:00 shaihulud: moved db backup from /home/wikipedia/backup on zwinger to /var/backupdump on harris, mounted on suda (for backup) and zwinger for download.wikimedia.org
  • 07:10 Jamesday: set fulltext search to be on all day Saturday. Still off 11:00-20:00 other days.

27 August

  • 20:22 Jamesday: Added 10% load balancing to Suda, set global read_buffer_size=8*1024*1024; set global key_buffer_size=512*1024*1024; both for search.
  • 16:00 shaihulud: set another slave on goeje to replace diderot, stopped squid on it
  • 15:00 shaihulud: restarted mysql on suda
  • 07:00 brion: disabled wiki add script from tim's crontab since it's caused a lot of trouble with infinite loops grinding down zwinger.
  • 01:00 Jamesday: set fulltext search off start time to 11:00 - no sign of trouble yesterday.

26 August

  • 22:45 Jamesday: took Suda out of DB load balancing while it's offline. Changed search blackout to 08:00 to 20:00 in the hope that the DB can handle the load after the optimize. May need to be turned off. We'll see.
  • 16:15 shaihulud: stopped mysql on suda, to copy db, as diderot is dead....
  • 01:30 Jamesday: Optimize table completed on Suda. Suda now getting about 10% of database load.

25 August

  • 23:30 Jamesday: Optimize table for all wikipedia and wiktionary searchindex tables completed on Ariel, starting on Suda.
  • 21:30 Jamesday: after some overloads, extended fulltext search unavailability from 08:00 to 20:00 to 07:00 to 22:00. Running optimize table for searchindex later in the hope that it will help.
  • 05:20 Jamesday: set global key_buffer_size=384*1024*1024; then 512M on Suda for search/searchindex. It's not in my.cnf so Suda restarts ready to take over as a master but the 128M in my.cnf is very inadequate for search service.

24 August

  • 17:00 shaihulud: added suda in the load-balancing db system

22 August

  • 15:15 shaihulud: set a 2nd db slave on diderot
  • 15:00 shaihulud: restarted mysql on suda
  • 08:00 shaihulud: stopped mysql on suda, time to copy the db to an apache. Please dont restart it
  • 00:30 brion: removed the redirect. if people want to hate each other, who am i to stop them. make your zh-cn and zh-tw and nn etc
  • 00:15 brion: redirected zh-tw.wikipedia.org and zh-cn.wikipedia.org to zh.wikipedia.org. A zh-tw wiki had somehow been erroneously created; backing up the few articles there.

21 August

  • 22:40 jeronim: syslog for the new P4s now logs to zwinger as well as locally
  • 21:20 brion: managed to kill zwinger's network trying to adjust the subnet masks. Rebooted it to recover.
  • 11:12 tstarling: Will has been taken away, taking 1.5 GB of memcached space with it. Removed its instances from CommonSettings.php, which will effectively cause a parser cache clear, remove load balancing inconsistencies due to failover, and reduce typical memcached access times
  • 11:00 shaihulud: added the new apaches in squid.conf, on browne only


20 August

  • 23:45 brion: changed php.ini & CommonSettings.php to force return address/envelope sender as 'wiki at wikimedia.org' to prevent paranoid servers from rejecting mail due to inability to verify the faux apache@[host].wikimedia.org addresses. This address is currently aliased to my account, but should probably go somewhere else.
  • 23:00 jeluf: changed firewallinit.sh of zwinger and yongle to SUBNET=.../26 instead of /27 due to four new server addresses
  • 22:30 jeluf: added the four new hosts to /etc/hosts of zwinger and DNS in wikipedia.zone. Added to /etc/exports on zwinger and yongle and reexported FS.
  • 22:00 jimbo: Installed eight new 1U P4's
  • 20:00 brion: addwiki script seems to have gone mad during the temporary downtime due to server rearrangement, and filled up zwinger's root partition with prompts from dsh. Killed it and trimmed the log file.

19 August

  • 17:45 Jamesday: saw slave stopped on Suda (linear growth in replication lag in Ganglia stats). Probably happened while I was copying cur to jamesday_cur to benchmark improved search options - had stopped on a wait timeout for cur. I'll use the replication catch-up to compare various controller settings for caching, so we can apply the lessons to Ariel's similar controller.

18 August

  • 21:30 brion: several big queries (some search, some maybe not) sluggified the database for a half hour or so. went to read only for a few minutes and killed out remaining ones to bring it back up to speed.
  • 15:15 shaihulud: disabled full text search during peak hours (12-16)

17 August

  • 18:30 shaihulud: /home/wikipedia/lsi_raid_utilities/MegaMGR/megamgr.bin the software to manage the raid on suda.
  • 11:56 tstarling: created a new sudo script set-group-write. Sets the group-writeable bit on all files in /home/wikipedia/common/php-new which have the user-writeable bit set.

13 August

  • 14:25 jamesday: Suda set global key_buffer_size=384*1024*1024; to keep replication lag down. This isn't in my.cnf so that Suda restarts as innodb-configured master instead of slave - need to type it each time while Suda is slave.
  • 14:15 jeluf: restarted mysql on suda. Stopped squid so that resync is faster.
  • 5:45 jeluf: reduced refresh, retry, minimum for wikipedia.org zone.
  • 2:00 brion: switched on mime filters in mailman

12 August

  • 15:30 jeluf: stopped mysql on suda for binary copy of datafiles to offsite location
  • 6:40 brion: unified php.ini files in the usual conf dir. Tightened down config a bit disabling unneeded register_globals and allow_url_fopen

11 August

  • 19:30 shaihulud: restarted squid on browne
  • 12:20 tstarling: running refreshLinks.php on all wikis, in a screen on zwinger. Will probably take a few days.

10 August

  • 23:10 brion: tweaked bugzilla to do basic wikilinks. patch at metawikipedia:bugzilla
  • 8:00 brion: initial bugzilla install at http://bugzilla.wikimedia.org/index.cgi (check your DNS!) Will want to do additional tweaking before we start using it as our main bug repo.
  • 6:00 brion: added bugzilla.wiki[pm]edia.org to DNS as alias to zwinger. Restarted zwinger's named. Does this change need to be manually replicated to the backup DNS?
No, and rndc reload suffices in place of a restart.

09 August

  • 4:40 brion: 'scap' now notifies #mediawiki that a sync is in-progress via wikibugs bot

06 August

  • 21:00 jeluf: added cron job /home/wikipedia/bin/Apache-Midnight-Job to all apache servers. Fixed 2GB logfile on suda.

05 August

  • 15:40 timstarling upgraded libpng to 1.2.5-8.wp version including security fixes.
  • 05:24 jeluf: Restarted apache on yongle. white pages again. Reduced MaxRequestsPerChild to 10000

04 August

  • 10:40 shaihulud: pb back on yongle, killed squid ( [Wed Aug 4 10:42:48 2004] [error] PHP Fatal error: main(): Failed opening required './LocalSettings.php' (include_path='.:/usr/local/apache/common/php') in /home/wikipedia/common/php-new/index.php on line 14)
  • 10:00 shaihulud: reinstalled apache on yongle, from rabanus. Before yongle returned white pages

03 August

  • 15:30 shaihulud: if I download the 9GB file from a 64 bits systems like ariel, it works.....
  • 15:00 shaihulud: webfs seems to truncate big files.....
  • 14:30 shaihulud: installed webfs http server on zwinger, for file > 2GB. http://download.wikimedia.org:8080 It runs from rc.local, please add some switchs if you want logs or something else.

02 August

  • 20:00 brion: set up wikibugs IRC bot. Adjust its parsing/output in /usr/local/bin/wikibugs.pl on zwinger if necessary
  • 18:44 angela: added quote.wikipedia.org to ServerAlias section in /home/wikipedia/conf/redirects.conf so that the quote.wikipedia.org redirect works

01 August

  • 6:45 brion: changed configuration for arwiki to disable localized numeral translation. (Make sure this stays enabled for hi.)

31 July

  • 17:21 jeronim: added wikibooks.com to ServerAlias section in /home/wikipedia/conf/redirects.conf so that the wikibooks.com redirect works

30 July

  • 10:05 jeluf: turned disk cache of maurus to 20GB, hit ratio was less than 40%, cache should be empty by now.
  • 9:45 jeluf: turned disk cache of coronelli to 20GB, hit ratio was less than 40%, cache should be empty by now.
  • 7:45 jeluf: turned disk cache of browne to 10GB, hit ratio was less than 40%, cache should be empty by now. coronelli and maurus unchanged.
  • 01:00 shaihulud turned disk cache to 2MB on squid, to purge de: pages after utf8 conversion

29 July

  • 23:44 Jamesday: turned of profiling at request of Shai, to speed up de load.
  • 14:39 jeronim: added /etc/init.d/postfix to sudoers for group wikidev on zwinger

28 July

  • 21:00 jeluf: rebooted coronelli.
  • 20:15 jeluf: rebooted browne.
  • 20:00 gwicke: installed squids on the french machines, added notes about how to do this at Squids.
  • 10:24 gwicke: changed the hardcoded message on de to use MediaWiki:Sitenotice, hacked Setup.php to parse the message as wikitext

27 July

  • jeronim: temporary squid access:
    • at einstein.hd.free.fr; ssh ports 500[123]0 for chloe, bleuenn and ennael; 8080 for chloe's apache
    • root's public key from zwinger is copied to /root/.ssh/authorized_keys on the .fr squids
  • jeronim: various things installed on .fr squids:
    • ganglia, for the moment at http://einstein.hd.free.fr:8080/ganglia/
      • gmetad running on each machine, so that the stats are collected in 3 places
    • apache 1.3, rrdtool, mrtg, tcpdump, iftop, kernel-package, kernel-source-2.6.7, lm-sensors, minicom, rsync, lynx, vim, php4, mtr, traceroute, netcat, ganglia-monitor, gmetad, hddtemp, sysfsutils, and some others I forget
  • 22:20 jeluf: moved IP 235 back from maurus to browne
  • 21:12 jeluf: High load on suda while resyncing mysql replication. Stopped squid.
  • 08:05 Jamesday, Ariel: set global query_cache_size=64*1024*1024; . gets fragmented after prolonged use, so larger size was needed to avoid lowmem prunes. set global key_buffer_size=450x1024*1024; . MyIsam hit rate is sometimes below 99% so raised from 400M to 450M. Dropped Innodb by another 100M in my.cnf to compensate and down another 500M to 4200M to allow for larger temp/sort buffer limits. Will raise if InnoDB hit rate goes below 99.5%. Raised tmp_table_size from 256M to 512M to reduce rate at which they are created on disk. Added new max_tmp_tables at 10 (default 32) to protect against over-allocation of RAM. Live and my.cnf.

26 July

  • 10:00 shaihulud: moved public IP 245 to maurus instead of browne

24 July

  • 23:00 jeluf: after suda crash, mysql replica was broken. Error in the logs was:
040724 21:34:45  Error in Log_event::read_log_event(): 'read error', data_len:
  501, event_type: 2
040724 21:34:45  Error reading relay log event: slave SQL thread aborted because
  of I/O error
040724 21:34:45  Slave: Could not parse relay log event entry. The possible
  reasons are: the master's binary log is corrupted (you can check this by running
  'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can
  check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug
  in the master's or slave's MySQL code. If you want to check the master's binary
  log or slave's relay log, you will be ab
040724 21:34:45  Error running query, slave SQL thread aborted. Fix the problem,
  and restart the slave SQL thread with "SLAVE START". We stopped at log
  'ariel-bin.086' position 728454974 
checking the binlogs showed a corruption in suda-relay-bin.028. Did a SHOW SLAVE STATUS, wrote down Exec_master_log_pos and Relay_Master_Log_File and issued a RESET SLAVE to drop the slave configuration. Used the CHANGE MASTER and START SLAVE commands from this page to reestablish replication.
  • 16:35 Jamesday: set global query_cache_size=24000000; on Ariel. Typically 14-18MB unused with 32MB so no need to allocate 32M. Live and my.cnf.
  • 16:25 Jamesday: set global wait_timeout=900 on Ariel. Suda and will waiting for permission change to my.cnf. See Lock wait timeout fix. Live and my.cnf.
  • 13:09 jeronim: noting backup-wikipedia script problems:
    • doesn't delete old split files before starting on making new ones -> people may download chunks from both current and older dumps (possible solution: give the chunks names which indicate the dump date)
    • split files are generated always (for old table) whether needed or not
    • md5sums not calculated on the fly -> dumps read from disk and transferred over network unnecessarily
  • 13:00 jeluf: Refreshing caches of all (I hope) special pages using /home/wikipedia/bin/update-special-pages.
  • 01:45 Jamesday: On Ariel set global transaction_alloc_block_size=32768 and set global transaction_prealloc_size=32768 to match query_alloc_block_size and query_prealloc_size. live and my.cnf changed.

21 July

  • 14:21 jeronim: updated DNS for wikibooks.org to include all 6 squid IPs
  • 07:45 tstarling: moved Wikisource from sources.wikipedia.org to wikisource.org

20 July

  • 20:00 jeluf: moved articles in de.wikiquote from names starting with Wikiquote: to namespace 4

19 July

  • 10:00 Jamesday: copying no query cache setting for Ariel MySQL to my.cnf. Data since Tuesday 12:00 showed immediate response time improvement Tues, apparently better then until miser mode was turned on. Will try a small cache eventually to see if that beats none.
  • 08:00 jeluf: set squid back to 10 and 20 GB storage. Performance with 2MB is not acceptable.

18 July

  • 20:30 gwicke: squid disk cache is set to only 2Mb temporarily to clear the cache from old es pages, needs to be upped to the commented out values (20Gb on maurus and coro, 10 on browne) again when the interface is fixed. Run squidhup afterwards. Also disabled parser cache for ES in InitialiseSettings, needs to be cleared as well. Please write some docs on how.
  • 20:00 shaihulud: es.wikipedia converted to utf-8, need to clear mediawiki cache if somebody know how ?
  • 09:00 shaihulud: dropped fulltext index on en,de,ja,fr on will. To stop the lag.
  • 15:18 tstarling: converted the wikipedias to use the shared document root layout, like the wikiquotes. Obsolete directories moved to old_wiki_dirs. Declared wiki.conf obsolete, replaced by the far smaller remnant.conf. Set up rewrite rules for /stats, redirecting to the appropriate stats directory in wikimedia.org.

16 July

  • 23:00 gwicke: looking into formal grammar and parsers, exploring bisongen and swig

15 July

  • 10:00 shaihulud: Enabled Miser mode until we get a dedicated server for slow queries.
  • 00:30 brion: Lots of Wantedpages queries on en bogging down ariel for about 10 minutes. Killed the threads and it's fine now. Ganglia shows a long load bump on suda before it switched to ariel; please check the replication load setup as well as securing the special pages better

14 July

  • 19:00 shaihulud : added crontab on zwinger to recache special pages at 8 and 21
  • 18:00 gwicke: added script in my crontab on zwinger that creates a nightly snapshot tar of the stable branch (REL1_3 currently), url is http://download.wikimedia.org/phase3_stable.tar.bz2. Changing sf.net page to link to this.
  • 15:30 gwicke: collected Apache hardware quotes from my local shop, cheapest (AthlonXP2.8, 1Gb ram, 80Gb hd) is 370€/444$. Similar prices at http://www.pricewatch.com, possibly others or local Tampa shops
  • 10:15 gwicke: colorized wprc

13 July

  • 23:30 gwicke: wprc installed, finally working..
  • 23:00 gwicke: Made the misermode time-dependant in CommonSettings, enabled automatically between 13:00-19:00 UTC. Did more GFS testing with multiple-hour bonnie++ runs. Some discussion about moving CommonSettings and at some stage also images off NFS. Possible alternatives: Proxy rewrite to central image server + some scripting to get the images there, AFS as an interim solution or waiting for GFS failover code release (likely on friday).
  • 10:00 shaihulud: enabled miser mode. Really need a second db slave so more apaches :)
  • 09:36 shaihulud: restarted apache and squid on suda after adding a missing file and php.ini from another apache what is the pb on webshop ?
  • 06:14 jeronim: killed squid on suda as there are apparently webshop problems (register globals?) due to misconfiguration of suda's apache.

12 July

  • 21:00 or so, gwicke: Addded suda again as apache
  • 18:16 jeluf: killed squid on suda as apache produces <p> where none should be
  • 15:00 gwicke: suda runs as apache now
  • 15:00 shaihulud: disabled suda in load-balacing, still a db slave

11 July

  • 14:30 shai, tim, james : ariel is the master db. will and suda are slaves
  • 08:00 shaihulud : ariel is back, disabled miser mode and eanble full text search

10 July

  • jeronim: IRC proxy and a simple TCP forwarder to freenode running on zwinger (access restricted in iptables) - more information: IRC forwarding
  • 06:00 TimStarling, jeluf: Updated postfix. chown'ed /home/mailman/aliases.mailman to root:nobody, so that mailman scripts are called as nobody, not as mailman. Before this, mailman complained:
Jul 10 06:11:12 zwinger postfix/local[26093]: CBEDB1AC0004: to=<info-de-l@wikipedia.org>,
relay=local, delay=1, status=bounced (Command died with status 2: 
"/home/mailman/mail/mailman post info-de-l". 
Command output: Group mismatch error.  Mailman expected the mail wrapper script to be
executed as group "nobody", but the system's mail server executed the mail script as
group "mailman".  Try tweaking the mail server to run the script as group "nobody",
or re-run configure,  providing the command line option `--with-mail-gid=mailman'. )
  • 04:30 shaihulud: enabled miser mode, disabled text search, disabled ariel in load-balancing. Time to rebuild db on ariel

9 July

  • 16:20: shaihulud : disabled miser mode
  • 06:20: tim: use maintenance/fix_message_cache.php to fix complaints about non-local messages. memcached is breaking regularly and to avoid asking the database and making the site very slow the web servers have been set to use local messages when memcached fails.
  • 01:33: jeronim: re-enabled text search, as ariel has caught up. Miser mode still on.
  • 01:01: jeronim: set miser mode on too to try to move things along more quickly
  • 00:55: jeronim: disabled text search in an attempt to unload ariel to let it sync to suda (replication lag 15 minutes)

8 July

  • 15:00 : shaihulud removed will. Too slow for heavy special queries : Lonelypages, etc...
  • 14:30 : shaihulud : added will in the load-balancing, ariel replication was laging a lot cause load
  • gwicke did some benchmarks of MySQL/MyIsam vs. BerkleyDB/python api, results at meta:Database benchmark. Working on a concept for a wiki based on SubWiki (using Subversion/BerkleyDB and GFS). Got GFS working now, needs testing.

4 July

  • jeronim: htdocs backup to vincent is now done from root's crontab at 2 a.m. each day, using zwinger's rsyncd, like this:
nice -n 19 rsync --stats --verbose --whole-file --archive --delete --delete-excluded \
  --exclude=upload/timeline/ --exclude=upload/thumb/ --exclude=upload/**/timeline/ \
  --exclude=upload/**/thumb/ --exclude=**/dns_cache.db zwinger::htdocs /var/backup/htdocs
    • not sure if --delete (which removes files on the destination if they have been removed from the source) is such a good idea, as if somebody accidentally deletes things from the source and doesn't notice, then they will soon be gone from the backup too.
    • OTOH, image tarballs are planned to be generated from the backup, and having too much extraneous junk in them is no good
      • could solve this by keeping a few weekly tarballs somewhere

3 July

  • 21:03 gwicke: enabled $wgPutIPinRC, grepping the logs isn't practical anymore with ~200Mb logs per hour. Query is something like
select rc_ip, rc_timestamp from recentchanges where rc_user_text like 'Gwicke' and rc_ip != ;
  • shaihulud: As it seems to load too much zwinger,using isidore as mysql server now. Loading in progress
  • 07:20: jeronim: killed rsync (over NFS) backup of htdocs to vincent, in favour of continuing yesterday's rsync to yongle from zwinger's rsyncd. Will duplicate it to vincent when it finishes.
  • jeronim: (yesterday) made image tarballs for all wikis, from the backups on vincent, and put them in the relevant directories with the dumps.
  • jeronim: (yesterday) installed rsyncd on zwinger, just for htdocs at the moment. World-readable, but restricted to cluster's subnet by iptables and rsyncd.conf.

2 July

  • shaihulud: Creating a slave with 3 innodb files splitted on 3 apaches. Main mysql server is zwinger. Dump loading in progress.
  • 13:14: Jamesday: Suda connections=11,844,357 threads_created=239,870. Change since 30 June: connections=6,892,087 threads_created=61,327 (112:1, 20.5 new threads per minute over 2979 minutes. Estimated prior value at 6.5:1 ratio 353/minute).
  • 08:09: jeronim: Moved /home/wikipedia/htdocs/en/upload-old to /home/wikipedia/en.wikipedia-upload-old so that it won't be rsynced to vincent anymore. Does anybody need this directory or can it be deleted?
apache   wikidev     36864 Jun 17 00:23 upload
root     wikidev     36864 Jan 23 22:29 upload-old

30 June

  • 11:35: Jamesday: set global thread_cache_size=90 on Suda. Not in my.cnf. connections=4,952,270 threads_created=178,543 (28:1) at this time.
  • 07:00: brion: tracking down PHP errors. squid debug statements broke image thumbnails and various; removing. /home/wikipedia/sessions was broken on whatever was using them (bad permissions). fixed an Article::getContent interface error in the foundation page's extract.php which produced annoying warnings. still tracking down others.

29 June

  • 19:30: gwicke: updated test from cvs, just cvs up in /home/wikipedia/htdocs/test/w with zwinger's pass
  • 18:00: shaihulud: doing some nightly dump of database on will, stopping slave. Dumps are in /home2 on will
  • 16:34: shaihulud: miser mode disabled
  • 16:08: shaihulud: enable miser mode, time to resync ariel.
  • 07:22 set global thread_cache_size=40 on Suda. Not in my.cnf. connections=842,104 threads_created=69,105 (12:1) at this time. Jamesday 08:13, 29 Jun 2004 (UTC)
  • 00:39 MySQL on Suda reported receiving signal 11. Brion restarted it and it recovered successfully.

28 June

  • 17:15 After discussion set global thread_cache_size=90 on Suda and Ariel. Within each hour at moderately busy times Suda routinely cycles within a 90-100 connection range, so there's no need to make it do more work creating new threads while something is slowing it down and causing the number to increase. Set only interactively, not in my.cnf, for observation over the next few days. Connections were 28,419,175 and threads_created 4,373,734 (6.5:1) an hour after setting. Jamesday 18:42, 28 Jun 2004 (UTC)

27 June

  • 21:00 - high load on zwinger, site frozen since NFS too slow. Brion used the APC to power-cycle zwinger. syslog showed that machine was short on memory before crashing. Activated swap to prevent future crashes.
  • 20:30 - jeluf : installed BigSister server to mormo. Doing some remote tests for maurus, browne, coronelli (port 80 connection), zwinger (DNS, smtp, ssh), gunther.bomis.com (DNS). Installed local agents to ariel and will. Agents are running as user bs (id 200), installation is in /home/bs/. Monitoring console is at http://mormo.org/bs/ . Other agents will be installed in the next days.
  • 9:20 - shaihulud : added an innodb file on /mnt/raid1 on Ariel
    • Doesnt seem to work, I'll have to copy the db from will to ariel.....

26 June

  • 18:300 - shaihulud: moved data in /mnt/raid1 on ariel to zwinger on /mnt/olddrive (the old 80G drive)
  • 8:00 - jeluf: investigated 100% CPU-issue on Ariel. CPU load was in user context. Neither top nor ps displayed a process using more than 1% of CPU. Shifting load to suda, CPU went down to 0%, while increase on suda was not visible. Finally restarted mysql on ariel, now for a short time CPU fingerprint looks normal (80% idle, 20% io-wait), back to 100% user :-(
  • Tim is doing searchindex updates: http://mail.wikipedia.org/pipermail/wikitech-l/2004-June/010885.html
  • 7:30 - jeluf: on will, configured ntp and syslog remote logging to zwinger.
  • 03:01 - jeronim: spamblocked www1 dot com dot cn at request of Guanaco on irc. example diff: http://wikibooks.org/w/wiki.phtml?title=Wikibooks:Sandbox&diff=38922&oldid=38633

25 June

  • 20:30 - shaihulud: installed mytop on will
  • 20:00 - shaihulud: Tryed to move searchindex table on Suda to the first raid5, to improve speed, same pb.
  • 11:05 - gwicke: Re-enabled misermode and disabled search on all wikis as suda was under heavy load from the jawiki search index rebuild. Many timeouts. Coda didn't like bonnie benchmarks. Compiling Arla now. OpenAFS seems to be slower than Arla and doesn't work on my machine so far.

June 24 2004

  • 23:01 - gwicke: Images can now be protected. Added request section to meta:Requests for logos for wikis that have the logo uploaded as Wiki.png, those can be changed (see wgLogo in InitializeSettings.php, there's already one row with the /b/bc/Wiki.png bit)
  • 19:10 - jeluf: suwiktionary repair: stopped slaves on both ariel and will, so that Read_Master_Log_Pos was the same. mysqldump suwiktionary on ariel. start slave on ariel. mysqladmin drop suwiktionary on will. Didn't succeed, so after talking to guys on #mysql, removed categorylinks.frm file manually. Drop than succeeded. created db and populated it using suwiktionary dump from ariel. started slave on will.
  • 10:00 - jeluf: suwiktionary replica on will is broken. needs fixing when replication is in sync.
  • 9:20 - hashar: gracefulled all apaches so they know about webshop. Notified TomK32.
  • 8:56 - hashar: edited /home/wikipedia/conf/webshop.conf and added a "ServerAlias webshop.wikipedia.org" by request of TomK32. Apache not reloaded.
  • 0:00 - jeluf: set up new mysql replica on will. Is doing much better than zwinger, which was not fast enough.

June 23 2004

  • zwinger mysql is down. Did someone kill it or it shut down by itself or what?
  • jeluf killing many "failed replications" on ariel
  • 922 threads on mysql on suda; mostly UPDATE LOW_PRIORITY searchindex
  • TimStarling granted non-root / wikidev shell account to Hashar. Aknowledged by brion at least.
  • jeronim placed squids in offline mode for a while in an attempt to unload the database so that many killed threads will stop taking up slots
  • 11:32 - still 587 Killed queries which won't go away
  • 11:48 - number holds steady at 587
  • Gwicke is researching network filesystem alternatives, current candidates: Coda, OpenAFS, Lustre. Coda already installed on Zwinger. Will start to summarize things at http://wikidev.net/Network_file_systems
  • 15:37 approx - the 587 Killed queries have gone, finally
  • 17:24 Added LoadBalancer in PageHistory.php
  • malicious-looking bot on de:, hitting revert/delete links on image pages but not actually logged-in, so it shouldn't be able to see the links... blocked in squid for the meantime (217.85.228.77)
    • relevant log bits on ariel in /tmp/squid2/
    • last responding router in traceroute: dd-ea1.DD.DE.net.DTAG.DE (62.154.87.58) (Dresden)

---

  • 18:36 jeronim still trying to get PHP working on ariel, just to do some PHP CLI stuff.. some details at PHP on ariel. Help!
    • gave up installing from source, went back to the yum/rpm. There is junk lying around from the source installs, and there is no uninstall target.

yum --download-only install php-mysql
look for the rpm in /var/cache/yum
rpm -i --nodeps php-mysql-4.3.6-5.x86_64.rpm

--($:~/incoming)-- php gmetric_repl_lag.php
PHP Warning:  Unknown(): Unable to load dynamic library '/usr/lib64/php4/mysql.so' - 
libmysqlclient.so.10: cannot open shared object file: No such file or directory in 
Unknown on line 0
PHP Fatal error:  Call to undefined function:  mysql_connect() in 
/home/jeronim/incoming/gmetric_repl_lag.php on line 15

So, no replication lag metric for ariel until this mess is sorted out. It's working on will though: http://download.wikimedia.org/ganglia/?m=repl_lag&r=hour&s=descending&c=Florida+cluster&h=&sh=1&hc=4

---

  • /etc/yum.conf on ariel altered to use some mirrors. Original is in yum.conf.original -- Jeronim 22:35, 23 Jun 2004 (UTC)

before 2004-06-23

  • will is back, tested with mprime and rapidly reaches 60C and goes to throttled mode
  • several new ganglia metrics in the last few days:
    • mysql_qps - queries per sec (average over 14 second period)
    • mysql_in - incoming connections ("established" in netstat) to mysql port
    • mysql_out - outgoing to mysql
    • http_in - incoming to port 80 on apaches
    • squ_in - incoming to port 80 on squids
      • the *_in and *_out metrics use netstat/awk/grep and use a fair amount of CPU time, at least on the squids which tend to have approaching 1000 established connections at once
  • KILL -9 for MYSQLD IS BAD!! :)
  • jeronim removed xfs from startup for servers with chkconfig --del xfs


2000s

2010s

2020s