- 17:15 shaihulud: stopped postfix
- 16:45 shaihulud : bayle became strange and all the wiki was down. Disabled bayle from memcached list to recover the wiki
- 09:15 brion: es.wiktionary fixed and unlocked (compression corruption)
- 07:25 brion: locked es.wiktionary while looking into data corruption problem
- James having connection trouble, here's my work in progress in case it causes me to vanish as it did on Friday:
- Ariel has all missing searchindex table sgenerated but indexes still disabled. Working on building the new consolidated stop word list before I turn them on and rebuild indexes.
- Bacon is having NFS trouble. If not fixed plan to restart the computer as the way I know which should get it back. Out of service at the moment.
- ifconfig eth1 ibacon fixed NFS trouble, as far as I can see. /etc/sysconfig/network-scripts/ifcfg-eth1 looks fine, no idea why it doesn't show up after a reboot. -- JeLuF, 23:20
- 22:30 jeluf: restarted slave on bacon.
- 23:29 kate: blocked image referers from ranking-check.de on request of tomk32
- 08:50 brion: Added MIDI and OpenOffice.org document types to upload whitelist on popular demand and belief that they are not terribly dangerous.
- nn:nn Jamesday: the new search index update procedures for slaves are documented at Searchindex update. New tool to perform database commands on all wiki databases mentioned there.
- Tim: got hypatia to take over yongle's role with log handling. Hypatia has a bigger HD, but we will still need to clean it periodically (or write a script to do so) if we want to avoid filling it. Using NFS v2, not sure how that happened exactly.
- Erik: I've created a small script, /home/erik/extractdb.pl , which puts all the images from the Commons that are used in a database into a directory, for easy redistribution. I've also added example dumps for fr: and en: to download.wikimedia.org, stored in /var/backup/public/wikipedia/(lang) on albert. Note to others: Can we please make it clearer where the download index is? Right now it's in htdocs/download/dl2, which is very confusing.
- 15:40 hashar: spamlisted www(dot)lemai(dot)com. Requested by dori on irc (mutli spam).
- 06:30 kate: upgraded zwinger to openssh 3.9p1, to fix user logout/utmp problem.
- 21:30 brion: cleared out /var/backup/archiv and /var/backup/incoming from yongle. Squid logs taking up huge amounts of space, and when they fill up the partition they cause script updates to fail and break the wiki. Can we please not automatically dump vast sums of useless logs onto a drive that needs to have free space? Thanks.
- 00:45 brion: Increased squid request_body_max_size from 8 MB to 20 MB (upload max size has been increased, large file uploads to commons were failing)
- 20:00 JeLuF: Installed spamassassin on albert, filters OTRS traffic.
- 06:50 Jamesday: all slaves are now set not to replicate searchindex updates. I'll be scripting load from master commands later today. This avoids the need to disable search on all slaves for many hours while the update is running - instead one at a time can be disabled for a relatively short time.
- 22:00 JeLuF: set up spamassassin on test1. changed aliases on zwinger to send noc and info-de mails to both albert and test1 to have a preview of the results spamassassin can provide.
- 23:06 kate: made tokiponawiki and tokiponawiktionary read-only
- 22:35 kate: added a 3rd RC bot for de: and moved them back to vincent.
- 17:49 Tim: Moved IRC bots to humboldt after I got vincent and hypatia firewalled from Freenode. Rewrote some stuff. PHP script now directs output to a file. 10 multi-channel bots (mxircecho.py) monitor a file each, 30 PHP scripts pipe output to the files.
- 21:48 kate: setup VPN on larousse
- 09:00 shaihulud: upgraded mysql to 4.0.22 on bacon
- 17:06 kate: put 'hashs' table online, which stores old_text's md5sum for quick comparison
- 02:43 kate: mail to all system addresses except root@ (webmaster, etc) now goes to OTRS 'noc' queue. if you don't have an account please talk to me or elian or jeronim who can create you one.
- 18:00 jeluf: stopped and started postfix. It had stopped delivering mail at ~ 12:00 UTC for unknown reasons.
- 11:40 tstarling: script now references /apache/common/php-new/CommonSettings.php instead of /home/wikipedia/common/php-new/CommonSettings.php. This means you have to use sync-file when you change it.
- 05:52 jeronim: altered /opt/otrs/Kernel/Config.pm so that the below problem should go away
- 05:20 jeronim: someone from AOL is being logged out from OTRS because of their rotating proxy setup:
Removed SessionID 182261488cd4d4f6eccfda48994610ac4.
RemoteIP of '182261488cd4d4f6eccfda48994610ac4' (xx.yy.zz.bb) is different with the request IP (xx.yy.zz.aa). Don't grant access!!!
- jeronim: robots.txt doesn't disallow things like Spezial:Randompage on de:
- jeronim: albert:/etc/sysconfig/SuSEfirewall2 has FW_ALLOW_FW_BROADCAST="int", but int is not one of the listed options
- 11:00 jeronim: albert: was getting lots of ip_conntrack: table full, dropping packet, so fixed it with echo "65536" > /proc/sys/net/ipv4/ip_conntrack_max (old value was 16384). Added this line to /etc/rc.d/boot.local so it's set at boot.
- --:-- jeronim: khaldun's OS is reinstalled with /usr mounted on a reiserfs partition, which has about 450GB free. / has about 6GB free. Apache etc set up and working. Besides the reiserfs partition, khaldun has the same setup as averroes. There's a bug in the FC2 installer which prevented a successfull installation with reiserfs on the root partition, so I installed it with a large reiserfs partition mounted at /a and then transferred /usr to it later.
- 12:40 Jamesday: Bacon: with swap off mysqld had 98M swap and cache was 1197M. Now set swappiness to 0 and enable swap.
- 01:55 Tim: Set up a default read-only file in the upload directory, for all wikis except a short list of the largest ones. This is to allow stewards to lock small wikis which are inactive, squatted or created in error.
- 16:39 Hashar: disabled first letter capital for gu wiktionary.
- 13:00 Jamesday: On bacon set swappiness to 10 with echo 10 >/proc/sys/vm/swappiness (default 60, scale 0-100, higher more prone to swap active tasks), turned on swap and restarted MySQL. Swap still completely off on Suda.
- 10:50 jeronim: fixed maurus and rabanus squids - problem was that you have to set ulimit -n 4096 in the shell before you compile it. Current build is in /home/wikipedia/src/squid/squid-2.5.STABLE4-20040219.
- 04:59 kate: added contact-nl and moderators-nl lists for walter
- 02:40 jeronim: added second instances of gmond and gmetad everywhere, running as processes gmond-all and gmetad-all. Ability to view all servers on one page is now restored; temporary URL is http://zwinger.wikimedia.org/ganglia-all/
- 00:00 jeronim: noted that rabanus and maurus' squids now have a file descriptor limit of 1024 instead of 4096 as they had before, which explains why they complain all the time
- Jamesday: turned of swap on Suda and bacon as part of tests to see if swap growth is controllable.
- 23:06 kate: reinstalled php 4.3.8 using gcc compiler, with a hack to print the hostname in errors. errors disappeared....
- 19:36 kate: rabanus is now using squid-2.5.STABLE4-20040219.wp[icpfix,nortt]. this is gwicke's squid with a hack to ignore neighbour response time when choosing a parent. this may help with the issues of new apaches getting less load.
- 15:13 hashar: disabled full text search. Looks like bacon mysql is dead.
- 8:41 hashar: updated mrtg graphs (- bart, bayle, coronelli + will).
- 6:15 jeluf: moved memcached's. 4*bayle, 5*isidore added, 2*yongle removed.
- 23:00 jeluf: copied /usr/local from tingxi to the 5 new servers. started apaches and squids and added them to the configuration of the external squids.
- 21:00 jeronim: renamed 4 of the 5 new servers because it was discovered that one user had removed all the naming suggestions and replaced them with his own, and those names were then used for the new servers...
- biruni -> hypatia
- smellie -> humboldt
- holbach -> kluge
- rustah -> averroes
- khaldun - not renamed
- 18:56 kate: created wikikn-l list
- 13:40 jeronim: set correct names on serial ports on SCS for the new machines, and edited /etc/rc.d/rc.serial to ensure that port speed of each of the main serial ports (i.e. all but the 2 serial ports which are used for connecting to the SCS) is set to 19200 baud on boot
- 12:30 jeronim: set up the 5 new servers: (Did I miss anything?)
- updated /etc/hosts on all machines
- added permissions for new servers in zwinger's and yongle's /etc/exports
- added hostnames in dsh node groups
- set up network interfaces, firewall, hostname, users and passwords, ssh keys, syslogging to izwinger, ntpd, NFS mount of home directory and yongle backup dir
- installed and configured apt, zsh, vim-enhanced, screen, ganglia-monitor-core-gmond, ImageMagick, tetex, tetex-latex, tetex-fonts, tetex-dvips, tidy, libtidy
- did yum update
- no apache, mysql or squid installed
- made notes on the process on Add a server
- 08:00 Jamesday: changed MySQL settings on bacon and Suda to give more RAM for search, less for InnoDB.
- 20:41 kate: dalembert gives php errors, stopped squid, can someone look why?
- 19:06 kate: disabled lame server logging on zwinger named
- 12:07 tstarling: clicked the wrong combo box in yang's web interface and an onChange action shut down the uplink and killed my tunnel. Took over an hour to fix, during which time download.wikimedia.org was down. Other services were maintained since they were on the other switch. The object of the exercise was to shut off the APC from the network, thereby fixing the security problem with it. To log into the APC, enable port 23 on yang.
- 21:40 shaihulud: rabanus is down, moved ip to 3 others squids
- shaihulud: installed mysql 4.0.22 on albert, slave stopped a backup is running
- 12:23 kate: converted svwiktionary to lowercase
- 08:30 jeronim: at Angela's request, changed board (at) wikimedia.org to forward to the OTRS and no longer forward to the individual board members
- 20:00 jeronim: OTRS reinstalled on albert, and upgraded to 1.3.2
- 09:00 jeronim: installed yum and apt on larousse and pliny (both of which have Redhat 9 installed), and updated from fedora legacy repository using apt. Had problems with yum. Use apt for now.
- 08:00 jeronim: synced users, passwords, keys, ntpd config, etc, to that of the rest of the cluster, on test1, test2, larousse, pliny, will
- 07:30 Jamesday: old rebuild/defragmentation for en on Bacon took 3 hours 57 minutes for 6.7 million rows. Used 40 more GB for the table space (created by the temporary table copy) and left 80GB total free, compared to 4GB before the rebuild. Better to use a temporary MyISAM table, drop the current table and then alter the MyISAM table type. That will avoid growing the tablespace size. On the to do list for tomorrow...
- 07:07 jeronim: updated /home/config/others/etc/hosts and copied it to /etc/hosts on each machine. Inserted note to admins to edit the master file rather than the local file, and then copy it.
- 03:00 Jamesday: bacon not replicating while en wikipedia old is rebuilt to reclaim fragmented space after compression. Also, Ariel didn't start that way and periodically rebuilt tables have proved faster in slave service.
- 02:00 Jameday: searchindex keys are now turned back on on Bacon and Suda.
- Jameday disabled search on slaves to keep up with searchindex update. Then disabled searchindex table indexes when that wasn't enough.
- jeronim/jwales: colo work today:
- 2 4Us, test1 and test2, set up with minimal install of FC1.
- larousse and pliny are for Wikimedia use now. Users/passwords not fully set up yet.
- 8 machines plugged in to APC. Port labels in its web interface need updating but they are correct on the APC page.
- BIOS console redirection set up on some of the old batch of P4s
- 2nd ethernet port of SCS connected to public switch?
- Many machines connected to SCS's serial ports?
- albert appears to have eth0 connected to a switch again
- switch port assignment list is still only on paper, and jwales is fairly sure that this list is correct
- private switch not connected to a public switch as no free ports available (and no VLAN set up yet anyway)
- ??:?? kate: put albert's ext IP on zwinger to serve an error message for download site
- colo work today: console server working; albert,ariel,bacon,bart,isidore,moreri connected to it. bios serial emulation set up on several system; use "connect <name>" to connect. albert reinstalled with SuSE 9.1 and reiserfs, using new kernel from jfbeam. albert is missing eth0's network cable, will be fixed tomorrow. several apaches back in service, bayle still down (failed memtest).
- 23:40 Jamesday: compression of en wikipedia old revisions resumed. Finished a couple of hours later.
- 17:27 kate: the IP 184.108.40.206 is now an automatically failed-over interface between bacon and albert (it always points to one of them). same with 2001:470:1f01:367::1/64. apache started on bacon in ipv6-proxy mode. will setup tunnel to use virtual ip and auto-failover v6 access.
- 14:25 jeronim: changed internal IP of console server to 10.0.0.220 (note it doesn't have an external IP yet)
- 09:00 Jamesday: Compress old revisions for enwiki stopped at 5467442 of 6.7 million. Will finish next off peak time. Ariel binary logs have been purged - used 20GB to get this far.
- 08:03 kate: moved vincent to NAT for external access. RC bots got k-lined despite being told it would be ok. asked freenode to fix it.
- 07:00: Jamesday: removed bart from memcached list in CommonSettings.php. It's down again, needs more testing.
- 02:54: jeronim: changed root password on console server to match that of the other machines
- jeronim: made a table of usage of IP addresses in the 220.127.116.11/26 space
- some of what happened at the colo today can be seen at Template:Todo
- jeronim: short notes on rsync usage, as client & daemon
- Jamesday: old compression for en wikipedia in progress off peak.
- Bart is in use as memcached, to replace isidore while it is in memtest. bayle, coronelli and moreri are also running memtest overnight. Albert is currently out of service for file system tests.
- 23:02 jeronim: added nagios user and group for hashar to test nagios with, and put hashar in nagios group:
groupadd -g 113 nagios
useradd -u 113 -g nagios -d /home/hashar/nagios/ -s /bin/false -c "nagios monitoring system" nagios
usermod -G apache,nagios hashar
- 21:57 jeronim: bart's apache runs, accepts connections, but uses no or almost no CPU. Stopped it and removed it from apache dsh node group. It could be used as squid, however squid needs reinstalling as it has been overwritten by apache's companion squid. Please don't overwrite it, but rather move it to a different directory.
- 11:55 kate: tried setproctitle on apache, seems to cause too much load, removed it...
- 06:15 kate: albert is now proxying IPv6 requests to *.wikipedia.org for a trial period. to disable it, remove the Proxy lines from the ipv6.wikimedia.org virtual host in apache config, and change the zone file on zwinger. mrtg graphs of traffic at  (should this URL go somewhere else?)
- 4:20 jeluf: added memcached instances on isidore.
- 21:34 brion: added yongle to dsh apaches group, removed bayle (RIP)
- 23:47 brion: fix hostname on wikiia-l list from ia.wikipedia.org to mail.wikimedia.org
- 23:38 kate: memory errors and apparant filesystem corruption on bayle. rebooted it, didn't come back up.
- 17:07 kate: connected cluster to the IPv6 Internet via a tunnel. machines with kernels supporting ipv6 will have addresses configured automatically. current router is albert; firewall is configured to disallow any incoming connections. IPv6 DNS under <name>.g.n6.intern.wikipedia.org (global)
and <name>.ll.n6.intern.wikipedia.org (link local) this was a really stupid way to set it up, i will come up with something better
- 13:25 kate: all apaches except vincent use BGP for NAT internet connection now (failover between zwinger and albert, default zwinger). See BGP
- 08:36 kate: put BGP daemon on dalembert and albert. dalembert receives its default route via BGP from albert. this can be used for automatic NAT failover in future
- 19:50: jeronim: decided to reboot bayle after seeing this:
Oct 22 19:29:54 bayle kernel: VM: killing process httpd
- it didn't come back up after rebooting, so 4 more memcached instances are lost. Took bayle out of $wgMemCachedServers
- jeronim: noticed some swap-related messages in syslog on bayle, like the below. Disabled swap and commented it out in /etc/fstab, but the messages continued to come intermittently.
Oct 22 19:24:16 bayle kernel: swap_free: Unused swap file entry 00683050
Oct 22 19:24:16 bayle kernel: swap_free: Bad swap file entry 88683050
Oct 22 19:24:16 bayle kernel: swap_free: Unused swap file entry 00b2f710
- ~19:00 jeronim: modified /home/config/others/root/.bashrc and copied it to /root/.bashrc on all (living) machines. Made it check /etc/redhat-release and, if it indicates that the system is Redhat 9, export LD_ASSUME_KERNEL=2.2.5 so that rpm, yum and apt-get will work. Also added a personalisation for root's environment for myself, which only takes place if the ssh connection is from my IP.
- 18:03 jeronim: edited InitialiseSettings.php to allow subpages in ns0 on foundationwiki
- 16:38 hashar: updated sitename for it wiktionary, quote and books (requested by Gerard Meijssen).
- 27:42 kate: all nfs now on internal network, except /home on ariel and couple others (but changed in fstab, will take effect on reboot)
- 12:33 kate: changed all apaches and squids to communicate via internal IPs 10.0.0.0/16 on eth1, on the private switch. $wgSquidServers and squid.conf changed. seems to work.
- 08:00 shaihulud: copyed db from ariel to bacon (site read only on suda during)
- Disk full on Ariel. All slaves lost sync and will need to be re-cloned. May be best to switch to Suda as master since it has more free space to work with. Can use any slave for read only while cloning ariel to the other.
- 18:30 shaihulud: adding back yongle in squid.conf to be able to use it as apache
- 12:20 shaihulud: installed squid on will, use only .248 and .245 actually
- ~04:20 kate: bart (again), moreri and coronelli are down (unknown reason). some more RAM is installed in other servers (User:Jamesday/RAM moves)
- 10:44 bart doesn't return after reboot.
- 9:28 brion: hacked $wgUploadPath on enwiki to bypass the /upload -> upload.wikimedia.org redirect as an extra boost. munged wikibugs logs file to get scap notify working again
- 9:10 shaihulud: as yongle, one of the memcached server, seem down, removed it from CommonSettings.php
- ~22:00 jeronim: Noted that avicenna, diderot and goeje consistently had about 120 established HTTP connections at any time, compared to 1-5 for the other apaches. Graceful restarting, normal restarting, and force-killing and then starting the apaches did nothing to change this. Suspected tugela, so switched all tugelas to memcached, but still no change even after each type of restart. Noted that both before and after the switch to memcached, killing the squid on the apache would see the established HTTP connection count drop quickly. Tried rebooting avicenna, but apache anomaly still remained. Took all 3 machines out of service, and found that the wiki seemed faster. Left it for a while and then put them back into service for a short time, and again, wiki got slower. Checked all apaches for defunct processes, and found that these 3 machines were the only ones which had any, with about four defunct httpds each. Took them out of service again.
- 20:17: kate: added /home/kate/public_html/cgi-bin as "Deny from all" on apaches (not zwinger)
- 19:15: jeronim: made root's mail on other machines forward to root (at) wikimedia.org
- 18:56: jeronim: changed postfix aliases on zwinger so that some humans get root's mail:
# Person who should get root's mail. This alias
# must exist.
# CHANGE THIS LINE to an account of a HUMAN
- 13:35: Hashar: cs.wiktionary is now case sensitive.
- jeronim: example of how to alter the mailman archives, for example to delete a message: Alter mailman archives
- jeronim: instructions on how to use email to get the colo to reboot a machine are at /home/wikipedia/doc/colo
- 6:19 brion: added hack to Title.php to keep %XX sequences out of page titles. Yongle seems to be offline.
- 17:38 Jamesday: Same error on Bacon. Fixed.
- 16:58 Jamesday: Bacon had could not parse relay log entry error. Fixed.
- 08:13 Tim: Added ang to langlist
- 02:00 Jamesday: stopped profiling run.
- 01:50 Jamesday: Bacon had damaged relay log file. Fixed by making it re-refetch using the procedure at MySQL server tools.
- 01:20: kate: entire site went down while squids copied logs to yongle then sent several thousand queries to the database simultaneously. can we stop this happening in future?
- Yes, I've altered the cronjobs so that the copy jobs are staggered. -- Jeronim 11:10, 16 Oct 2004 (UTC)
- jeronim: Tested apache on albert and found that it works properly with files > 2GB in size, when the hosted files are on the local hard drive. There seems to be a problem when the files are on a network (NFS) drive.
- 22:05 jeronim: pure-ftpd on zwinger, as currently configured, enables uploaded files to be read by anons and is thus suitable as a warez dump. Stopped it for now.
- 08:03 Tim: Installed pure-ftpd on zwinger, for ISO upload purposes. Recommended hostname is ftp.wikimedia.org. Anonymous access allowed, gives read access to the public backup directory and write access to an upload directory.
- ??:?? kate: added voip99.com to the spam filter
- 16:45 Jamesday: saved profiling data to enwiki.profiling_during_peak and truncated profiling.
- 13:27 JeLuF: raised relative memcached priority on yongle with nice -10; Jamesday saved profiling data to this point in table enwiki.profiling_before_priority_change.
- 10:15 Jamesday: database slaves set not to replicate enwiki.profiling.
- 09:45 Jamesday: profiling started at 30 frequency, to run all day to get memcached performance details. Drop to 100 if there's any sign of trouble.
- 22:20 jeluf: vincent can't connect to irc.freenode.net at 18.104.22.168:6667 at the moment. For some reason, python's resolver prefers that IP. Added different IP in /etc/hosts to make scripts work.
- 20:00 jeluf: Also switched yongle, isidore, moreri to user tugela.
- 15:00 jeluf: Switched bart and bayle to run tugela as user tugela.
- 11:43 jeronim: switched all memcached to tugela
- 11:30 jeronim: yongle upgraded from fc1 to fc2. All machines are now fc2 except zwinger.
- 07:54 jeronim: maurus upgraded from fc1 to fc2
- 07:00 jeronim: rabanus is upgraded from fc1 to fc2
- 02:46 jeronim: restarted backup dumps using bacon instead of albert
- jeronim: coronelli and vincent are upgraded to fc2, leaving zwinger as the sole rh9 machine
- 15:30 jeronim: upgraded bart rh9 -> fc1 -> fc2. Noted that bart does not boot with FC2's 2.6.8 kernel.
- 12:45 jeronim: started database backup dumps
- 12:00 midom: noticed premature apache child exits on all servers (checked if shm_only did affect, though, it did not appear to be so)
- 10:00 (approx) jeronim: upgraded bayle rh9 -> fc1 -> fc2
- 09:26 midom: increased parser cache time to 2 days
- 15:18 jeronim: isidore and moreri upgraded from rh9 to fc1 and then to fc2, using  and  as a rough guide
- 16:00 jeluf: Switched bayle to tugela
- 15:10 midom: Removed Turck's on-disk cache, now it is working in shm_only operation.
- 13:50 jeluf: Switched bart from memcached (4 * 280MB) to tugela (4 * 128MB plus disk)
- 09:50 jeluf: Started apaches on goeje and diderot, which currently have too small disks to be mysql slaves.
- 08:18 kate: rebooted ariel (actually jeronim did) to switch to an SMP kernel
- 04:38 jeronim: added /home/mailman and /home/wikipedia/htdocs to PRUNEPATHS in /etc/updatedb.conf on zwinger so that updatedb doesn't take so long
- 21:25 jeronim: also moved ariel-bin.2?
- 18:56 jeronim: moved ariel-bin.24? from ariel:/usr/local/mysql/data/ to alrazi:/var/archive/ariel-binlogs/
- 17:10 jeluf: restarted squid on rabanus, high load
- 17:30 jeluf: added accounts for akl and elian to albert
- kate: started documenting local scripts in /home/wikipedia/doc/man/. please document more things here.
- 01:57 jeronim: changed squid log file format to native format (commented out emulate_http_log on) - hopefully this won't break webalizer etc. remember to check.
- 23:39 jeronim:
- suda mysql data directory is duplicated to bacon
- rc bots refuse to start:
[root@vincent root]# grep allrc /etc/rc.local
su - apache -c /home/wikipedia/common/php-new/irc/allrc.sh
[root@vincent root]# su - apache -c /home/wikipedia/common/php-new/irc/allrc.sh
This account is currently not available.
- copy of mysql data from suda to diderot should finish by about 01:20 UTC
- browne is dead; jimbo has taken it away to test - one each of its virtual IPs given to maurus and rabanus
- virtual IP of 192.168.0.242 on eth0:1 on vincent, for accessing the switches for management
- switches are yin - 192.168.0.240 - and yang - 192.168.0.241 - password is in /h/w/doc/apcpass - added these names to /etc/hosts
- 23:33 kate: suda mysql data being copied to diderot, dont start it. why doesn't mysql work on bacon?
- ??:?? jeronim: yongle disk was full again. Deleted logs from before 20th Sept from /var/backup/archiv
- 07:45 Jamesday+jeronim changed memcached configuration to eliminate memcached swapping. Now 4760MB of memcached.