Server Admin Log/Archive 3

From Wikitech

30 March

  • 23:45 mark: Testing of HTCP/multicast purges finished succesfully, tingxi put back in the production pool
  • 23:00 jeluf: benet stopped, copying /usr/local/mysql to suda using rsync.
  • 20:40 mark: Tingxi taken out of production for some HTCP/multicast testing
  • 20:05 mark: Rebooted the Cisco switch to run an alternative software image
  • 19:00ish jeronim/chad: reinstalled suda with Fedora Core 2 - to match ariel - base system in 10GB partition on /dev/sda; post-install setup done by dammit/jeluf/etc
  • 13:25 jeronim: ganglia fixed. Ganglia says how to fix it when it's broken - see section on cluster-wide ganglia restart
  • 8:45 brion: Started rinetd, restarted rc bots. Accidentally killed python scripts on zwinger; mailman, servmon and wikibugs restarted.
  • 8:30 brion: Enabled experimental category-union extension on WikiNews for production testing.
  • 2:11 brion: Still investigating problems since zwinger reboot. Ganglia stats are badly broken (not showing most machines), servmon appears to be either dead or hiding (documentation is out of date, referring to vincent which is down). rc bots still not connecting to freenode, unsure what they are trying to do; if they're trying to use a proxy on zwinger it's not working.

29 March

  • 23:39 brion: apparently somebody rebooted zwinger a couple hours ago and didn't note this in the log. rcbots seem to be running on hypatia but are not visible on freenode. restarted bots with no success in changing this situation.

28 March

  • 14:00 Tim: bacon couldn't keep up with replication, don't know why. Replication lag hit 10 minutes, I switched off its read load, it caught up quickly. Restored a reduced amount of read load.
  • 12:45 Tim: at Mark's instruction, restarted all squids to get rid of the client DBs from memory.
  • 11:40 Tim: restarted squid on bleuenn, it was out of memory. Reconfigured all the Paris squids with client_db off, like what Mark did for Florida yesterday.

27 March

  • 13:30 midom: after fixing fragmentation cloned webster's MySQL to benet, though, it does not survive too much of load. will be ok for further clones though. or more load on benet after some tuning.
  • 11:30 mark: Disabled client_db on FL squids, as it consumes ~ 500 MB on each squid with our amount of clients

26 March

  • 22:00 midom: Forgot to mention ;-) Did FS cleanup on bacon and cloned MySQL on it from webster.

25 March

  • 23:20 mark: Put tingxi back in production. Browne remains broken.
  • 21:00 mark: Taken tingxi and browne out of production pools to test Mediawiki and Squid HTCP patches for multicast HTCP purges
  • 20:13 hashar: rsync process on webster was slowing the site for the last 20 minutes or so. Kate killed it.
  • 17:00 Gwicke: restarted squid with lower cache_mem (60m) on ennael, was swapping a lot with wait > 60%. Started to install Squid3 on chloe, but it's too slow currently- configure takes >40 minutes with 0% idle cpu left..

24 March

  • 02:00 brion: updated search-redirect.php to take a site family parameter. wiktionary, wikinews, wikiquote portal templates updated to use it in their search forms

23 March

  • 11:30 brion: created wikimedia foundation planning wiki. (updated sql grants, docroot; there is a particular apache config for this hostname only)
  • 11:00 brion: removed the 'create wiki' button from missing.php since its BROKEN. disabled requestwiki.php, and (hopefully) made the broken addwiki.php fail when whoever's running it next runs it
  • 08:45 brion: changed project ns on eswikinews to Wikinoticias, moved pages and rebuilt items
  • 07:30 brion: restarted rc bots with change to letter configuration (missing 'A' bug)

22 March

  • 20:00 midom: un-idle dalembert, as apache
  • 13:00 Tim: running some maintenance tasks on hypatia, shut down apache temporarily
  • 12:55: midom: unbroke MRTG
  • 08:20: midom: re-started memcached on rose, started memcached's on benet instead of bart. should cause some parser cache invalidations. Kate fixed enwiki edits by rotating post-headers logfile (2GB limit...)

19 March

  • 09:50: midom: benet is apache
  • erik: created

18 March

  • 22:00 midom: after on-site work, started bart and kluge apaches
  • 18:34 kate: configured apaches to purge fr squids via vpn, not internet
  • 11:20 midom: brought srv4 into apache production.
  • 00:25 brion: restarted postfix on mail server; mail was not being moved (postqueue -f had no effect)

17 March

  • 11:55 midom: I wasn't very successful. moreri crashed under load. needs hw recheck.
  • 11:54 brion: pointed bugzilla at ariel as master
  • 11:45 midom: brought moreri into service as apache. This was my first Apache setup,.. :)

16 March

  • 16:17 Tim: someone noticed the slaves were out of sync, in fact they had been that way for an hour with no-one noticing. Suda ran out of binlog space and the familiar hell has resulted. Taking temporary measures to keep the site operational, not starting a dump just yet.
  • 15:40 Tim: sent stop signal to the image backup job, the site was basically down and that fixed it.

14 March

  • 15:20 midom: moved one ip from srv10 to srv6, according to squid load matrix it was 3 times slower than other boxes.
  • 14:00 Jamesday: killed the en compression and kill -stop for the en wikipedia tar job so they aren't running during Monday peak time.

13 March

  • 14:50 midom: redistributed squid ips, by bringing into service browne and srv6, all squids have two ips, except srv6 with one.
  • 13:30 midom: setting noatime for / on srv8/srv9/srv10/browne
  • 12:00 midom: srv8 squid fd limit changed 4192->10240
  • 11:20 brion: moved docroot/ to docroot/foundation and ran sync. for some reason foundation was being used as the docroot for this vhost, but it didn't exist in the master or on goeje, bart, or bayle, so these three machines gave 404 errors when requests for that wiki hit them

11 March

  • 11:30 brion: added sr wikipedia to rc bot 10

10 March

  • 7:20 brion: moved updatedb on albert into cron neverland (it eats the souls of servers)

9 March

  • 12:37 kate: changed $wgDBserver in commonsettings to (suda). apparently the commandline mode uses this, including rc bots.
  • 07:50 Jamesday: backup dump started on albert. Will probably kill the site for a minute or two at various times during the day.

8 March

  • nn:nn Jamesday: Ariel cloned to khaldun. Bacon had very low disk write rate (megabyte/s) so clone to it ws aborted.

7 March

  • 23:00 midom: upgraded ImageMagick to home-built, fixes format string bug. Started local package repo at /home/wikipedia/rpms/.
  • 21:05 jeluf: killed wprcs
  • 21:00 jeluf: moved IPs from will to other servers, as will was having 5 IPs, while others had only 1.
  • ~13:00-16:00 many people: power strip blew, servers mostly okay. srv6 & srv7 are down, but browne is fixed. jamesday: please verify db load balancing config.
  • 05:33 kate: moved browne's IPs to srv6 & srv7 as browne is somehow broken (see ganglia)

6 March

  • 02:34 kate: added srv6 as squid

5 March

  • 22:30 jeluf: installed missing software on rabanus,bart,bayle, added them again to apache pool
  • 22:00 jeluf: removed bart, bayle, rabanus from apache pool. Please remember to install all needed tools as listed on Apache#main apache farm and to add the apaches to the dsh node groups before adding them to the farm.
  • 14:28 kate: added bayle as apache; added memcaches on bart, bayle, rabanus, anthony
  • 11:30 kate: added bart as apache
  • 11:20 Jamesday: enwiki old compression resumed at "James".
  • 09:40 kate: installed proper php.ini + mmcache on rabanus. now working properly
  • 08:42 kate: moved to perlbal on holbach+webster rather than icpagent
  • 04:30 kate: fixed ntp on all hosts to use broadcast client (from csw1)
  • 04:17 kate: yum upgrade on srv7 & rebooted it to disable selinux & fix ntp
  • 04:02 kate: removed maurus from dsh ALL group; its up but not accepting ssh connections which makes ssh hang
  • 04:01 kate: fixed auth/ldap on srv* hosts so i can actually log into them
  • 03:45 Tim: moved gmond multicast for the MySQL machines to the 10/8 VLAN, which is the only one which ariel is on. Created a cluster just for MySQL.
  • 03:38 kate: configured zwinger to sync time from the switch rather than external sources
  • 03:10 kate: stopped broken icpagaint on alrazi, friedrich, harris
  • 3:00 Tim: added wikinews to ourusers.php
  • 2:07 Tim: moved document roots for and to the standard location as defined by my documentation at Add a special wiki. The special cases in the wiki detection code in CommonSettings.php have been removed.

4 March

  • 12:53 kate: installed tidy on rabanus
  • 11:00 Jamesday: enabled ariel as slave, handling enwiki, jawiki and frwiki. Crash recovery mode commented out, back to normal set. Search still off.
  • 09:33 kate: enabled external images on nlwiki (requested by waerth)
  • 09:00 hashar: reenabled Special:Statistics
  • 08:50 Jamesday: database dump loaded into ariel and it's now catching up on replication. Took about 9.5 hours to load with a 20 second sleep between each database to give time for InnoDB to merge its insert buffer (otherwise it would create a bigger tablespace than needed).

3 March

  • 15:50 Tim: On reports of erratic DB connection errors, added wildcard GRANT for 10.0.% to ourusers.php and piped it into the database.
  • 15:35 Tim: installed gmond on srv6, in squid cluster
  • 15:14 Tim: added DNS entries for holbach's external address, used by ganglia
  • 13:30 Tim: added reverse DNS zone for external addresses in zwinger:/etc/named.conf. This fixed the squid list in ganglia.
  • 13:00 Tim: copied almost-blank /etc/hosts to all machines
  • 8:25 brion: updated apt, which, tidy on srv1,srv2,srv3. Fiddled with default route on srv2.
  • 6:00 jamesday: webster out of service and slave stopped to produce database by database dump in albert: /var/backup/private/2005-03-02-webster running in screen.
    dump finished 09:37, slave catchup finished 10:03, back in service.
  • 5:20 brion: chown'd a bunch on srv10 to get squid running again

2 March

  • 23:00 JeLuF: Made srv2, srv2, srv3 server apaches. Added to squid and nodegroup.
  • 14:50 Jamesday: load of data to ariel started again.
  • 07:00 Jamesday: load of data into ariel from albert: /var /backup /private /2005-02-22-khaldun-after-catchup/ all-tables.sql.gz started.
    load from backup completed 12:50.
    Binary log files created during the data load filled the disk so I'm assuming load was incomplete and restarting it.

28 February

  • 23:15 jeluf: fixed sendmail on apaches to use smtp.pmtpa.wmnet as smarthost for outgoing mail
  • 21:30 jeluf: finished setup of squid server srv7. assigned IP .202 to it. 6 squids in service in Florida, serving 11 IPs.
  • 19:00 kate,jeluf: moved squids from eth0:x style IP aliases to more modern ip addr style. takeip changed.

27 February

  • 23:45 jeluf: on srv1,srv2,srv3: installed apache, tex, ploticus, sudo, DB permissions. Todo: Sendmail (!!!), ganglia, icpagent.
  • 14:00 Tim: changed hostnames to short names. This way, they're consistent, and they can be looked up in DNS, even the ones without a * DNS entry. I haven't changed the config files yet, so machines will randomly revert back to the old hostnames
  • 12:54 brion: fixed albert NFS mount on rabanus & dalembert, reset parser cache epoch to clear out 'missing image' problems. note that will and browne seem to be having mysterious network issues; traceroutes from them report Icmp checksum errors, and one person has reported difficulty connecting (from one particular place) to the IPs that they carry
  • 10:06 Tim: switched to a pure-DNS configuration, cut /etc/hosts down to a single line on all machines.
  • 07:20 Tim: fixed routes and firewalls on bart and bayle
  • 04:30 Tim: adjusted icpagent delay time on alrazi, from 5.9ms to 8ms. Seems happier now.

26 February

  • 23:21 brion: hacked up zwinger's host file for srv7-10 to be accessible. set logfile_rotation 1 in the squid configuration to force those files to rotate. see if this helps...
  • 22:42 brion: srv7 through srv10 are inaccessible at their internal addresses. srv8 log overflowed; rotated manually, trying to fix rotation.
  • 17:30 Tim: got rabanus working as an apache, except for load balancing which I will start work on now
  • jer, kate: reinstalled ariel. has basic setup + mysql. and 10GB ext3 for root partition, and the rest of the disk is jfs, mounted at /a

25 February

  • 03:00 brion: set squids to rotate logs hourly instead of daily so squid doesn't explode regularly.

24 February

  • 18:30 gwicke: new machines (srv8-srv10) appear to perform very poorly, 100% cpu with single ip. srv7 didn't come up after reboot with non-smp rh stock kernel.
  • 17:50 gwicke,kate: installed squid on srv10, srv9, srv8 using the new Squids#New_squid_setup script. Main spool and logging on (reiserfs-formatted) disk, second spool on ext3 /var/spool/squid. Squid load balances the partitions. /home/wikipedia/conf/squid/Makefile hacked to support the second spool.
  • 8:11 brion: fixed %2F inifinite redirection loop problem; %2f and /-containing urls are excluded from the ->/wiki/X redirection.
    will is currently serving all squids, but maurus is partially on the network and intermittently the switch sends packets to it instead (presumably).
  • 5:20 hashar : someone noticed maurus is dead. Not answering port 80 & ssh. Can't access scs to check it.

22-23 February

  • a world of pain

21 February

  • 22:14 power goes out, all servers die :)
  • 21:00 gwicke: made the apaches dsh group a symlink to mediawiki-installation

19 February

  • 22:04 hashar: ariel crashed and came back all by itself. Site running well again.
  • Load change, Current graphs
    10:46 gwicke: For some reason squids were running instead of icpagent on some apaches which messed up load balancing. DONT'T DO THAT!
    • When making such a change like from squid to icpagent, CHECK THE STARTUP SCRIPTS! As long as you don't change the startup scripts, squid will of course be started upon boot. -- JeLuF
dsh -f -N apaches grep squid /etc/rc.local
ialrazi:        nice -n 19 /usr/local/squid/sbin/squid
ifriedrich:     nice -n 19 /usr/local/squid/sbin/squid
    • I rebooted both of these hosts earlier this week for OS upgrades, so, this is why the squid was restarted. kate.
      • Hmm, weird- am pretty sure all were converted at one stage (did a dsh/sed conversion and grepped for it). Squid was running on at least six machines though, most of which certainly didn't have squid in rc.local. Gwicke 11:56, 20 Feb 2005 (UTC)
  • 4:00 erik: ro.wikinews, pt.wikinews, pl.wikinews set up
  • 14:00+ erik: on request, updated LanguageRo.php and reinitialized ro.wikinews MW namespace. In the process, accidentally wiped cur_namespace=8 from rowiki. Restored from backup (February 9, no newer one available on Albert) and combined with refreshed LanguageRo.php reinitialization; data loss should be minimal. On the non-accident side, ro.wikibooks, ro.wikinews and ro.wikiquote should now correctly call their meta namespace according to the site name, rather than "Wikipedia".

18 February

  • all Jamesday: I'll be travelling from Friday through Monday. Old compression for en has been stopped until I can babysit it. One critical task daily: on khaldun do show slave status\G and note the Relay_Master_Log_File name. On Ariel type purge master logs to 'noted Relay_Master_Log_File';. Then check that Ariel disk free is above 10GB - between 9 and 10GB there are user-visible errors and risk of loss of data or slave sync if a disk full error from temporary file coincides with the need for more log or data file space. If necessary, raise the flush file extension on ariel by 1 per 256MB to get it safely above that, ideally over 12GB.
  • start Jamesday: Compress old revisions has some data on space reduction from the old article compression I've been doing for the last week or so.
  • earlier Jamesday: To assist with some en questions we get, khaldun has a table containing the results of this command, run once or twice a week: insert ignore into jamesrcusers select rc_user, rc_user_text, rc_ip, rc_timestamp from recentchanges;. rc_ser_text+rc_ip is the primary key. I'll purge based on the timestamp periodically using whatever interval experience suggests is suitable.

16 February

  • last few days, kate: create .wmnet zone for internal dns; <host>.pmtpa.wmnet, <host>.lopar.wmnet, etc. created a vpn between paris and florida, although until the network's rearranged most hosts can't see it. changed apaches to use zwinger as mail gateway.
  • 04:45 Jamesday: noticed that wgDisableAsksql was commented out. Removed the commenting out. Remember that disabling this setting completely compromises the privacy of the email addresses and IPs of every user of every project and exposes a password field which is susceptible to attack.
    unless the grants for the asksql user were changed, it doesn't have access to the user table. last time i checked the user had been removed entirely, so asksql did not work even when "enabled".
    Thanks. Good to hear.

14 February

  • 09:15 kate: set up srv1, srv3, srv4, srv6, srv8 of the new servers. rest appear non-functional.

12 February

  • all day, kate, mark, jwales, chad, etc: installed 10 new servers, 3 broke. rearranged network, zwinger albert dalembert ariel suda bacon webster holbach now on csw1. finishing changes friday.
  • 23:50 Jamesday: old compresson restarted, this time with: nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki . Kill or restart as needed. See zilla for the new q option.
  • 07:30 Jamesday: old compression for en running niced on zwinger. Kill it if it seems to be causing a problem - won't do any harm.

10 February

  • 02:00 Jamesday: holbach back into service for de, bacon back to en.

9 February

  • 03:30 Jamesday: holbach out of service for the dump, bacon handling dewiki. If bacon is overloaded, turn off search in dewiki section of db load balancing or add lines to turn off search for some hosts. If suda is overloaded, reduce its share of en to 60, its normal level - if still overloaded, reduce to 40. If ariel is overloaded, remove # at start of some en search off lines.
  • 03:30 Jamesday: revised dump script running on albert. If you see site dead with high apache load and low apache CPU use, as root kill STOP the part of the dump causing it - will probably be either md5 or split. Based on de, which didn't cause unavailability, if it happens, it will last for up to 3-5 minutes before fixing itself. If site doesn't come back after the stop, apache-restart-all, possibly with kill of NFS on albert (it should restart automatically).
    • Were some outage issues in the MD5 or split areas - tee on the to do list to eliminate them by doing the work as the files are being built.

8 February

  • 15:50 brion: confirmed that link from albert to the archives on yongle over NFS is on a 100mbps link when gets saturated, killing image NFS output and hanging the wikis
  • 11:07-11:13 unknown: network input at 4-6 mebabytes/s to albert, killing the site while it was happening, with the usual disk/NFS symptom of low apache CPU use and high apache load.
  • 00:20 brion: set up portal

7 February

  • 13:07 kate: put both benet's internal and external ip in squid servers list. it used to have a different config and used its external ip, which it shouldn't be doing any more, but i'll wait till the network's sorted out before verifying that.
  • 11:00 brion: Turned on Unicode normalization / illegal control character stripping for text fields (formerly off by mistake). This might increase save times on some wikis, particularly those using non-Latin writing systems. This might also cause some edits to spontaneously repair broken characters, which may surprise users.

6 february

  • 22:00 erik: set up

5 february

  • 04:11 brion: upgraded grants wiki to 1.4.

4 february

  • 11:31 kate: upgraded csw1-chtpa to IOS 12.2(25)SEA (latest image for this hardware); seems to have fixed SSH problems.

3 february

  • 22:00 jamesday: commented out image tarball section of /home/wikipedia/bin/backup-wiki since it DOSsed the Albert and the site. Had previously done the same to zwinger. Needs rework before it's safe to use again.
  • 19:30 mark,jwales: We hooked up the internal netgear switch to port 1 of the Cisco (which is assigned a separate vlan, so it won't loop). Also, we moved larousse's eth0 connection to port 6 on the Cisco, to be able to test/load it a bit.
  • 19:00 mark,jwales: We tried to move the uplink from the netgear to the new Cisco switch. It didn't work: the cisco didn't accept the SFP module (apparently it's a Netgear GBIC, not a Cisco...), and the Foundry BigIron on the other end crashed. Backed that out.
  • 16:00 gabriel,jeluf: WIKISLOW!! tar job running on albert killed NFS performance. Killed tar job. mysqldump still running, might be killed when causing trouble.
midom: The bigest impact was on Parser::replaceInternalLinks-image hitting NFS. needs refactoring?

1 february

  • 23:55 Jamesday: en and all non-big wikis: "alter table cur drop index cur_namespace". Others to follow during off peak read only time. The cur_namespace index was being used too many times where namespace_title should be used, including sometimes for replaceLinkHolders.
    Now complete for all wikis.
  • 22:00 jeluf: basic OS setup of moreri,bart,bayle,browne. Coronelli died during the process.

30 january

  • 17:26 hashar: fixed and (incorrect entry in all.dblist (and updated wikinews.dblist).
  • 17:22 hashar: s/ariel/suda/ in dsh group mysqlslaves.

29 January

  • erik: Spanish, French and Swedish Wikinews created

28 January

  • 17:34 hashar: updated mrtg graphs to show benet server.
  • erik: Dutch Wikinews created

27 January

  • 01:00 gwicke: experimented with different load balancing settings, but didn't get a smooth state. Reverted the changes.

25 January

  • 07:00 Jamesday: khaldun apache stopped and InnoDB buffer size raised to 700 MB to give it some hope of staying current with the very high update rate.
  • 06:27 brion: added (Friulian)
  • 03:30 brion: upgraded Mailman from 2.1.5c2 to 2.1.5 final release

24 January

  • 22:58 gwicke: moved from rabanus to maurus
  • 22:55 gwicke: moved from rabanus to benet, rabanus was doing 100% cpu

23 January

  • 16:44 gwicke: wprc is restarted twice daily from cron at 4:00UTC and 15:00UTC from now on
  • 15:00 Mark:
    • Modified Cricket on Larousse to support aggregation of targets using a variety of functions
    • Modified Cricket to include hourly graphs. I had to change the poll interval from once every 5 mins to each minute. Unfortunately old data was lost due to the transition.
  • 14:05 Jamesday: changed ICPAgent delay on khaldun from 4.9 to 10 and will adjust more because it was reporting modulated clock mode. Idea is to give it low load unless the power is really needed.
  • 13:17 gwicke: Added /usr/local/bin/icpagent to rc.local on apaches. That's a script that has the host-specific timings and starts /home/wikipedia/bin/icpagent. ToDo: adapt the wikidev-sudo /h/w/bin/squid-stop script to do the same for icpagent

22 January

  • 19:24 gwicke: did some fine-tuning on apache weights using the new 0.1ms-adjustment feature in icpagent, cpu usage now pretty even.

21 January

  • 20:30 gwicke: tweaked the 'old' purge function to read only once each 200 purges, should be faster for >1 squids. Enabled it in Commonsettings, $wgMaxSquidPurgeTitles at 500, deferred updates don't seem to be deferred at all currently
  • 07:02 Tim: implemented pfsockopen()-based squid purging. Together with the lock time reduction implemented last night, saves have now been brought down to 675ms in low to moderate traffic. Increased $wgMaxSquidPurgeTitles to 5000 since that will now only take 5 seconds or so.

20 January

  • 23:29 kate: properly fixed 404.php by actually copying it into the live htdocs directory.
  • 21:00 Jamesday: suda taken out of service after innodb had problems opening some tables (ERROR: 1016 Can't open file: 'querycache.InnoDB'. (errno: 1)). May be because khaldun was on 4.0.22 and suda is 4.0.20 or some copying error - the tables seem OK on khaldun.
  • 19:30 gwicke: ICP agent running on all apaches, with the default delay of 5ms on most, and with the new default nice of 10. Tingxi and Dalembert look a bit spiky cpu-wise, all others seem to be very even now. ToDo: replace squid in rc.local with "/home/wikipedia/bin/icpagent -d -t 5"
  • 19:20 gwicke: enabled tidy again, Anthere had problems on meta with broken markup. Monitoring performance.
  • 19:00 Jamesday: test running /home/wikipedia/bin/apache-restart-loop on zwinger. Does a restart of one apache, graceful restart of all with 20 second delay, in an infinite loop. If loop works a cron version can be used - we'll see how it goes. With current apache count this gives each a graceful restart once every 6 minutes and normal restart about once every 90 minutes.
  • 14:30 Jamesday: khaldun syslogd reporting temperature above threshold and running in modulated clock mode.
  • 14:25 Jamesday: khaldun and suda catching up in replication. Suda will enter service when caught up.
  • 11:45 Jamesday: khaldun mysqld shut down for copy to suda. exec/Relay_Master_Log_File: ariel-bin.009 Exec_master_log_pos 194151813
  • 06:22 kate: split 'apaches' dsh group into 'apaches' and 'mediawiki-installation'
  • 02:00 brion: upgraded yongle's memcached to 1.1.12rc1 on brad's advice (was ancient 1.1.10). will do others shortly if nothing is broken horribly
  • 01:20 gwicke: testing icpagent with static delay of 2ms and different nice levels on different hosts. Disabled perlbal, tingxi and dalembert are running apaches. No delay in icpagent doesn't seem to work, 10ms appeared relatively slow. Guess the scheduler needs a time > 0 to engage, should tick every ms on 2.6. Not yet tested: weights on squid, connection limits on squid per peer.
  • 00:49 brion: returned tingxi and dalembert to the apaches group in dsh so that their files get updated and they're no longer COMMITTING EDITS TO THE WRONG MASTER SERVER. set suda's mysqld to read-only and SHUT IT DOWN ENTIRELY.

19 January

  • at some point brion: disabled tidy to see what effect it would have on cpu usage
  • 08:53 Tim: doubled number of memcached instances to 20, adding some to the 4 internal apaches: rose, smellie, anthony and biruni
  • 0:30 jeluf: disabled webshop upon request of the board.

18 January

  • switched master to ariel after disk holding logs on suda filled.
  • 21:30 midom: spotted strange apache behaviour (child crashes). actually it was just logs on new boxes not rotating and 2GB file size limit was hit. Kate fixed it.
  • some time, someone: something broke. Ariel now mysql master.

17 January

  • 18:23 hashar: apache-stop && apache-start on avicenna to kill a 156MB defunct convert process :(
  • 14:06 Tim: Fixed ganglia on the Paris squids and on the internal apaches. This mainly involved synchronising gmond versions -- all are now 2.5.6. Also, gmetad really needs to get its data directly from a gmond, not from a gmetad intermediary. Various artifacts due to stopping or restarting gmetad and gmond are visible.
  • some time kate: removed yongle and isidore from apache because it doesn't work with memcached. want to put perlbal on them instead.
  • 00:37 brion: upgraded Bugzilla to 2.18 stable

16 January

  • 18:30 Jamesday: emergency switch from Paris squids - all traffic was going via rate-limited links, including France. Instructions at Squids.
  • 12:40 gwicke: reconfigured french squids to use htcp sibling communication which is superior to icp because it sends more detailed questions. Seems to work fine (it's used in Florida for a year now), to verify do
tail -f /var/log/squid/access.log | grep SIBLING

on the french squids, you should see sibling hits from the other squids.

  • 12:00 midom: as I yesterday installed proctitle module on apaches, and Setup.php has included hooks for that, mediawiki process status can be simply verified with ps or top -c
  • 02:35 brion: reinstalling PHP 4.3.10 with mbstring module

15 January

  • 23:20 gwicke: Mystery non-bug with no-cache headers when doing test requests from the fr squids resolved. Solution: Forwarded-For was empty, so the ip strip function in Setup.php had nothing to chew on. Empty ip string matched all user's newtalk flags further down the road which caused User::newTalk to return true which disabled caching in SkinTemplate.php... See also MediaWiki caching for some background on how MW uses http headers.
  • 17:32 hashar: put back isidore in apache pool
  • 18:13 gwicke: made wprc more reliable, now uses vtun in persist mode and a cron script checks if it's running ok every minute.
  • 13:31 Jamesday: enwikibooks is read only while I recover the most recent 73 records. Ariel is out of service while I work out why I can neither create nor drop enwikibooks.old on it.
  • 14:00 Tim: Fixed horribly broken 404.php, which was producing infinite redirect loops in response to almost any 404 error
  • 06:00 (or so) kate: removed pen and moved the site to use perlbal instead. seems better.

14 January

  • 1200EST Baylink: innocence added the squid name to the error pages; benet seemed the trouble spot; Steps Were Taken.
  • 15:30 Jamesday: converted interwiki table to MyISAM - should make for faster truncate/update operations when it's updated.

13 January

  • 03:15 brion: took webster's mysql offline to make a copy for an experimental public replication server

12 January

  • 23:45 kate: installed pen on dalembert and moved load balancing to use it instead of squid
  • 20:00 mark: Installed Cricket on larousse.
  • 09:35 hashar: JeLuF fixed benet ganglia which was using the apache configuration instead of squid one. benet shows as down in ganglia apache view probably cause we have to move (and merge) rrds.

11 January

  • 20:08 kate: put powerdns on zwinger using mail.wm's IP. it's 2ndary server for zone. seems to be working fine.

10 January

  • 23:20 gwicke: All css should be cached again. User-specific css is moved to /User:Somebody/-?action=raw, Vary and maxage=0 params removed.
  • 07:27 hashar: installed libpng-dev on larousse. Nagios compiled with gd support.

9 January

  • 23:30 jeluf: made benet a squid.
  • 22:02 kate: ennael crashed again. removed it from DNS.
  • 21:22 kate: moved several wikis (fr de nl it lb ch wa) to geodns and fr squids for users within those regions. seems to be working, except for unexplained reboot on ennael.
  • 21:17 hashar: changed gu and hi wiktionaries sitenames & metanamespaces
  • 16:45 hashar: launched nrpe (nagios) on squids (!browne)
  • earlier kate: dshroot -a yum upgrade

8 January

  • 22:50 brion: restarted replication on bacon. somehow it was trying to pull suda_relay_bin.012 instead of suda_log_bin.012; about 12 hours behind, now catching up...
  • 14:48 hashar: launched the "Nagio Remote Plugin Executor" as a daemon on all apaches. Need to create an init.d script later :o)
  • 00:10 hashar: installing bunch of perl cpan modules on larousse. Installing nagios as well in /usr/local/nagios/

6 January

  • 09:53 gwicke: Configured French Squids, enabled purging. Log rotation is enabled, log transfer isn't. Mem settings are very conservative (32Mb) for now. All three seem to work fine, but more testing can't harm of course.

5 January 2005

  • 00:54 hashar: s/suda/ariel/ in "mysqlslaves" dsh group.

4 January 2005

  • 23:17 hashar: added "paris" dsh group. Servers are not in other groups.
  • 22:26 gwicke: Updated this wiki to 1.4. Happy new year.

3 January 2005

  • 21:30 shaihulud: stopped apache on yongle, it was killing the wiki... Can we do something about memcached pbs ?
  • 20:22 shaihulud: after asked most devs (sorry for those I forgot ou not found), phe has now shell access on cluster
  • 19:23 brion: fixed database config for bugzilla, shop
  • 16:45 JeLuF: Stopped postfix on albert: OTRS seems to use ariel as master. Need to fix it tonight.
  • 09:45 Jamesday: Suda now master.

29 December

  • 21:05 Jamesday: added and enabled $wgUseLuceneSearch = false in CommonSettings.php to re-enable search for all wikis which were set up to use Lucene - all had apparently had no search at all since yesterday.
  • 10:20 jeluf: Added hotfix in Titel::legalChars() to dissallow character %AD in titles and usernames on Latin1 wikis.

28 December

  • 16:20 brion: Reduced MaxClients to 24 to keep the 512MB apaches from too much overflowing memory if they get a lot of threads using PHP's maximum amount of memory. Should vary this across the larger machines to allow for some breathing room
  • 15:30 Some kind of GC/timeout problem with the Lucene search ate up apache threads. Search disabled for now.

27 December

  • 14:35 shaihulud: removed phe (renamed public_key), if somebody is agains phe, please tell me
  • 14:25 shaihulud: added the french user phe, as wikidev, he helped me a lot and we need sysadmins :)
  • 13:15 shaihulud: webster is down, removed from load balancing
  • (various) kate: moved en*, fi*, de* and fr* to lucenesearch, now running on Rose. See the end of CommonSettings.php. Also fixed load balancing by defining $sqle* and $sqli* as well as $sql*.
  • 2:00 jeluf: Moved OTRS on albert to HTTPS

26 December

  • 00:05 kate: returned briefly to move enwiki to the LuceneSearch extension running on Kluge. The rest will follow shortly.

25 December

  • 01:12 brion: bacon caught up, put back on duty
  • 00:48 brion: restarted replication on bacon
  • 00:30 brion: bacon was 4822 seconds behind and weird history & diff inconsistencies were seen on at least en and ja wikis. took bacon out of 1.4 load balancing rotation for now.

23 December

  • 11:22 brion: installed ICU and php_normal on apaches & zwinger, but left it disabled. having a problem where all page views end up as the main page; can't reproduce it in isolation, but regularly reproduce it when turning it on for the whole farm.
  • brion: converted * to 1.4

22 December

  • 23:00 shaihulud: restarted mysql on suda, benet is now another db slave
  • 19:50 shaihulud: stopped mysql suda, time to cpy db to benet
  • jeronim: set up an offsite backup machine with an 80GB drive at my house - will have backups of uploads and some other stuff - details on how to log in etc are in /home/wikipedia/doc/backup_locations
  • 8:00 brion: after much fuss, think i've got the wiktionary messages sorted out.

21 December

  • 21:30 JeLuF: installed TeX and ImageMagick, as listed on Apaches
  • 21:00 JeLuF: installed tidy on biruni, rose, smellie, anthony and benet. Are all the other tools installed?
  • 14:10 Tim: put biruni, rose, smellie, anthony and benet into service as apaches. Benet required a change to CommonSettings.php since it can't contact the mysql servers on the 10.* addresses.
  • 11:30 brion: Upgraded wikisource and wiktionary to 1.4. Gave squids a master configuration file in /h/w/conf/squid. Installed clamav on the apaches; master conf at /h/w/conf/clam
  • 05:30 Tim: added restrict lines for NTP servers in ntp.conf on the French squids. TICK TICK TICK TICK... ahhh that's the sound of 35 servers ticking in synchrony
  • 03:48 Tim: added 10.0.* to zwinger's allowed ntp client list in ntp.conf. This allows the 4 internal servers to synchronise.
  • 01:44 Tim: routes on benet were set up incorrectly, especially 10.0.* which was sent through eth1, which is not connected. Fixed this problem and set the default gateway to izwinger.
  • 01:25 Tim: sychronised /etc/profile across machines except albert, all now use /etc/profile.local for our customisations. Albert's /etc/profile was the only one that was different from the start, it used /etc/profile.local by default.

19 December

  • 23:45 brion: reinstalled PHP with 20MB memory_limit. (can raise limit in php.ini or from CommonSettings.php if necessary)
  • 21:20 brion: friedrich having weird errors, took out of rotation
  • 13:24 Tim: set up default gateways and proper hostnames on biruni, rose, smellie and anthony.
  • installing new servers, work in progress : when running the setup-new-fc2-servers, ntp does not sync, have to check

18 December

  • 23:40 brion: dalembert was showing errors on 1.4 wikis executing 'SHOW SLAVE STATUS' on suda. Executed 'FLUSH PRIVILEGES' on suda and it seems to have stopped for now. Added hack to 1.4 SQL error reporting to include the db server IP to make these easier to track down.
  • 16:45 shaihulud: copyed /usr/local on biruni,rose,smellie,anthony from another apache. All others things still to do
  • 15:20 shaihulud: added benet,biruni,rose,smellie,anthony. benet is on public .210 others are on private .0.25 -> .28
  • public root passkey is on french squids, name added in our /etc/hosts

17 December

  • 23:00 jwales: Installed new servers, IPs to .28. Allow root ssh login. No names yet.
  • 22:35 brion: upgraded all apaches to PHP 4.3.10 (minor security updates in .9 and .10)
  • 21:00 jwales: Installed a new server, IP .210
  • 07:00 brion: upgraded all Wikiquotes to 1.4 (1.4 upgrade)

15 December

  • 22:00 jeluf: Reworked spamassassin on albert, teaching bayesian filter, adjusting weights.
  • nn:nn Jamesday: For a two step switch to suda, read only, search off, remove the # from #$wgEmergencyMasterSwitch = true; in CommonSettings.php and sync it.
  • nn:nn Jamesday: holbach and webster are now handling en and ja instead of en and zh.

14 December

  • 23:30 jeluf: moved all image files that were on albert but not on zwinger AND weren't from today to /var/zwinger/htdocs/ Those were images that have been deleted between the last rsync and today's copy of files from zwinger to albert.
  • 12:20 shaihulud: set cronjob on khaldun to recache special pages on all wikis every night at 0:00
  • 10:11 brion: moved uploads onto albert. They're in /var/zwinger/htdocs/uploads [still an ugly symlink tree for now], accessible from zwinger+apaches as /mnt/wikipedia/htdocs/uploads. Zwinger is very happy to have the load off. There may still be some images that need to be totally re-synced; files missing on albert will be pulled transparently from zwinger in the meantime.
  • 06:20 brion: starting rsync fun, moving uploads to albert permanently

12 December

  • 22:00 jeluf: restarted squid on rabanus.
  • 05:20 kate: removed my ssh public key. mail me if something I did needs documenting. bye - it's been fun.

11 December

  • 12:00 jamesday: en and zh now sharing holbach and webster (new DBs), rest sharing suda and bacon.
  • 10:00 brion: 1.4 upgrade on meta, wikinews, and wikisource
  • Tim: w:Meta:spam blacklist can be edited to change site-wide spam regex blocks

10 December

  • 20:20 kate: holbach & webster set up as mysql slaves. holbach is in production, webster is still catching up
  • 18:37 kate: ariel was taking 200+ seconds to process updates and the site was mostly down. restarted mysqld and it fixed itself.
  • 14:13 kate: started mysqld on webster by accident. killed it.
  • 13:16 Tim: installed dsh on albert. Put a cron job on albert to copy node_groups from zwinger on a daily basis. This is for redundancy, so we can use dsh when zwinger is slow or down
  • 0:20 jeluf: fixed mrtg config in /home/wikipedia/htdocs/wikimedia/live/conf: Added the fourth squid server.

8 December

  • 17:45 brion: somebody changed the memcached configuration but didn't update php-1.4/CommonSettings.php. updated it to match 1.3.

7 December

  • 14:45 brion: switched commons and from NFS-based to locally-based docroots (symlink farms themselves, but still)
  • 08:59 Tim: installed the CVS client on ariel

6 December

  • 15:00: brion: switched some 'w' bits from links to NFS to links to local copy. not sure if it makes much of a difference
  • 04:00: brion: Upgraded commons to 1.4beta (1.4 upgrade)

3 December

  • 08:10: Tim: moved to
  • 06:40: jeluf: added dalembert to squid again.
  • 05:50: jeluf: removed dalembert from squid configs. Server was heavily loaded, trying to reduce load.
  • 04:50: brion: Noticed someone had enabled Special:Asksql; disabled it again. If there was a consensus to turn on this highly dangerous feature again, nobody told me.

1 December

  • 10:53: Jamesday: bacon and suda out of service to load missing stop words and switch to a new stop word list. Search is off for all wikis while this runs.