Server Admin Log/Archive 6

February 28

02:02 brion: raised memory limit in mw to 80mb; at 50mb rsvg routinely dies on x86_64; apparently from shared library mappings
00:12 brion: fixed some rsvg bugs; upgrading libxml2 for better support of files from illustrator

February 27

11:47 brion: updating $wgThumbnailEpoch so rendered SVGs and mis-sized PNGs and JPEGs should rerender gradually
11:30 brion: investigating segfaults on yaseo apaches
10:44 brion: upgrading librsvg to 2.14.0 (patched for dashes and security) on pmtpa apaches

February 26

22:20 jeluf: Mail service enabled again, OTRS moved to goeje, using srv38 as DB server.
21:15 jeluf: Shutting down mail on goeje while migrating OTRS from ragweed to goeje.
16:40 jeluf: copied missing icons to goeje
11:15 jeluf: uninstalled sendmail on goeje, kept postfix.
10:30 jeluf: installed spamassassin on goeje. Spam mail gets a ---SPAM--- label.
09:56 brion: mail.wikimedia.org seems alive and well on goeje
08:55 brion: moving mail from zwinger to goeje; in progress copying mailman data

February 25

13:00 Domas: cloned commonswiki, nlwiki, frwiki, svwiki, plwiki to webster

February 24

06:20 Tim: restarted lighttpd on amane
05:50 Tim: deployed MW job queue
03:39 brion: copied .forward files from zwinger old files to each home dir that had one
00:14 brion: benet out of space. finishing cleaning off old files.

February 23

21:40 brion: resyncing search indexes
12:00 mark: Holbach couldn't reach external machines (like zwinger) as it was missing a default route
00:30 Tim: started apache on goeje
00:19 brion: continuing wiktionary bulk renames
00:05 brion: took thistle out of rotation; it's wildly behind replication
00:05 mark: Set up udpmcast on goeje as larousse is dead.
00:03 justin: bullshit

February 22

22:55 brion: set all remaining wiktionaries to wgCapitalLinks off. starting bulk-rename operations
22:00 brion: set default subpages to include custom 100 and 101
06:57 Solar: srv8 back up.
03:30 Tim: Set up gmond on sq1-10 [1]
03:20 Tim: killed sq9. ifcfg-eth0 is probably wrong.
01:44 brion: the lvs on the sq* machines looks good, so we're changing upload.wikimedia.org to point at it
00:30 brion: investigating performance of amane vs sq9

February 21

23:10 brion: caught lots of segfaults on srv38. looks like we still have the 'occasionally goes into mode where server does nothing but segfault until you restart it' problem. awesome.
23:05 brion: noticed rpc.mounted taking 99% cpu on amane; did /etc/init.d/nfs restart to clear it
22:55 brion: sq9 taking over upload.wikimedia.org squid duty, alone for now.
- With luck this will keep loads of slow open connections off of the other squids. If we need to we can have a separate set of squids for uploads.
22:16 mark: Flipped gi0/14 on csw4-pmtpa over to vlan 1, as it has sq9
22:00 brion: trying to set up sq9 with an external ip for upload squid
21:31 brion: browne has stale zwinger NFS; leaving it for now as the IRC is

still running ok

21:30 brion: doing survey of up/down machines and reusable ips
21:00 brion: ragweed down
04:20 Tim: fixed gmond in various places. Put most of the apaches into "deaf" mode, so they will only multicast their own statistics, not the rest of the cluster's. Only the apache-aggregators node group listens to the multicast stream and responds to XML requests.
02:19 Solar: set tingxi's vlan to default on csw4-pmtpa.

February 20

22:23 brion: rebalanced tampa squids; two ips from srv7 and srv9 to srv6
21:39 brion: got lots of segfaults on rabanus; restarted, seems ok
21:30 brion: slow access reported by some people, can't reproduce here. might be knams
21:13 brion: moving old backup files to amane to free space on benet
20:30 jeluf: changed settings of de.wiktionary to enable sysop RC patrolling, see bugzilla
20:00 jeluf: rebooted sage
19:40 brion: updated checkers.php
07:00-07:30 Tim: installed wikdiff2 1.0.0 everywhere
06:30 Tim: fixed yf1010 and did a reboot test.
04:01 brion: installed php5 on yaseo
00:20 brion: installing php5 on diderot, friedrich, humboldt, hypatia, kluge, rabanus, rose, srv2, srv3, srv4, srv0

February 19

23:48 brion: running yaseo dumps on amaryllis
23:41 brion: running enwiki dump on srv31 and other dumps on benet
22:42 brion: rearranged mounts on srv31 so it will survive if zwinger's ip is returned to it
19:37 brion: restarted search index builds; bad symlinks from older dumps still, uh, bad.
10:20 domas: enabled persistent connections for memcached
09:40 brion: having tingxi shut down until kyle can poke it
07:22 brion: rebooting tingxi; stale nfs
06:50 brion: explicitly disabled user dirs on the primary apaches

February 18

22:08 brion: started Lucene index rebuild on maurus
- Noticed that maurus can't be reached from zwinger. something bad in configs :(
06:15 Tim: Fixed and re-enabled wikidiff2

February 17

22:25 brion: disabling wikidiff2; wikidiff2 segfaults
11:00 brion: moved smlogmsg out of the day (smlogmsg-old), replaced with a shell script that just says servmon is down. so won't have to wait for it to time out looking for larousse
10:55 brion: fixed wikiquote docroot
10:30 brion: syncing; www.wikiquote.org portal docroot hadn't gotten copied out
07:57 Solar: anthony is back up, but I didn't set dns
07:41 Solar: Reinstalled fc4 on srv50
05:20 Tim: started gmetad on zwinger
04:56 brion: testing on srv39 to track donw segfaults. upgrading apc, etc
03:50 Tim: deployed wikidiff2 on numbered pmtpa hosts
(at some point) stopped a buttload of apaches which were massively segfaulting
03:03 brion: restarted apaches; srv11 had several old stuck convert processes and wasn't responding to new requests. 14, 16, 18, also had unusually low load avg.
01:48 brion: taking srv55, srv57, srv61, srv67 out of service due to RAM problems reported in mcelog. had to reshuffle memcached sets.
01:26 brion: srv61 flaky
01:21 brion: set suda's syslog to accept the messages from the cluster's php boxen

February 16

23:30 jeluf: Rebooted clematis, fuchsia, hawthorn, mayflower
20:23 brion: srv41 and srv43 for some reason aren't initializing the lvs magic ip or starting apache on boot, but the test for it passes and they can be run manually afterwards. don't know why :P
20:08 brion: testing reboot of srv41 to make sure it comes back up with proper LVS magic ip in place
20:00 brion: fixed NFS mounts and php config on rose, back in service.
19:50 brion: removed broken firewall rules from rose, nfs now ok. upgrading php...
19:20 brion: rose still borked, poking it with a stick
19:05 brion: noticed ganglia still broken.
13:15 Tim: noticed that suda (/home share) was full. Possibly because of the rsync from from-zwinger that I set going a while ago. Deleting some backups.
08:30 brion: added missing /upload redirects to some vhosts (commons, meta, sources, etc)
08:00 brion: moved /wikistats to http://stats.wikimedia.org/ (unused vhost on albert)
05:08 Tim: installed Term::Readline on goeje
04:42 Tim: moved LVS back to dalembert
04:00 Tim: Fixed dalembert's startup scripts
03:20 Tim: restarted dalembert
03:15 Tim: second attempt at moving the LVS director to srv52, worked this time
02:35 brion: lvs back up, for now
02:30 brion: lvs down; tim's rearranging things.
01:44 brion: enabled new config with everything moved from htdocs/ to common/docroot subdirs. Hopefully everything relevant copied in...

February 15

22:20 brion: dsh available on zwinger; moved some config files in for it
22:00 brion: mounted suda's /home on zwinger, so it can be used consistently for stuff until we reinstall it
21:11 brion: mailman back online, upgraded to 2.1.7, moved to /usr/local/mailman
19:55 brion: working on moving and upgrading mailman on zwinger
19:43 brion: switched .247 and .248 from srv7 to srv9. srv8 is still offline
14:15 brion: added redirects for wikimania200[5-9].org (where we have them)
13:57 brion: patched otrs for annoying session load errors
13:45 brion: disabled mod_perl for ticket, otrs seems to actually give back what you asked for now. tracking down a "Use of uninitialized value" in session handling that's flooding the error log.
13:30 brion: otrs is doing the 'random pages isntead of what you asked for' thing again. restart fixes for about a minute or two.
13:03 brion: determined otrs problem was wild goose chase because debian sent out notifications of a three-month-old security update they forgot to include until now. ticket back online.
12:49 brion: took ticket.wikimedia.org offline for upgrade
12:35 brion: copy of files to suda finished at some point. zwinger files can be copied into place from /home/from-zwinger as desired.
12:30 Tim: attempted to reinstall srv50 with a 20GB root partition and a large /a partition. I monitored it on the serial console until it finished the PXE phase, but I saw nothing more from it after it went into linux.
02:27 brion: got suda relaying mail from the cluster to zwinger until somebody figures out this DNS crap
01:40 brion: reassigned the smtp.pmtpa.wmnet CNAME to zwinger's 207 address, which works internally. Trying to get it to propagate...
01:27 brion: tracking down broken apache email; they send to smtp.pmtpa.wmnet, which is zwinger's old internal ip (now suda) so mail doesn't get out.

February 14

22:40 jeluf: Restarted srv31, had broken /home mounts
22:30 jeluf: Restarted sage.
22:00 jeluf: Restarted apache on srv31 for static.wikimedia.org. Added it to rc.local
21:30 jeluf: restarted lily, iris.
21:35 brion: started copying zwinger's /home to suda, so we'll have current files on the new file server. Temporarily putting in /home/from-zwinger.
19:30 mark: Changed IP address of ns1.wikimedia.org to 211.115.107.190 (Amaryllis), as larousse is pretty dead.
15:25 mark: Set up DNS zones wikimania2006.org and wikimania2007.org by request of Delphine
15:10 mark: Replaced internal DNS zonefiles by more recent versions retrieved from zwinger
07:53 brion: benet download.wikimedia.org back online. old lighty relied on /home; reinstalled from srpm.
03:54 brion: maurus and vincent online for search. there's a lot of connection-limit hitting, though...
03:47 brion: coronelli online for search
03:40 brion: kicking search boxes to see if they come up
01:35 brion: got the redirect from old math urls in pmtpa working. yaseo wikis are exempted by explicit test in apache config (main.conf)
01:15 brion: switching math in pmtpa to load off of upload.wikimedia directly instead of via apache+nfs
01:05 brion: got math settings resynced on yaseo. phew!
01:00 brion: hacked sync-file to use amaryllis.yaseo instead of amaryllis so it works on albert
00:42 brion: moved tex files from /home/wikipedia/shared/math to /mnt/math, so it's not in a double nfs hell

February 13

23:30 kyle,mark,jeluf: zwinger back running. Removed all irc stuff from rc.local because it was blocking and system didn't boot properly.
21:36 brion: took 204, 205 to srv6, whatever had them wasn't responding. now getting response on all squids
21:34 brion: took srv8s ips (246, 247, 248) to srv7 so something could respond on them. also ohers may be borked
21:23 brion: rebooting srv8, was verrry slow to respond, lots of stuck processes, nfs broken
21:00 jeluf: reconfigured dalembert to have all important components locally: icpagent moved to /usr/local/bin, lvs node list moved to /usr/local/etc/apaches.
15:45 mark: Deployed a new resolv.conf using new internal DNS resolver service IPs
15:15 mark: Set up 2 new internal DNS resolvers on srv1 (master) and albert (slave)
11:43 zwinger down NFS debacle
07:30 brion: added some partial blacklists on unicode chars in usernames. this should be fixed up for all titles and the whitelist fixed.

February 12

10:30 JeLuF: Moved helpdesk-l to OTRS. Added alias in /etc/postfix/aliases and removed the one in .../mailman/aliases.

February 11

23:00 Tim: srv6 crashed, moved IPs to srv7, 8 and 10

February 10

07:10 kate: upgraded zwinger to nfs-utils 1.0.8-rc2 so the Solaris NFS client doesn't crash it
05:30 Solar: bart back up at 207.142.131.227 (With FC3, let me know if you wanted FC4)
04:23 Solar: larousse is gone, no warranty. I might scavange a harddrive from a bomis server to replace its drive.
04:23 Solar: Taken anthony for RMA

February 9

17:25 brion: enabled emergency captcha and blocked some ip. robot o rsomething.
06:30 Solar: hydra the new server is up at 10.0.0.201
04:54 Solar: ixia back up.
02:13 brion: rebuild and reenabled interwiki cache
01:55 brion: disabled interwiki cache; it doesn't seem to handle removal of the cache file, and there's no obvious way to clear the cache.
01:45 brion: interwiki map now protected; for some reason somebody left this unprotected even though it gets updated on an unattended basis, and somebody decided to add javascript: to it. nice. updated cache epoch to ensure things are cleared
00:12 brion: restarted apaches; odd 'bad title' and failed load errors reported on srv12, restart cleared it

February 8

19:20 brion: larousse dead, doesn't come up on boot.
19:00 brion: benet / briefly filled, but nothing seems to have gone awry with the dump. cleaned some space.
18:30 brion: had larousse rebooted since its root filesystem doesn't work, worth a shot. may not be coming back up
18:00 brion: larousse is down since yesterday, nobody logged it.

February 7

23:57 brion: running cleanupWatchlist on pmtpa
07:23 brion: namespaceDupes on ta wikis for bugzilla:4889
05:40 Tim: Recompiled ImageMagick from the source RPM, with --with-quantum-depth=8. Installed on all apaches.

February 6

23:00 jeluf: Added new "Urgent-en" queue to OTRS
22:30 jeluf: Restarted sage and iris. Changed /a to ext3 to reduce fsck time.
21:00 jeluf: Restarted lily, purged cache
19:00 jeluf: Restarted load balancer on pascal, rebooted mayflower
18:50 mark: Revived clematis
00:05 mark: Created mailinglist chaptercommittee-l by request of Delphine.

February 5

02:43 brion: nowikinews was duplicated in all.dblist; cleared

February 4

21:38 brion: started data dumps in pmtpa, now including progress/ETA for xml dumps

February 3

01:50 brion: started fill-in dumps on srv31 again; now using local temp dir for stub dumps in the hope it won't mysteriously fail
01:30 brion: started dumps on yaseo

February 2

19:09 brion: compiling php 5.1.2 on srv31
19:00 brion: mark rebooted pascal for reasons unknown
07:30 brion: started makeup dump runs on pmtpa databases which had dump failures. unsure of cause still...
03:02 brion: testing fixes to yahoo dump gen
01:00 brion: squids in yaseo are way into swap, slow. trying some restarts

February 1

23:48 brion: trimmed a message from wikimediafr-l logs for privacy by request
20:35 brion: srv10 back up and ips put in service
19:35 brion: srv10 down; squid errors
01:04 brion: adding cfp.wikimania.wikimedia.org redirect for those wikimaniacs

January 31

22:00 brion: hewiki, huwiki, iawiktionary dumps report failure in full-history dump. checking log for iawiktionary showed an XML error in the stub load partway through, but rerunning the command to a test dump was successful. cause unknown

January 30

23:45 brion: disabled blank passwords on wikis
23:00 mark: Upgraded pybal to a newer version on pascal
22:20 brion: started a refreshLinks for itwiki; some major category was broken by a bogus template
19:30 brion: installed APC for srv13-30. had to reduce apc shm size to 30 on i386 boxen. temporarily used a cvs checkout of apc, in /h/w/src/apc visibly
19:15 brion: trying to get APC installed on the machines recently upgraded to php 5.1
18:30 brion: disabled accesslog on amane's lighty

January 29

11:10 brion: fixed externallinks table on leuksman.com wikis :P
11:06 brion: enabled captcha on remaining non-wikipedias, so all small sites covered. large sites still off while the smaller ones collect live test data. (added captcha to new user form a couple hours ago)
07:15 Tim: started upgrading srv11-30 to PHP 5.1.2
06:33 Tim: fixed secure.wikimedia.org
~05:00 Tim: Upgraded srv12 to PHP 5.1.1. Working on srv11.

January 28

10:55 brion: enabled experimental captcha on small wikipedias (all except the top 20 most edited and yaseo) to get some more test data
05:52 brion: added VfD/AfD entries to robots.txt, bugzilla:4776

January 27

22:50 brion: running captcha generation test on amane
22:45 brion: amane's root partition filled with 41 gigs of lighty logs. :) cleared out, restarted lighty.
22:17 brion: got srv63 updated php modules. Note: it's using dba as built-in, not .so module. A warning on Apache start about missing the .so is normal until we get the rest updated this way.
21:48 brion: added '--with-cdb --with-gdbm=/usr' to install-php51 script
21:43 brion: trying to fix srv63. why do we have these things turn on apache on boot? it's incredibly stupid; they end up broken
21:06 Solar: srv63 back up
02:29 brion: started refreshLinks.php on yaseo, running on amaryllis
02:28 brion: ran update.php to update schema on yaseo wikis, which were forgotten
01:58 Tim: fixed spam blacklist and re-enabled it
01:36 Tim: started refreshLinks.php, running on srv31
00:59 Tim: Updated schema, enabled externallinks table

January 26

09:29 brion: disabled spam blacklist; more reports of all kinds of things triggering blacklist for no apparent reason
01:09 brion: got ImageMagick 6.2.6 installed everywhere. bleh.

January 25

15:38 ævar: Added a portal namespace & portal talk namespace to svwiki and ran php maintenance/namespaceDupes.php svwiki --fix to fix the one resulting conflict:

Checking namespace 100: "Portal"
... 1 conflicts detected:
... 209565 (0,"Portal:Musik") -> (100,"Musik") Portal:Musik
... resolving on page... ok.

09:59 brion: postfix was stuck; killed (zombies, kill -9 needed), restarting
01:18 brion: added FollowSymLinks and mime type for .7z on download-yaseo
01:00 brion: enabled indexes on download-yaseo

January 24

22:55 brion: restarted squid on srv8; it was serving lots of error pages to people for unknown reason, seems happier after
06:24 Tim: Updated /h/w/b/foreachwiki. Started running cleanup.php on all wikis.

January 23

19:55 brion: disabled digests option for all users on daily-article-l by request (list admins disabled digests)
09:13 brion: enabled APC (from HEAD) on leuksman.com
03:40 brion: dba module needs to be enabled on secure.wm.o

January 22

23:30 brion: syncing fedora-extras from a mirror in .jp; added to sync-fedora-mirror.sh script
23:10 brion: fedora-extras seems to be missing from fedora mirror in yaseo; fedora-extas.repo points to the local main fedora repo mirror which doesn't help
22:30 brion: restarted dump run in pmtpa; PHP utfnormal extension enabled to speed up non-Latin dumps
- prefetch was actually working ok once i got into the debug log to watch. slowness was from not loading utfnormal from dumpTextPass. now controlled by WIKIDEBUG env var at CommonSettings level
21:30 brion: ragweed down, no OTRS (mark rebooted it shortly after)
12:08 brion: aborted dumps on pmtpa and yaseo pending investigation
- setting WIKIDEBUG env var causes segfault in php on srv31. what the hell
11:12 brion: prefetch didn't work due to broken symlinks. restarting on pmtpa
10:00 brion: running dumps on srv31 in pmtpa
06:24 brion: running another test dump on yaseo; will go ahead and run one on pmtpa soon. setting up to use srv31 as the dump runner

January 21

21:25 mark: Half the knams servers were down, at which point PyBal decided not to depool any more servers. Consequence is that most traffic is attrracted by the down server in LVS, and the site is more or less down. Fixed it by commenting the down servers in /etc/pybal/squids. (PyBal will reload that file every minute)
19:15 brion: disabled interwiki cdb cache on yaseo wikis. domas forgot to install the required php module
18:10 domas: enabled interwiki cdb cache, cleaned logs on apaches
15:00 domas: installed dba extension (--with-cdb --with-gdbm=/usr) all around, will make use of it soon.

January 20

01:45 brion: syncookies all around.
01:33 brion: ah, the old tcp_syncookies. resolved.
01:11 brion: hella slow squids on .246/.247/.248
- srv8. lots of suppressed messages in syslog, on the order of 5k/second. BUT WHAT ARE THEY

January 19

21:25 mark: Resurrected hawthorn

January 18

08:18 brion: running another dump test on yaseo
01:58 brion: saw some breakage with LanguageZh_hk; its deps file was missing one dep (Zh_cn) which I've now added. at least it's consistent with theory so far :D
00:20 brion: added dependency-loading stubs for language and skin classes that need them. hopefully will help with http://pecl.php.net/bugs/bug.php?id=6503

January 17

20:00 brion: reenabled DoubleWiki extension on wikisource, it seems to work now
17:35 Tim: someone commented on our APC bug report that the problem seemed very similar to this bug, which was apparently fixed in CVS last October. There hasn't been a release since then. So I upgraded APC to CVS on the cluster and switched off the initEncoding hack. Fingers crossed.
11:20 brion: running an experimental job of the new backup script on yaseo. output and some control features need some more work, but it should at least pump out some files.
- tried to set up download-yaseo.wikimedia.org vhost on amaryllis rooted right at the public backups dir, but it's not working right for some reason. *shrug*
06:10 brion: set up a log of Mozilla and Google Accelerator prefetch requests in /h/w/logs/x-moz.
- Sasa^Stefanovic reported unexpected reverts happening when visiting user contribs pages, had Google Accelerator installed and turned it off on my request.
- Haven't yet found confirmation of such a bug w/ google accel, but I am seeing requests made from their proxies which include the fragment identifier which is odd. Emailed google about it.
01:20 brion: reverted $wgMimeDetectorCommand back to default. Setting to 'file -bi' broke SVGs on the site.
00:35 brion: installed hack for bugzilla:4635, safari breakage on pages with '.gz'. these pages are sent without gzip encoding to avoid triggering
00:29 brion: testing, seems that test.wikipedia.org does NOT use the local nfs version of CommonSettings.php
00:10ish brion: set mime detector to 'file -bi' on duesentrieb's advice

January 16

16:15 mark: Deployed my new lvsmon like LVS script PyBal on Pascal, in /usr/local/pybal/
07:16 Tim: added a hack to Setup.php to automatically clear the APC cache if a language class is missing its parent.

January 15

21:51 brion: several instances of an odd error reported:
Jan 15 21:04:32 srv38.pmtpa.wmnet httpd: PHP Fatal error: Internal error: Failed to retrieve the reflection object in /usr/local/apache/common-local/php-1.5/includes/ProxyTools.php on line 133

This is some kind of PHP5 constructor error [2] which should never occur. The referenced line is a 'global' declaration in a top-level function. Did an apache restart to try clearing caches...
21:27 brion: enabled semi-protection on jawikinews (bugzilla:4608) and eswiki ([3])
20:24 brion: default logo on all wikiquotes was the locally-uploaded copy, but many don't have one. changed to the en.wiki copy as default
20:22 brion: moved en.wikiquote uploads to where they belong, working.
20:19 brion: found that en.wikiquote uploads aren't working properly; the dir is a symlink into /home which is no longer mounted on amane. need to rearrange the files...

January 14

16:18 brion: did extra sync; some old files stuck on servers or something (for instance SpecialUserlogin, with the broken password button)
15:15 brion: parser cache bug :( updated wgCacheEpoch and did a few manual squid purges of main pages
00:30 brion: trimmed obsolete funddrive dir from docroot/foundation (didn't work anymore, superseded by fundraising.wikimedia.org)

January 13

20:10 brion: broke some stuff for a couple minutes trying to make clone() work; now have a wfClone() for PHP4 compat until we finish killing PHP4
12:35 Tim: started refreshLinks.php with template redirect fix in place
08:15 brion: reverted experimental anti-bot hacks to login page; it was breaking 'mail new password'
07:30 jeluf: added info-es alias on zwinger, forwarding to OTRS
07:00 jeluf: hawthorn rebooted
06:30– Tim: second attempt at upgrading to PHP 5. Watching CPU stats closely this time.
06:17 Tim: added access_log to logrotate.conf, up to 10 GB will be stored
06:03 Tim: Amane's root partition was full due to 40 GB access_log. Deleted it and restarted lighttpd.
05:50 Tim: Put copy of skeleton /home on amane
05:15 Tim: unmounted /home on amane. Amane's network out shot up.
04:25 Tim: killed updatedb on amane and removed it from cron.daily
01:05 brion: briefly broke redirects when upgrading; forgot index.php had been reverted temporarily during yesterday's excitement.

January 12

20:00 jeluf: changed Dutch wikis to allow patroling by users, not only by sysops.
19:30 jeluf: rebooted fuchsia, purged squid cache.
19:14 ævar: I discovered that viwiki has made an extension to the software in Javascript. I did a quick security review of it and it doesn't appear to be evil(TM) in any way. It's basically an input method written in Javacript (docs in Vietnamese), for example try going to their sandbox, select "Tu9-. d/o>-.ng" or "Telex" and type "aw" in the input box, it'll be converted to "a". Still more eyes on the source code: Monobook.js and Him.js (main program) couldn' t hurt given some of the evil javascript we've been removing recently.
11:00- Tim: Upgrading PHP on srv31-70 to PHP 5.1.1
04:35 brion: installing xdebug on all apaches so it's available
04:20 brion: with xdebug extension was able to limit recursion within php and got a stack trace pointing to Image.php svg thumb rendering. bug in tim's recent changes was found to be the culprit. reverted Image.php while working
03:42 brion: monitoring very high apache load situation with logs of segfaults.
- Appears to be some recursion -> segfault in PHP PHP crash backtrace, may be recursion in user-level function

January 11

22:30 jeluf: rebooted ragweed after crash
22:00 mark: Added srv71-160 to DNS
21:00 jeluf: installed srv71...78. 79 and 80 need to be rebooted, probably using the old kernel.

January 10

21:19 brion: put paypal donation form back onto fundraising.wikimedia.org/ongoing top/year/month-level pages
21:15 brion: added no.wikinews

January 9

22:39 hashar: fix project namespace for bn: and csb: languages
21:56 hashar: ocwiktionary is now case sensitive.
21:56 brion: switching php error logs from local files to syslog, which should go to zwinger and include the hostnames
21:48 hashar & nikerabbit: fixed ga: project namespaces (now use genitive)
19:09 brion: where the hell is the documentation on external storage servers? srv34/srv33/srv32 aren't even documented -- they're listed as apaches.
- They are apaches, all external storage servers are dual-purpose. A list of external storage servers can be found in db.php. -- Tim 03:40, 10 January 2006 (PST)
17:45 brion: noticed secure.wikimedia.org is broken:
- Error in numRows(): SELECT command denied to user: 'wikiuser@goeje.wikimedia.org' for table 'blobs'
17:00 brion: we have mysterious huge load on apaches, started about 5 hours ago. restarted all apaches to see...
11:10 Domas: parser cache set to 2 weeks
05:48 Tim: enabled direct external storage on enwiki
04:01 Tim: enabled direct external storage on meta, as a pilot.

January 8

17:37 brion: odd db access error about 40 minutes ago:
- Sun Jan 8 16:56:02 UTC 2006 srv68 RecentChange::markPatrolled 10.0.0.101 1146 Table 'nlwiki.recemtchanges' doesn't exist (10.0.0.101) UPDATE `recemtchanges` SET rc_patrolled = '1' WHERE rc_id = '2429501'
- the source for this file on this server looks ok; no memory errors in /var/log/mcelog
- very odd
12:16 hashar: un hardcoded languageNV ns_project
12:05 hashar: un hardcoded languageOC ns_project (bug 4526)
05:30 brion: adding CNAME download-yaseo.wikimedia.org to amaryllis; yaseo dumps will be there ...

January 7

18:50 brion: running test dump for yahoo's abstract thingy for enwiki on benet (from samuel)
14:37 hashar: recached special:disambiguation for all pmtpa databases.
14:30 hashar: WARNING ran cvs up. That raised a lot of conflict. Attempting to solve them.
01:30 brion: added user throttle to en in response to registration flood
- deleted some of the crud accounts
- extra live hacks pending captcha later

January 6

14:42 Tim: Restarted squid on ragweed, was refusing connections. Four knams squids are currently down.
13:00 jeluf: destroyed attachments of a posting on wikimediach-l (personal CV) upon Delphine's request

January 5

22:30 mark: Added email alias for Monica in wikimedia.org
21:35 brion: put a CACert-issued SSL cert on secure.wikimedia.org
21:25 brion: added tr.wikisource (bugzilla:4333) and is.wikisource (bugzilla:4471)
21:00 mark: sq1 seems broken:

scsi3 (0:0): rejecting I/O to dead device
sde : READ CAPACITY failed.
sde : status=0, message=00, host=0, driver=04
sde : sense not available.
scsi3 (0:0): rejecting I/O to dead device
sde: Write Protect is off
sde: Mode Sense: 00 00 00 00
sde: assuming drive cache: write through
 sde:<3>scsi3 (0:0): rejecting I/O to dead device
Buffer I/O error on device sde, logical block 0

mark: Fixed Fedora mirror on Albert, FC4 mirror is ok now
18:20 brion: unblocked faleg.org from squids leech list, swears to do good (and moved stuff to tools server)
17:05 brion: changed SSL cert for wikitech to one signed by CACert
16:40 brion: added redirection for *.wikipedia.info
05:47 Tim: Enabled plus signs in titles
05:35 ævar: "Gordon Lyon" => "Fyodor Vaskovich" at http://fundraising.wikimedia.org/2005q4/index.php/2006-01-04/detail/

January 4

21:30 jeluf: power cycled ragweed once again

January 3

04:34 brion: set amane's hardware clock to UTC, was on US mountain time
04:31 brion: load seems to have stabilized, things seem to be working
04:28 brion: colo rebooted amane. it seems to be working now, but gettings lots of hangs and things on main web
- still waiting to fudge stuff
03:09 amane seems to be dead; hangs on http, ssh; pings though
- was able to root-ssh in, /home mount was hung. unmounted, remounted, restarted lighty; now dead again, but no longer accepting ssh and NFS server is also dead. site is down
02:33 Tim: Thousands of pdns_control processes started by crond were running on zwinger, stuck waiting for a hung pdns_server process. Fixed.

January 2

19:50 Domas: putting holbach into dewiki only operation

January 1

20:22 Tim: starting refreshLinks.php
19:10 Tim: putting templatelinks code live. Schema update finished a few hours ago.
16:30 ævar: Got into tingxi, it's mysql gone wild, ~50 load, trying to contact the mysql server which isn't working with all the load..
~15:55 ævar: tingxi is superloaded (or something) and isidore might be as well (might be using that for the SQL, not sure), as a result fundraising.wikimedia.org is down.

$ ssh tingxi "uptime"
16:03:18 up 124 days, 23:18, 0 users, load average: 78.62, 75.26, 72.74

..I haven't been able to open a normal shell...

12:00 Tim: gave Datrio steward access on jawiki
10:11 Solar: failed drive repaired on sq7, up on /d

December 31

06:40 brion: adding en2 to dns aliases

December 30

18:55 brion: added the one-time donation form to [4] by mav's request
18:30 brion: set up private board wiki
15:00 brion: banned another leech

December 29

16:00 mark: ~~Resized RRDs (of Cricket/ http://noc.wikimedia.org/stats.php ) to store more than 4 months of data...~~ Rolled back. As always with rrdtool, there were "issues"... Sigh.
15:55 brion: added https://bugzilla.wikimedia.org/ SSL alias

December 28

22:30 jeluf: replaced ssl certificate on https://tickets.wikimedia.org/
21:19 brion: running cleanupCaps on zhwiktionary; changed the caps setting a few hours ago (bugzilla:4351)
10:10 brion: killed a leech
- srv5 is alive, but not yet configured. host key has changed, local login keys not installed. this may be an annoyance during squid updates.
Jamesday: gzipped binary logs on Adler; still need to be moved to long term storage. Now has 57GB/18 days of space for binary logs.

December 26

11:49 brion: hacked RecentChange.php so the IRC output uses getInternalUrl(), so https urls don't go into the irc stream and confuse things
- correct solution may be to add another level, 'getPrimaryUrl' or something. or else declare 'internal' to mean 'external' ;)
08:30 brion: added goeje's new ip to ourusers for adler
07:22 brion: set 'reupload' permission off for regular users, on for autoconfirmed users (older accounts) in response to persistent upload vandalism

December 25

23:12 ævar: Installed a new special page extension, Special:Filepath, redirects user agents to the full path of a file like this.
22:57 ævar: Ran scap
14:27 mark: Partially set up sq1, but got frustrated by all the yum / Fedora crap.
13:11 brion: left goeje waiting for a reboot on kernel upgrade; shutdown is hanging on an nfs unmount which should eventually time out.
- apache 2.2 is installed, need to set up php and that start experimenting with stuff
- need to take goeje out of apache nodegroup, but *not* mediawiki_installation
12:13 brion: moved goeje to vlan 1, on 207.142.131.221
11:30 brion: I'm going to reassign webster's external IP to one of the old 512 mb apaches and use it as an experimental https server for secure logins to the wikis
04:33 Solar: sq7 is up at 10.0.3.7, but it is missing one drive at /d, RMA requested
04:15 Solar: srv5 is back up with a fresh FC3
04:00 Solar: srv71-80 are up at 10.0.2.71-80
02:26 Solar: srv57 and srv61 are back up
02:14 Solar: sq3 is up at 10.0.3.3
01:54 ævar: Installed a debuglog for Cite on enwiki to debug a whitespace generating problem I can't reproduce locally, even with tidy.
~00:45 ævar: Installed extensions/Cite/Cite.php site-wide, doesn't appear to be working on yaseo, as in <ref> & <references> just shows up as if the extension wasn't defined, even though it's required_once in CommonSettings.php on amaryllis, is it using some other system now?

...It's because I don't have permission to do anything at yaseo except on amaryllis...

$ ssh zwinger.wikimedia.org 
Last login: Sun Dec 25 00:46:21 2005 from adsl6-56.simnet.is
**** Documentation wiki at http://wikitech.leuksman.com/ ****
[0113][avar@zwinger:~]$ ssh amaryllis
Last login: Sun Dec 25 00:55:57 2005 from zwinger.wikimedia.org
Fedora Core linux kickstart-installed on Sun Sep 11 03:22:28 UTC 2005
[avar@amaryllis ~]$ dsh -f -N mediawiki-installation "hostname"
executing 'hostname'
avar@211.115.107.145's password:
avar@211.115.107.143's password:
avar@211.115.107.144's password:
avar@211.115.107.149's password:
avar@211.115.107.148's password:
avar@211.115.107.146's password:
avar@211.115.107.153's password:
avar@211.115.107.155's password:
avar@211.115.107.150's password:
avar@211.115.107.152's password:
avar@211.115.107.147's password:
avar@211.115.107.154's password:

December 24

09:49 brion: switched wikitech.leuksman.com to HTTPS
Out of interest, why? cause brion prefer yellow in URL bar.
Would like to move more of our infrastructure stuff to be behind encrypted connections so passwords won't be exposed on insecure wireless networks when we're at conferences (ccc, wikimania, etc). While it would only be annoying if someone gets into wikitech or bugzilla accounts, sysop accounts on the main wikis or access to the internal wikis might be even more dangerous. Starting small, moving up.
09:30 brion: upgraded leuksman.com to Apache 2.2.0 and PHP 5.1.2RC1.
- Had to set 'EnableSendfile Off' to fix zero-length responses for static files. Probably something funny with the virtual server's kernel or filesystem.
03:35 brion: turned on autoconfirm protection level on dewiki by elian's request
02:59 brion: set local logo for hiwiki

December 23

05:39 brion: removed leftover Amethyst.php from servers in pmtpa

December 22

07:00 brion: installed new protection interface. set newbies time to 4 days

December 21

10:00 mark: Raised bandwidth limit of csw1-pmtpa's port gi0/33 (Bomis/Wikicities) to 100 Mbit/s
09:40 brion: updated the squid error page
05:31 ævar: ran maintenance/updateSpecialPages.php --only=Unwatchedpages on all pmtpa wikis.
05:24 ævar: Enabled Special:Unwatchedpages for users with protect permission and modified the querycache to cache 5000 pages for that instead of the default 1000. Jimbo made me!

December 20

17:42 ævar: People were still reporting problems with $wgOut & sitenotice, cvs up'ed & ran <code>scap
14:59 ævar: Brion recently changed the sitenotice to use $wgOut->parse() instead of $p = new Parser; $notice = $p->parse(...); Appperently $wgOut is not always an object at that point. Nikerabbit reported a fatal call on a non object on that line. Inserted a live hack that tells people to report to #wikimedia-tech if it isn't an object while we hunt down why it doesn't get initialized properly sometimes.
14:27 ævar: Turned on rcpatrol on fiwiki much to the enjoyment of domas

December 19

20:00 Domas: srv57 and srv61 down, used srv70 and srv55 as Tugela replacements.
15:00 mark: Resurrected mint and lily.

December 18

14:20 Tim: attempted to restart lily, it crashed 20 hours ago.
00:00 Domas: holbach resurected and is working as db slave...

December 17

13:05 Solar: Holbach is available at 10.0.0.24
11:35 Solar: sq1-10 minus 3 and 7 ( hardware errors ) are up with 10.0.3.x ip's
06:40 brion: installed <fundraising/> extension (FixedImage) for the fundraising progress bar
01:55 brion: reinstalled php 5.1.1 on tingxi with gd enabled
01:50 brion: briefly locked new registrations on zh.wikipedia while adding a range block;
0:10 brion: rebuilt interwikis (bugzilla:1586)

December 16

04:45 brion: installed apache 2.2 and php 5.1.1 on tingxi for fundraising info server (with SSL)

December 15

23:35 ævar: Removed evil privacy invading javascript counting thing from http://wikimedia.org/nl-portal/ and http://wikimedia.org/be-portal/, the javascript pointed to a counter at http://e0.extreme-dm.com/
07:00 jeluf: Power cycled ragweed. Again.

December 14

20:30 hashar: added stylesheet for http://static.wikipedia.org/
12:50 mark: Built a new squid RPM (2.5.STABLE12-2wm) that sets a maximum resident memory size (default: 2 GB, specifiable in /etc/sysconfig/squid), and tested it on fuchsia
11:20 mark: Decreased the Squid timeout value of lvsmon on pascal to 10 seconds, and restarted iris which was trashing heavily.

December 13

22:31 brion: benet ran out of disk space, looking at where it went
19:22 brion: review of dump status shows that srv30 broke during the dump circa 04:22 yesterday, crashing enwiki and eswiki. restarting those two dumps
01:40 Tim: Restarted python IRC client on browne, on reports that no more channels were being created

December 12

22:40 brion: reinstalled turck-mmcache on tingxi; had not been upgraded after PHP recompile and was whining about version mismatch
14:30 mark: Resurrected mint which apparently had crashed two days ago.
03:00-5:00 Tim: restarted some apaches with hung processes waiting for NFS

December 11

13:17 hashar: BUG zwinger:/tmp/mediawiki/ should probably be in /var/cache/mediawiki/confs/ and wikitech group writable.
- This is not the place to report bugs. Please use the IRC channel. -- Tim 20:55, 11 December 2005 (PST)
13:16 hashar: created namespaces for itwiki & itwikisource (#bug 4247).
09:33 brion: dumps running in pmtpa on benet/srv35/srv36; in yaseo on amaryllis

December 10

05:20 brion: leuksman.com mysql & apache went wacko, memory limits killing things... restarted mysqld and apache
~05:00 Tim: dsh -N mediawiki-installation -f chmod -R 777 /tmp/mediawiki . And changed MessageCache.php so that it will stay that way.
01:40 brion: segfaults on leuksman.com reappeared; got backtrace, posted additional details on similar-looking php bug 35140. I have disabled APC on this server to try to reproduce the bug without it.
00:10 brion: set up cywikisource and copied in some pages (bugzilla:4228)

December 9

14:40 mark: Shutdown Tunnel0 on csw2-knams as an attempt to solve weird routing problems
06:30 Solar: new squids are racked, but only sq1 and sq2 are up at 10.0.3.1-2
04:47 Tim: set up www.wikimedia.org as a portal editable via meta, like the others

December 8

17:43 brion: ns1 and ns2.wikimedia.org don't have updated DNS. what's wrong??
11:40 Solar: sq1 is connected to the SCS port 9.
11:29 Solar: asw3-pmtpa is racked and connected to the scs
11:00 Solar: connected equ1's eth1 interface to csw4-pmtpa's port 34
10:16 ævar: Turned allowemailchange on in buzilla, users can now change their email
10:12 Solar: fixed srv66's grub.conf to boot to correct kernel
08:21 Domas: used srv70 as emergency tugela as srv66 down
07:42 brion: updating tingxi in forward/reverse DNS and adding 'fundraising' CNAME
06:30 brion: taking tingxi out of apache groups, giving it an external setup for fundraising utilities
03:54 kate: stopped lomaria to dump for import to zedler
02:10 brion: cleaning up after bogus CVS updates in common dir owned by hashar

December 7

13:30 mark: Rerouted traffic back to knams
12:00 mark: Rerouted knams traffic to pmtpa because of networking problems near knams
11:14 brion: recompiled apache/php/apc on leuksman.com, hoping to debug intermittent segfaults if they continue

December 6

19:00 jeluf: upgraded OTRS to 2.0.4
05:47 ævar: cvs up'ed includes/SpecialVersion.php, there was a conflict, I removed the following code (the top part) since I presume it's not an issue anymore and the offending site has been blocked:

<<<<<<< SpecialVersion.php
                $ip =  str_replace( '--', ' - - ', htmlspecialchars( wfGetIP() ) );
                #return "<!-- visited from $ip -->\n";
                # hacked to a hidden span since one nasty was stripping comments
                return "<span style='display:none'>visited from $ip</span>\n";
=======
                $ip =  str_replace( '--', '-', htmlspecialchars( wfGetIP() ) );
                return "<!-- visited from $ip -->\n";
>>>>>>> 1.32

04:23 ævar: De-installed Special:Cite on commons, meta, sources, species, foundation, nostalgia and mediawikiwiki. We really should have a $site variable that can be counted on (doesn't return wikipedia for non-wikipedia sites)
01:48 Tim: installed Folding@Home on the yaseo apaches
00:50-01:30 Tim: Started Folding@Home on knams squids

December 5

22:13 Domas: Did bring back srv9 (not sure if it is a good idea). Removed bayle/will from service. All squids are null-storage now.
20:35 hashar: apache-(restart|gracefull)-all(hard)? now use dologmsg instead of wikibugs
20:26 Hashar: http://www.mediawiki.org/FAQ now redirect to meta: page (rewrite rule for virtual host mediawiki.org).
18:00 Domas: noticed packetloss, talking to PM support
17:30 Domas: did put srv6 squid into i/o-less operation, as srv10 had same hitrate ;-)

December 4

20:30 jeluf: Several people report problems with Linker.php:504, thumbnail linking code. As a workaround, submitted and deployed Linker.php,rev-1.56
13:27 hashar: changed kawiki & kawiktionary namespaces (bugs 2103 & 3905)
10:10 brion: fixed tingxi's sudoers, fixed tingxi's /usr/local/apache/conf, synced its mediawiki, trying to start it. working? maybe
10:00 brion: stopped apache on tingxi, has damaged copy of mediawiki
04:05 brion: pascal root partition is full, needs cleanup
- deleted ~350 megs of old kernel modules from /lib/module, leaving those for 2.6.12-1.1381_FC3
03:02 Tim: srv55 has reported no more MCE errors, re-added to the apache pool
01:45 Tim: fixed Turck on humboldt
00:01 Domas, Tim: amane up and running, site back to normal

December 3

21:20 mark: zwinger's /home mount on amane is broken, all fs calls block
08:40 Tim: restarted squid on srv8, it had crashed
07:47 Tim: Fixed ntpd on coronelli, harris, larousse and adler.
07:14 Tim: Fixed ntpd on vincent and maurus. Stepped their clocks.
06:41 brion: fixed sync-file so that message is optional again, like its help message claims
06:10 Solar: Replaced what I believe to be the bad stick of ram for srv55. Its up.
03:01 brion: added info-pl mail alias to OTRS
02:09 brion: hopefully fixed the lucene restart problem; new mono installation in /usr/local wasn't in the PATH in crontab. hacked the init script to add it back

December 2

22:00 brion: watching search daemons more closely; i think they're not properly restarting on the hourly restart cronjob
- run logs in /var/log/mwdaemon-run.log
- also clocks are very bad on maurus and vincent, need to ntp them
21:19 hashar: sync-file now accept comments after the file name.
21:23 brion: restarted luene daemons; for some reason they had all died
09:17 brion: got lucene daemons back up and hopefully running.
- there was an extra restart script in my personal crontab on maurus which seemed to be messing things up there
- added a 'ulimit -n 8192' on the init script
05:45 brion: yum mirrors appear to be broken (missing repo files), trying to re-sync
03:44 Tim: srv43 didn't come back into the apache pool after restart, fixed
02:28 brion: restarted search daemons, stuck

December 1

23:53 brion: installed joe 3.3 on zwinger (in /usr/local/bin), handles utf-8 files properly
23:01 hashar: knams cluster was unreacheable for roughly 2 minutes, probably a maintenance on kennisnet side.
22:39 hashar: created Portal namespaces on ptwiki #3385
22:35 hashar: renamed namespaces on huwikibooks. 2 conflicts. #3783
19:54 brion: moved old .conf files from /h/w/conf to /h/w/conf/httpd-old to reduce confusion
19:54 hashar: gracefulled all pmtpa apaches to fix bug #4131
19:49 hashar: fixed apache-sanity-check , calls to 'ip' missed '/sbin/'
18:00 mark: Setup failover LVS on avicenna and alrazi. Still needs lvsmon, and isn't active yet. Uses CARP for failover.
14:30 mark: Removed avicenna and alrazi from Apache duty, as I am going to use them as LVS load balancers.
06:30 brion: removed /tmp/mediawiki/* caches on srv36; the backup run had saved a bunch by root and apache screamed about being unable to write them
06:25 brion: restarted apache on yf1005; odd PHP error, possibly APC cache breakage.
- Fatal error: main(): Failed opening required '' (include_path='/usr/local/apache/common/php-1.5:/usr/local/apache/common/php-1.5/includes: /usr/local/apache/common/php-1.5/languages:/usr/local/apache/common/php-1.5/templates: /usr/local/apache/common/php-1.5/extensions/wikihiero:/usr/local/lib/php:/usr/share/pear') in ½Íÿ on line 14
05:47 Solar: Ariel's raid has "failed", but no real disk failures. It put the array back online and rebooted. We'll see how it does.
05:00 Solar: Moved srv35-43 to second cage. Racked new sq1-sq10.
04:30 brion: deleted 20051127 enwiki pages_full dumps, since srv36 was turned off before they finished

November 30

21:49 brion: fixed upload dirs for wikimediafoundation.org
05:43 Solar: Racked donated load balancer in core cage on csw1-pmtpa port 34
02:25 brion: removed a privacy-violation in a revision comment via database edit (enwiki rev_id 29652015)

November 29

Note: ZX will be rebooting and upgrading most knams machines tonight, to help fix our problems. They will be taking machines down one by one, so this shouldn't give downtime - in theory. If it does, check whether the LVS ip is bound to the machines when they come up.

23:30 mark: Apparently service ips often were not added because /etc/rc.d/rc.local wasn't run... because it did not have eXecute permissions on some machines. Fixed.
21:56 hashar: rebuildMessages.php finished.
21:45 brion: bugzilla:4115 setting up latex on latest srv batch, adding to setup-apache
21:30 brion: found and fixed upload files for meta
21:15 brion: investigating broken upload files on meta
20:49 hashar: fixed bug 4048 and running 'rebuildMessages.php --update' on all wikis.
16:35 mark: Squid wasn't running on srv6, started
16:15 mark: Ran yum upgrade on all knams machines
15:15 mark: Reversed the change as it didn't work anyway: Squid simply ignores failure on binding IPs.
14:00 mark: Adapted the Squid configurator / squid.conf.php to explicitly bind to the Squid's main IP address and the LVS IP, if applicable. Meant to ensure that Squid will not start if the LVS IP is not bound to the machine, so lvsmon can detect that.

November 28

23:34 Hashar: uploaded a picture of clusters, please post comment on image talk page so I can modify / update it.
21:50 Domas: restarted rogue failing (bytecode cache issues?) apaches: srv47, srv4, srv37, srv63, srv58, srv67, srv68, srv53, srv39
20:40 Domas: ragweed booted up, started squid, then started something else (for a minute or two), then ran rc.local with LVS IP adding... site down for several minutes
19:58 brion: ragweed is down (no ping), OTRS dead
14:35 ævar: Site crashed because of insufficient sanity checks, my bad.
14:30 Domas: srv62 tugela crashed, no core dump yet, if crashes persist will need some poking, either code, or srv62. mcelog empty.

tugela-fc3-x64[3634]: segfault at 00000000010e1000 rip 0000003a781716e0 rsp 0000007fbffff668 error 6

00:44 ævar: Changed the project name and metanamespace for iswikibooks to Wikiorðabók
00:00 Domas: oops, ran tugela on srv51-srv54,srv56-srv69 instead of memcached, will see how it performs/scales/...

November 27

23:21 hashar: thanks to palica : updated Server inventory bot to add a link to ganglia.
19:09 hashar: added two scripts to check database : 'mysql-list' & 'replication'
18:49 hashar: BUG rose got 4 memcached instances but they are not listed in mc-pmtpa.php
18:47 hashar: commented 10.0.2.43:10000 from mc-pmtpa.php
13:13 ævar: Installed Special:Cite on all the wikipedias
10:15 brion: blocked wikipedia-l, wikien-l, and helpdesk-l list archives in mail.wikipedia.org's robots.txt to discourage future complaints about embarrassing newbie posts becoming #1 google hits. Search patches for mailman archives should be integrated at some point...
08:55 JeLuF: added http://www.spy-sweeper-webroot.de/wiki/?/ to squid's leecher blocklist
07:38 Solar: smellie is ready for service. Turned off seLinux.
07:30 Solar: srv5 is out with a bad case of bad blocks
07:00 Solar: Crossed over to the new switch, csw4-pmtpa
03:09:56 ævar: Installed Special:Cite on enwiki as an experiment.
01:39 Tim: took srv55 out of service, likely dud RAM. MCE errors reported.
01:10 Tim: squid on will had crashed. Restarted.
01:05 Domas: fixed default route on tingxi

November 26

17:30 jeluf: changed password of wikipl-l admin account. Gave new PW to Datrio. Docuemented PW at the usual place.
14:36 Tim: put srv52-70 into apache service. I broke srv51 with a restart test.
12:00 Tim: wrote /h/w/b/apache-sanity-check, set up scripts such as apache-start to run it and refuse to start apache if the necessary LVS-friendly conditions are not met.
~11:00 Tim: broke site temporarily due to LVS-related misconfiguration
10:53 Tim: rose, tingxi and srv2 had apache running but no LVS VIP. This would explain the random hanging behaviour with ab -X apaches:80. Fixed temporarily, will look into a permanent solution.
08:45 Tim: LVS wasn't decomissioned properly on iris. LVS on pascal was forwarding packets to LVS on iris, and iris, with no lvsmon running, forwarded most of those packets to sage, which is down. Thus users were seeing connection timeouts. Fixed with ipvsadm -D -t rrvs.knams.wikimedia.org:80.
07:41 Tim: srv5 still not up. Moved its virtual IPs, one to srv6, one to srv8 and one to srv10.
07:15 Tim: did a fsck of srv5 then a system reboot
03:32 srv5's root partition spontaneously declared "read-only filesystem". Logs stopped moving. Mount reported that it was still rw, but it couldn't be written to.

mount uses the contents of /etc/mtab to display mounts. These are not updated when the file system is r/o. Use /proc/mounts instead.

05:50 Tim: introduced time and memory limit for rsvg and convert
01:45 Tim: started image backup using updated scripts in /h/w/b
00:14 ævar: changed the logo for iswiktionary.

November 25

21:45 Hashar: killed some rsvg process on various apaches. Seems they tried to render a 120px thumb of /commons/7/70/Interstate_Highways.svg (possible DOS ? :( ).
04:40 Tim: experimentally enabled keepalive on apache.
03:35 Tim: testing lvsmon failover by stopping squid on clematis
03:05 Jamesday: Adler had 11GB disk free. gzipped first 80 binlogs to raise it to 48GB or so. gzipped version still need to be moved to wherever we're keeping them these days.

November 24

06:30 kate: setting up l3 failover.. see that page for details
02:55 brion: took cornelli out of search rotation while kyle moves it around

November 23

21:18 mark: Routing problems from 38.0.0.0/8 (cogent ip space) to florida. Altered the countries.nerd.dk file to reroute that prefix via knams.
20:44 mark: Reinstated the normal epoll RPM on mint, as epoll wasn't the problem
16:44 brion: fixed arrangement of upload directories for several sites (non-wikipedia :P)
00:35 kate: "ntp source vlan1" fixed NTP problem on csw1, but need to work out why traffic to 64.156.25.242 is being dropped
00:04 kate: upgraded csw4-pmtpa to 12.2(25)SED, enabled ssh and configured vlan 2 properly

November 22

22:33 brion: amane still seems to work. YAY \o/
21:49 brion: restarted apache on zwinger, wasn't loading
21:45 brion: increased php fastcgi workers on amane to absurd levels for thumbs to run
21:30 brion: mostly working now! had to set server.max-workers to 8 in lighty to get it running smoothly
19:28 brion: mounted /mnt/upload3 (amane) on zwinger, was missing mountpoint
19:22 brion: mounted /mnt/upload3 (amane) on srv2, was missing mountpoint
19:11 brion: restarted albert's http temporarily to cover the work period
19:02 brion: khaldun copy finally finished, rearranging bits on amane
15:21 brion: turned albert's http back off (hope you're done) so khaldun can finish its copy without the extra load
07:44 brion: started albert's http so kate can set things up requiring the local fedora yum mirror
05:08 kate: configured asw2-pmtpa. has the new srvs and the equ device on it (equ is 10.0.1.3)
00:55 brion: started copying commons files from bacon -> amane. disabled albert's apache
00:45 brion: started copying enwiki files from khaldun -> amane, non-wikipedia non-wiktionary files from albert -> amane
00:35 brion: started copying files bacon -> amane
00:20 brion: disabled uploads sitewide

November 21

23:10 brion: setting up to move uploads to amane, will disable all uploads and upload.wikimedia.org for a while to make this damn thing happen
21:15 brion: started lucene index rebuild on maurus
21:05 brion: restarted squid on will, was not responding (stuck) on port 80
20:49 brion: restarted apache on ragweed; https was down so otrs inaccessible
20:30 mark: Brought sage and mayflower back up.
20:00 mayflower went down.
20:00 mark: Moved LVS back to pascal to allow iris to be a squid again.
19:45 mark: Modified lvsmon on iris because it was always sending curl requests with Pragma: no-cache! And therefor testing the whole chain to florida.
19:45: sage went down.
18:00 mark: Installed non-epoll RPM on mint to compare.
17:56:40-17:57:31 ævar: Invalid argument notices were being generated in this time period due to me syncing three files and them depending on each other, ok now.
17:30 mark: udpmcast wasn't running on pascal. No idea since when... started.
17:30 jeluf: Restarted ragweed. Came back after powercycling and fsck.
16:30 ragweed broken.
12:14 erik: Updated logo of nap.wikipedia.org and sync'd InitialiseSettings.php

November 20

23:30 mark: Upcoming maintenance of knams tomorrow (ZX will do some firmware upgrades, rebooting at least pascal and vandale). Moved LVS to iris because of that.
20:00 JeLuF: All wikipedia.org upload directories moved off of albert and to amane.
18:03 Hashar: fixed #4022 'Asia/Seoul' timezone for kowiki.
17:50 Hashar: switched some logos to /b/bc/Wiki.png
14:24 JeLuF: chown -R apache:apache amane:/export/upload/wikipedia.org/
14:19 Hashar: in amane:/export/upload/wikipedia.org/ some directories cant be write by apache (af de es & fr). dewiki upload page report an error.
09:26 Tim: Fixed NTP broadcast, documented
03:21 Tim: Fixed perl upgrade on srv51-70 as per [5]

November 19

17:05 Tim: same on fuchsia
16:50 Tim: restarted squid on clematis, disabled swap.
16:05 Tim: upgraded otrs on ragweed to version 2.0.3, after Anthere complained about this bug: [6]. Minor upgrades within the 2.0.x series weren't documented (just an unanswered question on the ML), so I just untarred over the top of the old directory, with a backup in /opt/otrs-2.0.1. Treat any problem symptomatically, some chmodding might be required.
15:40 Tim: restarted squid on bayle

November 18

23:30 brion: installing ploticus 2.32 on mediawiki-installation, set to use gd & truetype fonts (bugzilla:3965)
- truetype fonts in common/fonts
07:00 jeluf: migration of dewiki's image and thumbnail directories done. archive and shared will be moved when albert has more headroom. Some 30 small to medium wikis moved. Currently running frwiki thumbnail migration.
00:27 brion: blocked another leech [7]

November 17

14:30 mark: ragweed was missing the LVS ip, fixed. Also readded iris as squid.
06:30 Tim: Added root key to srv51-70. The following machines didn't want to cooperate: 56, 64, 66, 67, 69
06:05 Tim: added srv51-70 to DNS, created a node group. Configured albert's BIND as a slave for the 10/8 reverse DNS zone.
05:46 Solar: srv2 is back up.
05:38 Solar: srv56 is up too.
05:26 Solar: srv51-srv70 are ready for Rock & Roll! (Except srv56 has some hardware issue)
04:34 Solar: holbach is rebuilt and ready
03:47 Tim: added tingxi and rose to the apaches node group. Left harris out, it sucks.
03:30 Tim: after moving some more hosts to the misc2 cluster, restarted gmond on the apache cluster to remove hosts which have been moved out
02:24 Tim: fixed amane's date, started ntpd
01:49 Tim: Created "Misc VLAN2" cluster on ganglia, for miscellaneous hosts which, due to being in the wrong VLAN, couldn't be in Miscellaneous.

November 16

8:25 brion: srv50 error_log flooded disk; removed and restarted apache
6:30 jeluf: moved es upload area to amane:/export/upload
5:30 jeluf: moved eo, ang, an upload areas to amane:/export/upload. Backups are still on albert in .../remove.
04:14 Tim: attempted to restart squid on will. It didn't work. I hacked /etc/init.d/squid to send errors to a file instead of /dev/null, and found it was giving error messages like "parseConfigFile: line 17 unrecognized: 'htcp_port 4827'". I started the squid copy in /usr/local/ instead.
01:20 brion: reenabled special:renameuser with the 'archive' bit disabled. it's possible that some undeleted pages will have incorrect rev_user_text data

November 15

23:00 jeluf: moved aa, ab, af, ak, als, am, ar, ast, zh image uploads to amane:/export/upload
20:32 hashar: updated http://wikimedia.org/stats/live/ with a message redirecting to the "new" system ( http://noc.wikimedia.org/stats.php ).
16:13 Tim: running batch imagemagick convert job on bacon, converting 1911 EB scans to PNG.
~12:30 Tim: Deployed diff cache and parser cache push features. Reduced cache expiry for RC feeds on en from 60 to 20 seconds. The performance impact of this should be monitored -- the diff cache should reduce it but it might not be enough.
03:46 Tim: Re-enabled tidy, trimmed error logs. The huge error logs did indeed have a few tidy errors towards the end, once every few minutes, interspersed with lots of "file not found" errors. Preceding this lack of activity was gigabytes of either:
[Mon Nov 7 04:33:33 2005] [error] PHP Parse error: parse error, unexpected $ in /usr/local/apache/common-local/php-1.5/checkers.php on line 101

OR

*** attempt to put segment in horiz list twice

Neither of which have anything to do with tidy. The other noticeable thing at the very end of the error logs was that apache was segfaulting regularly, but it was doing that just as much after tidy was disabled.

01:22 ævar: resolved bug 3968
00:50 brion: cleaned giant error_log files from srv44 and srv47, which had run out of space during sync
00:41 brion: adding some signature-nazi features, so new sigs with unbalanced html tags will not be inserted

November 14

22:30 mark: Many apaches have error_log's of 100G in size and more! Partly due to tidy, but how is logrotation supposed to be setup? See bug #3966
22:00 - 22:12 hashar: $wgUseTidy = false; its filling error logs on all apaches and seems to stall. Restarted all apaches too. Wikipedians need to FIX their HTML.
14:00 mark: Rebooted srv10, and started Squid on it with no cachedirs (1 null cachedir). Assigned IP .214 to it.
08:28 Tim: restarted squid on srv6. Slow hit service times (~100ms), it wasn't swapping but it had very little spare memory for kernel cache and buffers.
03:05 Tim: bayle was swapping heavily, very slow service times for both hits and misses. Restarted squid, added it to the ganglia squid cluster.

November 13

22:50 jeluf: mounted amane:/export/math to all mediawiki-installation servers for storage of math images.
20:00 midom: srv10 squid hanged, reiserfs issues?
16:57 brion: running data dumps on benet/srv35/srv36

November 12

19:49 ævar: tingxi had languages/LanguageCs.php (and probably something else) out of date, IIRC it has been down for some time, ran scap to bring it and others up to date.

November 11

00:16 brion: changed sitename on eswikinews (meta-namespace was already set)

November 10

14:28 ævar: changed the logo on trwiki
09:06 ævar: Changed the upload url of the wikis that had uploading disabled to point to the commons
09:09 brion: gave up trying to upgrade bugzilla due to bugzilla upgrade failure
08:40 brion: running yum update on pascal; got some glibc double-free bug during bugzilla update, and thought it was time to upgrade some damn packages
08:25 brion: shutting down bugzilla for upgrade to 2.20
07:18 brion: removed check_policy_service from /etc/postfix/main.cf on kate's advice, to see if it's more stable with that off
07:02 brion: restarting postfix on zwinger, mail stopped again
05:59 ævar: Removed harris from /usr/local/dsh/node_groups/mediawiki-installation, responded to ping, had port 22 open, but hung forever on ssh harris
04:27 Tim: set up ftp server on bacon, to accept uploads of scanned page images

November 9

14:18 Tim: fuchsia was swapping, regularly timing out on lvsmon health checks. Restarted squid.
11:09 brion: modified parser cache behavior to do cache with redirect targets. should increase hit rate; if troubles experienced, revert Article.php back to rev 1.396
10:13 brion: reenabled search text extracts for active sessions only
07:32 brion: updating live search indexes
00:54 brion: no mail in last eight hours... restarting postfix

November 8

23:30 jeluf: After intensive fsck, ragweed is back.
19:00 ragweed pings, but doesn't allow SSH login
13:10 holbach crashed
12:05 Tim: deployed local message cache, causing a 60% drop in network traffic on the apache cluster according to ganglia. We had noticed probable network saturation on the 100 Mbps switch asw1, this was the obvious solution. A content hash is stored in memcached and checked on each request. The local cache is stored in files, one file per wiki in /tmp/mediawiki/

November 7

20:51 kate: stopped replication on lomaria. please don't start it without asking me unless it's extremely important.
20:45 brion: trying to get tidy going again
20:30 brion: rebuilding search indexes on maurus.
20:00 brion: set search daemons to restart hourly. *sigh*
14:05 Tim: brought holbach back into service. Tweaked some load ratios.
13:55 Tim: started slave on lomaria. It was idle, the site was slow.
05:45 brion: switched lucene search to default to AND matches
02:50 brion: set up init script for MWDaemon (/etc/init.d/mwdaemon), added a daily cronjob to restart them

November 6

21:04 brion: several servers had disks filled from apache error_log; libart in rsvg apparently spewing out gigs of "*** attempt to put segment in horiz list twice"
20:10 brion: site unusually loaded; giving a kick to the apaches for luck
11:08 jeluf: srv22 was overheated. killed svg renderer (240 cpu minutes)
11:00 jeluf: added Category:Broken_servers for better keeping track of todos
10:40 jeluf: added portal namespace for nowiki upon Jhs' request
08:20 kate: copying from lomaria again... whee!
05:20 brion: added id.wikisource.org by request
04:59 Tim: started lvsmon-ksquid on pascal
04:39 kate: iris crashed... moved lvs to pascal.
02:40 Tim: Made MW check $cluster.dblist instead of all.dblist. This will generate appropriate error conditions for improper access to foreign databases via commandLine.inc, Special:Makesysop or squid misconfiguration.
01:40 Tim: installed memcached on srv41-50, moved instances from various other machines to there, including offloading browne completely. Restarted memcached on srv22, it had a dead instance.

November 5

22:02 kate: restarted replication on lomaria. set up replication on zedler.
11:00 brion: chgrp'd common files on humboldt
09:15 solar: installed new image filer, amane, into the rack.
04:55 kate: stopped replication lomaria again to re-dump. don't start it please. (server is still running)
03:41 Tim: tried to restart dumpHTML on srv31, the machine crashed almost immediately
03:39 brion: starting dumps on yaseo on amaryllis/henbane
03:32 kate: copy finished, restarted replication on lomaria
03:00 brion: refresh-dblist now also creates pmtpa.dblist and yaseo.dblist, based on assignment overrides from clusters.dblist
00:45 brion: started pmtpa dumps on benet, srv35, srv36

November 4

21:45 jeluf: moved lightgy on benet to /usr/local/lighttpd. Added startup to /etc/rc.local
21:00 jeluf: mounted benet:/var/backup to zwinger:/mnt/backup_benet
06:25 brion: restarted search servers; memory usage up to 650-1000mb range, and very slow response on vincent

November 3

11:03 kate: copying lomaria's db to zedler, don't start it
21:45 erik: fixed he.wikinews site name and meta namspace (hopefully), sync'd InitialiseSettings.php and ran update.php accordingly
20:44 brion: investigating connection errors (hacked wfLogDBerror to include hostname); seems to be on the new opteron boxen only
20:30 hashar: started apache on srv35.
20:22 hashar: started apache on avicenna.
20:10 mark: Will was running with only 1024 FDs. As it's the only non-RPM squid around (will is FC1) and I added bayle, I have taken it out, reassigned IPs to srv5 and srv7.
19:55 hashar: some apaches need a reboot. load is incorrectly high on them cause of state=D process (see bug #3869)
15:10 mark: Moved bayle (previously broken, inactive memcached) to the external vlan, made it a temporary squid. I cannot get it to mount izwinger:/home though. Any ideas?
5:30 Tim: copied ~tstarling/.ssh/known_hosts to /etc/ssh/ssh_known_hosts on all pmtpa machines
~5:00 Tim & kate: syslogd stopped working on zwinger, causing DNS to stop working. Kate restarted syslogd.
~5:00 created hewikinews using addwiki.php, sync-common-all
04:07 kate: made amaryllis ns3.wikimedia.org. needs magic stuff so it can be added as auth ns
01:58 Tim: restarted search daemon on vincent, the usual problem

November 2

mark: Apparently the restart squid cron job in the squid RPM is broken in a weird way: at some point in time /sbin/pidof /usr/sbin/squid will stop working. I will fix it and roll out a new RPM tomorrow. Sorry for the trouble!
23:20 JeLuF: Found 2 squids on srv8. Killed both, started a new one.
22:20 Tim: adapted lvsmon for knams squid service, started it on iris. See /usr/local/bin/lvsmon-ksquid . There's also a copy in ~tstarling/lvs on zwinger in case iris goes down.
21:30 mark: Installed the new squid RPM on clematis. Not using epoll didn't change memory leaking behaviour.
19:17 kate: LDAP in on pascal was broken after reboot.

Nov  2 19:11:26 pascal slapd[29793]: bdb_db_init: Initializing BDB database
Nov  2 19:11:26 pascal slapd[29794]: bdb(dc=knams,dc=wikimedia,dc=org): Lock table is out of available
-               locks
Nov  2 19:11:26 pascal slapd[29794]: bdb_db_open: db_open(/var/lib/ldap) failed: Cannot allocate
-               memory (12)
Nov  2 19:11:26 pascal slapd[29794]: backend_startup: bi_db_open(0) failed! (12)

Did a db_recover and restarted slapd.

04:38 kate, kyle: csw4 is installed. nothing on it yet.
01:08 kate: pascal broke again, moved LVS to iris
00:10 kate: colo allocated us 84.40.25.224/27, wikicities will move into this network

November 1

23:39 brion: created car-fr-l list for french arbcom
22:25 brion: heavy packet loss between pmtpa and lopar; kate is moving dns off lopar for now
21:10 UTC erik: created ru.wikinews.org using addwiki.php
18:26 mark: Dropped 207.142.131.225 as gateway IP, as it doesn't seem to be in use anymore
18:15 mark: Made csw1-pmtpa act as a DHCP relay agent for rabanus, 10.0.0.15
04:20 kate: replaced mormo.org with pascal & amaryllis as backup MX, using postgrey + other anti-spam stuff
05:48 Solar: anthony, suda, isidore and bayle are back up.
05:10 Tim: Cleaned up the squid list in CommonSettings.php. The need to have variables for the IP addresses of each squid passed long ago, it was just clutter, doubling the length of the section. Added the external IP address of will, which was missing, causing edits to be wrongly attributed in the yaseo wikis.

2000s

Archive 1: 2004 Jun - 2004 Sep
Archive 2: 2004 Oct - 2004 Nov
Archive 3: 2004 Dec - 2005 Mar
Archive 4: 2005 Apr - 2005 Jul
Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
Archive 6: 2005 Nov - 2006 Feb
Archive 7: 2006 Mar - 2006 Jun
Archive 8: 2006 Jul - 2006 Sep
Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
Archive 10: 2007 Feb - 2007 Jun
Archive 11: 2007 Jul - 2007 Dec
Archive 12: 2008 Jan - 2008 Jul
Archive 12a: 2008 Aug
Archive 12b: 2008 Sept
Archive 13: 2008 Oct - 2009 Jun
Archive 14: 2009 Jun - 2009 Dec

2010s

2020s