Server admin log/Archive 6

From Wikitech
Jump to: navigation, search

February 28

  • 02:02 brion: raised memory limit in mw to 80mb; at 50mb rsvg routinely dies on x86_64; apparently from shared library mappings
  • 00:12 brion: fixed some rsvg bugs; upgrading libxml2 for better support of files from illustrator

February 27

  • 11:47 brion: updating $wgThumbnailEpoch so rendered SVGs and mis-sized PNGs and JPEGs should rerender gradually
  • 11:30 brion: investigating segfaults on yaseo apaches
  • 10:44 brion: upgrading librsvg to 2.14.0 (patched for dashes and security) on pmtpa apaches

February 26

  • 22:20 jeluf: Mail service enabled again, OTRS moved to goeje, using srv38 as DB server.
  • 21:15 jeluf: Shutting down mail on goeje while migrating OTRS from ragweed to goeje.
  • 16:40 jeluf: copied missing icons to goeje
  • 11:15 jeluf: uninstalled sendmail on goeje, kept postfix.
  • 10:30 jeluf: installed spamassassin on goeje. Spam mail gets a ---SPAM--- label.
  • 09:56 brion: mail.wikimedia.org seems alive and well on goeje
  • 08:55 brion: moving mail from zwinger to goeje; in progress copying mailman data

February 25

  • 13:00 Domas: cloned commonswiki, nlwiki, frwiki, svwiki, plwiki to webster

February 24

  • 06:20 Tim: restarted lighttpd on amane
  • 05:50 Tim: deployed MW job queue
  • 03:39 brion: copied .forward files from zwinger old files to each home dir that had one
  • 00:14 brion: benet out of space. finishing cleaning off old files.

February 23

  • 21:40 brion: resyncing search indexes
  • 12:00 mark: Holbach couldn't reach external machines (like zwinger) as it was missing a default route
  • 00:30 Tim: started apache on goeje
  • 00:19 brion: continuing wiktionary bulk renames
  • 00:05 brion: took thistle out of rotation; it's wildly behind replication
  • 00:05 mark: Set up udpmcast on goeje as larousse is dead.
  • 00:03 justin: bullshit

February 22

  • 22:55 brion: set all remaining wiktionaries to wgCapitalLinks off. starting bulk-rename operations
  • 22:00 brion: set default subpages to include custom 100 and 101
  • 06:57 Solar: srv8 back up.
  • 03:30 Tim: Set up gmond on sq1-10 [1]
  • 03:20 Tim: killed sq9. ifcfg-eth0 is probably wrong.
  • 01:44 brion: the lvs on the sq* machines looks good, so we're changing upload.wikimedia.org to point at it
  • 00:30 brion: investigating performance of amane vs sq9

February 21

  • 23:10 brion: caught lots of segfaults on srv38. looks like we still have the 'occasionally goes into mode where server does nothing but segfault until you restart it' problem. awesome.
  • 23:05 brion: noticed rpc.mounted taking 99% cpu on amane; did /etc/init.d/nfs restart to clear it
  • 22:55 brion: sq9 taking over upload.wikimedia.org squid duty, alone for now.
    • With luck this will keep loads of slow open connections off of the other squids. If we need to we can have a separate set of squids for uploads.
  • 22:16 mark: Flipped gi0/14 on csw4-pmtpa over to vlan 1, as it has sq9
  • 22:00 brion: trying to set up sq9 with an external ip for upload squid
  • 21:31 brion: browne has stale zwinger NFS; leaving it for now as the IRC is

still running ok

  • 21:30 brion: doing survey of up/down machines and reusable ips
  • 21:00 brion: ragweed down
  • 04:20 Tim: fixed gmond in various places. Put most of the apaches into "deaf" mode, so they will only multicast their own statistics, not the rest of the cluster's. Only the apache-aggregators node group listens to the multicast stream and responds to XML requests.
  • 02:19 Solar: set tingxi's vlan to default on csw4-pmtpa.

February 20

  • 22:23 brion: rebalanced tampa squids; two ips from srv7 and srv9 to srv6
  • 21:39 brion: got lots of segfaults on rabanus; restarted, seems ok
  • 21:30 brion: slow access reported by some people, can't reproduce here. might be knams
  • 21:13 brion: moving old backup files to amane to free space on benet
  • 20:30 jeluf: changed settings of de.wiktionary to enable sysop RC patrolling, see bugzilla
  • 20:00 jeluf: rebooted sage
  • 19:40 brion: updated checkers.php
  • 07:00-07:30 Tim: installed wikdiff2 1.0.0 everywhere
  • 06:30 Tim: fixed yf1010 and did a reboot test.
  • 04:01 brion: installed php5 on yaseo
  • 00:20 brion: installing php5 on diderot, friedrich, humboldt, hypatia, kluge, rabanus, rose, srv2, srv3, srv4, srv0

February 19

  • 23:48 brion: running yaseo dumps on amaryllis
  • 23:41 brion: running enwiki dump on srv31 and other dumps on benet
  • 22:42 brion: rearranged mounts on srv31 so it will survive if zwinger's ip is returned to it
  • 19:37 brion: restarted search index builds; bad symlinks from older dumps still, uh, bad.
  • 10:20 domas: enabled persistent connections for memcached
  • 09:40 brion: having tingxi shut down until kyle can poke it
  • 07:22 brion: rebooting tingxi; stale nfs
  • 06:50 brion: explicitly disabled user dirs on the primary apaches

February 18

  • 22:08 brion: started Lucene index rebuild on maurus
    • Noticed that maurus can't be reached from zwinger. something bad in configs :(
  • 06:15 Tim: Fixed and re-enabled wikidiff2

February 17

  • 22:25 brion: disabling wikidiff2; wikidiff2 segfaults
  • 11:00 brion: moved smlogmsg out of the day (smlogmsg-old), replaced with a shell script that just says servmon is down. so won't have to wait for it to time out looking for larousse
  • 10:55 brion: fixed wikiquote docroot
  • 10:30 brion: syncing; www.wikiquote.org portal docroot hadn't gotten copied out
  • 07:57 Solar: anthony is back up, but I didn't set dns
  • 07:41 Solar: Reinstalled fc4 on srv50
  • 05:20 Tim: started gmetad on zwinger
  • 04:56 brion: testing on srv39 to track donw segfaults. upgrading apc, etc
  • 03:50 Tim: deployed wikidiff2 on numbered pmtpa hosts
  • (at some point) stopped a buttload of apaches which were massively segfaulting
  • 03:03 brion: restarted apaches; srv11 had several old stuck convert processes and wasn't responding to new requests. 14, 16, 18, also had unusually low load avg.
  • 01:48 brion: taking srv55, srv57, srv61, srv67 out of service due to RAM problems reported in mcelog. had to reshuffle memcached sets.
  • 01:26 brion: srv61 flaky
  • 01:21 brion: set suda's syslog to accept the messages from the cluster's php boxen

February 16

  • 23:30 jeluf: Rebooted clematis, fuchsia, hawthorn, mayflower
  • 20:23 brion: srv41 and srv43 for some reason aren't initializing the lvs magic ip or starting apache on boot, but the test for it passes and they can be run manually afterwards. don't know why :P
  • 20:08 brion: testing reboot of srv41 to make sure it comes back up with proper LVS magic ip in place
  • 20:00 brion: fixed NFS mounts and php config on rose, back in service.
  • 19:50 brion: removed broken firewall rules from rose, nfs now ok. upgrading php...
  • 19:20 brion: rose still borked, poking it with a stick
  • 19:05 brion: noticed ganglia still broken.
  • 13:15 Tim: noticed that suda (/home share) was full. Possibly because of the rsync from from-zwinger that I set going a while ago. Deleting some backups.
  • 08:30 brion: added missing /upload redirects to some vhosts (commons, meta, sources, etc)
  • 08:00 brion: moved /wikistats to http://stats.wikimedia.org/ (unused vhost on albert)
  • 05:08 Tim: installed Term::Readline on goeje
  • 04:42 Tim: moved LVS back to dalembert
  • 04:00 Tim: Fixed dalembert's startup scripts
  • 03:20 Tim: restarted dalembert
  • 03:15 Tim: second attempt at moving the LVS director to srv52, worked this time
  • 02:35 brion: lvs back up, for now
  • 02:30 brion: lvs down; tim's rearranging things.
  • 01:44 brion: enabled new config with everything moved from htdocs/ to common/docroot subdirs. Hopefully everything relevant copied in...

February 15

  • 22:20 brion: dsh available on zwinger; moved some config files in for it
  • 22:00 brion: mounted suda's /home on zwinger, so it can be used consistently for stuff until we reinstall it
  • 21:11 brion: mailman back online, upgraded to 2.1.7, moved to /usr/local/mailman
  • 19:55 brion: working on moving and upgrading mailman on zwinger
  • 19:43 brion: switched .247 and .248 from srv7 to srv9. srv8 is still offline
  • 14:15 brion: added redirects for wikimania200[5-9].org (where we have them)
  • 13:57 brion: patched otrs for annoying session load errors
  • 13:45 brion: disabled mod_perl for ticket, otrs seems to actually give back what you asked for now. tracking down a "Use of uninitialized value" in session handling that's flooding the error log.
  • 13:30 brion: otrs is doing the 'random pages isntead of what you asked for' thing again. restart fixes for about a minute or two.
  • 13:03 brion: determined otrs problem was wild goose chase because debian sent out notifications of a three-month-old security update they forgot to include until now. ticket back online.
  • 12:49 brion: took ticket.wikimedia.org offline for upgrade
  • 12:35 brion: copy of files to suda finished at some point. zwinger files can be copied into place from /home/from-zwinger as desired.
  • 12:30 Tim: attempted to reinstall srv50 with a 20GB root partition and a large /a partition. I monitored it on the serial console until it finished the PXE phase, but I saw nothing more from it after it went into linux.
  • 02:27 brion: got suda relaying mail from the cluster to zwinger until somebody figures out this DNS crap
  • 01:40 brion: reassigned the smtp.pmtpa.wmnet CNAME to zwinger's 207 address, which works internally. Trying to get it to propagate...
  • 01:27 brion: tracking down broken apache email; they send to smtp.pmtpa.wmnet, which is zwinger's old internal ip (now suda) so mail doesn't get out.

February 14

  • 22:40 jeluf: Restarted srv31, had broken /home mounts
  • 22:30 jeluf: Restarted sage.
  • 22:00 jeluf: Restarted apache on srv31 for static.wikimedia.org. Added it to rc.local
  • 21:30 jeluf: restarted lily, iris.
  • 21:35 brion: started copying zwinger's /home to suda, so we'll have current files on the new file server. Temporarily putting in /home/from-zwinger.
  • 19:30 mark: Changed IP address of ns1.wikimedia.org to 211.115.107.190 (Amaryllis), as larousse is pretty dead.
  • 15:25 mark: Set up DNS zones wikimania2006.org and wikimania2007.org by request of Delphine
  • 15:10 mark: Replaced internal DNS zonefiles by more recent versions retrieved from zwinger
  • 07:53 brion: benet download.wikimedia.org back online. old lighty relied on /home; reinstalled from srpm.
  • 03:54 brion: maurus and vincent online for search. there's a lot of connection-limit hitting, though...
  • 03:47 brion: coronelli online for search
  • 03:40 brion: kicking search boxes to see if they come up
  • 01:35 brion: got the redirect from old math urls in pmtpa working. yaseo wikis are exempted by explicit test in apache config (main.conf)
  • 01:15 brion: switching math in pmtpa to load off of upload.wikimedia directly instead of via apache+nfs
  • 01:05 brion: got math settings resynced on yaseo. phew!
  • 01:00 brion: hacked sync-file to use amaryllis.yaseo instead of amaryllis so it works on albert
  • 00:42 brion: moved tex files from /home/wikipedia/shared/math to /mnt/math, so it's not in a double nfs hell

February 13

  • 23:30 kyle,mark,jeluf: zwinger back running. Removed all irc stuff from rc.local because it was blocking and system didn't boot properly.
  • 21:36 brion: took 204, 205 to srv6, whatever had them wasn't responding. now getting response on all squids
  • 21:34 brion: took srv8s ips (246, 247, 248) to srv7 so something could respond on them. also ohers may be borked
  • 21:23 brion: rebooting srv8, was verrry slow to respond, lots of stuck processes, nfs broken
  • 21:00 jeluf: reconfigured dalembert to have all important components locally: icpagent moved to /usr/local/bin, lvs node list moved to /usr/local/etc/apaches.
  • 15:45 mark: Deployed a new resolv.conf using new internal DNS resolver service IPs
  • 15:15 mark: Set up 2 new internal DNS resolvers on srv1 (master) and albert (slave)
  • 11:43 zwinger down NFS debacle
  • 07:30 brion: added some partial blacklists on unicode chars in usernames. this should be fixed up for all titles and the whitelist fixed.

February 12

  • 10:30 JeLuF: Moved helpdesk-l to OTRS. Added alias in /etc/postfix/aliases and removed the one in .../mailman/aliases.

February 11

  • 23:00 Tim: srv6 crashed, moved IPs to srv7, 8 and 10

February 10

  • 07:10 kate: upgraded zwinger to nfs-utils 1.0.8-rc2 so the Solaris NFS client doesn't crash it
  • 05:30 Solar: bart back up at 207.142.131.227 (With FC3, let me know if you wanted FC4)
  • 04:23 Solar: larousse is gone, no warranty. I might scavange a harddrive from a bomis server to replace its drive.
  • 04:23 Solar: Taken anthony for RMA

February 9

  • 17:25 brion: enabled emergency captcha and blocked some ip. robot o rsomething.
  • 06:30 Solar: hydra the new server is up at 10.0.0.201
  • 04:54 Solar: ixia back up.
  • 02:13 brion: rebuild and reenabled interwiki cache
  • 01:55 brion: disabled interwiki cache; it doesn't seem to handle removal of the cache file, and there's no obvious way to clear the cache.
  • 01:45 brion: interwiki map now protected; for some reason somebody left this unprotected even though it gets updated on an unattended basis, and somebody decided to add javascript: to it. nice. updated cache epoch to ensure things are cleared
  • 00:12 brion: restarted apaches; odd 'bad title' and failed load errors reported on srv12, restart cleared it

February 8

  • 19:20 brion: larousse dead, doesn't come up on boot.
  • 19:00 brion: benet / briefly filled, but nothing seems to have gone awry with the dump. cleaned some space.
  • 18:30 brion: had larousse rebooted since its root filesystem doesn't work, worth a shot. may not be coming back up
  • 18:00 brion: larousse is down since yesterday, nobody logged it.

February 7

  • 23:57 brion: running cleanupWatchlist on pmtpa
  • 07:23 brion: namespaceDupes on ta wikis for bugzilla:4889
  • 05:40 Tim: Recompiled ImageMagick from the source RPM, with --with-quantum-depth=8. Installed on all apaches.

February 6

  • 23:00 jeluf: Added new "Urgent-en" queue to OTRS
  • 22:30 jeluf: Restarted sage and iris. Changed /a to ext3 to reduce fsck time.
  • 21:00 jeluf: Restarted lily, purged cache
  • 19:00 jeluf: Restarted load balancer on pascal, rebooted mayflower
  • 18:50 mark: Revived clematis
  • 00:05 mark: Created mailinglist chaptercommittee-l by request of Delphine.

February 5

  • 02:43 brion: nowikinews was duplicated in all.dblist; cleared

February 4

  • 21:38 brion: started data dumps in pmtpa, now including progress/ETA for xml dumps

February 3

  • 01:50 brion: started fill-in dumps on srv31 again; now using local temp dir for stub dumps in the hope it won't mysteriously fail
  • 01:30 brion: started dumps on yaseo

February 2

  • 19:09 brion: compiling php 5.1.2 on srv31
  • 19:00 brion: mark rebooted pascal for reasons unknown
  • 07:30 brion: started makeup dump runs on pmtpa databases which had dump failures. unsure of cause still...
  • 03:02 brion: testing fixes to yahoo dump gen
  • 01:00 brion: squids in yaseo are way into swap, slow. trying some restarts

February 1

  • 23:48 brion: trimmed a message from wikimediafr-l logs for privacy by request
  • 20:35 brion: srv10 back up and ips put in service
  • 19:35 brion: srv10 down; squid errors
  • 01:04 brion: adding cfp.wikimania.wikimedia.org redirect for those wikimaniacs

January 31

  • 22:00 brion: hewiki, huwiki, iawiktionary dumps report failure in full-history dump. checking log for iawiktionary showed an XML error in the stub load partway through, but rerunning the command to a test dump was successful. cause unknown

January 30

  • 23:45 brion: disabled blank passwords on wikis
  • 23:00 mark: Upgraded pybal to a newer version on pascal
  • 22:20 brion: started a refreshLinks for itwiki; some major category was broken by a bogus template
  • 19:30 brion: installed APC for srv13-30. had to reduce apc shm size to 30 on i386 boxen. temporarily used a cvs checkout of apc, in /h/w/src/apc visibly
  • 19:15 brion: trying to get APC installed on the machines recently upgraded to php 5.1
  • 18:30 brion: disabled accesslog on amane's lighty

January 29

  • 11:10 brion: fixed externallinks table on leuksman.com wikis :P
  • 11:06 brion: enabled captcha on remaining non-wikipedias, so all small sites covered. large sites still off while the smaller ones collect live test data. (added captcha to new user form a couple hours ago)
  • 07:15 Tim: started upgrading srv11-30 to PHP 5.1.2
  • 06:33 Tim: fixed secure.wikimedia.org
  • ~05:00 Tim: Upgraded srv12 to PHP 5.1.1. Working on srv11.

January 28

  • 10:55 brion: enabled experimental captcha on small wikipedias (all except the top 20 most edited and yaseo) to get some more test data
  • 05:52 brion: added VfD/AfD entries to robots.txt, bugzilla:4776

January 27

  • 22:50 brion: running captcha generation test on amane
  • 22:45 brion: amane's root partition filled with 41 gigs of lighty logs. :) cleared out, restarted lighty.
  • 22:17 brion: got srv63 updated php modules. Note: it's using dba as built-in, not .so module. A warning on Apache start about missing the .so is normal until we get the rest updated this way.
  • 21:48 brion: added '--with-cdb --with-gdbm=/usr' to install-php51 script
  • 21:43 brion: trying to fix srv63. why do we have these things turn on apache on boot? it's incredibly stupid; they end up broken
  • 21:06 Solar: srv63 back up
  • 02:29 brion: started refreshLinks.php on yaseo, running on amaryllis
  • 02:28 brion: ran update.php to update schema on yaseo wikis, which were forgotten
  • 01:58 Tim: fixed spam blacklist and re-enabled it
  • 01:36 Tim: started refreshLinks.php, running on srv31
  • 00:59 Tim: Updated schema, enabled externallinks table

January 26

  • 09:29 brion: disabled spam blacklist; more reports of all kinds of things triggering blacklist for no apparent reason
  • 01:09 brion: got ImageMagick 6.2.6 installed everywhere. bleh.

January 25

  • 15:38 ævar: Added a portal namespace & portal talk namespace to svwiki and ran php maintenance/namespaceDupes.php svwiki --fix to fix the one resulting conflict:
Checking namespace 100: "Portal"
... 1 conflicts detected:
... 209565 (0,"Portal:Musik") -> (100,"Musik") Portal:Musik
... resolving on page... ok.
  • 09:59 brion: postfix was stuck; killed (zombies, kill -9 needed), restarting
  • 01:18 brion: added FollowSymLinks and mime type for .7z on download-yaseo
  • 01:00 brion: enabled indexes on download-yaseo

January 24

  • 22:55 brion: restarted squid on srv8; it was serving lots of error pages to people for unknown reason, seems happier after
  • 06:24 Tim: Updated /h/w/b/foreachwiki. Started running cleanup.php on all wikis.

January 23

  • 19:55 brion: disabled digests option for all users on daily-article-l by request (list admins disabled digests)
  • 09:13 brion: enabled APC (from HEAD) on leuksman.com
  • 03:40 brion: dba module needs to be enabled on secure.wm.o

January 22

  • 23:30 brion: syncing fedora-extras from a mirror in .jp; added to sync-fedora-mirror.sh script
  • 23:10 brion: fedora-extras seems to be missing from fedora mirror in yaseo; fedora-extas.repo points to the local main fedora repo mirror which doesn't help
  • 22:30 brion: restarted dump run in pmtpa; PHP utfnormal extension enabled to speed up non-Latin dumps
    • prefetch was actually working ok once i got into the debug log to watch. slowness was from not loading utfnormal from dumpTextPass. now controlled by WIKIDEBUG env var at CommonSettings level
  • 21:30 brion: ragweed down, no OTRS (mark rebooted it shortly after)
  • 12:08 brion: aborted dumps on pmtpa and yaseo pending investigation
    • setting WIKIDEBUG env var causes segfault in php on srv31. what the hell
  • 11:12 brion: prefetch didn't work due to broken symlinks. restarting on pmtpa
  • 10:00 brion: running dumps on srv31 in pmtpa
  • 06:24 brion: running another test dump on yaseo; will go ahead and run one on pmtpa soon. setting up to use srv31 as the dump runner

January 21

  • 21:25 mark: Half the knams servers were down, at which point PyBal decided not to depool any more servers. Consequence is that most traffic is attrracted by the down server in LVS, and the site is more or less down. Fixed it by commenting the down servers in /etc/pybal/squids. (PyBal will reload that file every minute)
  • 19:15 brion: disabled interwiki cdb cache on yaseo wikis. domas forgot to install the required php module
  • 18:10 domas: enabled interwiki cdb cache, cleaned logs on apaches
  • 15:00 domas: installed dba extension (--with-cdb --with-gdbm=/usr) all around, will make use of it soon.

January 20

  • 01:45 brion: syncookies all around.
  • 01:33 brion: ah, the old tcp_syncookies. resolved.
  • 01:11 brion: hella slow squids on .246/.247/.248
    • srv8. lots of suppressed messages in syslog, on the order of 5k/second. BUT WHAT ARE THEY

January 19

January 18

  • 08:18 brion: running another dump test on yaseo
  • 01:58 brion: saw some breakage with LanguageZh_hk; its deps file was missing one dep (Zh_cn) which I've now added. at least it's consistent with theory so far :D
  • 00:20 brion: added dependency-loading stubs for language and skin classes that need them. hopefully will help with http://pecl.php.net/bugs/bug.php?id=6503

January 17

  • 20:00 brion: reenabled DoubleWiki extension on wikisource, it seems to work now
  • 17:35 Tim: someone commented on our APC bug report that the problem seemed very similar to this bug, which was apparently fixed in CVS last October. There hasn't been a release since then. So I upgraded APC to CVS on the cluster and switched off the initEncoding hack. Fingers crossed.
  • 11:20 brion: running an experimental job of the new backup script on yaseo. output and some control features need some more work, but it should at least pump out some files.
    • tried to set up download-yaseo.wikimedia.org vhost on amaryllis rooted right at the public backups dir, but it's not working right for some reason. *shrug*
  • 06:10 brion: set up a log of Mozilla and Google Accelerator prefetch requests in /h/w/logs/x-moz.
    • Sasa^Stefanovic reported unexpected reverts happening when visiting user contribs pages, had Google Accelerator installed and turned it off on my request.
    • Haven't yet found confirmation of such a bug w/ google accel, but I am seeing requests made from their proxies which include the fragment identifier which is odd. Emailed google about it.
  • 01:20 brion: reverted $wgMimeDetectorCommand back to default. Setting to 'file -bi' broke SVGs on the site.
  • 00:35 brion: installed hack for bugzilla:4635, safari breakage on pages with '.gz'. these pages are sent without gzip encoding to avoid triggering
  • 00:29 brion: testing, seems that test.wikipedia.org does NOT use the local nfs version of CommonSettings.php
  • 00:10ish brion: set mime detector to 'file -bi' on duesentrieb's advice

January 16

  • 16:15 mark: Deployed my new lvsmon like LVS script PyBal on Pascal, in /usr/local/pybal/
  • 07:16 Tim: added a hack to Setup.php to automatically clear the APC cache if a language class is missing its parent.

January 15

  • 21:51 brion: several instances of an odd error reported:
    Jan 15 21:04:32 srv38.pmtpa.wmnet httpd: PHP Fatal error: Internal error: Failed to retrieve the reflection object in /usr/local/apache/common-local/php-1.5/includes/ProxyTools.php on line 133
    This is some kind of PHP5 constructor error [2] which should never occur. The referenced line is a 'global' declaration in a top-level function. Did an apache restart to try clearing caches...
  • 21:27 brion: enabled semi-protection on jawikinews (bugzilla:4608) and eswiki ([3])
  • 20:24 brion: default logo on all wikiquotes was the locally-uploaded copy, but many don't have one. changed to the en.wiki copy as default
  • 20:22 brion: moved en.wikiquote uploads to where they belong, working.
  • 20:19 brion: found that en.wikiquote uploads aren't working properly; the dir is a symlink into /home which is no longer mounted on amane. need to rearrange the files...

January 14

  • 16:18 brion: did extra sync; some old files stuck on servers or something (for instance SpecialUserlogin, with the broken password button)
  • 15:15 brion: parser cache bug :( updated wgCacheEpoch and did a few manual squid purges of main pages
  • 00:30 brion: trimmed obsolete funddrive dir from docroot/foundation (didn't work anymore, superseded by fundraising.wikimedia.org)

January 13

  • 20:10 brion: broke some stuff for a couple minutes trying to make clone() work; now have a wfClone() for PHP4 compat until we finish killing PHP4
  • 12:35 Tim: started refreshLinks.php with template redirect fix in place
  • 08:15 brion: reverted experimental anti-bot hacks to login page; it was breaking 'mail new password'
  • 07:30 jeluf: added info-es alias on zwinger, forwarding to OTRS
  • 07:00 jeluf: hawthorn rebooted
  • 06:30– Tim: second attempt at upgrading to PHP 5. Watching CPU stats closely this time.
  • 06:17 Tim: added access_log to logrotate.conf, up to 10 GB will be stored
  • 06:03 Tim: Amane's root partition was full due to 40 GB access_log. Deleted it and restarted lighttpd.
  • 05:50 Tim: Put copy of skeleton /home on amane
  • 05:15 Tim: unmounted /home on amane. Amane's network out shot up.
  • 04:25 Tim: killed updatedb on amane and removed it from cron.daily
  • 01:05 brion: briefly broke redirects when upgrading; forgot index.php had been reverted temporarily during yesterday's excitement.

January 12

  • 20:00 jeluf: changed Dutch wikis to allow patroling by users, not only by sysops.
  • 19:30 jeluf: rebooted fuchsia, purged squid cache.
  • 19:14 ævar: I discovered that viwiki has made an extension to the software in Javascript. I did a quick security review of it and it doesn't appear to be evil(TM) in any way. It's basically an input method written in Javacript (docs in Vietnamese), for example try going to their sandbox, select "Tu9-. d/o>-.ng" or "Telex" and type "aw" in the input box, it'll be converted to "a". Still more eyes on the source code: Monobook.js and Him.js (main program) couldn' t hurt given some of the evil javascript we've been removing recently.
  • 11:00- Tim: Upgrading PHP on srv31-70 to PHP 5.1.1
  • 04:35 brion: installing xdebug on all apaches so it's available
  • 04:20 brion: with xdebug extension was able to limit recursion within php and got a stack trace pointing to Image.php svg thumb rendering. bug in tim's recent changes was found to be the culprit. reverted Image.php while working
  • 03:42 brion: monitoring very high apache load situation with logs of segfaults.
    • Appears to be some recursion -> segfault in PHP PHP crash backtrace, may be recursion in user-level function

January 11

  • 22:30 jeluf: rebooted ragweed after crash
  • 22:00 mark: Added srv71-160 to DNS
  • 21:00 jeluf: installed srv71...78. 79 and 80 need to be rebooted, probably using the old kernel.

January 10

  • 21:19 brion: put paypal donation form back onto fundraising.wikimedia.org/ongoing top/year/month-level pages
  • 21:15 brion: added no.wikinews

January 9

  • 22:39 hashar: fix project namespace for bn: and csb: languages
  • 21:56 hashar: ocwiktionary is now case sensitive.
  • 21:56 brion: switching php error logs from local files to syslog, which should go to zwinger and include the hostnames
  • 21:48 hashar & nikerabbit: fixed ga: project namespaces (now use genitive)
  • 19:09 brion: where the hell is the documentation on external storage servers? srv34/srv33/srv32 aren't even documented -- they're listed as apaches.
    • They are apaches, all external storage servers are dual-purpose. A list of external storage servers can be found in db.php. -- Tim 03:40, 10 January 2006 (PST)
  • 17:45 brion: noticed secure.wikimedia.org is broken:
    • Error in numRows(): SELECT command denied to user: 'wikiuser@goeje.wikimedia.org' for table 'blobs'
  • 17:00 brion: we have mysterious huge load on apaches, started about 5 hours ago. restarted all apaches to see...
  • 11:10 Domas: parser cache set to 2 weeks
  • 05:48 Tim: enabled direct external storage on enwiki
  • 04:01 Tim: enabled direct external storage on meta, as a pilot.

January 8

  • 17:37 brion: odd db access error about 40 minutes ago:
    • Sun Jan 8 16:56:02 UTC 2006 srv68 RecentChange::markPatrolled 10.0.0.101 1146 Table 'nlwiki.recemtchanges' doesn't exist (10.0.0.101) UPDATE `recemtchanges` SET rc_patrolled = '1' WHERE rc_id = '2429501'
    • the source for this file on this server looks ok; no memory errors in /var/log/mcelog
    • very odd
  • 12:16 hashar: un hardcoded languageNV ns_project
  • 12:05 hashar: un hardcoded languageOC ns_project (bug 4526)
  • 05:30 brion: adding CNAME download-yaseo.wikimedia.org to amaryllis; yaseo dumps will be there ...

January 7

  • 18:50 brion: running test dump for yahoo's abstract thingy for enwiki on benet (from samuel)
  • 14:37 hashar: recached special:disambiguation for all pmtpa databases.
  • 14:30 hashar: WARNING ran cvs up. That raised a lot of conflict. Attempting to solve them.
  • 01:30 brion: added user throttle to en in response to registration flood
    • deleted some of the crud accounts
    • extra live hacks pending captcha later

January 6

  • 14:42 Tim: Restarted squid on ragweed, was refusing connections. Four knams squids are currently down.
  • 13:00 jeluf: destroyed attachments of a posting on wikimediach-l (personal CV) upon Delphine's request

January 5

  • 22:30 mark: Added email alias for Monica in wikimedia.org
  • 21:35 brion: put a CACert-issued SSL cert on secure.wikimedia.org
  • 21:25 brion: added tr.wikisource (bugzilla:4333) and is.wikisource (bugzilla:4471)
  • 21:00 mark: sq1 seems broken:
scsi3 (0:0): rejecting I/O to dead device
sde : READ CAPACITY failed.
sde : status=0, message=00, host=0, driver=04
sde : sense not available.
scsi3 (0:0): rejecting I/O to dead device
sde: Write Protect is off
sde: Mode Sense: 00 00 00 00
sde: assuming drive cache: write through
 sde:<3>scsi3 (0:0): rejecting I/O to dead device
Buffer I/O error on device sde, logical block 0
  • mark: Fixed Fedora mirror on Albert, FC4 mirror is ok now
  • 18:20 brion: unblocked faleg.org from squids leech list, swears to do good (and moved stuff to tools server)
  • 17:05 brion: changed SSL cert for wikitech to one signed by CACert
  • 16:40 brion: added redirection for *.wikipedia.info
  • 05:47 Tim: Enabled plus signs in titles
  • 05:35 ævar: "Gordon Lyon" => "Fyodor Vaskovich" at http://fundraising.wikimedia.org/2005q4/index.php/2006-01-04/detail/

January 4

  • 21:30 jeluf: power cycled ragweed once again

January 3

  • 04:34 brion: set amane's hardware clock to UTC, was on US mountain time
  • 04:31 brion: load seems to have stabilized, things seem to be working
  • 04:28 brion: colo rebooted amane. it seems to be working now, but gettings lots of hangs and things on main web
    • still waiting to fudge stuff
  • 03:09 amane seems to be dead; hangs on http, ssh; pings though
    • was able to root-ssh in, /home mount was hung. unmounted, remounted, restarted lighty; now dead again, but no longer accepting ssh and NFS server is also dead. site is down
  • 02:33 Tim: Thousands of pdns_control processes started by crond were running on zwinger, stuck waiting for a hung pdns_server process. Fixed.

January 2

  • 19:50 Domas: putting holbach into dewiki only operation

January 1

  • 20:22 Tim: starting refreshLinks.php
  • 19:10 Tim: putting templatelinks code live. Schema update finished a few hours ago.
  • 16:30 ævar: Got into tingxi, it's mysql gone wild, ~50 load, trying to contact the mysql server which isn't working with all the load..
  • ~15:55 ævar: tingxi is superloaded (or something) and isidore might be as well (might be using that for the SQL, not sure), as a result fundraising.wikimedia.org is down.

$ ssh tingxi "uptime"
16:03:18 up 124 days, 23:18, 0 users, load average: 78.62, 75.26, 72.74

..I haven't been able to open a normal shell...

  • 12:00 Tim: gave Datrio steward access on jawiki
  • 10:11 Solar: failed drive repaired on sq7, up on /d

December 31

  • 06:40 brion: adding en2 to dns aliases

December 30

  • 18:55 brion: added the one-time donation form to [4] by mav's request
  • 18:30 brion: set up private board wiki
  • 15:00 brion: banned another leech

December 29

December 28

  • 22:30 jeluf: replaced ssl certificate on https://tickets.wikimedia.org/
  • 21:19 brion: running cleanupCaps on zhwiktionary; changed the caps setting a few hours ago (bugzilla:4351)
  • 10:10 brion: killed a leech
    • srv5 is alive, but not yet configured. host key has changed, local login keys not installed. this may be an annoyance during squid updates.
  • Jamesday: gzipped binary logs on Adler; still need to be moved to long term storage. Now has 57GB/18 days of space for binary logs.

December 26

  • 11:49 brion: hacked RecentChange.php so the IRC output uses getInternalUrl(), so https urls don't go into the irc stream and confuse things
    • correct solution may be to add another level, 'getPrimaryUrl' or something. or else declare 'internal' to mean 'external' ;)
  • 08:30 brion: added goeje's new ip to ourusers for adler
  • 07:22 brion: set 'reupload' permission off for regular users, on for autoconfirmed users (older accounts) in response to persistent upload vandalism

December 25

  • 23:12 ævar: Installed a new special page extension, Special:Filepath, redirects user agents to the full path of a file like this.
  • 22:57 ævar: Ran scap
  • 14:27 mark: Partially set up sq1, but got frustrated by all the yum / Fedora crap.
  • 13:11 brion: left goeje waiting for a reboot on kernel upgrade; shutdown is hanging on an nfs unmount which should eventually time out.
    • apache 2.2 is installed, need to set up php and that start experimenting with stuff
    • need to take goeje out of apache nodegroup, but *not* mediawiki_installation
  • 12:13 brion: moved goeje to vlan 1, on 207.142.131.221
  • 11:30 brion: I'm going to reassign webster's external IP to one of the old 512 mb apaches and use it as an experimental https server for secure logins to the wikis
  • 04:33 Solar: sq7 is up at 10.0.3.7, but it is missing one drive at /d, RMA requested
  • 04:15 Solar: srv5 is back up with a fresh FC3
  • 04:00 Solar: srv71-80 are up at 10.0.2.71-80
  • 02:26 Solar: srv57 and srv61 are back up
  • 02:14 Solar: sq3 is up at 10.0.3.3
  • 01:54 ævar: Installed a debuglog for Cite on enwiki to debug a whitespace generating problem I can't reproduce locally, even with tidy.
  • ~00:45 ævar: Installed extensions/Cite/Cite.php site-wide, doesn't appear to be working on yaseo, as in <ref> & <references> just shows up as if the extension wasn't defined, even though it's required_once in CommonSettings.php on amaryllis, is it using some other system now?
...It's because I don't have permission to do anything at yaseo except on amaryllis...
$ ssh zwinger.wikimedia.org 
Last login: Sun Dec 25 00:46:21 2005 from adsl6-56.simnet.is
**** Documentation wiki at http://wikitech.leuksman.com/ ****
[0113][avar@zwinger:~]$ ssh amaryllis
Last login: Sun Dec 25 00:55:57 2005 from zwinger.wikimedia.org
Fedora Core linux kickstart-installed on Sun Sep 11 03:22:28 UTC 2005
[avar@amaryllis ~]$ dsh -f -N mediawiki-installation "hostname"
executing 'hostname'
avar@211.115.107.145's password:
avar@211.115.107.143's password:
avar@211.115.107.144's password:
avar@211.115.107.149's password:
avar@211.115.107.148's password:
avar@211.115.107.146's password:
avar@211.115.107.153's password:
avar@211.115.107.155's password:
avar@211.115.107.150's password:
avar@211.115.107.152's password:
avar@211.115.107.147's password:
avar@211.115.107.154's password:

December 24

  • 09:49 brion: switched wikitech.leuksman.com to HTTPS
    Out of interest, why? cause brion prefer yellow in URL bar.
    Would like to move more of our infrastructure stuff to be behind encrypted connections so passwords won't be exposed on insecure wireless networks when we're at conferences (ccc, wikimania, etc). While it would only be annoying if someone gets into wikitech or bugzilla accounts, sysop accounts on the main wikis or access to the internal wikis might be even more dangerous. Starting small, moving up.
  • 09:30 brion: upgraded leuksman.com to Apache 2.2.0 and PHP 5.1.2RC1.
    • Had to set 'EnableSendfile Off' to fix zero-length responses for static files. Probably something funny with the virtual server's kernel or filesystem.
  • 03:35 brion: turned on autoconfirm protection level on dewiki by elian's request
  • 02:59 brion: set local logo for hiwiki

December 23

  • 05:39 brion: removed leftover Amethyst.php from servers in pmtpa

December 22

  • 07:00 brion: installed new protection interface. set newbies time to 4 days

December 21

  • 10:00 mark: Raised bandwidth limit of csw1-pmtpa's port gi0/33 (Bomis/Wikicities) to 100 Mbit/s
  • 09:40 brion: updated the squid error page
  • 05:31 ævar: ran maintenance/updateSpecialPages.php --only=Unwatchedpages on all pmtpa wikis.
  • 05:24 ævar: Enabled Special:Unwatchedpages for users with protect permission and modified the querycache to cache 5000 pages for that instead of the default 1000. Jimbo made me!

December 20

  • 17:42 ævar: People were still reporting problems with $wgOut & sitenotice, cvs up'ed & ran <code>scap
  • 14:59 ævar: Brion recently changed the sitenotice to use $wgOut->parse() instead of $p = new Parser; $notice = $p->parse(...); Appperently $wgOut is not always an object at that point. Nikerabbit reported a fatal call on a non object on that line. Inserted a live hack that tells people to report to #wikimedia-tech if it isn't an object while we hunt down why it doesn't get initialized properly sometimes.
  • 14:27 ævar: Turned on rcpatrol on fiwiki much to the enjoyment of domas

December 19

December 18

  • 14:20 Tim: attempted to restart lily, it crashed 20 hours ago.
  • 00:00 Domas: holbach resurected and is working as db slave...

December 17

  • 13:05 Solar: Holbach is available at 10.0.0.24
  • 11:35 Solar: sq1-10 minus 3 and 7 ( hardware errors ) are up with 10.0.3.x ip's
  • 06:40 brion: installed <fundraising/> extension (FixedImage) for the fundraising progress bar
  • 01:55 brion: reinstalled php 5.1.1 on tingxi with gd enabled
  • 01:50 brion: briefly locked new registrations on zh.wikipedia while adding a range block;
  • 0:10 brion: rebuilt interwikis (bugzilla:1586)

December 16

  • 04:45 brion: installed apache 2.2 and php 5.1.1 on tingxi for fundraising info server (with SSL)

December 15

December 14

  • 20:30 hashar: added stylesheet for http://static.wikipedia.org/
  • 12:50 mark: Built a new squid RPM (2.5.STABLE12-2wm) that sets a maximum resident memory size (default: 2 GB, specifiable in /etc/sysconfig/squid), and tested it on fuchsia
  • 11:20 mark: Decreased the Squid timeout value of lvsmon on pascal to 10 seconds, and restarted iris which was trashing heavily.

December 13

  • 22:31 brion: benet ran out of disk space, looking at where it went
  • 19:22 brion: review of dump status shows that srv30 broke during the dump circa 04:22 yesterday, crashing enwiki and eswiki. restarting those two dumps
  • 01:40 Tim: Restarted python IRC client on browne, on reports that no more channels were being created

December 12

  • 22:40 brion: reinstalled turck-mmcache on tingxi; had not been upgraded after PHP recompile and was whining about version mismatch
  • 14:30 mark: Resurrected mint which apparently had crashed two days ago.
  • 03:00-5:00 Tim: restarted some apaches with hung processes waiting for NFS

December 11

  • 13:17 hashar: BUG zwinger:/tmp/mediawiki/ should probably be in /var/cache/mediawiki/confs/ and wikitech group writable.
    • This is not the place to report bugs. Please use the IRC channel. -- Tim 20:55, 11 December 2005 (PST)
  • 13:16 hashar: created namespaces for itwiki & itwikisource (#bug 4247).
  • 09:33 brion: dumps running in pmtpa on benet/srv35/srv36; in yaseo on amaryllis

December 10

  • 05:20 brion: leuksman.com mysql & apache went wacko, memory limits killing things... restarted mysqld and apache
  • ~05:00 Tim: dsh -N mediawiki-installation -f chmod -R 777 /tmp/mediawiki . And changed MessageCache.php so that it will stay that way.
  • 01:40 brion: segfaults on leuksman.com reappeared; got backtrace, posted additional details on similar-looking php bug 35140. I have disabled APC on this server to try to reproduce the bug without it.
  • 00:10 brion: set up cywikisource and copied in some pages (bugzilla:4228)

December 9

  • 14:40 mark: Shutdown Tunnel0 on csw2-knams as an attempt to solve weird routing problems
  • 06:30 Solar: new squids are racked, but only sq1 and sq2 are up at 10.0.3.1-2
  • 04:47 Tim: set up www.wikimedia.org as a portal editable via meta, like the others

December 8

  • 17:43 brion: ns1 and ns2.wikimedia.org don't have updated DNS. what's wrong??
  • 11:40 Solar: sq1 is connected to the SCS port 9.
  • 11:29 Solar: asw3-pmtpa is racked and connected to the scs
  • 11:00 Solar: connected equ1's eth1 interface to csw4-pmtpa's port 34
  • 10:16 ævar: Turned allowemailchange on in buzilla, users can now change their email
  • 10:12 Solar: fixed srv66's grub.conf to boot to correct kernel
  • 08:21 Domas: used srv70 as emergency tugela as srv66 down
  • 07:42 brion: updating tingxi in forward/reverse DNS and adding 'fundraising' CNAME
  • 06:30 brion: taking tingxi out of apache groups, giving it an external setup for fundraising utilities
  • 03:54 kate: stopped lomaria to dump for import to zedler
  • 02:10 brion: cleaning up after bogus CVS updates in common dir owned by hashar

December 7

  • 13:30 mark: Rerouted traffic back to knams
  • 12:00 mark: Rerouted knams traffic to pmtpa because of networking problems near knams
  • 11:14 brion: recompiled apache/php/apc on leuksman.com, hoping to debug intermittent segfaults if they continue

December 6

  • 19:00 jeluf: upgraded OTRS to 2.0.4
  • 05:47 ævar: cvs up'ed includes/SpecialVersion.php, there was a conflict, I removed the following code (the top part) since I presume it's not an issue anymore and the offending site has been blocked:
<<<<<<< SpecialVersion.php
                $ip =  str_replace( '--', ' - - ', htmlspecialchars( wfGetIP() ) );
                #return "<!-- visited from $ip -->\n";
                # hacked to a hidden span since one nasty was stripping comments
                return "<span style='display:none'>visited from $ip</span>\n";
=======
                $ip =  str_replace( '--', '-', htmlspecialchars( wfGetIP() ) );
                return "<!-- visited from $ip -->\n";
>>>>>>> 1.32
  • 04:23 ævar: De-installed Special:Cite on commons, meta, sources, species, foundation, nostalgia and mediawikiwiki. We really should have a $site variable that can be counted on (doesn't return wikipedia for non-wikipedia sites)
  • 01:48 Tim: installed Folding@Home on the yaseo apaches
  • 00:50-01:30 Tim: Started Folding@Home on knams squids

December 5

  • 22:13 Domas: Did bring back srv9 (not sure if it is a good idea). Removed bayle/will from service. All squids are null-storage now.
  • 20:35 hashar: apache-(restart|gracefull)-all(hard)? now use dologmsg instead of wikibugs
  • 20:26 Hashar: http://www.mediawiki.org/FAQ now redirect to meta: page (rewrite rule for virtual host mediawiki.org).
  • 18:00 Domas: noticed packetloss, talking to PM support
  • 17:30 Domas: did put srv6 squid into i/o-less operation, as srv10 had same hitrate ;-)

December 4

  • 20:30 jeluf: Several people report problems with Linker.php:504, thumbnail linking code. As a workaround, submitted and deployed Linker.php,rev-1.56
  • 13:27 hashar: changed kawiki & kawiktionary namespaces (bugs 2103 & 3905)
  • 10:10 brion: fixed tingxi's sudoers, fixed tingxi's /usr/local/apache/conf, synced its mediawiki, trying to start it. working? maybe
  • 10:00 brion: stopped apache on tingxi, has damaged copy of mediawiki
  • 04:05 brion: pascal root partition is full, needs cleanup
    • deleted ~350 megs of old kernel modules from /lib/module, leaving those for 2.6.12-1.1381_FC3
  • 03:02 Tim: srv55 has reported no more MCE errors, re-added to the apache pool
  • 01:45 Tim: fixed Turck on humboldt
  • 00:01 Domas, Tim: amane up and running, site back to normal

December 3

  • 21:20 mark: zwinger's /home mount on amane is broken, all fs calls block
  • 08:40 Tim: restarted squid on srv8, it had crashed
  • 07:47 Tim: Fixed ntpd on coronelli, harris, larousse and adler.
  • 07:14 Tim: Fixed ntpd on vincent and maurus. Stepped their clocks.
  • 06:41 brion: fixed sync-file so that message is optional again, like its help message claims
  • 06:10 Solar: Replaced what I believe to be the bad stick of ram for srv55. Its up.
  • 03:01 brion: added info-pl mail alias to OTRS
  • 02:09 brion: hopefully fixed the lucene restart problem; new mono installation in /usr/local wasn't in the PATH in crontab. hacked the init script to add it back

December 2

  • 22:00 brion: watching search daemons more closely; i think they're not properly restarting on the hourly restart cronjob
    • run logs in /var/log/mwdaemon-run.log
    • also clocks are very bad on maurus and vincent, need to ntp them
  • 21:19 hashar: sync-file now accept comments after the file name.
  • 21:23 brion: restarted luene daemons; for some reason they had all died
  • 09:17 brion: got lucene daemons back up and hopefully running.
    • there was an extra restart script in my personal crontab on maurus which seemed to be messing things up there
    • added a 'ulimit -n 8192' on the init script
  • 05:45 brion: yum mirrors appear to be broken (missing repo files), trying to re-sync
  • 03:44 Tim: srv43 didn't come back into the apache pool after restart, fixed
  • 02:28 brion: restarted search daemons, stuck

December 1

  • 23:53 brion: installed joe 3.3 on zwinger (in /usr/local/bin), handles utf-8 files properly
  • 23:01 hashar: knams cluster was unreacheable for roughly 2 minutes, probably a maintenance on kennisnet side.
  • 22:39 hashar: created Portal namespaces on ptwiki #3385
  • 22:35 hashar: renamed namespaces on huwikibooks. 2 conflicts. #3783
  • 19:54 brion: moved old .conf files from /h/w/conf to /h/w/conf/httpd-old to reduce confusion
  • 19:54 hashar: gracefulled all pmtpa apaches to fix bug #4131
  • 19:49 hashar: fixed apache-sanity-check , calls to 'ip' missed '/sbin/'
  • 18:00 mark: Setup failover LVS on avicenna and alrazi. Still needs lvsmon, and isn't active yet. Uses CARP for failover.
  • 14:30 mark: Removed avicenna and alrazi from Apache duty, as I am going to use them as LVS load balancers.
  • 06:30 brion: removed /tmp/mediawiki/* caches on srv36; the backup run had saved a bunch by root and apache screamed about being unable to write them
  • 06:25 brion: restarted apache on yf1005; odd PHP error, possibly APC cache breakage.
    • Fatal error: main(): Failed opening required '' (include_path='/usr/local/apache/common/php-1.5:/usr/local/apache/common/php-1.5/includes: /usr/local/apache/common/php-1.5/languages:/usr/local/apache/common/php-1.5/templates: /usr/local/apache/common/php-1.5/extensions/wikihiero:/usr/local/lib/php:/usr/share/pear') in ½Íÿ on line 14
  • 05:47 Solar: Ariel's raid has "failed", but no real disk failures. It put the array back online and rebooted. We'll see how it does.
  • 05:00 Solar: Moved srv35-43 to second cage. Racked new sq1-sq10.
  • 04:30 brion: deleted 20051127 enwiki pages_full dumps, since srv36 was turned off before they finished

November 30

  • 21:49 brion: fixed upload dirs for wikimediafoundation.org
  • 05:43 Solar: Racked donated load balancer in core cage on csw1-pmtpa port 34
  • 02:25 brion: removed a privacy-violation in a revision comment via database edit (enwiki rev_id 29652015)

November 29

Note: ZX will be rebooting and upgrading most knams machines tonight, to help fix our problems. They will be taking machines down one by one, so this shouldn't give downtime - in theory. If it does, check whether the LVS ip is bound to the machines when they come up.

  • 23:30 mark: Apparently service ips often were not added because /etc/rc.d/rc.local wasn't run... because it did not have eXecute permissions on some machines. Fixed.
  • 21:56 hashar: rebuildMessages.php finished.
  • 21:45 brion: bugzilla:4115 setting up latex on latest srv batch, adding to setup-apache
  • 21:30 brion: found and fixed upload files for meta
  • 21:15 brion: investigating broken upload files on meta
  • 20:49 hashar: fixed bug 4048 and running 'rebuildMessages.php --update' on all wikis.
  • 16:35 mark: Squid wasn't running on srv6, started
  • 16:15 mark: Ran yum upgrade on all knams machines
  • 15:15 mark: Reversed the change as it didn't work anyway: Squid simply ignores failure on binding IPs.
  • 14:00 mark: Adapted the Squid configurator / squid.conf.php to explicitly bind to the Squid's main IP address and the LVS IP, if applicable. Meant to ensure that Squid will not start if the LVS IP is not bound to the machine, so lvsmon can detect that.

November 28

  • 23:34 Hashar: uploaded a picture of clusters, please post comment on image talk page so I can modify / update it.
  • 21:50 Domas: restarted rogue failing (bytecode cache issues?) apaches: srv47, srv4, srv37, srv63, srv58, srv67, srv68, srv53, srv39
  • 20:40 Domas: ragweed booted up, started squid, then started something else (for a minute or two), then ran rc.local with LVS IP adding... site down for several minutes
  • 19:58 brion: ragweed is down (no ping), OTRS dead
  • 14:35 ævar: Site crashed because of insufficient sanity checks, my bad.
  • 14:30 Domas: srv62 tugela crashed, no core dump yet, if crashes persist will need some poking, either code, or srv62. mcelog empty.
tugela-fc3-x64[3634]: segfault at 00000000010e1000 rip 0000003a781716e0 rsp 0000007fbffff668 error 6
  • 00:44 ævar: Changed the project name and metanamespace for iswikibooks to Wikiorðabók
  • 00:00 Domas: oops, ran tugela on srv51-srv54,srv56-srv69 instead of memcached, will see how it performs/scales/...

November 27

  • 23:21 hashar: thanks to palica : updated Server inventory bot to add a link to ganglia.
  • 19:09 hashar: added two scripts to check database : 'mysql-list' & 'replication'
  • 18:49 hashar: BUG rose got 4 memcached instances but they are not listed in mc-pmtpa.php
  • 18:47 hashar: commented 10.0.2.43:10000 from mc-pmtpa.php
  • 13:13 ævar: Installed Special:Cite on all the wikipedias
  • 10:15 brion: blocked wikipedia-l, wikien-l, and helpdesk-l list archives in mail.wikipedia.org's robots.txt to discourage future complaints about embarrassing newbie posts becoming #1 google hits. Search patches for mailman archives should be integrated at some point...
  • 08:55 JeLuF: added http://www.spy-sweeper-webroot.de/wiki/?/ to squid's leecher blocklist
  • 07:38 Solar: smellie is ready for service. Turned off seLinux.
  • 07:30 Solar: srv5 is out with a bad case of bad blocks
  • 07:00 Solar: Crossed over to the new switch, csw4-pmtpa
  • 03:09:56 ævar: Installed Special:Cite on enwiki as an experiment.
  • 01:39 Tim: took srv55 out of service, likely dud RAM. MCE errors reported.
  • 01:10 Tim: squid on will had crashed. Restarted.
  • 01:05 Domas: fixed default route on tingxi

November 26

  • 17:30 jeluf: changed password of wikipl-l admin account. Gave new PW to Datrio. Docuemented PW at the usual place.
  • 14:36 Tim: put srv52-70 into apache service. I broke srv51 with a restart test.
  • 12:00 Tim: wrote /h/w/b/apache-sanity-check, set up scripts such as apache-start to run it and refuse to start apache if the necessary LVS-friendly conditions are not met.
  • ~11:00 Tim: broke site temporarily due to LVS-related misconfiguration
  • 10:53 Tim: rose, tingxi and srv2 had apache running but no LVS VIP. This would explain the random hanging behaviour with ab -X apaches:80. Fixed temporarily, will look into a permanent solution.
  • 08:45 Tim: LVS wasn't decomissioned properly on iris. LVS on pascal was forwarding packets to LVS on iris, and iris, with no lvsmon running, forwarded most of those packets to sage, which is down. Thus users were seeing connection timeouts. Fixed with ipvsadm -D -t rrvs.knams.wikimedia.org:80.
  • 07:41 Tim: srv5 still not up. Moved its virtual IPs, one to srv6, one to srv8 and one to srv10.
  • 07:15 Tim: did a fsck of srv5 then a system reboot
  • 03:32 srv5's root partition spontaneously declared "read-only filesystem". Logs stopped moving. Mount reported that it was still rw, but it couldn't be written to.
mount uses the contents of /etc/mtab to display mounts. These are not updated when the file system is r/o. Use /proc/mounts instead.
  • 05:50 Tim: introduced time and memory limit for rsvg and convert
  • 01:45 Tim: started image backup using updated scripts in /h/w/b
  • 00:14 ævar: changed the logo for iswiktionary.

November 25

  • 21:45 Hashar: killed some rsvg process on various apaches. Seems they tried to render a 120px thumb of /commons/7/70/Interstate_Highways.svg (possible DOS ? :( ).
  • 04:40 Tim: experimentally enabled keepalive on apache.
  • 03:35 Tim: testing lvsmon failover by stopping squid on clematis
  • 03:05 Jamesday: Adler had 11GB disk free. gzipped first 80 binlogs to raise it to 48GB or so. gzipped version still need to be moved to wherever we're keeping them these days.

November 24

  • 06:30 kate: setting up l3 failover.. see that page for details
  • 02:55 brion: took cornelli out of search rotation while kyle moves it around

November 23

  • 21:18 mark: Routing problems from 38.0.0.0/8 (cogent ip space) to florida. Altered the countries.nerd.dk file to reroute that prefix via knams.
  • 20:44 mark: Reinstated the normal epoll RPM on mint, as epoll wasn't the problem
  • 16:44 brion: fixed arrangement of upload directories for several sites (non-wikipedia :P)
  • 00:35 kate: "ntp source vlan1" fixed NTP problem on csw1, but need to work out why traffic to 64.156.25.242 is being dropped
  • 00:04 kate: upgraded csw4-pmtpa to 12.2(25)SED, enabled ssh and configured vlan 2 properly

November 22

  • 22:33 brion: amane still seems to work. YAY \o/
  • 21:49 brion: restarted apache on zwinger, wasn't loading
  • 21:45 brion: increased php fastcgi workers on amane to absurd levels for thumbs to run
  • 21:30 brion: mostly working now! had to set server.max-workers to 8 in lighty to get it running smoothly
  • 19:28 brion: mounted /mnt/upload3 (amane) on zwinger, was missing mountpoint
  • 19:22 brion: mounted /mnt/upload3 (amane) on srv2, was missing mountpoint
  • 19:11 brion: restarted albert's http temporarily to cover the work period
  • 19:02 brion: khaldun copy finally finished, rearranging bits on amane
  • 15:21 brion: turned albert's http back off (hope you're done) so khaldun can finish its copy without the extra load
  • 07:44 brion: started albert's http so kate can set things up requiring the local fedora yum mirror
  • 05:08 kate: configured asw2-pmtpa. has the new srvs and the equ device on it (equ is 10.0.1.3)
  • 00:55 brion: started copying commons files from bacon -> amane. disabled albert's apache
  • 00:45 brion: started copying enwiki files from khaldun -> amane, non-wikipedia non-wiktionary files from albert -> amane
  • 00:35 brion: started copying files bacon -> amane
  • 00:20 brion: disabled uploads sitewide

November 21

  • 23:10 brion: setting up to move uploads to amane, will disable all uploads and upload.wikimedia.org for a while to make this damn thing happen
  • 21:15 brion: started lucene index rebuild on maurus
  • 21:05 brion: restarted squid on will, was not responding (stuck) on port 80
  • 20:49 brion: restarted apache on ragweed; https was down so otrs inaccessible
  • 20:30 mark: Brought sage and mayflower back up.
  • 20:00 mayflower went down.
  • 20:00 mark: Moved LVS back to pascal to allow iris to be a squid again.
  • 19:45 mark: Modified lvsmon on iris because it was always sending curl requests with Pragma: no-cache! And therefor testing the whole chain to florida.
  • 19:45: sage went down.
  • 18:00 mark: Installed non-epoll RPM on mint to compare.
  • 17:56:40-17:57:31 ævar: Invalid argument notices were being generated in this time period due to me syncing three files and them depending on each other, ok now.
  • 17:30 mark: udpmcast wasn't running on pascal. No idea since when... started.
  • 17:30 jeluf: Restarted ragweed. Came back after powercycling and fsck.
  • 16:30 ragweed broken.
  • 12:14 erik: Updated logo of nap.wikipedia.org and sync'd InitialiseSettings.php

November 20

  • 23:30 mark: Upcoming maintenance of knams tomorrow (ZX will do some firmware upgrades, rebooting at least pascal and vandale). Moved LVS to iris because of that.
  • 20:00 JeLuF: All wikipedia.org upload directories moved off of albert and to amane.
  • 18:03 Hashar: fixed #4022 'Asia/Seoul' timezone for kowiki.
  • 17:50 Hashar: switched some logos to /b/bc/Wiki.png
  • 14:24 JeLuF: chown -R apache:apache amane:/export/upload/wikipedia.org/
  • 14:19 Hashar: in amane:/export/upload/wikipedia.org/ some directories cant be write by apache (af de es & fr). dewiki upload page report an error.
  • 09:26 Tim: Fixed NTP broadcast, documented
  • 03:21 Tim: Fixed perl upgrade on srv51-70 as per [5]

November 19

  • 17:05 Tim: same on fuchsia
  • 16:50 Tim: restarted squid on clematis, disabled swap.
  • 16:05 Tim: upgraded otrs on ragweed to version 2.0.3, after Anthere complained about this bug: [6]. Minor upgrades within the 2.0.x series weren't documented (just an unanswered question on the ML), so I just untarred over the top of the old directory, with a backup in /opt/otrs-2.0.1. Treat any problem symptomatically, some chmodding might be required.
  • 15:40 Tim: restarted squid on bayle

November 18

  • 23:30 brion: installing ploticus 2.32 on mediawiki-installation, set to use gd & truetype fonts (bugzilla:3965)
    • truetype fonts in common/fonts
  • 07:00 jeluf: migration of dewiki's image and thumbnail directories done. archive and shared will be moved when albert has more headroom. Some 30 small to medium wikis moved. Currently running frwiki thumbnail migration.
  • 00:27 brion: blocked another leech [7]

November 17

  • 14:30 mark: ragweed was missing the LVS ip, fixed. Also readded iris as squid.
  • 06:30 Tim: Added root key to srv51-70. The following machines didn't want to cooperate: 56, 64, 66, 67, 69
  • 06:05 Tim: added srv51-70 to DNS, created a node group. Configured albert's BIND as a slave for the 10/8 reverse DNS zone.
  • 05:46 Solar: srv2 is back up.
  • 05:38 Solar: srv56 is up too.
  • 05:26 Solar: srv51-srv70 are ready for Rock & Roll! (Except srv56 has some hardware issue)
  • 04:34 Solar: holbach is rebuilt and ready
  • 03:47 Tim: added tingxi and rose to the apaches node group. Left harris out, it sucks.
  • 03:30 Tim: after moving some more hosts to the misc2 cluster, restarted gmond on the apache cluster to remove hosts which have been moved out
  • 02:24 Tim: fixed amane's date, started ntpd
  • 01:49 Tim: Created "Misc VLAN2" cluster on ganglia, for miscellaneous hosts which, due to being in the wrong VLAN, couldn't be in Miscellaneous.

November 16

  • 8:25 brion: srv50 error_log flooded disk; removed and restarted apache
  • 6:30 jeluf: moved es upload area to amane:/export/upload
  • 5:30 jeluf: moved eo, ang, an upload areas to amane:/export/upload. Backups are still on albert in .../remove.
  • 04:14 Tim: attempted to restart squid on will. It didn't work. I hacked /etc/init.d/squid to send errors to a file instead of /dev/null, and found it was giving error messages like "parseConfigFile: line 17 unrecognized: 'htcp_port 4827'". I started the squid copy in /usr/local/ instead.
  • 01:20 brion: reenabled special:renameuser with the 'archive' bit disabled. it's possible that some undeleted pages will have incorrect rev_user_text data

November 15

  • 23:00 jeluf: moved aa, ab, af, ak, als, am, ar, ast, zh image uploads to amane:/export/upload
  • 20:32 hashar: updated http://wikimedia.org/stats/live/ with a message redirecting to the "new" system ( http://noc.wikimedia.org/stats.php ).
  • 16:13 Tim: running batch imagemagick convert job on bacon, converting 1911 EB scans to PNG.
  • ~12:30 Tim: Deployed diff cache and parser cache push features. Reduced cache expiry for RC feeds on en from 60 to 20 seconds. The performance impact of this should be monitored -- the diff cache should reduce it but it might not be enough.
  • 03:46 Tim: Re-enabled tidy, trimmed error logs. The huge error logs did indeed have a few tidy errors towards the end, once every few minutes, interspersed with lots of "file not found" errors. Preceding this lack of activity was gigabytes of either:
    [Mon Nov 7 04:33:33 2005] [error] PHP Parse error: parse error, unexpected $ in /usr/local/apache/common-local/php-1.5/checkers.php on line 101
    OR
    *** attempt to put segment in horiz list twice
    Neither of which have anything to do with tidy. The other noticeable thing at the very end of the error logs was that apache was segfaulting regularly, but it was doing that just as much after tidy was disabled.
  • 01:22 ævar: resolved bug 3968
  • 00:50 brion: cleaned giant error_log files from srv44 and srv47, which had run out of space during sync
  • 00:41 brion: adding some signature-nazi features, so new sigs with unbalanced html tags will not be inserted

November 14

  • 22:30 mark: Many apaches have error_log's of 100G in size and more! Partly due to tidy, but how is logrotation supposed to be setup? See bug #3966
  • 22:00 - 22:12 hashar: $wgUseTidy = false; its filling error logs on all apaches and seems to stall. Restarted all apaches too. Wikipedians need to FIX their HTML.
  • 14:00 mark: Rebooted srv10, and started Squid on it with no cachedirs (1 null cachedir). Assigned IP .214 to it.
  • 08:28 Tim: restarted squid on srv6. Slow hit service times (~100ms), it wasn't swapping but it had very little spare memory for kernel cache and buffers.
  • 03:05 Tim: bayle was swapping heavily, very slow service times for both hits and misses. Restarted squid, added it to the ganglia squid cluster.

November 13

  • 22:50 jeluf: mounted amane:/export/math to all mediawiki-installation servers for storage of math images.
  • 20:00 midom: srv10 squid hanged, reiserfs issues?
  • 16:57 brion: running data dumps on benet/srv35/srv36

November 12

  • 19:49 ævar: tingxi had languages/LanguageCs.php (and probably something else) out of date, IIRC it has been down for some time, ran scap to bring it and others up to date.

November 11

  • 00:16 brion: changed sitename on eswikinews (meta-namespace was already set)

November 10

  • 14:28 ævar: changed the logo on trwiki
  • 09:06 ævar: Changed the upload url of the wikis that had uploading disabled to point to the commons
  • 09:09 brion: gave up trying to upgrade bugzilla due to bugzilla upgrade failure
  • 08:40 brion: running yum update on pascal; got some glibc double-free bug during bugzilla update, and thought it was time to upgrade some damn packages
  • 08:25 brion: shutting down bugzilla for upgrade to 2.20
  • 07:18 brion: removed check_policy_service from /etc/postfix/main.cf on kate's advice, to see if it's more stable with that off
  • 07:02 brion: restarting postfix on zwinger, mail stopped again
  • 05:59 ævar: Removed harris from /usr/local/dsh/node_groups/mediawiki-installation, responded to ping, had port 22 open, but hung forever on ssh harris
  • 04:27 Tim: set up ftp server on bacon, to accept uploads of scanned page images

November 9

  • 14:18 Tim: fuchsia was swapping, regularly timing out on lvsmon health checks. Restarted squid.
  • 11:09 brion: modified parser cache behavior to do cache with redirect targets. should increase hit rate; if troubles experienced, revert Article.php back to rev 1.396
  • 10:13 brion: reenabled search text extracts for active sessions only
  • 07:32 brion: updating live search indexes
  • 00:54 brion: no mail in last eight hours... restarting postfix

November 8

  • 23:30 jeluf: After intensive fsck, ragweed is back.
  • 19:00 ragweed pings, but doesn't allow SSH login
  • 13:10 holbach crashed
  • 12:05 Tim: deployed local message cache, causing a 60% drop in network traffic on the apache cluster according to ganglia. We had noticed probable network saturation on the 100 Mbps switch asw1, this was the obvious solution. A content hash is stored in memcached and checked on each request. The local cache is stored in files, one file per wiki in /tmp/mediawiki/

November 7

  • 20:51 kate: stopped replication on lomaria. please don't start it without asking me unless it's extremely important.
  • 20:45 brion: trying to get tidy going again
  • 20:30 brion: rebuilding search indexes on maurus.
  • 20:00 brion: set search daemons to restart hourly. *sigh*
  • 14:05 Tim: brought holbach back into service. Tweaked some load ratios.
  • 13:55 Tim: started slave on lomaria. It was idle, the site was slow.
  • 05:45 brion: switched lucene search to default to AND matches
  • 02:50 brion: set up init script for MWDaemon (/etc/init.d/mwdaemon), added a daily cronjob to restart them

November 6

  • 21:04 brion: several servers had disks filled from apache error_log; libart in rsvg apparently spewing out gigs of "*** attempt to put segment in horiz list twice"
  • 20:10 brion: site unusually loaded; giving a kick to the apaches for luck
  • 11:08 jeluf: srv22 was overheated. killed svg renderer (240 cpu minutes)
  • 11:00 jeluf: added Category:Broken_servers for better keeping track of todos
  • 10:40 jeluf: added portal namespace for nowiki upon Jhs' request
  • 08:20 kate: copying from lomaria again... whee!
  • 05:20 brion: added id.wikisource.org by request
  • 04:59 Tim: started lvsmon-ksquid on pascal
  • 04:39 kate: iris crashed... moved lvs to pascal.
  • 02:40 Tim: Made MW check $cluster.dblist instead of all.dblist. This will generate appropriate error conditions for improper access to foreign databases via commandLine.inc, Special:Makesysop or squid misconfiguration.
  • 01:40 Tim: installed memcached on srv41-50, moved instances from various other machines to there, including offloading browne completely. Restarted memcached on srv22, it had a dead instance.

November 5

  • 22:02 kate: restarted replication on lomaria. set up replication on zedler.
  • 11:00 brion: chgrp'd common files on humboldt
  • 09:15 solar: installed new image filer, amane, into the rack.
  • 04:55 kate: stopped replication lomaria again to re-dump. don't start it please. (server is still running)
  • 03:41 Tim: tried to restart dumpHTML on srv31, the machine crashed almost immediately
  • 03:39 brion: starting dumps on yaseo on amaryllis/henbane
  • 03:32 kate: copy finished, restarted replication on lomaria
  • 03:00 brion: refresh-dblist now also creates pmtpa.dblist and yaseo.dblist, based on assignment overrides from clusters.dblist
  • 00:45 brion: started pmtpa dumps on benet, srv35, srv36

November 4

  • 21:45 jeluf: moved lightgy on benet to /usr/local/lighttpd. Added startup to /etc/rc.local
  • 21:00 jeluf: mounted benet:/var/backup to zwinger:/mnt/backup_benet
  • 06:25 brion: restarted search servers; memory usage up to 650-1000mb range, and very slow response on vincent

November 3

  • 11:03 kate: copying lomaria's db to zedler, don't start it
  • 21:45 erik: fixed he.wikinews site name and meta namspace (hopefully), sync'd InitialiseSettings.php and ran update.php accordingly
  • 20:44 brion: investigating connection errors (hacked wfLogDBerror to include hostname); seems to be on the new opteron boxen only
  • 20:30 hashar: started apache on srv35.
  • 20:22 hashar: started apache on avicenna.
  • 20:10 mark: Will was running with only 1024 FDs. As it's the only non-RPM squid around (will is FC1) and I added bayle, I have taken it out, reassigned IPs to srv5 and srv7.
  • 19:55 hashar: some apaches need a reboot. load is incorrectly high on them cause of state=D process (see bug #3869)
  • 15:10 mark: Moved bayle (previously broken, inactive memcached) to the external vlan, made it a temporary squid. I cannot get it to mount izwinger:/home though. Any ideas?
  • 5:30 Tim: copied ~tstarling/.ssh/known_hosts to /etc/ssh/ssh_known_hosts on all pmtpa machines
  • ~5:00 Tim & kate: syslogd stopped working on zwinger, causing DNS to stop working. Kate restarted syslogd.
  • ~5:00 created hewikinews using addwiki.php, sync-common-all
  • 04:07 kate: made amaryllis ns3.wikimedia.org. needs magic stuff so it can be added as auth ns
  • 01:58 Tim: restarted search daemon on vincent, the usual problem

November 2

  • mark: Apparently the restart squid cron job in the squid RPM is broken in a weird way: at some point in time /sbin/pidof /usr/sbin/squid will stop working. I will fix it and roll out a new RPM tomorrow. Sorry for the trouble!
  • 23:20 JeLuF: Found 2 squids on srv8. Killed both, started a new one.
  • 22:20 Tim: adapted lvsmon for knams squid service, started it on iris. See /usr/local/bin/lvsmon-ksquid . There's also a copy in ~tstarling/lvs on zwinger in case iris goes down.
  • 21:30 mark: Installed the new squid RPM on clematis. Not using epoll didn't change memory leaking behaviour.
  • 19:17 kate: LDAP in on pascal was broken after reboot.
Nov  2 19:11:26 pascal slapd[29793]: bdb_db_init: Initializing BDB database
Nov  2 19:11:26 pascal slapd[29794]: bdb(dc=knams,dc=wikimedia,dc=org): Lock table is out of available
-               locks
Nov  2 19:11:26 pascal slapd[29794]: bdb_db_open: db_open(/var/lib/ldap) failed: Cannot allocate
-               memory (12)
Nov  2 19:11:26 pascal slapd[29794]: backend_startup: bi_db_open(0) failed! (12)
Did a db_recover and restarted slapd.
  • 04:38 kate, kyle: csw4 is installed. nothing on it yet.
  • 01:08 kate: pascal broke again, moved LVS to iris
  • 00:10 kate: colo allocated us 84.40.25.224/27, wikicities will move into this network

November 1

  • 23:39 brion: created car-fr-l list for french arbcom
  • 22:25 brion: heavy packet loss between pmtpa and lopar; kate is moving dns off lopar for now
  • 21:10 UTC erik: created ru.wikinews.org using addwiki.php
  • 18:26 mark: Dropped 207.142.131.225 as gateway IP, as it doesn't seem to be in use anymore
  • 18:15 mark: Made csw1-pmtpa act as a DHCP relay agent for rabanus, 10.0.0.15
  • 04:20 kate: replaced mormo.org with pascal & amaryllis as backup MX, using postgrey + other anti-spam stuff
  • 05:48 Solar: anthony, suda, isidore and bayle are back up.
  • 05:10 Tim: Cleaned up the squid list in CommonSettings.php. The need to have variables for the IP addresses of each squid passed long ago, it was just clutter, doubling the length of the section. Added the external IP address of will, which was missing, causing edits to be wrongly attributed in the yaseo wikis.

Archives