Server admin log/Archive 13

From Wikitech
Jump to: navigation, search

June 30

  • 23:58 Tim: killed gearmanWorker.php instances on hume running with the old master conf, re-ran them running as apache
  • 23:20 Andrew: 23:19 <@brion> !log s1 switched to db16, enwiki back online! (thanks tim!) tomasz rebooted wikitech linode which also borked. :P
  • 22:58 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'put non-enwiki back to read/write'
  • 22:57 brion: db14 (s1 master) is in some kind of borked state. slaves seem up to date; gonna try a master switch.
  • 22:49 logmsgbot: brion synchronized php-1.5/CommonSettings.php
  • 22:49 brion: putting site to read-only
  • 22:44 Andrew: Rolling back updates, first vector skin, then config changes. If that doesn't help, will roll back UsabilityInitiative changes
  • 22:43 logmsgbot: andrew synchronized php-1.5/skins/Vector.php 'Rolling back updates to r52581'
  • 22:29 Andrew: db14 seems to be overloaded
  • 22:27 Andrew: reports of (Cannot contact the database server: Unknown error (10.0.6.24))
  • 22:24 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Remove vector from wgSkipSkins on usabilitywiki'
  • 22:21 Andrew: Updated Vector and UsabilityInitiative, deployed UsabilityInitiative to usabilitywiki. Scapping to apply.
  • 21:06 tomaszf: restarting lighttpd on storage2 due to large i/o wait
  • 20:01 logmsgbot: andrew synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php
  • 18:44 tomaszf: dropping -grosley dns entries in favor of dev. and civicrm.
  • 16:07 Fred: restarted mwserve on pdf1 as it was not processing pdfs anymore.
  • 15:56 brion: activity on pdf1 seems to have croaked around 15:30. pdf daemon might need a restart?
  • 04:28 logmsgbot: tstarling synchronized php-1.5/extensions/timeline/Timeline.php
  • 04:27 logmsgbot: tstarling synchronized php-1.5/extensions/timeline/EasyTimeline.pl
  • 04:27 Tim: updated EasyTimeline to r52591
  • 03:34 Rob: seeing good results with the firewall plugin on the corporate blog, copying it over to the techblog and setting it to active.
  • 02:57 Rob: It was net gremlins!
  • 02:56 brion: appears to have been temporary networking issue (external?) affecting access to some clients
  • 02:48 brion: some sort of downtime reported for a few minutes. no apparent current problems persisting, all looks well
  • 02:11 hcatlin: Mobile site is having some encoding issues with UTF-8 characters, therefore the redirects have been disabled from common.js

June 29

  • 23:03 logmsgbot: andrew synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php 'Updated l10n'
  • 23:00 Fred: updated DNS for *.m.wikipedia.org to point to mobile1 instead of eiximenis
  • 22:05 logmsgbot: andrew synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.js 'Trevor told me to!'
  • 22:05 logmsgbot: andrew synchronized php-1.5/extensions/UsabilityInitiative/EditToolbar/EditToolbar.css 'Trevor told me to!'
  • 21:47 Fred: apache stuck on srv[210-214] - restarting.
  • 21:36 Andrew: scapping for vector updates
  • 21:35 Andrew: updated Vector skin to r52581
  • 21:18 Andrew: Lost connection to zwinger mid-scap. Running scap again to make sure everything's fine.
  • 21:10 Andrew: Scapping to deploy UsabilityInitiative CSS/JS changes, and license updates done by Fred.
  • 21:09 Andrew: updated WikimediaMessages to push latest changes pushed to svn by siebrand.
  • 20:58 Andrew: not scapping yet, deploying license change stuff too, waiting for Fred to do the rest of the updates for that
  • 20:52 Andrew: scapping
  • 20:49 Andrew: Activating UsabilityInitiative extension and removing vector from $wgSkipSkins on testwiki
  • 20:38 RobH_A90: racked and installed mobile1
  • 20:38 RobH_A90: pulled nehlam test server
  • 19:17 Andrew: Fred svn up'd EditPage.php, Skin.php, MessagesEn.php and WikimediaMessages. I reversed the updates in EditPage.php and Skin.php and used svn merge -c to update EditPage.php and svn up -r to update Skin.php, cherry-picking only the correct updates (r52361)
  • 19:01 Andrew: pdated UsabilityInitiative extension on test as prep for deployment on testwiki.

June 28

  • 19:39 RobH_away: ran authdns-update
  • 19:39 RobH_away: changing bayes mgmt IP (to make it sane and in line with the rest of the mgmt IP ranges) and adding the new IP to dns and reverse template files
  • 18:58 brion: bayes back up
  • 18:52 logmsgbot: midom synchronized php-1.5/db.php 'db28 live'
  • 18:49 brion: poking LOM on bayes; box is down, EZ requested reboot.
  • 08:54 domas: manually reset db28 position, based on innodb internal info, got some filesystem relaylog corruption
  • 08:52 logmsgbot: midom synchronized php-1.5/includes/GlobalFunctions.php 'removing messages profiling hook'
  • 06:43 Tim: rebooted db28 with /proc/sysrq-trigger
  • 06:29 Tim: depooled db28, locked up in kswapd, needs reboot

June 27

  • 07:42 domas: changed log expiration on db9 to 100 days

June 26

  • 20:33 Rob: blog.wikimedia.org back online, very restrictive settings and improved security (i hope)
  • 19:09 Rob: singer https services back online for survey. ocs, wm09schols
  • 18:20 Rob: Singer Restore: survey.wikimedia.org - UP, ocs.wikimania2009.wikimedia.org - UP
  • 17:11 Rob: pulled blog.wikimedia.org out of squid via changing its dns to point directly at singer. just to make it easier to secure and fix on the fly later today
  • 15:03 Rob: firing singer back up for reinstall and restoration from the haxor
  • 14:42 logmsgbot: midom synchronized php-1.5/includes/parser/ParserCache.php 'removing the pcache hack for now'
  • 11:44 domas: halted singer, don't start it without contacting me.
  • 07:27 logmsgbot: brion synchronized php-1.5/includes/parser/ParserCache.php 'Putting live hack for MJ article back. Total traffic spike down but article traffic still spiking.'
  • 05:53 Tim: rebooting srv182, swap death since ~00:10 UTC
  • 05:50 Tim: killed broken mysqld_safe instances on srv157, srv161, srv167, srv173, srv174, srv175, srv181, srv185, srv186
  • 05:17 logmsgbot: jeluf synchronized php-1.5/mc-pmtpa.php 'replace srv113 by srv142'
  • 05:14 logmsgbot: jeluf synchronized php-1.5/mc-pmtpa.php 'replace srv182 by srv92'

June 25

  • 23:36 logmsgbot: brion synchronized php-1.5/includes/parser/ParserCache.php 'live hack to extend caching of Michael Jackson'
  • 23:17 logmsgbot: brion synchronized php-1.5/includes/db/Database.php 'tweak the db err msg'
  • 22:46 logmsgbot: brion synchronized php-1.5/mc-pmtpa.php 'putting 156 mc back'
  • 22:44 logmsgbot: brion synchronized php-1.5/mc-pmtpa.php 'swapping in 145 in place of temp down 156'
  • 22:43 tomaszf: rebooting srv156 due to hard down
  • 22:41 logmsgbot: midom synchronized php-1.5/db.php '156 down'
  • 18:26 RobH_A90: srv92 back up with replaced fans
  • 18:16 RobH_A90: srv92 down due to bad fan, going to replace it with another dead server fan, yay for frankenstien servers
  • 18:09 RobH_A90: db17 has had a cold reset, service processors are now responsive, as well as system
  • 09:21 logmsgbot: tstarling synchronized php-1.5/languages/Language.php

June 24

  • 16:27 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 19364'
  • 15:56 Fred: modified /home/wikipedia/conf/nagios/sync to include new nagios server in the sync/restart process.
  • 15:55 Fred: cleaned up /home/wikipedia/conf/nagios/conf.php to remove unused server / conf parameters.
  • 15:54 Fred: Cleaned up dsh node_group and resynched Nagios.
  • 10:26 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'livehack upper limit, as mediawiki devs are lazy.. oh wait.'
  • 08:29 domas: compressed databases on db3-5 using lzip, freed up the space. :)
  • 08:16 logmsgbot: midom synchronized php-1.5/db.php 'db24 back to life'
  • 06:28 Tim: updated GeSHi to 1.0.8.4
  • 05:54 Tim: removed all SVN externals from the MW working copy. Updated extensions/SyntaxHighlight_GeSHi to r52346. Scapping.

June 23

  • 22:00 Fred: make && deploy of new squid configuration with added acl for spence.w.o
  • 21:07 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19364 Set project namespace of Portuguese Wikibooks'
  • 20:49 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Fixing issues with arwiki namespacing and formatting'
  • 19:33 Rob: checked up on the updateTitles running against enwiki on hume. On 11864380 / 23266362
  • 19:31 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19357 Create new namespace on arwiki'
  • 19:22 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 19:14 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 19116 adding namespace aliases for itwikibooks'
  • 15:39 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19328 set for fywikibooks - typosssss'
  • 15:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19328 set for fywikibooks'
  • 15:24 logmsgbot: robh synchronized php-1.5/abusefilter.php 'Updated with Andrew'
  • 15:15 Rob: wrong reason listed in sync, opps, was for bug 19272
  • 15:15 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'extensions/AbuseFilter/abusefilter.tables.sql'
  • 15:14 logmsgbot: robh synchronized php-1.5/abusefilter.php '19274 Enable AbuseFilter on Lithuanian Wikipedia.'
  • 08:23 logmsgbot: midom synchronized php-1.5/db.php 'add db28 to s1'
  • 08:20 logmsgbot: midom synchronized php-1.5/db.php 'db19 going live'
  • 00:09 brion: obsoleted useless "web browser" custom field on bugzilla. Doesn't appear in search, hardcoded list would need to be maintained, generally not useful.

June 22

  • 23:27 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Changed wgRC2UDPPrefix for usability.wikimedia'
  • 20:02 tomaszf: adding stats.m.wikipedia.org for hcatlin
  • 20:01 tomaszf: pkill'd pdns on ns1 due to zombie and defunct procs
  • 19:31 JeLuF: removed the refresh_pattern from ortelius' squid config
  • 19:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Trying to enable Nuke on enwiki'
  • 19:12 JeLuF: firefishy switched OSM DNS to deliver tiles via ortelius
  • 17:45 JeLuF: added service IP for tiles.wikimedia.org
  • 16:51 logmsgbot: brion synchronized php-1.5/CommonSettings.php 'Setting CC by-sa 3 $wgRightsUrl/$wgRightsText ... see if anything explodes.'
  • 16:15 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php '19309 Creation of "autopatroller" usergroup on en.wikipedia'
  • 16:00 Fred: restarted apache on srv208, srv132 as well
  • 15:59 Fred: restarted apache on srv182

June 21

  • 21:28 logmsgbot: andrew synchronized php-1.5/lucene.php 'switch enwiki mwsuggest to lucene backend'
  • 20:57 logmsgbot: andrew synchronized php-1.5/lucene.php
  • 11:30 domas: copying db3->db28
  • 11:21 logmsgbot: midom synchronized php-1.5/db.php 'depooling db24, was not in replication, IT IS GODDAMN SNAPSHOT SLAVE'
  • 11:20 logmsgbot: midom synchronized php-1.5/db.php 'repooling db15, db24, depooling db3, db4'
  • 11:08 logmsgbot: midom synchronized php-1.5/db.php 'db5 off, will use as db19 copy source'

June 20

  • 19:06 domas: pdf1 mw-serve had segfaulting python processes, haha! kill + /etc/init.d/mwserve start seems to have helped.
  • 08:18 logmsgbot: midom synchronized php-1.5/db.php 'removing db3,db4,db5, will be rebuilt into national slaves'

June 19

  • 23:54 logmsgbot: ariel synchronized php-1.5/CommonSettings.php 'Test-push - No Change'
  • 23:53 logmsgbot: ariel synchronized php-1.5/CommonSettings.php 'Test-push - No Change'
  • 22:24 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php '18902 Enable collectionsaveascommunitypage and collectionsaveasuserpage on Default'
  • 17:13 Fred: exim configuration on Sanger updated. Details can be found on https://wikitech.wikimedia.org/view/Mail
  • 16:08 RobH_A90: shutting down eiximenis for ram upgrade (thus m.* will be down until it is back online)
  • 03:18 logmsgbot: tstarling synchronized php-1.5/includes/SkinTemplate.php 'fix for fatal error'

June 18

  • 20:16 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Live-merged r52141, which fixes broken page-moves on Wikimedia sites'
  • 20:11 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Less agressive testing code as I can't immediately reproduce the bug. Still loading extension messages as a live hack so we can see wtf is going on when a user reports the bug'
  • 20:01 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Testing code for unbreaking page moving'
  • 16:46 Fred: starting the upgrade process for all apache boxen.
  • 16:24 Fred: rebooting srv224 to test dist upgrade.
  • 16:16 mark: Ran rm -rf /var/tmp/texvc on all apaches
  • 16:13 mark: Upgraded wikimedia-task-appserver to 1.39 on srv76
  • 15:40 logmsgbot: mark synchronized all.dblist
  • 15:38 logmsgbot: mark synchronized all.dblist
  • 15:26 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updating logo for ruwikimedia'
  • 15:07 Tim: running schema updates on ruwikimedia
  • 14:57 Rob: correction, did not remove, merely set to tfalse
  • 14:57 Rob: removed some apaches from pybal config since they are not receiving updates
  • 14:46 logmsgbot: robh synchronized php-1.5/flaggedrevs.php
  • 14:05 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUserrights.php 'r52116'
  • 08:00 logmsgbot: tstarling synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevs.class.php
  • 07:45 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php 'srv92 down'
  • 07:24 Tim: scap to r52088
  • 05:26 JeLuF: reinstalled ortelius (formerly known as ptolemy)
  • 05:24 Tim: amane had wikimedia-task-appserver 1.33, was reporting it was "kept back". Ran apt-get dist-upgrade to fix.
  • 04:33 Tim: db15 lagged due to schema updates, depooling again
  • 04:33 logmsgbot: tstarling synchronized php-1.5/db.php
  • 03:46 logmsgbot: tstarling synchronized php-1.5/db.php

June 17

  • 21:27 JeLuF: ptolemy squid test setup running, needs some fine tuning (esp. statistics)
  • 19:48 JeLuF: added ptolemy in dhcpd configuration
  • 19:27 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Deployed r52047, fix for AbuseFilter parser fatals'
  • 19:20 logmsgbot: andrew synchronized php-1.5/includes/HTMLForm.php 'Deploying r52070, string/int inconsistency in XmlSelect value/default breaking imagesize option'
  • 16:43 Fred: rebooting srv101 to finish install.
  • 16:35 logmsgbot: andrew synchronized php-1.5/api.php 'Mistake in previous sync'
  • 16:23 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'srv75 is down. Replacing with spare: srv97'
  • 16:22 logmsgbot: andrew synchronized php-1.5/api.php 'Fix API for secure.wikimedia.org with ugly live-hack'
  • 16:17 Fred: srv101 install from Monday not completed. Finishing it now. (yes this is a memcached node as well)
  • 16:02 logmsgbot: andrew synchronized php-1.5/api.php 'Debugging for bug 19263'
  • 13:54 logmsgbot: tstarling synchronized php-1.5/flaggedrevs.php 'removed bug 19207 workaround'
  • 13:40 logmsgbot: tstarling synchronized php-1.5/extensions/TrustedXFF/trusted-xff.cdb
  • 13:30 Tim: running schema changes on db15
  • 09:46 Tim: scap at r52034
  • 09:32 Tim: updating to r52031
  • 06:04 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'enabled djvutxt'

June 16

  • 20:56 logmsgbot: andrew synchronized php-1.5/includes/Preferences.php 'Fix bug 19237, broken Preferences page for some languages (e.g. cs)'
  • 18:37 Fred: irc.wikimedia.org is back in service. Channel list is growing. Everything seems to be working as expected.
  • 18:02 Fred: upgrading irc.wikimedia.org. Server will be offline for a couple of minutes.
  • 15:37 logmsgbot: tstarling synchronized php-1.5/flaggedrevs.php 'granted all bots the review right to work around bug 19207'
  • 15:36 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Blocking an email address which has been spamming ascii art to admins'
  • 14:54 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Fix for backwards-incompatibility in AbuseFilter list handling'
  • 10:10 logmsgbot: andrew synchronized php-1.5/extensions/CodeReview/CodeRevision.php 'Live-merging r51955 "CodeReview doesn't load messages when sending e-mail notifications of follow-up revs"'
  • 09:44 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Live-merged r51952, fix for Special:GlobalGroupMembership'
  • 00:42 Tim: scap
  • 00:41 Tim: svn up to r51943

June 15

  • 23:05 logmsgbot: tstarling synchronized php-1.5/flaggedrevs.php 'autoreview bot edits'
  • 20:30 mark: Fixed puppet config of srv35
  • 14:47 logmsgbot: tstarling synchronized php-1.5/includes/ImageFunctions.php 'docs'
  • 14:45 logmsgbot: tstarling synchronized php-1.5/languages/Language.php 'removed hack'
  • 14:41 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/FSRepo.php 'documented hack'
  • 13:58 Tim: updated core to r51904, will scap
  • 13:30 logmsgbot: tstarling synchronized php-1.5/serialized/MessagesMr.ser
  • 13:30 logmsgbot: tstarling synchronized php-1.5/languages/messages/MessagesMr.php
  • 12:59 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php
  • 12:54 logmsgbot: tstarling synchronized php-1.5/languages/Language.php
  • 12:35 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php
  • 12:05 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
  • 12:03 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
  • 11:54 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
  • 11:44 logmsgbot: tstarling synchronized php-1.5/includes/MagicWord.php 'debugging'
  • 11:32 logmsgbot: tstarling synchronized php-1.5/includes/Skin.php 'r51882'
  • 11:29 Tim: created missing table code_bugs on mediawikiwiki
  • 11:27 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php 'fixed change_tag index name (2nd attempt)'
  • 11:25 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php 'fixed change_tag index name'
  • 10:08 Tim: merged r51871
  • 10:08 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialSearch.php
  • 09:58 Tim: rebooting db17 via mgmt
  • 09:57 logmsgbot: tstarling synchronized php-1.5/db.php 'depooling db17, went down due to scap'
  • 09:49 Tim: srv159 swapdeath, rebooting using management interface
  • 09:46 logmsgbot: tstarling synchronized php-1.5/includes/ImageFunctions.php 'disabled bad image list'
  • 09:37 logmsgbot: tstarling synchronized php-1.5/includes/filerepo/FSRepo.php 'disabled fileExistsBatch'
  • 09:32 Tim: s2 read/write
  • 09:25 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db13, seems to have fixed itself'
  • 09:19 Tim: done critical schema updates on db15
  • 09:12 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRecentchanges.php
  • 08:56 logmsgbot: tstarling synchronized php-1.5/db.php 'making db8 the fake master on s2'
  • 08:44 logmsgbot: tstarling synchronized php-1.5/db.php
  • 08:25 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRecentchanges.php
  • 08:21 logmsgbot: tstarling synchronized php-1.5/db.php
  • 08:20 logmsgbot: tstarling synchronized php-1.5/db.php
  • 08:19 logmsgbot: tstarling synchronized php-1.5/db.php
  • 08:19 Tim: switching s2 master from db15 to db13, db15 is missing schema updates
  • 08:09 Tim: scap
  • 07:13 Tim: restarted test.wikipedia.org
  • 07:08 Tim: svn update to r51863
  • 06:49 Tim: shut down test.wikipedia.org to avoid issues with the NFS copy
  • 06:04 Tim: preparing for scap to ~r51860. Made backup in lazy-backups/php-1.5-2009-06-15. Will remove merged changes to bring php-1.5 back to near r48811 plus hacks, to reduce conflicts on svn up.

June 14

  • 17:05 logmsgbot: andrew synchronized php-1.5/CommonSettings.php

June 13

  • 22:13 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
  • 21:54 logmsgbot: kate synchronized php-1.5/db.php
  • 21:54 river: taking db24 out of rotation to dump s2 for TS
  • 05:07 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'added wmfLoadInitialiseSettings definition'
  • 05:06 logmsgbot: tstarling synchronized php-1.5/wgConf.php 'removed wmfLoadInitialiseSettings definition'

June 12

  • 04:13 logmsgbot: tstarling synchronized php-1.5/wgConf.php 'updated for scaptrap r51333'

June 11

  • 22:46 Fred: Kicking srv156 has it has gone unresponsive
  • 16:06 Rob: rolled backups and upgrades to all corporate blogs (newsblog, whygive, & techblog). All upgrades test successful with no visible issues.
  • 15:39 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18612'
  • 14:41 Rob: updated dns for russian chapter url
  • 14:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 14731'
  • 07:48 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '#19149 bgwiki autopatrol group'
  • 06:14 Tim: added srv76 to mediawiki-installation and ran sync-common, was rogue
  • 02:13 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db16, db13, db11'

June 10

  • 18:59 logmsgbot: midom synchronized php-1.5/db.php 'reduced load on snapshot nodes'
  • 12:02 Tim: master switches done, everything should be r/w. Doing schema changes now, toolserver needs to wait for these to be logged before it switches.
  • 12:01 logmsgbot: tstarling synchronized php-1.5/db.php
  • 11:59 logmsgbot: tstarling synchronized php-1.5/db.php
  • 11:55 logmsgbot: tstarling synchronized php-1.5/db.php
  • 11:53 logmsgbot: tstarling synchronized php-1.5/db.php
  • 11:52 logmsgbot: tstarling synchronized php-1.5/db.php
  • 11:51 logmsgbot: tstarling synchronized php-1.5/db.php
  • 11:50 Tim: doing master switches, s1 -> db14, s2 -> db15, s3 -> db18
  • 08:11 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'added 1.15 to ExtensionDistributor'
  • 07:58 logmsgbot: midom synchronized php-1.5/includes/Article.php
  • 07:53 logmsgbot: midom synchronized php-1.5/includes/Article.php
  • 07:50 logmsgbot: midom synchronized php-1.5/CommonSettings.php 'added one more log'
  • 07:05 logmsgbot: midom synchronized php-1.5/includes/GlobalFunctions.php 'adding messages profiling hook, the usual one'

June 9

  • 21:31 mark: Cleaned up stale route-maps on csw1-esams
  • 21:08 mark: Removed unnecessary static routes to esams VLANs on br1-knams
  • 21:01 mark: Migrated IPv4 traffic onto DF leg 1, altered static routes on br1-knams as well
  • 20:20 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '#19138 - arwiki group settings'
  • 20:19 mark: Set up IPv6 iBGP session between csw1-esams and br1-knams over DF leg 1
  • 20:18 mark: Set up new link between csw1-esams and br1-knams over our first dark fiber leg
  • 20:16 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '#19142 - fiwiki wgBlockAllowsUTEdit=true'
  • 17:14 logmsgbot: tstarling synchronized php-1.5/db.php 'moved db18 back to regular s3'
  • 15:50 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db1, db14, lomaria'
  • 14:54 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 18594'
  • 14:54 logmsgbot: robh synchronized php-1.5/flaggedrevs.php 'bug 18594'
  • 14:20 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18905 Enable recent changes patrol on Bulgarian Wikipedia'
  • 14:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '14276 Enable patrol function for non-sysops on Turkish Wikipedia'
  • 14:04 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18194 Enable NewUserMessage extension on Arabic Wikisource'
  • 13:53 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18590'
  • 13:48 logmsgbot: tstarling synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewEdit.php 'fix bug 19135'
  • 09:15 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled db1, db14, lomaria'
  • 09:02 Tim: installed ganglia on lomaria
  • 08:56 logmsgbot: tstarling synchronized php-1.5/db.php 'moving db18 into temporary fr/ja role to replace db1'
  • 08:52 Tim: installed ganglia on db17
  • 08:42 logmsgbot: tstarling synchronized php-1.5/db.php 'reassigned db24 back to s2'
  • 08:35 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled the remainder of the current batch'
  • 06:09 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled thistle'
  • 05:44 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled ixia for schema updates'
  • 04:51 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled db22, db26, db30, thistle, db25, db29 for schema change, warming db24 for commons role'

June 8

  • 21:44 Andrew: reports of missing column af_global reported by AbuseFilterViewEdit.php, in /h/w/logs/dberror.log. Used ddsh to check checksums for that file on all servers, no differences from the version under /h/w/c, which had no mention of the offending column.
  • 16:48 logmsgbot: tstarling synchronized php-1.5/db.php
  • 16:47 logmsgbot: tstarling synchronized php-1.5/db.php
  • 16:19 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled the current batch of servers. db24 is still lagged.'
  • 16:07 logmsgbot: tstarling synchronized php-1.5/db.php
  • 10:48 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled db4, db7, db23, db24, db18, db21 for schema updates. db12 will do the enwiki query groups.'
  • 10:38 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db12, db3, db8, db15'
  • 10:11 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db5, db17'
  • 10:01 Tim: adding AbuseFilter tables to all wikis that don't have them
  • 09:59 logmsgbot: tstarling synchronized php-1.5/extensions/AbuseFilter/abusefilter.tables.sql
  • 08:54 logmsgbot: tstarling synchronized php-1.5/db.php 'removed db12 from query groups (was 1%)'
  • 07:27 Tim: depooled db12, db3, db8, db15, db17 for schema updates
  • 07:27 logmsgbot: tstarling synchronized php-1.5/db.php
  • 06:46 Tim: running DB updates on db5. All updates today are done by maintenance/update-2009-06-08.php running in a screen on zwinger
  • 06:16 logmsgbot: tstarling synchronized php-1.5/maintenance/archives/patch-log_user_text.sql
  • 06:12 logmsgbot: tstarling synchronized php-1.5/maintenance/archives/patch-log_user_text.sql
  • 06:01 Tim: depooled db5 for schema update. Makes a good guinea pig since it has the lowest disk free space.
  • 06:00 logmsgbot: tstarling synchronized php-1.5/db.php
  • 05:51 logmsgbot: tstarling synchronized php-1.5/db.php 'repooled db29, depooled on Feb 2 by Domas "for some testing"'
  • 05:43 logmsgbot: tstarling synchronized php-1.5/db.php
  • 05:38 Tim: fixed nagios configuration, had many errors preventing sync
  • 05:23 Tim: disk space critical on storage2. Deleted ~600 GB of files from 2008: all 2008 backups except those that come from wikis that are not in all.dblist
  • 05:04 Tim: stopped ES slave on srv171, disk critical. ms2/ms3 have been reasonably stable replacements.
  • 04:58 Tim: cleaned up binlogs on db13, was disk space critical

June 7

  • 23:02 Andrew: Usability prototype wiki was insanely slow because it ran out of memory and swapped, and then ran out of swap. Looks to have been one rogue PHP process, which I killed. Restarted apache (it had been killed by the kernel to free memory).
  • 13:26 JeLuF: PDF generation checked, seems to be working
  • 13:24 JeLuF: PATH setting was missing in the startup script of mw-serve on pdf1. Added it in line 17.
  • 13:16 JeLuF: pdf generation (Extension:Collection) is broken. Server restart didn't help

June 6

  • 12:41 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18397 itwiktionary suppressredirect permission'
  • 12:21 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '19106 dawiki rollback permission'

June 5

  • 20:11 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18341 plwiktionary namespace alias'
  • 20:05 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18588 Fix pt namespace alias'
  • 20:01 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '18985 wgBlockAllowsUTEdit for ptwiki'
  • 19:51 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '(19041) NewUserMessage extension for rowiki'
  • 19:48 logmsgbot: jeluf synchronized php-1.5/CommonSettings.php 'removed obsolete idwiki account creation throttle'

June 4

  • 18:48 Rob: all active memcached servers now online
  • 18:47 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'decommissioned srv90 due to bad fans'
  • 18:46 Rob: fixing memcached
  • 18:43 Rob: pulled srv90 and srv67 for decommissioning
  • 18:43 mark: Rebooting iris which has a bad disk
  • 18:31 Rob: replaced disk in both sq26 and sq47
  • 18:01 Rob: shutdown sq47 for bad disk
  • 18:01 Rob: shutting down sq26 to replace bad disk
  • 17:01 Fred: temporarily re-enabled deletion for OTRS while the Junk queue is getting cleaned.
  • 13:09 mark: Moved back traffic to esams

June 3

  • 20:20 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '17050 Change upload rights at Russian Wikipedia'
  • 18:47 Rob: pushed update to planet for new inclusions (and removed some crap)
  • 18:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 18594'
  • 17:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18874 Add enwiki as import source at mediawikiwiki'
  • 17:42 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18591'
  • 16:53 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18588 Create a namespace aliases on yuewiki'
  • 16:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18956 Import sources on el.wikiversity, forgot a source'
  • 16:27 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18956 Import sources on el.wikiversity'
  • 16:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '17967 Alias of Wikibooks namespace in Chinese Wikibooks'
  • 15:51 Rob: running initStats.php against commonswiki per bug 17802
  • 15:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '16180 Upload and Transwiki settings for Japanese Wikiversity'
  • 14:48 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '13853 Setup new groups in no.wikibooks'
  • 14:29 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18590 Add autoeditor group and remove autopromote on Ukrainian Wiktionary'
  • 14:19 Rob: ran sync-common-all to update cluster, enabling flaggedrevs on eswikinews
  • 06:37 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php
  • 06:36 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'moved wgSkipSkins to InitialiseSettings.php and added vector'

June 2

  • 05:00 logmsgbot: tstarling synchronized php-1.5/extensions/Collection/Collection.templates.php 'deployed r51327'
  • 01:14 rainman_: ran salsa to update to latest search logging code

June 1

  • 23:37 tomaszf: cleaning up space on storage2. once dumps are being cycled free space will come back
  • 20:38 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18876 Create new namespace for Korean Wiktionary'
  • 20:21 logmsgbot: robh synchronized php-1.5/CommonSettings.php '16961 Activate watchcreations on Commons'
  • 20:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18369 Add an import source for de.wikisource'
  • 19:59 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 19:59 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18437 Subpages not activated on enwiki ns 13'
  • 19:59 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18437 Subpages not activated on enwiki ns 13'
  • 19:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '13451 Set new user groups for bswiki'
  • 17:01 Rob: Fred pushed DNS to add bugzilla upgrade installation url
  • 15:58 logmsgbot: robh synchronized php-1.5/flaggedrevs.php
  • 15:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '19013 Request for Extension:Collection on hr Wikipedia'
  • 15:31 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18217 Additional namespace aliases for cuwiki'
  • 15:13 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'bug 17730'
  • 15:11 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18329 New namespace alias in ca.wiki'
  • 15:07 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '17694 Enable subpages on Lithuanian Wikipedia template namespace'
  • 15:03 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '6633 Enable transwiki import for the Hebrew projects'
  • 14:58 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '9907 Change Portal talk to Perbincangan Portal on mswiki'
  • 14:50 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18865 Please, dissable local upload of files on es.wikibooks.'

May 31

  • 16:05 logmsgbot: andrew synchronized php-1.5/includes/IP.php 'Live-merging r51236-7, fixes for IP::isInRange, which was broken.'
  • 10:51 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Activating AbuseFilter on ukwiki'
  • 10:50 logmsgbot: andrew synchronized php-1.5/abusefilter.php 'Activating AbuseFilter on ukwiki'

May 30

  • 20:14 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Improve logging for hacked-in hook'
  • 17:26 mark: Manually fixed up srv51, srv52, srv55
  • 17:06 mark: There were two duplicate instances of pybal running on lvs3, killed both and restarted
  • 17:03 mark: Puppetised srv90-99
  • 16:53 mark: Puppetised srv80-89
  • 16:45 mark: Puppetised srv71-79
  • 16:23 mark: Puppetised srv61-srv70
  • 16:08 mark: Puppetised srv51-60
  • 16:04 mark: Installed srv57 and srv58 as application servers
  • 15:53 mark: Installed srv56 as application server
  • 15:42 mark: Installed srv42 as application server
  • 15:36 mark: Rebooted srv42
  • 15:30 mark: Fixed test.wikipedia.org, reinstalled wikimedia-nis-client
  • 14:51 mark: Puppetised srv48-srv50
  • 14:01 mark: Puppetised srv38-41
  • 13:55 mark: Installed puppetd on all application servers
  • 13:32 mark: Puppetised, dist-upgraded & rebooted srv37
  • 13:25 mark: dist-upgraded & rebooted srv32, srv36
  • 13:16 mark: Puppetised srv35 and srv36, dist-upgraded & rebooted srv35
  • 12:28 mark: Repooled srv32-srv33, srv121-123
  • 12:27 mark: Puppetised srv34
  • 12:09 mark: Installed srv120 as appserver

May 29

  • 23:27 tomaszf: restarting apache on hume for static.wikipedia.org to clean out old dead/lazy/useless workers
  • 21:29 Rob: bad file permissions on morebots caused it to not restart. This explains the huge gap in log entries. we were making them in IRC and no bot was logging it
  • 21:28 Rob: blah
  • 21:25 Rob: damned morebots
  • 00:10 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Overriding moving for users on usabilitywiky for sysop as well

May 28

  • 23:16 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Overriding moving for users on usabilitywiki (again, but with the right attribute this time)'
  • 23:09 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Overriding moving for users on usabilitywiki'
  • 22:04 rainman_: enabled lucene-search 2.1 on all wikis, still needs more tweaking to use available resources more efficiently, but leaving that for tomorrow
  • 20:59 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Enabled PdfHandler on usability'
  • 20:58 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Enabled PdfHandler on usability'
  • 20:50 Fred: updated apache cluster with wikimedia-task-appserver 1.38
  • 20:22 logmsgbot: andrew synchronized php-1.5/lucene.php 'Swapped out search11, replaced with search12 and added $wgLuceneSearchVersion = 2.1;. Rainman made me do it!'
  • 19:53 logmsgbot: andrew synchronized php-1.5/lucene.php 'Reverted last change, rainman told me to do it'
  • 19:45 logmsgbot: andrew synchronized php-1.5/lucene.php 'Swapped out search11, replaced with search12 and added $wgLuceneSearchVersion = 2.1;. Rainman made me do it!'
  • 17:42 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Logging for previously added hook for blocking mail spam'
  • 17:20 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18815 Create Portal namespace on Swahili Wikipedia'
  • 17:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '16217 Official Korean name for Wikisource'
  • 16:54 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
  • 15:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18835 Activation of subpages feature on frwiki ns=15'
  • 15:03 Rob: updated sync-common-all to mimic the logging of sync-file so I do not have to manually enter admin log entries for it anymore
  • 14:59 Rob: ran updateAutoPromote and updateLinks in flaggedrevs maintenance for iawiki
  • 14:58 Rob: ran sync-common-all to enable flaggedrevs on iawiki
  • 14:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18897 Change logo for Bulgarian Wikisource'
  • 14:23 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Hacked in a hook to stop some clown from email-spamming'
  • 10:09 Tim: deploying r51105 to stop people from emailling via tor
  • 10:08 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialEmailuser.php
  • 07:40 logmsgbot: midom synchronized php-1.5/db.php 'db26 going live'
  • 00:31 rainman_: deployed udp search query logging on searchidx1 and search1-12

May 27

  • 20:37 tomaszf: killing long running query. bugzilla is all well again
  • 20:33 tomaszf: cleaning out numerous locks on db9 causing huge slowdown for bugzilla3 db
  • 08:48 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'AbuseFilter on English Wikinews'

May 26

  • 22:51 river: deleted some old snapshots on ms1
  • 22:45 Fred: fixed dsh a tad, taking care of a few host which keys have been changed and hosts missing keys...
  • 22:28 Fred: removed stale amane NFS mount on all apache boxes.
  • 22:17 Andrew: added 'Change Tagging' component to bugzilla
  • 22:05 brion: adminned werdna on bugzilla to help w/ component config etc
  • 18:17 rainman-sr: all of search_4 wikis temporarely moved to search12 server, preparing search11 for taking its place with full set of lucene-search 2.1 features
  • 17:51 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18589 Enable AbuseFilter on Cantonese Wikipedia'
  • 17:51 logmsgbot: robh synchronized php-1.5/abusefilter.php 'updates on bug 18589 Enable AbuseFilter on Cantonese Wikipedia'
  • 17:47 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18938 Please activate Rollback on Simple.Wikibooks'
  • 17:27 logmsgbot: andrew synchronized php-1.5/lucene.php 'Rainman made me do it! (taking search11 out of rotation)'
  • 16:45 rainman-sr: running initial warmup and start of lsearchd on search12
  • 16:38 Rob: search12 memory upgraded and rebooted
  • 16:37 rainman-sr: running initial warmup and restart on search11
  • 16:32 Rob: search11 ram upgraded and rebooted.
  • 16:04 Rob: srv217 shutdown until I can poke it
  • 16:01 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'removed "xx" from $wgStyleVersion'
  • 15:36 Rob: decommissioning out of warranty and broken servers: srv53, srv67, srv85, srv88, srv90.
  • 15:22 Rob: db28 fan controller board replaced (not db26, typo) system is now online
  • 15:21 Rob: db26 fan controller board replaced, system is now online
  • 15:15 Rob: db26 memory replaced, shows all 32 GB now. Restarting
  • 15:13 logmsgbot: andrew synchronized php-1.5/includes/Export.php 'Removing useless free()s breaking partial dumps'
  • 14:20 Rob: shutdown mysql on db26 as it will have the memory replaced in approx. 25 minutes.
  • 13:58 logmsgbot: tstarling synchronized php-1.5/skins/common/block.js
  • 13:56 Tim: merged r50871
  • 13:55 logmsgbot: tstarling synchronized php-1.5/includes/Block.php
  • 08:14 Tim: srv78 had apache running but did not have /mnt/upload5 mounted. Upgraded its wikimedia-task-appserver package and mounted it.
  • 07:23 Tim: set cr_status=new for all revisions with a recent status change by Skizzerz

May 25

  • 21:35 mark: Modified scap scripts to work on /home-less apaches
  • 21:17 logmsgbot: midom synchronized php-1.5/db.php 'replaced db16 with db12 for auxiliary roles'
  • 18:51 mark: Restarted stuck apache on srv187
  • 18:49 mark: db16 got in trouble at 18:35, spiking to 2000 threads. Might be related to the nv_nic_irq bug. It recovered 5 mins afterwards
  • 17:33 mark: Increased COSS cache dirs from 10 GB to 15 GB on knsq16+ esams squids
  • 07:08 domas: brought williams up, was down, console not responding.

May 23

  • 01:07 tomaszf: restarted srv159 due to hard down

May 22

  • 18:03 mark: Increased big object store on all upload squids from 15 GB to 20 GB
  • 15:24 mark: Raised max in-memory object size to 100 kiB on all squids
  • 15:11 mark: Raised cache dir size of large object store on knsq18 from 15 to 20 GB
  • 14:58 mark: Increased cache dir size by 50% on knsq16, and upped max object size in memory from 75 to 100 kB followed by a backend squid restart
  • 14:40 mark: Increased cache dir size by 50% on knsq16, and upped max object size in memory from 75 to 100 kB
  • 10:41 mark: apt-get dist-upgrade on searchidx1
  • 10:29 mark: Rebooting searchidx1
  • 09:40 rainman-sr: zombie java process 1666 on searchidx1 locked in i/o and taking up lots of ram, cannot kill it, searchidx1 need restart

May 21

  • 19:09 Fred: restarted squid process on sq43 as it was not responding properly.
  • 14:08 Tim: updated Switch master docs
  • 13:15 domas: db12 RAID set to no-battery write-behind mode: arcconf SETCACHE 1 LOGICALDRIVE 0 WB noprompt
  • 13:06 Tim: master switch apparently worked perfectly, in and out of read only mode in like 15 seconds
  • 12:54 logmsgbot: tstarling synchronized php-1.5/db.php
  • 12:54 logmsgbot: tstarling synchronized php-1.5/db.php
  • 12:53 logmsgbot: tstarling synchronized php-1.5/db.php 'setup'
  • 12:50 Tim: attempting master switch from db12 to db16 (s1/enwiki) using new switch script
  • 12:44 domas: db12 RAID controller needs battery replacement: http://p.defau.lt/?XsaDOS3KdZHeCO9VzypUnw
  • 04:46 Tim: added some more singtel proxies to the XFF list
  • 04:45 logmsgbot: tstarling synchronized php-1.5/extensions/TrustedXFF/trusted-xff.cdb
  • 04:45 logmsgbot: tstarling synchronized php-1.5/extensions/TrustedXFF/trusted-hosts.txt

May 20

  • 21:17 tomaszf: restarted wikitech db after InnoDB crash
  • 21:12 Fred: updated all image scaler boxes (srv[43-47,100]) to wikimedia-task-scaler (1.6)
  • 15:50 Fred: restarted apache on srv224 to bring the load back down from 10.
  • 02:43 Tim: patched in r50175 to stop Special:RevisionDelete timing out on pages with lots of inbound links
  • 02:40 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRevisiondelete.php
  • 02:06 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialRevisiondelete.php

May 19

  • 23:30 mark: Created ldap/nis/puppet account for ariel
  • 15:45 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'fixing abusefilter settings for zh_yuewiki'
  • 13:36 Tim: restarted segfaulting apaches on srv212, srv222, srv184
  • 06:56 Tim: set cr_status='new' for 1500 recent revisions in code_rev

May 18

  • 23:20 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'updating due to my split of abusefilter configuration into its own file'
  • 23:20 logmsgbot: robh synchronized php-1.5/abusefilter.php 'splitting configuartions into smaller specific files to save my sanity'
  • 22:51 logmsgbot: robh synchronized php-1.5/CommonSettings.php '18589 Enable AbuseFilter on Cantonese Wikipedia'
  • 22:51 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18589 Enable AbuseFilter on Cantonese Wikipedia'
  • 22:23 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'typos rock even more'
  • 22:23 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'typos rock'
  • 22:21 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'abusefilter for nlwiki'
  • 22:21 logmsgbot: robh synchronized php-1.5/CommonSettings.php 'abusefilter for nlwiki'
  • 22:04 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'abusefilter nlwiki changes'
  • 22:00 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Adding AppleTouch Icon for Usability'
  • 21:49 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Adding more details for nlwiki abusefilter roles'
  • 21:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Finish setup for nlwiki and nlwikibooks to have AbuseFilter'
  • 21:06 logmsgbot: fvassard synchronized php-1.5/CommonSettings.php 'Enabling CommunityVoice extension for Usabilitywiki'
  • 21:06 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Enabling CommunityVoice extension for Usabilitywiki'
  • 18:54 Rob: replaced dead disks in db30 and db19
  • 18:14 logmsgbot: midom synchronized php-1.5/db.php 'rob is a polar bear'
  • 18:11 logmsgbot: midom synchronized php-1.5/db.php
  • 17:17 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'abusefilter on nlwiki'
  • 15:28 logmsgbot: midom synchronized php-1.5/mc-pmtpa.php
  • 14:46 Rob: sync-common-all after bug 18421 Update config of FlaggedRevs for en.wikibooks
  • 11:03 domas: showed pediapress people how to find large unlinked files :)

May 17

  • 20:14 domas: cleaned up IPC semaphores, restarted apache on srv55

May 15

  • 17:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'enabling abusefilter on nlwikibooks per bug 18615'
  • 16:47 Rob: moved usabilitywiki upload from linode server to cluster, all files are now accessible
  • 16:36 Fred: rebooting srv159 as it is hard down (or close to it)
  • 15:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18781 Change SITENAME and add namespace alias on Ukrainian Wikiquote'

May 14

  • 21:04 Rob: ran cleanupTitles.php against huwiki, mtwiki, & barwiki
  • 19:56 Rob: reinstalling srv217 just incase its issues are software related (but prolly are not)
  • 15:41 Rob: srv56 online
  • 15:24 Rob: reinstalling srv56
  • 15:15 Rob: db2 reinstalled
  • 15:07 Rob: had to restart the wikitech server again
  • 14:53 Rob: srv42 reinstall done, leaving setup for mark and puppetification
  • 14:51 Rob: srv42 reinstalled, installing packages for apache use
  • 14:31 Rob: taking down srv42 for reinstall
  • 14:27 Rob: replaced disk in adler, reinstalling
  • 11:07 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'enabled wgBlockAllowsUTEdit on jawiki'

May 13

  • 21:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17464 Add Portal namespace for arzwiki'
  • 21:00 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16290 Creation of namespace Portal at bar.wikipedia.org'
  • 20:28 Rob: took down srv217 cuz its fubar, will troubleshoot in DC tomorrow
  • 19:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 11112 Install DynamicPageList on Incubator'
  • 19:23 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16254 enable for simple.wikiquote.org'
  • 19:18 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16523 Creating portal namespace for et.wikipedia.org'
  • 18:46 Rob: kicked srv217 back into service, with a note to test hardware later
  • 18:28 mark: Ran apt-get dist-upgrade on srv143 and rebooted it, to get it back into shape
  • 18:10 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Added Logo path definition for usabilitywiki'
  • 17:51 Rob: pulled some apaches that were not getting config updates out of the cluster, the bad redirects should resolve now
  • 17:45 Rob: pushed out the config for apache to the cluster, now checking to ensure any failed syncs are NOT pooled.
  • 17:41 logmsgbot: fvassard synchronized php-1.5/InitialiseSettings.php 'Bug 11488 Fix namespace names in the Hungarian localization'
  • 15:43 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18717 namespace addition for nowiktionary'
  • 15:18 Rob: ran rebuildrecentchanges.php against wikimania2010wiki
  • 15:17 Rob: exported wm2009 mediawiki namespace pages and imported them into wm2010 wiki per initial bugzilla request 18740
  • 15:04 Rob: deleted the excess and unwanted imported pages on wm2010, thanks to Casey for compiling the list \o/
  • 14:51 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'Gave the accountcreator group on enwiki tboverride rights, so that they can create otherwise disallowed account names.'
  • 14:47 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18643 Create Thesaurus namespace on Icelandic Wiktionary'
  • 11:42 logmsgbot: midom synchronized php-1.5/db.php 'db26 serving as dump source'
  • 11:40 logmsgbot: midom synchronized php-1.5/db.php
  • 10:51 domas: same disks on db30 went offline again

May 12

  • 20:46 mark: Rerouted traffic from ptmpa to esams via 6939 / 16150, by prepending 16265 3 times on csw1-esams
  • 20:31 mark: Moved esams text LVS to mint
  • 20:22 mark: Mark is making the network all awesome and flawless, for all my fans in #wikipedia-nl
  • 20:20 mark: Brought BGP session to AS13680 back up, only made it worse
  • 20:17 mark: Shutdown BGP session to 13680, we may be saturating that network and therefore experiencing packet loss
  • 20:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'taking out some of the wikimania2010wiki settings that I am not sure about, to see if its causing an issue.'
  • 19:39 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'coping a bunch of settings from wikimania2009wiki to wikimania2010wiki'
  • 19:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 19:34 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 19:32 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Logo update for iowiktionary'
  • 19:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'temp update cuz i need to push an upload to a wiki that normally doesnt allow them.'
  • 19:28 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updated logo for ukwikiquote'
  • 19:25 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updated logo for ukwikibooks'
  • 19:23 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'logo update for arzwiki'
  • 19:10 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'mkwikisource logo'
  • 18:57 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'I think this should turn on abuse filter for usability wiki...'
  • 18:52 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18517 Enable the Collection extension for creating books on the Alemannic Wikipedia [alswiki]'
  • 18:37 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18430 Enable Collection extension on id.wikipedia.org'
  • 17:09 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 17:08 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Rob needs to pay closer attention when he is frustrated =P'
  • 17:03 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'forgot to set the server for wikimania2010'
  • 16:58 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Turned on PDF collection on ptwikiversity'
  • 16:52 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'for wikimania2010 wiki'
  • 16:19 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php
  • 16:18 Rob: depooled puppet apaches so i can make site changes
  • 16:05 Rob: messing with configs due to puppet and such
  • 13:48 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18519 Please enable transwiki-imports to English Wiktionary from other Wiktionaries.'
  • 01:02 tomaszf: restarting mysqld on db9 to get rid of ram disk

May 11

  • 21:40 mark: Switched srv32, srv33 puppetmaster to sockpuppet
  • 21:13 mark: Puppetified srv121, srv122 and srv123 and installed them as app servers
  • 20:40 brion: making a note for the record that db7 (enwiki watchlist) is lagging sometimes. it's under extra load pulling a dump
  • 06:46 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUndelete.php 'deploying r50470 to fix bug 18726 (double URL escaping)'
  • 04:12 Tim: starting recompressTracked for all wikis on hume
  • 03:57 Tim: stopped old ES slaves on srv172, srv173, srv184, srv185
  • 03:53 Tim: cleaned up relay logs on db25

May 10

  • 14:06 domas: configured pageview count aggregation on locke, shipment not done yet

May 9

  • 13:30 brion: rebooting usability.wikimedia.org, it got more thoroughly stuck while attempting to restart apche
  • 13:20 brion: poking usability.wikimedia.org ...
  • 04:15 Andrew: usability.wikimedia.org seems to be down

May 8

  • 20:32 logmsgbot: tfinc synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php
  • 18:59 logmsgbot: tfinc synchronized php-1.5/extensions/WikimediaMessages/WikimediaMessages.i18n.php
  • 08:54 Tim: amane's root partition filled up due to the cp running in a root screen, copying from NFS to an unmounted mount point /mnt/big-disk. Moving some stuff to the real mount point /mnt/scratch (in another screen)
  • 08:25 Tim: deploying r48837 and r48911 to fix bug 18171 (broken oldimage parameter)
  • 06:11 logmsgbot: midom synchronized php-1.5/db.php 'got to get coffee'
  • 05:56 logmsgbot: midom synchronized php-1.5/db.php
  • 05:55 logmsgbot: midom synchronized php-1.5/db.php
  • 00:08 brion: db12 no longer overloaded with 'too many connections'. very mysterious
  • 00:04 brion: db12
  • 00:04 brion: db errs on en. poking...

May 7

  • 21:36 domas: db30 disks are shown online, array degraded after 'arcconf rescan', not sure what that means
  • 21:33 domas: db19 disk error counts: http://p.defau.lt/?pBtD7HBx1O6IboeB9VgINg (one disk just failed few times entirely, other gets lots of aborts/medium errors, might be related)
  • 21:21 domas: db30.mgmt needs reset (facilitated by physical movements of power cord)
  • 21:03 domas: db30 has _second_ disk death
  • 21:00 domas: db28 FUBAR information: http://p.defau.lt/?EZH6Bg4GwYJDJ4hG3OIJYQ
  • 20:46 domas: db28 fb0.fm1.f1.speed is flapping between 0 and 21100. needs datacenter inspection and/or vendor service.
  • 19:50 domas: db19 has corrupted ibdata, depooling
  • 19:47 domas: bad disk on db19 actually made I/Os time out, thus corrupting relay logs, reset slave seems to have helped.
  • 19:36 logmsgbot: midom synchronized php-1.5/db.php 'db25 needs some load'
  • 18:46 domas: db19 drive failed, needs replacement (you hear, Rob?! :)
  • 17:35 domas: added retry=1 to ProxyPass for secure.wikimedia apaches backend
  • 16:56 domas: enabling mod_deflate (bottom of main.conf) on apaches
  • 16:50 domas: added new singtel subnet to trusted xff
  • 16:50 logmsgbot: midom synchronized php-1.5/extensions/TrustedXFF/trusted-xff.cdb
  • 07:45 domas: reset slave on db18
  • 02:40 david: added localhost IPs (v4 and v6) to relay_from_hosts in exim4.conf on grosley
  • 02:40 david: Truncated /var/log/exim4/paniclog on grosley, which had an old configuration syntax error notice in it
  • 00:00 brion: added an 'editor' group to wikitech so we don't have to make all users sysops to edit until we get round to culling the abuse accounts :)

May 6

  • 18:48 Fred: made a couple of changes to the Bayes processing scripts so that they support people moving the Bayes folders around. Wikitech updated.

May 5

  • 21:41 mark: Started Apache on srv32/srv33
  • 19:11 mark: Stopping Squid processes on yaseo servers
  • 13:58 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18618 enabling collections on a couple of wikis'
  • 00:34 tomaszf: upping srv255 to 8 active running dumps

May 4

  • 18:43 tomaszf: spawning 5 extra dumps from srv225 to see throughput of system
  • 18:43 tomaszf: depooling srv225 from apache node list and adding it as a dumps workers box
  • 10:01 Tim: set max_connections on ms2 to 2000 using SET GLOBAL, to match the value in /etc/mysql/my.cnf

May 3

May 2

  • 19:16 rainman___: search3 mysteriously rebooted around midnight, starting lsearchd on it now
  • 05:45 Andrew: reports that Special:Log is broken because it's hitting MAX_JOIN_SIZE
  • 00:16 tomaszf: forcing 644 on dumps using 7za on srv31 until ubuntu bug # 370618 is resolved

May 1

  • 19:43 mark: Unshut peering with AS 2529 on br1-knams
  • 19:11 tomaszf: added php normalize library to srv31. running a couple batch dumps to test functionality.
  • 19:00 tomasz: xml back up jobs went haywire last night due to missing libs. time to fix ..
  • 17:27 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'updated for usability wiki'
  • 16:49 Rob: upgraded limesurvey to newest version
  • 16:04 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Allow bureaucrats to set/unset inactive right on private wikis -- private request from cary'
  • 13:14 mark: Installed server sockpuppet
  • 12:59 logmsgbot: andrew synchronized php-1.5/includes/ChangeTags.php 'Live-merged r50104 -- escaping for classes applied to change tags'
  • 05:47 logmsgbot: tstarling synchronized php-1.5/includes/UserMailer.php 'merged r49682'
  • 05:36 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Re-activated tag filtering with the live patch to ChangeTags.php'
  • 05:32 logmsgbot: andrew synchronized php-1.5/includes/ChangeTags.php 'Index has been renamed since it was created on Wikimedia'
  • 05:13 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
  • 05:10 Tim: $wgUseTagFilter=true experimentally
  • 05:10 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
  • 04:54 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php
  • 04:53 logmsgbot: tstarling synchronized php-1.5/includes/ChangeTags.php
  • 04:50 Tim: merging r49068 and r49086

April 30

  • 17:10 Rob: srv143 back online
  • 17:07 Rob: all memcached back online
  • 17:07 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'swapped out srv142'
  • 17:06 Rob: srv143 locked up, restarting
  • 17:05 Rob: srv142 reinstalling
  • 16:52 Rob: srv31 setup and good to go back to tomasz
  • 16:48 Rob: srv31 reinstalled, installing wikimedia-task-appserver package but NOT pooling.
  • 16:39 Rob: srv81 back online
  • 16:25 Rob: upgrading srv31 to ubuntu
  • 16:10 Rob: reinstalling srv81
  • 16:08 Rob: srv130 back online
  • 15:57 domas: db30 has drive failure, needs replacement
  • 15:41 Rob: upgrading srv124 to ubuntu
  • 15:30 Rob: srv127 was readonly, restarted, fsck, back online
  • 15:25 Rob: upgrading srv137 to ubuntu
  • 13:29 river: upgraded ms4/ms6 to solaris 10 update 7
  • 02:34 Tim: reset slave on db3
  • 02:28 Tim: updated /root/.ssh/authorized_keys on all machines identified with a pingscan that allowed a login with nagios's key. Revoked access for nagios, jeronim and kyle.

April 29

  • 21:32 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialExport.php 'merging r50054 fix for recursive depth export'
  • 21:23 Rob: ran namespaceDupes script against mtwiki once the new portal namespaces were created.
  • 21:22 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18498, adding portal and portal talk namespaces'
  • 21:13 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18498, adding metanamespace_talk for mtwiki'
  • 21:12 brion: set up system administrators global group with export depth override right so Trevor can test the batch export
  • 20:49 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18237 enable autopatrolling and improve patrolling user rights on itwiktionary'
  • 19:05 Rob: DHCP services stopped on zwinger and started on khaldun. Khaldun is now the dhcp server as well as the installation server.
  • 14:53 Rob: restarted wikitech and manually ran morebots upon reboot.
  • 04:07 Tim: doing some network scanning to make sure our host lists are up to date
  • 02:15 tomaszf: moving rsync test to ms1 per Tim.
  • 02:36 Tim: removed all remaining obsolete by_ssh* checks from the nagios configuration
  • 02:27 Tim: installed NRPE on amane and adjusted nagios configurator
  • 01:54 tomaszf: testing commons upload of top level storage directory on zwinger to offsite backup.
  • 01:38 Tim: fixed the mediawiki installation on amane: installed wikimedia-task-appserver, disabled apache, ran sync-common, added to ganglia

April 28

  • 18:02 Rob: futzing around with moving dhcp, taking srv209 as my guineapig.
  • 10:58 Tim: re-added srv31 to mediawiki-installation node group, backup task was rogue and generating "missing cluster" exceptions
  • 10:21 logmsgbot: tstarling synchronized php-1.5/includes/ExternalStoreDB.php
  • 10:19 Tim: re-added srv57 to mediawiki-installation, was rogue and causing "unknown cluster" errors
  • 07:59 logmsgbot: tstarling synchronized php-1.5/db.php 'set the new cluster22 to be the sole ES write destination'
  • 07:57 Tim: pdns on bayle is broken, stuck in futex, restarting
  • 07:52 logmsgbot: tstarling synchronized php-1.5/db.php
  • 07:49 logmsgbot: tstarling synchronized php-1.5/db.php 'introducing cluster22 (ms3/ms2)'
  • 07:43 Tim: adding tables called blobs_cluster22 to ms3, for new current text cluster
  • 07:30 Tim: fixed /etc/mysql/debian.cnf on ms3 so that logrotate flush logs can work
  • 02:09 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Rolling out tor changes'
  • 02:07 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Rolling out tor changes, and ipblock-exempt on all wikis'
  • 01:48 Andrew: Updating configuration to cchange tor settings.

April 27

  • 23:42 logmsgbot: tstarling synchronized php-1.5/db.php 'gave the current ES masters some read load'
  • 23:05 Tim: increased connection limit on temp-es* from 100 to 500
  • 18:31 Rob: srv138, srv139, & srv145 reinstalled and online.
  • 18:24 brion: stopped apache and umounted amane from srv184 (ES slave). load is way overloaded for some reason on this box
  • 18:24 Rob: removed amane from mounts on srv184
  • 18:01 Rob: srv145 reinstalling
  • 17:58 Rob: some quirky stuff going on from various memcached hosts being reinstalled and such. Issues seem to be resolved now.
  • 17:55 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removing reinstalling servers'
  • 17:54 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removing reinstalling servers'
  • 17:43 Rob: srv129 back online
  • 17:43 Rob: reinstalling srv138 and srv139
  • 17:24 Rob: srv126 up and online
  • 17:11 Rob: srv126 and srv129 being reinstalled.
  • 17:09 Rob: srv86 and srv87 up and online
  • 16:49 Rob: srv86 and srv87 upgrading to ubuntu
  • 16:42 Rob: srv107 online
  • 16:38 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'Removing srv120-srv123 for other testing'
  • 16:35 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removing srv156'
  • 16:22 Rob: srv120-srv123 reinstalled, NOT online. Base OS, nothing else, passed on to mark for his testing. (Puppet I assume.)
  • 15:48 Rob: srv120-123 going down for reinstallation
  • 15:45 Rob: srv108 and srv109 up and online
  • 15:06 Rob: srv108 and srv109 are in mid-install for ubuntu
  • 15:06 Rob: srv107 wont restart for some reason, adding to tasks to troubleshoot.
  • 15:04 Rob: srv105 and srv106 back up and online
  • 14:56 Rob: srv107-srv109 goin down
  • 14:54 Rob: srv104 back online
  • 14:48 Rob: srv102 and srv103 back up and online
  • 14:43 Rob: srv102-106 reinstalling.
  • 14:29 Rob: srv53 has a bad fan, shutting down until its replaced.
  • 14:20 Rob: srv102-srv109 being upgraded to ubuntu.
  • 11:42 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Updated $wgSitename for ukwikimedia in accordance with IRC request from Michael Peel, a board member'
  • 02:20 Tim: srv53 down, took it out of memcached rotation. Updating the memcached spare list.
  • 02:20 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php
  • 02:12 Tim: fixed rc1 slaves, broken by expire_logs_days on ms3
  • 01:59 Tim: Shut down srv217 for maintenance. Similar timer interrupt issue observed as before: select() syscalls running indefinitely despite a short timeout specified.
  • 01:53 logmsgbot: tstarling synchronized php-1.5/db.php
  • 01:52 Tim: repooled ms3 rc1 instance
  • 01:49 Tim: reset slave on db21, was running out of disk space due to relay logs
  • 01:42 Tim: fixed nagios for srv99, still had its apache check command set to my CGI security vulnerability demonstration, permanently saved in retention.dat despite config changes
  • 01:17 Tim: enabled apport on srv99, to see if I can track down the nagios flapping
  • 00:52 Tim: restarted trackBlobs.php

April 25

  • 23:31 Tim-away: experimentally stopping replication on db3 to check disk load
  • 22:51 logmsgbot: tstarling synchronized php-1.5/db.php 'reduced load on db3'
  • 18:50 mark: Killed long-running SQL query TrackBlobs::trackRevisions query from hume causing db3 to lag heavily
  • 17:22 mark: Stopped Apaches on srv32/srv33 again, as syncs will fail in most cases
  • 16:36 mark: Started /home-less apache on srv33
  • 13:23 mark: Started /home-less apache on srv32
  • 11:03 mark: Kicked srv99 back into submission
  • 10:56 mark: Squid-blocked high-rate scraper which was overloading ES
  • 05:30 Tim-away: fixed conflict markers in extensions/CentralNotice/SpecialNoticeText.php and resynced.
  • 05:30 logmsgbot: tstarling synchronized php-1.5/extensions/CentralNotice/SpecialNoticeText.php

April 24

  • 22:23 rainman__: search back up on all wikis
  • 22:17 logmsgbot: root synchronized php-1.5/lucene.php 'Replacement for reinstalled srv58'
  • 22:15 logmsgbot: brion synchronized php-1.5/secure.php 'fix for thumbs on private ssl access (bug 18475 etc)'
  • 21:19 rainman_: srv58 dead, making all non-major wikis search broken, transfering the service to search11/12....
  • 19:50 Rob: srv90-srv99 ganglia installed.
  • 19:50 Rob: srv97 online
  • 19:47 Rob: srv98 online
  • 19:46 Rob: srv96 online
  • 19:45 Rob: srv99 online
  • 19:42 Rob: srv95 online
  • 19:40 Rob: srv92, srv93, and srv94 back online
  • 19:39 Rob: srv91 back online
  • 19:24 Rob: srv90 online
  • 19:16 Rob: srv90-srv99 reinstalled, currently looping though package installation
  • 18:34 mark: Fixed ganglia by installing the appropriate config files on the (reinstalled) aggregation hosts
  • 18:27 Rob: installed ganglia on all servers reinstalled to ubuntu apache thus far today.
  • 18:27 Rob: srv89 back online
  • 18:17 Rob: srv90-srv99 will be down over the next 30 minutes for ubuntufication.
  • 18:16 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'some spares were actually down'
  • 18:14 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removed the 9x servers for reinstallation'
  • 18:02 Rob: srv84 ubuntufied and online
  • 17:58 Rob: srv83 ubuntufied and online
  • 17:54 Rob: srv82 ubuntufied and online
  • 17:50 Rob: srv81 reinstalled and online
  • 17:47 Rob: srv89 coming down for reinstall
  • 17:44 Rob: srv58 online
  • 17:38 Rob: srv57 online
  • 17:26 Rob: reinstalling srv58
  • 17:16 mark: Set up switchport for srv57 on asw-c4-pmtpa
  • 17:10 Rob: reinstalling srv57
  • 17:08 Rob: srv75 back online
  • 17:02 Rob: srv74 back online
  • 16:47 Rob: srv73 back online as apache
  • 16:45 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'removed srv75'
  • 16:43 Rob: srv71, srv72 back online as apache
  • 16:42 Rob: taking down srv75 for reinstall
  • 16:37 logmsgbot: fvassard synchronized php-1.5/mc-pmtpa.php 'swapping out srv72 for srv100 and srv73 for srv101 while srv[72,73] are being ubuntified'
  • 16:26 Rob: srv72, srv73, and srv74 down for reinstallation
  • 16:23 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out srv71 for srv70 and srv74 for srv92 while srv[71,74] are being ubuntified'
  • 16:05 Rob: srv34 back online reinstalled as ubuntu
  • 16:04 Rob: reinstalling srv71
  • 16:04 Fred: restarted apache on srv99
  • 15:21 Rob: srv34 coming down for reinstall
  • 15:13 Rob: amane reinstalled for tomasz
  • 14:59 Rob: amane reinstall started
  • 14:36 rainman-sr: search9,10 also up; everything should be normal again
  • 14:33 Rob: amane shutting down for rain controller work
  • 14:27 rainman-sr: search5-8 back in search pool
  • 14:16 Rob: shutting down search9 & search10 for memory upgrade
  • 14:15 Rob: search7 & search8 memory upgraded, systems rebooted
  • 14:07 Rob: search5 and search6 back online.
  • 14:05 Rob: memory upgrade complete on search5 & search6, rebooted.
  • 14:02 rainman-sr: done with initial index warmup on search3,4, back in rotation
  • 13:59 Rob: search5, search6 shutdown for memory upgrade
  • 13:58 Rob: search4 memory upgraded and system back online
  • 13:55 Rob: search3 ram upgraded and system is back online
  • 13:50 Rob: search3 upgraded, rebooting.
  • 13:44 Rob: shutdown search3 & search4 for memory upgrades
  • 07:18 logmsgbot: tstarling synchronized php-1.5/db.php
  • 03:35 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php
  • 03:35 Andrew: Deployed AbuseFilter to fiwiki
  • 02:51 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php
  • 02:46 Tim: srv127 has corrupted root partition, needs reinstall or repair. Shut down with echo o > /proc/sysrq-trigger.
  • 02:36 logmsgbot: tstarling synchronized php-1.5/mc-pmtpa.php
  • 02:31 Tim: killed srv124 with /proc/sysrq-trigger. Was very slow on ssh and was giving odd 403 errors via HTTP.
  • 02:21 logmsgbot: tstarling synchronized php-1.5/README
  • 02:12 Andrew: Updated ruwiki abuse filter configuration per bugzilla request.
  • 02:12 logmsgbot: andrew synchronized php-1.5/CommonSettings.php
  • 02:10 Andrew: srv127: rsync: mkstemp "/apache/common/php-1.5/.CommonSettings.php.TRNqkG" failed: Read-only file system (30)
  • 01:15 logmsgbot: tstarling synchronized php-1.5/db.php
  • 01:14 Tim: depooled db3 so that it can finish doing the querycache update without making lots of people wait for a MASTER_POS_WAIT
  • 01:03 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
  • 01:03 Tim: blacklisted Wantedtemplates on enwiki, has been running for more than a day.
  • 00:54 Tim: restarting trackBlobs.php on hume for afwiki and enwiki

April 23

  • 19:05 brion: donate.wikipedia.org redirect borked, going to civicrm instead of public donation pages. server config needs updating
  • 16:54 brion: db3 was lagging a bit; 403s a few minutes ago. catching up nicely now
    • Note this is from Wantedtemplates recache job
  • 14:46 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Added namespaces to huwikisource per bug 18557'
  • 14:41 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUpload.php
  • 14:39 Tim: merged r49775
  • 14:32 logmsgbot: tstarling synchronized php-1.5/includes/specials/SpecialUpload.php
  • 14:31 Tim: merged r49051
  • 14:13 Tim: fixed nagios labels for esams backup ext store, erroneously labelled as "toolserver"
  • 06:27 Tim: restarted all job runners, ES connection errors weren't killing them
  • 05:43 Tim: shutting down mysql on all fedora ES servers. Will update documentation and node lists to indicate that this is permanent.
  • 05:37 Tim: srv217 did not come up from a soft reboot, but power cycle worked. Before reboot, observed apache2 hanging indefinitely on nanosleep(), but couldn't reproduce a timer issue in other processes. An NFS mount was hanging on stat.
  • 05:13 Tim: rebooting srv217
  • 04:41 Tim: srv217 is hanging on various operations, investigating. Trying to shut down its apache.
  • 04:35 logmsgbot: tstarling synchronized php-1.5/db.php
  • 04:31 Tim: copy done, started cluster18 mysql instance on ms3 using srv104 snapshot, repooled it
  • 02:07 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
  • 01:57 Tim: relaxed wgAccountCreationThrottle on frwiki, presumably the 2006 vandal emergency is over. Disabled it on idwiki for workshop event.
  • 01:45 Tim: copying srv104's data from ms3 to ms2
  • 01:11 Tim: started mysql on srv104

April 22

  • 21:44 tomaszf: db9 is back up. excessive tmpfs file systems removed
  • 21:39 tomaszf: taking outage on db9 to remove tmpfs file systems
  • 11:34 JeLuF: initiated reboot of srv137. dmesg shows no usable information any more.
  • 11:30 JeLuF: srv137 has read-only filesystem. Stopped Apache.
  • 06:03 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php 'Live-merged r49730, typo causing failures in user hiding'
  • 06:02 Andrew: srv137 still seems read-only, srv137: rsync: mkstemp "/apache/common/php-1.5/includes/specials/.SpecialBlockip.php.1QkrKX" failed: Read-only file system (30)
  • 03:14 Tim: copying ES data from srv104 to ms3 using nc tarpipe
  • 03:10 logmsgbot: tstarling synchronized php-1.5/db.php 'depooling srv104 ES'
  • 03:03 Tim: corruption found on cluster18, the copy source server (srv106) is missing lots of rows. Switched back to srv105/104.
  • 03:02 logmsgbot: tstarling synchronized php-1.5/db.php
  • 02:50 logmsgbot: tstarling synchronized php-1.5/includes/Revision.php 'reverted profiling and logging hacks'
  • 02:40 Tim: depooled ms2 ex-fedora instances and shut them down, it can be a backup for now
  • 02:38 logmsgbot: tstarling synchronized php-1.5/db.php
  • 02:33 Tim: deployed the new ms2/ms3 ex-fedora ES configuration
  • 02:32 logmsgbot: tstarling synchronized php-1.5/db.php
  • 02:04 tomaszf: updated CentralNotice to skip over bad messages when generating js.
  • 02:01 Tim: set up ex-fedora mysql instances on both ms2 and ms3, controlled with /etc/init.d/mysql-ex-fedora
  • 01:04 Tim: changed the main mysql instance on ms3 (rc1) to bind to a single IP address instead of *

April 21

  • 19:41 mark: Added grosley.wikimedia.org to local_domains list on grosley's exim.conf, and added appropriate aliases in /etc/aliases
  • 16:35 Andrew: Re-ran rebuildTemplates.php, all seems well now
  • 16:30 logmsgbot: robh synchronized php-1.5/mc-pmtpa.php 'syncing for fred'
  • 16:30 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out srv88 for srv159 and srv90 for srv198'
  • 16:29 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'Switched srv88 for srv159, srv90 for srv198 to fix down memcache nodes'
  • 16:18 azafred: restarted memcached on srv96. Now responding.
  • 16:14 Rob: Fred needs to start logging in as Fred and not as root, bad fred (see it wasnt me this time, bwahahahahahaa)
  • 16:11 Andrew: Fred fixed up some memcached nodes, but no joy with rebuildTemplates
  • 16:10 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out down servers for active ones'
  • 16:09 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out down servers for active ones'
  • 16:01 Rob: srv137 read only, depooled in pybal for apache and rebooting.
  • 15:57 logmsgbot: root synchronized php-1.5/mc-pmtpa.php 'swapping out down servers for active ones'
  • 14:34 Andrew: rebuildTemplates.php appeared not to help, same problem as before (stopped after a few wikis). Possibly a dodgy memcache node.
  • 14:32 Andrew: ran rebuildTemplates.php metawiki due to reports of <messagename> appearing in place of the central notice.
  • 05:04 Andrew: Live-merged r49685, fix for unsuppression of usernames on unblock -- some usernames were left stuck suppressed if they were unblocked when the block suppressed their username
  • 05:03 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php
  • 05:03 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialIpblocklist.php
  • 01:34 azafred: Made some improvments on Spam handling. Bayes is in play and can learn from everybody what is spam and what is ham. Documentation to follow.

April 20

  • 19:59 Rob: Powering down srv67, srv85, srv88, srv90 due to temp warnings and bad fans.
  • 19:36 Rob: updated mc-pmtpa.php to reflect the status of down or spare for the memcached servers. (lots more spares now)
  • 17:35 azafred: restarted apache on srv217
  • 17:34 azafred: srv125 reinstall completed.
  • 17:24 Rob: srv146 back online
  • 17:10 Rob: srv131 back up, updated and synced.
  • 16:52 azafred: srv118 reinstall completed.
  • 16:52 Rob: srv127 back online and synced.
  • 16:41 Rob: srv125 reinstalled, passing off to fred
  • 16:40 Rob: replaced dead disk in sq26
  • 16:31 Rob: shutting down sq26 to replace bad hdd
  • 16:27 Rob: reinstalling srv125
  • 16:13 azafred: finished re-install of srv63.
  • 16:11 Rob: reinstalled srv118, handed off to fred for completion
  • 16:01 Rob: restarted srv118 and reinstalled it
  • 15:57 Rob: restarted a locked up srv110 and synced it.
  • 15:49 Rob: srv81 lacked up, fixed, synced and online
  • 15:29 Rob: replaced fan and drive in srv63, reinstalling
  • 14:36 Rob: memory replaced in srv203, back online.
  • 14:11 Rob: shutting down srv203 to swap out bad memory
  • 05:12 Tim: fixed memcached on srv75, stopped old ES slave on srv102, srv106, srv107, srv159, srv171

April 18

  • 14:05 Tim: unblocked 80legs, they promised to be nice
  • 13:56 logmsgbot: tstarling synchronized robots.txt
  • 05:26 azafred: rebooted db20 after / ran out of space and started causing all kind of issues.

April 17

  • 22:49 brion: regenerated centralnotice output again... this time ok
  • 22:48 brion: srv93 and srv107 memcached nodes are running but broken. restarting them...
  • 22:43 brion: restarted srv82 memcache node. attempting to rebuild centralnotices...
  • 22:41 brion: bad memcached node srv82
  • 22:05 mark: Set up 3 new pywikipedia mailing lists, redirected svn commit output to one of them
  • 19:38 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18494 Logo for ln.wiki'
  • 17:22 Rob: removed wikimedia.se from our nameservers as they are using their own.
  • 16:48 azafred: updated spamassassin rules on lily to include the SARE rules and mirror the settings on McHenry.
  • 10:25 logmsgbot: tstarling synchronized robots.txt
  • 08:19 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php
  • 07:13 Tim: temporarily killed apache on overloaded ES masters
  • 07:11 logmsgbot: tstarling synchronized php-1.5/db.php 'zeroing read load on ES masters'
  • 06:04 Tim: brief site-wide outage while it rebooted, reason unknown. All good now. Resuming logrotate.
  • 05:55 Tim: db20 h/w reboot
  • 05:48 Tim: shutting down daemons on db20 for pre-emptive reboot. Serial console shows "BUG: soft lockup - CPU#4 stuck for 11s! [rsync:27854]" etc.
  • 05:10 Tim: on db20: killed logrotate -f half done due to alarming kswapd CPU (linked to deadlocked rsync processes). May need a reboot.
  • 05:00 Tim: fixed logrotate on db20, broken since March 10 due to broken status file, most likely due to non-ASCII filenames generated by demux.py. Patched demux.py. Removed everything.log.
  • 02:14 river: set up ms6.esams, copying /export/upload from ms1
  • 00:24 Tim: blocked lots of uci.edu IPs that were collectively doing 20 req/s of expensive API queries, overloading ES
  • 00:15 brion: techblog post on Phorm opt-out is linked from slashdot; load on singer seems fairly stable.

April 16

  • 23:06 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
  • 22:48 azafred: bounced apache on srv217. All threads were DED - dead
  • 22:16 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
  • 22:08 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
  • 17:41 domas: fantastic. I start _looking_ at stuff and it fixes itself.
  • 17:35 logmsgbot: midom synchronized php-1.5/includes/Revision.php 'live profiling hook'
  • 17:28 domas: db20 has kswapd deadlock, needs reboot soonish
  • 17:18 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'disabled stats'
  • 17:15 logmsgbot: midom synchronized php-1.5/InitialiseSettings.php 'enabling udp stats'
  • 16:18 azafred: bounced apache on srv217 (no pid file so previous restart did not include this one)
  • 15:57 brion: network borkage between Florida and Amsterdam. Visitors through AMS proxies can't reach sites.
  • 15:55 azafred: bounced apache on srv[73,86,88,93,108,114,139,141,154,181,194,204,213,99]
  • 15:52 Tim-away: started mysqld on srv98,srv122,srv124,srv142,srv106,srv107: done with them for now. srv102 still going.
  • 15:30 mark: Set up ms6 with SP management at ms6.ipmi.esams.wikimedia.org
  • 14:13 mark: Restoring traffic to Amsterdam cluster
  • 14:06 mark: Reloading csw1-esams
  • 13:55 mark: Reloading csw1-esams
  • 13:53 JeLuF: ms1 NFS issues again. Might be load related
  • 13:49 Tim: copying fedora ES data from ms3 to ms2
  • 13:44 JeLuF: ms1 is reachable, no errors logged, NFS daemons running fine. After some minutes, NFS clients were able to access the server again. Root cause unknown.
  • 13:38 JeLuF: ms1 issues. On NFS slaves: "ls: cannot access /mnt/upload5/: Input/output error"
  • 13:24 mark: DNS scenario knams-down for upcoming core switch reboot
  • 08:23 river: pdns on bayle crashed, bindbackend parser seems rather fragile
  • 03:01 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Deployed AbuseFilter to ptwiki'

April 15

  • 22:42 tomaszf: adding ramdisk to db9 to speed up create tmp tables
  • 22:34 mark: PowerDNS got confused by a commented DNS entry and broke zone wikimedia.org, fixed
  • 22:32 brion-codereview: DNS broken. mark's poking it
  • 22:24 mark: Temporarily removed AAAA record from mayflower in DNS
  • 22:14 brion-codereview: db9 tmpfs full, breaking anything using that db
  • 22:00 brion-codereview: ipv6 connectivity broken between isidore & mayflower, breaking codereview SVN updates
  • 20:59 brion: civicrm queries bogging down db9 affecting otrs performance. tom's looking into it
  • 18:24 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'for subpages on ukwikimedia'
  • 17:32 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17898 Wiktionary is a bad interwiki prefix on ukwiktionary and mlwiktionary'
  • 17:25 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'per bug 17773 Install Labeled Section Transclusion for dewikiversity'
  • 14:33 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17718 Disable CentralNotice on private/fishbowl wikis'
  • 14:29 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18434 Enable the rollback feature on Commons'
  • 14:19 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '18307 Add autopatrolled group to English Wikisource'
  • 14:12 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17717 Enable subpages on main namespace of UK chapter website'
  • 13:55 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18428 cswikisource settings updates'
  • 12:38 Tim: restarting copy to ms3
  • 12:25 Tim: rebooting ms3 with 2.6.28 kernel
  • 12:18 Tim: running xfs_check on ms3
  • 12:14 Tim: restarting ms2 with domas's 2.6.28 kernel
  • 12:06 logmsgbot: midom synchronized php-1.5/db.php 'removing db25 - apparently it was down for more than a day'
  • 11:58 domas: db25 went down, resetting
  • 11:08 Tim: ms3 went down, no response on serial console, rebooting
  • 11:05 logmsgbot: tstarling synchronized php-1.5/db.php
  • 08:32 Tim: copy in progress, rsync over ssh controlled via screen on tstarling@zwinger
  • 08:23 Tim: shutting down mysqld on srv98,srv122,srv124,srv142,srv102,srv106,srv107 for data directory copy to ms3

April 14

  • 23:48 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php
  • 23:39 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php
  • 23:37 logmsgbot: tfinc synchronized php-1.5/reporting-setup.php
  • 19:01 Rob: replaced dead drive in ms4
  • 18:41 Rob: srv78 back online
  • 18:37 Rob: srv78 was wonky and such, reinstalled to fix.
  • 18:21 Rob: srv90 reinstalled and redeployed
  • 18:21 Rob: memcached had stopped on srv89, restarted.
  • 18:16 Rob: all fans are good on srv86, bringing back online.
  • 18:13 Rob: srv86 has temp warnings, shutting down to check fans and such
  • 17:59 Rob: reinstalling srv90 from FC to ubuntu
  • 17:52 Rob: replaced bad fan in srv90
  • 17:40 Rob: pulling srv90, overheating warnings.
  • 17:38 Rob: srv85 overheating due to dead fans. server is old and out of warranty, decommissioned but kept on site for parting out.
  • 17:10 Rob: bringing back up sq1, no memory on hand for upgrading these (They are ddr pc3200, all the spare memory we have is ddr2 or ddr 2700)
  • 17:02 Rob: pulling sq1 for memory upgrade.
  • 16:55 Rob: replaced bad patch for search9, LOM functions properly.
  • 16:38 Rob: Upgraded memory in search1 and search2 to a total of 16GB each (previously 8).
  • 16:14 Rob: Had to restart wikitech due to OOM issues, again. Perhaps it is time to up the memory in the machine or tweak settings.
  • 05:05 Andrew: testwiki problem seems to be a squid problem, can get srv123, srv84 to serve the main page with no problems by sending a request through netcat. Trying to connect to rr just gets no response
  • 05:01 Andrew: testwiki seems fubar, timing out on all pageviews.
  • 03:47 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php
  • 03:45 Andrew: Installing AbuseFilter on alswiki

April 12

  • 15:22 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/SecurePoll.i18n.php
  • 12:25 Tim: updated CentralNotice templates manually to get the license vote in the header
  • 09:05 Tim: loading license update messages into SecurePoll jump side with sp-msgs-reduced.sql

April 11

  • 12:05 domas: restarting all ubuntu memcacheds, rolling 1.2.8-4 live
  • 08:31 domas: rebooted srv187 with all the new kernels and such

April 10

  • 08:51 domas: few memcacheds were hitting OOMs, I really have to upgrade them :)

April 9

  • 20:15 JeLuF: started "maintenance/importImages.php" upload of the second batch of Fotothek images to commons
  • 16:37 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/includes/Auth.php
  • 14:43 Tim: php_admin_flag engine on in the SecurePoll directory
  • 14:37 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/includes/VotePage.php
  • 13:55 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'enabling subpages on fiwikiversity'
  • 12:59 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 're-enabled SecurePoll'
  • 12:59 logmsgbot: tstarling synchronized php-1.5/extensions/SecurePoll/includes/Entity.php
  • 12:33 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'enabled SecurePoll'
  • 12:32 logmsgbot: tstarling synchronized php-1.5/CommonSettings.php 'enabled SecurePoll'
  • 09:04 mark: Restored Amsterdam traffic; problem resolved
  • 08:07 mark: Moving Amsterdam traffic to pmtpa while a power problem at esams is being investigated
  • 05:27 JeLuF: many hosts in esams not reachable any more. Switch outage?

April 8

  • 21:31 Tim: running voterList.php for all wikis on hume, to construct license update voter lists
  • 21:05 Rob: pushed blog.wikimedia.org dns back into the squid cluster
  • 12:13 Tim: installing SecurePoll on WM including IP.php r49117
  • 04:33 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'Comment for DOWN nodes that seem to be up'
  • 04:28 Andrew: Only two spare memcached nodes left. Checked all the nodes marked as down, and found that srv126, srv100, srv137, srv92, srv129 seem to be up (tried nc to port 11000, and got ERROR response). Not moving them into the SPARE section in case I'm not doing it right.
  • 04:19 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'Memcached on srv143 died, replaced with srv197 (slot 12)'
  • 00:34 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php
  • 00:33 Andrew: deploying AbuseFilter to zhwiki

April 7

  • 18:34 Rob: restarted memcached on srv116
  • 18:26 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18307 Autopatrolled permission on enwikisource.'
  • 16:36 Rob: rmeoved the outdated fundraising blog out of planet
  • 16:27 Rob: singer is back to normal. whygive is fubar, but no one cares ;] The rest of the services are online and functional.
  • 08:28 domas: reset srv217 - where did I hear that again. it was hanging on image NFS and had segfaulting apache too. 7 incidents with it in past two months - needs hardware diagnostics
  • 01:48 Rob: blog.wikimedia.org is now up. singer is kinda mostly fixed, i will finish it in the morning. all sites on it are up.
  • 00:16 Rob: pushed blog.wikimedia.org out of squid via dns

April 6

  • 21:49 Rob: singer is near returned to normal, however the primary corporate blog is doing funny things with caching and apache direction on the server.
  • 21:34 Rob: random insanity with apache on singer, affected corporate blogs, ocs, wm09scholarships, communicate portal, and ...well... thats enough. Still working on resolution.
  • 14:26 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 16178 Activate Collection Extension for generating PDF on the French Wikiversity'
  • 14:18 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 17861 logo update at german wikibooks'
  • 13:52 logmsgbot: kate synchronized php-1.5/extensions/CodeReview/CodeReview.php
  • 12:44 domas: reset power for srv187
  • 12:44 domas: restarted hanging apaches on srv90, srv97 and crashlooping ones srv189, srv206
  • 12:35 domas: restarted busylooping memcached on srv143, new bug!
  • 11:47 logmsgbot: kate synchronized php-1.5/extensions/CodeReview/codereview.css
  • 04:57 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php 'srv129 down, swapped it for srv182'
  • 04:54 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php 'Live-merging r49222 -- fix for hiding of logging data in recentchanges. Tests okay on testwiki.'

April 5

  • 19:45 logmsgbot: jeluf synchronized php-1.5/mc-pmtpa.php 'replace srv92 b srv152'
  • 15:07 domas: Tim rocks, cluster back to normal
  • 15:04 Tim: deployed r49212 to fix infinite template recursion issue
  • 11:24 domas: restarted some fedora apaches, were stuck in write() after the previous hiccups
  • 10:28 mark: Restarted memcached on srv151
  • 08:59 domas: I hate computers
  • 08:28 domas: hanging latex processes held :80, thus not allowing clean apache restarts on some nodes
  • 08:18 domas: restarted plentiful of crashlooping apaches, investigating resource consumption problem
  • 05:53 Andrew: srv183 back up and memcached running, moved it from DOWN to SPARE. Threads on db12 back below 1000 again (normal range). Somehow I just resolved my first site problems by myself :O
  • 05:44 logmsgbot: andrew synchronized php-1.5/mc-pmtpa.php
  • 05:42 Andrew: srv183 went down, and it's running memcached. Replaced it with srv61, the first one in the SPARE section of mc-pmtpa.php, and moved it out. Hoping I did this right, and that it helps with db12 overload.
  • 05:30 Andrew: db12 (enwiki master) overloaded. Nothing I can do about it.
  • 01:52 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php 'Live-merge r49191, fixes a bug that gets in the way of suppression'

April 4

  • 21:01 azafred: bounced apache on srv82
  • 20:14 azafred: bounced apache on srv50
  • 18:16 azafred: bounced apache on srv72
  • 18:15 azafred: bounced apache on srv115
  • 18:01 azafred: bounced apache on srv121
  • 17:55 azafred: bounced apache on srv71
  • 14:32 azafred: bounced apache on srv147
  • 14:25 domas: restarted memcacheds on srv67, srv112 and srv143 - they seem to have hit reference leak condition (that was probably resolved in memcached 1.2.7)
  • 13:44 brion_: deployed update to wikibugs
  • 13:03 Tim: restarting trackBlobs.php, probably died during db2 crash
  • 09:17 domas: reset-mysql-slave on db23, purged 1-100 logs on db13
  • 05:06 azafred: bounced apache on srv89
  • 04:46 azafred: bounced apache on srv124

April 3

  • 20:36 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Turned on collections on srwiki, srwikibooks, and srwikisource'
  • 09:16 mark: Power cycled sq34
  • 00:15 Andrew: installed renameuser, updated mediawiki to r48811 on usability

April 1

  • 23:08 Fred: restarted apache on srv99
  • 22:57 mark: Restored session to AS 30217 as well
  • 22:33 mark: Brought session to AS 13680 back up
  • 21:30 mark: Shut down BGP sessions to AS 13680 and 30217 for what appears to be problems to/within Level 3 Tampa
  • 16:10 JeLuF: Image import of about 5000 images done. 245000 left to do...
  • 15:53 Fred: rebooting srv217 since it is wedged.
  • 15:47 Fred: restarted apache on srv137
  • 13:47 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Activating AbuseFilter on tpiwiki and hewiki, bugs 18299, 18300'
  • 13:45 mark: Rebooting srv217
  • 13:45 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'AbuseFilter custom settings, hewiki'
  • 13:40 JeLuF: batch importing images from the Deutsche Fotothek, commons.wikimedia.org/wiki/Commons:Deutsche_Fotothek
  • 13:11 Andrew: Reports that Common.css/Common.js weren't working on hsbwiki. Manually purging http://hsb.wikipedia.org/w/index.php?title=-&action=raw&smaxage=0&gen=js&useskin=monobook on the command-line fixed the issue.
  • 08:42 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Similar fix for plwiki wgRemoveGroups -- added "bot"'
  • 08:41 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Fix for plwiki wgAddGroups, overriding with array( "abusefilter" ) stopped plwiki bureaucrats from adding other groups'
  • 04:06 river: test x
  • 04:05 Andrew: Works again
  • 04:04 Andrew: Testing re-enabling of identi.ca bridge for morebots

March 31

  • 21:19 aZaFred: restarted apache on srv99
  • 15:19 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewExamine.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
  • 15:18 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewTestBatch.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
  • 15:18 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.class.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
  • 15:17 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.i18n.php 'Plug up the ability of users to run arbitrary filters against edits. Not strictly a security risk, but you could do some nasty things to slow down the servers with a filter (DoS vector).'
  • 05:03 JeLuF: removed 30 GB of binlogs on db17

March 30

  • 23:50 domas: interrupted crash recovery for db2, will do crime scene investigation afterwards :)
  • 23:37 logmsgbot: midom synchronized php-1.5/CommonSettings.php 'enwiki rw with new master'
  • 23:36 logmsgbot: midom synchronized php-1.5/db.php 'welcome new enwiki master, db12 - db12-bin.016 79'
  • 23:27 domas: too many open transactions (?) on enwiki/db2 caused it to go OOM or so...
  • 23:25 logmsgbot: midom synchronized php-1.5/CommonSettings.php
  • 23:23 river: Out of Memory: Killed process 5374 (mysqld).
  • 23:21 Andrew: mysqld on db2 crashed.
  • 15:58 aZaFred: restarted Apache on srv126 and srv103
  • 08:22 domas: memcached on srv60 segfaulted..
  • 05:55 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Live-merging r48993, fix for global group membership form (regression)'
  • 05:55 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/SpecialGlobalGroupMembership.php 'Live-merging r48993, fix for global group membership form (regression)'

March 29

  • 14:20 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialWatchlist.php 'sync up watchlist fixes, r49002'
  • 11:20 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'temporary increasing internal RC limit to 5000'
  • 11:15 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'some more efficient joining'
  • 11:09 logmsgbot: midom synchronized php-1.5/includes/ChangesList.php 'livemerging up to 48990'
  • 11:09 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'livemerging up to 48990'
  • 10:35 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Bug 18223, let sysops edit abuse filters on dewiki'
  • 08:51 logmsgbot: midom synchronized php-1.5/CommonSettings.php 'move away the mainpage delete protection to getUserPermissionsErrorsExpensive'
  • 08:41 logmsgbot: midom synchronized php-1.5/includes/Title.php 'merging in c48983'
  • 07:37 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Turning on AbuseFilter on enwikiquote'

March 28

  • 00:56 Tim: fixed permissions on .ssh directories on singer. Converted jforrester's authorized_keys file from RFC to OpenSSH format.
  • 00:08 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Syncing revert of apiuserrights r48909'
  • 00:07 logmsgbot: andrew synchronized php-1.5/includes/User.php 'Syncing revert of apiuserrights r48909'
  • 00:06 logmsgbot: andrew synchronized php-1.5/includes/AutoLoader.php 'Syncing revert of apiuserrights r48909'
  • 00:05 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryUsers.php 'Syncing revert of apiuserrights r48909'
  • 00:05 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryRecentChanges.php 'Syncing revert of apiuserrights r48909'
  • 00:04 logmsgbot: andrew synchronized php-1.5/includes/api/ApiMain.php 'Syncing revert of apiuserrights r48909'
  • 00:03 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/SpecialGlobalGroupMembership.php 'Reverting apiuserrights (r48910)'

March 27

  • 23:29 aZaFred: rebooting srv217 since it is wedged.
  • 22:10 aZaFred: restarted apache on srv203
  • 21:39 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'disabling TIFF->JPEG thumbnailing, doesn't work at present with our setup'
  • 21:38 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php 'Enabling TIFF->JPEG thumbnailing experimentally'
  • 17:49 brion: enabled upload-by-url for all users on testwiki for wider testing, upped $wgMaxUploadSize to 500MB from default 100
  • 17:44 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialUpload.php
  • 17:43 logmsgbot: brion synchronized php-1.5/includes/DefaultSettings.php
  • 17:43 brion: live-merging r48923 to make CURL timeout for upload-by-url configurable
  • 17:27 brion: enabled xml-rpc publishing on techblog so I can administer from WordPress iPhone app
  • 10:56 mark: Moved server nehalem to vlan 101
  • 02:28 Tim: db29 was full of relay logs, ran RESET SLAVE.

March 26

  • 22:45 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialBlockip.php 'to r48899 - fixes for hiding'
  • 22:42 Danny_B: wikibugs-l stopped to send mails to wikibugs-irc mailbox due to excessive bounces. reenabling sending again
  • 22:05 logmsgbot: midom synchronized php-1.5/db.php 'db1 coming back as frwiki/jawiki slave'
  • 21:49 logmsgbot: brion synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php to r48899
  • 18:36 logmsgbot: brion synchronized php-1.5/languages/LanguageConverter.php
  • 18:36 brion: applying r48836 language converter fix live
  • 18:22 aZaFred: srv224 and srv225 have been kickstarted, deployed and put in rotation.
  • 15:56 Rob: srv224 and srv225 have temp power from A4 until new cables can be made.
  • 15:56 Rob: morebots was dead!
  • 12:45 domas: 10s lock wait timeout doesn't work for parallel data loads %)
  • 09:45 logmsgbot: midom synchronized php-1.5/db.php '*whip* db18, back to work, grunt.'
  • 09:16 logmsgbot: midom synchronized php-1.5/db.php 'letting db12 back into the pool'
  • 08:36 domas: doing firmware/kernel updates on db12
  • 08:36 logmsgbot: midom synchronized php-1.5/db.php
  • 08:32 domas: updating firmware on db18 ILOM: load -source http://208.80.152.185/~midom/ilom.pkg
  • 08:29 domas: rebooting db18 with 2.6.28.2
  • 08:27 domas: db18 problem was 2.6.24, not 'memory use', I guess
  • 03:29 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/SpecialCentralAuth.php 'Fixes for strange display on Special:CentralAuth'
  • 03:28 logmsgbot: andrew synchronized php-1.5/extensions/CentralAuth/CentralAuth.i18n.php 'Fixes for strange display on Special:CentralAuth'
  • 02:43 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.class.php 'Deploying contains_any function, radix regex fixes for performance improvements'
  • 02:43 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Deploying contains_any function, radix regex fixes for performance improvements'
  • 02:42 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.i18n.php 'Deploying contains_any function, radix regex fixes for performance improvements'
  • 00:11 Tim: killed flaggedrevs update on zwinger, same reason as last time I did it. This time it actually took down ganglia for a while and made shell access very slow.

March 25

  • 23:03 aZaFred: kickstarted spence to run some test on.
  • 19:09 mark: Allocated port 0/1/13 on asw-a4-sdtpa for the server with the wrong name
  • 19:04 Rob: updated dns for nahalem test server
  • 15:59 logmsgbot: andrew synchronized php-1.5/extensions/FlaggedRevs/specialpages/OldReviewedPages_body.php 'Bug in OldReviewedPages'
  • 15:41 Rob: running the flaggedrevs updatelinks script across the flaggedrevs wiki in a screen session on zwinger
  • 15:30 Andrew: Running php svnImport.php MediaWiki 0 --wiki mediawikikwiki
  • 15:28 logmsgbot: andrew synchronized php-1.5/extensions/CodeReview/CodeRevision.php 'r48831'
  • 14:56 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialBlockip.php
  • 14:44 logmsgbot: andrew synchronized php-1.5/extensions/Collection/Collection.body.php
  • 14:43 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php
  • 14:11 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php
  • 14:08 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php
  • 14:01 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryImageInfo.php
  • 14:01 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php 'Yet another fatal'
  • 13:54 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php 'Fatal errors'
  • 13:49 logmsgbot: andrew synchronized php-1.5/includes/api/ApiQueryCategories.php 'Fatal for invalid titles'
  • 13:49 logmsgbot: andrew synchronized php-1.5/includes/specials/SpecialUserrights.php 'Fatal for cross-wiki user rights'
  • 13:47 logmsgbot: andrew synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevs.hooks.php 'Returning a value from a hook, causing exceptions.'
  • 13:41 logmsgbot: andrew synchronized php-1.5/extensions/ProofreadPage/ProofreadPage.php 'Fixing a fatal'
  • 13:37 brion: ProofreadPage is borked
  • 13:34 brion: scap complete!
  • 13:23 brion: starting general scap to r48811 -- yay!
  • 13:16 Andrew: Reverted some live hacks for AbuseFilter that were in there because of dependencies on core.
  • 12:48 brion: svn up'd test to r48811 ... last one?
  • 12:31 brion: svn up'ing test to r48810
  • 12:22 brion: disabling Configure extension on testwiki for now; we'll poke at it more later
  • 12:22 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
  • 12:20 brion: applying DB schema tweaks for flaggedrevs, codereview
  • 12:16 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Trying to get rid of a warning'
  • 11:55 brion: scap time cometh! svn up'ing testwiki for shakedown...
  • 10:26 brion: svn up'ing CodeReview
  • 01:29 Tim: running trackBlobs.php again on hume
  • 01:23 Tim: db18 back up, replicating, 14285s lag
  • 01:18 logmsgbot: tstarling synchronized php-1.5/db.php
  • 01:14 Tim: rebooted ILOM on db18, was refusing to reboot the machine
  • 00:58 Tim: db18 was locked up in kswapd, attempting reboot
  • 00:30 Tim: installed NRPE etc. on ms2 and ms3
  • 00:15 logmsgbot: tstarling synchronized php-1.5/db.php
  • 00:11 logmsgbot: tstarling synchronized php-1.5/db.php
  • 00:09 logmsgbot: tstarling synchronized php-1.5/db.php

March 24

  • 23:57 Tim: changing master for rc1 to ms3. Omitting srv183, which will be removed from the group.
  • 23:46 azafred_: udpated motd on srv32 to reflect its puppeteer status.
  • 23:35 domas: experimented with PG vs MySQL5.0 performance on db28 :)
  • 23:34 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Ack, added to foundationwiki, not fishbowl, reverted, and added to fishbowl.'
  • 23:07 Rob: lowered the memory limit on usability project server
  • 22:52 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Adding an inactive group to foundationwiki'
  • 22:43 brion: usability.wikimedia.org timing out
  • 22:23 azafred_: fixed spamassassin rules compilation on mchenry to speed up the process
  • 22:22 azafred_: Updated spamassassin rules to include more spam definitions on mchenry.
  • 21:36 brion: updating usability.wikimedia.org to current, installing AntiSpoof, AbuseFilter to help clean up vandalism problems
  • 18:59 azafred_: bounced apache on srv201
  • 18:58 azafred_: bounced apache on srv188
  • 18:30 Rob: pushing apache changes for wikizdroje.cz for cs.wikisource.org
  • 17:40 azafred_: restarted apache on srv190
  • 15:30 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php 'Bug 18023 Enable Collection extension on svwiki'
  • 15:22 logmsgbot: robh synchronized php-1.5/InitialiseSettings.php '//For bug 18061 Remove obsolete settings from cluster config files'
  • 10:45 logmsgbot: midom synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'RC cost became too high to support 5000 limit, decreased to 500'
  • 06:56 Tim: stopped mysql on srv171 and srv183 for copy to ms2 and ms3. Depooled.
  • 06:55 logmsgbot: tstarling synchronized php-1.5/db.php
  • 06:53 Tim: installed ganglia on ms2 and ms3, put them in the mysql cluster
  • 04:31 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'I broke plwiki'
  • 04:27 logmsgbot: andrew synchronized php-1.5/InitialiseSettings.php 'Bugs 18073, 18094, 18102, AbuseFilter for commons, plwiki, svwiki'
  • 04:22 logmsgbot: andrew synchronized php-1.5/CommonSettings.php 'Custom AbuseFilter settings for plwiki (bug 18073)'
  • 01:41 mark: Installed Ubuntu on ms2; ms3 and ms2 are now ready for ES usage

March 23

  • 23:32 logmsgbot: tstarling synchronized php-1.5/InitialiseSettings.php 'rate limit exemption for usability testing'
  • 22:56 domas: db1 will serve fr/ja soonish
  • 22:56 logmsgbot: midom synchronized php-1.5/db.php 'removing db1'
  • 18:58 Rob: upgraded spam plugin on wikimedia blog (shows old stuff in dashboard due to caching, its updated now)
  • 18:23 aZaFred: Setup access on LDAP and NIS for fvassard.
  • 13:36 logmsgbot: midom synchronized php-1.5/db.php 'rename s2a into s2dewiki, add s2commons with single primary server ixia, enable ixia with commons-only dataset, *poof*'
  • 00:34 brion-weekend: customized style a bit on techblog.wikimedia.org

March 22

  • 23:14 Tim: killed long-running tidy instance on srv108
  • 22:10 Tim: restarted memcached on srv82
  • 12:15 domas: dumping commonswiki from db30 with server still pooled in and 4 dumper threads
  • 10:38 domas: hey, I found remote code execution vulnerability, it seems! :)
  • 10:37 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php '--message=Arbitrary execution vulnerability in AbuseFilter, exploitable only by admins'
  • 06:00 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled ixia, is full'
  • 05:54 Tim: db5 was running out of disk space due to excessive relay logs. Ran RESET SLAVE.
  • 03:02 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewExamine.php '18096 Special:AbuseFilter/examine doesn't list new account creation log entries'
  • 02:47 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.class.php 'Fixed bug in batch testing interface'
  • 02:41 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/AbuseFilter.parser.php 'Optimisation of rmdoubles, causes 20-fold performance improvement on large pages'
  • 02:23 logmsgbot: andrew synchronized php-1.5/extensions/AbuseFilter/Views/AbuseFilterViewDiff.php 'Prevent leaking of private filters through diff interface'

March 21

  • 17:30 logmsgbot: midom synchronized php-1.5/db.php 'someone forgot to enable db8.. putting back to pool'
  • 17:29 logmsgbot: midom synchronized php-1.5/db.php 'bringing back db26'
  • 10:46 logmsgbot: midom synchronized php-1.5/db.php 'taking out db26 for kernel experiments (2.6.28.8 with some different build options)'
  • 00:12 brion: upload-by-URL enabled for sysops on testwiki (using khaldun as internal proxy)
    • 00:07 logmsgbot: brion synchronized php-1.5/InitialiseSettings.php
    • 00:04 logmsgbot: brion synchronized php-1.5/extensions/MWSearch/MWSearch_body.php
    • 00:04 logmsgbot: brion synchronized php-1.5/includes/specials/SpecialUpload.php
    • 00:03 brion: live-merging r48648 to allow $wgHTTPProxy to work for uploads and not interfere with search

March 20

  • 23:58 Tim: repooled srv126, part of cluster12, appears to be up and working
  • 23:58 logmsgbot: tstarling synchronized php-1.5/db.php
  • 23:54 Tim: depooled srv125 from ES, has been down for 12 days. cluster12 is now down to 1 server
  • 23:54 logmsgbot: tstarling synchronized php-1.5/db.php
  • 23:52 Tim: install ganglia-metrics on db24
  • 23:13 Tim: set up logrotate on locke, using the same script that we use for MW debug logs
  • 22:56 domas: kickbanned security threat from #wikimedia-tech, was trying to install keylogger and steal our passwords
  • 22:54 Tim: removed old squid log stream going to iris. Set up a log stream going from all the squids to locke.
  • 22:37 Tim: depooled adler, it's down
  • 22:36 logmsgbot: tstarling synchronized php-1.5/db.php 'depooled adler'
  • 21:01 mark: removed vlan 5 on csw5-pmtpa that was accidently created/left behind by Tim
  • 20:47 domas: installed snaprotate on db26, enabled snapshots with 2x8h schedule, updated Database snapshots
  • 20:44 logmsgbot: midom synchronized php-1.5/includes/api/ApiQueryRevisions.php 'livemerging r48642'
  • 20:43 logmsgbot: midom synchronized php-1.5/includes/filerepo/ArchivedFile.php 'livemerging r48644'
  • 20:27 domas: someone who wrote ArchivedFile::load, needs some pain and torture applied (query doesn't use index... ;-)
  • 20:25 logmsgbot: midom synchronized php-1.5/db.php 'reduced load for db26 from 200 to 100, as it has reduced amount of RAM and increased amount of other work'
  • 20:09 domas: added db26 to s1 pool
  • 20:08 logmsgbot: midom synchronized php-1.5/db.php
  • 18:47 mark: JeLuF> |log \o/
  • 18:46 domas: pooled db22 back in, as it caught up on replication!
  • 18:46 mark: raised karma of /h/w/b/reset-mysql-slave
  • 18:46 logmsgbot: midom synchronized php-1.5/db.php
  • 18:45 domas: ran /h/w/b/reset-mysql/slave on db26, after copy from db22!
  • 18:45 domas: wrote /h/w/b/reset-mysql-slave to reset mysql slaves!
  • 18:32 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '15880 Pseudo-Namespace on Korean Wikipedia'
  • 18:25 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17699 Create Appendix namespace on Spanish Wiktionary'
  • 18:19 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17987 Set $wgBlockAllowsUTEdit = true for zh.wikipedia'
  • 18:16 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '16446 tlwikibooks namespaces to be searched by default'
  • 18:14 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '16446 Add Pagluluto: namespace to tlwikibooks'
  • 18:12 domas: db26 bootstrap problem was inconsistent ibdata specification, probably first copy would've been enough :)
  • 17:55 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17818 Create "Wikijunior" namespace on Polish Wikibooks (fix)'
  • 17:53 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php '17818 Create "Wikijunior" namespace on Polish Wikibooks'
  • 17:41 domas: I'm out of luck, copy from db22 to db26 has failed again
  • 17:32 JeLuF: changed sync-common-file to automatically log messages to Server admin log
  • 17:27 logmsgbot: jeluf synchronized php-1.5/InitialiseSettings.php "17739 nowiktionary logo"
  • 17:18 JeLuF: changed slwikibooks, slwikisource logos to $stdlogo
  • 17:17 domas: tried to poke Werdna about AbuseFilter regexp warnings, but he didn't listen (check remotelogtail @ db20 )
  • 17:16 domas: restarted srv170 - was throwing occasional segfaults
  • 17:15 domas: 'RESET SLAVE" on db22, to clean the relay log congestion
  • 17:14 JeLuF: changed mgwiki logo to $stdlogo
  • 17:13 domas: attempted to reboot adler using sysrq, failed at it. adler needs datacenter service
  • 17:12 domas: did set up Jens access rights on wikitech, because nobody did that before
  • 17:11 JeLuF: sahwiki and stqwiki logos changed to $stdlogo
  • 17:11 domas: tested block compression on 'pagelinks' table on lomaria's innodb/1.0.3
  • 17:11 domas: tested key packing on 'pagelinks' table on lomaria's innodb/1.0.3
  • 17:10 domas: erased test mysql 5.1 builds on db26
  • 17:10 domas: restarted data copy from db22 to db26
  • 17:10 domas: forgot to stop mysqld on db26 :)
  • 17:09 domas: started copying data from db22 to db26
  • 17:09 domas: took out db22 for copy to db26
  • 16:53 JeLuF: locked rn.wiktionary
  • 12:48 domas: truncated log/remote on db20, had 6G of adler kernel noise, firewalled out adler syslog stream
  • 06:10 Tim: started trackBlobs.php for all fedora clusters
  • 03:20 Tim: rebuilt udplog for hardy and installed it on locke
  • 02:55 Tim: installed ganglia and NRPE on locke
  • 02:48 Tim: renamed db6 to locke in csw5 port list, this wiki, dsh node lists
  • 01:28 Tim: reinstalled db6 as "locke", with Ubuntu 8.04, RAID 5
  • 00:22 Tim: removed db8 from groupLoadsBySection, was causing it to be included in lag reporting

March 19

  • 23:04 Tim: prepared DNS for a rename of db6 to "locke", for use as a squid log server, with a new IP address on the public subnet
  • 22:47 river: Morebots 0 - 1 Moarbots
  • 22:46 Andrew: Rmoved identica code from morebots for now, it's giving annoying error messages.
  • 22:28 mark: Removed sq1 from upload squids node group
  • 22:11 Tim: moving updateLinks.php job from zwinger to hume
  • 21:58 Tim: restarted apache on srv62, srv82, srv151
  • 21:53 Tim: hume is down, OOM, rebooting. Was down since about 18:00.
  • 20:46 brion: poke poke
  • 20:42 Rob: ran extensions/FlaggedRevs/archives/patch-fpc_level.sql on existing flaggedrevs dbs (did it about 15 minutes ago)
  • 20:41 Rob: wikitech died, rebooted.
  • 20:06 Rob: running updatelinks php script on flaggedrevs dbs per aaron
  • 19:58 Rob: enabled pdf collection on meta
  • 19:38 Rob: updated initialisesettings for 16342 Enable flood flag configuration on English Wikibooks
  • 18:59 Rob: updated initialisesettings for bug 17986 Set $wgBlockAllowsUTEdit = true for zh.wikibooks
  • 17:18 Rob: updated Initialisesettings for 14079 Configure groups in nowiki for access to Special:Unwatchedpages
  • 17:12 Rob: added namespace aliases to zhwiki and ran dupe checking per 17885 Aliases of 'Wikipedia talk' namespace in Chinese Wikipedia
  • 17:00 Rob: 16426 Enable subpages in template namespace on MediaWiki.org is done.
  • 16:53 Rob: updated the logo in sul login for wikibooks
  • 15:24 Rob: updated CommonSettings.php per bug 17453
  • 15:16 Rob: ran namespacedupes script for zhwiki per bug 17701
  • 15:12 Rob: added new namespaces to nowikisouce per bug 16232
  • 15:04 Rob: added 6 new namespaces to zhwikisource per bug 15722
  • 14:27 Rob: updated flaggedrevs onto iawiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16485
  • 14:11 Rob: ran flaggedrevs autopromote script on iswiktionary
  • 14:08 Rob: updated for 16476 Enable FlaggedRevs Patrolling Configuration on is.wiktionary
  • 14:07 Rob: updated for 16427 Set $wgRestrictDisplayTitle to False on Chinese Wikipedia
  • 10:51 domas: disabled write barriers on lomaria's /a
  • 08:55 river: undepooled adler, depooled db8
  • 06:21 Andrew: synced r48573, bug in testing interface
  • 02:53 Andrew: Synced r48564 to sites to allow more self-policing of filter performance, displaying run time in ms on the filter page itself.
  • 02:44 Andrew: Updated AbuseFilter to r48564 on test to check filter profiling.
  • 01:27 Andrew: GIving abusefilter-revert right to enwiki admins
  • 00:37 Andrew: Synchronised AbuseFilter.parser.php for short-circuiting
  • 00:22 Andrew: Updating AbuseFilter on test to r48553

March 18

  • 20:44 brion: also req'ing doc updates for Upload filesystem snapshots
  • 20:43 brion: made a quick note about our database snapshots, needs more docs
  • 20:21 Tim: deploying AjaxResponse.php r48531
  • 20:20 brion: disabling image moving due to reports of breakage
  • 20:18 brion: synced r48525 for temp xss fix in abusefilter ajax
  • 19:45 Tim: re-enabled AbuseFilter with per-filter profiling
  • 19:42 brion: disabling AbuseFilter on en.wikipedia.org; performance problems on save. Needs proper per-filter profiling for further investigation.
  • 05:49 Andrew: Live-hacked in r48512 on AbuseFilter -- visual diffs on details page.
  • 04:14 Andrew: Synced r48509 in AbuseFilter -- cross-filter diffing allowing leaking of hidden filters.

March 17

  • 23:32 Andrew: Activated AbuseFilter on enwiki
  • 23:18 Andrew: Scapping to update AbuseFilter
  • 23:12 Andrew: Updating AbuseFilter to r48500 on testwiki.
  • 22:48 Tim: NRPE installed on srv100
  • 22:37 Tim: installed NRPE on db17, adler, thistle, lomaria, db30. Fixed NRPE on thistle.
  • 22:29 Tim: deleted cluster13 and cluster14 backups on storage2
  • 14:19 Rob: updated logo for pntwiki per bug 17960 Update Logo for pntwiki
  • 03:27 Tim: deployed OggPlayer.js r48477
  • 00:16 Andrew: Scapping to update AbuseFilter

March 16

  • 23:30 Andrew: Conflict in AbuseFilter resolved. AbuseFilter on testwiki only updated to r48466. Will roll out to other wikis in an hour or so.
  • 23:25 Andrew: Conflict in extensions/AbuseFilter/Views/AbuseFilterViewHistory.php -- DO NOT SCAP
  • 23:09 brion: enabling $wgAllowImageMoving sitewide. Default group permissions allow image moving for sysops only, so should be safe-ish.
  • 22:10 brion: setting up basic TIFF upload support on test & commons (bugzilla:17714) per req of image restoration folks. No thumbnailing yet.
  • 20:36 brion: set up aboostani on SVN
  • 18:43 Rob: updated akismet plugin on blog.wikimedia.org
  • 17:22 Rob: and brion called me a weenie cuz I do not do enough SVN work.
  • 17:21 Rob: updated en.planet with the new tech blog feed
  • 15:48 Rob: forced ssl login and admin panel for techblog, rest moves back to standard http
  • 15:45 Rob: setup https for techblog
  • 08:07 Andrew: Bug 17998 Allow autoconfirmed users to see filters and logs on ruwiki
  • 03:27 Andrew: Bug 17071 - Allow import rights to be added/removed by bureaucrats on mediawikiwiki

March 15

  • 11:21 domas: lomaria runs dewiki on 5.1.33/innodb1.0.3

March 14

  • 20:39 mark: db4 was being used for special page updates from hume and lagged, reduced its load from 150 to 50
  • 13:10 mark: Reduced cache_mem on backend squid sq28 to see if memory pressure is causing some issues
  • 07:03 Andrew: added /etc/init.d/morebots to wikitech, to auto-start morebots. also made it auto-restart on crash

March 13

  • 23:11 Rob: migrated survey software from isidore to singer
  • 22:21 Rob: restarted apache on singer after enabling all the mod rewrite stuffs
  • 21:58 Rob: redeployed ./sync all for squid for whygive migration
  • 21:58 Rob: disabled blog and whygive apache virtual hosts on isidore
  • 21:58 Rob: migrated old whygive.wikimedia.org from isidore to singer.
  • 21:21 Rob: http://techblog.wikimedia.org is online (although quite sad and empty)
  • 21:14 Rob: setup tech blog on singer with database residing on db9
  • 21:06 Rob: updated dns for new techblog (not yet live)
  • 21:06 Rob: updated squids configuration for blog move
  • 21:00 Rob: moved blog.wikimedia.org from old server isidore to new server singer
  • 20:18 Tim: srv85 died, possible disk failure, no ssh or memcached, still has HTTP. Removed it from the memcached list, removed it from apache LVS.
  • 20:15 Rob: updated spamfree plugin on blog.wikimedia.org
  • 20:04 Rob: updated to newest version of wordpress on blog.wikimedia.org and whygive.wikimedia.org
  • 17:24 Rob: updated flaggedrevs.php per bug 16365
  • 17:23 Tim: removed password auth from nagios
  • 15:28 river: ms4 disk c5t5d0 failed

March 12

  • 18:08 Tim: restarted apache on srv75 and srv103

March 11

  • 20:40 Rob: updated Initialisesettings for 17893 en.wiktionary bureaucrats can't add sysop, crat and bot flags (they also could not remove them until now)
  • 00:37 Andrew: Live-hacking out r48087 (localisation changes for AbuseFilter with core dependencies) until a full code update.
  • 00:13 Andrew: Activating AbuseFilter on arwiki
  • 00:08 mark: Added httpdconf module to rsyncd on db20, and also restricted access to certain subnets
  • 00:02 Andrew: scapping to update AbuseFilter

March 10

  • 23:31 tomaszf: renabled central notice for all template generation
  • 23:30 Andrew: on testwiki, that is
  • 23:29 Andrew: updated AuseFilter to r48294 for debugging
  • 22:58 Andrew: updated AbuseFilter to r48288 on test.
  • 22:32 Tim: restarted memcached on srv63
  • 22:24 brion: updating Collection extension to r48283; fixes
  • 18:45 logmsgbot: hi identi.ca folks! domas and mark just cleared #profiling data!
  • 17:27 brion: reenabling limited centralnotice update cronjob on hume, since the live notice ain't working
  • 00:52 brion: domas makes it all better. yay domas!
  • 00:50 brion: cluster21 master (srv161) in read-only; we're having write failure problems on multiple wikis
  • 00:48 brion: putting ES back into service on enwiki; srv160/cluster20 master has been fixed. slaves still running it, but this is safe for read/write since reads will fall back to master

March 9

  • 20:59 Rob: fixed bug 17893 en.wiktionary bureaucrats can't add sysop, crat and bot flags
  • 20:15ish-20:30ish Brion:
    • some (unspecified) blob tables in ES broken due to MAX_ROWS=1m. writes broken on enwiki. domas is rebuilding tables
    • disabling wgDefaultExternalStore on enwiki temporarily as hack measure. last text_id was 274347580 before
    • now actually disabling read-only too
  • 09:50 Andrew: Live-synced r48211 -- fixes bug 17877, which is a security issue because it allows accounts to be created which need a bureaucrat to block.
  • 09:15 domas: db24..

March 8

  • 13:40 domas: srv125 didn't come up after sysrq-b

March 7

  • 02:32 brion: updated wikibugs to r48113, fixes issue with resolved bugs
  • 02:16 brion: installing build-essential on mchenry so CPAN will work
  • 02:11 brion: CPAN sux
  • 02:04 brion: installing CPAN Email::MIME on mchenry for new wikibugs...
  • 01:59 brion: updated wikibugs to r48110, which actually uses an email parser (omg)

March 6

  • 18:26 brion: restarting dump threads on srv31, down since it was rebooted a few days ago
  • 06:37 Andrew: Live-hacking out the "save and share" functionality of Collections (UI and processing). It uses Article::doEdit and does absolutely no checking of permissions or filtering.
  • 02:25 brion: reverting live wikibugs to older version, we need some further tweaking on the mail decoding :D
  • 02:21 brion: updated wikibugs to r48080, which should handle unicode subjects even when the body is plain ascii
  • 02:07 brion: updated wikibugs to r48079
  • 02:05 brion: installing MIME::QuotedPrint perl module on mchenry to educate wikibugs about mail subject encodings
  • 01:25 brion: adding 'hideuser' right to oversight group, forgot that one in january
  • 00:58 brion: synced SpecialDeletedContributions for r47930 fix

March 5

  • 22:53 Andrew: Ran namespaceDupes.php --fix --wiki oswiki for bug 17776
  • 22:43 domas: edited /etc/udev/rules.d/70-persistent-net.rules on db28 to move back eth4 to eth0 :)
  • 21:43 mark: Added community to 1299 session to prepend outbound announcements to 2828 once
  • 21:42 Rob: updated DNS for bug 16955
  • 21:25 Rob: ran php namespaceDupes.php --fix and it didnt seem to bork crap. Also checked the script itself. Seems ok and I did not break the entire site, so yay?
  • 19:47 Rob: db28 mainboard swapped, booting up
  • 19:28 mark: Set up NIS on zwinger
  • 19:01 Rob: shutdown db26 to check its memory (lag, log, blah)
  • 18:43 brion: removing obsolete, unused 'developer' group (bugzilla:12569)
  • 16:49 Rob: updated InitialiseSettings for bug 17701 Alias of 'Wikipedia' namespace in Chinese Wikipedia
  • 16:39 Rob: updated InitialiseSettings for bug 17307 Remove restrictions for wm2009wiki
  • 15:07 domas: db21 needs 2.6.28, hit the 2.6.24 deadlock problem

March 4

  • 21:41 Rob: final errors in OCS configuration fixed. ocs.wikimania2009.wikimedia.org is now working properly
  • 20:36 Rob: added singer to nagios
  • 20:22 Rob: setup proper email sending for ocs software install
  • 14:54 Rob: pushed changes to InitialiseSettings for bug 13055 Make newpage patrolling available to autoconfirmed users on jawp
  • 14:43 Rob: pushed changes to InitialiseSettings for bug 16289 patrol function assigned to all users, not autoconfirmed users, on nlwiki
  • 14:16 mark: Brought knsq7 back up
  • 07:38 domas: srv217 was acting funny - lots of hanging processes which all decided to quit at strace. system locked up eventually, and hardware powercycle was done :)
  • 07:37 domas: db24 hanged again (triple checked, not when producing a snapshot), nothing in dmesg or SP log

March 3

  • 18:26 brion: setting our nameservers for transferred wikpedia.org domain
  • 17:09 domas: re-enabled db5 and db25, db25 is serving as snaprotate slave for s3.
  • 06:15 river: depooled db5 instead to dump from, left adler out of rotation
  • 06:13 river: adler mysqld crashed during dump, i/o error: sd 2:2:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK
  • 06:02 river: restarted replication on db30
  • 05:06 river: repooled db11 since it's the master and depooled db1 instead
  • 05:04 river: depooled db11 to dump s3

March 2

  • 18:57 brion: updating Collection ext to r47946; includes coll-license_url message for per-wiki license URL override
  • 18:57 brion: morebots is dead :(
  • 16:15 Rob: Updated and synced InitialiseSettings for bug 15927 Enable Special:Nuke on pl.wikipedia
  • 03:56 river: removed db30 from rotation to dump commons

March 1

  • 17:30 river: repooled db7
  • 16:29 river: unpooled db7 to dump externallinks
  • 09:54 Tim: Mirror update done. Doing apt-get update/upgrade and reboot test on srv188.
  • 07:45 Andrew: Deploying AbuseFilter to ruwiki per bug 17729
  • 07:05 Tim: deleting files from khaldun:/srv/ubuntu/pool that are over 2 years old
  • 06:52 Tim: khaldun out of disk space as expected
  • 05:51 Tim: switched the ubuntu mirror to use the one at neu.edu, which is much closer (43ms RTT) and thus faster for small files than the osuosl.org one
  • 05:46 Tim: updated ApiQueryUsers.php to r47865
  • 05:06 Tim: disabled mirror cron job on khaldun temporarily, while the update running in my terminal completes
  • 05:00 Tim: khaldun had not been updating its ubuntu repositories since roughly September last year, due to the lack of a ~/.gnupg/trustedkeys.gpg file, for gpgv. Also the option syntax for debmirror had changed, breaking b/c, and feisty was removed from OSL. Fixed everything, removed feisty.
  • 04:26 Tim: testing ubuntu mirroring on khaldun

February 28

  • 03:19 river: put db7 back in rotation

February 27

  • 22:37 Rob: setup daily backups of web directories from singer to tridge
  • 22:25 Rob: setup https on singer for ocs software
  • 21:37 Rob: singer installed and up, running wikimania2009 OCS software only at present. Will migrate other services to it at a later date/time.
  • 21:03 river: stopped replication on db7 to dump s1
  • 19:24 Rob: installing and setting up new r300 singer for OCS install and eventual blog migration.
  • 12:21 river: mounted /mnt/upload5 on srv78
  • 05:55 Tim: thistle was toast due to ChangeTags screwing up the query plan for RecentChanges for unfiltered queries. It was doing a scan of the tag_summary table followed by a filesort of recentchanges. Needs FORCE INDEX(ts_rc_id). Disabled with live patch.

February 26

  • 23:06 mark: Attempted Ubuntu install on ms2, but Ubuntu doesn't fully support systems with > 16 disks. Awaiting their fix.
  • 23:06 mark: Downpreffed paths _2828_7473_ and _2828_4637_
  • 22:37 brion: installed swap-watchdog on pdf1
  • 22:25 brion: tom rebooting pdf1 by LOM
  • 22:20 brion: swapping spike on pdf1
  • 21:54 mark: Set up link aggregation over two GigE links on ms3
  • 21:49 Rob: replaced the favicon file for wikibooks and synced out per bug 17049 New favicon for Wikibooks projects
  • 21:36 Rob: setup futher log rotation on pdf1
  • 21:31 brion: enabling Collection on sourceswiki ("old wikisource"), was forgotten the other day
  • 21:15 Rob: setup log rotation of mw-serve.log on pdf1
  • 21:02 mark: Upgraded iLOM/BIOS on ms2
  • 20:32 Rob: restarted apache on srv193, srv197, srv201, srv217, srv223
  • 20:28 Rob: restarted apache on srv62, srv78, srv100, srv133, srv170, srv173, srv175, srv190
  • 20:18 Rob: restarted apache on srv135 and srv163 due to segfaults
  • 20:10 brion: fixed up cache clear on pdf1. note we need to set up log rotation
  • 19:44 Rob: nope, didnt break it, it worked. Updated InitialiseSettings per bug 17295 Please allow sidebar links to be localisable for Wikimania 2009wiki
  • 19:44 Rob: pushing a change per bug 17295 that could very well crash wikimania2009 wiki, lets find out! =]
  • 19:41 brion: merging r47836 to restrict 'pdf version' link in restricted mode
  • 19:37 mark: Moved ms2 to internal vlan with new ips and dns entries
  • 19:28 brion: testing rollout of PDF generation extension on en.wikipedia.org; with collection portal limited to logged-in users only
  • 19:19 Rob: srv31 commented out of mediawiki-installation node group until its back online
  • 19:16 Rob: updated mw-lib on pdf1, restarted pdf generation service
  • 19:16 Rob: pdf generation offline for update
  • 18:17 brion: srv31 appears to be a bit borked again. sigh.
  • 18:05 Rob: Added wgRestrictDisplayTitle to InitialiseSettings to support bug 17307 Remove restrictions for wm2009wiki. Also removed the wgAllowDisplayTitle as it was an old testwiki flag that is no longer used.
  • 17:49 Rob: Updated InitialiseSettings to support bug 17361 Enable DyanmicPageList for Wikimania2009wiki
  • 16:44 mark: Brought BGP session to 2828 back up
  • 14:59 Rob: updated InitialiseSettings for 17201 Enable the ability for crats on the Arbcom private wiki (English Wikipedia) to also remove crat and sysop
  • 14:47 Rob: fixed change in InitialiseSettings for bug 17388
  • 14:42 Rob: rolling back change because I changed the wrong entry.
  • 14:29 Rob: Implemented changes to Initilizesettings for bug 17388 Allow bureaucrats to revoke sysop status on UK chapter website
  • 11:13 mark: Shut down BGP session to 2828, many reports of partial reachability, seemingly due to a broken link in a multipath
  • 06:12 Tim: running gearmanRefreshLinks.php on hume for all wikis with 20 worker threads
  • 05:24 Tim: on hume: deleted the unnecessarily aggressive default configuration from /etc/ufw to allow ufw to be enabled without immediately taking the server off the net or rate limiting all incoming connections to 3 per minute. Drop incoming connections to gearman.

February 25

  • 22:21 Andrew: Deploying AbuseFilter to nowiki.
  • 18:58 mark: Installed ms3 on 2 of the 4 bootable disk drives (sda and sdi), on a 80 GB RAID-1 root partition, leaving the rest of those 250 GB drives free, as well as the other 46 drives
  • 18:24 Rob: srv215 deployed into service
  • 18:16 Rob: installed and setting up srv215
  • 18:08 brion: merging r47809 live -- XSS fix for Collection extension
  • 18:01 brion: temporarily halting apache on srv78, which seems to have some borkage with nfs mounts and nis
  • 17:55 brion: srv78: ls: cannot access /mnt/upload5/wikipedia: No such file or directory
  • 17:53 Rob: rename auth2 to sockpuppet in dns, pushed changes
  • 16:55 Rob: fixed IP allocation for ms4 and ms3
  • 16:55 Rob: reset ms4 by accident due to IP misassignment on service processors for ms4 and ms3
  • 05:29 Tim: updated ApiQueryLogEvents.php to r47781
  • 04:21 Tim: killed long-running ApiQueryLogEvents queries on db4 (12000 seconds)
  • 01:11 Andrew: Deployed AbuseFilter on dewiki per request of DaB
  • 01:08 Andrew: adding AbuseFilter tables for dewiki
  • 00:03 brion: running mw-serve cache cleaning on pdf1; cron job was borked by missing log file

February 24

  • 22:46 Andrew: Deployed AbuseFilter on metawiki.
  • 22:43 Andrew: Adding abuse filter tables to metawiki.
  • 22:30 brion: syncing merge of r47769 to fix regression in page history
  • 22:22 brion: fix to disable slow tag search has had a side effect breaking page history. Should be tidied up in a few minutes.
  • 22:15 brion: merging r47767 live to disable $wgUseTagFilter
  • 20:37 Rob: updating dns for wikimedia.cz
  • 20:32 brion: turning Collection/PDF on for all wikisources
  • 19:47 mark: Shutting down Apache and unmounting /home on srv32, for puppet testing
  • 19:03 Rob: updated wikibooks logo stuff per bug # 17034
  • 17:13 brion: just noting that TeliaSonera has a planned maintenance which may affect KNAMS connectivity for a few minutes around 2009-02-25 05:00 UTC.
  • 16:51 brion: restarting some dump threads on srv31
  • 01:55 river: ms1 is now replicating to ms4 instead of ms2
  • 00:06 Rob: reverted cs.wikimedia.org and cz.wikimedia.org redirects for danny

February 23

  • 22:56 RobH_: srv136 reinstalled and redeployed as apache
  • 22:32 mark: Installed puppet (test install) and thereby automatically gmond as aggregator on srv33
  • 22:27 mark: Installed puppet (test install) and thereby gmond on srv32
  • 21:16 domas: adler has disk with media errors (ID:5, 6th disk in array): http://p.defau.lt/?3_7_6aIatj3DeNBw_jjtBg - needs cannibalized samuel, disk replacement, and ubuntu install on raid10
  • 19:04 Rob: srv136 back from repairs, reinstalling as apache server
  • 18:44 Rob: srv217 not running apache, synced and restarted
  • 18:29 Rob: srv33 reinstalled to ubuntu and deployed as apache server
  • 18:24 Rob: srv32 reinstalled to ubuntu and deployed as apache server
  • 17:55 Rob: reinstalling srv32 to ubuntu
  • 17:38 Rob: resynced and restarted apache on srv32, srv33, srv34
  • 17:32 Rob: srv31 powered back up
  • 17:25 Rob: found a breaker flip in the DC, affects srv31-srv34
  • 13:40 domas: oh, btw folks, kudos on perfect web2.0 engineering, now morebots complains when message is longer than 140 bytes, and we end up without our microblogging syndication
  • 13:39 domas: added "su -m 'www-data' -c 'find /opt/mwlib/var/cache/ -mindepth 3 -mtime +1 -delete'" to pdf1 crontab, does anyone actually look after this service?
  • 12:57 Tim: deployed r47704, now command line scripts don't access /home anymore
  • 11:37 Tim: switched archive directory over to /mnt/upload5, starting another rsync. Some files will be missing until the rsync is done
  • 10:07 Tim: moved all job runners from the previous ad hoc script to the new wikimedia-job-runner package
  • 06:25 Tim: moved the nagios plugins for fedora from /home/nagios to /h/w/common/nagios-fedora-plugins
  • 05:21 Tim: started udp2log on db20, MW UDP logs were dead
  • 05:19 Tim: killed errant jobs loop scripts still running on fedora servers
  • 04:36 Tim: fixed the log directory for /etc/cron.d/mw-central-notice, killed the process that was in a tight loop trying to write to a stale NFS file handle
  • 04:28 Tim: finished moving ExtensionDistributor working copy
  • 04:14 Tim: moving ExtensionDistributor working directory from /home to /mnt/upload5
  • 04:00 Tim: private/archive/wikipedia was in fact not migrated, but an initial rsync was done. I will do a second rsync now.
  • 03:42 Tim: rsync done, uploads re-enabled, b/c symlinks set up
  • 03:37 Tim: doing rsync
  • 03:31 Tim: temporarily disabled file uploads on all private wikis, for migration to ms1
  • 03:10 Tim: confirmed with maintenance/getRealUploadDir.php that all wikis except the private wikis have an upload directory which symlinks to /mnt/upload5. Changed $wgUploadDirectory in InitialiseSettings.php accordingly. Deleted some ancient commented-out code from CommonSettings.php.
  • 02:50 Tim: same for commons ForeignDBViaLBRepo directory, ScanSet directory, CentralNotice directory,
  • 02:44 Tim: fixed CommonSettings.php location of deleted images, upload3 -> upload5, appears to have been moved already

February 21

  • 19:49 mark: Installed gmond on eiximenis
  • 19:02 domas: db26 lacks 8g of ram :)
  • 19:00 mark: Restarted stuck apache on srv217
  • 17:26 mark: Started apache on srv218-221
  • 17:24 mark: Restarted stuck apache on srv217
  • 17:07 mark: Squid/kernel upgrade complete
  • 16:46 mark: Increased max-connections per upload squid to ms1 to 100
  • 15:58 mark: Running automated upgrade/reboot of squid and kernel on sq43-47
  • 15:58 mark: Upgraded squid and kernel on sq41-42, sq48-50, and rebooted
  • 15:44 mark: Upgraded squid and kernel on sq36-40, and rebooted
  • 12:55 river: fixed reverse dns entries for ms3/ms4, which had got swapped somehow
  • 11:55 Tim: re-enabled ExtensionDistributor
  • 11:16 Tim: removed syslog.0 and messages.0 on srv170 and srv176, they had critical disk free on /
  • 03:25 Tim: started apache on the image scaling servers
  • 02:51 brion: ran sync-common on srv199 while i'm at it
  • 02:48 brion: zeroing out stupid giant syslog files on srv199
  • 02:46 brion: srv199 is out of disk space
  • 02:46 brion: copying hacked-up copies of InitialiseSettings/CommonSettings back to /home so the changes aren't lost this time
  • 02:22 mark: db20 back up, for reals
  • 02:19 mark: Rebooting db20 with upgraded RAID controller firmware
  • 02:13 domas: flashing BIOS helped
  • 02:13 mark: db20 up!
  • 02:03 brion: services on bart (secure, planet) are temporarily offline while server is poked at
  • 01:50 brion: seeing pages, yay
  • 01:49 brion: running apache2ctl start or apachectl start for various apaches
  • 01:47 domas: I FOUND HOW TO REVIVE APACHES
  • 01:46 brion: think i killed em, now trying to restart apache procs
  • 01:43 brion: poking to see if we can restart apaches...
  • 01:42 brion: syncing fixed InitialiseSettings/COmmonSettings to apaches
  • 01:14 brion: and flyingparchment
  • 01:14 brion: domas and mark are attempting to restart the NFS server, but aren't mentioning any details in the public channel or log
  • 00:52 domas: http://p.defau.lt/?_M1iGbA0PCz2OOt2_KKPug
  • 00:52 mark: db20 in trouble
  • 00:39 mark: @brion you don't need to wake up
  • 00:36 domas: disabled 2006 fundraising cronjob on amane :-)

February 20

  • 23:31 Rob: upgraded squid and kernel on sq34-sq36
  • 23:12 Rob: upgraded kernel and squid on sq31-sq33, redeployed and online
  • 23:08 brion: updating CentralNotice for improved test script (plus i8n update)
  • 22:54 Rob: upgraded kernel + squid on sq28-sq30
  • 22:29 Rob: completed upgrades to sq25-sq27
  • 22:12 Rob: upgrading kernel and squid versions on sq25-sq27 (if i crash the site, i apologize in advance)
  • 22:08 Rob: upgraded kernel and squid on sq24
  • 21:59 river: added current patches to ms4, set zil_disable=1 and rebooted
  • 21:30 brion: srv31 seems to be down, so no dump activity
  • 21:08 brion: scapping to update FlaggedRevs to r47588 (fixing fatal err)
  • 21:01 Rob: updated kernel and squid on sq23
  • 20:58 Rob: updated kernel and squid on sq22
  • 20:36 Rob: updated kernel and squid on sq20 and sq21
  • 20:25 domas: some apaches in crashloop like this: http://p.defau.lt/?s9YhHD_0qHroVhauBdQb_g
  • 20:09 Rob: restarted apache on srv74
  • 20:03 Rob: upgraded kernel and squid on sq19
  • 19:50 Rob: upgraded kernel + squid on sq18
  • 19:34 Rob: upgraded kernel + squid on sq17
  • 19:19 brion: updating FlaggedRevs to r47574
  • 18:16 river: set zil_disable on ms1 to improve nfs write performance
  • 18:15 mark: Raised max-conns to 50
  • 18:03 mark: Cut down max conns even more (25) for pmtpa upload backend squids
  • 17:40 mark: Limited maximum connections to backend (ms1) to 50 per squid on upload squids, 1000 per squid on text
  • 16:17 domas: plenty of fedoras had futex deadlocks
  • 16:16 Rob: upgraded kernel and squid on sq14 and sq15
  • 15:49 Rob: updated squid and kernel on sq13, rebooted, back online
  • 15:26 Rob: upgraded squid and kernel on sq9-sq12 (not all at the same time)
  • 14:59 Rob: upgraded squid and kernel on sq5, sq6, sq7, sq8
  • 14:51 Rob: upgraded squid and kernel on sq2-sq4
  • 14:50 Tim: updated ContactPage extension, will deploy it on nlwiki shortly
  • 10:52 mark: Reduced cache_mem from 3000 to 2500 for pmtpa upload backend squids - no restart, will take effect with the 2.7 upgrade later today
  • 10:00 mark: Started backend squid on sq26, it was gone

February 19

  • 23:54 brion: updating AbuseFilter to r47523 :P
  • 23:51 brion: updating AbuseFilter to r47522
  • 23:40 brion: updating FlaggedRevs to r47522
  • 23:39 Andrew: Enabled Abuse Filter on MediaWiki.org
  • 23:17 mark: Stopped experimental varnish on sq1, please keep Squid off as well
  • 22:52 Andrew: Allowed bureaucrats to remove 'sysop' right on testwiki.
  • 22:42 brion: updating includes/api to r47522 to fix a couple regressions
  • 22:15 mark: Started an experimental varnish instance on sq1 port 80
  • 21:22 mark: Stopped Squids on sq1
  • 14:23 Tim: removing memcached from srv154,srv155,srv157,srv158,srv169,srv170
  • 14:18 Tim: started memcached on srv190-199
  • 14:06 mark: Added "vport=80" to the http_host directive on all backend squids, to force Squid to use the default HTTP port, 80
  • 10:53 domas: livemerged r47483 (backlinks cache read explicit order, :( )
  • 07:56 Tim: restarted job runners with 4 processes per server instead of 1. Db2 is now heavily loaded, apparently due to the SELECT queries involved in the large numbers of unnecessary refreshLinks2 jobs that were queued before r47478 went live. But they should be done in a few hours at this rate.
  • 05:00 Brion: enabling Collection on fr, pl, nl, pt, es, simple Wikipedias
  • 02:12 Tim: deploying r47478

February 18

  • 22:41 Andrew: morebots back up, now logs to identi.ca with the name wikimediatech
  • 22:38 tomaszf: installed srv208 with Ubuntu 8.10.1 and installed app sever software.
  • 22:12 domas: Andrew killed morebots. let's see how he fixes it... :)
  • 21:59 Rob: PDF creation moved to pdf1
  • 21:58 Rob: changed pdf generation from eruzumi to pdf1, testing.
  • 19:21 Rob: srv255 changed to pdf1 and moved, drac setup along with dns resolution
  • 19:19 brion: scapping
  • 19:18 brion: svn up'ing test to r47457
  • 18:37 Rob: reinstalling srv209 due to dhcp misconfiguration making it think it was srv208
  • 15:13 mark: Restarted all upload frontend squids to get rid of the memleaking
  • 14:20 mark: Blocked all non-GET/HEAD HTTP methods in requests to upload frontend squids
  • 12:46 Tim: put r47447 live for temporary proposed fix of bug 17552
  • 08:38 Tim: svn up r47434 to fix Special:BrokenRedirects
  • 08:04 Tim: cleaned up binlogs on db2
  • 06:33 brion: note there's a live hack in api categorymembers query which may be breaking lookups
  • 05:54 Tim: set up bugzilla attachment_base, pointing to the new domain http://bug-attachment.wikimedia.org/, and set allow_attachment_display=on
  • 05:51 brion: disabling $wgTorTagChanges in CommonSettings after the ext gets loaded (needs fix for testwiki)
  • 05:46 brion: syncing reverted expr.php w/o bc stuff
  • 05:25 brion: syncing extensions/FlaggedRevs/specialpages/OldReviewedPages_body.php fix
  • 05:24 brion: syncing fix to Expr.php for bcpow() error
  • 05:16 brion: syncing fix to extensions/ParserFunctions/Expr.php
  • 04:59 brion: starting scap process...
  • 04:52 brion: svn up'ing test to r47418
  • 04:45 brion: svn up'd test to 47417
  • 04:30 brion: removing editor, reviewer from add/remove for all users in test. that ws an old test not needed anymore :D
  • 03:42 brion: rc tags tables created sitewide; should be safe to scap and check for final problems if we're brave
  • 03:35 brion: applying patch-change-tags to all wikis
  • 02:57 brion: ran patch-change_tag.sql on testwiki
  • 02:52 brion: full svn up'ing for test wiki
  • 02:06 brion: worked around breakage with pager base class incompat with latest codereview :P
  • 01:52 brion: svn up'ing CodeReview to aid in completing code review ;)

February 17

  • 23:58 Rob: srv217-srv223 installed and online as apache servers. Updated dsh groups and nagios, as well as pybal
  • 23:24 Rob: installed OS on srv217-srv223, moving on to package installation.
  • 21:12 Rob: reinstalling srv209, which thought it was srv208. silly server. srv208 has not been installed, gave to tomasz to check against setup checklist.
  • 21:05 Rob: actually, srv209 installed as 208, bad dhcp entry. Fixing
  • 21:04 Rob: pulling srv208 and srv209 for quick reboots, their drac ips are wrong.
  • 21:04 Rob: racked srv217-223 (also racked srv224/225 but no power yet)
  • 18:30 brion: starting a batch run of update-special-pages-small just to ensure it actually works
  • 18:25 brion: fixed hardcoded /usr/local path for PHP and use of obsolete /etc/cluster in update-special-pages and update-special-pages-small; removing misleading log files (bugzilla:17534)
  • 03:19 Tim: removed live hack updating MW_DIFF_VERSION, changed on December 30 and the cache expiry is a week. Should not cause a significant amount of load.
  • 03:01 Tim: removed live hacks from extension/Cite, updated to r47350.
  • 01:49 Tim: deleting all enotif jobs from the job queue, there is still a huge backlog

February 16

  • 16:46 mark: Did emergency rollback of squid 2.7.6 to squid 2.6.21 because of incompatible HTTP Host: header
  • 16:21 Rob: stopped upgrades, sq36 completed before stop
  • 16:17 Rob: performing upgrades to sq35-sq38 (not depooling in pybal, letting pybal handle that automatically)
  • 16:16 Rob: performed dist-upgrade on sq31-34
  • 15:35 Rob: depooled sq31-sq34 for upgrade
  • 08:12 Tim: patched in r47309, Article.php tweak
  • 05:00 Tim: made runJobs.php log to UDP instead of via stdout and NFS
  • 04:53 Tim: fixed incorrect host keys in /etc/ssh/ssh_known_hosts for srv38, srv39 and srv77
  • 04:13 Tim: removing all refreshLinks2 jobs from the job queue, duplicate removal is broken so to clear the backlog it's better to just run maintenance/refreshLinks.php

February 15

  • 21:59 mark: Experimentally blocked non GET/HEAD HTTP methods on sq3 frontend squid
  • 16:15 mark: Upgraded PyBal on lvs2 - others will follow
  • 13:11 domas: db23 has multiple MCEs for same dimm logged: http://p.defau.lt/?IarKD4gbFhe5RmaV0RB_Xg
  • 12:38 domas: in wikistats, placed older than 10 days files into ./archive/yyyy/mm/ - maybe will make flack crash less :))
  • 11:56 mark: Doing Squid memleak searching on sq1 with valgrind, pooled with weight 1 in LVS
  • 03:09 Andrew: CentralNotice still not working properly, and when we tried to set it to testwiki-only, it never came up. Left it on testwiki only for the time being, until somebody who knows CentralNotice can take a look at it.
  • 02:21 Tim: fixed permissions on the rest of the logs in /home/wikipedia/logs/norotate (fixes centralnotice)

February 14

  • 19:19 Az1568_: re-enabled CentralNotice on testwiki to try and find the problem (we've had this before, but fixed it somehow...possibly with a regen? See November 16th log.)
  • 18:34 domas: filed a bug at https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/329489 - could use some Canonical escalation too
  • 18:26 domas: same affected srv47 - this is related to switching locking to fcntl() - this drives apparmor crazy
  • 17:47 domas: srv178 kernel memleaked few gigs. blame: apparmor
  • 14:34 domas: srv215 very much dead, doesn't show vitality signs even after serveractionhardreset
  • 14:28 domas: correction, srv208.mgmt is pointing to uninstalled box
  • 14:27 domas: DRAC serial on all new boxes is ttyS1 which is not in securetty
  • 14:24 domas: srv209.mgmt is actually srv208's SP, and srv208.mgmt is pointing to dead box
  • 14:15 domas: srv209,215 down?
  • 13:43 domas: installing php5-apc-3.0.19-1wm2 (no more futexes) on all ubuntu appservers.
  • 10:02 Andrew: Reports that CentralNotice broke on all wikis, displaying just the message name in angle brackets, even though the message existed on meta. I have no idea what caused it and I couldn't find anybody who knows anything about it, so I disabled the notice itself on Special:CentralNotice on meta. Somebody who knows what they're doing should probably look into it later.

February 13

  • 22:10 mark: esams squid upgrade complete
  • 21:05 RobH: deployed srv207-srv216 in apaches cluster
  • 20:34 RobH: added new servers to nagois and restarted it
  • 20:15 RobH: setup all node groups, ganglia, apache, so on for srv199-srv206 and added into rotation
  • 19:38 mark: Upgrading esams squids to 2.7.6
  • 18:36 mark: Upgraded squid on sq1 to 2.7.6 and rebooted the box
  • 18:03 mark: Memory leak issues on the upload frontend squids, which started in November
  • 18:01 RobH: sq13 back online, seems there is a memory leak, go mark for finding =]
  • 17:54 RobH: lomaria install done for domas
  • 17:49 RobH: rebooting sq13 due to it failing out in ganglia, OOM error evident.
  • 17:48 RobH: reinstalling lomaria per domas request
  • 17:37 RobH: sq8 was unresponsive to console, locked up, rebooted, cleaned cache, and bringing back online
  • 17:34 RobH: srv38 and srv39 back in rotation
  • 17:23 RobH: srv38 and srv39 reinstalled, installing packages now
  • 16:57 RobH: reinstalling srv38/srv39
  • 16:57 RobH: srv80 reinstalled as ubuntu apache and back in rotation
  • 16:31 RobH: srv79 back in rotation
  • 16:21 RobH: srv79 reinstalled, installing packages and ganglia
  • 16:12 RobH: reinstalling srv79
  • 16:00 RobH: ganglia installed on srv77, back in rotation
  • 15:55 RobH: srv77 redeployed as ubuntu apache server
  • 15:48 RobH: reinstalling srv77 to ubuntu

February 12

  • 23:59 brion: adding 'helppage' to ui-content messages on commons per bugzilla:5925
  • 23:01 RobH: racked and setup drac for srv298-srv216
  • 21:20 mark: Killed blocked apache processes on srv180, and restarted apache
  • 21:19 mark: Killed blocked apache processes on srv172, and restarted apache
  • 21:07 brion: fixed ownership on log files for updateSpecialPages cronjob, which likely is what broke it
  • 20:28 mark: Upgraded experimental squid 2.7.5 on knsq1 to squid 2.7.6
  • 20:00 brion: fixed typo which broke access to revision deletion log for oversighters. tx to aaron for the spot :D
  • 19:45 mark: Replaced "2 cpu apaches" group aggregator srv32 by srv35
  • 18:55 RobH: racked, wired, and remote management setup for srv199-srv207
  • 09:51 domas: added srv190-srv198 to apaches dsh group, as they seem to be alive and kicking
  • 09:48 domas: changed weights for srv190-srv198 80->100 (to account for 1.85->2.5 ghz cpu step )
  • 00:29 brion: running updateRestrictions on wikis to clean up remaining funky restrictions entries per bugzilla:16846
  • 00:22 Tim: restarted apache on srv172

February 11

  • 23:23 mark: Pooled srv190-198
  • 23:23 Tim: re-enabling search suggestions
  • 23:19 mark: Installed Ganglia on srv190-198
  • 23:17 mark: Installed MediaWiki application server packages on srv190-198
  • 23:02 mark: Added srv190-198 to mediawiki_installation node_group (not any others)
  • 22:55 mark: Ran dist-upgrade && reboot on srv190-198
  • 22:46 mark: OS installed on srv190-198
  • 22:19 RobH: racked and setup drac on srv195-srv198
  • 22:11 RobH: racked and setup drac on srv192, srv193, srv194
  • 22:00 RobH: racked and setup drac on srv190, srv191
  • 21:24 brion: putting ixia back in rotation, it's caught up
  • 20:05 brion: depooling ixia while it catches up
  • 20:05 brion: ixia lagged 8810 secs
  • 20:00 brion: ixia replication is broken -- causing contribs lag on itwiki
  • 19:19 RobH: setup msw-a5-sdtpa like 30 minutes ago, opps ;]
  • 19:00 mark: Added srv190-225 to DNS & DHCP
  • 18:55 mark: set up RANCID for asw-a4-sdtpa and asw-a5-sdtpa
  • 18:54 brion: disabled srv38,39,77,79,80 in lvs3 pybal config to ensure they don't go back into service accidentally until fixed up
  • 18:37 brion: stopping apache on those bad machines for the moment
  • 18:35 brion: srv38, 39, 77, 79, and 80 appear to have been prematurely put into apaches pool, running old version of PHP. need to be halted and upgraded
  • 17:26 domas: restarted apache on srv154 after teh deadlock in apc
  • 16:04 Tim: disabled checkers.php hack, using mwsuggest.js hack instead
  • 15:52 Tim: emergency optimisation: disabled search suggest via checkers.php
  • 15:41 domas: srv159 restarted as proper apache, not -DSCALER
  • 09:02 domas: moved morebots to ~morebots@wikitech.wikimedia, startup line in rc.local :)
  • 07:05 Tim: running maintenance/fixBug17442.php
  • 06:56 Tim: restarted job runners
  • 04:31 Tim: upgraded bugzilla to 3.0.8 with cvs up, and copied in the docs directory from the 3.0.8 tarball
  • 03:31 Tim: gave myself an account on isidore, cleaned up some crap in /srv/org/wikimedia to /srv/org/wikimedia/backup
  • 02:58 Tim: apt-get upgrade on isidore

February 10

  • 23:47 mark: Moved upload esams LVS from mint to hawthorn
  • 23:41 mark: Installed a specially compiled LVS Feisty kernel on hawthorn (running Hardy) & rebooted
  • 22:33 RobH: updated mwlib on erzurumi per brion
  • 22:25 RobH: some resets and such on searchidx1 to get ssh working. system is very sluggish.
  • 19:28 brion: wikitech server crashed; CPU pegged and OOM. rob rebooted it, yay
  • 02:46 Tim: running maintenance/fixBug17300.php to create missing redirect table entries
  • 01:18 Tim: reverted PP caching patch
  • 01:14 Tim: re-enabled search suggestions

February 9

  • 23:13 domas: grunt session finished
  • 23:10 domas: brought up srv80 from hibernation and made it work.
  • 22:53 domas: added srv61 too
  • 22:23 domas: added srv144 and srv147 to duty, added ganglia stuff too
  • 22:01 domas: started appserver work on srv77,srv79
  • 21:54 domas: started srv35,38,49 as appservers, restarted deadlocked srv49 processes
  • 16:14 mark: Moved upload LVS back from hawthorn to mint - even a optimized 2.6.24 kernel is not fast enough to serve upload LVS
  • 16:03 Tim: disabled search suggest as an emergency optimsation measure
  • 16:02 mark: Rebooted hawthorn with an LVS optimized kernel, moved upload LVS back to it
  • 15:53 mark: Moved upload esams LVS back to mint
  • 15:37 mark: Moved upload.esams LVS from mint to hawthorn
  • 15:28 mark: Reinstalled server hawthorn with Hardy 8.04
  • 13:55 domas: fixed ganglia group for srv159 (it is scaler, not appserv)
  • 13:51 domas: brought srv182 up
  • 13:32 domas: repooled srv104 and srv105, after few months of vacation
  • 13:20 domas: killed few orphaned tidy processes that were very very busy since Feb1
  • 13:13 domas: heeheee, extorted this: [15:11] <rainman-sr> so, srv77,79,80, rose, coronelli and maurus could be converted to apaches
  • 12:36 Tim: trying apc.localcache=1 on srv176
  • 04:27 Tim: patching in r46936
  • 03:48 Tim: attempting to reproduce APC lock contention on srv188

February 8

  • 22:43 brion: may or may not have fixed that -- log file was unwritable. hard to test the command since 'su' bitches about apache not being loginabble on hume :P
  • 22:39 brion: investigating why centralnotice update is still broken. getting fatal php errors wtf?
  • 20:17 domas: we were hitting APC lock contention after some CPU peak. Dear Ops Team, please upgrade to APC with localcache support. :)))))

February 7

  • 22:49 domas: db17 came up, but it crashed with different symptoms than other boxes, and it was running 2.6.28.1 kernel. might be previous hardware problems resurfacing
  • 22:47 brion: chmod'ing centralnotice JS output on ms1 so batch processes running as 'apache' user can actually update them. hadn't been getting updated since february 5, leading to complaints when the swedes updated a translation on the steward banner
  • 21:23 domas: db17 down

February 6

  • 12:33 brion: stopped that process since it was taking a while and just saved it as an hourly cronjob. :) log to /opt/mwlib/var/log/cache-cleaning
  • 12:28 brion: running mw-serve cache cleanup for files older than 24h

February 5

  • 18:19 brion: put ulimit back with -v 1024000 that's better :D
  • 18:18 brion: removed the ulimit; was unable to reach server with it in place
  • 18:15 brion: hacked mw-serve to ulimit -v 102400 on erzurumi, see if this helps with the leaks for now
  • 16:56 domas: rebooted erzuruzumi, placed swap-watchdog ( http://p.defau.lt/?mELQFcwRSvYRYdiIR9pvKQ ) into rc.local
  • 16:03 mark: Added Qatar (634) to the list of esams countries
  • 01:27 Tim: migrated arzwiki upload directory from amane to ms1
  • 01:00 Tim: fixed arzwiki upload directory permissions
  • 00:56 Tim: moved most cron jobs from admin user cron tabs to /etc/cron.d on hume

February 4

  • 22:33 tomaszf: Adding cron for torblock under tfinc@hume
  • 22:20 tomaszf: ran loadExitNodes() to update tor block list
  • 18:36 brion: running TorBlock/loadExitNodes.php
  • 17:25 brion: stripped BOM from en.planet config.ini; re-running.
  • 17:24 brion_: attempting to run planet update for en.planet manually..... there's a config error
  • 16:30 domas: stealing db27 for moar tests

February 3

  • 13:05 mark: Remote-hands replaced some cables, fuchsia is back up but idling
  • 06:57 Tim: doing some schema changes on the otrs database. Some fields should be blobs and are text instead, perhaps due to a previous 4.0 -> 5.0 MySQL upgrade
  • 01:48 Tim: added blob_tracking table to ukwikimedia
  • 01:42 Tim: repooled db3 and db4
  • 00:34 mark: Moved traffic back
  • 00:28 mark: Shutdown switchport of fuchsia in order to prevent it from interfering with mint (which took up text LVS as well as upload)
  • 00:20 mark: Moved European traffic to pmtpa - text LVS unreachable

February 2

  • 23:54 domas: took out db29 for some testing
  • 22:07 mark: Modified Exim configuration on williams to not discard but delivered spam-recognized messages to OTRS with an X-OTRS-Queue: Junk header, as well as SpamAssassin headers
  • 21:35 brion: reverting change to Cite_body.php
  • 21:28 brion: caching for cite refs is known to cause problems with links randomly replacing with other links; likely strip marker problem. andrew is investigating
  • 19:31 domas: merged in Andrew's Cite cache to live site
  • 16:47 brion-sick: syncing update to Collection to do more efficient sidebar lookups
  • 16:18 brion-sick: large spike in text backend service times
  • 16:15 brion-sick: secure.wikimedia.org is returning 503 Service Temporarily Unavailable
  • 08:11 Tim: removing ancient static HTML dump from srv31
  • 08:05 Tim: removed cluster13 and cluster14 from db.php, will watch exception.log for attempted connections
  • 08:02 Tim: removed srv130 from LVS and the apaches node group, not accessible by ssh but still serving pages
  • 07:56 Tim: find /home/wikipedia/logs -size 0 -delete
  • 07:43 Tim: re-added db22 to s1 rotation, no explanation for its removal in server admin log
  • 06:39 Tim: dropped the otrs_test database
  • 06:38 Tim: moved the OTRS database from otrs_real back to otrs. Updated exim4 config on mchenry
  • 04:23 Tim: db10's relay log was corrupted, did a flush slave/change master
  • 01:10 Tim: started mysqld on db23, doing recovery
  • 00:59 Tim: rebooted db23
  • 00:56 Tim: db23 down, depooled
  • 00:05 Tim: adjusted innodb configuration on db10, restarted, starting replication

February 1

  • 23:40 Tim: OTRS recovery script done
  • 22:13 brion: updating rowikibooks logo bugzilla:17273 (note the log bot is down again)
  • 21:25 Tim: running script to copy deleted OTRS data from db10
  • 20:40 mark: Lily was overloaded due to the long downtime of mchenry, stalling all mailing lists deliveries
  • 20:39 mark: Granted SELECT access to mchenry and williams for database otrs_real - they've been giving temp rejects for hours
  • 11:24 Tim: mysqld on db10 crashes when it tries to run the current replicated query. Probably needs a resync. Set --skip-slave-start
  • 10:05 Tim: updated OTRS DB name on mchenry
  • 09:53 Tim: reading in SQL backup
  • 09:33 Tim: moving the otrs database to otrs_real to allow easier binlog import
  • 03:52 Tim: done 1 and 2
  • 03:10 Tim: recovery plan is as follows: 1. re-enable r/w web access, 2. compile a list of deleted IDs from the binlogs (confirmed that this is possible), 3. read in the pre-upgrade backup to a separate DB and execute binlogs to the appropriate point, 4. copy affected IDs from the backup to the live DB
  • 02:52 Tim: patched GenericAgent.pm to prevent ticket deletion
  • 02:27 Tim: it seems some admin inserted a GenericAgent job called "temp1" at 09:46 with the effect of deleting all tickets older than 30 days. The binlogs show a duplicate "Valid" key, with one row setting it to 0 and the next setting it to 1, so it's possible the user set valid=0 in the UI but due to a bug in OTRS, the job was considered valid. The job appears to have been run first at 09:46, probably from the web, then regularly at 10 minute intervals, most likely due to the cron job on bart which was not deactivated. I've now removed the relevant crontab and revoked bart's OTRS permissions.
  • 01:11 Tim: put an explanatory note on the OTRS login screen and deleted all sessions to send users there
  • 00:38 Tim: revoked write access from the otrs mysql user, to prevent any further damage. Making a copy of the binlogs. The plan is to do forensics first and then recovery second.

January 31

  • 18:17 mark: Following reports of OTRS rapidly deleting old tickets/emails every ~ 10 minutes, I disabled (set to invalid) all GenericAgent jobs pending investigation
  • 15:43 mark: Set local_from_check = false in exim.conf on williams, to prevent Sender headers from being added (annoying for Outlook users)
  • 07:11 Tim: converting OTRS database to proper UTF-8 (instead of UTF-8 in latin1 fields) using ~/fix-schema.php
  • 01:30 brion: updating eswikibooks logo bugzilla:17078
  • 00:55 brion: setting mswikibooks logo bugzilla:17263
  • 00:53 brion: copied wikimedia favicon to blog.wikimedia.org bugzilla:17171
  • 00:51 domas: lomaria needs reinstall, db24 and db30 are live in s2 duty

January 30

  • 17:54 domas: *giggle*, booted up lomaria with SMP kernel
  • 17:43 domas: lomaria kernel detects just one CPU (out of four)
  • 17:26 domas: converted lomaria into dewiki-only server
  • 14:20 Tim: Done with OTRS for now. Some bugs remain, particularly the missing ticket list in AgentTicketCustomer. I'll probably have to downgrade to 2.3.x tomorrow.
  • 12:51 mark: Installed ganglia on williams
  • 11:50 mark: Letting OTRS mail through to williams on mchenry
  • 10:50 Tim: running upgrade of OTRS DB
  • 10:44 mark: Removed all OTRS test copies in the queue of williams
  • 10:42 mark: Deferring all OTRS mail on the queue of mchenry
  • 10:30 mark: Put in a quick hack to forward misrouted OTRS mails from williams to bart
  • 08:52 Tim: sent upgrade warning email to all OTRS agents
  • 06:56 Tim: RCT should be finished now, no more connections are expected on cluster13 or 14. Current connection counts: 123943575, 295618929.
  • 02:36 Tim: set up SSL on williams and switched ticket.wikimedia.org DNS to point to there
  • 02:21 brion: set up new SSL cert for ticket.wikimedia.org; tim's poking at installing it
  • 02:19 brion: updated password on tridge *cough*
  • 01:43 brion: syncing update to Drafts with IE 7 fix (r46571 and style ver update)
  • 00:16 brion: live-merging r46570 -- fixes to DB access in revisiondelete

January 29

  • 22:55 mark: Did s/knams/esams/ on the selective AAAA answer config of ns0/ns1/ns2.wikimedia.org
  • 22:47 mark: While messages are held in the queue on williams, use "mailq" to view the queue, and "exim -M <messageid>" to let an individual message through for testing
  • 22:44 mark: SpamAssassin training from the OTRS Junk queue not yet setup
  • 22:43 mark: Note: Exim on williams queries for mail addresses from the live OTRS database, not the test database
  • 22:42 mark: Completed OTRS mail setup on williams. wikitech documentation updated in OTRS and Mail. OTRS mail is still copied to williams, and then held on the queue.
  • 22:00 mark: Added db10 as secondary DB to query for Exim on mchenry
  • 21:59 mark: Granted SELECT privileges on otrs.system_address to exim@williams on db9/db10
  • 21:58 brion: enabling revision & log suppression for oversighters
  • 21:12 brion: live-merging r46429 change to Special:Contributions -- stub marking fix
  • 21:01 mark: Copying OTRS mail to williams, where it's automatically held in the queue without extra processing; useful for testing
  • 21:00 mark: Installed SpamAssassin on williams for OTRS, copied training data from bart
  • 20:14 recompressTracked.php finished
  • 19:18 brion: aborted old enwiki dump so a fresh one can start, since that old history will never finish on the old system
  • 19:17 brion: updated data dump scripts
  • 17:57 brion: disabled 'mark patrolled' link for views without specific rcid param; but now it's back when we actually ask for it so actual rc/new pages patrol works again http://rafb.net/p/puGHC095.html
  • 17:54 brion: poking at patrol link live hack
  • 17:40 brion: erzurumi is rebooted and serving out PDFs again. need to implement some resource limits...
  • 17:35 brion: rebooting erzurumi via drac
  • 17:32 brion: i hate the drac shell
  • 17:24 brion: erzurumi appears to have been victim to a massive memory leak. seeing if we can reboot it
  • 17:17 brion: poking at mw-serve on erzurumi; not responding
  • 16:15 domas: livehacked out 'patrol' link on article views %)
  • 04:02 Tim: added DNS entry for OTRS test
  • 03:19 tomaszf: installed grosley
  • 01:31 Tim: fixed srv76 and the wikimedia-task-appserver package
  • 01:31 brion-busy: syncing r46513 -- fix for categoryfinder, update to fix for Collection
  • 01:14 brion-busy: updating Collection ext -- compat issue with changed category
  • 00:56 brion-busy: stopped apache on srv76 for the moment
  • 00:55 brion-busy: srv76 doesn't have upload5 mounted
  • 00:41 brion: live-hacking out a broken check in getDupeWarning() which broke uploading if you had a duplicate file
  • 00:34 mark: DOM readouts on br1-knams:
br1-knams#sh optic 1
 Port Temperature    Tx Power       Rx Power    Tx Bias Current Monitor
+----+-----------+--------------+--------------+---------------+-------+
  1/1   24.0078 C    000.7776 dBm                  84.360 mA    Disabled
  1/2   N/A            N/A            N/A            N/A            
  1/3   37.0000 C   -003.4582 dBm  -003.8111 dBm   58.470 mA    Disabled
  1/4   32.0234 C    000.4669 dBm                  71.928 mA    Disabled
  • 00:22 Tim: synced nagios config

January 28

  • 23:40 mark: s/knams/esams/ in DNS geobackend files
  • 23:25 mark: Deployed fix in /lib/lsb/init-functions on sanger, mchenry, williams and lily which caused (amongst others) Exim reloads (-HUP) to be turned into a kill -TERM (Debian bug #434756)
  • 23:15 mark: Set up basic mail system for OTRS on williams. Still incomplete and needs fine tuning and testing, spam checking is not yet implemented amongst other things.
  • 22:30 mark: Restarted Exim on sanger, disappeared mysteriously
  • 21:50 mark: Raised Dovecot max login process count from 128 to 1024
  • 21:04 brion: merging reupload fixed: r46479, r46483, r46487
  • 20:49 mark: Base OS install finished on williams.wikimedia.org
  • 20:02 brion: merging r46472 (FlaggedRevs autopromote fix), r46464-46476 (feed RTL style fix, re-upload disabled field fix)
  • 18:05 RobH: setup mail relay for wikimedia.cz for Danny and Co  ;]
  • 08:43 domas: s3 replication switched from db1-bin.325:437169827 to db11-bin.026 :79
  • 08:35 domas: s2 rep switched from ixia-bin.150:119337662 to db13-bin.004:79
  • 06:15 Tim: creating backup of db10 on storage2
  • 04:29 brion: svn up'ing and scapping to r46424 consistently
  • 04:22 brion: updating FlaggedRevs to r46422
  • 04:17 brion: merging r46419, r46421 -- search display fixlets
  • 03:51 brion: attempting scap again; tweaking DataCenter.ui.php since the scap syntax checks are whinging about the abstract static method o_O
  • 03:40 brion: scapping to r46413
  • 01:35 brion: svn up'ing to r46413 on test...

January 27

  • 19:28 brion: syncing updates to Collection
  • 19:04 brion: scapping update to AbuseFilter for test. updated its schema...
  • 18:44 brion: db16 lagged 2188s
  • 18:44 brion: restarting slave thread on db16. it got stopped with a lock wait timeout on a page_touched update (wtf?!)
  • 18:43 brion: slave stopped on db16
  • 17:41 mark: knsq1 Up and serving requests with squid 2.7.5
  • 17:25 mark: Trying squid 2.7.5 on knsq1 - might be unstable in the mean time
  • 17:22 mark: Reduced cache_mem on backend esams text squids from 3000 to 2500
  • 16:23 RobH: srv76 had a failed hdd, replaced, reinstalled, and bringing back into rotation
  • 16:18 RobH: srv146 was powered down (heat issue?), powered back up, synced and now in rotation.
  • 16:09 RobH: srv139 didnt have apache running, synced and started
  • 16:01 RobH: srv129 didnt have apache running, synced and started
  • 15:59 RobH: sq11 back online, cleaned
  • 15:40 RobH: srv126 back online. possible bad disk, if it crashes again, the disk needs replacement. (it went read only before, which seems to sometimes happen even when the disks are not bad.)
  • 15:25 RobH: srv76 wont boot up, reinstalling.
  • 15:12 RobH: srv130 coming back online, updated fstab, synced, putting it back in rotation.
  • 15:05 RobH: moved ts-array4 to its dedicated ports, now its kate's problem ;]
  • 14:49 Tim: restarted recompressTracked.php
  • 14:33 Tim: henbane's disk has been full for 8 days due to donate-campaign.log, starting cleanup
  • 14:18 Tim: killed recompressTracked.php
  • 14:08 domas: removed unnecessary ms1 stat from CommonSettings.php. Recovery observed. ( diff )
  • 13:44 mark: CARP weight redistribution caused large load spike in upload backend request, causing ms1 overload, probably causing issues on apaches via NFS, etc etc...
  • 13:29 mark: Lowered CARP weight from 10 to 5 for sq1-10.wikimedia.org, from 15 to 10 for sq11-15
  • 08:20 Tim: depooled db3 and db4 to improved recompressTracked speed
  • 07:09 Tim: There was a bug in recompressTracked.php which caused the last batch of orphans for any given wiki to be skipped. Re-running recompressTracked.php to repair it.
  • 05:55 Tim: killed all job runners, changed the job-runners group to srv151-180, started job runners on those servers
  • 05:50 Tim: migrated job runner scripts to ubuntu and started job runners on srv110-119
  • 05:29 Tim: started job runner on srv89
  • 02:13 brion: updating extensions/AbuseFilter/Views/AbuseFilterViewList.php (mysql 4 compat issue)
  • 02:04 brion: installed release versions of mwlib on erzurumi and restarted. these should have updated localizations
  • 01:48 brion: turning AbuseFilter on on test.... having some mysql 4.0 compat issues. poking
  • 01:47 brion: srv31 seems very sad; slow/borked login?
  • 01:39 brion: scapping to update AbuseFilter to current
  • 01:27 brion: prepping testing of AbuseFilter on test.wikipedia
  • 00:46 brion: enabling Collection also for de.wikisource per frank's req passed on from community
  • 00:36 brion: adding NS_HELP to $wgCollectionArticleNamespaces
  • 00:12 brion: Collection extension being enabled on dewiki

January 26

  • 22:39 RobH: UK Chapter wiki setup per https://bugzilla.wikimedia.org/show_bug.cgi?id=16996
  • 22:18 RobH: pushed apache changes for uk chapter wiki
  • 22:13 RobH: updated dns for uk chapter wiki
  • 19:29 brion: going to update Collection to current trunk in prep for further activation today
  • 17:01 RobH: added support for the phone server to dns

January 25

  • 12:18 mark: Announcing routes to AS16265 again
  • 10:17 domas: our deadlocks are described in X4240 manuals. the fix is either disabling MSI or setting 'options forcedeth max_interrupt_work=15' in modprobe.conf. product notes
  • 09:31 domas: db17 live, with 2.6.28.1 kernel

January 24

January 23

  • 18:04 brion: putting load back on db3, it's up to date
  • 17:49 brion: taking some load off db3 until it catches up
  • 17:46 brion: also killed a WantedTemplatesPage::recache query which had been running for a day. that ain't sustainable. :P
  • 17:44 brion: domas restarted morebots a few minutes ago :D
  • 17:43 brion: syncing update to ApiQueryBacklinks.php with the USE INDEX that was added for this problem
  • 17:41 brion: killing some stray backlinks queries
  • 17:38 brion: ~1-hour lag on db3
  • morebots is broken/down? unable to edit

January 22

  • 00:10 brion: whitelisting .ott (OpenDocument templates) for private-wiki uploads

January 21

  • 20:25 RobH: some tinkering on http redirects, rollback
  • 17:51 RobH: setup https for wikitech
  • 17:23 RobH: setup wikitech to stream weekly backups to tridge
  • 10:29 domas: db28 powered down because of temperature reading over threshold (45C???)

January 20

  • 21:45 RobH: killed some run away processes on db9 that were killing bugzilla
  • 21:44 brion: stock long queries on bz again. got rob poking em
  • 20:31 brion: putting $wgEnotifUseJobQ back for now. change postdates some of the spikes i'm seeing, but it'll be easier to not have to consider it
  • 20:19 mark: Upgraded kernel to 2.6.24-22 on sq22
  • 19:57 brion: disabling $wgEnotifUseJobQ since the lag is ungodly
  • 17:58 JeLuF: db2 overloaded, error messages about unreachable DB server have been supported. Nearly all connections on DB2 are in status "Sleep"
  • 17:21 JeLuF: srv154 is reachable again, current load average is 25, no obvious CPU consuming processes visible
  • 17:10 JeLuF: srv154 went down. Replaced its memcached by srv144's memcached
  • 03:02 brion: syncing InitialiseSettings -- reenabling CentralNotice which we'd taken temporarily out during the upload breakage
  • 01:50 Tim: exim4 on lily died while I examined reports of breakage, restarted it

January 19

  • 21:28 mark: Distribution upgrade on lily complete
  • 21:27 mark: Letting mail through again on lily
  • 21:01 JeLuF: Bugzilla didn't work. Some long-running (>3h) requests were locking some tables. Killed all long running jobs.
  • 20:05 mark: Put mail delivery on hold on lily
  • 20:03 mark: Upgrading lily (Mailing list server) to Ubuntu 8.04 Hardy
  • 14:04 mark: Set a static ARP entry for 85.17.163.246 on csw1-esams to see if it helps with the inbound packet loss effects

January 18

  • 20:25 mark: Cut outbound announcements to AS16265 to counter the inbound packet loss on that link
  • 17:50 river: started copying ms1:/export/upload to ms4
  • 00:21 Tim: restarted apache on srv158,srv177,srv106,srv66,srv109,srv140,srv86,srv90,srv133,srv172
  • 00:19 Tim: cleaned up binlogs on db1

January 17

  • 12:43 mark: Shut down transit link to 16265 due to intermittent packet loss

January 16

  • 23:25 brion: activating Drafts extension on testwiki
  • 21:18 brion: updating english/default wikibooks logo bugzilla:17034
  • 19:50 brion: uncommented srv101 from apache nodelist
  • 19:41 mark: Fixed authentication on srv101, and mounted /mnt/upload5
  • 19:25 brion: srv101 is commented out of 'apaches' node group so didn't show up on my earlier sweep
  • 19:23 brion: poking around, srv101 at least is missing upload5 mount still

January 15

  • 21:16 brion: seems magically better now
  • 20:48 brion: ok webserver7 started
  • 20:43 brion: per mark's recommendation, retrying webserver7 now that we've reduced hit rate and are past peak...
  • 20:28 brion: bumping styles back to apaches
  • 20:25 brion: restarted w/ some old server config bits commented out
  • 20:24 brion: tom recompiled lighty w/ the solaris bug patch. may or may not be workin' better, but still not throwing a lot of reqs through. checking config...
  • 19:48 brion: trying webserver7 again to see if it's still doing the funk and if we can measure something useful
  • 19:47 brion: we're gonna poke around http://redmine.lighttpd.net/issues/show/673 but we're really not sure what the original problem was to begin with yet
  • 19:39 brion: turning lighty back on, gonna poke it some more
  • 19:31 brion: stopping lighty again. not sure what the hell is going on, but it seems not to respond to most requests
  • 19:27 brion: image scalers are still doing wayyy under what they're supposed to, but they are churning some stuff out. not overloaded that i can see...
  • 19:20 brion: seems to spawn its php-cgi's ok
  • 19:19 brion: trying to stop lighty to poke at fastcgi again
  • 19:15 brion: looks like ms1+lighty is successfully serving images, but failing to hit the scaling backends. possible fastcgi buggage
  • 19:12 brion: started lighty on ms1 a bit ago. not realyl sure if it's configured right
  • 19:00 brion: stopping it again. confirmed load spike still going on
  • 18:58 brion: restarting webserver on ms1, see what happens
  • 18:56 brion: apache load seems to have dropped back to normal
  • 18:48 brion: switching stylepath back to upload (should be cached), seeing if that affects apache load
  • 18:40 brion: switching $wgStylePath to apaches for the moment
  • 18:39 brion: load dropping on ms1; ping time stabilizing also
  • 18:38 RobH: sq14, sq15, sq16 back up and serving requests
  • 18:38 brion: trying stopping/starting webserver on ms1
  • 18:27 brion: nfs upload5 is not happy :(
  • 18:27 brion: some sort of issues w/ media fileserver, we think, perhaps pressure due to some upload squid cache clearing?
  • 18:23 RobH: sq14-aq16 offline, rebooting and cleaning cache
  • 18:16 RobH: sq2, sq4, and sq10 were unresponsive and down. Restarted, cleaned cache, and brought back online.
  • 04:32 Tim: increased squid max post size from 75MB to 110MB so that people can actually upload 100MB files as advertised in the media

January 14

January 13

  • 23:32 Tim: fixed NRPE on db29
  • 22:56 Tim: cleaned up binlogs on db1 and ixia
  • 22:54 brion: poking WP alias on frwiki bugzilla:16887
  • 21:11 RobH: setup ganglia on erzurumi
  • 20:42 brion: setting all pdf generators to use the new server
  • 20:40 brion: testing pdf gen on erzurumi on testwiki
  • 20:35 RobH: setup erzurumi for dev testing
  • 20:35 RobH: some random updates on server roles to clean it up
  • 19:37 mark: Restored normal situation, with 14907 -> 43821 traffic downpreffed to HGTN to avoid peering network congestion
  • 18:40 mark: Retracted outbound announcement to all AMS-IX peers, 16265 and 13030 to force inbound via 1299
  • 18:25 mark: Undid any routing changes as they were not having the desired effect
  • 18:14 mark: Prepended 43821 twice on outgoing announcements to 16265 to make pmtpa-esams path via nycx less attractive
  • 11:38 Tim: reducing innodb_buffer_pool_size on db19, db21, db22, db29
  • 09:15 Tim: restarting mysqld on db23 again
  • 09:09 Tim: restarting mysqld on db18 again
  • 07:08 Tim: removed db23 from rotation, since I'm bringing it up soon and it will be lagged
  • 07:02 Tim: shutting down mysqld on db18 for further mem usage tweak
  • 06:53 Tim: fixed broken /etc/fstab on db23 via serial console
  • 06:42 Tim: restarting db23
  • 00:08 Tim: repooling db18, has caught up

January 12

  • 21:50 brion: testing a scap after touching MessagesWuu.php to see if that clears borked serialized btis
  • 21:22 RobH: erzurumi installed
  • 21:00 tomaszf: moved erzurumi to vlan 101 on asw-a4-sdtpa
  • 17:55 brion: temporarily stopped apache on srv78, srv118
  • 17:54 brion: srv78 doesn't have upload5 mounted
  • 17:54 brion: srv118 doesn't have upload5 mounted
  • 17:46 RobH: fixed some settings for flaggedrevs in https://bugzilla.wikimedia.org/show_bug.cgi?id=14648
  • 17:31 RobH: per brion commented out db18 in db.php cuz its making other crap lag too much (bugzilla:16993)
  • 17:26 RobH: updated flaggedrevs.php for https://bugzilla.wikimedia.org/show_bug.cgi?id=16365
  • 17:23 RobH: updated apache config on yongle for wap => mobile forwarding oversight per https://bugzilla.wikimedia.org/show_bug.cgi?id=16692
  • 17:05 brion: db18 is backlogged 191k seconds. depooling it; complaints of hella lag
  • 15:32 Tim: restarted mysqld on db18 with reduced memory usage, repooled
  • 14:12 Tim: rebooting db18
  • 13:20 Tim: depooled db18 (is down)

January 10

  • 16:08 domas: rotated 300g sampled-1000.log ;-)
  • 07:09 river: applied current OS patches to ms2 and rebooted
  • 01:21 Tim: restarted apache on srv95,srv114,srv37,srv49
  • 01:19 Tim: cleaned up disk space on db1. Still looks suspiciously like the master...
  • 00:33 brion: redirecting old bylaws.pdf to wiki page bylaws on wikimediafoundation.org (foundation.conf update)
  • 00:13 brion: reconfigured exim on wikitech to hopefully actually send mail out. whether it reaches anything, we'll see
  • 00:12 tomaszf: turned off fundraising banners
  • 00:08 brion: installed a mail server on wikitech server, hopefully

January 9

January 8

  • 22:08 brion: putting db12 back in service, caught up
  • 21:42 RobH: changed the ip address for the management interfaces on sq31-sq50
  • 21:30 RobH: updated dns with the squids and srv mangement info for pmtpa
  • 21:16 brion: taking load off db12 while it updates
  • 21:15 brion: killing stuck query threads on db12 (lagged 13k seconds)
  • 20:23 RobH: updated dns removing a large number of decommissioned servers from records.
  • 20:08 RobH: pushed updates to dns for mangement ip allocations, changed mangement ips of search8-search12
  • 19:42 RobH: changed the mangement ip addresses of db5-db10 to fit into current ip scheme
  • 18:20 RobH: updated dns for the management name resolution of db11-db30
  • 18:11 RobH: ms5 has lom access enabled and is ready for testing. (Only one ethernet connection in lieu of the typical 3 on the thumper/thors)
  • 15:50 RobH: srv118 reinstalled
  • 15:46 RobH: srv136 is borked. Even after reinstall, it will run for a few minutes, then lock hard. Going to RMA it.
  • 15:38 RobH: reinstalled srv136 and srv118 cuz they were pissing me off (a valid reinstallation reason if there ever was one.)
  • 15:08 RobH: and srv118 back down, thing is borked.
  • 15:06 RobH: srv118 back online and serving requests.
  • 15:01 RobH: pushed db13 back into cluster, same with db14, from yesterdays work
  • 14:26 RobH: srv101 back online and in lvs
  • 14:15 RobH: reinstalled srv101, installing wikimedia-task-app packages now
  • 06:37 JeLuF: rebooted db18. Mysqld was stuck but couldn't be killed.
  • 04:08 Tim: migrated all locked wikis from $wgReadOnly(File) to permissions-based locking, so that stewards can edit the alternate project links, and so that various MediaWiki components don't break on page view
  • 03:57 river: set up ms3/ms4 with solaris 10 update 6

January 7

  • 22:50 RobH: db13 and db14 are replicating but not in the cluster (not sure if they are caught up)
  • 22:35 RobH: updated power strip information for ps1-a1-sdtpa and balanced load
  • 22:35 RobH: reseated mrj cable for csw1-sdtpa_1/13
  • 21:36 RobH: started up db13 and db14
  • 21:19 RobH: updating firmware on db13-db14
  • 21:14 RobH: shutdown db13 and db14 to fix lom lockup issue.
  • 20:52 RobH: depooled db13 and db14 in db.php to reboot them and fix the SP lockup issue.
  • 20:49 RobH: updating firmware on db16.
  • 20:43 RobH: started mysql back up on db15
  • 20:42 RobH: cold reset of db16 to resolve lom issue. will update firmware upon boot.
  • 20:39 RobH: swappned hostnames on ms3 and ms4, updated racktables and dns to reflect change
  • 20:24 brion: disabled wikidiff2 on wikitech since it's not installed, and this apparaently is nicely broken
  • 20:21 RobH: db15 now responsive to lom and ready to be re-integrated into the cluster
  • 20:12 RobH: db15 cold reset fixes the LOM non-responsive issue. Upgrading its firmware to prevent future issues.
  • 20:06 brion: removed stray whitespace from wikitech config file which was breaking rss feeds
  • 19:22 mark: Possibility that esams LVS was overloaded, split over 2 boxes (fuchsia & mint)
  • 19:19 RobH: ms3 and ms4 are accessible via LOM and ready for setup/deployment
  • 19:05 RobH: updated dns for ms3-ms5, updated dns for mangement for all media servers.
  • 19:03 brion: touching MessagesZh.php and re-trying scap; may not have properly updated
  • 17:40 brion-plague: scapping -- merged r45507 zh specialpage alias fix to live. also r45499 (revert of Cite error thingy) seems to already have been merged
  • 13:58 Tim: ran updateAutoPromote.php on all flaggedRevs wikis
  • 13:41 Tim: scap
  • 13:21 Tim: repooled db3 and db4
  • 12:47 Tim: recompressTracked.php complete. Recompressed 628 GB of data to 30GB, a 21x reduction over per-revision compression.
  • 04:36 brion-codereview: svn up'ing testwiki to r45489

January 6

  • 16:01 mark: Changed 'knams' into 'esams' in DNS, kept a lot of old names in place
  • 15:26 Tim: cleaned up binlogs on db1
  • 13:09 mark: Did some Traffic Engineering on the Amsterdam network
  • 11:58 Tim: installed NRPE on new ES servers
  • 11:47 domas: added db29 to s3 duty
  • 11:32 Tim: locked clusters 18 and 19, updated nagios
  • 11:27 Tim: fixed lack of schema on srv161
  • 11:21 Tim: retired cluster18 from the write list, added cluster20 and cluster21
  • 11:15 Tim: cleaned up binlogs on srv105
  • 00:04 tomaszf: built out eiximenis with ubuntu-8.04 for mobile server

January 5

  • 20:47 brion: re-updating SpecialSearch.php and MWSearch.php for better fix of the XSS
  • 20:40 brion: updating SpecialSearch.php for XSS issue
  • 20:00 RobH: wikitech is moved to new host. Still needs HTTPS setup. Redirects from old host are in place.
  • 13:17 domas: setting up db24-db26 LVMs per http://p.defau.lt/?eAOimTjd9r_QvSDiIhHjng
  • 12:56 mark: Brought down BGP transit session to AS 1145 / Kennisnet
  • 12:29 domas: db16 had our special deadlock, didn't come up after reboot, SP not responding, needs datacenter activity
  • 12:07 domas: upgraded BIOS firmware on db29,db30 and accidently on db19 (damn .29 ip :)
  • 11:47 domas: added 208.80.152.185 to noc.wikimedia.org vhost ServerAlias
  • 10:33 mark: Brought BGP session to AS 16265 back up
  • 00:04 Tim: cleaned up binlogs on ixia and db1

January 4

  • 17;08 mark: Restored traffic to esams
  • 16:38 mark: Moved route sourcing from br1-knams to csw1-esams
  • 15:55 mark: Moving esams traffic to pmtpa (scenario knams-down)

January 3

  • 23:57 mark: Restored AAAA record on upload.wikimedia.org
  • 12:04 domas: db17, db18 had OS/firmware updates, rebooted
  • 10:50 domas: db19 RAID complaining about temperature, check-raid/kswapd/mysqld deadlock. upgrading RAID firmware, rebooting, etc
  • 01:23 Tim: removed db3 and db4 from rotation again, to allow recompressTracked to go faster
  • 00:36 Tim: depooled db19, is down
  • 00:32 Tim: restarting recompressTracked with an extra wfWaitForSlaves()
  • 00:08 Tim: repooled db3 and db4

January 2

  • 22:35 Tim: depooled db3 and db4 temporarily
  • 21:56 Tim: killed recompressTracked for now, not waiting for slaves properly. db3 and db4 lagged.
  • 20:54 mark: Set db4 s1 load to 0, 4368s lagged
  • 00:42 Tim: restarting recompressTracked.php on hume

January 1

  • 20:34 brion: live-merging file delete fatal error fix from r45278
  • 19:47 brion: bumped meter image to 7
  • 01:59 brion: scapping!
  • 01:39 brion: svn up'ing test.wiki to r45274
  • 00:55 brion: svn up'ing on test.wikipedia

December 31

  • 18:40 brion: fixed old whygive.wikimedia.org blog by copying de-conflicted WordPress source files out of the active blog where we fixed it after the 2.7 upgrade

December 30

  • 23:02 RobH: is leaving on a jet plane, weeeeeeeee.. in 8 hours.
  • 23:01 RobH: all knams squids are now online.
  • 22:49 RobH: knsq23-26 back in rotation, 3 more to go.
  • 22:33 RobH: enabled knsq16-knsq22 in lvs, almost time to go back to hotel and die.
  • 22:22 brion: attempting to purge affected pages on dawiktionary, dawiki
  • 22:21 brion: taking dawiki, dawiktionary out of read-only because the rest of the fixes won't work until it's disabled :P
  • 22:14 brion: poking diff version in live DifferenceEngine.php to eliminate bogus cache entries for dawiki/dawiktionary
  • 22:11 RobH: stopping and clearing the cache on knsq16-knsq30.
  • 22:06 brion: trying it again, but this time with the right variable names
  • 22:02 brion: attempting to clear revision text loading cache entries for dawiktionary, dawiki
  • 21:47 brion: live-merging r45206 so bugzilla:16841 corrupted entries will be loaded properly on dawiki/dawiktionary. need to clear revision, diff, parser caches...
  • 21:15 brion: locking dawiki, dawiktionary ($wgReadOnly) pending encoding fix
  • 20:07 brion: killed recompressTracked.php processes on hume pending investigation of encoding breakage
  • 20:02 brion: commenting ariel out of pmtpa also
  • 19:58 brion: trying to clear no-longer-in-dns hosts from ALL node group
  • 19:57 brion: PLEASE SAY WHAT SERVER YOU'RE RUNNING BATCH PROCESSES ON IF THEY'RE NOT ON ZWINGER. thanks
  • 19:56 RobH: power disconnection for primary routing rack in esams. power restored, and totally was not robh's fault regardless of what lies mark may say to the contrary.
  • 19:54 brion: encoding issues reported with some old edits on dawiki. wondering if this is recompression-related?
  • 18:46 brion: added PMTPA nameserver back in mayflower's resolv.conf so DNS actually works on it until things are fixed
  • 17:42 brion: internal DNS for knams seems to be down (at least on mayflower), this is breaking at least SVN update notifications
  • 17:14 brion: updating logo for pmswiki bugzilla:16587
  • 13:29 Tim: starting recompressTracked.php on all wikis
  • 11:22 mark: Shutting down knsq16-30
  • 10:59 mark: In case of overload problems, please move traffic to pmtpa (scenario knams-down)
  • 10:54 mark: Depooled knsq16-30
  • 10:47 mark: Set DNS timeout on fuchsia (LVS) to 1s, PyBal timeout to 8s
  • 10:21 mark: Unracking pascal, mint, lily
  • 09:57 Tim: testing recompressTracked on huwiki
  • 09:38 mark: ts-array3/A --> yarrow/0
  • 09:23 TimStarling: testing recompressTracked on testwiki
  • 09:20 mark: hemlock/eth1 <--> clematis/eth1
  • 09:17 mark: ts-array2 -> zedler scsi B, ts-array1/0 -> zedler scsi A
  • 08:47 Tim: running FlaggedRevs/maintenance/clearCachedText.php on all FlaggedRevs wikis

December 29

  • 11:24 mark: Shutting down and unracking mayflower (subversion)
  • 11:21 mark: Temporarily disabled AAAA record upload.wikimedia.org for ipv6 participants
  • 11:19 mark: Unracked fuchsia
  • 11:16 mark: In case of overload problems, move traffic to pmtpa!
  • 11:11 mark: Moving all LVS to mint
  • 09:56 mark: Depooled knsq8-15
  • 09:56 mark: Unracked knsq1-7
  • 09:43 mark: Repooled knsq23-30, depooled knsq1-7
  • 09:23 mark: Depooled knsq23-30
  • 08:47 Tim: deleted some binlogs on srv108.
  • 04:50-05:32 Tim: set up external storage on the remaining 9 servers in srv151-186: srv160, srv161, srv162, srv172, srv173, srv174, srv184, srv185, srv186
  • 03:41 Tim: running orphanStats.php on all wikis
  • 03:26 Tim: restarted apache on srv33, srv146, srv169, srv172
  • 03:00 Tim: cleaned up binlogs on srv105

December 28

  • 21:33 brion: tweaked namespace robot policies for hewiki bugzilla:16247
  • 20:52 brion: tweaking it correctly this time
  • 20:50 brion: tweaking centralnotice loader path for secure.wm.o
  • 20:20ish brion: copied a couple image files for Bugzilla skin to local dir, since Firefox 3.1b whinges about loading images via http: from an https: page
  • 18:21 brion: we've been getting reports of difficulties reaching PMTPA via Level3
  • 18:03 brion: updating thwiki logo bugzilla:16008
  • 17:54 mark: csw1-esams racked and configured; link established with br1-knams
  • 12:14 mark: Moving equipment to EvoSwitch
  • 11:55 mark: Moved udpmcast from pascal to lily
  • 11:48 mark: sage stays at knams, to be racked into J-13 later
  • 11:44 mark: Unracking ragweed
  • 11:38 mark: Unracking hawthorn
  • 11:37 mark: Unracking sage
  • 11:37 mark: Unracked csw1-knams
  • 11:25 mark: Directed traffic back to knams
  • 10:52 mark: knams network should be back up
  • 09:05 mark: Moving knams traffic to pmtpa

December 27

  • 21:50 brion: removed stale sitemaps dirs for several private wikis

December 26

  • 00:50 Tim: started mysqld on db19, repooled
  • 00:44 Tim: got connection on db19 and assumed it was still broken, initiated shutdown
  • 00:44 domas: db19 had jfs/kswapd/etc deadlock, came up after reboot
  • 00:34 Tim: noticed db19 was down, depooled it.

December 25

  • 23:59 domas: restarted db19 with sysrq without telling anyone
  • 19:37 brion: adjusted subpage namespaces for arbcom_enwiki
  • 19:11 brion: disabled magic_quotes_gpc on yongle -- mobile.wikimedia.org gateway doesn't compensate for quoted input. :P
  • 19:09 brion: merry christmas!
  • 01:09 brion: re-running SVN metadata import for CodeReview to fix comment encoding (bugzilla:16640)

December 24

  • 21:55 brion: merging r45005 (restoring default font for Safari textarea)

December 23

  • 23:35 brion: svn up'd to r44990 (serialization updates broken by Setup.php change)
  • 23:28 brion: starting scap!
  • 23:24 brion: svn up'ing to r44989, prep for scap!
  • 22:41 brion: think i tweaked scap script to update skin files on upload.wikimedia.org ...hopefully :)
  • 22:09 brion-codereview: svn up'ing test.wikipedia.org to r44982 -- DO NOT SCAP UNTIL TESTED!
  • 02:38 Tim: cleaned up binlogs on db1, db2. Removed cluster19 from the write list, it's almost full.
  • 02:28 brion: clearing out bogus page_restrictions entries (bugzilla:16629)

December 22

  • 22:56 brion: updated timezone for huwikinews (bugzilla:14343)

December 21

  • 03:05 Tim: depooled db4 temporarily to speed up a long running trackBlobs query

December 20

  • 01:08 brion: starting a cleanupImages run on all wikis
  • 00:57 brion: set UI lang fo rmainpage on meta bugzilla:16701

December 19

  • 23:52 brion: removing MessageCache::get profiling hack, all done
  • 22:16 brion: adding profiling hack for MessageCache::get
  • 13:48 mark: Found knsq12 turned off, brought it back up
  • 12:17 mark: Unracking knsq15 to make room for the new router
  • 08:53 Tim: changed crontab on hume to run rebuildTemplates.php every 30 minutes instead of every 10 minutes, since it's taking about 30 minutes to finish each run
  • 07:42 Tim: started trackBlobs.php running on hume, for all wikis

December 18

  • 23:16 brion: updating MessagesLij.php, MessagesMt.php -- namespace breakage
  • 21:53 brion: bugzilla:16597 spam regex update
  • 21:01 RobH: added wikitech subdomain for future setup/migration of wikitech mediawiki
  • 20:33 RobH: added commons to meta imports allowed per https://bugzilla.wikimedia.org/show_bug.cgi?id=16665
  • 14:50 RobH: pushed dns change to correct spence.mgmt.pmtpa.wmnet.
  • 03:09 TimStarling: killed long-running query on db9, 5762 seconds, plain select query probably with a read lock held by the thread, all read queries were waiting for the lock
  • 02:27 TimStarling: deleted binlogs on srv105 and srv108
  • 01:16 brion: briefly experimented with changing wgLogo on testwiki via Configure and it didn't explode. yay! setting it back to default and just letting it be. only stewards can edit config, and only wgLogo is configable atm.
  • 01:12 brion: testing Configure on testwiki only
  • 01:10 brion: created test Configure ext tables in 'wikiconfig' db
  • 00:49 brion: scapping for update of Configure extension prior to small-scale test deployment
  • 00:48 Danny_B: wikibugs-l stopped to send mails to wikibugs-irc mailbox due to excessive bounces. reenabling sending again
  • 00:28 RobH: fixed part of the revert for lucene that i missed.
  • 00:24 RobH: reverted lucene.php changes from rainman's testing.

December 17

  • 23:18 RobH: more lucene changes
  • 22:36 brion: applied fix for Android browser on mobile gateway (also did the pl language setup recently)
  • 22:05 RobH: more lucene.php changes
  • 21:12 RobH: additions to lucene.php per rainmain
  • 20:39 mark: Corrected LVS service IPs on search2, search10-12
  • 20:03 brion: hacked mw-serve init script on yongle into shape. will commit it in a bit and update docs
  • 19:38 brion: pdf server seems to have eaten all temp space on yongle. clearing...
  • 19:26 mark: Set up search2, search8-12
  • 18:57 RobH: pushing dns changes for new misc. servers management resolution
  • 18:30 RobH: updated lucene.php with rainman to do things that I really do not get but he knows about.
  • 16:28 RobH: new servers auth1, nfs2, streber and williams are racked, IP's allocated, DRAC working. No DHCP entries or OS installed yet.
  • 16:08 mark: restarted lighttpd on zwinger
  • 15:59 RobH: added williams to dns records, updated dns
  • 15:50 TimStarling: removed some binlogs on ixia
  • 01:17 brion: scapping a couple more fixes to r44698
  • 00:36 brion-codereview: srv126 is borked -- read-only filesystem
  • 00:23 brion-codereview: scapping to 44696
  • 00:15 brion-codereview: svn up'ing on test...

December 16

  • 23:09 brion-codereview: disabling FixedImage extension -- was used for old 2006 and 2007 fundraisers; images no longer exist and are not applicable to current fundraisers
  • 20:34 RobH: ariel is dead, will decommission later.
  • 20:29 RobH: ariel is fubar, rebooting and investigating.
  • 20:25 RobH: restarted services on sq13
  • 20:21 RobH: took down sq13 to clean its cache
  • 20:09 RobH: replaced bad /c0/p0 in amane
  • 19:45 RobH: setup drac access for nfs1, brewster, auth2, dobson, eiximenis, erzurumi, fenari, grosley, loudon, singer, & spence. The other 3 misc. servers will be setup later. OS not installed, just remote access setup and IP space allocated. (Not setup in DHCP yet.)
  • 18:47 brion: applying temporary resource limit lift on enwiki for an IP for workshop in SF
  • 17:40 RobH: updated dns for misc. servers project.
  • 01:08 brion: deploying r44643 update to CodeReview subversion proxy (swapped encoding protocol to avoid bugs in json_decode with some diffs)
  • 00:04 brion: running cleanupTitles.php in bg on all wikis...

December 15

  • 23:20 brion: going to test fixes for FiveUpgrade.inc to back cleanupTitles.php, cleanupImages.php etc
  • 22:21 RobH: changed settings on metawiki to allow banned users to edit their talk pages per https://bugzilla.wikimedia.org/show_bug.cgi?id=16621
  • 21:25 brion: reenabling handheld skin setting, was turned off during overload emergencies on 11-17
  • 21:13 brion: rsyncd appears to be running on srv56. does anything else need to be done for index updates?
  • 20:10 brion: yongle hanging again, restarting apache
  • 18:58 RobH: started rsync daemon on srv56 per rainman
  • 18:35 RobH: setup new planet per https://bugzilla.wikimedia.org/show_bug.cgi?id=16511.
  • 01:39 brion-weekend: applying API deletion log fix from r44541 (bugzilla:16626)
  • 00:09 rainman-sr: rsyncd is not running on srv56, updates for wikis served by old indexer halted since Oct7. Run rsync --daemon on srv56

December 14

  • 02:04 Platonides: Connections timing out

December 13

  • 02:04 brion: applied patch-rfb_ratings.sql to flaggedrevs wikis
  • 01:46 brion: did some debugging on RatingHistory graph generation with Aaron and got it working yay!

December 12

  • 22:47 brion: patched Bugzilla so we can exclude CC-only mails from wikibugs-l ([bugzilla:15585]])
  • 21:52 brion: scapping to r44509
  • 19:19 brion: put all the themes and plugins and patches back on wordpress for blog.wm.o. whee
  • 19:15 brion: restarted apache on isidore while fiddling with php error logging settings and blog started magically working again. sigh. going back to tweak its config back to normal
  • 18:04 brion: we managed to fix the svn update conflict on blog.wm.o (to wordpress 2.7) but it's still showing main page as blank
  • 17:42 mark: Telia connection / BGP session was up for 20 hours; problem seems resolved. Removed route filters
  • 00:29 brion: bumping to r44485 for more NS fixes for ms, ast
  • 00:12 brion: scapping bump to r44484, fixing a few issues w/ hu
  • 00:06 brion: updated wikibugs irc script to r44483, fixes issues w/ users w/o real name setting

December 11

  • 23:19 brion: shutting down srv118; bad config. missing upload5 mount, seems to have bogus authenticatin (local su to root fails with "Authentication service cannot retrieve authentication info")
  • 23:10 brion: restarted apache on 134, it's scary/corrupt
  • 22:55 brion: manually syncing updated skin files to upload.wm.o ...
  • 22:53 brion: scapping to r44474
  • 21:31 brion: don't sync yet; RC regression in r44033 being worked on
  • 19:41 brion-codereview: removed conflicting live profiling hack from AutoLoader.php. Put this stuff in SVN, huh guys?
  • 19:39 brion-codereview: applying flaggedrevs schema updates
  • 19:38 brion-codereview: starting svn up for testwiki
  • 13:41 mark: configured asw-a4-sdtpa and asw-a5-sdtpa, but no link
  • 10:41 mark: bart out of disk space, removed some old cruft (mailman)

December 10

  • 23:50 RobH: pulled srv76 due to two dead fans (yay for da bot)
  • 23:35 RobH: srv78 reinstalled and in apache pool
  • 22:57 RobH: srv78 kernel panic, old FC install, pulled for reinstall
  • 22:49 RobH: sq1. sq3, sq6 cache cleaned and back online serving requests.
  • 22:35 RobH: sq1, sq3, sq6 all unresponsive to console, flashing leds on kvm. rebooted.
  • 20:40 RobH: srv118 installation completed.
  • 20:00 RobH: reinstalled srv118 after replacing dead parts. installing packages now.
  • 19:48 RobH: started rebuild of storage1 /c1/p0 into array
  • 19:47 RobH: replaced disk /c1/p0 in storage1. /c1/p13 is now bad as well, placing rma for it.
  • 19:14 RobH: db13-db16 responsive to ssh.
  • 19:13 RobH: db15 rebooted.
  • 18:05 RobH: temp probes installed in a3-sdtpa

December 9

  • 18:46 RobH: fixed group names in add/remove groups per https://bugzilla.wikimedia.org/show_bug.cgi?id=16248
  • 18:42 RobH: updated some settings for no.wikimedia.org and pushed to cluster.
  • 15:23 RobH: backedup blog frontend/database and upgraded to 2.6.5 successfully
  • 14:21 RobH: updated InitialiseSettings for nowikimedia wiki
  • 06:47 Tim: srv146 did not have /mnt/upload5 mounted. Fixed.
  • 02:03 brion: dropped loading of obsolete RenderHash ext (bug 16114)

December 8

  • 23:30 RobH: updated enwiktionary group settings per https://bugzilla.wikimedia.org/show_bug.cgi?id=16248
  • 23:24 brion: updating Oversight for bug 16065
  • 22:44 RobH: no.wikimedia.org is now functioning per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
  • 22:35 RobH: made changes to InitialiseSettings.php for cswikisource per https://bugzilla.wikimedia.org/show_bug.cgi?id=16277
  • 21:37 RobH: authdns-update for no.wikimedia.org
  • 21:20 RobH: running sync-common-all for wikimedia norge (found the php error)
  • 21:01 RobH: its all back up now.
  • 20:59 RobH: I stupidly crashed the site with a php typo, rolling back my changes since i was ignorant and did not php -l  ;_;
  • 20:58 RobH: setup wikimedia norge wiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15383
  • 19:23 brion: updating OggHandler for fix for bug 15920 (chopped oggs)
  • 15:57 mark: Set up mirroring of traffic of e7/2 to e7/14 for testing the fiber patch loop/optics
  • 13:16 Tim: added some IWF proxies to the trusted XFF list. These proxies are probably about 30% of the IWF traffic, the other 70% comes from proxies that pass through the XFF header without adding the client address.

December 5

  • 22:42 domas: srv47 is running scaler usr.sbin.apache2 aa profile in learning mode
  • 22:33 RobH: sq50 reinstalled and back in rotation
  • 22:25 RobH: finished setup on srv146, back in apache pool
  • 21:32 RobH: setting up packages on srv146
  • 21:32 RobH: reinstalling sq50
  • 21:27 brion: pointing SiteMatrix at local copy, not NFS master, of langlist file
  • 19:19 RobH: added sq48, and sq49 back into pool. sq50 pending reinstallation.
  • 18:58 mark: depooled broken squids sq1 and sq3
  • 18:26 RobH: depooled sq48-sq50 for relocation
  • 18:17 RobH: added sq44-sq47 back into pybal, relocation complete.
  • 17:45 brion: sync-common-all to add w/test-headers.php
  • 17:28 RobH: shutting down sq44-sq47 for relocation.
  • 17:27 RobH: sq41 - sq43 back online.
  • 17:17 RobH: sq40 oddness, but its back up now
  • 16:44 RobH: accidentally pulled power for sq38, opps!
  • 15:36 RobH: removed sq41 - sq43 from pybal to relocate from pmtpa to sdtpa
  • 15:34 domas: srv178 running usr.sbin.apache2 aa profile in complain mode
  • 15:34 RobH: removed sq40 from pybal to relocate from pmtpa to sdtpa

December 4

  • 22:50 domas: job runners are no longer blue on ganglia CPU graphs :(((((((
  • 22:45 domas: fc4 maintenance, reniced job runners to 20 (10 behind apaches), installed apc3.0.19 (APC3.0.13 seams to have hit severe lock contention/busylooping at overloads)
  • 22:04 RobH: re-enabled sq38 in pybal. all is well
  • 22:02 RobH: fired sq37-sq39 back up
  • 21:58 RobH: shutdown sq37-sq39, cuz I need to balance the power distribution a bit better.
  • 21:40 RobH: sq38 is trying to break my spirit, so i reinstalled it to show it who is boss (me!)
  • 21:02 RobH: setup asw-a4-sdtpa and asw-a5-sdtpa on scs-a1-sdtpa
  • 20:52 mark: Increased TCP buffers on srv88 (a Fedora), matching the Ubuntus - Fedora Apaches appear to get stuck/deadlocked on writes to Squids
  • 19:39 RobH: pulled sq38 back out, as it is giving me issues. need to fix the msw-a3-sdtpa before i can fix sq38.
  • 19:35 RobH: added sq38, sq39 back into pybal
  • 19:25 RobH: added sq36, sq37 back into pybal
  • 18:14 RobH: I need to stop forgetting about lunch and stop working through it, oh well.
  • 18:13 RobH: depooled sq36-sq39 for move from pmtpa to sdtpa.
  • 18:12 RobH: some tinkering with lvs4 and idleconnection timer was fixed by mark.
  • 17:46 RobH: racked sq21-sq35 in sdtpa-a3. added back to pybal.
  • 16:31 RobH: depooled sq31-sq35 from lvs4 to move from pmtpa to sdtpa
  • 15:15 RobH: reinstalled storage1 to ubuntu 8.04, left data partition intact and untouched.

December 3

  • 23:46 JeLuF: performing importImage.php imports to commons for Duesentrieb
  • 19:13 RobH: tested i/o on db17, issue where it pauses disk access is gone.
  • 19:02 mark: Shutdown TeliaSonera (AS1299) BGP session, the link is flaky resuling in unidirectional traffic only for most of the day
  • 19:02 RobH: replaced hardware in db17, reinstalled.
  • 18:58 mark: Prepared search10, search11 and search12 as search servers
  • 17:26 brion: investigating ploticus config breakage bugzilla:16085
  • 17:18 brion: ploticus seems to be missing from most new apaches
  • 17:12 RobH_DC: search10, search11, search12 racked and installed.
  • 14:29 RobH_DC: srv136 was unresponsive, rebooted, synced, back in rotation.

December 2

  • 23:57 Tim: added CNAME poke.wikimedia.org for SMS notification project
  • 23:33 brion: scapping to update ContributionReporting ext
  • 23:11 Tim: db7 wasn't deleting its relay logs for some reason, since August 21. Disk critical. Did a reset slave.
  • 20:03 brion: rebuilt public_reporting with fixed encoding
  • 19:53 brion: fudged charsets in triggers for donation db update, let's see if that helps
  • 12:11 Tim: started squid (backend instance) on sq40, stopped for 13 days for no apparent reason
  • 12:08 Tim: restarted apache on srv161, srv122, srv137, attempted on srv123 but it is waiting for dead NFS mount
  • 11:48: srv183 made a miraculous recovery
  • 11:44 Tim: took srv183 out of memcached rotation
  • 11:10-11:35: a spike in backend requests (as seen in lvs3 network) caused the application cluster to overload. Due to the extra threads, srv183 went into swap and died.
  • 10:50 Tim: purged binlogs on ixia and db1 (both critical)

December 1

  • 23:49 brion: sync-common-all'ing to add a wikispecies little icon for sul shared session login, since people keep asking for it :)
  • 20:31 RobH: synced and restarted apache on srv89
  • 19:33 RobH: manually setup apache-check for pybal on srv138, synced, enabled.
  • 19:29 RobH: manually setup the apache_check stuff for srv126 and pybal.
  • 17:19 RobH: synced and restarted apache on srv176 & srv176
  • 17:18 RobH: did the sync and restart thing for apache on srv162
  • 17:16 RobH: synced and restarted apache on srv145
  • 17:13 RobH: synced and restarted apache on srv121 and srv125
  • 17:00 RobH: apache wasnt working on srv102 and srv106, restarted them after syncing
  • 15:10 mark: Restarted stuck pdns_server on bayle, lots of stale selective_answer.py processes
  • 14:44 domas: restored Roma article on itwiki, had orphaned revision entries after deleting it, manually inserted page entry
  • 14:40 mark: Setup Telia transit at knams, but all inbound routes filtered
  • 14:35 RobH: removed images from plwiki flaggedrevs per request from Leinad

November 30

  • 12:14 mark: restarted flapping apache on srv119, looks like memory corruption going on

November 28

  • 18:58 brion-holiday: updating User-Agent blacklists to block 'WebCapture' download tool but not the Library of Congress's www.loc.gov/webcapture/ spider
  • 18:17 yksinaisyyteni: fixed broken upload/deletion/timeline on jawiki
  • 07:11 JeLuF: succeeded to umount /mnt
  • 07:10 JeLuF: killed hanging cron entries on db22. updatedb.mlocate. Might be related to broken mount db16:/a -> /mnt
  • 07:05 JeLuF: killed lots of jobs running on db22, "SELECT /* ApiQueryBacklinks::run XX.XXX.XXX.X */ page_id,page_title,page_namespace,page_is_redirect" which were in status "copying to tmp table"

November 27

  • 13:10 mark: hungover, headache, lack of voice

November 26

  • 17:00 RobH: fixed flaggedrevs to work on ruwikiquote, due to my own mistake in earlier implementation, per https://bugzilla.wikimedia.org/show_bug.cgi?id=14863
  • 02:38 brion: updated Math.php to r43966 which both fixes 0-byte math PNGs and generates correct URLs *cough*
  • 02:36 brion: broke math temporarily woops
  • 02:29 brion: bumped Math.php to r43965 to hopefully clear out those 0-byte math images (bugzilla:16440)
  • 02:01 brion: updating CentralNotice to r43962 to fix sitenames again :P
  • 01:57 brion: poking centralNotice to r43961 for evil hacks to bump limits temporarily :D
  • 01:31 brion: updating CentralNotice to r43959

November 25

  • 19:25 brion: syncing update to CentralNotice
  • 18:28 RobH: root password changed across all servers. if you didnt get a copy and you should have one, talk to another tech team member.
  • 17:58 RobH: added bayes to allowed nfs connections to storage2, setup fstab for nfs mounts on bayes, revoked shell access for ezachte on storage2 (not needed for what he wanted)
  • 15:49 RobH: updated some points for huwiki flaggedrevs and removed an outdated user group per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568
  • 15:38 RobH: gave erik zachte login rights to storage2
  • 15:16 RobH: updated dns for survey software
  • 01:35 brion: updating ContributionReporting ext
  • 01:06 brion: forcing a manual run of centralnotice batch update on hume
  • 01:04 brion: retstarting memcached on srv64
  • 01:02 brion: memcache bad on srv64
  • 01:01 brion: notice texts borked on at least wikimedia, wiktionary

November 24

  • 22:45 brion: updated ContributionReporting for some silly bugs
  • 22:20 RobH: portal and portal_talk namespaces added to dvwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=16403
  • 22:04 RobH: added two new namespaces to dewikinews per https://bugzilla.wikimedia.org/show_bug.cgi?id=16263
  • 21:29 RobH: removed a group and granted further permission customization for huwiki per https://bugzilla.wikimedia.org/show_bug.cgi?id=15568
  • 21:09 RobH: pushed a bad flaggedrevs.php that rendered blank pages for all wiki's with flaggedrevs enabled. fixed it, its working properly now, opps ;]
  • 21:06 RobH: appended page and dossier namespaces into the frwikinews flagged revisions per https://bugzilla.wikimedia.org/show_bug.cgi?id=15346
  • 20:36 RobH: enabled flaggedrevs on ukwiktionary per https://bugzilla.wikimedia.org/show_bug.cgi?id=15335, and ran sync-common-all
  • 20:27 RobH: ran sync-common-all
  • 20:27 RobH: enabled flaggedrevs on dewiktionary
  • 20:07 mark: moved upload knams LVS to mint
  • 20:05 brion: mark is on the case -- LVS overload
  • 19:58 brion: seem to be getting heavy packet loss on some routes to knams
  • 19:47 RobH: changed nameservers for wikimedia.li to WMCH administered name servers.
  • 19:30 RobH: re-enabled arzwiki, cannot find the bugzilla entry.
  • 15:43 RobH: search2 reinstalled and ready for search setup and deployment

November 22

  • 18:28 yksinaisyyteni: srv108 (cluster19) disk full, removing old logs
  • 00:37 brion: bumped php.ini post/file upload limit to 100mb, we'll see how well uploads to that size actually work  :)

November 21

  • 23:11 brion: dropping 'Wikipedia: a non-profit project" banner from rotation, as it's apparently not a winner
  • 22:56 brion: updated logo for cr.wikipedia (bugzilla:16417)
  • 18:34 brion: running updateAutoPromote on new flaggedrevs wikis (bugzilla:16415)

November 20

  • 01:00 brion: updating ContributionHistory
  • 00:34 brion: moving $wgStyleSheetPath back to upload.wikimedia.org

November 19

  • 22:47 brion: updating Tomas skin to r43752 for toc fix
  • 22:41 brion: scapping for ContributionReporting update to 43750 (localization bugs)
  • 22:40 brion: ran namespaceDupes --prefix=D on enwiki and dewiki -- some 'D:blah' pages conflicted with iw prefix 'd' for wiktionary
  • 15:53 brion: updated centralnotice templates with user-targetted lightweight collapsed notice (wish it was for everybody)
  • 01:38 brion: updating CentralNotice to r43697 for anon/user collapsed variants
  • 00:35 yksinaisyyteni: unmounted storage1:/export/upload on all hosts
  • 00:32 yksinaisyyteni: rebooted srv{114,184,166} to fix stuck nfs mount

November 18

  • 23:52 brion: enabling new search UI on testwiki
  • 21:35 brion: switching css/js back to text temporarily to reduce load on upload squids
  • 21:27 brion: request -- squid conf deploy script should do a config file dry-run before actually deploying
  • 21:26 brion: there's load on ms1...
  • 21:25 brion: started more... most... all? squids in squids_uploda
  • 21:24 brion: restarted squid manually on 46
  • 21:17 brion: uploads still borked, we're investigating the squid config problem
  • 21:16 brion: rebuilding squid conf, was a little funky
  • 21:12 brion: updating squid config to send centralnotice to ms1 instead of storage1
  • 20:41 RobH: db24 reinstalled, awaiting domas to do the magic db stuff
  • 20:38 RobH: replaced disk /c0/p7 in amane and started rebuild
  • 20:34 RobH: replaced controller in search2, search2 requires reinstall
  • 20:34 RobH: replaced controller in db24, db24 reinstalling.
  • 20:03 mark: installed gmond on db9 and db10
  • 19:59 brion: scapping to update Collection for regression fix
  • 01:51 mark: Moved text LVS to temporary LVS host lvs4, with an optimized kernel
  • 01:48 brion: setting $wgStyleSheetPath to point at upload.wikimedia.org/skins for non-SSL hosts
  • 01:30 brion: disabling handheld stylesheet; one less thing to load, should have little impact
  • 01:15 brion: another crappy slow squid this time in pmtpa

November 17

November 16

  • 17:24 brion: notices are becoming unborked with new regen. should be done and recached within 10 minutes
  • 17:17 brion: srv120 memcached now functional according to test: 10.0.2.120:11000 set: 100 incr: 100 get: 100 time: 0.0809991359711
  • 17:16 brion: restarting memcached on srv120
  • 17:14 brion: srv120's memcached seems broken: 10.0.2.120:11000 set: 100 incr: 0 get: 0 time: 0.0769970417023
  • 17:05 brion: investigating centralnotice borkage on non-wikipedia sites

November 15

  • 01:03 brion: scapping to r43514 -- regression in CodeReview :)
  • 00:49 brion: enabled UDP->IRC logging for CentralAuth user creations, now that it works instead of crashing PHP
  • 00:45 brion: set up ariel on isidore for blog maint
  • 00:24 brion: starting scap from r42593 to r43512
  • 00:02 brion: preparing for general svn up && scap

November 14

  • 23:24 RobH: updated flaggedrevs: $wgFlaggedRevValues to 4 from 2 for enwikibooks, synced files out to cluster.
  • 23:11 RobH: FlaggedRevs deployed on enwikibooks.
  • 23:00 RobH: removed the crap for specific seroul servers in sync-common-all
  • 22:43 brion: tweaked flaggedrevs.php to have cleaner default behavior
  • 20:27 RobH: setup the backend stuff for arz wiki but not enabled yet.
  • 19:59 brion: yongle is back up! yay
  • 19:48 RobH: fixed authdns-update script, was not rsyncing over the langlist file
  • 19:47 brion: swapping codereview-proxy to isidore since yongle's still down
  • 18:01 brion: requesting reboot on yongle from PM support
  • 17:14 domas: yongle is hanging, apple dictionary searches staled
  • 16:12 RobH: upgraded installation of blog.wikimedia.org and whygive.wikimedia.org to newest stable versions.
  • 15:14 RobH: limesurvey.wikimedia.org online on isidore, initial users created and deployed.
  • 02:03 brion: pascal down again
  • 00:00 brion: syncing to update InputBox extension (note: renamed from inputbox)

November 13

  • 23:41 brion: scapping to update CodeReview
  • 20:26 brion: scapping updates to Collection and ContributionReporting exts
  • 17:33 brion: set up TrevorParscal with access to reporting database so he can grab updates to test with
  • 17:03 river: upgraded ms1 to solaris 10 update 6 + rebooted
  • 09:57 Tim: db10 sync worked just fine this time, it's now replicating all DBs
  • 08:27 Tim: db10 slave start potentially botched, going to re-read the dump and try again
  • 06:43 Tim: loading data into mysqld on db10
  • 06:35 Tim: copy finished, restored r/w on bugzilla
  • 05:43 Tim: copying data from db9 to db10 using: mysqldump -h db9 --master-data --single-transaction --all-databases | gzip --fast > db9-master-data-2008-11-13.sql.gz
  • 05:34 Tim: switching bugzilla into read-only mode for copy to db10. Queries will be denied by user permissions for all tables except logincookies.
  • 05:02 Tim: converting all tables in bugzilla to InnoDB except longdescs
  • 04:53 Tim: converting the MyISAM tables in otrs to InnoDB (the large ones are done already)
  • 04:49 Tim: converted donateblog and newsblog to innodb
  • 03:34 Tim: converted racktables DB to InnoDB
  • 01:59 atglenn: changed wireless network password
  • 01:43 Tim: doing lockless backup of db9 to db10. This will give us a fallback in case disaster strikes during the considerably more complex replication synchronised dump which will follow.
  • 00:45 brion: poked it again
  • 00:29 brion: updating for ContributionReporting

November 12

  • 23:38 brion: XHTML fixes for Collection made the broken 'Random book' link on en.wikibooks.org work again (it very inefficiently loads a giant page of links via JS, and needs it to be clean XML to parse it)
  • 23:16 brion: updated mw-serve
  • 22:48 brion: scapping for Collection ext updates
  • 20:10 brion: updated wgNoticeProject to wikimedia for incubator
  • 18:46 brion: added "uploader" group so we can bump known-good people into being able to upload without waiting for the autoconfirm heuristic
  • 03:14 river: didn't reboot ms1 as its lom is unreachable
  • 01:20 Tim: an error in the cron job on hume caused the r43398 bug to persist until this time, delivering incorrect language text in some site notices.
  • 01:08 Tim: Fixed those 50 servers with a couple of sed commands. Many of them were attempting to send data to larousse and zwinger. Tested srv125.
  • 00:56 Tim: srv125 was spewing PHP fatal errors without reporting them to the syslog on db20. Restarted it. A quick check (ddsh -cM -g apaches -- 'grep -q @syslog /etc/syslog.conf || echo help') suggests that there are 50 apache servers in the same situation.
  • 00:27 Tim: updated ExtensionDistributor configuration to account for amane -> ms1 storage move. (bug 16308)
  • 00:13 Tim: some language issues caused by r43398, reverted at 23:50 and resynced in fixed form at 00:12.

November 11

  • 23:47 Tim: restored FlaggedRevs stats job as per Batch jobs, removal was not documented.
  • 23:35 Tim: r43398 worked just fine, memory usage dropped from ~4GB to 90MB. Adding rebuildTemplates.php to my crontab on hume, removing it permanently from Brion's on zwinger.
  • 23:28 Tim: updated CentralNotice templates on hume (which has enough memory to do it, unlike zwinger)
  • 22:11 Tim: deleted some binlogs on db1. Remaining disk space is still only 48 GB with negligible InnoDB free space.
  • 16:20 RobH: search2 still down, drives will not detect reliably. Ticket with sun reopened.
  • 15:56 RobH: replaced backplane on search2, reinstalling.
  • 15:13 RobH: srv137 back online. apache and memcached back up.
  • 14:49 RobH: srv100 back online.
  • 10:44 river: removed centralnotice php from brion's crontab as it was breaking zwinger
    • Core dump suggests the memory usage may be dominated by the localisation cache. wfMsgExt() loads the localisation for the requested language, and all languages are requested. -- Tim 12:07, 11 November 2008 (UTC)
  • 01:19 brion: swapped Commons to use $wgNoticeProject 'wikimedia' rather than having separate 'Commons needs you' notices
  • 00:57 brion: swapped in fundraiser to all projects

November 10

  • 19:18 mark: Shutdown AMS-IX route server 1 session as it's been flapping for hours

November 9

  • 16:11 river: removed nfsfind cronjob on ms1

November 7

  • 22:52 brion_: tossing 2008_meter_2b notice into partial rotation on enwiki -- has reduced collapsed version
  • 22:49 brion_: adding "_collapsed" to banner source tracking for collapsed view
  • 22:27 brion: scapping updates to ContributionReporting and CentralNotice
  • 01:43 Tim: experimentally reading the civicrm database into db10 with --master-data=1
  • 01:19 brion: db9 temporarily (hopefully) messed up. tim's fiddling with it to put it back
  • 01:05 Tim: my.cnf on db10 had an error in it, replicate-wild-do-tables instead of replicate-wild-do-table. Fixed it. The OTRS snapshot is now hopelessly out of date anyway, so I might wipe the data directory and start again. The idea is to set it up to replicate civicrm first. It's 100% InnoDB so should be easy to copy.
  • 00:09 river: upgraded ms2 to solaris 10 update 6

November 6

  • 21:03 Tim: switched GIFs to use Bitmap_ClientOnly (client-side scaling)
  • 17:23 brion: restarting apache on srv47, seem smysteriously stuck
  • 17:15 brion: setting $wgMaxAnimatedGifArea to 1 to prevent animated thumbnailing of GIFs for now, see if that helps
  • 17:10 brion: river complaining of image scaler issues -- load spikes, depooling?
  • 02:35 mark: disabled BGP, now using lvs2 only
  • 02:25 mark: restarting lvs2 with new kernel
  • 01:52 due to switch issues, load balancing to lvs2/lvs4 stopped working. Mark restarted the BGP session which fixed it temporarily.
  • 01:42 Tim: restarting squids
  • 01:42 mark: Setup lvs4 as temp LVS support for lvs2, balancing the load
  • 01:07 brion: updated ContributionReporting to add paging links to ContributionHistory (might be a little funky w/ caching, we'll work it out :)
  • 00:45 Tim: progressively clearing /a on the remaining image scalers
  • 00:37 Tim: wiping /a on srv44
  • ~00:30 lvs2 went into overload and started losing packets. Upload squid slowly went down over the next half hour.
  • 00:00 brion: scapping for update to ContributionReporting

November 5

  • 23:38 brion: set yongle to restart apache every hour since it still seems to bork up and get stuck sometimes
  • 22:01 RobH: srv100 rebooted, was down.
  • 18:28 mark: tech team is procrastinating
  • 18:16 atglenn: added dhelps to office@wikimedia.org alias, redirected office@wikipedia.org to him also
  • 18:14 brion: disabling centralnotice on private wikis, we don't need to be told to donate to ourselves ;)
  • 18:03 brion: poking sitenotices off wikibooks, on *.wikipedia
  • 18:03 brion: set up ariel on mchenry for mail admin
  • 05:38 brion_: opera users may rejoice ;)
  • 05:38 brion_: tweaked storage1 lighttpd config so centralnotice.js is served with utf-8 charset
  • 05:17 brion_: for reference -- load spikes are page rendering on enwiki and dewiki mostly :)
  • 05:16 brion_: bumping enwiki notice to 100%
  • 05:06 Tim: killed various mysqld_safe processes which were using 100% CPU on ES servers
  • 04:50 brion_: fixed morebots -- bots now allowed to edit again at wikitech
  • 04:50 brion_: enabling enwiki notice at about 10% sampling
  • 03:27 brion_: squids are... i think.... looking better :D
  • ... brion: cleaned up movepage attack, restricted editing here for convenience
  • 02:47 brion_: seems happier after restart of front-end squids
  • 02:43 brion_: tim's doing hard restarts of more squids, we're kinda offline briefly
  • 02:34 brion_: disabling centralnotices on remaining sites just for good measure while we debug
  • 02:29 brion_: current status: the squids which borked are still kind of borked, but perhaps slightly better. mark is examining squid memory reports
  • 02:14 brion: tim's attempting to restart borked squids
  • 02:01 brion: disabling enwiki centralnotice while investigating hits dropoff

November 4

  • 21:36 Tim: added nagios monitoring of HTTP on image backends
  • 21:14 Tim: installed NRPE stuff on db19
  • 19:37 Tim: killed the broken NFS mount on db21:/mnt with umount -l. The processes that are waiting for it will probably hang until system restart
  • 18:33 brion_: enabling ja-wikipedia notice for testing :D
  • 18:32 Tim: installed nagios stuff on db21,db22,db23
  • 18:27 Tim: srv104 done, cluster18 re-added to the write list
  • 18:15 Tim: installed NRPE on srv159,srv171,srv183
  • 17:25 domas: bounced db16 after jfs deadlock
  • 17:24 brion: settin' centralnotice on wikibooks to test, should show up in a few minutes
  • 16:00 Tim: fixing max_rows on srv104
  • 15:41 Tim: switching cluster18 master from srv104 to srv105
  • 01:33 Tim: fixing max_rows on srv105 and srv106
  • 01:28 Tim: removed cluster17 from the write list, is full.

November 3

  • 23:28 Tim: installed xdiff and gmp on hume. Used a source install of libxdiff since it's not packaged, and pecl install for the pecl module. Used the stock libgmp, a source install from the debian sources for the PHP GMP module.
  • 22:05 brion: enabled extra file upload types for foundationwiki, since it's restricted-write-access
  • 21:42 Tim: initialising srv159/171/183 as cluster20.
  • 21:24 Tim: srv159 needs to be an ext store, and so will be moved from the disk-intensive image scaler role back to an ordinary apache.
  • 20:46 brion: Special:ContributionTracking form submission intermediary live on foundationwiki
  • 20:33 brion: scapping for ContribtionTracking extension
  • 19:59 brion: enabled mp3 and aiff uploads for private wikis so jay can upload some radio PSAs for fundraiser
  • 19:46 brion: poking $wgSquidMaxage from 31 days to 1 hour on wikimediafoundation.org, since templates and funkypage URLs may do funky things and not get purged (extra parameters)
  • 19:32 brion: note there's no notice up yet ;)
  • 19:31 brion: enabling centralnotice loader on all wikis
  • 11:00 domas: mount -o remount,nobarrier /a on db15, observed 20x more performance. I am an idiot. :)
  • 02:36 brion-away: got a test centralnotice notice running on test.wikipedia.org. rock on
  • 02:18 brion: set up every-10-minute cronjob on zwinger to regen the centralnotice template JS files
  • 02:10 brion: centralnotice .js file loader up on test and meta for poking at
  • 01:12 mark: level 3 blackholing of traffic disappeared, brought BGP sessions back up
  • 00:59 mark: shutdown BGP session to AS 30217, for blackholing of traffic behind it (L3?)
  • 00:58 brion: network problems at pmtpa
  • 00:44 brion: for fun, did some load-time optimization on wikitech. trimmed out unneeded user/site .js, consolidated several .js files, and enabled mod_deflate for .css/.js. ssl setup time still sucks, and it's still a 1.7GHz Celeron. :)

November 2

  • 23:43 brion: added bot flag to domas's log bot so it doesn't get hit by the URL captcha
  • 23:29 domas: db19 jfs deadlocked: http://p.defau.lt/?hC8C7MTk9BdTKBEHFgcsqA
  • 23:28 brion: scapping for CentralNotice tweak update
  • 23:11 brion: setting up ContactFormFundraiser on wikimediafoundation.org for fundraiser templates
  • 22:52 brion: scapping for ContactPageFundraiser setup
  • 22:41 brion: poked spamregex update
  • 22:14 brion: added 403 block in checkers.php for 'speichern' GET parameter -- bug in a common dewiki user script allowing CSRF-type vandalism
  • 17:13 Tim: Unmounted /tmp, cleaned up /tmp. Deleting /a/tmp on all image scalers.
  • 16:48 Tim: set ImageMagick temporary directory to /a/magick-tmp. Will unbind the /tmp -> /a/tmp mount.
  • 15:06 river: added missing /mnt/upload5 mount on several apaches: srv37 srv61 srv76 srv69 srv63 srv118 srv132 srv135 srv133 srv138 srv136
  • 14:49 domas: few missing .frm files on db18 were causing trouble, resynced them from db19, resumed replication
  • 13:02 river: copying en from storage1 to ms1
  • 10:49 domas: replaced XFS with JFS on db18, installed ganglia on db17-db30
  • 10:36 river: completed move of commons, now being served from ms1 (except archive/)

November 1

  • 22:48 brion: fixed ContributionReporting to force a utf8 connection, now loads names in right charset
  • 22:20 brion: fixed $wgNoticeInfrastructure setting; defaults must have changed at some point
  • 22:15 domas: installed wikimedia-mysql4 on db21-23, established s1,s2,s3 replication. we now have full database copy in sdtpa \o/
  • 20:53 brion: deploying CentralNotice editing system on meta, woo
  • 20:27 brion: scapping to update reporting and centralnotice bits internally
  • 19:38 brion: rescapping to make sure 159 is unbroken
  • 19:27 brion: svn up'ing on wikitech just for domas
  • 19:25 brion: srv159 is out of space
    • We need to clean out the damn temp files somehow, eh?
  • 19:20 brion: scapping to update ContributionReporting ext
  • 12:56 mark: uppreffed traffic from knams to pmtpa via 6908/2828, as existing peering path had slight packet loss
  • 11:25 Tim: enabled subpages in the main namespace by default for all Wikisource wikis. This appears to be a defacto standard and is used by all wikisources with an entry in wgNamespacesWithSubpages.
  • 07:55 Tim: disabled ParserDiffTest, obsolete
  • 07:06 mark: XO circuit back up:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now up
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <w005.z207088246.xo.cnc.net>, session is now up

October 31

  • 23:11 brion: set up some logs for fundraising banner campaign clicks for later mining
  • 17:44 brion: adding support for Tomas skin on wikimediafoundation.org for new fundraiser templates
  • 14:24 mark: XO circuit went down:
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 207.88.246.5 <207.88.246.5>, session is now down because <Port State Down>
[vl101-ve5.csw5-pmtpa.wikimedia.org] BGP peer 2610:18:10a::1 <2610:18:10a::1>, session is now down because <Port State Down>

October 30

  • 23:11 Tim: fixed disk space on srv159, db1, srv103
  • 19:03 brion: updated triggers for donation reporting database a few minutes ago
  • 18:14 RobH: moved ms1 from pmtpa:a4 to sdtpa:a1, its back online.
  • 17:46 RobH: db26 OS installed and online
  • 17:28 brion: added a spam filter rule for private-l messages :)
  • 04:54 river: testing sun web server on ms1
  • 03:56 brion: updating squid conf to send upload /centralnotice to storage1 for testing
  • 03:53 brion: tweaked lighttpd config on storage1 for centralnotice static file testing, since amane's configuration is too crappy to support regexes needed to set headers on a directory
  • 02:59 brion: poking experimental expires options on amane for static centralnotice tests
  • 02:44 brion: brion broke lighttpd.conf briefly

October 29

  • 22:39 brion: enabling $wgCodeReviewENotif experimentally
  • 18:35 brion: disabled bitmap fonts in fontconfig on image scalers, seems to help with the "mad helvetica" problem
  • 18:02 RobH: db28 & db29 OS installed and online.
  • 17:59 brion: fixed some upload directory perms on foundationwiki
  • 17:12 RobH: db27 OS installed and online.
  • 16:54 RobH: db21 OS installed and online.
  • 16:38 RobH: db22, db23, db25, db30 were installed yesterday, forgot to admin log it, sorry ;/
  • 14:44 _mary_kate_: copying wikipedia/commons/thumb/4 from storage1 to ms1

October 28

  • 20:02 domas: re-enabled db16
  • 18:03 mark: Removed blackholes.securitysage.com from lily's spamassassin configuration
  • 17:52 domas: db16 fubar'ed by queries that built 100GB temporary tables, leading to jfs hangs, leading to unhappy kernel.
  • 15:23 RobH: updated dsh node group ALL, added backup of frontend data for bugzilla and blogs from isidore to tridge.
  • 12:33 rainman-sr: experimentally turning on "did you mean.." on search8,9 for enwiki
  • 10:44 mark: Reverted yesterday's search changes

October 27

  • 23:24 mark: Switched to lucenesearch 2.1 for all wikis
  • 23:06 mark: pooled search8 as the only search server in search pool 3
  • 22:25 mark: rainman-sr is making me do more ugly things to lucene.php
  • 22:22 mark: Pointed search for "all other wikis" hardcoded to search7 in lucene.php
  • 22:14 mark: Added zhwiki and plwiki to lucene search 2.1 pool 2

October 26

  • 15:43 mark: Set up OpenGear serial console server scs-a1-sdtpa
  • 13:37 mark: Set up iBGP between csw1-sdtpa and csw5-pmtpa (IPv4/IPv6)
  • 13:36 mark: Prepared csw1-sdtpa for production deployment (general configuration)
  • 09:56 domas: updated db18 firmware to 2.1.1 (September 2008)
  • 04:31 Tim: fixed the "service_ips" hostgroup in nagios
  • 03:03 Tim: hardware reboot of db18
  • 02:47 Tim: mysqld on db18 apparently hit a kernel bug. It was reported as a zombie but was still using 200% CPU in top. kswapd was simultaneously using 100% CPU. Did not respond to SIGKILL. The non-zombie parent, mysqld_safe, also did not respond to SIGKILL (wchan=flush_cpu_workqueue). Attempted a reboot with shutdown -r.
  • 02:47 brion: tweaked MaxClientsPerChild on yongle to see if that helps with the mysterious hangs i sometimes see where requests seem to get backed up; it's disrupting the CodeReview proxy as well as mobile & Mac Dictionary search

October 25

  • 20:46 brion: scapped to r42573
  • 08:17 Tim: svn up to 42536 for API overload fix. Re-enabling disabled query modules.
  • 05:55 Tim: svn up/scap to 42531 (for properly tested Interwiki.php fix).
  • 05:09 Tim: DB overload on many enwiki slave servers. Long running queries attributed to ApiQueryAllpages, ApiQueryBacklinks, ApiQueryCategoryMembers and ApiQueryLogEvents. Disabled those modules and killed related running threads.
  • 05:01 Tim: Interwiki links were broken due totally broken and untested getInterwikiCached() function. Live patch deployed at this time.
  • 04:33 Tim: Fixed svn conflicts in two files. Scap to r42524.
  • 04:20 Tim: disabled Drafts extension on test.wikipedia.org. Trevor, please contact me for code review.
  • 04:11 Tim: synced php-1.5 to srv35 and ran "make -B" in the serialized directory. Seems to have fixed test. Will scap.
  • 01:01 ariel: preemptively up mail quota to 7GB from 1GB for cbass, dmenard
  • 00:59 brion: testwiki is borked until we figure out how to get it to load updated message files. tried disabling $wgLocalMessageCache and $wgCheckSerialized to no effect
  • 00:51 brion: temporarily blocking scap during testing :) ... running serialized language file updates for test, broken by need to get magic word updates
  • 00:44 brion: preparing a svn up...
  • 00:37 ariel: up msecoquian's mail quota from 1GB to 6.9GB

October 24

  • 23:12 brion: set up ariel (the person) on sanger to do mail administration -- quota fixes etc
  • 16:24 TimStarling: reloaded ourusers.sql on all core and ext. mysql servers, adding a nagios user
  • 15:39 mark: slacking
  • 15:36 TimStarling: added special nagios user to ES instances on clematis
  • 14:00 domas: re-enabled db5, added db18 to s3
  • 10:45 domas: taking out db5 for copy to db18
  • 10:44 domas: fixed ntpd on bart, was pointing to multicast address that doesn't work
  • 09:57 Tim: removed decommissioned servers from monitoring: dryas, alrazi, diderot, friedrich, samuel
  • 07:50 Tim: added monitoring for toolserver ES clusters 17-19
  • 07:40 Tim: regenerated trusted XFF list with extra SAIX proxies
  • 05:00 Tim: fixed nagios check script handling of MySQL connection errors
  • 01:37 brion: setting $wgLicenseURL for Collection to point at GFDL English text
  • 01:01 brion: enabling Drafts on testwiki, but it seems to not be saving there... works on my local test, not sure what the issue is
  • 01:03 brion: disabling logentry, still borken?

October 23

  • 22:33 brion: trying re-enabling logentry ext on wikitech, now with cache disable to avoid edittoken for now
  • 21:34 brion: updating ipblocks table definition
  • 21:25 brion: re-ran svnImport to update path listings for CodeReview
  • 20:11 mark: Set up search7 - search9
  • 17:05 mark: Pooled search4 as a s1 search server to help with dead search2
  • 16:33 brion: updated mw-serve
  • 15:38 Tim: On the image scalers, temporarily mounted /a/tmp as /tmp with --bind to stop the disk full problem while we figure out some better solution
  • 15:24 Tim: removed temporary files on image scalers again
  • 14:54 RobH: Replaced dead disk in amane, rebuilding array.
  • 11:04 Tim: Added disk space monitoring for image scalers. Also added apache monitoring which was also missing.
  • 10:53 Tim: freed up disk space on image scalers, magick-* temporary files were filling their root partitions
  • 10:50 Tim: re-added cluster19 to the default write list. Not sure who took it out or why.
  • 10:32 Tim: freed up some space on srv103 (was down to 500MB)
  • 10:29 Tim: fixed monitoring for MegaRAID SAS
  • 07:10 Tim: Set up monitoring of RAID status for all Ubuntu DB servers using the wikimedia-raid-utils package that I just wrote. It doesn't do anything on the MegaRAID servers yet, but the Adaptec ones should work.
  • 05:05 Tim: running CodeReview svnImport.php

October 22

  • 18:26 brion: enabling ODT output for collection
  • 18:17 brion: updating collection and codereview extensions
  • 18:13 Brion: updated mw-serve code and configured to send error emails per jojo's request
  • 17:15 Brion: Changed bugzilla's mail delivery from local sendmail (SSMTP) to direct SMTP, per Mark's recommendation

October 21

  • 19:29 RobH: Bayes upgraded from 2GB to 10GB.
  • 13:49 Tim: Did a demonstration hack of nagios from CSRF to arbitrary shell. Disabled cmd.cgi.
  • 04:13 Tim: Brought srv43-47 up as image scalers with mem limit 6 x 200MB = 1200MB (2GB physical)

October 20

  • 18:11 RobH: srv118 rebooted, back online.
  • 17:25 RobH: srv79 was in kernel panic, rebooted.
  • 05:10 Tim: increased concurrency on srv159 to 15, for mem limit 15 x 200MB = 3000MB
  • 02:40 Tim: installed NRPE on khaldun and db20
  • 02:20 Tim: moved disk space checks on the ext stores from the "apaches" service group to the relevant ext store service group
  • 01:53 Tim: installed NRPE on the new ext stores
  • 01:45 Tim: Updated /etc/ssh/ssh_known_hosts on bart (copied from zwinger).
  • 00:30-01:30 Tim: Listed down servers on DC tasks. Removed broken servers from memcached rotation. Restarted apache on srv99, srv109, srv123. Purged master binlogs on srv102.

October 18

  • 21:45 RobH's mighty index finger brought amane and the site back up.
  • 21:00 river: Ran 'nc -l -p 623' command, amane's kernel panic'ed. Rob was called.
  • 20:55 mark, river: diagnosed the NFS communication problems to be caused by NIC hardware packet interception of port 623 packets... amane wasn't receiving NFS replies from ms1.
  • 19:40 mark: Upload got unhappy, ms1 NFS mount on amane was unreachable and stalling things
  • 13:40 Tim: down again, single process allocating all memory
  • 07:35 Tim: took it down again, while recording /proc/vmstat and /proc/stat
  • 06:27 Tim: restarted srv160
  • 05:45 Tim: took srv160 into the purple for a much more convincing overload, and different oprofile results
  • 03:40 Tim: used oprofile to determine what part of the kernel is responsible for the system CPU spike. Looks like a spinlock in dnotify.
  • 03:12 Tim: simulated a memory-intensive request rate spike to srv160. Large system CPU response spike, but it didn't go down completely. Will try a bigger one.

October 17

  • 21:10 brion: enabled Commons foreign image repo on Wikitech
  • 18:45 brion: created Wikimedia-Boston list for SJ
  • 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
  • 16:45 brion: deleted some junk comments from bugzilla
  • 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
  • 14:22 RobH: srv131 back up.
  • 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
  • 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
  • 01:59 Tim: removing cups on all servers where it is running
  • 00:00 RobH: restarted srv43-47

October 16

  • 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
  • 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
  • 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
  • 18:40 RobH: rebooted scs-ext opps!
  • 18:26 RobH: srv61 reinstalled and redeployed.
  • 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
  • 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
  • 17:02 brion: thumbnails on commons are insanely slow and/or broken
  • 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
  • 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
  • 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
  • 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
  • 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
  • 11:25 Tim: synced upload scripts (including to ms1)
  • 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
  • 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
  • 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
  • 08:26 Tim: Restarted ntpd on search7, was broken
  • 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

  • 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
  • 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
  • 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
  • 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
  • 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
  • 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
  • 02:30 Tim: moved image scalers into their own ganglia cluster
  • 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
  • 02:13 tomasz: upgraded srv9 to ubuntu 8.04
  • 02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

  • 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
  • 19:11 mark: Pooled old scaling servers srv38, srv39
  • 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
  • 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
  • 16:20 brion: image scaling is broken in some way, investigating
  • 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
  • 00:10 Tim: oops, forgot to add VIPs, switched back.
  • 00:05 Tim: switched image scaling LVS to srv43-47

October 13

  • 23:45 Tim: prepping srv43-47 as image scaling servers
  • 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
  • 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
  • 21:30 RobH: search8 and search9 are online, awaiting configuration.
  • 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
    • rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
  • 20:48 domas: thistle serving as s2a server
  • 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
  • 19:53 RobH: search7 back online, awaiting addition to the search cluster.
  • 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
  • 17:00 RobH: srv21-srv29 decommissioned and unracked.
  • 12:05 domas: put lomaria back in rotation
  • 11:50 domas: Enabled write-behind caching on db15. Restarted.
  • 10:40 domas: restarted replication on db15 and lomaria
  • 10:27 domas: loading dewiki data from SQL dump into thistle
  • 09:09 Tim: restarted logmsgbot
  • 08:27 Tim: folded s2b back into s2
  • 08:06 Tim: db13 in rotation
  • 08:02 domas: copying from db15 to lomaria
  • 07:38 Tim: started replication on db13
  • 04:51 Tim: copying
  • 03:27 Tim: Preparing for copy from db15 to db13
  • 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

  • 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
  • 22:05 brion: db15 sucks hard. putting categories back to db13
  • 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
  • 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
  • 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
  • 19:20 Tim: depooled db15.
  • 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
  • 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
  • 18:38 Tim: offloading commons RCL queries to db13
  • 18:36 Tim: dewiki r/w with ixia (master) only
  • 18:33 Tim: offloading commons category queries to db13
  • 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
  • 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
  • 17:46 Tim: db8->db15 copy finished, deploying
  • 17:33 Tim: installed NRPE on thistle.
  • 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
  • 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
  • 16:07 Tim: repooled ixia/db8 r/o
  • 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
  • 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
  • 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
  • 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
  • 14:11 Tim: copy now in progress as planned.
  • 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
  • 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
  • 13:00 jeluf: moved wikipedia/m* image directories to ms1
  • 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
  • 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

  • 7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

  • 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
  • 19:20 RobH: Bayes online.
  • 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
  • 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
  • 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
  • 12:17 domas: killed long running threads
  • ~12:04: s2 down due to slave server overload

October 9

  • 22:52 brion: enabled Collection on de.wikibooks so they can try it out
  • 20:00 jeluf: moved wikipedia/i* images to ms1
  • 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
  • 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
  • 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

  • 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
  •  ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
    • Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
  • 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
  • 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

  • 21:47 brion: started two dump threads (srv31)
  • 21:16 RobH: installed and configured gmond on all knams squids.
  • 21:00 jeluf: moved wikipedia/g* to ms1
  • 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
  • 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
  • 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
  • 15:54 RobH: srv101 crashed again, running tests.
  • 15:45 RobH: srv146 was powered down for no reason. Powered back up.
  • 15:42 RobH: srv138 locked up, rebooted, back online.
  • 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
  • 15:31 RobH: srv101 back up and synced.
  • 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
  • 15:21 RobH: updated lucene.php and synced.
  • 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
  • 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

  • 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
  • 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
  • 14:36 RobH: setup ganglia on all pmtpa squids.
  • 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
  • 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

  • 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
  • 20:35 jeluf: wikipedia/b* moved, too
  • 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
  • 14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

  • 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
  • 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
  • 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
  • 00:26 mark: Depooled srv61
  • 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
    • updating packages on srv37. srv61 seems to have internal auth breakage
    • updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

  • 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
  • 20:01 brion: running updateRestrictions on all wikis (done)
  • 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
  • 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
  • 17:13 RobH: srv130 back online.
  • 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
  • 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

  • 20:02 RobH: srv63 back online.
  • 19:35 RobH: srv61 and srv133 back online.
  • 18:22 RobH: storage3 online and handed off to brion.
  • 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
  • 17:32 RobH: srv61 faulty fan replaced, back online.
  • 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
  • 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
  • 23:46 mark: Reinstalled knsq24
  • 22:55 mark: Reenabled switchports of knsq16 - knsq30
  • 20:45 jeluf: fixed resolv.conf on srv131
  • 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
  • 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
  • 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

Archives