Server admin log/Archive 17

From Wikitech
Jump to: navigation, search

This page archive server admin logs from November 1st 2010 till December 31st 2010

December 31

  • 21:55 tomaszf: kaldari aborted alter tables on db9 due to high i/o wait
  • 20:37 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/WikiEditor/WikiEditor.hooks.php 'r79366'
  • 20:01 Ryan_Lane: freeing up space on db17 by clearing out an old error log
  • 17:27 apergos: copying snapshot1:/data (which appears to be static html dumps of en and pl wiki) to tridge:/data/snapshot1, rsync running from tridge
  • 17:13 logmsgbot: catrope synchronized php-1.5/includes/specials/SpecialLinkSearch.php 'r79354'

December 30

  • 23:36 tomaszf: pushed r79304 to payments cluster .. updating tshirt images
  • 22:40 tomaszf: pushed r79297 to payments cluster .. adding wordmarks
  • 20:36 Ryan_Lane: powercycling mobile2 - it's dead
  • 19:44 apergos: add xml data fs to nfs in puppet, include that in snapshot hosts stanzas
  • 19:43 apergos: remove andrew from snapshot stanzas in puppet, he should already be convered by home-no-service
  • 19:33 awjr: pushed r79287 to payments cluster, replacing bad payflowpro_gateway.i18n.php with the good
  • 19:05 awjr: pushed r79282 of DonationInterface to payments cluster, updating credit card validation on cc form, picking up i18n changes and new css for premiums
  • 18:49 logmsgbot: catrope synchronized docroot/mediawiki/xml/export-0.4.xsd 'Update XSD file for XML dumps'
  • 15:59 RobH: all blogs updated successfully, all plugins updated and reactivated
  • 15:48 RobH: created backups of both blogs, pushing updates to new versions for security reasons
  • 11:27 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 11:25 logmsgbot: catrope synchronized php-1.5/skins/common/shared.css 'r79244'
  • 06:40 Tim: fixed my VOIP account

December 29

  • 21:01 richcole: drive15 from IBM db42
  • 19:22 Ryan_Lane: granted sudo privileges for demon on formey to create/modify/delete svn (LDAP) users
  • 02:03 tomaszf: updating payments cluster to r79148

December 28

  • 21:54 Ryan_Lane: destroying 1TB logical volume on storage3
  • 21:54 Ryan_Lane: umounting 1TB logical volume from /archive on storage3 and mounting 4TB xfs volume (data was moved prior to this)
  • 19:09 Ryan_Lane: moving archive data from /archive to /archive1 on storage3, will replace mount afterwards
  • 19:08 Ryan_Lane: creating new 4TB logical volume "archive1" on storage3, and putting an xfs filesystem on it
  • 16:34 mark: Defined special disk space check (with SMS notification) for mysql core databases
  • 15:52 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'adding db11, db18 and db39'
  • 14:30 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'enabling db37'
  • 13:16 domas: resumed replication on db11, db18, db37 and db39
  • 00:19 Ryan_Lane: restarting apache on srv217 - puppet updated its apache2.conf file, but didn't restart it
  • 00:12 Ryan_Lane: powercycling amssq53 - it's dead
  • 00:09 Ryan_Lane: powercycling sq76 - it's dead
  • 00:08 Ryan_Lane: knsq14 isn't coming back up - may be broken
  • 00:08 Ryan_Lane: powercycling knsq24 - it's dead
  • 00:02 Ryan_Lane: powercycling knsq14 - it's dead

December 27

  • 23:54 Ryan_Lane: starting apache on srv217, running puppet to ensure code is synched
  • 23:52 Ryan_Lane: restarting apache on srv231
  • 23:50 logmsgbot: laner synchronized php-1.5/wmf-config/mc.php 'Fixing IP address for srv226'
  • 23:48 logmsgbot: laner synchronized php-1.5/wmf-config/mc.php 'Fixing IP address for srv227'
  • 23:39 Ryan_Lane: powercycling srv217 - it's dead
  • 20:38 Ryan_Lane: adding puppet schema to opendj on nova-controller.tesla
  • 19:51 Ryan_Lane: specifically chgrp'd all files to svnadm and added write permissions for the group
  • 19:51 Ryan_Lane: opened up file permissions on /srv/org/wikimedia/svn on formey so that ^demon can edit files
  • 19:48 Ryan_Lane: disabled selenium service on windows7-1, and launched selenium services manually, while logged in as the selenium user
  • 18:04 Ryan_Lane: installing puppet, puppet-el, puppetmaster, puppetmaster-passenger, and vim-puppet on nova-controller.tesla
  • 17:56 Ryan_Lane: adding new 1TB "archive" logical volume on storage3, formatting as ext3, and mounting at /archive
  • 17:49 Ryan_Lane: purging some bin logs on db9 to free up space
  • 16:27 Ryan_Lane: installing graphviz on formey via puppet for bug 26404
  • 09:21 apergos: truncated log-all and log-index in /a/search/log on searchidx1 to get some space back

December 26

  • 18:57 Ryan_Lane: powercycling mobile1, as it is dead
  • 12:49 apergos: restarted apache on srv227, it was seemingly the source of the DOM related errors, many msgs like "PHP Fatal error: Cannot access property started with '\0' in /usr/local/apache/common-local/wmf-deployment/includes/parser/Preprocessor_DOM.php on line 1393" in /var/log/messages
  • 02:29 Ryan_Lane: running /home/wikipedia/common/wmf-deployment/maintenance/purgeStaleMemcachedFlaggedRevs.php on s3/s7 flaggedrevs enabled wikis
  • 00:08 logmsgbot: ariel synchronized wmf-deployment/wmf-config/db.php 'edits enabled for s3/s7, people should purge if they see weird things, flagged revs may still cause problems, hewiki still has issues'

December 25

  • 23:39 apergos: ran Platonides' script "purgeStaleMemcachedText.php" on s3/s7 projects, see /home/wikipedia/common/wmf-deployment/maintenance/purgeStaleMemcachedText.php on fenari (not added to the local branch yet)
  • 21:14 logmsgbot: ariel synchronized wmf-deployment/wmf-config/db.php 'read-only for s3/s7 while we sort out edit sync and cache issues'
  • 20:17 mark: New master bin-log db17-bin.003, position 406
  • 20:16 mark: Switched master on s3, s7 from db27 to db17, just one slave db25
  • 20:16 logmsgbot: mark synchronized php-1.5/wmf-config/db.php 'Going read-write on s3, s7'
  • 20:13 logmsgbot: mark synchronized php-1.5/wmf-config/db.php 'Switch to new master db17, slave db25 for s3/s7'
  • 20:03 logmsgbot: mark synchronized php-1.5/wmf-config/db.php
  • 20:02 logmsgbot: mark synchronized php-1.5/wmf-config/db.php 'Remove db39 from rotation'
  • 19:58 mark: Switching master to db17
  • 19:58 logmsgbot: mark synchronized php-1.5/wmf-config/db.php
  • 19:43 Ryan_Lane: purging old bin logs on db27
  • 14:32 apergos: turned off Indexes option on wikitech in apache config files, just like on the cluster
  • 12:17 apergos: powercycling searchidx1, can't get in through management console, nonresponsive to ssh, google says the weird ilom messages are a firmware issue, so... reset.
  • 11:11 rainman-sr: searchidx1 is down for about 5 days now, can someone take a look at it? Cannot ssh to it. Also search group dissapeared from ganglia.
  • 00:22 mark: Running apt-get dist-upgrade on mobile3
  • 00:20 mark: Rebalanced LVS weights of mobile servers on lvs4 from 20/50/30 to 30/50/50
  • 00:06 mark: powercycled mobile3

December 24

  • 18:18 apergos: changed cname for dumps.wikimedia.org to point to dataset2 (so all related aliases will also point there)
  • 04:47 Ryan_Lane: adding 512MB ram to nova-controller.tesla; shutting down VM to do so

December 23

  • 18:23 apergos: added export of /data for download host to class in puppet
  • 17:57 Ryan_Lane: ignore that last message
  • 17:57 Cobi: TODO: Give Cobi root on servers
  • 17:54 Ryan_Lane: added vivek and aditya to nova-controller.tesla, nova-compute1, nova-compute2, and added them to sudoers
  • 17:46 apergos: added misc::download class for dataset2 in puppet (handles lighttpd for download.wm)
  • 14:02 hcatlin: deploying code update to m.
  • 10:42 logmsgbot: tfinc synchronized php-1.5/wmf-config/reporting-setup.php 'Flipping back stats page to db10 since weve cleared up our space issues'
  • 01:39 apergos: dataset2 racked and installed (thanks rich, robh, mark). now copying data back off tridge, rsync in screen session from tridge as root.
  • 00:40 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php
  • 00:40 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.i18n.php
  • 00:23 tomaszf_: updating IPN listener on grosley to r83

December 22

  • 23:58 tomaszf: updating civicrm thank_you module to pick up email fix
  • 23:35 nelson__: snow
  • 23:28 tomaszf: updating thank_you and queue2civicrm module on grosley
  • 23:28 Ryan_Lane: started profile-collector on spence
  • 23:21 logmsgbot: tfinc synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php
  • 22:05 logmsgbot: demon synchronizing Wikimedia installation... Revision: 78853: Deploy various FR/PC changes, r78852
  • 21:30 RobH: udpated dns with dataset2 info
  • 21:24 tomaszf: fixing perms on grosley:/srv/org.wikimedia.civicrm/sites/all/modules to point to www-data instead of awjrichards
  • 16:30 Ryan_Lane: apache-graceful-all was me ;)
  • 16:23 Ryan_Lane: reloading apache on srv*, incrementally
  • 15:48 logmsgbot: catrope synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics_body.php 'r78799'
  • 15:48 logmsgbot: catrope synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.php 'r78799'
  • 11:22 Tim: added Roan to security@wikimedia.org alias list

December 21

  • 21:42 RobH: db41 online and slated for owa project
  • 21:17 tomaszf: dropping old faulkner db on db9
  • 01:02 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 26357 - Flood flag for foundationwiki'
  • 01:01 tomaszf: moving over hume banner impression logs : 2010-12-08 -> 2010-12-13 to storage3 : /archive/udplogs
  • 00:49 tomaszf: moving over hume banner impression old logs : 2010-12-04 -> 2010-12-07 to storage3 : /archive/udplogs
  • 00:49 Ryan_Lane: changing expiration times of text/javascript and application/x-javascript from A2592000 to A345600 in apache config as a short-term fix for bannerlist loading
  • 00:24 tomaszf_: starting copy from hume -> storage3 to free up space
  • 00:20 logmsgbot: catrope synchronized php-1.5/languages/Names.php 'r78543'

December 20

  • 21:28 RobH: bayes is back online, updated
  • 18:43 Ryan_Lane: configuring pdns on nova-controller.tesla to use strict mode for LDAP
  • 13:16 logmsgbot: tstarling synchronized wmf-deployment/cache/trusted-xff.cdb
  • 13:14 Tim: updating TrustedXFF
  • 12:14 apergos: one hour later, 6.62GB has been copied of the XML dumps, out of 1T. Let's see how long it takes to get the whole thing over.
  • 11:13 apergos: started copy to google storage of the last complete xml dumps for each project (note that "complete" may include "completed with failures"), from dataset1, in root screen session

December 19

  • 18:58 mark: Powercycled amssq42, 48, 57, 58
  • 18:22 mark: Fixed puppet on ekrem
  • 16:20 mark: Fixed ganglia on spence again
  • 14:49 mark: Fixed Ganglia on Spence
  • 14:21 mark: Fixed puppet payments3 exported resource override issue
  • 14:02 mark: Rebooting spence
  • 13:44 mark: Starting dist upgrade from hardy to lucid on spence
  • 13:43 mark: Stopping nagios on spence
  • 13:42 mark: Removed mysql-server from spence
  • 13:29 mark: Disabled Merlin Nagios module
  • 13:11 mark: Rebooting spence
  • 13:02 mark: Running apt-get dist-upgrade on spence
  • 12:57 mark: Power cycled spence

December 18

  • 23:11 RobH: srv187-srv189 relocated and powering up
  • 22:27 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv181-srv186 in'
  • 22:26 mark: Updated smokeping with new rack layouts and contents
  • 22:21 RobH: srv181-srv186 relocated and coming back online
  • 22:21 RobH: srv187-srv189 coming down for relocation
  • 22:00 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv153 srv154 in'
  • 21:58 RobH: srv153, srv154 powering back up
  • 21:53 RobH: shutting down srv153, srv154 power redistribution
  • 21:51 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv154-srv153 out'
  • 21:28 RobH: srv181-srv186 going offline for relocation
  • 21:27 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv181-srv186 pulled when applicable'
  • 21:23 logmsgbot: hashar synchronized php-1.5/wmf-config/abusefilter.php 'resyncing abusefilter.php for the offline apaches'
  • 21:20 logmsgbot: robh synchronized php-1.5/wmf-config/db.php
  • 21:15 RobH: srv157 accidental reboot, opps =P
  • 21:13 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'commented out srv156, its coming down for power rebalancing'
  • 21:11 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv178-srv180 back in'
  • 21:05 RobH: srv175-srv180 relocated to new rack and coming online
  • 20:53 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv177 back in'
  • 20:42 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv175 srv176 back in'
  • 20:06 logmsgbot: hashar synchronized php-1.5/wmf-config/abusefilter.php 'bug 26364: abuse filter privacy issue on hiwiki'
  • 20:02 RobH: srv175-srv180 coming down for relocation
  • 20:02 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'putting ms2 back into ES rotation, removing srv175-srv180'
  • 19:39 logmsgbot: hashar synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 26364: abuse filter privacy issue on hiwiki'
  • 19:28 RobH: hume back online in new rack
  • 18:38 mark: DRBD failover to nfs2 successful, nfs2 is now primary
  • 18:38 RobH: pdf2 relocated, powering up
  • 18:32 mark: Starting DRBD failover of nfs1 to nfs2
  • 18:20 RobH: nfs2 reracked, powered up, online (replication check delayed until later)
  • 18:19 RobH: pdf2 coming down for relocation, will be back online shortly
  • 17:51 RobH: shutting down nfs2 for relocation
  • 17:42 RobH: tridge moved, powering back up
  • 17:00 RobH: streber & williams moved, powering up
  • 16:43 RobH: williams and streber coming down for relocation (otrs and rt will be offline during this transition)
  • 16:33 RobH: mchenry & sanger moved, powering up
  • 16:09 RobH: shutting down mchenry & sanger to relocate them
  • 16:03 mark: Started slave on ms1
  • 15:59 mark: Renamed asw-b2-pmtpa to asw-d1-sdtpa in DNS
  • 15:30 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'remvoing ms5 for relocation, will add back again shortly'
  • 15:28 RobH: ms1 is going to power down and be relocated
  • 15:26 mark: Failed over traffic back from amslvs4 to amslvs2
  • 15:23 RobH: ms5 moved and powering up
  • 15:20 mark: Running apt-get dist-upgrade && reboot on amslvs2
  • 15:11 mark: Failed over upload.esams traffic from amslvs2 to amslvs4
  • 15:04 mark: Running apt-get dist-upgrade && reboot on amslvs4
  • 14:51 mark: Running apt-get dist-upgrade && reboot on amslvs3
  • 14:48 mark: Failed over traffic back from amslvs3 to amslvs1
  • 14:46 RobH: ms5 shutting down for relocation
  • 14:32 mark: Running apt-get dist-upgrade && reboot on amslvs1
  • 14:27 mark: Made puppet install kernel 2.6.36 on LVS balancers
  • 14:22 mark: ES clusters 7, 20, 21 copies finished
  • 14:22 mark: ES cluster 8 copy finished
  • 14:07 mark: Imported package "linux-image-2.6.36-1-server" from the kernel PPA into the Wikimedia APT repository for lucid-wikimedia, section universe

December 17

  • 23:38 RobH: srv169 is not racked, has a dead hard disk, and had to use its rail for another server until we drill the stuck rail out of old rack
  • 23:38 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'pushing srv170-174 back into file'
  • 23:36 RobH: srv170-srv174 racked and powered, mysql running, putting back into db.php, puppet currently running to bring apache online
  • 23:25 apergos: added default values for tick and freq to /etc/default/adjtimex on dataset1 manually (ubuntu install of package is broken and leaves broken conf file, known bug, etc.)
  • 23:12 mark: ES cluster 6 copy done
  • 23:00 RobH: shutting down and moving srv169-srv174
  • 22:59 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv169-srv174 out for relocation'
  • 22:46 mark: Started copy of ES cluster 21 data from srv161 to tridge (screen on tridge)
  • 22:43 mark: Started copy of ES cluster 20 data from srv160 to tridge (screen on tridge)
  • 22:41 mark: ES cluster 10 copy done
  • 22:40 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv157 back in es rotation'
  • 22:39 mark: Started copy of ES cluster 8 data from srv156 to tridge (screen on tridge)
  • 22:38 mark: ES cluster 5 copy done
  • 22:31 mark: Stopped apache & puppet on srv170 to speed up last bit of the copy
  • 22:26 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'removing srv157 for relocation'
  • 22:25 RobH: shutting down srv157 for relocation
  • 22:25 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'putting srv163-srv168 back into rotation'
  • 22:22 RobH: srv163-168 relocated, puppet will run automatically to repool apache, mysql is already running, pushing them back into ES cluster
  • 22:18 mark: ES cluster 9 copy done
  • 22:16 apergos: trying new approach to clock drift on dataset1: installed adjtimex, set adjtimex --tick 10853 ( after doing adjtimex --compare), set hwclock -s and restarted ntpd. hwclock seems to be much more accurate than system clock. note changing clock source from tsc to hpet made no difference, so changed that back.
  • 21:59 mark: Stopped puppet and apache on srv157 to speed up copy
  • 21:33 RobH: shutting down srv163-srv168 for relocation
  • 21:32 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv163-srv168 depooled for relocation'
  • 21:28 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'putting srv158-srv162 back into service'
  • 21:27 RobH: srv158-srv162 back in service, pushing back into db.php
  • 21:26 mark: ES cluster 4 copy finished
  • 21:26 mark: ES cluster 3 copy finished
  • 20:35 RobH: shutting down srv158-srv162, will relocate srv157 after the ES data is backed up to tridge
  • 20:32 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'pulling srv158-srv162 from ES rotation for relocation'
  • 20:28 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'srv182 is online pushing it back into es service'
  • 20:28 mark: Started copy of ES cluster 10 data from srv170 to tridge (screen on tridge)
  • 20:23 mark: Started copy of ES cluster 9 data from srv157 to tridge (screen on tridge)
  • 20:16 RobH: working on srv169
  • 20:16 mark: Started copy of ES cluster 7 data from srv155 to tridge (screen on tridge)
  • 20:05 mark: Powercycled sq69
  • 19:45 mark: Started copy of ES cluster 6 data from srv154 to tridge (screen on tridge)
  • 19:37 mark: Started copy of ES cluster 5 data from srv153 to tridge (screen on tridge)
  • 19:37 mark: Started copy of ES cluster 4 data from srv152 to tridge (screen on tridge)
  • 19:28 mark: Started copy of ES cluster 3 data from srv151 to tridge (screen on tridge)
  • 19:18 RobH: srv151-srv156 back in service
  • 19:17 logmsgbot: robh synchronized php-1.5/wmf-config/db.php 'put srv156 back into service'
  • 18:39 logmsgbot: awjrichards synchronized php-1.5/wmf-config/reporting-setup.php 'Updating maximum amount for contribution reporting'
  • 18:31 mark: Setup all new PDUs
  • 18:17 RobH: shutting down srv151-srv156 for relocation
  • 17:36 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'moved srv254-srv257 into spares group'
  • 17:33 RobH: srv254-srv257 back in rotation
  • 16:50 mark: Set up PDUs ps1-d2-sdtpa and ps1-d3-sdtpa
  • 16:39 mark: Updated Torrus power monitoring with the new PDUs
  • 16:38 RobH: srv254-srv257 going offline for relocation
  • 16:38 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv254-srv257 replaced in active rotation for relocation'
  • 16:03 mark: Ignored replication of db bugzilla3 on storage3, repaired bugzilla3 tables, and restarted replication
  • 15:55 mark: Setup DNS for new PDUs
  • 15:16 mark: Updating dns for asw-b1-pmtpa and asw-b4-pmtpa name change
  • 04:08 RobH: srv247-srv253 back in rotation and relocated into d2 from b2 (will update racktables shortly)
  • 03:26 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv240-srv246 in srv247-srv253 out'
  • 03:21 RobH: srv247-srv253 removing from memcached rotation and shutting down for rack relocation
  • 03:20 RobH: srv240-srv246 online, pushing back into memcached rotation
  • 02:30 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv233-srv239 in, srv240-srv246 out for relocation'
  • 02:27 RobH: removing srv240-srv246 from mc rotation and shutting down for relocation
  • 02:27 RobH: srv233-srv239 successfully relocated, pushing them into mc rotation
  • 01:18 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv233-srv239 moved out of memcached rotation'
  • 01:14 RobH: srv233-srv239 being removed from memcached pool and shutdown for rack relocation. srv228-srv232 being put back into memcached
  • 01:06 RobH: srv232 moved and powering up
  • 00:46 Ryan_Lane: rsyncing data from storage3 to db10
  • 00:43 RobH: srv226-srv231 powering up in their new home in d2-sdtpa
  • 00:31 Ryan_Lane: stopping mysql on storage3
  • 00:31 Ryan_Lane: update puppet config for storage3 to update ganglia and nagios
  • 00:25 logmsgbot: tfinc synchronized php-1.5/wmf-config/reporting-setup.php
  • 00:15 Ryan_Lane: copying snapshot data to /a/sqldata (excluding faulkner db)
  • 00:11 Ryan_Lane: deleting databases on db10
  • 00:10 Ryan_Lane: stopping mysql on db10 and restoring data from snapshot
  • 00:06 Ryan_Lane: starting mysql on db10
  • 00:05 Ryan_Lane: deleting faulkner db on db10

December 16

  • 23:48 Ryan_Lane: starting mysql on storage3
  • 23:42 Ryan_Lane: stopping mysql on storage3 to replace the faulkner db
  • 23:33 Ryan_Lane: starting mysql on storage3
  • 23:07 RobH: srv226-srv232 coming down for relocation
  • 23:06 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'moved servers around for relocation'
  • 22:38 RobH: srv281 reinstalling
  • 22:24 RobH: srv281 coming down for repair (isnt in memcached active pool)
  • 22:11 Ryan_Lane: rebooting storage3
  • 22:10 Ryan_Lane: restating apparmor on storage3 to reload profiles
  • 21:09 RobH: all power strips in row c and d sdtpa have been wired for serial
  • 20:18 Ryan_Lane: stopping mysql on db10 and copying data to storage3
  • 20:11 Ryan_Lane: installing lvm on storage3
  • 20:11 tomaszf: killing long running civicrm queries on db9
  • 19:59 Ryan_Lane: creating LVM configuration using sda
  • 19:48 Ryan_Lane: reformatting /dev/sda1 as xfs on storage3
  • 19:48 Ryan_Lane: unmounting /data on storage3 and mounting as /a
  • 19:44 CodeBlock: installing puppetmaster and puppet on nova-controller
  • 19:35 Ryan_Lane: installing mysql-client mysql-server mysql-common on storage3

December 15

  • 22:26 Ryan_Lane: restarted slave on db10
  • 22:25 Ryan_Lane: killed mysqlcheck process on db10, as it was eating all available cpu
  • 22:09 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 26222 - Set $wgLogo to local logo on rowiki'
  • 21:55 Ryan_Lane: restarted slave on db10
  • 21:16 Ryan_Lane: restarting mysqld on db10
  • 21:16 Ryan_Lane: enabling long query log on db10
  • 20:43 Ryan_Lane: stoped and started slave on db10
  • 19:39 RobH: pushing dns change to switch dumps to dataset1 from singer (download and all its various incarnations are all pointing at dumps
  • 19:16 RobH: amaranth accidental reboot due to power removal
  • 18:50 RobH: msw-b1-sdtpa power cycled due to power cable being moved within the rack
  • 18:41 RobH: storage3 powering down to have its power distro fixed in rack
  • 18:30 RobH: bringing down db41/db42, the power in rack B1-sdtpa is not balanced properly.
  • 17:52 logmsgbot: demon synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'r78443'
  • 17:45 logmsgbot: demon synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'Debugging pywikipedia failure'
  • 09:48 domas: upgraded stuff on secure to fix ssl warnings
  • 02:37 phuzion: Wrote configs for powerdns with ldap backend and restarted powerdns on nova-controller.tesla.
  • 02:34 Ryan_Lane: restarting opendj on nova-controller.tesla
  • 02:34 Ryan_Lane: adding the dnsdomain2 schema to opendj on nova-controller.tesla
  • 02:13 phuzion: Instaled pdns-server and pdns-backend-ldap on nova-controller.tesla
  • 01:15 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics.js 'updating to r78421 so fundraising stats page defaults to 2010 fundraiser'
  • 01:15 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics_body.php 'updating to r78421 so fundraising stats page defaults to 2010 fundraiser'
  • 00:09 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics.js 'Revering to r64689 due to broken JS'

December 14

  • 23:59 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics.js 'updating to r78414 so fundraising stats page defaults to 2010 fundraiser'
  • 22:36 Ryan_Lane: installing python-crypto on nova-controller.tesla
  • 20:09 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Raise $wgAPIMaxResultSize to 12MB temporarily because there are 10MB revs on enwiki even though $wgMaxArticleSize is 2MB'
  • 12:26 Tim: spent the 5 minutes required to get MediaWiki tarball downloads working again from download.wikimedia.org

December 13

  • 23:47 apergos: dataset1 is up, ssh accessible, with the data partition mounted read only, rsync on tridge in a screen session dong rsync as root of /data/xmldatadumps from dataset1 to /data/XMLdumps on tridge.
  • 21:37 Ryan_Lane: restarting pdns on ns1
  • 21:36 Ryan_Lane: changing IP address for payments.wikimedia.org in DNS
  • 20:54 Ryan_Lane: restarting pybal on lvs4
  • 20:33 Ryan_Lane: intalling wikimedia-lvs-realserver on payments3 and payments4
  • 20:31 Ryan_Lane: adding payments configuration to pybal on lvs4
  • 20:11 Ryan_Lane: moving payments4 into public subnet and restarting network services
  • 20:10 Ryan_Lane: moving payments4 into squid vlan
  • 20:09 mark: Failed over text.esams and bits.esams traffic to amslvs3 using an increased BGP metric for amslvs1 routes on csw1-esams
  • 20:03 mark: Configured amslvs3 with option conn_tab_size=20 for module ip_vs
  • 20:01 mark: Rebooting amslvs3 again to test
  • 19:53 mark: Rebooting amslvs3
  • 19:51 mark: Running apt-get upgrade on amslvs3
  • 19:50 mark: Added "kernel-ppa" PPA to amslvs3, installing linux 2.6.36
  • 18:34 Ryan_Lane: renaming payments3 and payments4 in puppet
  • 18:27 Ryan_Lane: restarting network and services on payments3
  • 18:00 Ryan_Lane: moving payments3 and payments4 to a public subnet
  • 17:59 Ryan_Lane: disabling payments3 and payments4 from haproxy config on loudon
  • 12:42 mark: Revived down servers

December 12

  • 22:01 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'removed dead settings'
  • 18:47 apergos: bayes unreachable by mgmt or ping, needs power cycled somehow
  • 18:47 apergos: bringing snapshot1 back up (not intended, but it was sitting there in the bios, I was actually looking for the bayes mgmt console as its current ip for that is nonresponsive... seems like snapshot1 got the old one though.
  • 15:55 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Belated style version bump for last sync'
  • 15:38 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js 'r78249'
  • 14:26 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'Reorganized autopromote settings to be more sane'
  • 05:52 Ryan_Lane: installing php5-uuid on nova-controller.tesla

December 10

  • 23:29 Ryan_Lane: adding generic::geoip to owa*
  • 20:24 RobH: updating dns with emery entry
  • 20:20 RobH: snapshot1 powering back up, emery being pulled for temp storage use
  • 18:54 RobH: snapshot1 going offline to add capacity
  • 13:40 mark: Deployed new version exim 4.69-2ubuntu0.1wm2 with a str_format vuln fixed from exim 4.70 (lily, mchenry, sanger, grosley, williams)
  • 13:40 mark: Ran apt-get upgrade on sockpuppet
  • 13:40 mark: Ran apt-get upgrade on lily
  • 05:57 Ryan_Lane: cleaning /tmp on pdf1-3
  • 05:45 Ryan_Lane: rebooting pdf3
  • 05:38 Ryan_Lane: running mw-serve-ctl in cleanup mode on pdf3
  • 02:10 Ryan_Lane: installing ldap scripts into /usr/local/sbin on nova-controller.tesla
  • 02:03 Ryan_Lane: installing ldap-utils on nova-controller.tesla
  • 01:06 Ryan_Lane: installing php5-curl on nova-controller.tesla
  • 00:27 Ryan_Lane: installing php5-memcache on nova-controller.tesla
  • 00:27 Ryan_Lane: installing php-apc and memcached on nova-controller.tesla
  • 00:18 mark: Deployed slightly modified exim4-daemon-heavy 4.69-2ubuntu0.1wm1 package on mchenry, sanger, lily, grosley

December 9

  • 21:51 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php 'Updating ContributionTracking to have alternate item names for recurring and one-time paypal donations; Also now using i18n messages for item-name text'
  • 21:50 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.i18n.php 'Updating i18n file to allow for translatable item-name for paypal payments'
  • 21:10 logmsgbot: awjrichards synchronized php-1.5/wmf-config/CommonSettings.php 'Re-updating common settings to use new configuration style and to play nicely with new contribution-tracking-setup.php'
  • 21:09 logmsgbot: awjrichards synchronized php-1.5/wmf-config/contribution-tracking-setup.php 'Adding separate contribution-tracking-setup.php configuration file contribution tracking so inclusion of COntributionTracking.php does not override values set in reporting-setup.php'
  • 21:02 logmsgbot: awjrichards synchronized php-1.5/wmf-config/CommonSettings.php 'Fixing config revert due to misspelled extension path'
  • 21:01 logmsgbot: awjrichards synchronized php-1.5/wmf-config/CommonSettings.php 'Rolling back config changes do to config problems - looking for database in wrong place'
  • 20:56 logmsgbot: awjrichards synchronized php-1.5/wmf-config/CommonSettings.php 'Updating ContributionTracking configuration to use style and also updating config vars for ContributionTracking to use new recurring payments IPN listener'
  • 20:55 logmsgbot: awjrichards synchronized php-1.5/wmf-config/InitialiseSettings.php 'Updating ContributionTracking configuration to use style and also updating config vars for ContributionTracking to use new recurring payments IPN listener'
  • 20:48 RobH: updating dns for storage3
  • 20:47 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'removed unused config var'
  • 20:37 richcole: rebooting dataset1 to run scripts
  • 20:36 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:36 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.php 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:36 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_OWA_ref.sql 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:35 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/patch-owa.sql 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:35 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.sql 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:35 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.pg.sql 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:34 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.i18n.php 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:34 logmsgbot: awjrichards synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.alias.php 'Updating ContributionTracking extension to be able to handle recurring donations via PayPal'
  • 20:17 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'cleaned up tag config to use newer format'
  • 19:48 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'removed unused config var'
  • 19:33 Ryan_Lane: powercycling srv281
  • 19:31 Ryan_Lane: powercycling srv266
  • 19:28 Ryan_Lane: powercycling sq64
  • 19:25 Ryan_Lane: powercycling sq49
  • 04:02 Tim: reset list admin password for cu-ombuds-l
  • 01:17 RobH: storage3 online to pull off old data
  • 01:05 RobH: dns updated for storage3 mgmt

December 8

  • 23:31 Ryan_Lane: Adding additional directory and alias for owa resources via puppet on owa*
  • 22:05 richcole: moving storage 3 from PMtpa to SDtpa row B rack 1
  • 22:04 hcatlin: deploying code update to m.wiki
  • 22:04 Ryan_Lane: changing the * cert on owa* to the more current one
  • 21:37 Ryan_Lane: adding a virtualhost for https://owa.wikimedia.org on owa*
  • 19:48 Ryan_Lane: adding *.wikimedia.org certificate and key to owa* nodes
  • 19:18 Ryan_Lane: changing DNS ttl for payments.wikimedia.org
  • 07:50 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevision.php 'deployed r78053'
  • 06:01 logmsgbot: tstarling synchronized php-1.5/includes/DjVuImage.php 'r78047'

December 7

  • 22:03 Ryan_Lane: upgrading imagemagick on pdf1, pdf2, and pdf3
  • 21:37 Ryan_Lane: restarting puppet on srv187 and srv177
  • 21:35 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Resync because Arthur's sync didn't get picked up on all servers'
  • 21:23 logmsgbot: awjrichards synchronized php-1.5/wmf-config/InitialiseSettings.php 'Disabling VariablePage extension on all wikis'
  • 19:48 Ryan_Lane: moving awjrichards from restricted to mortals in admins.pp
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/WikimediaMessages/WikimediaLicenseTexts.i18n.php 'r77997'
  • 01:08 logmsgbot: tstarling synchronized wmf-deployment/cache/trusted-xff.cdb

December 6

  • 22:40 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'removed redundant config var'
  • 20:57 mark: Restarted pybal on amslvs3; moved traffic back to amslvs1
  • 20:48 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/CodeReview.i18n.php 'r77904'
  • 20:21 logmsgbot: catrope synchronized php-1.5/extensions/ExtensionDistributor/ExtensionDistributor_body.php 'r77901'
  • 20:18 mark: Restarting varnish on knsq1-5
  • 20:17 RobH: rebooting sq68
  • 20:17 RobH: sq68 unresponsive to console and ssh
  • 20:11 mark: Moved bits.esams traffic from amslvs1 to amslvs3 (on csw1-esams)
  • 19:13 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php
  • 19:10 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv266 down'
  • 14:12 domas: master switch on s2 to New position: db13-bin.000001:106
  • 14:12 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 14:11 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 12:13 mark: Shutdown db4's switchport (csw5-pmtpa:1/8)
  • 11:29 mark: Powercycled srv217

December 5

  • 20:28 logmsgbot: demon synchronized php-1.5/extensions/CodeReview/ui/CodeRepoStatsView.php 'r77827'
  • 16:07 Ryan_Lane: modified /etc/apache2/envvars on mobile3 to look like mobile2
  • 15:54 Ryan_Lane: restarting apache on mobile3
  • 15:00 hcatlin: Deploying mobile update to fix change in Palm Pre useragent
  • 10:53 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionListView.php 'r77792'
  • 00:57 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 00:39 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'awwwww crap'

December 4

  • 23:43 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/FRDependencyUpdate.php 'deployed r77749'
  • 22:43 logmsgbot: aaron synchronized php-1.5/includes/Article.php 'Deployed r77747'
  • 22:22 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'taking out db11 and db17'
  • 19:54 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevs.hooks.php 'Deploy 77740'
  • 19:53 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/FlaggedRevs.php 'Deploy 77740'
  • 19:04 logmsgbot: demon synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionTagView.php 'r77736'
  • 18:55 logmsgbot: demon synchronized php-1.5/includes/api/ApiQueryRevisions.php 'r77735'
  • 01:02 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'r77708'
  • 00:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'r77705'
  • 00:48 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Fix settings for r77701'
  • 00:48 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'r77701'
  • 00:47 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/CodeReview.php 'r77701'
  • 00:40 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/backend/CodeRevision.php 'r77699'
  • 00:21 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Send all code comments to mediawiki-codereview@lists.wikimedia.org'

December 3

  • 22:38 Ryan_Lane: powercycling sq70, since it is very dead
  • 20:20 Ryan_Lane: fixing deadlock problem with torrus
  • 18:27 mark: Shutdown csw5-pmtpa:7/9 (to asw-b4-pmtpa) - so rack B4 is now offline
  • 18:26 mark: Shutdown csw5-pmtpa:7/3 (to asw-b3-pmtpa)
  • 14:09 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'sending most of s3 load to db39'
  • 14:08 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'adding db39, reduced s3 host, 5.1'
  • 14:03 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'send most of s7 load to db37'
  • 13:56 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'initial s7 server and wiki sets defined, db37 in as s7-only slave'
  • 10:31 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 18307 - Autopatrolled group for frwikisource'
  • 02:53 Ryan_Lane: rebooting nova-compute1.tesla
  • 00:41 logmsgbot: tfinc synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php
  • 00:31 logmsgbot: andrew synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 00:30 Andrew: Granted 'disableaccount' right to users on arbcom_enwiki, according to email conversations
  • 00:28 logmsgbot: andrew synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 00:24 logmsgbot: andrew synchronizing Wikimedia installation... Revision: 77622:
  • 00:23 Andrew: Running scap.
  • 00:17 Andrew: About to deploy updates for Special:DisableAccount

December 2

  • 23:11 Ryan_Lane: installing user-mode-linux on nova-compute1/2
  • 23:10 Ryan_Lane: setting --libvirt_type=qemu in the nova.conf on nova-*.tesla
  • 22:25 Ryan_Lane: adding icmp to the default security list's exceptions on nova-controller.tesla using "euca-authorize default -P icmp -t -1:-1 -s 0.0.0.0/0"
  • 22:22 Ryan_Lane: adding ssh to the default security list's exceptions using "euca-authorize default -P tcp -p 22 -s 0.0.0.0/0"
  • 22:07 Ryan_Lane: adding a lucid image to the lucid-64-1 bucket using uec-publish-tarball
  • 22:02 Ryan_Lane: installing unzip on nova-controller.tesla
  • 21:59 Ryan_Lane: installing cloud-utils on nova-controller.tesla
  • 21:29 Ryan_Lane: giving eth2 a static IP assignment on nova-controller.tesla; no bridging is necessary on the controller, only compute nodes
  • 21:21 Ryan_Lane: installing python-mysqldb on nova-compute1 and 2
  • 21:17 Ryan_Lane: removed template, keys, and ca path flags from nova-*.tesla, and added state_path and dhcp_bridge flags
  • 21:12 Ryan_Lane: updating nova packages on nova-controller.tesla
  • 20:53 Ryan_Lane: made owa1-3 real servers. depooling owa3
  • 20:44 mark: Setup owa.wikimedia.org LVS service on lvs4
  • 17:56 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 26159 groups for hiwiki'
  • 17:53 Ryan_Lane: restarting rabbit on nova-controller.tesla
  • 17:53 CodeBlock: installed mysql-client on nova-compute1 and 2
  • 17:52 Ryan_Lane: restarting nova services on nova-controller.tesla
  • 17:44 CodeBlock: installed nova-common nova-doc python-nova nova-compute nova-volume on nova-compute1, as well.
  • 17:43 mark: Shutdown port csw5-pmtpa:2/47 (to msw-a1-pmtpa) - so rack A1 is now fully offline
  • 17:41 CodeBlock: installed nova-common nova-doc python-nova nova-compute nova-volume on nova-compute2
  • 17:40 mark: Shutdown ports csw5-pmtpa:7/23 (to msw-b1-pmtpa) and 7/7 (to asw-b1-pmtpa) - so rack B1 is now fully offline
  • 17:24 Ryan_Lane: restarting nova services on nova-controller.tesla after configuring them to use the ldap driver
  • 13:25 hcatlin: deploying fixes to mobile cluster
  • 12:01 domas: please don't run any ddl on s3 (no new wiki creations either) until further notice, consider your data lost otherwise
  • 02:51 richcole: rebuilding raid array DB15
  • 02:38 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'disabling db15'
  • 02:15 richcole: swapping drive 15 on DB 19
  • 02:11 Ryan_Lane: stopping nova services on nova-*.tesla
  • 02:11 Ryan_Lane: adding mount for /var/lib/nova on nova-*.tesla
  • 02:06 Ryan_Lane: added another nic to all nova-* systems, on a new virtual nic called "nova vm network". To be used for VMs with no access to the outside world.
  • 02:05 CodeBlock: added phuzion to sudoers on nova-*.tesla
  • 01:51 RobH: db15 down for reinstallation, do not take db13 or db24 offline
  • 01:43 Tim: added a user account for hashar in the wikidev group, with the same UID and home directory as in 2006-2008
  • 01:13 richcole: reboot DB15
  • 01:09 RobH: db15 is not recovering from lockup, lost a lot of drives in array, rich is investigating
  • 01:02 RobH: db15 unresponsive to console, rebooting
  • 01:01 RobH: db15 drive was swapped, now is down, investigating

December 1

  • 23:16 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Updating fundraiser counter source for 2010'
  • 22:50 Ryan_Lane: rebooted mobile3
  • 21:17 mark: Removed jjones account for shell access
  • 21:06 Ryan_Lane: installing python-libvirt on nova-controller.tesla
  • 20:58 Ryan_Lane: restarting nova services on nova-controller.tesla
  • 20:55 Ryan_Lane: installing rabbitmq-server on nova-controller.tesla
  • 20:19 Ryan_Lane: installing ldap-utils on nova-controller.tesla
  • 19:54 Ryan_Lane: installing php5-cli on nova-controller.tesla
  • 19:54 Ryan_Lane: installing php5, php5-ldap, and php5-mysql on nova-controller.tesla
  • 19:54 Ryan_Lane: installing apache2 on nova-controller.tesla
  • 19:54 Ryan_Lane: installing opendj on nova-controller.tesla
  • 19:47 CodeBlock: also installed mysql-server on nova-controller.tesla
  • 19:46 CodeBlock: added nova's PPA to nova-controller.tesla, and installed nova-common nova-doc python-nova nova-api nova-network nova-objectstore nova-scheduler python-greenlet python-mysqldb
  • 19:45 Ryan_Lane: added hosts entries on nova-*.tesla for backend IP addresses
  • 19:20 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'Bug 26159 - Enable some user groups on hi wiki'
  • 19:07 RobH: updated flaggedrevs and pushed for bug 24622
  • 19:07 logmsgbot: robh ran sync-common-all
  • 17:57 Ryan_Lane: installing vmware tools on nova-*.tesla
  • 11:03 logmsgbot: tstarling synchronized wmf-deployment/cache/trusted-xff.cdb
  • 10:04 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 03:51 CodeBlock: installed htop on nova-controller
  • 01:37 Ryan_Lane: powered off, but did not delete ds1.tesla, storage1.tesla, and storage2.tesla. Re-used their resources for nova-*.tesla though.
  • 01:36 Ryan_Lane: gave shell and sudo access to codeblock on nova-*.tesla systems
  • 01:35 Ryan_Lane: added vms nova-controller.tesla, nova-compute1.tesla, and nova-compute2.tesla

November 30

  • 22:05 Ryan_Lane: correction that's ds1.tesla to nova-controller.tesla
  • 22:03 Ryan_Lane: changing ds1 to nova-controller in dns
  • 21:23 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25650 - Please add rollback group on fa.wikinews'
  • 21:14 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '16315 - Change favicon for Wiktionaries'
  • 21:03 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '16315 - Change favicon for Wiktionaries'
  • 20:51 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26171 - Change logo for Esperanto Wikibooks'
  • 16:54 Ryan_Lane: destroying ds1.tesla, storage1.tesla, and storage2.tesla to reclaim memory and an IP address for OpenStack testing with volunteer ops/dev community members
  • 16:16 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Temporarily allow sysops to -flood anyone on zhwiki, because $wgGroupsRemoveFromSelf seems to be broken'
  • 16:06 Ryan_Lane: removing amanda-client from tridge, it isn't needed afterall
  • 15:54 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 26117 - Fix wrong syntax in zhwiki entries in $wgGroupsAddToSelf and $wgGroupsRemoveFromSelf'
  • 15:51 Ryan_Lane: adding amanda-client to tridge
  • 15:29 ^demon: reinstated pre-commit hook for php lint test
  • 15:20 ^demon: disabled pre-commit hook for php -l while I debug some more, hitting issues with svn del's
  • 14:59 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 26151 - Change project namespace on mrwiktionary'
  • 14:57 ^demon: bug 26172, added php lint check to public svn pre-commit
  • 13:18 mark: moved noc monitoring out of LVS (I want to sleep at night)
  • 13:05 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'Set $wgAbuseFilterBlockDuration to 1 month on hiwiki'
  • 12:55 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'bug 24394 - Allow block and unblock as abuse filter actions on hiwiki'
  • 12:24 mark: Enabled Nagios SMS notifications for critical services to all paid Ops staff and Danese
  • 12:08 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 12:07 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js 'r77473'
  • 11:29 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Enable UploadWizard on commonswiki'
  • 11:28 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 11:22 logmsgbot: catrope synchronizing Wikimedia installation... Revision: 77471:
  • 11:22 RoanKattouw: Running scap to deploy UploadWizard changes cluster-wide. Only enabled on test right now, will go to Commons soon
  • 11:07 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js
  • 10:51 logmsgbot: catrope synchronized php-1.5/extensions/UploadWizard/UploadWizard.i18n.php
  • 10:44 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js
  • 10:41 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js
  • 10:31 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js
  • 10:22 logmsgbot: catrope synchronized wmf-deployment/extensions/UploadWizard/resources/combined.min.js
  • 05:14 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26117 - Enable flood flag on zhwiki'
  • 00:32 Ryan_Lane: done checking nagios monitoring of LDAP on nfs2
  • 00:32 Ryan_Lane: checking nagios monitoring of LDAP on nfs2

November 29

  • 22:49 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking_body.php
  • 22:47 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionTracking/ContributionTracking.i18n.php
  • 22:22 Ryan_Lane: changed check for noc.wikimedia.org to use http, not https
  • 21:46 Ryan_Lane: finished testing manifestcheck. The pre-commit hook for puppet's svn is more complete now. It will pull all manifests and check site.pp and all includes.
  • 21:34 Ryan_Lane: testing new pre-commit manifestcheck script for puppet svn. errors may be reported for ldap.pp
  • 19:21 Ryan_Lane: even better... testing with a file not included in site.pp
  • 19:21 Ryan_Lane: testing svn pre-commit hook for puppet. ldap.pp may show some errors...
  • 18:16 Ryan_Lane: added http monitoring for noc.wikimedia.org
  • 17:58 Ryan_Lane: added ldap and ldaps monitoring to nfs1/2
  • 17:45 Ryan_Lane: adding backup::client class to nfs1/2 for ldap backups
  • 13:55 domas: changed db10 lag threshold to alert at 10 minutes \o/

November 28

  • 19:20 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 19:18 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25693 - Change logo on tawikisource'
  • 19:11 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26074 - please enable the Collection extension for the albanian wikipedia'
  • 19:10 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26096 - setup logo for br.wikimedia.org'
  • 19:08 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25780 - Create "autopatroller" User Group and granting admins(sysop) to add/remove this user group in ca.wikipedia'
  • 18:55 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26117 - Enable flood flag on zhwiki'
  • 18:46 logmsgbot: jeluf ran sync-common-all 'closing cowikiquote'
  • 18:38 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26060 - Change logo for Esperanto Wikinews'
  • 18:36 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26153 - Please restrict anonymous users from creating new pages at es.wikibooks'
  • 18:33 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '26155 - Please enable the "autopatrolled" group on it.wikipedia'
  • 18:30 logmsgbot: jeluf synchronized php-1.5/wmf-config/flaggedrevs.php '26154 - Activate Flagged Revs For NS_CATEGORY'
  • 11:51 logmsgbot: midom synchronized php-1.5/wmf-config/db.php

November 27

  • 22:49 logmsgbot: catrope synchronized php-1.5/wmf-config/db.php 'Put db18 back into rotation, it's caught up'
  • 22:41 logmsgbot: catrope synchronized php-1.5/wmf-config/db.php 'Set db18 load to 0 while while it's lagged'
  • 20:57 logmsgbot: midom synchronized php-1.5/extensions/OAI/OAIRepo_body.php 'removing audit'
  • 12:54 logmsgbot: midom synchronized php-1.5/wmf-config/db.php
  • 11:25 logmsgbot: midom synchronized php-1.5/wmf-config/db.php 'db25 maintenance'

November 26

  • 14:52 logmsgbot: catrope synchronized php-1.5/languages/messages/MessagesBh.php 'r77331'
  • 13:58 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 13:57 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r77329'
  • 13:31 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25923 - Namespace aliases for brwikisource'
  • 11:52 mark: Fixed packages on fenari
  • 03:42 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php 'disabling proxy for RSS extension, to allow requests to blog.wikimedia.org'

November 25

  • 21:28 RoanKattouw: And Ariel restarted Apache too
  • 21:27 RoanKattouw: Apache on fenari was down, caused by wikimedia-task-appserver being half-installed. Ariel ran apt-get install on it manually, which seems to have fixed it
  • 20:07 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/api/ApiReview.php 'Deploy r77298'
  • 13:35 logmsgbot: catrope synchronized php-1.5/languages/messages/MessagesExt.php 'r77287'
  • 13:30 logmsgbot: catrope synchronized php-1.5/includes/api/ApiQueryUsers.php 'r77285'
  • 12:42 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php 'In RSS extension, use url-downloader.wikimedia.org'
  • 12:40 logmsgbot: tstarling synchronized php-1.5/extensions/RSS/RSSParser.php
  • 12:40 logmsgbot: tstarling synchronized php-1.5/extensions/RSS/RSS.php
  • 12:26 logmsgbot: tstarling synchronized php-1.5/wmf-config/CommonSettings.php 'enabling RSS extension on wikimediafoundation.org'
  • 12:26 logmsgbot: tstarling synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 12:22 logmsgbot: tstarling synchronizing Wikimedia installation... Revision: 77242:
  • 12:21 Tim: pushing out RSS extension to update extension messages, not enabled yet
  • 10:46 richcole: dataset1 being taken down for service
  • 01:30 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/maintenance/updateStats.php 'deployed r77270'
  • 01:30 logmsgbot: aaron synchronized php-1.5/extensions/FlaggedRevs/api/ApiReview.php 'deployed r77270'
  • 00:35 richcole: dataset1 going down for part replacement
  • 00:28 tomasz: turning off replication for faulkner db on db10. setting db10 to rw per nimish talk with domas.

November 24

  • 23:13 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics_body.php
  • 23:13 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.i18n.php
  • 23:05 logmsgbot: catrope synchronized php-1.5/includes/media/Bitmap.php
  • 23:02 logmsgbot: catrope synchronized php-1.5/includes/media/Bitmap.php 'r77263'
  • 22:59 logmsgbot: catrope synchronized php-1.5/includes/media/Bitmap.php 'Back out r77262, is broken'
  • 22:55 logmsgbot: catrope synchronized php-1.5/includes/media/Bitmap.php 'r77262'
  • 22:50 JeLuF: added srv230 in /etc/dsh/group/mediawiki-installation
  • 22:29 logmsgbot: catrope synchronized php-1.5/thumb.php
  • 22:19 atglenn: ms4 thumb_handler.php synced to latest from svn, temp dir thumb handling (see rev 77257)
  • 22:18 logmsgbot: catrope synchronized php-1.5/thumb.php 'Live hack for temporary thumbs'
  • 21:01 tomasz: moving faulkner db from faulkner to un replicated fundraiser_history on db10
  • 15:53 Ryan_Lane: powercycling mobile2
  • 15:49 Ryan_Lane: restarting apache on mobile3
  • 15:46 RoanKattouw: mobile3 is broken, serving 500s for http://m.wikipedia.org . mobile1 is fine; this causes LVS HTTP status to flap in Nagios
  • 03:57 logmsgbot: aaron synchronizing Wikimedia installation... Revision: 77216:
  • 03:50 logmsgbot: aaron synchronizing Wikimedia installation... Revision: 77216:
  • 03:15 logmsgbot: demon synchronized php-1.5/extensions/FlaggedRevs/api/ApiReview.php 'r77216'
  • 03:15 logmsgbot: demon synchronized php-1.5/extensions/FlaggedRevs/forms/RevisionReviewForm.php 'r77216'
  • 02:55 logmsgbot: demon synchronized php-1.5/extensions/FlaggedRevs/FlaggedArticleView.php 'r77213'
  • 02:43 Ryan_Lane: restarted memcached on owa1/2 to read new config file
  • 00:19 logmsgbot: demon synchronizing Wikimedia installation... Revision: 77209:
  • 00:17 ^demon: syncing FlaggedRevs changes to all wikis

November 23

  • 23:11 logmsgbot: demon synchronized php-1.5/includes/specials/SpecialRecentchangeslinked.php 'r76485'
  • 23:10 logmsgbot: demon synchronized php-1.5/includes/specials/SpecialRecentchanges.php 'r76485'
  • 23:09 logmsgbot: demon synchronized php-1.5/includes/diff/DifferenceInterface.php 'r76444'
  • 21:58 ^demon: restarted wmfsvndumper with new dump location
  • 21:44 ^demon: running svndump test in screen on formey
  • 21:43 Ryan_Lane: installed screen on formey
  • 21:42 Ryan_Lane: added demon to svnadm
  • 21:41 Ryan_Lane: changed ownership of svn config files from adm to svnadm
  • 21:41 Ryan_Lane: added an svnadm group to ldap
  • 15:07 mark: Finished testing, amslvs1 is primary again
  • 14:40 mark: Shutdown BGP session to amslvs1 on csw1-esams for testing, traffic failing over to amslvs3
  • 13:50 mark: Removed redistribution of BGP LVS routes into OSPF; rely on iBGP exclusively for LVS routes
  • 12:06 logmsgbot: catrope synchronized php-1.5/includes/api/ApiDelete.php 'r77146'
  • 02:27 RobH: had to restart morebots, it had left all channels. also had to change admin pass for listserv

November 22

  • 22:09 mark: Restarted varnish on knsq4
  • 22:05 mark: Restarting varnish on knsq1
  • 21:44 Ryan_Lane: err - restarting varnish on knsq2,4,5
  • 21:44 Ryan_Lane: restarting varnish on knsq2,3,4
  • 21:43 Ryan_Lane: restarting knsq2,3,4
  • 21:29 mark: Restarted varnish on knsq2,4,5
  • 21:24 mark: Restarting Varnish on knsq1
  • 21:09 mark: Set static routes 91.198.174.232/31 to 91.198.174.247 (csw1-esams) on br1-knams
  • 21:03 mark: Started pybal on amslvs1 again
  • 20:53 mark: Killed PyBal on amslvs1
  • 20:39 mark: Lowered cache_mem and cache_dir sizes on amssq31 for testing
  • 20:28 Ryan_Lane: depooling knsq24
  • 20:25 mark: Restarted pybal on amslvs1 with depool-threshold = 1 (temporarily)
  • 17:27 Ryan_Lane: restarting nagios
  • 17:27 Ryan_Lane: adding warning notifications via IRC to nagios
  • 16:40 mark: Suppressed announcements to AS16265 on csw1-esams
  • 16:30 mark: Suppressed BGP announcements to AS13030 on br1-knams

November 21

  • 08:29 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25929 - Import sources for Spanish Wikipedia'
  • 05:49 tomasz: purging binary logs on db9 to mysql-bin.001600
  • 05:43 tomasz: syncing db9 mysql-bin.001562 - mysql-bin.001599 to tridge

November 20

  • 21:12 JeLuF: added download.wikipedia.org as ServerAlias for download.wikimedia.org
  • 20:14 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25982 - Enable subpages in the main namespace for ten.wikipedia.org'
  • 20:11 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25923 - Namespaces on br.wikisource'
  • 20:09 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25929 - Import sources for Spanish Wikipedia'
  • 20:06 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25996 - Import sources for pfl.wikipedia'
  • 13:00 JeLuF: added more error patterns to puppetmon

November 19

  • 23:04 Ryan_Lane: changed https check on payments to 1 retry in nagios, via puppet
  • 20:59 atglenn: restarted torrus. guess why :-P
  • 17:52 mark: Readded notification of Service[nagios] when changing nagios types in puppet
  • 13:59 richcole: set raid 10 up on DB41
  • 08:36 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'bug 25850 - Hide Take me Back link on all wikis'
  • 07:32 JeLuF: started puppet on sq42 sq41 sq47 sq45 sq46 sq44 sq50 sq52 sq53 sq51 sq48 sq55 sq54 sq43 sq56 sq58 sq64 sq62 sq61 sq65 sq66 sq60 sq63 sq72 sq75 sq71 sq73 sq74 sq76 sq78 sq77 sq81 sq80 sq79 sq83 sq84 sq82 sq85 sq86
  • 05:21 apergos: on sq85 I was seeing complaints from cron about restart of puppet: unknown option -w. removed that from /etc/default/puppet and restart, but that fails: Could not parse for environment production: Could not find file /agent.pp
  • 01:00 atglenn: added monitoring mechanism in root's crontab on sq85 (don't need it everywhere) that will sms me when ms4 is acting up. I'd do it in puppet if someone told me how they would want it to be added there.
  • 00:22 tomasz: turning db9 watchdog back on. setting at 5minutes

November 18

  • 21:57 richcole: DB42 shutdown for service
  • 21:40 JeLuF: CORRECTION: started puppet manually on sq59 sq61 sq73 sq60 sq62 sq65 sq77 sq63 sq64 sq72 sq75 sq74 sq76 sq71 sq78 sq66, startup script is broken.
  • 21:40 JeLuF: started squid manually on sq59 sq61 sq73 sq60 sq62 sq65 sq77 sq63 sq64 sq72 sq75 sq74 sq76 sq71 sq78 sq66, startup script is broken.
  • 21:26 mark: Fixed puppet on formey
  • 21:24 mark: Fixed puppet on linne
  • 19:57 JeLuF: blocked UDP from srv124 on nfs1 aka syslog
  • 17:00 JeLuF: restarted puppet on srv215, srv235, srv244, srv257, srv262, srv288
  • 16:33 JeLuF: fixed puppet on srv185 and srv200
  • 15:00 logmsgbot: aaron synchronized php-1.5/wmf-config/flaggedrevs.php 'Set FR_INCLUDES_CURRENT on mediawikiwiki'

November 17

  • 20:19 JeLuF: syslog is being spammed with one week old messages from srv124
  • 20:19 RobH: owa1/2/3 online with base OS install and puppet updates
  • 17:59 RobH: updated dns for new databases servers
  • 17:15 richcole: owa1 going down for repair
  • 15:52 Ryan_Lane: moved the nagios purge stuff out of puppet, and into nagios's init script. Pulled the nagios init script into puppet
  • 10:03 tomasz_: adding single field index on converted amount under public_reporting within civirm db on db9
  • 10:03 tomasz_: adding single field indexes to utm_source, utm_medium, and utm_campaign under contribution_tracking table within drupal db on db9
  • 03:44 atglenn: restarted apache on ekrem, many processes hung in "graceful close" state for a long period of time
  • 03:06 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php
  • 03:04 Tim: in puppet, disabled nagios::purge since it breaks puppet entirely on fenari. Removed Aaron's obsolete ssh public key by adding an ensure=>absent to puppet.
  • 01:34 Tim: on ekrem: ran logrotate -f, since log rotation previously failed due to disk full
  • 01:28 Tim: on ekrem: root partition full, deleted old apache access logs
  • 00:20 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php

November 16

  • 19:05 logmsgbot: catrope synchronizing Wikimedia installation... Revision: 76812:
  • 19:03 RoanKattouw: Running scap to deploy UploadWizard backend changes (core only)
  • 17:04 Ryan_Lane: adding run stages to puppet config; adding apt-get update to first stage, and nagios resource purging to last stage
  • 17:01 logmsgbot: catrope synchronized php-1.5/maintenance/nextJobDB.php 'Fix memcached usage for nextJobDB.php, broken since Sep 09. Should speed up job queue processing'
  • 16:38 RobH: updated dns
  • 15:50 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Add deletedtext, deletedhistory rights to eliminator group on hiwiki'
  • 15:04 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25374 - Eliminator group for hiwiki'
  • 14:36 JeLuF: 25871 - fixed logo for pflwiki
  • 14:05 RobH: temp fixed nagios

November 15

  • 22:31 Ryan_Lane: repooling sq70
  • 21:43 Ryan_Lane: pushing change to varnish to send cache-control header for geoip lookup
  • 21:43 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Adding categories to $wmgArticleAssessmentCategory'
  • 21:37 logmsgbot: catrope synchronized php-1.5/extensions/ArticleAssessmentPilot/ArticleAssessmentPilot.hooks.php 'r76709'
  • 21:17 Ryan_Lane: depooling sq70
  • 21:17 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25569 - Create the Gagauz Wikipedia (wp/gag)'
  • 21:01 mark: Lowered CARP weight of esams text amssq* squids from 20 to 10, equal to the older knsq* squids
  • 20:33 Ryan_Lane: setting authdns-scenario normal
  • 20:05 RobH: current slowdowns reported for folks hitting AMS squids. Moving traffic to US datacenter should fix major slowdowns on !Wikipedia & !Wikimedia
  • 20:04 Ryan_Lane: setting authdns-scenario esams-down
  • 19:56 RobH: fixed nagios again
  • 19:51 RobH: updating dns for new owa processing nodes
  • 18:54 RobH: srv298 now online in api pool
  • 18:21 Ryan_Lane: fixing puppet manually on sq34, sq36, sq37, sq39, sq40, and knsq13
  • 18:18 RobH: gilman to secure gateway project stalled, needs network checks done
  • 18:07 Ryan_Lane: puppetizing /etc/default/puppet, since some hosts had START=no, instead of START=yes
  • 17:46 RobH: gilman needed hard reset, ilom responsive now (thx rich!)
  • 17:35 Ryan_Lane: restarting puppet again on all nodes using -M flag for ddsh to see system names (checking for errors)
  • 17:23 Ryan_Lane: restarting puppet on all nodes
  • 17:12 RobH: sq57 disk replaced, reinstalled, back in service
  • 17:09 mark: Restarted apache on sockpuppet with concurrency 4 instead of 3
  • 17:04 RobH: puppet is now failing to work properly on sq57, why did we upgrade puppet again?
  • 16:59 RobH: sq57 reinstalled and doing post installation configuration
  • 16:40 Ryan_Lane: upping configtimeout setting in puppet to 8 minutes, globally
  • 16:33 Ryan_Lane: trying to add puppet.conf to puppet again
  • 16:24 Ryan_Lane: undoing puppet.conf changes
  • 16:20 RobH: sq57 coming down for reinstallation
  • 16:19 RobH: db13 back online, restarted mysql, but its currently commented out of db.php
  • 16:12 RobH: not sure why db13 is borked, but its down, poking at it
  • 16:09 Ryan_Lane: added puppet.conf to puppet. pushing change out
  • 16:00 RobH: torrus is up again
  • 15:59 richcole: swaped sq57 sdb bad drive
  • 15:56 RobH: torrus is down, again, restarting and cleaning up its services
  • 15:52 RobH: manually purged spence nagios, started manually, working until puppet borks it again
  • 15:10 RobH: nagios is down, investigating

November 14

  • 23:28 mark: Fixed Nagios
  • 20:00 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25918 - Namespaces on vec.wikisource.org'
  • 14:57 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25904 - Create the Swedish Wikiversity (wv/sv)'
  • 14:56 logmsgbot: jeluf ran sync-common-all '25904 - Create the Swedish Wikiversity (wv/sv)'
  • 14:28 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25918 - Namespaces on vec.wikisource.org'
  • 08:26 domas: ran purge-nagios-resources.py manually to bring up nagios
  • 07:14 domas: reduced passenger pool size to 4 on sockpuppet
  • 04:42 Ryan_Lane: moving /etc/nagios/puppet_services.cfg to .bak and rerunning puppet
  • 03:05 Ryan_Lane: modified nagios puppet manifest to purge decommisioned servers from the services configuration
  • 01:48 logmsgbot: jeluf synchronized php-1.5/cache/interwiki.cdb 'Updating interwiki cache'
  • 01:33 Ryan_Lane: temporarily upped configtimeout in /etc/puppet/puppet.conf to 8 minutes on spence so that puppet would run

November 13

  • 22:03 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php
  • 20:42 mark: Ran dist-upgrade on sq68
  • 20:37 mark: powercycled sq68
  • 19:29 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25871 - Create the Palatinate German Wikipedia (wp/pfl)'
  • 19:29 logmsgbot: jeluf ran sync-common-all '25871 - Create the Palatinate German Wikipedia (wp/pfl)'
  • 18:39 mark: Fixed puppet on db16
  • 18:24 mark: Installed script reporting the last Puppet run in MOTD (Karmic and higher only)
  • 18:03 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25774 - Create Wikinews in Esperanto'
  • 18:02 logmsgbot: jeluf ran sync-common-all '25774 - Create Wikinews in Esperanto'
  • 17:43 logmsgbot: jeluf ran sync-common-all '25773 - Create Wikibooks in Limburgish'
  • 17:37 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25743 - Create the Breton Wikisource (ws/br)'
  • 17:27 logmsgbot: jeluf ran sync-common-all '25743 - Create the Breton Wikisource (ws/br)'
  • 17:10 mark: Installed cron job that removes puppetdlock files over a day old; these prevent puppet from doing runs forever otherwise
  • 17:01 apergos: removed "-n" from mw-tor-list on hume, otherwise it (I guess) terminates early, at any rate it produces an empty tor node list. If this turns out to be too big a burden on hume's resources we can look at some other approach
  • 16:49 mark: Upgrading puppet agent from 0.25 to 2.6 across the cluster
  • 16:24 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25696 - Create vec.wikisource.org'
  • 16:04 logmsgbot: jeluf ran sync-common-all 'added gag.wikipedia and vec.wikisource'
  • 15:48 mark: Replaced the Wikimedia APT repository by a new per-distribution-version repository managed by 'reprrepro' on brewster; the old repository is available as http://apt.wikimedia.org/wikimedia-old/
  • 14:29 logmsgbot: jeluf synchronized php-1.5/cache/interwiki.cdb 'Updating interwiki cache'
  • 10:51 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25714 - Adding sources wikis for [[Special'
  • 09:30 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25156 - Requesting an alias for project namespace on Persian Wikipedia'
  • 08:58 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25859 - Enable Collection on gl.wikipedia'
  • 08:45 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25893 - Wikimania Logo for WikimaniaTeam Wiki'

November 12

  • 22:56 Ryan_Lane: added file_mover user to hume
  • 21:22 mark: Fixed torrus
  • 21:11 mark: Setup amanda backups of brewster:/srv/{wikimedia,autoinstall,tftpboot}
  • 18:31 logmsgbot: jeluf ran sync-common-all '25737 - Closure of Nauruan Wikibooks'
  • 18:06 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.php
  • 17:46 Ryan_Lane: adding demon to shell accounts as mortal
  • 17:35 Ryan_Lane: rebooting ersch
  • 17:34 Ryan_Lane: rebooting alsted
  • 15:58 mark: Shutdown srv126 for decommissioning
  • 15:27 Ryan_Lane: deleting svnuser.pp manifest, and any references to it, since we are now using ldap for svn users instead.
  • 14:11 RobH: srv230 shows memory error in SEL. it reboots and sees all memory. opening a ticket to ensure its not showing the memory error on its LCD
  • 12:22 mark: Rebooting sockpuppet
  • 12:20 mark: Running apt-get dist-upgrade on sockpuppet
  • 12:05 mark: Converted puppetmaster install on sockpuppet from mongrel based to passenger based
  • 11:19 mark: Upgraded puppet and puppetmaster on sockpuppet to 2.6.1
  • 07:09 JeLuF: restarted crashed backend squids on sq41 and sq42
  • 01:32 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/FundraiserStatistics.css
  • 00:44 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerAllocation.php
  • 00:27 logmsgbot: tfinc synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php
  • 00:27 logmsgbot: tfinc synchronized php-1.5/extensions/LandingCheck/LandingCheck.php
  • 00:27 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Changing landing check to be wg var'

November 11

  • 23:59 logmsgbot: tfinc synchronized php-1.5/wmf-config/InitialiseSettings.php 'Adding country codes to landing check'
  • 22:09 RobH: torrus wasnt recording items, restarted
  • 18:14 apergos: restarted varnish on storage1, seemed it might have gone out to lunch
  • 17:41 mark: Powercycled sq84
  • 17:01 mark: powercycled sq80
  • 16:52 mark: Powercycled amssq50
  • 16:42 mark: Powercycled sq68
  • 16:37 mark: Powercycled sq59
  • 16:34 mark: Powercycled sq57
  • 15:38 mark: Removed ex-fedora data on ms2, after backing it up to tridge
  • 10:33 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 10:32 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76511'
  • 09:58 Ryan_Lane: restarted apache on fenari
  • 04:02 logmsgbot: tfinc synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.php 'Updating for 2010'
  • 03:50 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'removing test since its in the extension config'
  • 03:46 logmsgbot: tfinc synchronizing Wikimedia installation... Revision: 76474
  • 02:30 atglenn: so another restart of torrus. seriously...
  • 00:43 domas: what Rob meant was that they went away by themselves, as it was upstream provider issue.
  • 00:39 RobH: !wikipedia and !wikimedia network issues resolved, all projects should be fine now
  • 00:35 domas: #network #failwhale #lol
  • 00:31 RobH: looking into the current slowdown/inaccessibilty issues for folks on !Wikipedia and !Wikimedia
  • 00:25 domas: flapping network in pmtpa

November 10

  • 22:32 rfaulk: installed "scipy" python package on grosley.wikimedia.org with apt-get - statistical analysis in python
  • 22:23 atglenn: restarted torrus, it had deadlocked again. is it my imagination or is this happening really often lately?
  • 21:26 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 21:25 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76474'
  • 21:21 logmsgbot: nimishg synchronized php-1.5/extensions/ContributionReporting/ContributionReporting.i18n.php 'r76472'
  • 21:16 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 21:16 RoanKattouw: Removed srv124 from mediawiki-installation node group as it's slated to be decommissioned
  • 21:14 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76471'
  • 20:44 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 20:44 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/Vector/Vector.combined.min.js 'r76469'
  • 20:32 logmsgbot: catrope synchronized php-1.5/wmf-config/CommonSettings.php 'Bump style version appendix'
  • 20:31 logmsgbot: catrope synchronized php-1.5/extensions/UsabilityInitiative/js/plugins.combined.min.js 'r76467'
  • 19:40 RobH: singer config restarted, will host download.w.o & dumps.w.o as well as a number of other things that refer to those two entries in dns
  • 19:39 RobH: changed dns for dumps.wikimedia.org to go to singer instead of dataset1 during its downtime
  • 18:59 atglenn: someone was polite and didn't name me in the above comment :-P I commented out the script that ships logs to both dammit.lt and dataset1 instead of looking at the script itself
  • 18:58 domas: unbroke pagecounts shipment (someone broke it and said "yes you can blame me, it was my f*ckup, people should know that")
  • 14:59 apergos: rebooting dataset1 so we can get web service going over there (can't be restarted in the usual way after kernel panic)
  • 10:38 RoanKattouw: Published MW 1.16 tarball on noc.wm.o because download.wm.o is still down http://noc.wikimedia.org/mediawiki-1.16.0.tar.gz
  • 06:47 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Turning cc gateway back on with the sidebar'
  • 05:20 apergos: stopped rsync of pagecount stats from locke to dataset1 for now til disk/fs issue is resolved
  • 04:59 apergos: shot all dump processes on dataset1; note a kernel panic in logs from within __destroy_inode, going to reboot and leave rsync of pagecounts and dumps off
  • 00:41 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Taking outage on cc cluster'

November 9

  • 22:42 RobH: running puppet on spence to remove all the old apaches that are no longer in any kind of service
  • 22:39 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '24539 - Transwiki import source for ml.wikisource.org'
  • 22:36 RobH: didnt log my change that I ran about 25 minutes ago to change test.w.o from srv124 to srv193 in squid settings and deployed
  • 22:32 logmsgbot: jeluf ran sync-common-all
  • 22:27 logmsgbot: jeluf synchronized closed.dblist
  • 21:31 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'removed srv193 from potential memcached pool as it will shortly become the new test.w.o server'
  • 20:53 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'Set tenwiki logo to local Wiki.png'
  • 17:26 logmsgbot: robh synchronized php-1.5/wmf-config/abusefilter.php 'bugzilla#24394'
  • 17:17 logmsgbot: robh synchronized php-1.5/wmf-config/abusefilter.php
  • 17:15 RobH: that actually ran 15 minutes ago and was stuck on a broken server at the end, all other hosts had synced
  • 17:15 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv230 unresponsive to ssh, needs to reboot, swapped it out for working spare'
  • 15:51 RobH: removed srv* under srv151 from pybal, left entry for srv124 as its test.w.o, even though its set to false
  • 15:46 RobH: srv229 puppet was hanging, manually ran apt-get update and reran puppet, now its happy
  • 15:22 RobH: srv229 rebooted, wouldnt let me ssh in, coming back up now with puppet run
  • 15:21 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'srv281 is not behaving, swapped it out'
  • 15:17 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'replacing a server that i am working on'
  • 15:01 mark: Reduced CARP weight of new amssq squids from 30 to 20, until they get SSDs
  • 12:16 mark: Started copy of ms2:/a/ex-fedora data to tridge
  • 12:15 mark: Included class base for server tridge in Puppet
  • 12:09 mark: Cleaned up temporary files on image scalers
  • 06:33 apergos: cleared out some old bin log files on db27 to get back some space
  • 05:20 apergos: reboot of ms4 successful, we need to monitor performance over the next several days. If people are still seeing 503's for thumbs that's a problem
  • 05:09 apergos: er, ms4!!
  • 05:09 apergos: rebooting ms5 into alternate boot environ, now with new improved patches (:-P), let's see if it works
  • 01:00 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Bumping style version for meta udp2log fix'

November 8

  • 23:50 logmsgbot: robh synchronized php-1.5/wmf-config/InitialiseSettings.php 'updating tenwiki logo'
  • 23:18 logmsgbot: tstarling ran sync-common-all
  • 22:32 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25801 - New logo for et.wikimedia.org'
  • 22:27 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25742 Please set the Buryat Wikipedia logo'
  • 22:25 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25742 Please set the Buryat Wikipedia logo'
  • 22:20 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25836 - Commons'
  • 22:13 logmsgbot: jeluf synchronized php-1.5/wmf-config/InitialiseSettings.php '25779 - Create new namespace "Institution"/"Museum" for Commons'
  • 21:54 JeLuF: Fixed typo: Replaced the second "srv290" in node group "apaches" by srv296.
  • 21:10 logmsgbot: jeluf synchronizing Wikimedia installation... Revision: 76208
  • 21:10 JeLuF: srv154 didn't receive any updates in the last few days, it was missing in the mediawiki-installation nodegroup
  • 20:19 mark: Moved uplink of asw-b-sdtpa from temporary GigE link csw1-sdtpa:3/48 to 2x 10G (LACP) links csw1-sdtpa:16/2 and 16/3, shutdown 3/48
  • 19:44 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php 'r76156'
  • 19:44 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.php 'r76156'
  • 19:22 RobH: updated dns, seems it borked, but pdns is running on nescio so it should clear up
  • 17:47 logmsgbot: robh synchronized php-1.5/wmf-config/abusefilter.php 'bug'
  • 16:38 RobH: esams squid flapping on text squids is due to disk i/o use, they will be replaced with SSD soon
  • 06:04 apergos: brought ms4 back up in primary boot environment, testing concluded for tonight, results being sent to Oracle
  • 05:50 apergos: doing reboot of ms4 into alternate boot environ for testing
  • 02:34 apergos: restarted torrus, it was out to lunch again

November 7

  • 17:39 RoanKattouw: Starting makeArbcomList.php in a screen on fenari. Tim says this'll take about a day
  • 12:43 logmsgbot: catrope synchronized php-1.5/includes/Profiler.php 'r76243'
  • 12:04 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'More perm changes for frwiktionary'
  • 11:59 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'More perm changes for frwiktionary'
  • 11:56 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'More perm changes for frwiktionary'
  • 11:35 logmsgbot: catrope synchronized php-1.5/includes/Profiler.php
  • 11:33 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25711 - enable AbuseFilter on frwiktionary'
  • 11:33 logmsgbot: catrope synchronized php-1.5/wmf-config/abusefilter.php 'bug 25711 - enable AbuseFilter on frwiktionary'
  • 11:17 logmsgbot: catrope synchronized php-1.5/includes/Profiler.php 'Temp hack to debug fatals in Profiler.php'
  • 10:48 logmsgbot: catrope ran sync-common-all
  • 10:47 RoanKattouw: Enabling FlaggedRevs on sqwiki per bug 25822 and disabling new page patrolling
  • 10:16 RoanKattouw: srv230 SSH is broken from fenari, Nagios disagrees. Commenting out srv230 from /etc/dsh/group/mediawiki-installation . After fixing srv230, uncomment it and resync the box
  • 10:13 logmsgbot: catrope synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25674 - Enable $wgBlockAllowsUIEdit on frwiktionary'

November 6

  • 18:59 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionView.php 'r76209'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeReleaseNotes.php 'r76208'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeAuthorListView.php 'r76208'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/ui/CodeRevisionListView.php 'r76208'
  • 18:56 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeUpdate.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeRevisions.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeDiff.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/api/ApiCodeComments.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/svnImport.php 'r76208'
  • 18:55 logmsgbot: catrope synchronized php-1.5/extensions/CodeReview/CodeReview.i18n.php 'r76208'
  • 18:55 RoanKattouw: Syncing CodeReview update

November 5

  • 23:38 logmsgbot: tfinc synchronized php-1.5/wmf-config/InitialiseSettings.php 'Turning central notice back on for everyone'
  • 23:09 logmsgbot: tfinc synchronizing Wikimedia installation... Revision: 76127
  • 23:09 logmsgbot: tfinc synchronized php-1.5/wmf-config/InitialiseSettings.php 'Turnning off cn on all but testing wikis before scap'
  • 23:08 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Turnning off cn on all but testing wikis before scap'
  • 23:03 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Setting wgCentralDBname to meta'
  • 20:34 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Picking up url fix so that udp2log doesnt double count on meta'
  • 20:33 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php 'Picking up url fix so that udp2log doesnt double count on meta'
  • 20:15 RobH: There are no longer any memcached servers in the decommissioned server range. If there are any issues from the changes, the original mc.php is named mc.php.old and will be removed in 72 hours if there are no mishaps
  • 20:14 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'of course one dies AS i sync it'
  • 20:12 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'on secondary review, missed two old servers, removed and updated'
  • 20:08 RobH: tested new memcached config, all servers working
  • 20:08 logmsgbot: robh synchronized php-1.5/wmf-config/mc.php 'removed the older servers below srv150 and replaced with tested good new memcached servers'
  • 19:55 logmsgbot: catrope synchronized images/wikimedia-button.png 'Let's try that with an actual PNG file rather than HTML'
  • 19:50 logmsgbot: catrope synchronized images/wikimedia-button.png 'New Powered by Wikimedia button'
  • 19:11 logmsgbot: catrope synchronized php-1.5/skins/common/images/poweredby_mediawiki_88x31.png 'r76126'
  • 19:00 rfaulk: Added "httpagentparser" Python package on grosley.wikimedia.org from publicly avaialable distutils distribution - this package assists in parsing user-agent header strings found in the 2010/11 fundraiser squid logs
  • 18:58 rfaulk: Added "setuptools" Python package on grosley.wikimedia.org with apt-get for Wikimedia 2010/11 fundraiser work - This package enables installation of python packages distributed with Python distutils
  • 18:22 RobH: srv284 is having some booting issues, seems to be harddisk related, but since drac output is slightly garbled, unable to confirm. new rt# 376
  • 17:54 RobH: srv284 unresponsive to console, rebooting and fixing it to bring it back into service
  • 17:53 RobH: srv266 back online and in service
  • 17:52 mark: Shutdown browne and srv2 for decommissioning - thereby removing the last traces of Fedora from the cluster. Goodbye!
  • 17:43 RobH: srv266 unresponsive to remote console, rebooting and updating
  • 17:42 RobH: srv206 fixed, pushed back into lvs
  • 17:25 RobH: working on srv206, disregard any errors it throws
  • 16:40 RobH: issue with the new api servers is fixed and they are now back in service
  • 16:04 RobH: some new api servers are not working right, depooled until they are fixed
  • 15:58 mark: Removed ibis IPs from Squid ACLs; invalid requests issue has been resolved
  • 15:57 mark: Fixed NFS mounts on apaches that had them missing since the wikimedia-task-appserver upgrade
  • 15:26 RobH: working on sq57, disregard flapping
  • 15:24 RobH: new api apackes srv290-srv301 are online, except srv298 which needs drac correction before installation
  • 15:22 RobH: dropping old entry for tenwiki in apache config and resyncing/restarting apaches to eliminate error message
  • 15:18 RobH: pushing srv291-srv301 into lvs
  • 15:11 RobH: doing puppet runs on srv292-srv301 before pushing them into service
  • 14:57 mark: Hacked out the 'remotemount' lines in /var/lib/dpkg/info/wikimedia-task-appserver.postrm files to prevent apaches from being without NFS mounts during/between puppet runs and package upgrades
  • 14:23 mark: Deploying new package wikimedia-task-appserver 1.46 across the cluster, which removes configuration files (now handled by Puppet)
  • 11:59 logmsgbot: catrope synchronized php-1.5/includes/api/ApiLogin.php 'Revert r76078'
  • 11:49 logmsgbot: catrope synchronized php-1.5/includes/api/ApiLogin.php 'r76078'
  • 05:57 apergos: failure booting into be3 on ms4, had to back out. so, no progress, we are back to where we were before the reboots.
  • 05:40 apergos: cleared up luactivate error, shutdown ms4 again, trying to boot into alt boot environment
  • 05:16 apergos: used shutdown on ms4, be3 showed as "active on reboot" but it booted into be0 (old boot environment) nonetheless. *grumble*
  • 05:06 apergos: rebooted ms4 into alt boot environment with current patches applied
  • 00:18 RobH: new api servers are not coping down the data correctly and not reflecting config changes in puppet, so they fail, srv290+ not online yet

November 4

  • 23:06 RobH: running puppet across the new api servers srv290-srv301 then will push them in service later when i figure out why they are not doing what I want ;P
  • 20:13 RobH: sq51 hatees me
  • 20:11 RobH: new api servers srv290-301 are installed and showing in ganglia, having issues getting the first couple to pool into lvs before i push the rest into service
  • 20:09 RobH: fixed sq51
  • 19:29 RoanKattouw: Strike that, have backed out changes
  • 19:06 RoanKattouw: Until Mark's made sure they're good, that is
  • 19:06 RoanKattouw: Changing some files in wmf-deployment/includes/media . DO NOT RUN SCAP or otherwise deploy these changes!
  • 18:35 RobH: added dns entries for payments
  • 17:59 RobH: doing puppet runs and final setup for srv290-srv301
  • 16:56 rfaulk: Added numpy Python package to grosley.wikimedia.org with apt_get ... For use in the 2010/11 fundraiser to facilitate stats gathering by providing scientific computing functionality in Python
  • 16:43 rfaulk: Added MySQLdb Python package to on grosley.wikimedia.org with apt-get ... This package will be used to access fundraising databases to facilitate the gathering and synthesis of relevant statistics for the 2010/11 Wikimedia findraiser
  • 16:23 mark: Set storage1 (varnish) as upload backend on sq41-50, instead of ms4
  • 16:14 RobH: sq59 is being bitchy and wont clean the cache, possible hdd issue? will investigate later
  • 15:42 RobH: sq35 back in rotation
  • 15:34 mark: Added storage1 (varnish->ms4) as an HTTP backend to sq45's squid config
  • 15:34 RobH: commenting out sq35, trying to make it work again in pybal
  • 15:16 RobH: poking at sq59
  • 15:06 RobH: sq35 back online, pushed into lvs, partially up - may need to wait up to 5 for idleconnect timer
  • 14:46 RobH: pushed dns updates for new payments boxes and correcting owadb1/2 to db31/32
  • 14:28 RobH: sq35 set to false in pybal until i determine whats wrong with it
  • 14:09 mark: Reduced CARP weight of sq41-50 from 10 to 5
  • 13:37 RobH: sq35 may flag, disregard
  • 13:30 RoanKattouw: Removed uploadwizard test wiki on prototype, gonna set it up on the Commons prototype instead
  • 04:17 atglenn: ganglia 3.1 now running on ms4 and ms5
  • 01:44 RobH: srv217 back in cluster
  • 00:36 RobH: torrus back online
  • 00:29 RobH: fixing torrus deadlock, no touchy
  • 00:18 tomaszf: upped open file descriptors on loudon to 4096 for squid
  • 00:17 RobH: kicking srv217 for reinstall

November 3

  • 21:22 RobH: updated puppet to properly remove memcached from memcached::false entries and removed the host memcached check for servers no longer running memcached, hup'd nagios to take the change
  • 21:21 atglenn: rebooting ms5 after OS update. note that we were unable to get some of the more recent patches, they are probably from after the sun->oracle transition
  • 21:02 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.i18n.php 'r75890'
  • 21:02 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.alias.php 'r75890'
  • 21:01 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/SpecialLandingCheck.php 'r75890'
  • 21:01 logmsgbot: nimishg synchronized php-1.5/extensions/LandingCheck/LandingCheck.php 'r75890'
  • 20:31 atglenn: removed about 1.5T of stuff off of /export on ms4 (old backups, solaris isos, etc)
  • 19:41 logmsgbot: catrope synchronized php-1.5/README 'Dummy sync so I can document what the errors look like'
  • 19:32 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Backing out config change for stats fix'
  • 19:31 RobH: srv281 still down, setting to false in pybal just so it doesnt keep trying to use it
  • 18:31 RobH: reinstalling srv281, tired of lookin at it in red
  • 17:18 mark: Upgraded storage1 to Lucid
  • 16:42 mark: Removing 2010-03 snapshots on ms4
  • 16:01 mark: Fixed sshd on ms4
  • 15:46 mark: Removing 2010-02 snapshots on ms4
  • 15:45 mark: Disabled gmetric cron jobs on ms4
  • 15:43 mark: Disabled daily snapshot generation on ms4
  • 15:27 mark: Restarted gmond on ms4
  • 15:24 mark: Upgraded puppet on ms4
  • 15:13 mark: Powercycled knsq2
  • 14:52 mark: Removing daily snapshots for 2010-10 on ms4
  • 14:24 mark: Restored /etc/sudoers file on DB machines butchered by old versions of wikimedia-raid-utils
  • 05:34 logmsgbot: tstarling synchronized php-1.5/includes/Math.php 'r75909'
  • 04:52 apergos: oh btw, I notice that when / on the squids fills, we don't see it in ganglia, it must report an aggregate or something. it would sure be nice to get notified.
  • 04:18 apergos: lather rinse repeat for sq47, I hope that's all of 'em
  • 03:46 apergos: repeated on sq45...
  • 03:13 apergos: same old story on sq46... restarted syslog, reloaded squid, got back some space on /
  • 02:41 apergos: er... and deleted the log file :-P
  • 02:38 apergos: moved ginormous cache.log out of the way on sg48 and reloaded squid over there since it wasn't done earlier
  • 02:32 apergos: cleaned up / on sq41, restarted syslog, reloaded squid
  • 00:59 logmsgbot: nimishg synchronized php-1.5/wmf-config/InitialiseSettings.php
  • 00:53 logmsgbot: nimishg synchronizing Wikimedia installation... Revision: 75891
  • 00:33 apergos1: also 44 and 43
  • 00:30 apergos1: cleaning up space on other / full squids: sq42

November 2

  • 23:22 apergos: same story on sq50, cleared out some space, tried upping that to 300 but started seeing TCP connection to 208.80.152.156 (208.80.152.156:80) failed in the logs so backed off to 200
  • 23:13 apergos: trying adjusting max-conn on sq49 for conns to ms4... tried 200, it maxed out. trying 300 now...
  • 23:08 apergos: hupped squid on sq49, restarted syslog, / was full from "Failed to select source" errors, cleared out some space
  • 23:08 logmsgbot: tfinc synchronized php-1.5/wmf-config/CommonSettings.php 'Updating sidebar links'
  • 22:40 apergos: added in the amssq47 through amssq62 to /etc/squid/cachemgr.conf on fenari
  • 19:48 RobH: torrus back online
  • 19:44 RobH: following procedure on wikitech to fix torrus
  • 16:46 RobH: sq42 & sq44 behaving normally now, cleaning cache on sq48 and killing squid for restart as it is flapping and at high load, due to earlier nfs issue
  • 16:38 RobH: restarting and cleaning backend squid on sq44 and sq42 which were complaining in lvs
  • 16:35 RobH: sq43 was flapping since the nfs mount on ms4 was borked. restarted it
  • 16:07 apergos: NFSD_SERVERS=2048 in /etc/default on ms4
  • 16:06 apergos: note that the variables rpcmod:cotsmaxdupreqs has been changed to 2048 in /etc/system, and
  • 15:54 apergos: hard reset on ms4, reboot was not getting the job done
  • 15:47 apergos: rebootint ms4, nfsd hung and couldn't be restarted or killed.
  • 14:04 RobH: restarted pdns on linne due to crash from authdns update
  • 14:02 RobH: updated dns with new mgmt entries for payments, owasrvs, and owadbs
  • 03:45 domas: added srv193 back to apaches pool on lvs

November 1

  • 23:55 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php 'Picking up fixes for Bug #25564'
  • 23:54 logmsgbot: tfinc synchronized php-1.5/extensions/CentralNotice/CentralNotice.php 'Picking up fixes for Bug #25564'
  • 20:42 domas: ms4 mildly loaded (disks go to >100i/s each) throwing nfs timeouts, I bumped up NFSD_SERVERS to 2048
  • 19:05 Ryan_Lane: powercycling srv207
  • 16:18 RoanKattouw: Something weird's going on with srv207: Nagios says its SSH is up but it times out on SSH from fenari
  • 16:15 logmsgbot: catrope synchronized php-1.5/includes/api/ApiBase.php 'r75798'

Archives