Server admin log/Archive 10

From Wikitech
Jump to: navigation, search

June 30

  • 13:30 Tim, rainman: installing LS2 for enwiki. Migrating servers starting at srv22.

June 29

  • ~evening river: removed db7 from rotatation to dump enwiki for yarrow
  • 23:14 mark: Repooled knams
  • (all day) mark: network maintenance at knams, installed the new RX-8 (csw1-knams)
  • 20:14 brion: running REPAIR TABLES on bugzilla tables on srv7
  • 17:30 Tim: fixed pdns-recursor on bayle again
  • 10:00 domas: bayle pdns_recursor gone missing. restarted.
  • 07:28 mark: DNS scenario knams-down for network maintenance later today
  • 00:58 Tim: fixed disk full condition on srv7 yet again
  • 00:03ish robchurch: Users are complaining about OTRS database errors too, something wrong with srv7/8?

June 28

  • 22:30ish robchurch: BugZilla broke :(
  • 18:30 Tim: fixed secure.wikimedia.org for wikiversity, was making the boardvote session transfer not work
  • 09:30 mark: Downpreffed AS path _30217_13680_ as people could not access us

June 27

  • 21:00ish brion: did a scap to current (r23488 ish), made a few quick fixes. schema changes are upcoming but not needed yet for live code, allegedly
  • 19:05 brion: fixed the dump generation issue where all sql dumps failed due to a chagned pwd :P

June 26

  • 18:55 brion: replaced secure.wikimedia.org's CA-Cert cert with a rapidssl one that won't make everyone's browsers whine. We may still want a nicer wildcard / unlimited hosts cert in the future.
  • 16:09 Rob: srv80 back online after HDD swap and re-installation. Needs scripts run.
  • 15:10-15:25 Another load spike, cause unknown.
  • 14:45 River identifies a client doing a high-rate multithreaded load of action=render pages. Potential candidate for load spike source, blocked by Tim.
  • 14:24 Rob: srv118 back online after bios reset. Please sync and bring in to service.
  • 14:14 Rob: srv87 back powered on after powersupply replacement. Please sync and bring into service.
  • 13:35 Load spike on apache cluster, drop in hit ratio, overflow of apache CPU capacity. Comes and goes.
  • 08:20 domas: observed adler going down

June 25

  • 18:50 mark: Upgraded pdns on bayle and yf1019 to pdns-server-2.9.21-1wm2 to fix the tab bug
  • 15:22 Tim: reassigning srv56 to search indexing
  • 14:23 Rainman, Tim: testing lucene-search-2 on srv153 and srv142

June 24

  • 18:55 mark: Upgraded pdns-server on bayle to 2.9.21-1wm1
  • 13:30 - 15:30 mark: Upgraded yf1019 to Feisty, and upgraded pdns-server to version 2.9.21-1wm1

June 21

  • 11:05 mark: yf1015 was still sitting on the same IP as the newly installed yf1000, which was of course causing problems. Fixed.

June 20

  • 20:24 mark: Installed yf1000 as text squid and pushed into production
  • 17:00 jeluf: Added a second data file /usr/local/mysql/data2/ibdata2 to all three nodes of cluster 12 (srv124, 125, 126)

June 19

  • 21:25 brion: doing a sweep for images with backslash (\) in name, found investigating an OAI-related problem triggered by a page update to [1]. Can't repro upload of such a filename now, believe it was an old bug.
  • 21:07 brion: took live a change to SquidUpdate.php which should skip HTTP purges instead of doing both HTTP and HTCP (old code didn't exit the HTTP function when subcalling to the HTCP function)

June 18

  • 14:21 Rob: srv80 shutdown pending HDD replacement. RMA placed.
  • 13:30 Rob: srv87 unplugged. Powersupply is bad, RMA placed.
  • 13:00 Rob: srv118 unplugged. Will not boot. RMA placed.

June 18

  • 21:05 brion: updated lighty config on amane & restarted, tightening up some access control
  • 21:00 jeluf: added a new datafile on /dev/sdb1 to mysql on srv121, srv122, srv123 (cluster11)

June 17

  • 19:49 mark: Backported Dovecot 1.0.0 to Feisty, put it in the Wikimedia Universe repository, and installed it on sanger
  • 06:51 Tim: deleting some binlogs on srv123 and srv126
  • 00:22 river: set sitenotice for election stuff

June 16

  • 20:05 mark: Put traffic load back on knams

[knams sick, scenario knams-down]

June 15

  • 16:10 Tim: Set up a mysql server instance on srv77 for Lucene Search 2. Rainman chose the root password, it is documented in ~root/.my.cnf
  • 14:25 jeluf: copy completed, adler and webster back in production. Enabled DB cluster "s3a" for frwiki and jawiki
  • 04:45 jeluf: stopped replication on adler, took it out of the pool. Copying frwiki, jawiki and commonswiki to webster

June 14

  • 21:32 mark: Brought sq41 back up
  • 21:33 river: removed srv117 from mediawiki_installation/srv117 as it's not accepting logins
  • 20:14 mark: yaseo squids reinstalled, deployed and ready. Putting traffic load back on yaseo
  • 17:35 mark: Added the new yaseo Squid IPs to the MW CommonSettings.php, and removed some old entries
  • 16:40 brion: switched SVN CIA-bot config to use mail again (via lily). xml-rpc was hanging, which sucks for committers
  • 16:something brion: recovered most of the remaining commons images, doing another sweep for a few more
  • 16:29 Rob: srv117 HDD replaced, mirrored, and FC reinstalled. Needs setup scripts run.
  • 15:44 Rob: sq41 bad 4th disk replaced. Not turned on, as it needs to be sync'd once booted.
  • 11:15 mark: Reinstalled yf1001 - yf1009 with Feisty, will deploy them as squids later today. yf1000 has no network link.
  • 05:40 jeluf: Copying DB from lomaria to thistle.

June 12

  • 20:11 Rob: webster hdd replaced and reinstalled.
  • 20:06 Rob: thistle hdd replaced and reinstalled.
  • 19:14 mark: Brought up amaryllis at the new location
  • 19:13 Rob: srv80 shutdown, pending HDD replacement.
  • 19:08 Rob: srv121 turned back on and online.
  • 19:02 Rob: srv66 shutdown pending re-installation.
  • 18:40 Rob: srv54 shutdown. Single HDD bad in system.
  • 18:14 Rob: srv149 turned back on and online.
  • 18:11 Rob: srv146 turned back on and online.
  • 18:00 brion: applying patch-ipb_emailban.sql in prep for mw update
  • 16:19 Rob: srv137 rebooted from crash, back online.
  • 16:14 Rob: srv136 rebooted from crash, back online.
  • 16:07 Rob: srv134 readonly filesystem. REbooted and ran FSCK. Corrected a number of inode errors. Server is back online.
  • 15:39 Tim: ddsh -cM -g apaches -- cp /home/config/others/etc/rc.local.apache /etc/rc.local
  • 15:36 Rob: Shutdown sq41, HDD 03 is defective/clicking.
  • 15:21 Rob: webster is in OS install on partitioning. Mark needs to correct something before it is back online. Replaced HDD in array.
  • 15:14 Tim: set up henbane on new network, deployed squid conf change for logging
  • 14:40 brion: suppressed error output from planet cronjob per mark's request
  • 14:23 Rob: Rebooted webster. Its raid is optimal and it now allows remote access. Documented issue on it's history page.
    • Found the faulty drive, server currently in rebuild status.
  • 13:00 mark: sq26 ran into some sort of livelock, not doing any syscalls but using 100% cpu. gdb backtrace shows aio... leaving it attached for now.
  • 10:09 mark: Brought up yf1019, yaseo DNS server. Changed ns1's ip - delegation needs to be changed as well
  • 10:09 yaseo hosts seem to be connected using eth1!
  • 10:00 mark: Updated DNS entries of yaseo.wikimedia.org to the new situation

June 11

  • 20:59 brion: removed chroot on benet's lighty, was breaking symlinked dirs in the downloads area for things being migrated to new boxen
  • 17:44 Tim: changing password for wikiuser and wikiadmin after a leak on IRC
    • srv87, thistle and webster will need to have their passwords changed before they can be put back into rotation
  • 13:24 mark: sq12 suddenly rebooted for unknown reasons
  • 08:00 mark: Uppreffed routes via AS30217 to knams and yaseo
  • 07:55 mark: Depooled yaseo, servers are going to be moved today

June 10

  • 23:48 Tim: removed binlogs 90-99 from srv123
  • 18:25 Tim: fixed incorrect modification of $dbHostsByName in db.php. Commenting out a server from $dbHostsByName does not mark it down. Use $sectionLoads to do that.
  • 7:45 jeluf: srv117 has a defective disk, apache is hanging. I shut the server down.

June 9

  • 14:30 jeluf: started jobs-loop.sh on srv100 srv102 srv103 srv104 srv105 srv106 srv106 srv108 srv109 srv111 srv112 srv113 srv114 srv115 srv116 srv119. Apparently there wasn't any jobs-loop running since the power outage

June 8

  • 10:25 mark: Level3 is being flaky, downpreffed routes to knams and yaseo so TWTC routes are selected

June 7

  • 18:35 brion: setting up some nfs on storage2 for temporary dump work
  • 17:52 Tim: moved enwiki contribs back to db6, reduced general load
  • 14:51 Tim: Fixed monitorurl, deployed squid conf

June 6

  • 19:43 mark: Backed out most of Tim's monitorurl changes as it caused corrupt config files with empty monitorurl options, and therefore, downtime. Please test/check configs thoroughly before reapplying.

June 5

  • 20:00 jeluf: replaced rc.local on all apaches, so that they won't start apache on reboot. Use apache-start to start apache after a reboot.
  • 17:50 jeluf: copied rc.local file to srv68, started apache
  • 13:39 Tim: restarted backend squid on sq26. It was giving errors when attempting to forward to srv77 (ls2.wikimedia.org). cachemgr.cgi said that the server was down, but last failed connect "04/Jun/2007:13:23:56 +0000", over a day ago. Restarting fixed it.

June 4

  • 15:18 Tim: restored memcached to normal operation, cleared 7 out of 26 slots
  • 14:47 Tim: stopped slave on various old external stores
  • 14:40 Tim: fixed replication on cluster10 and put it back into the write pool
  • 13:00 brion: clusters 4 adn 5 came back a bit ago
  • 12:52ish brion: took off read-only
  • 12:50 brion: everything looks ok except es clusters 4 and 5 still out
  • 12:45 brion: restarted lvsmon on dalembert. for some reason it wasn't taking down servers out of rtoatio anymore
  • 12:28 brion: mysql didn't autostart on cluster10, restarting it... 9 looks fine
  • 12:22 brion: rob got in to colo, some es boxen back up
  • 12:08 brion: reassigned 7 mc boxen to up boxen
  • 11:42 brion: marking read-only for the moment
  • 11:29 : some kind of power problem in DC -- ES cluster9 and cluster10 down entirely, also a number of memcaches, so site slow even for pages not using those storage clusters

June 3

  • 16:03 Tim: Samuel's slave thread was stopped for a while due to temporarily buggy MediaWiki code. Fixed it. Re-enabled slave status monitoring in nagios, set it up to enforce a policy of no slave running on old external stores.
  • 15:23 Tim: took webster out of rotation
  • 12:04 Tim: svn up/scap
  • 07:00 jeluf: webster's mysqld is down. Reports "read error" in its mysql error log. dmesg reports SCSI errors.

June 2

  • 17:00 mark: Disconnected clematis because it's idle and I needed a switchport for the console server. Will reconnect once the RX-8 is at knams.

June 1

  • 21:00 mark: Set up switches asw-c3-pmtpa and asw-c4-pmtpa
  • 20:29 brion: chown'd all the mailing list files so the web server can edit configs and html files properly per bugzilla:10091
  • 20:00 jeluf: removed binlogs 80-99 from srv95
  • 19:10 mark: Set control = submission/retain_sender on SMTP mail submissions from allowed relay_from_hosts on all mail servers

May 31

  • 22:31 brion: switched DNS for svn.wikimedia.org to mayflower now that all the bits seem to be working. documentation will go up once i've formatted it a bit
  • 22:31 brion: _really_ locked them accounts now. david made one extra commit on old :( [2]
  • 21:23 brion: locking svn+ssh accounts on old svn server to test migration on new box

May 30

  • 19:52 mark: Installed Feisty on mayflower for use as Subversion server

May 27

  • 23:37 river: created hakwiki

May 26

  • 9:30 jeluf: removed binlogs 60-79 on srv123

May 25

  • 10:40 jeluf: replaced memcached srv118 by a spare one - nagios was complaining for 4d 9h already...

May 24

  • 12:40 Rob: Replaced DRAC in srv155 and programmed it. It is now accessible via DRAC.
  • 12:37 Rob: sq39's power cord was lose, it was powered down, plugged back in. Also knocked out sq47 checking plugs, plugged back in.

May 22

  • 19:46 mark: db5 installed with Feisty, and waiting for setup.
  • 17:46 Rob: Replaced disk 1 in db5.
  • 17:10 Rob: Rebooted srv145, filesystem no longer read-only.
  • 16:03 mark: Set up sq39 as Squid, pushed into production
  • 15:05 Tim: increased move rate limit to 8 per minute
  • 14:27 Rob: Added rails to srv0 and will restarted will back in to the installer.
    • What purpose is srv0 supposed to be serving? It will not allow login, please see Datacentre_tasks
      • It was doing test.wikipedia.org until it crashed, srv3 took over. -- Tim
  • 14:06 Tim: Created ls2.wikimedia.org, forwards to srv77 for lucene-search-2 test
  • 13:40 Rob: sq39 fixed and OS loaded. Needs to be configured.

May 21

  • 23:29 river: started refreshlinks on eswiki for [3] scratch that, fixed the bad rows manually
  • earlier river: created rswikimedia

May 20

  • 23:36 river: srv126 out of space, removed old binlogs
  • 22:20 hashar: gzipped / bziped some logs on suda:/h/w/logs to save disk space. You might want to establish a logrotate script there.
  • 21:55 mark: Upgraded bayle to Ubuntu 7.04 Feisty to fix the pdns recursor issues.

May 19

  • 22:42 river: closed ndswikiquote
  • 16:04 jeluf: enabled api.php's watchlist module again since it's now using the watchlist DB.
  • 08:50 jeluf: started the pdns recursor on bayle
  • 00:41 river: closed krwiki

May 18

  • 21:25 river: created itwikiversity
  • 18:39 brion: restored bug URL shortcut on new bugzilla
  • 18:30 brion: srv8 replication is down, not sure why... io thread no starty
    • Maybe because it stopped weeks ago and srv7 is missing almost 20 binlog files, probably deleted to free disk space on may 15?
      • It was before May 15, the server barely had room for 3 binlogs let alone 20.
  • 14:11 brion: bugzilla migration done, but srv7 seems confused about dns and thinks isidore is bart. flush hosts had no effect
  • 13:45 brion: shutting down bugzilla to try migrating its db to srv7
  • 00:13 Tim: fixed fstab on srv101. Had two swap lines with binary junk in the label, which was causing the rest of the file (including NFS mounts) to be ignored. Removed them. Also did mount -a.

May 17

  • 23:50 Tim: srv101 was not in mediawiki-installation and so missed out on the squid list update. Fixed.
  • 22:06 mark: ...and of course I forgot about Mediawiki's Squid list again.
  • 21:47 mark: Deployed all new squids sq31 - sq50, except sq39 which is broken
  • 21:00 mark: Made Brion's life bearable again by shutting down srv145's switchport on csw1-pmtpa
  • 20:45 brion: srv145 is readonly fs, need to kill it
  • 20:00ish brion: updated WebRequest and such to make Serbian variant urls work again, new URL code broke em for a couple days
  • 15:57 Rob: reset db5 per Jeluf
  • 13:47 Rob: re-racked srv151, srv152, & srv153 in Rack C3.
  • 03:24 brion: updated install-modules51 script for wikidiff 1.0.1 package
  • 03:22 brion: srv101 appears to have been left broken as well; bogus sudo data and ssh key changed. been reinstalled and not quite updated properly? reupped

May 16

  • 23:12 mark: Moved udpmcast off bayle onto browne as it was affecting pdns recursor
  • 22:40 brion: bugzilla mail is more or less working we think, but mail is backed up a bit atm. changing dns for bugzilla to point at isidore
    • bugzilla SSL is not set up yet (was experimental before)
  • 21:54 brion: bugzilla upgrade on isidore so far so good, except mail setup is broken on that server
  • 20:58 brion: giving bugzilla upgrade another go, woo
  • ~19:40 Rob, Mark: Network moves of db5 - db10, mchenry & sanger onto the new MRJ-21 line card in C1
  • 19:31 brion: fixed sudoers on srv3
  • 19:29 brion: updated wikidiff2 packages and svn up'd for new diff layout style fixes
    • srv0 down
    • srv3 sudo is broken
  • 06:30 jeluf: nagios is reporting timeouts for nearly all of its ssh-based tests. Restarted pdns-recursor on bayle.

May 15

  • 19:12 mark: Moved udpmcast off goeje onto bayle.
  • 19:05 brion: bugzilla 3.0 upgrade failed on pascal with weird low-level glibc errors. restored 2.18, and will finalize the upgrade after a new machine is set up to host it
  • 19:50 jeluf: fixed OTRS's webform, so that the queues that became subqueues during the last reorg are now properly addressed, using masterqueue::subqueue notation
  • 11:37 Tim: enabled SyntaxHighlight_GeSHi on all wikis
  • 10:30 Tim: introduced a limit of 2 page moves per minute for non-sysop, non-bot users. In response to vandalism on Wikiquote.
  • 07:18 JeLuF: short network interruption (2 minutes) between KNAMS and PMTPA. Reason unkown.
  • 05:15 Tim: srv7 (OTRS) is out of disk space again. I guess that's what you get for putting a 12GB ibdata1 file on a 20GB partition and waiting a few months.

May 14

  • 14:44 brion: power-cycling srv129, it's stuck in the mud with r/o filesystem and unresponsive login

May 13

  • 02:30 domas: fixed memcached (srv69 -> srv67) pool, people were getting session losses.

May 12

  • 03:30 Tim: Fixed MIME type for .wbmp on *.wap.wikipedia.org. Now the logo is displayed correctly in the Motorola Browser 2.2.1.

May 11

  • 23:24 Tim: Started apache on srv3, for test.wikipedia.org serving. There was a problem with segfaults on that server, I may have fixed them already but if not, test.wikipedia.org will be a good way to debug them.
  • 22:44 river: deployed NIS servers on srv1, srv2
  • 19:28 brion: test.wikipedia.org is down w/ squid errors. working on re-adding be-x-old links [readded to Names.php for compat]
  • 09:17 river: started an email job runner on srv91

May 10

  • 20:09 mark: Increased max simultaneous SMTP connections on both mail relays

May 9

  • 21:50 Tim: srv0 went down (test.wikipedia.org server). SSH and HTTP are both hanging after the connection completes. Redirected traffic to srv3 which has no running HTTP server, in case there was some problem with the incoming traffic or the wiki which is crashing servers.
  • 21:40 Tim: gave rainman an OAI account for incremental search update testing
  • 19:24 brion: leuksman.com http down briefly, restarted

May 7

  • all day brion: running password security checkers on enwiki, commonswiki. added some logging, doing some various things per internal discussion
  • ~10:00 river: amane:/export ran out of space, removed some old upload dumps

May 6

  • 16:07 river: created kabwiki
  • 13:42 Tim: reverting changes to the logging table on db6, db9 and db10. log_id needs to be added on a single server only and then copied to the other servers, so that it is stable.

May 5

  • 14:37 hashar: saved ourself a bit of bandwith by producing blank pages (php mistake). I reverted my change and everything came back. Outage ~ 2 minutes.

May 4

  • 21:15 brion: running a bad-title check on enwiki in the background to clean up bad titles from move page bug etc
  • 21:10 brion: was a brief problem with all logged actions failing due to use of non-present log_id field. this needs to be applied to the live sites before r21784 is re-applied
  • 14:14 Tim: enabled djvu rendering everywhere
  • 14:08 Tim: removing stale s2 databases from samuel
  • 13:05 Tim: pulling lomaria out of rotation for copy to thistle
  • 12:42 Tim: installed ganglia on adler and db10
  • 11:48 Tim: db5 has fields missing in frwiktionary and advisorywiki. Replication has been stopped for a couple of weeks. Fixing.

May 02

  • 19:26 mark: Set up (mailing lists) backups of lily to mchenry
  • 18:36 Tim: index change on lomaria
  • 18:15 Tim: switching s2 master from lomaria to db8 (db8-bin.001, 79).
  • 17:06 Tim: index change on db10, samuel, holbach
  • 15:18 Tim: index change on adler, db8
  • 12:21 Tim: index change on ixia
  • 12:06 Tim: index change on db9, db5
  • 12:03 Tim: Created schema changes page to track schema changes
  • 11:50 Tim: started replication on thistle
  • 11:34 Tim: running patch-backlinkindexes.sql on db7 (enwiki). Backup is also running on that server, it should be fine to continue on a lagged slave.
  • 7:21 river: removed broken servers srv131, srv101 from mediawiki_installation

April 30

  • 17:58 mark: Created wiki-mail.wikimedia.org DNS record / service IP, changed the mail.wikimedia.org A record to it. mchenry is now mail.wikimedia.org.
  • 15:48 mark: Made mchenry primary MX for all domains (except lists), taking albert out of the loop.
  • 15:45 Tim: deleting binlogs from srv95
  • 14:47 mark: Made lily secondary MX for all domains instead of pascal.
  • 12:46 mark: Set albert to forward mail to mchenry instead of goeje.
  • 11:31 mark: Added mchenry as secondary MX for lists.wikimedia.org.

April 29

  • 17:51 mark: Set bart's Postfix to use mchenry as smart host.
  • 17:19 mark: Moved pmtpa's secondary resolver .18 from khaldun to mchenry (mail relay)

April 27

  • 11:53 Tim: removing srv77, srv79 and srv80 from the apache pool, for lucene-search-2 test. Removed srv77 from the memcached pool.
  • 11:49 Tim: installed memcached on srv60
  • 11:30 Tim: attempted to resurrect srv78. It had bad firewall rules, I fixed them and ran setup-apache. But halfway through compiling PHP, it apparently crashed, dropping off the network, no ping.

April 26

  • 21:50 brion: scapping update to disallow password == username

April 24

  • 20:21 brion: fixed crond on leuksman.com
  • 19:30 mark: Increased COSS and aufs cache dirs on pmtpa upload cache squids to 15 GB
  • 16:45 mark: Loaded Ubuntu on storage1, learned rob something. Ready for configuration.
  • 14:35ish brion: updated live code, we found a couple fun problems: checkuser failures again (reverted bogus functions), and some problems with SVG rendering which tim is in process of fixing
  • 14:18 Rob: OS loaded on srv101 which has a new mainboard. Needs to have scripts run and put in rotation.
  • 14:17 Rob: srv66 rebooted and brought back online.

April 23

  • 10:50 Tim: updated interwiki links

April 22

April 20

  • 20:10 brion: ran updates on advisorywiki, had been made w/ old schema

April 19

  • 20:09 mark: Upgraded all knams upload squids to squid-2.6.12-1wm6 and set up aufs for big media files
  • 18:36 mark: Reinstalled knsq8 with Feisty
  • 14:16 Rob: thistle passes memtest86. Ran 6 passes with no problems. Rebooted and put back online. Needs to be placed back in rotation.
  • 13:21 Rob: Rebooted srv137 as it was unresponsive.
  • 13:21 brion: big scap!
  • 12:50 mark: Upgraded all pmtpa upload squids to squid-2.6.12-1wm6 and set up aufs for big media files
  • 11:21 mark: Upgraded sq1 to squid-2.6.12-1wm6 and set up aufs for big media files
  • 09:40 mark: Prepared installation environment for Ubuntu Feisty Fawn. Did two test installs on mint. The APT repository shares its packages with Edgy for now. Ubuntu Feisty installs are now possible.

April 18

  • 21:40 brion: got through SVN changes review; things should be ready for svn up at any time, will do this in the morning if tim doesn't beat me to the punch. conflicts or oddities are possible, of course!
  • ~20:00 Tim: Installed the following packages from FC4 on all apache servers: bitstream-vera-fonts fonts-bengali fonts-chinese fonts-gujarati fonts-hindi fonts-japanese fonts-korean fonts-punjabi fonts-tamil
  • 19:08 Tim: Installed DejaVu fonts version 2.16 on all apache servers. Added to setup-apache.
  • 15:55 brion: public advisory.wikimedia.org is up and open
  • 15:53 brion: updated addwiki.php with checkuser tables, it borked halfway through creating advisorywiki
  • 15:44 brion: fixed free space on rabanus, resynced common
  • 14:30 mark: knams Squids running out of local port space. Increased range to 1024 - 61000 for this boot
  • 14:20 mark: lily ran out of memory, looks like too many python processes (Mailman). Rebooted it and set ulimit -p 150 in the lighttpd init script.

April 17

  • 22:10 brion: s1 switch done! starting db schema updates on db2...
  • 21:44ish brion: s2 switch done! starting db schema updates on db8...
  • 21:40 brion: rabanus disk full
  • 21:27 brion: in midst of switch on s2; lomaria was not configured correctly per spec, holding while enabling binlog......
  • 20:38 brion: preparing master switches on s1 and s2 to finalize db updates
  • mark: Set cache_dir /dev/sda6 on all non-yaseo upload cache squids to read-only and increased the other 3 cache dirs from 10 GB to 15 GB
  • 19:47 Rob: Rebooted thistle in to a memory test per domas. Will leave it running the test until tomorrow afternoon or Thursday morning and check results.
  • 19:46 Rob: rebooted srv91 per domas's request.
    • Is back online as of 19:48 and will now accept SSH.
  • 17:54 brion: switched oaiaudit group from thistle to db1; since yesterday's switch oai feeds were broken

April 16

  • 22:20 Tim: removed langcomwiki from all.dblist on request from Pathoschild.
  • 20:03 mark: Doubled read_ahead_gap to 64KB on cache upload Squids, to buffer more data off amane
  • 19:34 mark: Enabled socket debugging (5, 1) on frontend Squids
  • 15:10 Tim: ran sync-fedora-mirror.sh with --delete enabled, to remove obsolete files.
  • 14:30 Tim: fixed /home/config/others/etc/yum.conf, was missing extras.
  • 14:14 brion: noticed texvc compile failure on srv62, texvc install failure on srv117
  • 14:05 brion: bad sudoers on srv150, 128, 58 (fixed)
  • 06:00 domas: apparently thistle has hit our usual master problem :) switched master to db1.

April 15

  • 19:21 mark: Set cache_dir /dev/sda6 on knsq9 to read-only and increased the other 3 cache dirs from 10 GB to 15 GB
  • 12:32 - 15:30 mark: Wrote a patch for Squid that adds a min-size option to cache_dirs, so aufs cache dirs on upload squids only take large files that the COSS dirs will not accept, instead of anything based on a round robin / load balancing scheme. Built a squid-2.6.12-1wm5 deb and installed it on knsq8, not in the repository yet.
  • 12:32 mark: Converted COSS swap dir /dev/sda6 into an AUFS dir (Reiserfs) on knsq8 for extra large files that don't fit in COSS dirs, as an experiment

April 13

  • 23:40 Tim: set up *.wap.wikipedia.org, served by anthony.

April 12

  • 22:00 Rob: racked and powered on mchenry and sanger, future mail servers. DRAC access is working.
  • 19:00-ish Rob: Powered down srv66: it is having disk issues, either a bad drive or controller.
  • 16:03 Hashar: csw2.knams came down for roughly 3 minutes.

April 11

  • 17:20 Tim: fixed dewiki search

April 10

  • 16:55 Rob: Turned up srv121, returned from SM with new mainboard. OS still nominal, and online. Needs to be synced and brought in to rotation.
  • 16:44 Rob: Consoled srv137 and saw no I/O errors pointing towards failed HDD. Rebooted and it is now online. Noted this in its history so if it occurs again, HDD's will be tested.
  • 16:43 Rob: Rebuilt array on db3 and reloaded the OS.
    • Requires DEV attention. Server is online but not in rotation. Needs setup scripts run.
  • 12:41 Rob: Rebooted srv66 as it was showing down in Nagios, as well as unresponsive to ssh and ping.
  • 12:40 Rob: Rebooted srv58 per Datacentre tasks

April 9

  • 19:19 brion: started batch regeneration of cu_changes tables; previous batch gen run broke on a lot of things
  • 16:54 mark: db2 had trouble, noone else around so I depooled it. Without realizing that it's the master. ;) Fortunately ariel was set to read-only so I didn't do too much damage, and db2 recovered...
  • 13:44 brion: putting db4 back in rotation, updates finished

April 7

  • 20:35 Tim: upgrading apache to 1.3.37
  • 11:05 mark: Taking db10 (Squid) out of production for reinstall as a DB
  • 07:00 domas: db3 I/O crashed, needs repairs
  • 03:40 brion: starting db udpates on db4, db3 done

April 6

  • 20:24 brion: running manual rebuild of frwiki search
  • 19:51 brion: search rebuild for srwiki was stuck FUTEX; killed it
  • 14:11 brion: starting db updates on db3... db4 is last slave left after
  • 13:13 domas: db9 on enwiki duty
  • 11:38 mark: No incoming traffic via TWTC since a couple of hours. Removed outgoing AS prepend to diagnose.
  • 08:18 domas: enabled write behind & adaptive read ahead on db6 :-/
  • 04:38 Tim: upgraded Tidy on all apaches to a CVS snapshot designated 2007-04-02. They apparently don't do releases anymore.
  • 04:23 Tim: repooled ariel
  • 00:03 Tim: reduced buffer pool size on ariel from 6500MB to 6000MB.

April 5

  • 23:30 Tim: removed ariel from rotation, mysqld crashed and is restarting
  • 23:24 Tim: put db6 back in rotation

April 4

  • 21:01 brion: killed pagelinks subdump on enwiki dump on srv31; was hung mysteriously
  • 14:15 brion: fixed wikitech wiki broken for a few hours by a rename of the SyntaxHighlight extension

April 3

  • 20:39 Tim: removed db6 from enwiki contributions group
  • 18:47 brion: running db updates on db6 [3 and 4 remain left]

April 2

  • 19:00 jeluf: removed binlogs 1-10 on thistle.
  • 14:00ish brion: fiddling with dumps to use db7 to reduce load on db6
  • 13:30 brion: started db updates on ariel (ixia done)
  • 04:00 brion: started db updates on ixia (lomaria done)

April 1

  • 21:35 mark: Enabled monitorurl and -timeout to have Squid monitor its cache peers whether they are up or down. In the past weeks we've seen several Squids not redetect uptime of parent caches, in some cases ending up with no available peers at all.
  • 18:20 mark: Enabled caching on the frontend Squids. Even with the tiny 10 MB cache they have now, they achieve a 65% hit rate and thus all cache the (same) ~1000 hottest objects in memory without having to forward them to CARPed cache squids.
  • 12:40 brion: started db updates on lomaria (holbach done)

March 31

  • 05:12 brion: locked tlh.wiktionary.org per previous announcement (then unlocked it, let someone else fight, don't care)
  • 02:25 brion: db updates starting on samuel (db5 done)

March 30

  • 18:27 brion: db updates starting on db5 (adler done)
  • 17:45 brion: batch-fixed links for the recent dump indexes
  • 15:50 brion: db updates starting on adler
  • 15:45 brion: ipmi-booted srv137, stuck again
  • 14:17 brion: manual clearing on benet, still something is overflowing
  • ... brion: db updates on webster

March 29

  • 23:31 Tim: freed up a measly amount of space on amane by deleting backups. Was down to 60GB.
  • 18:53 brion: running db updates on db7, will go on to tohers
  • 18:10 jeluf: changed config on goeje, albert, pascal so that mail directed to wiktionary.org is accepted.
  • 17:50 brion: updated lagtop to display current cluster assignments
  • 17:01 Rob: Reloaded OS on srv131 after replacing bad HDD.
    • Needs to have setup scripts run and brought into the rotation.
  • 16:54 Rob: Reloaded OS on srv128
    • Needs to have setup scripts run and brought into the rotation.
  • 15:44 Rob: Rebooted srv50 per datacenter tasks page.
  • 15:30 Rob: Rebooted srv122 as it was locked up and down on nagios, Machine Check Exception error.
  • 15:12 Rob: Re-racked harris with rails.
    • System was powered off, what purpose is this server currently being used towards?
  • 02:15 Tim: Configured squid to send all requests for test.wikipedia.org to srv0

March 28

  • 21:14 brion: fixed images for be-x-old (moved from be)
  • 21:05 brion: adding small fields for March 2007 schema changelets on all dbs
  • 18:46 brion: otrs hanging; srv7 db seems ot have issues. space?
    • ok after i cleared out a couple gigs of binlogs. is this machine being replicated? do we have a backup plan for it?
  • 17:59 brion: fixed broken links in (new) dumps, ones from last couple days still broken
  • 17:00 mark: Set up wm07schols.wikimedia.org on friedrich by request of Austin, using the old wikimania htdocs.

March 27

  • 21:24 Tim: removed closed-zh-tw.wikipedia.org from all.dblist
  • 21:15 mark: Increased disk caches of text squids
  • 16:04 brion: fixed be-x-old.wikipedia.org (moved ES around and tweaked config). srv131 appears to be acting up again w/ read-only filesystem
  • 13:00 - 14:15 mark: knsq1 was giving out socket errors EADDRINUSE. It turned out to be pooled with a higher load in LVS, and quietly running out of port space (60k connections!). After correcting that, the cache squid was running at 100% cpu usage, with no syscalls at all. opreport:
samples  %        symbol name
9275     65.0923  squidaio_poll_queues
3417     23.9806  squidaio_sync
134       0.9404  headersEnd

March 26

  • 23:27 brion: updated refresh-dblist to write a full copy in pmtpa.dblist
  • 23:20 brion: /h/w/common/refresh-dblist has apparently been changed to not update pmtpa.dblist; as a result there were a few minutes of fun when the bad dblist from below was the live one. have restored a good copy of the list
  • 23:00ish brion: managed to accidentally eat the all.dblist and friends with an old 'refresh_dblists' script in /h/w/bin ... have moved it to disabled/subdir and rebuilt the lists, removing invalid entries
  • 22:50 mark: All squids upgraded to 2.6.12-1wm4, and ran apt-get upgrade.
  • 22:00-22:40ish brion: setting up be-x-old.wikipedia.org and importing new bewiki from incubator
  • 21:34 Tim: started new static HTML dump
  • 20:30 brion: srv31 back up after reboot; running more dump threads on srv31
  • 19:00 brion: starting new dumps on benet with new script o doom

March 25

  • 20:17 mark: Ran apt-get update && apt-get upgrade on knsq*, upgrading kernel, libc6 and squid to newer versions. Fixed earlier Squid issues.
  • 16:12 domas: redirected enwiki/Recentchangeslinked to ariel
  • 14:00 domas: Edge case in JobQueue (fixed in r20672, r20674) caused db3 to melt (load >20.0, cpu use 100% in user). at ~16:00 db3 going live and patches being deployed on site.
  • 13:49 mark: Squid on knsq1 has been segfaulting, started in a gdb/screen.

March 24

  • 18:10 Tim: killed frozen job runners on srv91-100.
  • 14:57 mark: Built a squid-2.6.12-1wm1 package and installed it on knsq1 for testing. Also ran apt-get upgrade. Rebooting knsq1 for the kernel upgrade
  • 12:40 mark: Changed overwrite-percent COSS option from 50% to 30% on knams upload cache squids. Cache size utilization always seems to be < 60%, and those machines have spare I/O bandwidth. Rather than increasing the cache dir partitions with the tedious restarts we might as well try to utilize the existing space better.
  • 12:29 mark: Disabled siblings for upload cache squids as well, HTCP hit rates had dropped to < 15%

March 23

  • 23:00 domas: db9 has kernel/fs level deadlocks, mysql doesn't like that, controller? RMA? drivers? Rob? :)
  • 23:00 domas: db7 had replication failure, due to commonswiki profiling/hitcounter touches (some purge script?)
  • 20:35 brion: srv39-41 had wrong sudoers files after recent reconfig; caused apache-graceful-all to hang prompting for passwords. manually recopied file; setup-apache script looks like it should have copied the files orignally,...
  • 19:20 brion: srv141 power-cycled and back up.
  • 19:16 brion: srv141 hanging, gonna poke it
  • 12:19 Rob: Rebooted srv81, it appears to be back online.
  • 12:12 Rob: Shutdown sq2, disk 4 replaced and brought back online.

March 22

  • 19:05 JeLuF: bootstraped srv150 as apache, added to the farm
  • 17:42 Tim: moving srv21 and srv25 from search pool 1 to search pool 2, returning srv39 to the apache pool.
  • 02:12 Tim: srv81 is mostly down, removed from ext store rotation
  • 00:06 Tim: moved srv40-41 back to apache service

March 21

  • 22:30- Tim: moving srv21-30 to lucene service
  • 21:25 Tim: moved cluster1 and 2 to a new merged external storage cluster on srv96-98.
  • 20:25 Tim: srv31 down. Unmounted /var/backup/public/mnt/srv31 on benet.
  • 19:21 brion: lighty on benet was mysteriously dead for a bit. very slow to kill. killed it, restarted. srv31 1seems stuck/
  • 14:40 brion: syncing apache configurations and restarting apaches. at least one machine had old remnant.conf and possibly other bad configs, breaking office wiki intermittently

March 20

  • 23:00 domas: user attempted to delete sandbox on enwiki... :)

March 18

  • 22:18 brion: srv81 ssh broken and http connections hang; removed from LVS manually. srv131 has read-only filesystem but has no load

March 16

  • 15:45 Tim: reassigning srv21-24 from apache to search

March 15

  • 15:52 mark: Disabled all spanning tree on asw2 and asw3
  • 15:27 Rob: Moved srv152 back to its normal rack PDU. Booted it online.
  • 14:41 Rob: New storage1 racked and installed with FC4.
    • I am not sure if that is the OS we want in the end, since storage2 runs Ubuntu, I just wanted to test the install and make sure it went at a reasonable speed, and it did.
  • 14:21 Rob: Placed replacement drives in servers adler and ariel. Did not add to array.
  • 14:20 brion: set up blank zh.planet.wikimedia.org
  • 06:39 river: added a prefix-list (pm-in) on incoming BGP routes to deny 10.0.0.0/8 routes
  • 06:18 jeluf: after bw sealed a leaking route announcement (10.0.0.0/30), ariel and suda are reachable again. Restored watchlist config.
  • 06:10 jeluf: ariel died, removed the enwikiWatchlistServers section from the config

March 14

  • 19:53 brion: planet.wikimedia.org content moved to en.planet.wikimedia.org
    • redirects for feed, link for page
  • 19:41 brion: adding some language subdomains for planet
  • 19:06 Tim: Pybal was flapping the search backends due to timeouts. Disabled ProxyFetch for now, it can just use IdleConnection.
    • The dual processor servers are only 50% utilised (if that) due to excessive thread synchronization
  • 18:14 Tim: Switched search to use LVS instead of perlbal. Pybal on diderot.
  • 13:37 brion: restarted leuksman.com apache, some mystery prob

March 13

  • 23:38 mark: sq2 broke with a disk failure, killed squids
  • 09:57 mark: Removed siblings for text cache squids, hit rates were < 2%. Upload squids have hit rates up to ~25%, so keeping it enabled there for now.
  • 01:30 Tim: moved commons thumbnails back to bacon. Amane was toast, average service time blowing out to 4 seconds, and nobody noticed. It's OK now that it's off peak, but obviously bacon is better for this task.

March 12

  • 16:17 Tim: moved commons thumbnail serving back to amane

March 11

  • 22:43 mark: Set negative_ttl to 5 minutes for upload squids, so it can cache 404s and other errors. 404s negative cache entries are purged as usual, hopefully the other kinds of errors don't cause problems.
  • 18:45 brion: migrating some old files from benet -> amane; out of space again.
  • 17:28 brion: db7 replication is broken; it's trying to replicate updates from commonswiki from its master, db2, however there is no commonswiki db on db7. I'm not sure how these are supposed to be set up, so...
  • 11:40 Tim: enabled UsernameBlacklist everywhere
  • 00:50 Tim: Finished phasing in CARP configuration. Moved CARP config to /home/wikipedia/conf/squid.

March 10

  • 21:35 Tim: starting to phase in the CARP-based squid configuration. Starting with knsq1.
  • 7:50 jeluf: stopped db5, copying its DB to adler

March 9

  • 6:30 jeluf: bootstrapped adler, running mkfs.jfs -c on sda3, will need some time to finish
  • 00:03 Rob: Adler is re-installed sans 1 disk. Replacement disk is being taken care of.
    • Please bring Adler online as the main database server, as Ariel and Samuel need to be powered down and re-racked with rails that are on hand for them.

March 8

  • 21:20ish -- they came back online. a circuit breaker flipped, was reset
  • 21:05ish -- several servers in srv6x range dead, probably circuit breaker
  • 18:11 brion: adding planet.wikimedia.org to dns; setting up...
  • 14:35 Tim: phasing in remerged squid configuration

March 7

  • 19:45 brion: private SVN set up on net instead of just my laptop
  • 18:00 Tim: upgrading apaches to APC 3.0.13
  • 17:45 Tim: crashed srv43, kernel GPF due to oprofile
  • 17:03 brion: upgraded subversion server to 1.4.3

March 6

  • 19:41 Tim: set an idle timeout of 30 minutes in bfr for the search index rebuild. This will hopefully stop it hanging without killing any valid jobs.

March 5

  • 20:30 brion: rebooting srv130 via ipmi; filesystem was read-only

March 3

  • 17:28 Tim: deploying new APC to all apaches
  • 04:51 Tim: Came up with a workaround for the segfault on exit bug, experimentally installed patched APC on srv43 and srv88. The problem is a glibc bug involving linking to libraries with DF_1_NODELETE, such as librt and libpthread. APC was not actually using librt, so I removed it from the link.

March 2

  • 04:34 Tim: We were still getting a large number of file size exceeded signals in the syslog. I eventually patched the bulk of the live code to check the file size before attempting an error log write. That seems to have fixed it. Will commit eventually.

March 1

  • 15:43 Tim: rotated dberror.log, rapidly filling with database selector related errors.
  • 15:37 Tim: brought srv129 into rotation
  • 15:28 Tim: brought srv83 into rotation
  • 15:25 Tim: brought srv86 into rotation
  • 15:07 Tim: started jobs-daemon on srv98 and srv99

February 28

  • 16:25 Tim: tweaked setup-apache, installed full apache environment on srv152 (test Xeon)

February 26

  • 18:15 Tim: upgraded squid to 2.6.9-1wm4
  • 17:20 jeluf: srv142 not responding to ssh, removed from the load balancing. Added back srv147.
  • 15:40 Tim: did master split. New masters are thistle and db8, as described here.
  • 15:16 mark: "Fixed" PyBal with a dirty temporary hack. Will fix cleanly tonight.
  • 14:46 mark: Brought up iris as new load balancer, statically as PyBal thinks all squids are down due to some PartialDownloadError. Moved traffic back to knams as pmtpa was overloaded
  • 14:00 mark: pascal behaved odd, power cycling didn't help. Switched DNS to scenario KNAMS-down.

February 25

  • 23:39 Tim: adapted cluster provision in refresh-dblist to make cluster lists for the DB masters
  • Tim: db9 keeps freezing during query execution. Recompiling mysql didn't help. It can't be put into rotation like this.
  • 13:00-late Tim: setting up new master split.
  • 11:00 jeluf: set up replication on db8, started replication on db5, added db5 back into the LB pool
  • 00:54 Tim: reopened access to frwikiquote backup dumps

February 24

  • 22:28 Tim: stopping slave on db5 for mysqldump to db8
  • 22:20 Tim: running mkfs.jfs on adler, with bad block check option
  • 21:41 Rob: Shutdown, added rails, and re-racked Adler
  • 20:36 Rob: Rebooted adler and set the correct root password.

February 23

  • 12:02 mark: Reinstalled isidore with Edgy, is ready to go.
  • 02:23 Tim: db5 and thistle back in rotation
  • 02:00 Tim: copy complete, db5 and thistle catching up
  • 00:25 Tim: copying data directory from db5 to thistle
  • 00:20 Tim: wiped /a on thistle, was corrupt

February 22

  • 23:39 Tim: took thistle out of rotation. Can't mount /a, syslog reports bus error.
  • 22:46 mark: isidore got reinstalled wrongly, will fix it in a few days
  • 22:29 brion: setting up rejuvenated yongle to replace the old isidore stuff
  • 21:54 Rob: yongle reinstalled with ubuntu.
  • 21:36 Rob: Isidore back online, just reseated powercord and it boots fine.
  • 20:07 Rob: Adler back online, needs /a filesystem created and synced. Using only onboard disks.
  • 18:30 tim: switched master for non-enwiki to samuel
  • 18:15 jeluf, tim: adler down. Tim prepares a master switch, jeluf tries to reboot adler.
  • 10:33 brion: fiddling with dumps
  • 00:58 Tim: starting upgrade to squid 2.6.9-1wm3 everywhere
  • 10:27 domas: increased idle transaction timeout to 600s, due to apparently existing slow requests inside cluster.

Feburary 21

  • 19:37 brion: taking over yongle to replace dead isidore
  • 19:12 mark: Fixed an issue on the mailing lists server where list wikifi-admin was not recognized due to the -admin / -owner / etc. suffixes.
  • 15:43 Tim: piloting squid 2.6.9-1wm3

February 20

  • 23:44 brion: leuksman.com apache down for a while, 'couldn't grab the accept mutex' o_O
  • 08:02 brion: confirmed presence of bad cached entries showing anon page views to logged-in users (due to buggy Vary: Accept-Encoding header being present for a few hours, overriding the Vary: Accept-Encoding, Cookie)
  • 06:27 mark: Uplink upgrade succeeded, restoring knams. Also putting more countries back on knams, will gradually add more during the day.
  • 05:12 mark: DNS scenario knams-down in preparation of the uplink upgrade.
  • 02:40 brion: srv137 up, re-synced its files. a happy ending powered by ipmi
  • 02:37 brion: power-cycling srv137... ipmi actually works on it \o/
  • 02:35 brion: srv137 complaining of read-only filesystem; slow or broken logins
  • 02:25 brion: Special:Export was broken under the new scheme; the buffer reset code didn't correctly handle the new handler. Should be fixed now in r19998

February 19

  •  ??? tim: enabled Content-Length header with the new compression buffer handler, rumored to help squid maintain persistent HTTP connections
  • 20:16 brion: reenabling image captcha, now using subdirs for hopefully better performance
  • 16:44 Rob: Completed PSU and MB swap for storage1, re-racked & started OS load.
  • 14:50 Rob: Pulled srv86 to work on report that it does not have a working monitor output.
    • Works Fine, re-racked.

February 18

  • 13:00 mark: Starting rsync of amane to storage2, without thumbs.

February 17

  • 23:55 jeluf: Started mysql on srv122
  • 23:00 many: Avicenna (text load balancer) failed. Load balancing was moved to alrazi. avicenna's ethernet is "link up" after a reboot, but doesn't ping. Site up after 25 minutes.
  • 22:10 Rob: Rebooted srv122
    • and it came back up right away. If it dies again, please click on its name and log it.
  • 22:00 Rob: Replaced PDUs in storage1 started netboot to reload os.

February 16

  • 20:39 brion: srv117 sudoers file was broken; copied in standard one, sync-common now works.
  • 20:30 brion: srv110 and srv117 php files out of date; updated srv110 ok, 117 unhappy
  • 16:20 brion: isidore hung for some hours; requesting reboot
    • back up after a while
  • 05:15 jeluf: took thistle out of the rotation, copying its mysql DB to db8
  • 03:00 brion: set up http://wiktionarydev.leuksman.com/ for hippietrail to test in-development extensions for Wiktionary

February 15

22:10 Rob: replaced ram in srv26 upgraded to 4 GB and brought server online.

February 14

  • 22:14 brion: added daily-image-l list
  • 07:48 brion: restarted search-rebuild-loop1 on srv37, was stuck at aawiki for a long time so not updating non-en wikis. bugzilla:8979
  • 06:30 jeluf: restarted squid on sq27 and sq30, removed binlogs 93-99 from adler.

February 13

  • 21:59 brion: benet out of space, moving files again. probably some broken stuff. sigh.

February 12

  • 22:40 brion: added login=PASS on cache_peer definitions for squid origin servers; apparently if you don't do this squid disobeys the HTTP spec and eats Authorization headers so you can't do any HTTP auth. nice!
  • 20:19 Tim: hit a bug, no_cache apparently destroys the public StoreEntry, giving up for now
  • 16:45 Tim: trying again
  • 14:15 Tim: CARP configuration didn't work on the pmtpa upload cluster, we had a CPU overload on several servers. Reverted back to the old conf.
  • 12:11 Tim: phasing in CARP-based squid config

February 11

  • 23:00 jeluf: mysql copy from db1 to thistle done. restarted replication. 40'000s replication lag
  • 15:42 mark: Reinstalled storage2, mounted a 3 TB JFS filesystem /dev/vg00/static, with ~ 600 GB free in the VG, starting rsync from Darwin. storage1 didn't come up after a reboot, which is unacceptable.
  • 11:55 jeluf: Took db1 out of rotation, generating a mysqldump from it, copying it to thistle.

February 10

  • 23:50ish brion: renewed fundraising.wikimedia.org SSL cert for a year
  • 23:19 Rob: srv101 will not detect NIC. Booting off LIVE CD has same results.
  • 22:23 Rob: thistle is online with FC5-64.
  • 22:21 Rob: Rebooted srv101, srv110, srv134.
  • 21:22 Rob: Rebooted srv122 per task list.
  • 21:15 Rob: srv126 down for memtest. Failed memtest.
  • 20:46 Rob: Created a single raid 1 on will
  • 20:45 Rob: thistle down for hardware troubleshooting.
  • 14:55 mark: yaseo hosts under SSH DoS, firewalled source ip
  • 8:00 jeluf: Started copying the DB from db1 to db8 and back.

February 9

  • 21:28 Rob: thistle: Running memtests until tomorrow evening.
  • 20:27 Rob: srv117 back online after repairs with FC4 and working SSH, please complete installation.
  • 20:47 Rob: Reset srv26 per Jeluf request.
  • 20:10 Rob: thistle is offline. The cpu/mainboard is having issues and require replacement.
  • 20:09 mark: sq11 was replaced by SM, installed and up as an upload squid
  • 19:30 Rob: srv152 is installed and online with FC5 64. Please complete setup.
  • 19:30 mark: albert seems to report disk trouble / bad sectors / SMART failures on its console
  • 03:28 brion: updated leuksman.com to PHP 5.2.1

February 8

  • 16:48 brion: resynced and restarted apaches on srv76 and srv86
  • 16:30 Rob" replaced DRAC card in srv152 new DRAC online
  • 16:25 Rob: Plugged in and booted srv76 & srv86 per Jeluf.
  • 15:40 Rob: Moved srv148-srv153 down one port on switch.
  • 14:09 mark: Reinstalled ragweed with Ubuntu Edgy, for preparation as a backup server.

February 7

  • 23:38 brion: restarted apache on leuksman; was down
  • 23:38 brion: enabled poem extension sitewide (was on just some wikis)
  • 21:23 brion: texvc issues resolved now i think (/var/tmp/texvc build dir was owned by root, presumably from initial setup, and wouldn't resync properly as user)
  • 21:03 brion: srv71 clock hopefully resolved now
  • srv71 has clock offset over 6 seconds
  • srv61, srv53, srv54, srv68, srv120, maurus cannot compile texvc during sync, probably missing dev packages.
  • 19:19 brion: setting up il.wikimedia

February 6

  • 15:30 Tim: started httpd on amaryllis, reconfigured to allow serving of private log data from henbane
  • 6:00 jeluf: set up iSCSI on ragweed, connecting to the Infortrend array.

February 5

  • 17:00 brion: httpd now on and set to start on boot on isidore (had died when machine rebooted last week)
  • 14:50 mark: Please do not start Squid on yf1010, it has crashed and finally produced a useful coredump.

February 4

  • 19:30 mark: Auto-installed Ubuntu Edgy on srv153

February 3

  • 23:59 Rob: Installed FC5 on srv151
  • 23:50 Rob: Installed FC5 on srv153
  • 22:29 Rob: Enabled DRAC access for srv153
  • 22:23 Rob: Installed FC5 on srv152
  • 22:23 Rob: Enabled DRAC access for srv151
  • 22:23 Rob: Brought Coyotepoint Extreme online via serial console port 10 for mark.
  • 21:02 mark: Installed Ubuntu and squid on sq3, pooled it
  • 20:00 Rob: Rebooted srv22, it had shutdown due to a heat issue.
  • 20:00 Rob: Hooked up sq3 for mark (power, network, serial)
  • 01:00 mark: Put all external connections (transit/peering) in VLAN 4. Apparently the Cisco switches are sending VTP multicast junk on the native vlan 1 despite explicitly being told not to. The Foundry forwards these to VLAN 1 ports, which happened to contain the route-only ports.

February 2

  • 22:50 brion: shut down srv22, was overheating and whining a lot
  • 22:02 brion: taking srv76 and srv86 out of rotation (old ES masters, clusters 5 and 7 not actively being written to)
  • 21:58 brion: srv76 and srv86 down? dberror log filling up with connection loop failures

February 1

  • 22:35 Tim: upgraded FSS on srv144, was using the old buggy version from October.
  • 20:00ish jeluf: set up a new 8-disk RAID0 on db8. Installed Ubuntu.
  • 18:56 mark: srv147 IPMI does not seem to work (or I am doing it wrong), depooled it on dalembert
  • 18:15ish brion: srv147 not logging in properly, serving 404 errors; mark trying to get in and kill it
  • 07:00 Tim: now using customised error messages on the text squids.
  • 6:45 RobH: db10 was locked up on the console, rebooted. Seems to be online with no issues but still has an error on Nagios. Server should be responsive to DRAC.
  • 3:00 - 6:35 RobH: Installed cables for srv151-153. Troubleshooting on sq11.

Archives