Server Admin Log/Archive 10

June 30

13:30 Tim, rainman: installing LS2 for enwiki. Migrating servers starting at srv22.

June 29

~evening river: removed db7 from rotatation to dump enwiki for yarrow
23:14 mark: Repooled knams
(all day) mark: network maintenance at knams, installed the new RX-8 (csw1-knams)
20:14 brion: running REPAIR TABLES on bugzilla tables on srv7
17:30 Tim: fixed pdns-recursor on bayle again
10:00 domas: bayle pdns_recursor gone missing. restarted.
07:28 mark: DNS scenario knams-down for network maintenance later today
00:58 Tim: fixed disk full condition on srv7 yet again
00:03ish robchurch: Users are complaining about OTRS database errors too, something wrong with srv7/8?

June 28

22:30ish robchurch: BugZilla broke :(
18:30 Tim: fixed secure.wikimedia.org for wikiversity, was making the boardvote session transfer not work
09:30 mark: Downpreffed AS path _30217_13680_ as people could not access us

June 27

21:00ish brion: did a scap to current (r23488 ish), made a few quick fixes. schema changes are upcoming but not needed yet for live code, allegedly
19:05 brion: fixed the dump generation issue where all sql dumps failed due to a chagned pwd :P

June 26

18:55 brion: replaced secure.wikimedia.org's CA-Cert cert with a rapidssl one that won't make everyone's browsers whine. We may still want a nicer wildcard / unlimited hosts cert in the future.
16:09 Rob: srv80 back online after HDD swap and re-installation. Needs scripts run.
15:10-15:25 Another load spike, cause unknown.
14:45 River identifies a client doing a high-rate multithreaded load of action=render pages. Potential candidate for load spike source, blocked by Tim.
14:24 Rob: srv118 back online after bios reset. Please sync and bring in to service.
14:14 Rob: srv87 back powered on after powersupply replacement. Please sync and bring into service.
13:35 Load spike on apache cluster, drop in hit ratio, overflow of apache CPU capacity. Comes and goes.
08:20 domas: observed adler going down

June 25

18:50 mark: Upgraded pdns on bayle and yf1019 to pdns-server-2.9.21-1wm2 to fix the tab bug
15:22 Tim: reassigning srv56 to search indexing
14:23 Rainman, Tim: testing lucene-search-2 on srv153 and srv142

June 24

18:55 mark: Upgraded pdns-server on bayle to 2.9.21-1wm1
13:30 - 15:30 mark: Upgraded yf1019 to Feisty, and upgraded pdns-server to version 2.9.21-1wm1

June 21

11:05 mark: yf1015 was still sitting on the same IP as the newly installed yf1000, which was of course causing problems. Fixed.

June 20

20:24 mark: Installed yf1000 as text squid and pushed into production
17:00 jeluf: Added a second data file /usr/local/mysql/data2/ibdata2 to all three nodes of cluster 12 (srv124, 125, 126)

June 19

21:25 brion: doing a sweep for images with backslash (\) in name, found investigating an OAI-related problem triggered by a page update to [1]. Can't repro upload of such a filename now, believe it was an old bug.
21:07 brion: took live a change to SquidUpdate.php which should skip HTTP purges instead of doing both HTTP and HTCP (old code didn't exit the HTTP function when subcalling to the HTCP function)

June 18

14:21 Rob: srv80 shutdown pending HDD replacement. RMA placed.
13:30 Rob: srv87 unplugged. Powersupply is bad, RMA placed.
13:00 Rob: srv118 unplugged. Will not boot. RMA placed.

June 18

21:05 brion: updated lighty config on amane & restarted, tightening up some access control
21:00 jeluf: added a new datafile on /dev/sdb1 to mysql on srv121, srv122, srv123 (cluster11)

June 17

19:49 mark: Backported Dovecot 1.0.0 to Feisty, put it in the Wikimedia Universe repository, and installed it on sanger
06:51 Tim: deleting some binlogs on srv123 and srv126
00:22 river: set sitenotice for election stuff

June 16

20:05 mark: Put traffic load back on knams

[knams sick, scenario knams-down]

June 15

16:10 Tim: Set up a mysql server instance on srv77 for Lucene Search 2. Rainman chose the root password, it is documented in ~root/.my.cnf
14:25 jeluf: copy completed, adler and webster back in production. Enabled DB cluster "s3a" for frwiki and jawiki
04:45 jeluf: stopped replication on adler, took it out of the pool. Copying frwiki, jawiki and commonswiki to webster

June 14

21:32 mark: Brought sq41 back up
21:33 river: removed srv117 from mediawiki_installation/srv117 as it's not accepting logins
20:14 mark: yaseo squids reinstalled, deployed and ready. Putting traffic load back on yaseo
17:35 mark: Added the new yaseo Squid IPs to the MW CommonSettings.php, and removed some old entries
16:40 brion: switched SVN CIA-bot config to use mail again (via lily). xml-rpc was hanging, which sucks for committers
16:something brion: recovered most of the remaining commons images, doing another sweep for a few more
16:29 Rob: srv117 HDD replaced, mirrored, and FC reinstalled. Needs setup scripts run.
15:44 Rob: sq41 bad 4th disk replaced. Not turned on, as it needs to be sync'd once booted.
11:15 mark: Reinstalled yf1001 - yf1009 with Feisty, will deploy them as squids later today. yf1000 has no network link.
05:40 jeluf: Copying DB from lomaria to thistle.

June 12

20:11 Rob: webster hdd replaced and reinstalled.
20:06 Rob: thistle hdd replaced and reinstalled.
19:14 mark: Brought up amaryllis at the new location
19:13 Rob: srv80 shutdown, pending HDD replacement.
19:08 Rob: srv121 turned back on and online.
19:02 Rob: srv66 shutdown pending re-installation.
18:40 Rob: srv54 shutdown. Single HDD bad in system.
18:14 Rob: srv149 turned back on and online.
18:11 Rob: srv146 turned back on and online.
18:00 brion: applying patch-ipb_emailban.sql in prep for mw update
16:19 Rob: srv137 rebooted from crash, back online.
16:14 Rob: srv136 rebooted from crash, back online.
16:07 Rob: srv134 readonly filesystem. REbooted and ran FSCK. Corrected a number of inode errors. Server is back online.
15:39 Tim: ddsh -cM -g apaches -- cp /home/config/others/etc/rc.local.apache /etc/rc.local
15:36 Rob: Shutdown sq41, HDD 03 is defective/clicking.
15:21 Rob: webster is in OS install on partitioning. Mark needs to correct something before it is back online. Replaced HDD in array.
15:14 Tim: set up henbane on new network, deployed squid conf change for logging
14:40 brion: suppressed error output from planet cronjob per mark's request
14:23 Rob: Rebooted webster. Its raid is optimal and it now allows remote access. Documented issue on it's history page.
- Found the faulty drive, server currently in rebuild status.
13:00 mark: sq26 ran into some sort of livelock, not doing any syscalls but using 100% cpu. gdb backtrace shows aio... leaving it attached for now.
10:09 mark: Brought up yf1019, yaseo DNS server. Changed ns1's ip - delegation needs to be changed as well
10:09 yaseo hosts seem to be connected using eth1!
10:00 mark: Updated DNS entries of yaseo.wikimedia.org to the new situation

June 11

20:59 brion: removed chroot on benet's lighty, was breaking symlinked dirs in the downloads area for things being migrated to new boxen
17:44 Tim: changing password for wikiuser and wikiadmin after a leak on IRC
- srv87, thistle and webster will need to have their passwords changed before they can be put back into rotation
13:24 mark: sq12 suddenly rebooted for unknown reasons
08:00 mark: Uppreffed routes via AS30217 to knams and yaseo
07:55 mark: Depooled yaseo, servers are going to be moved today

June 10

23:48 Tim: removed binlogs 90-99 from srv123
18:25 Tim: fixed incorrect modification of $dbHostsByName in db.php. Commenting out a server from $dbHostsByName does not mark it down. Use $sectionLoads to do that.
7:45 jeluf: srv117 has a defective disk, apache is hanging. I shut the server down.

June 9

14:30 jeluf: started jobs-loop.sh on srv100 srv102 srv103 srv104 srv105 srv106 srv106 srv108 srv109 srv111 srv112 srv113 srv114 srv115 srv116 srv119. Apparently there wasn't any jobs-loop running since the power outage

June 8

10:25 mark: Level3 is being flaky, downpreffed routes to knams and yaseo so TWTC routes are selected

June 7

18:35 brion: setting up some nfs on storage2 for temporary dump work
17:52 Tim: moved enwiki contribs back to db6, reduced general load
14:51 Tim: Fixed monitorurl, deployed squid conf

June 6

19:43 mark: Backed out most of Tim's monitorurl changes as it caused corrupt config files with empty monitorurl options, and therefore, downtime. Please test/check configs thoroughly before reapplying.

June 5

20:00 jeluf: replaced rc.local on all apaches, so that they won't start apache on reboot. Use apache-start to start apache after a reboot.
17:50 jeluf: copied rc.local file to srv68, started apache
13:39 Tim: restarted backend squid on sq26. It was giving errors when attempting to forward to srv77 (ls2.wikimedia.org). cachemgr.cgi said that the server was down, but last failed connect "04/Jun/2007:13:23:56 +0000", over a day ago. Restarting fixed it.

June 4

15:18 Tim: restored memcached to normal operation, cleared 7 out of 26 slots
14:47 Tim: stopped slave on various old external stores
14:40 Tim: fixed replication on cluster10 and put it back into the write pool
13:00 brion: clusters 4 adn 5 came back a bit ago
12:52ish brion: took off read-only
12:50 brion: everything looks ok except es clusters 4 and 5 still out
12:45 brion: restarted lvsmon on dalembert. for some reason it wasn't taking down servers out of rtoatio anymore
12:28 brion: mysql didn't autostart on cluster10, restarting it... 9 looks fine
12:22 brion: rob got in to colo, some es boxen back up
12:08 brion: reassigned 7 mc boxen to up boxen
11:42 brion: marking read-only for the moment
11:29 : some kind of power problem in DC -- ES cluster9 and cluster10 down entirely, also a number of memcaches, so site slow even for pages not using those storage clusters

June 3

16:03 Tim: Samuel's slave thread was stopped for a while due to temporarily buggy MediaWiki code. Fixed it. Re-enabled slave status monitoring in nagios, set it up to enforce a policy of no slave running on old external stores.
15:23 Tim: took webster out of rotation
12:04 Tim: svn up/scap
07:00 jeluf: webster's mysqld is down. Reports "read error" in its mysql error log. dmesg reports SCSI errors.

June 2

17:00 mark: Disconnected clematis because it's idle and I needed a switchport for the console server. Will reconnect once the RX-8 is at knams.

June 1

21:00 mark: Set up switches asw-c3-pmtpa and asw-c4-pmtpa
20:29 brion: chown'd all the mailing list files so the web server can edit configs and html files properly per bugzilla:10091
20:00 jeluf: removed binlogs 80-99 from srv95
19:10 mark: Set control = submission/retain_sender on SMTP mail submissions from allowed relay_from_hosts on all mail servers

May 31

22:31 brion: switched DNS for svn.wikimedia.org to mayflower now that all the bits seem to be working. documentation will go up once i've formatted it a bit
22:31 brion: _really_ locked them accounts now. david made one extra commit on old :( [2]
21:23 brion: locking svn+ssh accounts on old svn server to test migration on new box

May 30

19:52 mark: Installed Feisty on mayflower for use as Subversion server

May 27

23:37 river: created hakwiki

May 26

9:30 jeluf: removed binlogs 60-79 on srv123

May 25

10:40 jeluf: replaced memcached srv118 by a spare one - nagios was complaining for 4d 9h already...

May 24

12:40 Rob: Replaced DRAC in srv155 and programmed it. It is now accessible via DRAC.
12:37 Rob: sq39's power cord was lose, it was powered down, plugged back in. Also knocked out sq47 checking plugs, plugged back in.

May 22

19:46 mark: db5 installed with Feisty, and waiting for setup.
17:46 Rob: Replaced disk 1 in db5.
17:10 Rob: Rebooted srv145, filesystem no longer read-only.
16:03 mark: Set up sq39 as Squid, pushed into production
15:05 Tim: increased move rate limit to 8 per minute
14:27 Rob: Added rails to srv0 and will restarted will back in to the installer.
- What purpose is srv0 supposed to be serving? It will not allow login, please see Datacentre_tasks
  - It was doing test.wikipedia.org until it crashed, srv3 took over. -- Tim
14:06 Tim: Created ls2.wikimedia.org, forwards to srv77 for lucene-search-2 test
13:40 Rob: sq39 fixed and OS loaded. Needs to be configured.

May 21

23:29 river: ~~started refreshlinks on eswiki for [3]~~ scratch that, fixed the bad rows manually
earlier river: created rswikimedia

May 20

23:36 river: srv126 out of space, removed old binlogs
22:20 hashar: gzipped / bziped some logs on suda:/h/w/logs to save disk space. You might want to establish a logrotate script there.
21:55 mark: Upgraded bayle to Ubuntu 7.04 Feisty to fix the pdns recursor issues.

May 19

22:42 river: closed ndswikiquote
16:04 jeluf: enabled api.php's watchlist module again since it's now using the watchlist DB.
08:50 jeluf: started the pdns recursor on bayle
00:41 river: closed krwiki

May 18

21:25 river: created itwikiversity
18:39 brion: restored bug URL shortcut on new bugzilla
18:30 brion: srv8 replication is down, not sure why... io thread no starty
- Maybe because it stopped weeks ago and srv7 is missing almost 20 binlog files, probably deleted to free disk space on may 15?
  - It was before May 15, the server barely had room for 3 binlogs let alone 20.
14:11 brion: bugzilla migration done, but srv7 seems confused about dns and thinks isidore is bart. flush hosts had no effect
13:45 brion: shutting down bugzilla to try migrating its db to srv7
00:13 Tim: fixed fstab on srv101. Had two swap lines with binary junk in the label, which was causing the rest of the file (including NFS mounts) to be ignored. Removed them. Also did mount -a.

May 17

23:50 Tim: srv101 was not in mediawiki-installation and so missed out on the squid list update. Fixed.
22:06 mark: ...and of course I forgot about Mediawiki's Squid list again.
21:47 mark: Deployed all new squids sq31 - sq50, except sq39 which is broken
21:00 mark: Made Brion's life bearable again by shutting down srv145's switchport on csw1-pmtpa
20:45 brion: srv145 is readonly fs, need to kill it
20:00ish brion: updated WebRequest and such to make Serbian variant urls work again, new URL code broke em for a couple days
15:57 Rob: reset db5 per Jeluf
13:47 Rob: re-racked srv151, srv152, & srv153 in Rack C3.
03:24 brion: updated install-modules51 script for wikidiff 1.0.1 package
03:22 brion: srv101 appears to have been left broken as well; bogus sudo data and ssh key changed. been reinstalled and not quite updated properly? reupped

May 16

23:12 mark: Moved udpmcast off bayle onto browne as it was affecting pdns recursor
22:40 brion: bugzilla mail is more or less working we think, but mail is backed up a bit atm. changing dns for bugzilla to point at isidore
- bugzilla SSL is not set up yet (was experimental before)
21:54 brion: bugzilla upgrade on isidore so far so good, except mail setup is broken on that server
20:58 brion: giving bugzilla upgrade another go, woo
~19:40 Rob, Mark: Network moves of db5 - db10, mchenry & sanger onto the new MRJ-21 line card in C1
19:31 brion: fixed sudoers on srv3
19:29 brion: updated wikidiff2 packages and svn up'd for new diff layout style fixes
- srv0 down
- srv3 sudo is broken
06:30 jeluf: nagios is reporting timeouts for nearly all of its ssh-based tests. Restarted pdns-recursor on bayle.

May 15

19:12 mark: Moved udpmcast off goeje onto bayle.
19:05 brion: bugzilla 3.0 upgrade failed on pascal with weird low-level glibc errors. restored 2.18, and will finalize the upgrade after a new machine is set up to host it
19:50 jeluf: fixed OTRS's webform, so that the queues that became subqueues during the last reorg are now properly addressed, using masterqueue::subqueue notation
11:37 Tim: enabled SyntaxHighlight_GeSHi on all wikis
10:30 Tim: introduced a limit of 2 page moves per minute for non-sysop, non-bot users. In response to vandalism on Wikiquote.
07:18 JeLuF: short network interruption (2 minutes) between KNAMS and PMTPA. Reason unkown.
05:15 Tim: srv7 (OTRS) is out of disk space again. I guess that's what you get for putting a 12GB ibdata1 file on a 20GB partition and waiting a few months.

May 14

14:44 brion: power-cycling srv129, it's stuck in the mud with r/o filesystem and unresponsive login

May 13

02:30 domas: fixed memcached (srv69 -> srv67) pool, people were getting session losses.

May 12

03:30 Tim: Fixed MIME type for .wbmp on *.wap.wikipedia.org. Now the logo is displayed correctly in the Motorola Browser 2.2.1.

May 11

23:24 Tim: Started apache on srv3, for test.wikipedia.org serving. There was a problem with segfaults on that server, I may have fixed them already but if not, test.wikipedia.org will be a good way to debug them.
22:44 river: deployed NIS servers on srv1, srv2
19:28 brion: test.wikipedia.org is down w/ squid errors. working on re-adding be-x-old links [readded to Names.php for compat]
09:17 river: started an email job runner on srv91

May 10

20:09 mark: Increased max simultaneous SMTP connections on both mail relays

May 9

21:50 Tim: srv0 went down (test.wikipedia.org server). SSH and HTTP are both hanging after the connection completes. Redirected traffic to srv3 which has no running HTTP server, in case there was some problem with the incoming traffic or the wiki which is crashing servers.
21:40 Tim: gave rainman an OAI account for incremental search update testing
19:24 brion: leuksman.com http down briefly, restarted

May 7

all day brion: running password security checkers on enwiki, commonswiki. added some logging, doing some various things per internal discussion
~10:00 river: amane:/export ran out of space, removed some old upload dumps

May 6

16:07 river: created kabwiki
13:42 Tim: reverting changes to the logging table on db6, db9 and db10. log_id needs to be added on a single server only and then copied to the other servers, so that it is stable.

May 5

14:37 hashar: saved ourself a bit of bandwith by producing blank pages (php mistake). I reverted my change and everything came back. Outage ~ 2 minutes.

May 4

21:15 brion: running a bad-title check on enwiki in the background to clean up bad titles from move page bug etc
21:10 brion: was a brief problem with all logged actions failing due to use of non-present log_id field. this needs to be applied to the live sites before r21784 is re-applied
14:14 Tim: enabled djvu rendering everywhere
14:08 Tim: removing stale s2 databases from samuel
13:05 Tim: pulling lomaria out of rotation for copy to thistle
12:42 Tim: installed ganglia on adler and db10
11:48 Tim: db5 has fields missing in frwiktionary and advisorywiki. Replication has been stopped for a couple of weeks. Fixing.

May 02

19:26 mark: Set up (mailing lists) backups of lily to mchenry
18:36 Tim: index change on lomaria
18:15 Tim: switching s2 master from lomaria to db8 (db8-bin.001, 79).
17:06 Tim: index change on db10, samuel, holbach
15:18 Tim: index change on adler, db8
12:21 Tim: index change on ixia
12:06 Tim: index change on db9, db5
12:03 Tim: Created schema changes page to track schema changes
11:50 Tim: started replication on thistle
11:34 Tim: running patch-backlinkindexes.sql on db7 (enwiki). Backup is also running on that server, it should be fine to continue on a lagged slave.
7:21 river: removed broken servers srv131, srv101 from mediawiki_installation

April 30

17:58 mark: Created wiki-mail.wikimedia.org DNS record / service IP, changed the mail.wikimedia.org A record to it. mchenry is now mail.wikimedia.org.
15:48 mark: Made mchenry primary MX for all domains (except lists), taking albert out of the loop.
15:45 Tim: deleting binlogs from srv95
14:47 mark: Made lily secondary MX for all domains instead of pascal.
12:46 mark: Set albert to forward mail to mchenry instead of goeje.
11:31 mark: Added mchenry as secondary MX for lists.wikimedia.org.

April 29

17:51 mark: Set bart's Postfix to use mchenry as smart host.
17:19 mark: Moved pmtpa's secondary resolver .18 from khaldun to mchenry (mail relay)

April 27

11:53 Tim: removing srv77, srv79 and srv80 from the apache pool, for lucene-search-2 test. Removed srv77 from the memcached pool.
11:49 Tim: installed memcached on srv60
11:30 Tim: attempted to resurrect srv78. It had bad firewall rules, I fixed them and ran setup-apache. But halfway through compiling PHP, it apparently crashed, dropping off the network, no ping.

April 26

21:50 brion: scapping update to disallow password == username

April 24

20:21 brion: fixed crond on leuksman.com
19:30 mark: Increased COSS and aufs cache dirs on pmtpa upload cache squids to 15 GB
16:45 mark: Loaded Ubuntu on storage1, learned rob something. Ready for configuration.
14:35ish brion: updated live code, we found a couple fun problems: checkuser failures again (reverted bogus functions), and some problems with SVG rendering which tim is in process of fixing
14:18 Rob: OS loaded on srv101 which has a new mainboard. Needs to have scripts run and put in rotation.
14:17 Rob: srv66 rebooted and brought back online.

April 23

10:50 Tim: updated interwiki links

April 22

18:00 mark: OS-installed mchenry and sanger

April 20

20:10 brion: ran updates on advisorywiki, had been made w/ old schema

April 19

20:09 mark: Upgraded all knams upload squids to squid-2.6.12-1wm6 and set up aufs for big media files
18:36 mark: Reinstalled knsq8 with Feisty
14:16 Rob: thistle passes memtest86. Ran 6 passes with no problems. Rebooted and put back online. Needs to be placed back in rotation.
13:21 Rob: Rebooted srv137 as it was unresponsive.
13:21 brion: big scap!
12:50 mark: Upgraded all pmtpa upload squids to squid-2.6.12-1wm6 and set up aufs for big media files
11:21 mark: Upgraded sq1 to squid-2.6.12-1wm6 and set up aufs for big media files
09:40 mark: Prepared installation environment for Ubuntu Feisty Fawn. Did two test installs on mint. The APT repository shares its packages with Edgy for now. Ubuntu Feisty installs are now possible.

April 18

21:40 brion: got through SVN changes review; things should be ready for svn up at any time, will do this in the morning if tim doesn't beat me to the punch. conflicts or oddities are possible, of course!
~20:00 Tim: Installed the following packages from FC4 on all apache servers: bitstream-vera-fonts fonts-bengali fonts-chinese fonts-gujarati fonts-hindi fonts-japanese fonts-korean fonts-punjabi fonts-tamil
19:08 Tim: Installed DejaVu fonts version 2.16 on all apache servers. Added to setup-apache.
15:55 brion: public advisory.wikimedia.org is up and open
15:53 brion: updated addwiki.php with checkuser tables, it borked halfway through creating advisorywiki
15:44 brion: fixed free space on rabanus, resynced common
14:30 mark: knams Squids running out of local port space. Increased range to 1024 - 61000 for this boot
14:20 mark: lily ran out of memory, looks like too many python processes (Mailman). Rebooted it and set ulimit -p 150 in the lighttpd init script.

April 17

22:10 brion: s1 switch done! starting db schema updates on db2...
21:44ish brion: s2 switch done! starting db schema updates on db8...
21:40 brion: rabanus disk full
21:27 brion: in midst of switch on s2; lomaria was not configured correctly per spec, holding while enabling binlog......
20:38 brion: preparing master switches on s1 and s2 to finalize db updates
mark: Set cache_dir /dev/sda6 on all non-yaseo upload cache squids to read-only and increased the other 3 cache dirs from 10 GB to 15 GB
19:47 Rob: Rebooted thistle in to a memory test per domas. Will leave it running the test until tomorrow afternoon or Thursday morning and check results.
19:46 Rob: rebooted srv91 per domas's request.
- Is back online as of 19:48 and will now accept SSH.
17:54 brion: switched oaiaudit group from thistle to db1; since yesterday's switch oai feeds were broken

April 16

22:20 Tim: removed langcomwiki from all.dblist on request from Pathoschild.
20:03 mark: Doubled read_ahead_gap to 64KB on cache upload Squids, to buffer more data off amane
19:34 mark: Enabled socket debugging (5, 1) on frontend Squids
15:10 Tim: ran sync-fedora-mirror.sh with --delete enabled, to remove obsolete files.
14:30 Tim: fixed /home/config/others/etc/yum.conf, was missing extras.
14:14 brion: noticed texvc compile failure on srv62, texvc install failure on srv117
14:05 brion: bad sudoers on srv150, 128, 58 (fixed)
06:00 domas: apparently thistle has hit our usual master problem :) switched master to db1.

April 15

19:21 mark: Set cache_dir /dev/sda6 on knsq9 to read-only and increased the other 3 cache dirs from 10 GB to 15 GB
12:32 - 15:30 mark: Wrote a patch for Squid that adds a min-size option to cache_dirs, so aufs cache dirs on upload squids only take large files that the COSS dirs will not accept, instead of anything based on a round robin / load balancing scheme. Built a squid-2.6.12-1wm5 deb and installed it on knsq8, not in the repository yet.
12:32 mark: Converted COSS swap dir /dev/sda6 into an AUFS dir (Reiserfs) on knsq8 for extra large files that don't fit in COSS dirs, as an experiment

April 13

23:40 Tim: set up *.wap.wikipedia.org, served by anthony.

April 12

22:00 Rob: racked and powered on mchenry and sanger, future mail servers. DRAC access is working.
19:00-ish Rob: Powered down srv66: it is having disk issues, either a bad drive or controller.
16:03 Hashar: csw2.knams came down for roughly 3 minutes.

April 11

17:20 Tim: fixed dewiki search

April 10

16:55 Rob: Turned up srv121, returned from SM with new mainboard. OS still nominal, and online. Needs to be synced and brought in to rotation.
16:44 Rob: Consoled srv137 and saw no I/O errors pointing towards failed HDD. Rebooted and it is now online. Noted this in its history so if it occurs again, HDD's will be tested.
16:43 Rob: Rebuilt array on db3 and reloaded the OS.
- Requires DEV attention. Server is online but not in rotation. Needs setup scripts run.
12:41 Rob: Rebooted srv66 as it was showing down in Nagios, as well as unresponsive to ssh and ping.
12:40 Rob: Rebooted srv58 per Datacentre tasks

April 9

19:19 brion: started batch regeneration of cu_changes tables; previous batch gen run broke on a lot of things
16:54 mark: db2 had trouble, noone else around so I depooled it. Without realizing that it's the master. ;) Fortunately ariel was set to read-only so I didn't do too much damage, and db2 recovered...
13:44 brion: putting db4 back in rotation, updates finished

April 7

20:35 Tim: upgrading apache to 1.3.37
11:05 mark: Taking db10 (Squid) out of production for reinstall as a DB
07:00 domas: db3 I/O crashed, needs repairs
03:40 brion: starting db udpates on db4, db3 done

April 6

20:24 brion: running manual rebuild of frwiki search
19:51 brion: search rebuild for srwiki was stuck FUTEX; killed it
14:11 brion: starting db updates on db3... db4 is last slave left after
13:13 domas: db9 on enwiki duty
11:38 mark: No incoming traffic via TWTC since a couple of hours. Removed outgoing AS prepend to diagnose.
08:18 domas: enabled write behind & adaptive read ahead on db6 :-/
04:38 Tim: upgraded Tidy on all apaches to a CVS snapshot designated 2007-04-02. They apparently don't do releases anymore.
04:23 Tim: repooled ariel
00:03 Tim: reduced buffer pool size on ariel from 6500MB to 6000MB.

April 5

23:30 Tim: removed ariel from rotation, mysqld crashed and is restarting
23:24 Tim: put db6 back in rotation

April 4

21:01 brion: killed pagelinks subdump on enwiki dump on srv31; was hung mysteriously
14:15 brion: fixed wikitech wiki broken for a few hours by a rename of the SyntaxHighlight extension

April 3

20:39 Tim: removed db6 from enwiki contributions group
18:47 brion: running db updates on db6 [3 and 4 remain left]

April 2

19:00 jeluf: removed binlogs 1-10 on thistle.
14:00ish brion: fiddling with dumps to use db7 to reduce load on db6
13:30 brion: started db updates on ariel (ixia done)
04:00 brion: started db updates on ixia (lomaria done)

April 1

21:35 mark: Enabled monitorurl and -timeout to have Squid monitor its cache peers whether they are up or down. In the past weeks we've seen several Squids not redetect uptime of parent caches, in some cases ending up with no available peers at all.
18:20 mark: Enabled caching on the frontend Squids. Even with the tiny 10 MB cache they have now, they achieve a 65% hit rate and thus all cache the (same) ~1000 hottest objects in memory without having to forward them to CARPed cache squids.
12:40 brion: started db updates on lomaria (holbach done)

March 31

05:12 brion: locked tlh.wiktionary.org per previous announcement (then unlocked it, let someone else fight, don't care)
02:25 brion: db updates starting on samuel (db5 done)

March 30

18:27 brion: db updates starting on db5 (adler done)
17:45 brion: batch-fixed links for the recent dump indexes
15:50 brion: db updates starting on adler
15:45 brion: ipmi-booted srv137, stuck again
14:17 brion: manual clearing on benet, still something is overflowing
... brion: db updates on webster

March 29

23:31 Tim: freed up a measly amount of space on amane by deleting backups. Was down to 60GB.
18:53 brion: running db updates on db7, will go on to tohers
18:10 jeluf: changed config on goeje, albert, pascal so that mail directed to wiktionary.org is accepted.
17:50 brion: updated lagtop to display current cluster assignments
17:01 Rob: Reloaded OS on srv131 after replacing bad HDD.
- Needs to have setup scripts run and brought into the rotation.
16:54 Rob: Reloaded OS on srv128
- Needs to have setup scripts run and brought into the rotation.
15:44 Rob: Rebooted srv50 per datacenter tasks page.
15:30 Rob: Rebooted srv122 as it was locked up and down on nagios, Machine Check Exception error.
15:12 Rob: Re-racked harris with rails.
- System was powered off, what purpose is this server currently being used towards?
02:15 Tim: Configured squid to send all requests for test.wikipedia.org to srv0

March 28

21:14 brion: fixed images for be-x-old (moved from be)
21:05 brion: adding small fields for March 2007 schema changelets on all dbs
18:46 brion: otrs hanging; srv7 db seems ot have issues. space?
- ok after i cleared out a couple gigs of binlogs. is this machine being replicated? do we have a backup plan for it?
17:59 brion: fixed broken links in (new) dumps, ones from last couple days still broken
17:00 mark: Set up wm07schols.wikimedia.org on friedrich by request of Austin, using the old wikimania htdocs.

March 27

21:24 Tim: removed closed-zh-tw.wikipedia.org from all.dblist
21:15 mark: Increased disk caches of text squids
16:04 brion: fixed be-x-old.wikipedia.org (moved ES around and tweaked config). srv131 appears to be acting up again w/ read-only filesystem
13:00 - 14:15 mark: knsq1 was giving out socket errors EADDRINUSE. It turned out to be pooled with a higher load in LVS, and quietly running out of port space (60k connections!). After correcting that, the cache squid was running at 100% cpu usage, with no syscalls at all. opreport:

samples  %        symbol name
9275     65.0923  squidaio_poll_queues
3417     23.9806  squidaio_sync
134       0.9404  headersEnd

March 26

23:27 brion: updated refresh-dblist to write a full copy in pmtpa.dblist
23:20 brion: /h/w/common/refresh-dblist has apparently been changed to not update pmtpa.dblist; as a result there were a few minutes of fun when the bad dblist from below was the live one. have restored a good copy of the list
23:00ish brion: managed to accidentally eat the all.dblist and friends with an old 'refresh_dblists' script in /h/w/bin ... have moved it to disabled/subdir and rebuilt the lists, removing invalid entries
22:50 mark: All squids upgraded to 2.6.12-1wm4, and ran apt-get upgrade.
22:00-22:40ish brion: setting up be-x-old.wikipedia.org and importing new bewiki from incubator
21:34 Tim: started new static HTML dump
20:30 brion: srv31 back up after reboot; running more dump threads on srv31
19:00 brion: starting new dumps on benet with new script o doom

March 25

20:17 mark: Ran apt-get update && apt-get upgrade on knsq*, upgrading kernel, libc6 and squid to newer versions. Fixed earlier Squid issues.
16:12 domas: redirected enwiki/Recentchangeslinked to ariel
14:00 domas: Edge case in JobQueue (fixed in r20672, r20674) caused db3 to melt (load >20.0, cpu use 100% in user). at ~16:00 db3 going live and patches being deployed on site.
13:49 mark: Squid on knsq1 has been segfaulting, started in a gdb/screen.

March 24

18:10 Tim: killed frozen job runners on srv91-100.
14:57 mark: Built a squid-2.6.12-1wm1 package and installed it on knsq1 for testing. Also ran apt-get upgrade. Rebooting knsq1 for the kernel upgrade
12:40 mark: Changed overwrite-percent COSS option from 50% to 30% on knams upload cache squids. Cache size utilization always seems to be < 60%, and those machines have spare I/O bandwidth. Rather than increasing the cache dir partitions with the tedious restarts we might as well try to utilize the existing space better.
12:29 mark: Disabled siblings for upload cache squids as well, HTCP hit rates had dropped to < 15%

March 23

23:00 domas: db9 has kernel/fs level deadlocks, mysql doesn't like that, controller? RMA? drivers? Rob? :)
23:00 domas: db7 had replication failure, due to commonswiki profiling/hitcounter touches (some purge script?)
20:35 brion: srv39-41 had wrong sudoers files after recent reconfig; caused apache-graceful-all to hang prompting for passwords. manually recopied file; setup-apache script looks like it should have copied the files orignally,...
19:20 brion: srv141 power-cycled and back up.
19:16 brion: srv141 hanging, gonna poke it
12:19 Rob: Rebooted srv81, it appears to be back online.
12:12 Rob: Shutdown sq2, disk 4 replaced and brought back online.

March 22

19:05 JeLuF: bootstraped srv150 as apache, added to the farm
17:42 Tim: moving srv21 and srv25 from search pool 1 to search pool 2, returning srv39 to the apache pool.
02:12 Tim: srv81 is mostly down, removed from ext store rotation
00:06 Tim: moved srv40-41 back to apache service

March 21

22:30- Tim: moving srv21-30 to lucene service
21:25 Tim: moved cluster1 and 2 to a new merged external storage cluster on srv96-98.
20:25 Tim: srv31 down. Unmounted /var/backup/public/mnt/srv31 on benet.
19:21 brion: lighty on benet was mysteriously dead for a bit. very slow to kill. killed it, restarted. srv31 1seems stuck/
14:40 brion: syncing apache configurations and restarting apaches. at least one machine had old remnant.conf and possibly other bad configs, breaking office wiki intermittently

March 20

23:00 domas: user attempted to delete sandbox on enwiki... :)

March 18

22:18 brion: srv81 ssh broken and http connections hang; removed from LVS manually. srv131 has read-only filesystem but has no load

March 16

15:45 Tim: reassigning srv21-24 from apache to search

March 15

15:52 mark: Disabled all spanning tree on asw2 and asw3
15:27 Rob: Moved srv152 back to its normal rack PDU. Booted it online.
14:41 Rob: New storage1 racked and installed with FC4.
- I am not sure if that is the OS we want in the end, since storage2 runs Ubuntu, I just wanted to test the install and make sure it went at a reasonable speed, and it did.
14:21 Rob: Placed replacement drives in servers adler and ariel. Did not add to array.
14:20 brion: set up blank zh.planet.wikimedia.org
06:39 river: added a prefix-list (pm-in) on incoming BGP routes to deny 10.0.0.0/8 routes
06:18 jeluf: after bw sealed a leaking route announcement (10.0.0.0/30), ariel and suda are reachable again. Restored watchlist config.
06:10 jeluf: ariel died, removed the enwikiWatchlistServers section from the config

March 14

19:53 brion: planet.wikimedia.org content moved to en.planet.wikimedia.org
- redirects for feed, link for page
19:41 brion: adding some language subdomains for planet
19:06 Tim: Pybal was flapping the search backends due to timeouts. Disabled ProxyFetch for now, it can just use IdleConnection.
- The dual processor servers are only 50% utilised (if that) due to excessive thread synchronization
18:14 Tim: Switched search to use LVS instead of perlbal. Pybal on diderot.
13:37 brion: restarted leuksman.com apache, some mystery prob

March 13

23:38 mark: sq2 broke with a disk failure, killed squids
09:57 mark: Removed siblings for text cache squids, hit rates were < 2%. Upload squids have hit rates up to ~25%, so keeping it enabled there for now.
01:30 Tim: moved commons thumbnails back to bacon. Amane was toast, average service time blowing out to 4 seconds, and nobody noticed. It's OK now that it's off peak, but obviously bacon is better for this task.

March 12

16:17 Tim: moved commons thumbnail serving back to amane

March 11

22:43 mark: Set negative_ttl to 5 minutes for upload squids, so it can cache 404s and other errors. 404s negative cache entries are purged as usual, hopefully the other kinds of errors don't cause problems.
18:45 brion: migrating some old files from benet -> amane; out of space again.
17:28 brion: db7 replication is broken; it's trying to replicate updates from commonswiki from its master, db2, however there is no commonswiki db on db7. I'm not sure how these are supposed to be set up, so...
11:40 Tim: enabled UsernameBlacklist everywhere
00:50 Tim: Finished phasing in CARP configuration. Moved CARP config to /home/wikipedia/conf/squid.

March 10

21:35 Tim: starting to phase in the CARP-based squid configuration. Starting with knsq1.
7:50 jeluf: stopped db5, copying its DB to adler

March 9

6:30 jeluf: bootstrapped adler, running mkfs.jfs -c on sda3, will need some time to finish
00:03 Rob: Adler is re-installed sans 1 disk. Replacement disk is being taken care of.
- Please bring Adler online as the main database server, as Ariel and Samuel need to be powered down and re-racked with rails that are on hand for them.

March 8

21:20ish -- they came back online. a circuit breaker flipped, was reset
21:05ish -- several servers in srv6x range dead, probably circuit breaker
18:11 brion: adding planet.wikimedia.org to dns; setting up...
14:35 Tim: phasing in remerged squid configuration

March 7

19:45 brion: private SVN set up on net instead of just my laptop
18:00 Tim: upgrading apaches to APC 3.0.13
17:45 Tim: crashed srv43, kernel GPF due to oprofile
17:03 brion: upgraded subversion server to 1.4.3

March 6

19:41 Tim: set an idle timeout of 30 minutes in bfr for the search index rebuild. This will hopefully stop it hanging without killing any valid jobs.

March 5

20:30 brion: rebooting srv130 via ipmi; filesystem was read-only

March 3

17:28 Tim: deploying new APC to all apaches
04:51 Tim: Came up with a workaround for the segfault on exit bug, experimentally installed patched APC on srv43 and srv88. The problem is a glibc bug involving linking to libraries with DF_1_NODELETE, such as librt and libpthread. APC was not actually using librt, so I removed it from the link.

March 2

04:34 Tim: We were still getting a large number of file size exceeded signals in the syslog. I eventually patched the bulk of the live code to check the file size before attempting an error log write. That seems to have fixed it. Will commit eventually.

March 1

15:43 Tim: rotated dberror.log, rapidly filling with database selector related errors.
15:37 Tim: brought srv129 into rotation
15:28 Tim: brought srv83 into rotation
15:25 Tim: brought srv86 into rotation
15:07 Tim: started jobs-daemon on srv98 and srv99

February 28

16:25 Tim: tweaked setup-apache, installed full apache environment on srv152 (test Xeon)

February 26

18:15 Tim: upgraded squid to 2.6.9-1wm4
17:20 jeluf: srv142 not responding to ssh, removed from the load balancing. Added back srv147.
15:40 Tim: did master split. New masters are thistle and db8, as described here.
15:16 mark: "Fixed" PyBal with a dirty temporary hack. Will fix cleanly tonight.
14:46 mark: Brought up iris as new load balancer, statically as PyBal thinks all squids are down due to some PartialDownloadError. Moved traffic back to knams as pmtpa was overloaded
14:00 mark: pascal behaved odd, power cycling didn't help. Switched DNS to scenario KNAMS-down.

February 25

23:39 Tim: adapted cluster provision in refresh-dblist to make cluster lists for the DB masters
Tim: db9 keeps freezing during query execution. Recompiling mysql didn't help. It can't be put into rotation like this.
13:00-late Tim: setting up new master split.
11:00 jeluf: set up replication on db8, started replication on db5, added db5 back into the LB pool
00:54 Tim: reopened access to frwikiquote backup dumps

February 24

22:28 Tim: stopping slave on db5 for mysqldump to db8
22:20 Tim: running mkfs.jfs on adler, with bad block check option
21:41 Rob: Shutdown, added rails, and re-racked Adler
20:36 Rob: Rebooted adler and set the correct root password.

February 23

12:02 mark: Reinstalled isidore with Edgy, is ready to go.
02:23 Tim: db5 and thistle back in rotation
02:00 Tim: copy complete, db5 and thistle catching up
00:25 Tim: copying data directory from db5 to thistle
00:20 Tim: wiped /a on thistle, was corrupt

February 22

23:39 Tim: took thistle out of rotation. Can't mount /a, syslog reports bus error.
22:46 mark: isidore got reinstalled wrongly, will fix it in a few days
22:29 brion: setting up rejuvenated yongle to replace the old isidore stuff
21:54 Rob: yongle reinstalled with ubuntu.
21:36 Rob: Isidore back online, just reseated powercord and it boots fine.
20:07 Rob: Adler back online, needs /a filesystem created and synced. Using only onboard disks.
18:30 tim: switched master for non-enwiki to samuel
18:15 jeluf, tim: adler down. Tim prepares a master switch, jeluf tries to reboot adler.
10:33 brion: fiddling with dumps
00:58 Tim: starting upgrade to squid 2.6.9-1wm3 everywhere
10:27 domas: increased idle transaction timeout to 600s, due to apparently existing slow requests inside cluster.

Feburary 21

19:37 brion: taking over yongle to replace dead isidore
19:12 mark: Fixed an issue on the mailing lists server where list wikifi-admin was not recognized due to the -admin / -owner / etc. suffixes.
15:43 Tim: piloting squid 2.6.9-1wm3

February 20

23:44 brion: leuksman.com apache down for a while, 'couldn't grab the accept mutex' o_O
08:02 brion: confirmed presence of bad cached entries showing anon page views to logged-in users (due to buggy Vary: Accept-Encoding header being present for a few hours, overriding the Vary: Accept-Encoding, Cookie)
06:27 mark: Uplink upgrade succeeded, restoring knams. Also putting more countries back on knams, will gradually add more during the day.
05:12 mark: DNS scenario knams-down in preparation of the uplink upgrade.
02:40 brion: srv137 up, re-synced its files. a happy ending powered by ipmi
02:37 brion: power-cycling srv137... ipmi actually works on it \o/
02:35 brion: srv137 complaining of read-only filesystem; slow or broken logins
02:25 brion: Special:Export was broken under the new scheme; the buffer reset code didn't correctly handle the new handler. Should be fixed now in r19998

February 19

??? tim: enabled Content-Length header with the new compression buffer handler, rumored to help squid maintain persistent HTTP connections
20:16 brion: reenabling image captcha, now using subdirs for hopefully better performance
16:44 Rob: Completed PSU and MB swap for storage1, re-racked & started OS load.
14:50 Rob: Pulled srv86 to work on report that it does not have a working monitor output.
- Works Fine, re-racked.

February 18

13:00 mark: Starting rsync of amane to storage2, without thumbs.

February 17

23:55 jeluf: Started mysql on srv122
23:00 many: Avicenna (text load balancer) failed. Load balancing was moved to alrazi. avicenna's ethernet is "link up" after a reboot, but doesn't ping. Site up after 25 minutes.
22:10 Rob: Rebooted srv122
- and it came back up right away. If it dies again, please click on its name and log it.
22:00 Rob: Replaced PDUs in storage1 started netboot to reload os.

February 16

20:39 brion: srv117 sudoers file was broken; copied in standard one, sync-common now works.
20:30 brion: srv110 and srv117 php files out of date; updated srv110 ok, 117 unhappy
16:20 brion: isidore hung for some hours; requesting reboot
- back up after a while
05:15 jeluf: took thistle out of the rotation, copying its mysql DB to db8
03:00 brion: set up http://wiktionarydev.leuksman.com/ for hippietrail to test in-development extensions for Wiktionary

February 15

22:10 Rob: replaced ram in srv26 upgraded to 4 GB and brought server online.

February 14

22:14 brion: added daily-image-l list
07:48 brion: restarted search-rebuild-loop1 on srv37, was stuck at aawiki for a long time so not updating non-en wikis. bugzilla:8979
06:30 jeluf: restarted squid on sq27 and sq30, removed binlogs 93-99 from adler.

February 13

21:59 brion: benet out of space, moving files again. probably some broken stuff. sigh.

February 12

22:40 brion: added login=PASS on cache_peer definitions for squid origin servers; apparently if you don't do this squid disobeys the HTTP spec and eats Authorization headers so you can't do any HTTP auth. nice!
20:19 Tim: hit a bug, no_cache apparently destroys the public StoreEntry, giving up for now
16:45 Tim: trying again
14:15 Tim: CARP configuration didn't work on the pmtpa upload cluster, we had a CPU overload on several servers. Reverted back to the old conf.
12:11 Tim: phasing in CARP-based squid config

February 11

23:00 jeluf: mysql copy from db1 to thistle done. restarted replication. 40'000s replication lag
15:42 mark: Reinstalled storage2, mounted a 3 TB JFS filesystem /dev/vg00/static, with ~ 600 GB free in the VG, starting rsync from Darwin. storage1 didn't come up after a reboot, which is unacceptable.
11:55 jeluf: Took db1 out of rotation, generating a mysqldump from it, copying it to thistle.

February 10

23:50ish brion: renewed fundraising.wikimedia.org SSL cert for a year
23:19 Rob: srv101 will not detect NIC. Booting off LIVE CD has same results.
22:23 Rob: thistle is online with FC5-64.
22:21 Rob: Rebooted srv101, srv110, srv134.
21:22 Rob: Rebooted srv122 per task list.
21:15 Rob: srv126 down for memtest. Failed memtest.
20:46 Rob: Created a single raid 1 on will
20:45 Rob: thistle down for hardware troubleshooting.
14:55 mark: yaseo hosts under SSH DoS, firewalled source ip
8:00 jeluf: Started copying the DB from db1 to db8 and back.

February 9

21:28 Rob: thistle: Running memtests until tomorrow evening.
20:27 Rob: srv117 back online after repairs with FC4 and working SSH, please complete installation.
20:47 Rob: Reset srv26 per Jeluf request.
20:10 Rob: thistle is offline. The cpu/mainboard is having issues and require replacement.
20:09 mark: sq11 was replaced by SM, installed and up as an upload squid
19:30 Rob: srv152 is installed and online with FC5 64. Please complete setup.
19:30 mark: albert seems to report disk trouble / bad sectors / SMART failures on its console
03:28 brion: updated leuksman.com to PHP 5.2.1

February 8

16:48 brion: resynced and restarted apaches on srv76 and srv86
16:30 Rob" replaced DRAC card in srv152 new DRAC online
16:25 Rob: Plugged in and booted srv76 & srv86 per Jeluf.
15:40 Rob: Moved srv148-srv153 down one port on switch.
14:09 mark: Reinstalled ragweed with Ubuntu Edgy, for preparation as a backup server.

February 7

23:38 brion: restarted apache on leuksman; was down
23:38 brion: enabled poem extension sitewide (was on just some wikis)
21:23 brion: texvc issues resolved now i think (/var/tmp/texvc build dir was owned by root, presumably from initial setup, and wouldn't resync properly as user)
21:03 brion: srv71 clock hopefully resolved now
srv71 has clock offset over 6 seconds
srv61, srv53, srv54, srv68, srv120, maurus cannot compile texvc during sync, probably missing dev packages.
19:19 brion: setting up il.wikimedia

February 6

15:30 Tim: started httpd on amaryllis, reconfigured to allow serving of private log data from henbane
6:00 jeluf: set up iSCSI on ragweed, connecting to the Infortrend array.

February 5

17:00 brion: httpd now on and set to start on boot on isidore (had died when machine rebooted last week)
14:50 mark: Please do not start Squid on yf1010, it has crashed and finally produced a useful coredump.

February 4

19:30 mark: Auto-installed Ubuntu Edgy on srv153

February 3

23:59 Rob: Installed FC5 on srv151
23:50 Rob: Installed FC5 on srv153
22:29 Rob: Enabled DRAC access for srv153
22:23 Rob: Installed FC5 on srv152
22:23 Rob: Enabled DRAC access for srv151
22:23 Rob: Brought Coyotepoint Extreme online via serial console port 10 for mark.
21:02 mark: Installed Ubuntu and squid on sq3, pooled it
20:00 Rob: Rebooted srv22, it had shutdown due to a heat issue.
20:00 Rob: Hooked up sq3 for mark (power, network, serial)
01:00 mark: Put all external connections (transit/peering) in VLAN 4. Apparently the Cisco switches are sending VTP multicast junk on the native vlan 1 despite explicitly being told not to. The Foundry forwards these to VLAN 1 ports, which happened to contain the route-only ports.

February 2

22:50 brion: shut down srv22, was overheating and whining a lot
22:02 brion: taking srv76 and srv86 out of rotation (old ES masters, clusters 5 and 7 not actively being written to)
21:58 brion: srv76 and srv86 down? dberror log filling up with connection loop failures

February 1

22:35 Tim: upgraded FSS on srv144, was using the old buggy version from October.
20:00ish jeluf: set up a new 8-disk RAID0 on db8. Installed Ubuntu.
18:56 mark: srv147 IPMI does not seem to work (or I am doing it wrong), depooled it on dalembert
18:15ish brion: srv147 not logging in properly, serving 404 errors; mark trying to get in and kill it
07:00 Tim: now using customised error messages on the text squids.
6:45 RobH: db10 was locked up on the console, rebooted. Seems to be online with no issues but still has an error on Nagios. Server should be responsive to DRAC.
3:00 - 6:35 RobH: Installed cables for srv151-153. Troubleshooting on sq11.

2000s

Archive 1: 2004 Jun - 2004 Sep
Archive 2: 2004 Oct - 2004 Nov
Archive 3: 2004 Dec - 2005 Mar
Archive 4: 2005 Apr - 2005 Jul
Archive 5: 2005 Aug - 2005 Oct, with revision history 2004-06-23 to 2005-11-25
Archive 6: 2005 Nov - 2006 Feb
Archive 7: 2006 Mar - 2006 Jun
Archive 8: 2006 Jul - 2006 Sep
Archive 9: 2006 Oct - 2007 Jan, with revision history 2005-11-25 to 2007-02-21
Archive 10: 2007 Feb - 2007 Jun
Archive 11: 2007 Jul - 2007 Dec
Archive 12: 2008 Jan - 2008 Jul
Archive 12a: 2008 Aug
Archive 12b: 2008 Sept
Archive 13: 2008 Oct - 2009 Jun
Archive 14: 2009 Jun - 2009 Dec

2010s

2020-2024