Incident documentation/20140619-parsercache/IRC log

From Wikitech
Jump to: navigation, search

IRC conversion on #wikimedia-operations during the 2014-06-19 pc1002 outage. All times are UTC.

Trimming and annotations made by Antoine "hashar" Musso.

[19:00:19] <logmsgbot>	 !log reedy Finished scap: scap 1.24wmf10 take 2... (duration: 22m 59s)
[19:00:24] <morebots>	 Logged the message, Master
[19:03:05] <grrrit-wm>	 (PS1) Reedy: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 
[19:03:51] <grrrit-wm>	 (CR) Reedy: [C: 2] Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 (owner: Reedy)
[19:04:28] <grrrit-wm>	 (Merged) jenkins-bot: Wikipedias to 1.24wmf9 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140769 (owner: Reedy)
[19:06:27] <logmsgbot>	 !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf9
[19:09:15] <grrrit-wm>	 (PS1) Reedy: group0 to 1.24wmf10 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/140770 
# And here PC1002 is reported dieing:
[19:09:51] <icinga-wm>	 PROBLEM - check configured eth on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:01] <icinga-wm>	 PROBLEM - Disk space on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:01] <icinga-wm>	 PROBLEM - check if dhclient is running on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - MySQL disk space on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - mysqld processes on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - DPKG on pc1002 is CRITICAL: Timeout while attempting connection  
[19:10:11] <icinga-wm>	 PROBLEM - RAID on pc1002 is CRITICAL: Timeout while attempting connection  

# First users report (icinga still beat them by a few seconds yeah!)
[19:10:26] <Excirial>	 Fyi: Just received several 503's on Enwiki.
[19:10:36] <twkozlowski>	 same on Commons, peeps are reporting
[19:11:13] <jorm>	 all wikis.
[19:11:31] <Reedy>	 Sweet

[19:11:41] <icinga-wm>	 PROBLEM - Host pc1002 is DOWN: PING CRITICAL - Packet loss = 100%  
[19:11:48] <Reedy>	 And that'll be why
# Root cause analysis performed by Reedy in less than 30 seconds.

# Then apaches dies...
[19:11:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:11:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:11:51] <icinga-wm>	 PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
....

# Debugging time!
[19:12:27] <RobH>	 checking pc1002
[19:12:56] <RobH>	 seems its borked completely and unresponsive to ssh
[19:12:59] <RobH>	 checking mgmt
[19:13:28] <mutante>	 oom death
[19:13:32] <mutante>	 powercycle
[19:13:35] <RobH>	 im rebooting it
# Indeed, not much we can do since the box is dead.

[19:13:36] <ori>	 Reedy: i'd revert
[19:13:38] <ori>	 before it boots up
# If there is something we do properly it is reverting. Proposed less than 2
# minutes after the cause is acked.
[19:14:20] <ori>	 Reedy: it's going to boot with a cold, empty memcached
[19:14:26] <ori>	 if it OOMd there's a good chance it's related to the deploy
[19:14:37] <Reedy>	 isn't it mysql?
[19:14:52] <ori>	 better start populating the cache using known-good mw code

# And reverting...
[19:15:16] <logmsgbot>	 !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias back to 1.24wmf8

[19:15:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[19:15:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.073 second response time  
[19:15:48] <RobH>	 !loh powercycled pc1002
[19:16:00] <ori>	 coming back up
# It came back because of the revert.  At that point mysql was not running on pc1002.
[19:16:36] <RobH>	 it will be a few minutes

[19:22:10] <Reedy>	 Good to see https://wikitech.wikimedia.org/wiki/Parser_cache is up to date
# points at bug 1. We have no documentation at all for the parser cache.

[19:22:32] <Reedy>	 146 Fatal error: Call to a member function real_escape_string() on a non-object in /usr/local/apache/common-local/php-1.24wmf9/includes/db/DatabaseMysqli.php on line 289

[19:22:41] <icinga-wm>	 RECOVERY - Host pc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms  
[19:22:41] <mutante>	 pc1002 login: 
# pc1002 finally back up


[19:29:46] <RobH>	 its back up
[19:29:51] <RobH>	 well, its responsive to ssh now
[19:29:54] <RobH>	 (im on it)
[19:30:07] <_joe_>	 RobH: is mysql up?
[19:31:04] <RobH>	 nope, wont fire
[19:31:12] <RobH>	 _joe_: if you know about the server, hop on ;]
[19:31:48] <_joe_>	 I don't even know wich one it is
[19:31:52] <ori>	 pc1002
[19:31:56] <_joe_>	 I can try to troubleshoot mysql not starting
[19:31:58] <mutante>	 !log started mysql on pc1002
[19:31:59] <springle>	 wtf
[19:32:01] <icinga-wm>	 RECOVERY - mysqld processes on pc1002 is OK: PROCS OK: 1 process with command name mysqld  
[19:32:03] <springle>	 what happened
[19:32:04] <morebots>	 Logged the message, Master

# Summary for springle (DBA)
[19:32:28] <ori>	 springle: wmf10 to all wikis -> pcs1002 unresponsive -> apache busy workers -> 503s

[19:32:29] <mutante>	 RobH: /etc/init.d/mysql start
[19:32:31] <RobH>	 cuz normal upstart didnt do it
# That should be fixed.

[19:32:43] <ori>	 springle: pcs1002 hard reboot, mysql on pcs1002 did not come up
[19:32:48] <mutante>	 it first said something about the pid file existing
[19:32:49] <ori>	 springle: that's where we're at
[19:32:50] <mutante>	 but it's not set
[19:33:15] <chasemp>	 ori springle then mutante did something not mysql is up there :)
[19:33:19] <mutante>	 springle: 
[19:33:21] <mutante>	 root@pc1002:~# /etc/init.d/mysql status * MySQL is not running, but PID file exists
[19:33:27] <mutante>	 root@pc1002:~# /etc/init.d/mysql status * MySQL is not running

[19:33:33] <bd808>	 ori, springle: looks like it may have started with many "DBError: DB connection error: Too many connections (10.64.16.157)" events
[19:33:35] <springle>	 mysql is recovering
[19:33:35] <mutante>	 root@pc1002:~# /etc/init.d/mysql status * MySQL running (6031)

[19:34:36] <Reedy>	 springle: I told you we needed to replace it with MongoDB
[19:35:23] <springle>	 Reedy: :)


[19:35:44] <ori>	 so,  DB connection error: Too many connections
[19:36:18] <ori>	 on rollout of new release
[19:36:19] <ori>	 is consistent with a run on the parser cache DBs due to massive cache invalidation 
[19:36:31] <Reedy>	 First time we've seen this...
[19:36:35] <Reedy>	 Or at least, in a long time anyway
[19:36:37] <_joe_>	 looking at the logs... mysql server became unresponsive/slow, connection piled up, this blocked workers in apache 
[19:36:56] <_joe_>	 probably this has happened before any failover mechanism kicked in
[19:37:06] <robla>	 Reedy: yeah....we've had at least one parser cache outage that almost went unnoticed, so this is a weird new development
[19:37:08] <bd808>	 sql-bagostuff events in logstash -- https://logstash.wikimedia.org/#dashboard/temp/S9cjGaUCRriwaPpk9xNMCw
....