Incidents/2019-09-23 s3 primary db master crash

From Wikitech

document status: final

Summary

s3 primary database master had a RAID backup batery failure which cause the host to completely crash. It had to be power cycle from the idrac.

Impact

All the s3 wikis (https://raw.githubusercontent.com/wikimedia/operations-mediawiki-config/master/dblists/s3.dblist) went read-only as the master wasn't available for writes. Reads were not affected, all the replicas were available.

Detection

  • The problem was clear when we saw that db1075 reported HOST DOWN - however, that only sends an IRC alert, not a page. Masters should probably page for HOST DOWN.
  • Alerts were sent to IRC and pages.
  • Users reporting issues on #wikimedia-operations

Timeline

All times in UTC.

  • 18:40 OUTAGE BEGINS
Sep 22 18:40:38 db1115 mysqld[1630]: OpenTable: (2003) Can't connect to MySQL server on 'db1075.eqiad.wmnet' (110 "Connection timed out")
  • IRC logs from #wikimedia-operations:
18:42:45 <icinga-wm> PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100%
18:47:03 <AntiComposite> I'm getting a warning on otrs-wiki about a high replag database lock
18:47:40 <AntiComposite> It's also slower than usual
18:48:52 <+icinga-wm> PROBLEM - MariaDB Slave IO: s3 on db2105 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
18:49:22 <+icinga-wm> PROBLEM - MariaDB Slave IO: s3 on dbstore1004 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
18:49:58 <+icinga-wm> PROBLEM - MariaDB Slave IO: s3 on db1095 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
<More alerts arriving - not pasting them all here>
18:55:51 <marostegui> I'm connecting
  • 18:56 First SMS page sent
  • 19:02 Host rebooted from the idrac after seeing it is a BBU issue and the host is not responsive
  • 19:03 MySQL started and starts InnoDB recovery
  • 19:04 Manual puppet run ran (but failed and went unnoticed)
  • 19:06 MySQL finishes recovery
  • 19:07 Manually removed read_only=ON from db1075
  • 19:08 Lag still being reported
  • 19:16 Manual puppet ran (success and pt-heartbeat starts)
  • 19:16 OUTAGE ENDS

Conclusions

The master lost its BBU and that resulted on a completely host crash, which is something that has been seen before with HP hosts https://phabricator.wikimedia.org/T231638 https://phabricator.wikimedia.org/T225391

The master being unavailable means that writes cannot happen: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=master&from=1569174586143&to=1569181181947

s3 master queries, that reflect the impact of the crash

This is is part of a batch of 6 servers, and 3 of them have already had BBU issues: https://phabricator.wikimedia.org/T233569 so we'd need to evaluate if what to do with then next. Definitely replacing the current master and promoting another one which is not part of that batch is what is happening next: https://phabricator.wikimedia.org/T230783

What went well?

  • Alerts worked fine
  • Rebooting the host from the idrac was successful
  • MySQL came back clean

What went poorly?

  • Race condition between pt-hearbeat being ran via puppet but mysql wasn't still fully up failed (and went unnoticed) resulted on lag being reported while everything was up, resulting in 8 minutes more of an outage until the second manual puppet run was done.
  • A BBU failure shoulnd't result on a completely host crash (but we haven see that before with HP hosts)

Where did we get lucky?

  • The master was able to come back after the hardware issue. We had to restart it via idrac but it came back clean, otherwise, we'd have needed to do a fully master failover manually to promote a new replica to master.
  • Volunteers and staff noticed the failure even before the alerts caught them

How many people were involved in the remediation?

  • 1 DBA, 2 SREs, 1 WMCS, 1 Dev, 1 Volunteer

Links to relevant documentation

Actionables