Incidents/2020-05-28 commons read-only

From Wikitech

document status: in-review

Summary

s4 primary database master (db1138) had a hardware memory issue, mysqld process crashed and came back as read-only for 8 minutes.

Impact: commonswiki didn't accept writes for 8 minutes. Reads remained unaffected

Timeline

All times in UTC.

  • 01:33 First signs of issues on the DIMM are logged on the hosts's IDRAC
  • 20:21 mysql process crashes OUTAGE BEGINS
  • 20:24 First page arrives: PROBLEM - MariaDB read only s4 #page on db1138 is CRITICAL: CRIT: read_only: True, expected False:
  • 20:24 A bunch of SREs and a DBA start investigating and the problem is quickly found as a memory DIMM failure and mysql process crashed
  • 20:28 <@marostegui> !log Decrease innodb poolsize on s4 master and restart mysql
  • 20:29 MySQL comes back and read_only is manually set to OFF
  • 20:29 OUTAGE ENDS

Conclusions

This was a hard to avoid crash - hardware crash on a memory DIMM.

Masters start as read-only by default (to avoid letting more writes go through after a crash, until we are fully sure data and host are ok and still able to take the master role).

We did see traces of issues on the idrac's error log, if we could alert on those, maybe we could have performed a master failover before this host crashes. If this crash happens on a slave, the impact wouldn't have been as big, as slaves are read-only by default and MW would have depooled the host automatically.

What went well?

  • The read_only alert that pages when a master comes back as read-only worked well (this was an actionable from a similar previous issue)
  • Lots of people made themselves available to help troubleshooting, from both EU (late UTC evening already) and US (from different TZs)

What went poorly?

  • We had signs of this memory failing on the idrac's hw log ardound 19 hours before the actual crash, we could have alerted on those beforehand, scheduled an emergency failover and prevent this incident.

Where did we get lucky?

  • The host and the data didn't come back corrupted, otherwise we'd have needed to do a master failover at that same moment

How many people were involved in the remediation?

  • 1 DBA even though lots of SRE made themselves available in case they were needed.

Links to relevant documentation

Actionables