document status: in-review
On Sunday 5th at 11:22UTC, the primary hard drive of cr3-eqsin (one of the two Singapore POP routers) crashed. This caused the router to reboot into its second disk, containing only a factory default configuration. Everything failed over cleanly to the redundant router.
Impact: We lost at max ~15000 requests/s in a 7min window (see screenshot, and graph).
All times in UTC.
- 11:22 PROBLEM - Host cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100% (paging) OUTAGE BEGINS
- 11:25 SREs reports of connectivity issues to eqsin (too brief to trigger alerting)
- 11:27 Routing is done converging, no more reports of connectivity issue OUTAGE ENDS
- 11:35 DNS patch ready to depool eqsin (just in case, unused) - https://gerrit.wikimedia.org/r/c/operations/dns/+/609571/
- ~07:40 Router is brought back up on its backup disk Redundancy restored
- Was automated monitoring first to detect it? Yes
- Did the appropriate alert(s) fire? Yes
- PROBLEM - Host cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100% (paging alert)
- Was the alert volume manageable? Yes, only relevant alerts fired
- Did they point to the problem with as much accuracy as possible? Yes, the router went down, and only the router down paging alert triggered
- This outage showed that our hardware redundancy and failover are solid
- Juniper recently introduced a new feature:
vmhost snapshotthat would have prevented the lack of redundancy (but not the crash itself)
What went well?
- Everything failed over as expected to the redundant router
What went poorly?
Where did we get lucky?
How many people were involved in the remediation?
- 2 SREs investigating the issue, multiple SREs reported present to the page