Incidents/2024-09-28 cr2-eqsin down

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-09-28 cr2-eqsin down	Start	20:30
Task	T375961	End	22:13
People paged	31	Responder count	3
Coordinators	Sukhbir Singh	Affected metrics/SLOs
Impact	Connectivity to the sites for all users from eqsin (Asia region).

A core router in eqsin went down due to hardware failure.

Recovery was slowed down due to an additional unexpected event when the router loaded a very outdated config from the past.

Timeline

20:24 core router cr2-eqsin in [[eqsin]] goes down
20:25-20:26 monitoring starts sending alerts about the device being down, OSPF status, failed ripe probes and others
20:27-20:29 2 SREs start responding / getting to laptops
20:30 eqsin was depooled
20:45 hardware failure of disk 0 identified
21:08 incident is officially opened, Sukhbir becomes IC
21:12 eqsin repooled to test recovery
21:15 BGP is down to all the LVS, pybal is restarted trying to mitigate it
21:16 recovery failed, eqsin depooled again
21:25 discovery that the router rebooted with an old config
21:?? previous config is restored
22:13 eqsin repooled,recovery succesful, incident marked as resolved

Graphs

SAL

https://wikitech.wikimedia.org/wiki/Server_Admin_Log/Archive_85#2024-09-28

Detection

20:24:51 <+icinga-wm> PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
20:24:52 <+icinga-wm> PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
20:24:52 <+icinga-wm> PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
20:25:12 <+jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 -
                	https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status -
                	https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
20:25:53 <+icinga-wm> PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 3, dormant: 0, excluded:
                	0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
20:26:37 <+icinga-wm> PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP
                	https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
20:26:49 <+icinga-wm> PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP
                	https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
20:27:05 <+icinga-wm> PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 439 probes of 783 (alerts on 35) -
                	https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
                	https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
...
20:29:53 <+icinga-wm> RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.76 ms

Conclusions

Coverage on a weekend in late hours can be spotty.

What went well?

Given the timeframe and circumstances, access for Asia users was restored in a reasonable time frame by depooling the site.

What went poorly?

The router unexpectedly loaded an outdated rescue config which increased downtime.

Where did we get lucky?

netops reacted even though it was very late and after hours for them

Links to relevant documentation

…

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

We need to prevent routers from loading outdated old configs on reboot. One way or another.

One option considered is running the command to save current config to backup config on every homer commit.

Another is to write a cookbook for it.

It's unfortunately not possible to configure the routers to just never load the backup config.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)
People	Were the people responding to this incident sufficiently different than the previous five incidents?	no
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	no
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	yes
	Was a public wikimediastatus.net entry created?	yes
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		9

Follow-up tickets

phab:T376005 - Juniper: regularly run `request system configuration rescue save`

phab:T375961 - cr2-eqsin disk failure Sept 2024

phab:T378038 - create a place (whiteboard) where SRE advertises current site status / things for awareness