Incidents/2024-09-28 cr2-eqsin down
document status: draft
Summary
Incident ID | 2024-09-28 cr2-eqsin down | Start | 20:30 |
---|---|---|---|
Task | T375961 | End | 22:13 |
People paged | 31 | Responder count | 3 |
Coordinators | Sukhbir Singh | Affected metrics/SLOs | |
Impact | Connectivity to the sites for all users from eqsin (Asia region). |
A core router in eqsin went down due to hardware failure.
Recovery was slowed down due to an additional unexpected event when the router loaded a very outdated config from the past.
Timeline
20:24 core router cr2-eqsin in [[eqsin]] goes down 20:25-20:26 monitoring starts sending alerts about the device being down, OSPF status, failed ripe probes and others 20:27-20:29 2 SREs start responding / getting to laptops 20:30 eqsin was depooled 20:45 hardware failure of disk 0 identified 21:08 incident is officially opened, Sukhbir becomes IC 21:12 eqsin repooled to test recovery 21:15 BGP is down to all the LVS, pybal is restarted trying to mitigate it 21:16 recovery failed, eqsin depooled again 21:25 discovery that the router rebooted with an old config 21:?? previous config is restored 22:13 eqsin repooled,recovery succesful, incident marked as resolved
Graphs
SAL
Detection
20:24:51 <+icinga-wm> PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100% 20:24:52 <+icinga-wm> PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100% 20:24:52 <+icinga-wm> PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% 20:25:12 <+jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed 20:25:53 <+icinga-wm> PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down 20:26:37 <+icinga-wm> PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status 20:26:49 <+icinga-wm> PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status 20:27:05 <+icinga-wm> PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 439 probes of 783 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas ... 20:29:53 <+icinga-wm> RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.76 ms
Conclusions
- Coverage on a weekend in late hours can be spotty.
What went well?
- Given the timeframe and circumstances, access for Asia users was restored in a reasonable time frame by depooling the site.
What went poorly?
- The router unexpectedly loaded an outdated rescue config which increased downtime.
Where did we get lucky?
- netops reacted even though it was very late and after hours for them
Links to relevant documentation
- …
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Actionables
We need to prevent routers from loading outdated old configs on reboot. One way or another.
One option considered is running the command to save current config to backup config on every homer commit.
Another is to write a cookbook for it.
It's unfortunately not possible to configure the routers to just never load the backup config.
- Juniper: regularly run `request system configuration rescue save`
- cr2-eqsin disk failure Sept 2024 (replace router)
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | no | |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | |
Was a public wikimediastatus.net entry created? | yes | ||
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | no | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 9 |
Follow-up tickets
phab:T376005 - Juniper: regularly run `request system configuration rescue save`
phab:T375961 - cr2-eqsin disk failure Sept 2024
phab:T378038 - create a place (whiteboard) where SRE advertises current site status / things for awareness