Jump to content

Incidents/2024-09-28 cr2-eqsin down

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2024-09-28 cr2-eqsin down Start 20:30
Task T375961 End 22:13
People paged 31 Responder count 3
Coordinators Sukhbir Singh Affected metrics/SLOs
Impact Connectivity to the sites for all users from eqsin (Asia region).


A core router in eqsin went down due to hardware failure.

Recovery was slowed down due to an additional unexpected event when the router loaded a very outdated config from the past.

Timeline

20:24 core router cr2-eqsin in [[eqsin]] goes down
20:25-20:26 monitoring starts sending alerts about the device being down, OSPF status, failed ripe probes and others
20:27-20:29 2 SREs start responding / getting to laptops
20:30 eqsin was depooled
20:45 hardware failure of disk 0 identified
21:08 incident is officially opened, Sukhbir becomes IC
21:12 eqsin repooled to test recovery
21:15 BGP is down to all the LVS, pybal is restarted trying to mitigate it
21:16 recovery failed, eqsin depooled again
21:25 discovery that the router rebooted with an old config
21:?? previous config is restored
22:13 eqsin repooled,recovery succesful, incident marked as resolved

Graphs

SAL

Detection

20:24:51 <+icinga-wm> PROBLEM - Host cr2-eqsin.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
20:24:52 <+icinga-wm> PROBLEM - Host cr2-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
20:24:52 <+icinga-wm> PROBLEM - Host cr2-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
20:25:12 <+jinxer-wm> FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on kubestage2001:9100 -
                	https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status -
                	https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
20:25:53 <+icinga-wm> PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 68, down: 3, dormant: 0, excluded:
                	0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
20:26:37 <+icinga-wm> PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP
                	https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
20:26:49 <+icinga-wm> PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP
                	https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
20:27:05 <+icinga-wm> PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 439 probes of 783 (alerts on 35) -
                	https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
                	https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
...
20:29:53 <+icinga-wm> RECOVERY - Host cr2-eqsin.mgmt is UP: PING OK - Packet loss = 0%, RTA = 222.76 ms

Conclusions

  • Coverage on a weekend in late hours can be spotty.

What went well?

  • Given the timeframe and circumstances, access for Asia users was restored in a reasonable time frame by depooling the site.

What went poorly?

  • The router unexpectedly loaded an outdated rescue config which increased downtime.

Where did we get lucky?

  • netops reacted even though it was very late and after hours for them

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

We need to prevent routers from loading outdated old configs on reboot. One way or another.

One option considered is running the command to save current config to backup config on every homer commit.

Another is to write a cookbook for it.

It's unfortunately not possible to configure the routers to just never load the backup config.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes
Was a public wikimediastatus.net entry created? yes
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 9


Follow-up tickets

phab:T376005 - Juniper: regularly run `request system configuration rescue save`

phab:T375961 - cr2-eqsin disk failure Sept 2024

phab:T378038 - create a place (whiteboard) where SRE advertises current site status / things for awareness