Incidents/2019-08-23 network codfw

document status: final

Summary

A provider outage on our primary transport link between eqiad and codfw caused it to be in a constant flapping (going down and up) state.

This flapping caused routing re-convergence churn and packet loss between the two sites.

On the application level, this translated to elevated 5xx/s from Varnish from ulsfo, eqsin, and codfw from 21:20 to 21:55 UTC. Varnish reported "No backend" for many of the requests. Host checks in Icinga were flapping "TTL exceeded" and service checks flapping "No route to host."

Impact

Surfaced a bit more than 52,000 5xx responses.

https://grafana.wikimedia.org/d/000000479/frontend-traffic?panelId=2&fullscreen&from=1566594998796&to=1566597283616&var-status_type=5

Detection

Monitoring caught and reported the issue via SmokePing and Icinga.

Timeline

All times in UTC.

2019-08-23 21:20 OUTAGE BEGINS
21:25 Investigation begins
21:33 Zayo (the link's provider) reports issue with service (email unnoticed)
Lots of errors and recoveries - flapping
21:41 Arzhel starts investigating
21:46 Brandon called
21:47 Decided to depool codfw (ended up not needing it)
21:48 Arzhel promotes backup link to primary
21:55 OUTAGE ENDS
2019-08-25 01:37 Link stops flapping

Conclusions

What went well?

The root cause was quickly worked-around once the cause (network link) was identified.

What went poorly?

Due to the frequency of the flapping Icinga checks for link status, OSPF and BFD didn't trigger, causing SREs to think of an application layer issue
The work-around (failing over to the backup link) is not documented and requires Netops to be done.
Nothing paged even though it had user facing impact

Where did we get lucky?

Giuseppe, and Filippo responded outside of their office hours.

How many people were involved in the remediation?

No Incident coordinator appointed - 5 SREs

Links to relevant documentation

Actionables

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

Those two will help mitigate the consequences of an overly flapping link:
- Configure interface damping on primary links - T196432
- ospf link-protection - T167306
This one will make it easier (down the road) to a non-netops to failover a link if the need arises:
- Configuration management for network operations - T228388
This one is about having better monitoring and alerting by replacing Smokeping by something Prometheus based
- Investigate/setup prometheus blackbox_exporter - T169860