Incidents/2020-03-25 codfw-network

document status: in-review

Summary

Unexpected loss of internal connectivity to codfw hosts and services for 5 minutes, creating user-visible failed queries for users whose traffic hits our eqsin and ulsfo edges, when they were using services that are active/active (Swift, Maps, Restbase API, ...)

Cause was maintenance that required a linecard restart on cr1-codfw, which exposed a flaw in codfw's network design.

Loss of some external connectivity to codfw was expected, and the site was CDN-depooled before the maintenance began. However, what was unanticipated was that cr1-codfw would hold VRRP mastership for the duration of the linecard restart, so it tried to act as the default gateway for all hosts in the cluster, while effectively being a black hole for routing anywhere outside the cluster (as the linecard being rebooted has both all the cross-cluster links and the router-to-router interconnect).

There was a second OSPF flap/convergence event around 12:22, however it doesn't seem to have been impactful.

Impact

~28k queries lost for queries terminated in ulsfo and eqsin against active/active services https://logstash.wikimedia.org/goto/bcab629e395fc8a71ef9ac5d525c1ec7

Although this was <1% of global HTTP traffic at the time, upload-lb requests in ulsfo and eqsin were very much affected -- so users of Wikimedia Commons images, or of map tiles whose traffic terminates in those datacenters. Impact on upload-lb in ulsfo was ~10% of requests failed for the interval; in eqsin, about 1.5%.

This also created Kafka mirrormaker delays, the impact of which is TODO

Detection

Automated: Icinga pages for service IPs in codfw, in addition to alerts for socket timeouts against many hosts (especially appservers).

Since all codfw appservers could not be reached, there *would* have been lots of alert spam in #wikimedia-operations (one per appserver) -- except that icinga-wm got Excess Flooded off of IRC.

Timeline

Conclusions

What went well?

automated monitoring detected the incident very well

What went poorly?

a lot of Icinga spam due to many host-level service checks
root-causing the issue took a while and required deep network expertise

Where did we get lucky?

Linecard took only 5 minutes to finish rebooting
Exposed a design flaw in the codfw network during scheduled maintenance, rather than unexpectedly -- an actual hardware failure in the same spot would be ugly
codfw wasn't the primary DC

How many people were involved in the remediation?

1 SRE during incident; 2 SRE + 1 SRE director investigating afterwards

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.

NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.

Add linecard diversity to the router-to-router interconnect in codfw. phab:T248506
Consider plumbing a backup router cross-connect via a new VLAN on the access switches.