Incidents/2018-08-08 Network
Summary
Topology changes made to improve the redundancy and stability of the switch stack asw2-a-eqiad caused it to drop ~1/3 of the packets transiting through its members for about 1h. This packet drop caused internal services to timeout/retry, exact user facing issues TBD but at least an increase of 5xx errors.
Timeline (UTC)
17:14 - First topology change made
17:43 - Last topology change made (T201145#4489225)
17:47 - First Icinga alerts, some high API latencies, puppetfails, etc. IRC spam is bad, but no major pages or signs of broader user-facing issues yet.
18:07 - Replaced fpc1-fpc3 link for T201095 (Unaware of the alerts)
18:10 - Started investigating asw2-a-eqiad
18:30 - Disabled fpc1-fpc3 link
18:33 - Minor user-facing disturbances begin showing up as a low-but-unusual rate of 503s
18:42 - 503 rate begins climbing significantly, reaching ~5% of all cache_text request rate at peak (probably roughly all of the misses and passes (e.g. logged-in traffic), only cache hits being served). Grafana
18:47 - Disabled fpc2-fpc4 link
18:47 - First Icinga recoveries
18:50 - 503 burst that began at 18:42 comes back to normal near-zero rate.
19:18 - eqiad front edge depooled in DNS, to stabilize and reduce risk during follow-on investigations fixups (takes 10 minutes for DNS TTLs to expire as this comes into effect)
Conclusions
- Virtual Chassis are black boxes, which makes it more difficult to investigate issues
- Topology changes included cable move, which makes a rollback more difficult
- Our current topologies are unsupported, this outage revealed that any changes, even though toward a more supported configuration can have bad consequences.
- Logging work done in SAL could have reduced the response time
- This event caused a driver issue on new cp1* servers, causing their link to be up on the switch side, but down on the server side
Actionables
- Status: Unresolved - Fix asw2-a-eqiad topology phab:T201145
- Status: TODO - Repool eqiad front edge traffic once eqiad is stable