Incidents/2020-05-01 vc-link-failure

From Wikitech

document status: final

Summary

The virtual chassis link between asw2-d1-eqiad and asw2-d8-eqiad failed in two steps.

First on Friday the 1st where it was causing packet loss for hosts on D1 without any other signs of failures.

This packet loss caused connectivity issues between MediaWiki appservers (and API servers at a lower scale) and memcache servers. Resulting in a significant increase of MediaWiki exceptions being served to the users.

This got worked around for the weekend by depooling D1 servers. At this point the cause of the packet loss was unknown.

The day after, on Saturday, hosts in D8 started seeing the same issues as in D1. This time the switches were logging errors about the D1-D8 link. Disabling the link solved the issues.

Impact: This had little to no effect on traffic (Varnish_HTTP_Total), error rates (ATS availability) and latencies (Navtiming requests) for anonymous users. A increase in error rates (~1% of requests had errors, with a short spike to ~7.5%, Appserver errors) and an increase in tail latency (around plus 100%-150%, Appserver p95) has been observed for logged in users, though.

Timeline

All times in UTC. Friday 1st:

Saturday 2nd:

  • (Overnight) Wall of flapping PROBLEM - PHP7 rendering on mwXXXX is CRITICAL: CRITICAL - Socket timeout after 10 seconds OUTAGE RESURFACE
  • 06:42 Giuseppe and Luca start investigating
  • 06:52 Arzhel starts investigating
  • 07:08 <XioNoX> asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 0 member 1 OUTAGE ENDS
  • 07:49 <oblivian@cumin1001> conftool action : set/pooled=yes; selector: name=mw13(49|5[0-9]|6[0-2])\.eqiad\.wmnet

Detection

  • Did the appropriate alert(s) fire? Yes
  • Was the alert volume manageable? Yes
  • Did they point to the problem with as much accuracy as possible? No

The root cause didn't generate any logs at first, and when it did, those logs didn't trigger alerts.

Conclusions

  • Packet loss through Virtual Chassis Fabric are difficult to pinpoint
  • Higher layers monitoring worked as expected
  • From history, this failure scenario has a low probability of happening, and is now documented

What went well?

  • We had enough capacity to depool impacted mediawiki hosts
  • Once the failure generated logs, the root cause and fix were quick to identify and apply
  • SREs quickly identified D1 then D8 as common factor

What went poorly?

  • The first VC link failure didn't generate any switch side errors
  • The issue started happening on a Friday and re-appeared on a Saturday
  • The issue would not have been noticed if SREs didn't look at alerts on a weekend

Where did we get lucky?

  • SREs looked at alerts on a weekend

How many people were involved in the remediation?

  • 4 SREs

Links to relevant documentation

Actionables