Incidents/2019-04-16 varnish

Summary

For approximately an hour, the traffic layer served bursts of 503 errors: up to ~50k/minute for several minutes at a time. It is unclear why this happened, and whether misbehavior at the traffic layer or at the appserver layer was actually at fault.

Impact

Approximately 553,000 HTTP 503 errors were served across all sites. https://logstash.wikimedia.org/goto/accfe83bffa587f460110942361af4a1

Detection

Automated monitoring (Icinga alerts on traffic availability) plus multiple staff/user reports in #wikimedia-operations.

Timeline

All times in UTC.

18:24: HTTP 503 error rate begins to rise OUTAGE BEGINS

18:26: first alert from Icinga

<+icinga-wm>	PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1

18:52: jynus phones bblack
18:55: cdanis depools cp1085 after looking at Varnish mailbox lag console and https://phabricator.wikimedia.org/T145661
19:04: bblack performs varnish-backend-restart on cp1085
19:07: bblack performs varnish-backend-restart on cp1083
19:08: 503s taper off to 0 OUTAGE ENDS

Possibly-relevant other graphs:

Spikes of connections from Varnish backends to appservers correlate with the 503s: https://grafana.wikimedia.org/d/000000439/varnish-backend-connections?orgId=1&from=1555438200000&to=1555442400000
Some very-long-running WDQS queries also correlate: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=26&fullscreen&orgId=1&from=1555438200000&to=1555442400000

It is unclear what of the above are symptoms vs causes.

Conclusions

What went well?

automated monitoring detected the incident

What went poorly?

unable to root-cause incident

Where did we get lucky?

Whatever was causing the issue stopped happening.
The outage was not more widespread

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

Continue working on moving to ATS. Similar incidents have happened before, and continuing to investigate these failures of Varnish is not a good use of time.