Incidents/2019-04-16 varnish

From Wikitech

Summary

For approximately an hour, the traffic layer served bursts of 503 errors: up to ~50k/minute for several minutes at a time. It is unclear why this happened, and whether misbehavior at the traffic layer or at the appserver layer was actually at fault.

Impact

Approximately 553,000 HTTP 503 errors were served across all sites. https://logstash.wikimedia.org/goto/accfe83bffa587f460110942361af4a1

Detection

Automated monitoring (Icinga alerts on traffic availability) plus multiple staff/user reports in #wikimedia-operations.

Timeline

All times in UTC.

  • 18:24: HTTP 503 error rate begins to rise OUTAGE BEGINS
  • 18:26: first alert from Icinga
    <+icinga-wm>	PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
  • 18:52: jynus phones bblack
  • 18:55: cdanis depools cp1085 after looking at Varnish mailbox lag console and https://phabricator.wikimedia.org/T145661
  • 19:04: bblack performs varnish-backend-restart on cp1085
  • 19:07: bblack performs varnish-backend-restart on cp1083
  • 19:08: 503s taper off to 0 OUTAGE ENDS

Possibly-relevant other graphs:

It is unclear what of the above are symptoms vs causes.

Conclusions

What went well?

  • automated monitoring detected the incident

What went poorly?

  • unable to root-cause incident

Where did we get lucky?

  • Whatever was causing the issue stopped happening.
  • The outage was not more widespread

Links to relevant documentation

Where is the documentation that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, there should be an action item to create it.

Actionables

  • Continue working on moving to ATS. Similar incidents have happened before, and continuing to investigate these failures of Varnish is not a good use of time.