Incidents/2019-04-23 varnish

From Wikitech

Summary

Similar Varnish 'mailbox lag' problems as many times before.

Impact

Approximately 82k queries lost (HTTP 503 served instead). source

Detection

Automated monitoring -- Icinga alerts on traffic availability.

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

  • 19:54 Varnish mailbox lag begins climbing on cp1083 OUTAGE BEGINS
  • 19:56 first Icinga alert for HTTP availability PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo
  • 19:57 Varnish mailbox lag recovers on cp1083 but begins climbing on cp1085
  • 20:02 Varnish mailbox lag recovers on cp1085 OUTAGE ENDS

Graphs: Mailbox lag HTTP availability

Conclusions

See Incident_documentation/20190416-varnish#Conclusions

Actionables

See Incident_documentation/20190416-varnish#Actionables