Incident documentation/20180312-Cache-text

From Wikitech
Jump to navigation Jump to search

This is a draft, edit heavily please.

Summary

From 05:08 to 08:32 UTC on Monday, 12 March 2018 the number of user-facing 503 error messages increased due to piling backend connections in several CDN nodes (cp3011,cp3033 and cp3042) on esams datacenter. The root cause of this incident is unknown, but is likely related to known scalability issues in the varnish file storage backend. The issue has been successfully mitigated by restarting the varnish backend instance on the mentioned nodes.

Timeline

All times are UTC.

Conclusions

TBW

Actionables

  • Continue ongoing related investigations in phab:T181315
  • Set up paging alerting on backend connections piling up (TBD)
  • Move backend restarts from weekly to bi-weekly (done in gerrit:419090)
  • Long term: Move to ATS as caching solution for cache backends (phab:T96853)