Incidents/20180312-Cache-text
Appearance
(Redirected from Incident documentation/20180312-Cache-text)
This is a draft, edit heavily please.
Summary
From 05:08 to 08:32 UTC on Monday, 12 March 2018 the number of user-facing 503 error messages increased due to piling backend connections in several CDN nodes (cp3011,cp3033 and cp3042) on esams datacenter. The root cause of this incident is unknown, but is likely related to known scalability issues in the varnish file storage backend. The issue has been successfully mitigated by restarting the varnish backend instance on the mentioned nodes.
Timeline
All times are UTC.
- 05:08 Incident begins. Related graphs: Varnish HTTP errors on esams, varnish backend connections (cp3031, cp3033, cp3042)
- 08:06 Restart cp3042 varnish backend instance.
- 08:32 Restart cp3031 and cp3033 varnish backend instances.
- 08:32 Incident mitigated.
Conclusions
TBW
Actionables
- Continue ongoing related investigations in phab:T181315
- Set up paging alerting on backend connections piling up (TBD)
- Move backend restarts from weekly to bi-weekly (done in gerrit:419090)
- Long term: Move to ATS as caching solution for cache backends (phab:T96853)