Jump to navigation Jump to search
This is a draft, edit heavily please.
From 05:08 to 08:32 UTC on Monday, 12 March 2018 the number of user-facing 503 error messages increased due to piling backend connections in several CDN nodes (cp3011,cp3033 and cp3042) on esams datacenter. The root cause of this incident is unknown, but is likely related to known scalability issues in the varnish file storage backend. The issue has been successfully mitigated by restarting the varnish backend instance on the mentioned nodes.
All times are UTC.
- 05:08 Incident begins. Related graphs: Varnish HTTP errors on esams, varnish backend connections (cp3031, cp3033, cp3042)
- 08:06 Restart cp3042 varnish backend instance.
- 08:32 Restart cp3031 and cp3033 varnish backend instances.
- 08:32 Incident mitigated.
- Continue ongoing related investigations in phab:T181315
- Set up paging alerting on backend connections piling up (TBD)
- Move backend restarts from weekly to bi-weekly (done in gerrit:419090)
- Long term: Move to ATS as caching solution for cache backends (phab:T96853)