Incidents/2025-03-29 Upload cache unavailability
document status: draft
Summary
Incident ID | 2025-03-29 Upload cache unavailability | Start | 2025-03-29 15:30 |
---|---|---|---|
Task | T390385 | End | 2025-03-30 9:34 |
People paged | 35 (batphone) | Responder count | 6 + 3 |
Coordinators | Jelto | Affected metrics/SLOs | Elevated error rate and latency for upload |
Impact | Users accessing media from upload.wikimedia.org (images, files, videos) received error messages or experienced higher latency for short periods of time (less then 30 minutes for each incident) |
…
- Swift was overloaded in eqiad because of excessive download of .tiff files
- Users accessing media from upload.wikimedia.org saw increased latency or error messages.
- Traffic was throttled/blocked using existing tooling.
- More details in the linked task and gdoc.
Timeline
All times in UTC.
15:30 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
15:30 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
15:42: FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page
15:37: Status page update “Partial media unavailability”
15:47: Swift frontend in eqiad is restarted
16:03: Throttling rule is applied (/action/cache-upload/T390385)
16:16: Throttling rule is updated
16:10 Incident opened. Jelto becomes IC.
User impact stopped.
16:22 Status page update “A fix has been implemented and we are monitoring the results.“
16:45: Status page update “This incident has been resolved.”
Sunday 2025-03-30
02:00 recurrence [handled by swfrench-wmf urandom krinkle, self-resolved). Throttling rule updated to match slight change in UA
09:17 paged again
09:19 Throttling rule set to ban, swift recovers without other intervention
09:26 Throttling rule set to rate-limit instead of ban
09:34 paged again, rule re-set to ban
User impact stopped.
Detection
This incident was detected via paging for cache_upload and swift:
FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page
Conclusions
This was a somewhat common traffic incident, as we have seen similar issues in the past. More details in the linked task.
Links to relevant documentation
Actionables
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | no | batphone | |
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | IC started 40 minutes after the first page |
Was a public wikimediastatus.net entry created? | yes | ||
Is there a phabricator task for the incident? | yes | T390385 | |
Are the documented action items assigned? | no | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | no | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | yes | ? |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | yes | Service restarts#Swift ? | |
Total score (count of all “yes” answers above) | 9 |