Jump to content

Incidents/2025-03-29 Upload cache unavailability

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-03-29 Upload cache unavailability Start 2025-03-29 15:30
Task T390385 End 2025-03-30 9:34
People paged 35 (batphone) Responder count 6 + 3
Coordinators Jelto Affected metrics/SLOs Elevated error rate and latency for upload
Impact Users accessing media from upload.wikimedia.org (images, files, videos) received error messages or experienced higher latency for short periods of time (less then 30 minutes for each incident)

  • Swift was overloaded in eqiad because of excessive download of .tiff files
  • Users accessing media from upload.wikimedia.org saw increased latency or error messages.
  • Traffic was throttled/blocked using existing tooling.
  • More details in the linked task and gdoc.

Timeline

All times in UTC.

15:30 FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable

15:30 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable

15:42: FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page

15:37: Status page update “Partial media unavailability”

15:47: Swift frontend in eqiad is restarted

16:03: Throttling rule is applied (/action/cache-upload/T390385)

16:16: Throttling rule is updated

16:10 Incident opened. Jelto becomes IC.

User impact stopped.

16:22 Status page update “A fix has been implemented and we are monitoring the results.“

16:45: Status page update “This incident has been resolved.”

Sunday 2025-03-30

02:00 recurrence [handled by swfrench-wmf urandom krinkle, self-resolved). Throttling rule updated to match slight change in UA

09:17 paged again

09:19 Throttling rule set to ban, swift recovers without other intervention

09:26 Throttling rule set to rate-limit instead of ban

09:34 paged again, rule re-set to ban

User impact stopped.

Detection

This incident was detected via paging for cache_upload and swift:

FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable

FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable

FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page

Conclusions

This was a somewhat common traffic incident, as we have seen similar issues in the past. More details in the linked task.

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no batphone
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no IC started 40 minutes after the first page
Was a public wikimediastatus.net entry created? yes
Is there a phabricator task for the incident? yes T390385
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? no
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. yes ?
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? yes Service restarts#Swift ?
Total score (count of all “yes” answers above) 9