Incidents/2024-07-21 s4 and x1 write overload
Appearance
document status: final
Summary
Incident ID | 2024-07-21 s4 and x1 write overload | Start | 2024-07-21 20:59 |
---|---|---|---|
Task | T370304 | End | 2024-07-21 21:09 |
People paged | Unknown (VictorOps history does not go back that far) | Responder count | 2 (Amir1, bvibber) |
Coordinators | N/A | Affected metrics/SLOs | |
Impact | Wiki unavailability |
Database servers became unavailable with errors like Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.
This is because s4 gets overloaded and it brings down x1 with itself, bringing down services.
A previous incarnation of this incident occurred on 2024-07-13.
Timeline
All times in UTC.
2024-07-21
- 20:57 Write queries begin to exceed 400 wr/s
- 20:59 5XX errors begin being served (OUTAGE BEGINS)
- 21:01 Metrics no longer collected due to server overwhelm
- 21:03 Metrics begin collecting again
- 21:09 5XX errors return to nominal rates (OUTAGE ENDS)
- 22:14 Gerrit change 1055629 merged in to reduce write load
- 22:44 Gerrit change 1055629 deployed
-
Site health overview
Detection
Automated alerting fired at 21:00 and 21:01:
<jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0.06649% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
<jinxer-wm> FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
At 21:04, Amir became active on the channel (<Amir1> we just had another one
)
Actionables
Switchover s4 master (db1238 -> db1160)
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | no | ||
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | |
Was a public wikimediastatus.net entry created? | no | ||
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | |||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | no | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | no | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 5 |