Incidents/2024-07-21 s4 and x1 write overload

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-07-21 s4 and x1 write overload	Start	2024-07-21 20:59
Task	T370304	End	2024-07-21 21:09
People paged	Unknown (VictorOps history does not go back that far)	Responder count	2 (Amir1, bvibber)
Coordinators	N/A	Affected metrics/SLOs
Impact	Wiki unavailability

Database servers became unavailable with errors like Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.

This is because s4 gets overloaded and it brings down x1 with itself, bringing down services.

A previous incarnation of this incident occurred on 2024-07-13.

Timeline

SAL log

All times in UTC.

2024-07-21

20:57 Write queries begin to exceed 400 wr/s
20:59 5XX errors begin being served (OUTAGE BEGINS)
21:01 Metrics no longer collected due to server overwhelm
21:03 Metrics begin collecting again
21:09 5XX errors return to nominal rates (OUTAGE ENDS)
22:14 Gerrit change 1055629 merged in to reduce write load
22:44 Gerrit change 1055629 deployed

Site health overview
MySQL Write query stats

Detection

Automated alerting fired at 21:00 and 21:01:

<jinxer-wm> FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0.06649% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy

<jinxer-wm> FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate

At 21:04, Amir became active on the channel (<Amir1> we just had another one)

Actionables

Switchover s4 master (db1238 -> db1160)

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	no
	Were fewer than five people paged?	no
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	no
	Was a public wikimediastatus.net entry created?	no
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	no
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	no
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		5