Incidents/2024-04-17 mw-on-k8s eqiad outage

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-04-17 mw-on-k8s eqiad outage	Start	20240417-04-17 09:10:00
Task	T362766	End	20240417-04-17 09:50:00
People paged		Responder count	5
Coordinators	1	Affected metrics/SLOs
Impact	According to Traffic's graphs, from HAproxy's POV, non 5XX requests (text+upload) dropped from 138K rps, to 120K - 130K rps for 20'

mcrouter daemonset on mw-on-k8s: The mediawiki pod has 9 containers. We were working on reducting this number to 7, by introducing the mw-mcrouter service. In practice, our end goal was that each mw-on-k8s pod would use a standalone mcrouter pod running within the same node, instead of its own mcrouter container. From mediawiki's POV, mcrouter's location would be mcrouter-main.mw-mcrouter.svc.cluster.local:4442 instead of 127.0.0.1:11213. The same change was deployed on codfw the day before, but codfw has less traffic.

This change increased the number of DNS requests towards CoreDNS, from an average of 40k req/s to 110k req/s, overwhelming the pods.

Status at ~09:20 UTC:

scap was blocked waiting for the deployment of mw-on-k8s to finish
during the deployment, the mediawiki pods were never becoming ready, and after a while scap attampted to rollback
CoreDNS pods (3) were overwhelmed and oom killed over and over again (being left in an crashloopbackoff state)

Actions:

depooled mediawiki reads from eqiad (via discovery)
Increase memory limits and replicas for coredns on wikikube clusters
terminate mw-server FQDN with a dot - mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
reverted eqiad to use in-pod mcrouter container
pooled eqiad back

Commits:

Graphs:

Timeline

SAL entry: https://sal.toolforge.org/log/yk9Q644BGiVuUzOdxNwu

All times in UTC.

09:08 effie runs scap sync-world to deploy mediawiki deployments: use mcrouter daemonset for both DCs T346690
09:10 antoine observes a higher level of mw related events arriving to logstash
09:18 OUTAGE BEGINS
09:18 ALERT: (PHPFPMTooBusy) firing: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 30.2% idle
09:26 ALERT: (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 6.79s
09:44 claime depools eqiad in mw-web-ro, mw-api-ext-ro, mw-api-int-ro
09:44 OUTAGE ENDS
10:00 akosiaris manually bumps coredns pods to 6 (eqiad+codfw)
10:16 effie merges and deploys any relevant code changes and reverts https://gerrit.wikimedia.org/r/1020768 and https://gerrit.wikimedia.org/r/1020774
10:53 effie pools back eqiad for mw-web-ro, mw-api-ext-ro, mw-api-int-ro

Detection

Antoine noticed an elevated number of events coming from the mediawiki channel on logstash. A few minutes later we got our first alert that we are running out of available php workers.

Conclusions

What went well?

Everyone in ServiceOps was around
Janis quickly figured out that we were missing the final dot in the FQDN

What went poorly?

While we deployed the very same change on codfw the day before, we didn't properly analyse the impact.

Where did we get lucky?

No luck.

Links to relevant documentation

Actionables

TBA

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?
	Were the people who responded prepared enough to respond effectively
	Were fewer than five people paged?
	Were pages routed to the correct sub-team(s)?
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
	Was a public wikimediastatus.net entry created?
	Is there a phabricator task for the incident?
	Are the documented action items assigned?
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
	Were the people responding able to communicate effectively during the incident with the existing tooling?
	Did existing monitoring notify the initial responders?
	Were the engineering tools that were to be used during the incident, available and in service?
	Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)