Incidents/2022-11-03 conf disk space

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2022-11-03 conf disk space	Start	2022-11-03 17:06:00
Task	T322360	End	2022-11-03 18:09:00
People paged	0	Responder count	10
Coordinators	denisse	Affected metrics/SLOs	No impact to the etcd SLO. Metrics: conf* filesystem usage and etcd req/s
Impact	No user impact. confd service failed for ~33 minutes

A bug introduced to the MediaWiki codebase caused an increase in connections to Confd hosts from systems responsible for Dumps which in turn lead to a high volume of log events and ultimately a filled up filesystem.

Timeline

All times in UTC.

2022-09-06: A bug is introduced on MediaWiki core codebase on 5b0b54599bfd, causing configuration to be checked for every row of a database query on WikiExport.php, but the feature is not yet enabled.
2022-10-24: The feature is enabled: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/848201
2022-11-03 08:09 Systemd timer starts dump process on snapshot10[10,13,12,11] that starts accessing dbctl/etcd (on conf1* hosts) once per row from a database query result.
17:06 OUTAGE BEGINS conf1008 icinga alert: <icinga-wm> PROBLEM - Disk space on conf1008 is CRITICAL: DISK CRITICAL - free space: / 2744 MB (3% inode=98%): /tmp 2744 MB (3% inode=98%): /var/tmp 2744 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
17:10 Incident opened, elukey notifies of conf1008 root partition almost full
17:13 Disk space is freed with apt-get clean
17:37 Some nodes reach 100% disk usage
17:37 nginx logs are truncated
17:39 etcd_access.log.1 are truncated in the 3 conf100* nodes
17:39 OUTAGE ENDS: Disk space is under control
17:46 DB maintenance is stopped
17:48 denisse becomes IC
17:50 All pooling/depooling of databases is stopped
17:52 The origin of the issue is identified as excessive connections from snapshot[10,13,12,11]
17:58 snapshot hosts stopped hammering etcd after pausing dumps
18:15 Code change of fix merged https://sal.toolforge.org/log/4iLgPoQBa_6PSCT93YhE

Detection

The last symptom of his issue was detected by an Icinga alert: conf1008 icinga alert: <icinga-wm> PROBLEM - Disk space on conf1008 is CRITICAL: DISK CRITICAL - free space: / 2744 MB (3% inode=98%): /tmp 2744 MB (3% inode=98%): /var/tmp 2744 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space

Conclusions

What went well?

confd/etcd designed to not be a SPOF prevented further bad things from happening

What went poorly?

We could have reacted to disk space warnings already instead of criticals
There where several other metrics clearly pointing out that "something is off" (see linked graphs)

Where did we get lucky?

People where around to react to the disk space critical alert

Links to relevant documentation

Task that introduced the source of this issue: MW scripts should reload the database config; task T298485

Actionables

conf* hosts ran out of disk space due to log spam; task T322360
Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often; task T322400

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	Yes	although some had responded to previous incidents as well
	Were the people who responded prepared enough to respond effectively	Yes
	Were fewer than five people paged?	Yes	No page
	Were pages routed to the correct sub-team(s)?	No	No page
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	No	No page
Process	Was the incident status section actively updated during the incident?	No	IC came in late
	Was the public status page updated?	No
	Is there a phabricator task for the incident?	Yes
	Are the documented action items assigned?	Yes
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	No	From the memory of review ritual participants we had that exact same issue before
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	Yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	Yes
	Did existing monitoring notify the initial responders?	Yes
	Were the engineering tools that were to be used during the incident, available and in service?	Yes
	Were the steps taken to mitigate guided by an existing runbook?	No
Total score (count of all “yes” answers above)		9