No impact to the etcd SLO. Metrics: conf* filesystem usage and etcd req/s
Impact
No user impact. confd service failed for ~33 minutes
A bug introduced to the MediaWiki codebase caused an increase in connections to Confd hosts from systems responsible for Dumps which in turn lead to a high volume of log events and ultimately a filled up filesystem.
2022-09-06: A bug is introduced on MediaWiki core codebase on 5b0b54599bfd, causing configuration to be checked for every row of a database query on WikiExport.php, but the feature is not yet enabled.
2022-11-03 08:09 Systemd timer starts dump process on snapshot10[10,13,12,11] that starts accessing dbctl/etcd (on conf1* hosts) once per row from a database query result.
17:06 OUTAGE BEGINSconf1008 icinga alert: <icinga-wm> PROBLEM - Disk space on conf1008 is CRITICAL: DISK CRITICAL - free space: / 2744 MB (3% inode=98%): /tmp 2744 MB (3% inode=98%): /var/tmp 2744 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
17:10 Incident opened, elukey notifies of conf1008 root partition almost full
17:13 Disk space is freed with apt-get clean
17:37 Some nodes reach 100% disk usage
17:37 nginx logs are truncated
17:39 etcd_access.log.1 are truncated in the 3 conf100* nodes
17:39 OUTAGE ENDS: Disk space is under control
17:46 DB maintenance is stopped
17:48 denisse becomes IC
17:50 All pooling/depooling of databases is stopped
17:52 The origin of the issue is identified as excessive connections from snapshot[10,13,12,11]
17:58 snapshot hosts stopped hammering etcd after pausing dumps
The last symptom of his issue was detected by an Icinga alert: conf1008 icinga alert: <icinga-wm> PROBLEM - Disk space on conf1008 is CRITICAL: DISK CRITICAL - free space: / 2744 MB (3% inode=98%): /tmp 2744 MB (3% inode=98%): /var/tmp 2744 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
Conclusions
What went well?
confd/etcd designed to not be a SPOF prevented further bad things from happening
What went poorly?
We could have reacted to disk space warnings already instead of criticals
There where several other metrics clearly pointing out that "something is off" (see linked graphs)
Where did we get lucky?
People where around to react to the disk space critical alert
Links to relevant documentation
Task that introduced the source of this issue: MW scripts should reload the database config; task T298485
Actionables
conf* hosts ran out of disk space due to log spam; task T322360
Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often; task T322400