Incidents/2025-11-05 WMCS toolsdb primary out of space
document status: final
Summary
| Incident ID | 2025-11-05 WMCS toolsdb primary out of space | Start | 2025-11-04 22:40:00 |
|---|---|---|---|
| Task | End | 2025-11-05 09:04:00 | |
| People paged | Responder count | ||
| Coordinators | Affected metrics/SLOs | ||
| Impact | Tools were not able to access toolsdb | ||
…
Timeline
All times in UTC.
22:40 tools-db-4 runs out of space
00:58 one of the tasks about this issue is opened https://phabricator.wikimedia.org/T409244
07:19 Filippo starts investigating
07:50 Filippo pages David
08:00 Incident opened. Filippo becomes IC.
08:00 David starts taking action to resize the tools-db-4 Cinder volume
08:01 David attempts to stop mariadb
08:36 mariadb was taking a long time to stop, the decision was made to force-stop
08:36 umount and extend of the volume begins on tools-db-4
08:42 mariadb is started again on the extended volume: 4.9T 3.7T 968G 80% /srv/labsdb
09:04 status is ok
Detection
Users reported failures of tools to connect to toolsdb
Actionables
- T409287 Fail over to replica where ibdata1 is still small, and nuke tools-db-4 where ibdata1 grew too much and will never shrink (https://www.percona.com/blog/why-is-the-ibdata1-file-continuously-growing-in-mysql/)
- Predictive alerts on space running out in XXX days/hours https://phabricator.wikimedia.org/T409404
- Page on low disk space https://phabricator.wikimedia.org/T409404
- Investigate why the alert that checks if it can write to toolsdb (TBD exact name) did not fire
- <dhinus> ok the toolsdb read/write check was removed in T313030
- <dhinus> because we thought alertmanager alerts were enough
- <dhinus> but they only check the read/write status, they don't try writing
As part of https://phabricator.wikimedia.org/T357977 (monitor our own sample tools)