Jump to content

Incidents/2025-11-05 WMCS toolsdb primary out of space

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-11-05 WMCS toolsdb primary out of space Start 2025-11-04 22:40:00
Task End 2025-11-05 09:04:00
People paged Responder count
Coordinators Affected metrics/SLOs
Impact Tools were not able to access toolsdb

Timeline

All times in UTC.

22:40 tools-db-4 runs out of space

00:58 one of the tasks about this issue is opened https://phabricator.wikimedia.org/T409244

07:19 Filippo starts investigating

07:50 Filippo pages David

08:00  Incident opened.  Filippo becomes IC.

08:00 David starts taking action to resize the tools-db-4 Cinder volume

08:01 David attempts to stop mariadb

08:36 mariadb was taking a long time to stop, the decision was made to force-stop

08:36 umount and extend of the volume begins on tools-db-4

08:42 mariadb is started again on the extended volume: 4.9T  3.7T  968G  80% /srv/labsdb

09:04 status is ok

Detection

Users reported failures of tools to connect to toolsdb

Actionables

  • T409287 Fail over to replica where ibdata1 is still small, and nuke tools-db-4 where ibdata1 grew too much and will never shrink (https://www.percona.com/blog/why-is-the-ibdata1-file-continuously-growing-in-mysql/)
  • Predictive alerts on space running out in XXX days/hours https://phabricator.wikimedia.org/T409404
  • Page on low disk space https://phabricator.wikimedia.org/T409404
  • Investigate why the alert that checks if it can write to toolsdb (TBD exact name) did not fire
    • <dhinus> ok the toolsdb read/write check was removed in T313030
    • <dhinus> because we thought alertmanager alerts were enough
    • <dhinus> but they only check the read/write status, they don't try writing

As part of https://phabricator.wikimedia.org/T357977 (monitor our own sample tools)