Jump to content

Portal:Toolforge/Admin/Runbooks/ToolsDBAlmostFull

From Wikitech

This happens when the free disk space in a ToolsDB host is getting close to zero. The alert will initially be at level "warning", and escalate to "page" if the free space goes below 5%.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

Debugging

Finding what is taking up space

fnegri@tools-db-6:~$ sudo du -hs /srv/labsdb/* |sort -hr |head -10
2.1T	/srv/labsdb/data
905G	/srv/labsdb/binlogs
20K	/srv/labsdb/tmp
16K	/srv/labsdb/lost+found
fnegri@tools-db-6:~$ sudo du -hs /srv/labsdb/data/* |sort -hr |head -10
281G	/srv/labsdb/data/s53220__quickstatements_p
250G	/srv/labsdb/data/s51434__mixnmatch_p
183G	/srv/labsdb/data/s53685__editgroups
142G	/srv/labsdb/data/ibdata1
95G	/srv/labsdb/data/s51698__yetkin
83G	/srv/labsdb/data/s51412__data
70G	/srv/labsdb/data/s51114__enwp10
58G	/srv/labsdb/data/s53952__freebase_p
57G	/srv/labsdb/data/s51499__wikiminiatlas
56G	/srv/labsdb/data/s51156__petscan

Common issues

ibdata1 file growing

Long uncommitted transactions can cause the file /srv/labsdb/data/ibdata1 to grow very quickly. You can check for active transactions with SHOW ENGINE INNODB STATUS\G from a MariaDB console (sudo mariadb in the tools-db host).

Look out for something like ---TRANSACTION (0x7f739e977e80), ACTIVE 455393 sec, with a big number of active seconds.

See phab:T409716 for more details.

data growth of one of the user databases

If disk space is low because the user data is growing, we can increase the disk size. The data volume is a Cinder volume that can be easily resized, see Extending a volume.

Support contacts

The main discussion channel for this alert is the #wikimedia-cloud-admin in IRC.

If the situation is not clear or you need additional help, you can also contact the Data Persistence team (#wikimedia-data-persistence on IRC).

Old incidents

Add any incident tasks here!