Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState

From Wikitech

This happens when the primary ToolsDB instance is down, or is up but in read-only mode.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

Debugging

Checking the systemd unit status

SSH to the instance and check the systemd status for mariadb.service

$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
fnegri@tools-db-1:~$ sudo systemctl status mariadb.service

If SSH does not work (for example because of phab:T349681) you can use virsh console: find the "instance name" and "host" in Horizon, then SSH to the cloudvirt host, and run virsh console {instance name}.

Common issues

Add new issues here when you encounter them!

MariaDB process killed by OOM killer

If this is the case, you usually see a log message like the following one in the mariadb logs:

$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
fnegri@tools-db-1:~$ sudo journalctl -u mariadb |grep -i kill
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.

Sometimes the mariadb logs only include "Main process exited", without any mention to OOM, but you can verify if the process was killed by the OOM killer looking at dmesg:

fnegri@tools-db-1:~$ sudo dmesg -T |grep Killed
[Tue Oct 24 09:19:23 2023] Out of memory: Killed process 2437 (mysqld) total-vm:64835688kB, anon-rss:64103256kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:126460kB oom_score_adj:-600

Check if systemd restarted the "mariadb.service" automatically with systemctl status mariadb, otherwise run systemctl start mariadb.

Finally, set the server to read-write, as it is configured to start in read-only mode for extra safety:

$ sudo mariadb
MariaDB [(none)]> SET GLOBAL read_only=OFF;

Related information

Support contacts

The main discussion channel for this alert is the #wikimedia-cloud-admin in IRC.

If the situation is not clear or you need additional help, you can also contact the Data Persistence team (#wikimedia-data-persistence on IRC).

Old incidents

Add any incident tasks here!