Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState
This happens when the primary ToolsDB instance is down, or is up but in read-only mode.
Error / Incident
This usually comes in the form of an alert in alertmanager.
Debugging
General cluster overview
You can run the following cookbook to get a cluster overview:
dcaro@urcuchillay$ wmcs-cookbooks wmcs.toolforge.toolsdb.get_cluster_status --cluster tools
Checking the systemd unit status
SSH to the instance and check the systemd status for mariadb.service
$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
fnegri@tools-db-1:~$ sudo systemctl status mariadb.service
If SSH does not work (for example because of phab:T349681) you can use virsh console
: find the "instance name" and "host" in Horizon, then SSH to the cloudvirt host, and run virsh console {instance name}
.
Common issues
Add new issues here when you encounter them!
MariaDB process killed by OOM killer
If this is the case, you usually see a log message like the following one in the mariadb logs:
$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
fnegri@tools-db-1:~$ sudo journalctl -u mariadb |grep -i kill
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.
Sometimes the mariadb logs only include "Main process exited", without any mention to OOM, but you can verify if the process was killed by the OOM killer looking at dmesg:
fnegri@tools-db-1:~$ sudo dmesg -T |grep Killed
[Tue Oct 24 09:19:23 2023] Out of memory: Killed process 2437 (mysqld) total-vm:64835688kB, anon-rss:64103256kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:126460kB oom_score_adj:-600
Check if systemd restarted the "mariadb.service" automatically with systemctl status mariadb
, otherwise run systemctl start mariadb
.
Finally, set the server to read-write, as it is configured to start in read-only mode for extra safety:
$ sudo mariadb
MariaDB [(none)]> SET GLOBAL read_only=OFF;
Related information
Support contacts
The main discussion channel for this alert is the #wikimedia-cloud-admin in IRC.
If the situation is not clear or you need additional help, you can also contact the Data Persistence team (#wikimedia-data-persistence on IRC).
Old incidents
Add any incident tasks here!