Portal:Toolforge/Admin/ToolsDB
ToolsDB is a multi-tenant MariaDB database service for use by Toolforge tools and maintainers. This page is primarily for WMCS admins for database and server operations on the ToolsDB database cluster that is currently operating in the tools Cloud VPS project. For general information and background, see Portal:Data Services#ToolsDB and Help:Toolforge/Database#User_databases.
Architecture
ToolsDB is a MariaDB server that runs on a two-host primary-replica configuration.
Both hosts are Cloud VPS virtual machine, where we install our customized MariaDB deb packages using the Puppet profile profile::wmcs::services::toolsdb_primary
. We reuse a lot of Puppet classes created by the data persistence team for production MariaDB databases, and add some customization that we need for ToolsDB.
Finding the right hostnames
At the time this wiki page was last updated, the primary host was tools-db-P.tools.eqiad1.wikimedia.cloud and the replica host was tools-db-R.tools.eqiad1.wikimedia.cloud, but these hostnames will change and there is no guarantee that the primary is sorted alphabetically before the replica.
To find the current hostnames, navigate to the tools project in Horizon and look for hosts with name tools-db-*. There should be only one primary and one replica at any given time, unless some maintenance operation is in progress. To find which of the two hosts is the primary, SSH to both, open a MariaDB shell (sudo mariadb
) and type SHOW SLAVE STATUS\G
.
In the primary host you will see something like:
MariaDB [(none)]> SHOW SLAVE STATUS\G
Empty set (0.000 sec)
In the replica host you will see something like:
MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
Checking the replication status
SSH to the replica host and from a MariaDB shell (sudo mariadb), type SHOW SLAVE STATUS\G:
MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
[...]
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
[...]
Last_Error:
[...]
Seconds_Behind_Master: 0
[...]
1 row in set (0.000 sec)
If both Slave_IO_Running and Slave_SQL_Running are Yes and Seconds_Behind_Master is 0, the replication is working correctly.
If both Slave_IO_Running and Slave_SQL_Running are Yes but Seconds_Behind_Master is greater than zero, it means the replication is working but the replica is out of sync. Most of the times this is not something to worry about, and the replica will catch up automatically. If the replication lag grows too much it can trigger the alert ToolsToolsDBReplicationLagIsTooHigh.
If at least one of Slave_IO_Running and Slave_SQL_Running are not Yes, the replication has stopped. The field Last_Error will usually provide the reason for the failure, but you can also look at the full mariadb logs with sudo journalctl -u mariadb.
There are some tricks to fix a broken replication, but many times the best thing is to create a new replica from scratch, following the instructions in the section "Creating a new replica host".
Changing a Replica to Become the Primary
Official docs: https://mariadb.com/kb/en/changing-a-replica-to-become-the-primary/
Silence alerts in alertmanager
In https://alerts.wikimedia.org/ create a silence matching all ToolsDB alerts:
@cluster=wmcloud.org
team=wmcs
alertname=.*ToolsDB.*
(regex)- Duration: 1 hour should be enough
Set the primary to read-only
If the primary server is running, you can flush all data to disk and set the server as read-only by executing the following command on the MariaDB prompt:
FLUSH TABLES WITH READ LOCK;
Stop the primary
Take note of the current position of the primary:
MariaDB (primary)> SELECT @@global.gtid_binlog_pos;
And wait until you have the same position on the replica:
MariaDB (replica)> SHOW SLAVE STATUS\G
When the replica is up to date, you can then take the primary down. This should be on the same connection where you executed FLUSH TABLES WITH READ LOCK
in the previous step:
MariaDB (primary)> SHUTDOWN;
Promote the replica to primary
Stop replication on the replica and reset its replication settings:
STOP SLAVE;
RESET SLAVE ALL;
Save a copy of the values shown by the following commands, as you might need them later:
SHOW MASTER STATUS;
SELECT @@global.gtid_binlog_pos;
To promote this host to primary, you need to make the host writable:
SET GLOBAL read_only=OFF;
To check if the host is writable, this value should be 0:
SELECT @@read_only;
Change the DNS record to point to the new primary
The DNS record for tools.db.svc.wikimedia.cloud
is managed with OpenTofu. To change it, make a Merge Request in the tofu-infra repo, merge it and apply it.
SSH to the Toolforge bastion and verify that the hostname resolves to the new IP:
$ ssh login.toolforge.org
tools-bastion-13:~$ dig +short tools.db.svc.wikimedia.cloud
Change the floating IP to point to the new primary
The IP 185.15.56.15
is a floating IP in the tools
project in Cloud VPS. It is used by the maintain_dbusers script and needs to point to the new primary. You can change it from Network → Floating IPs in Horizon.
tools.db.svc.wikimedia.cloud
, tracked in phab:T381272.Update the Hiera key for pt-heartbeat
heartbeat.heartbeat
table.Modify the hiera key profile::wmcs::services::toolsdb::primary_server: tools-db-1.tools.eqiad1.wikimedia.cloud
in https://horizon.wikimedia.org/project/prefixpuppet/ and point it to the new primary. This will enable the pt-heartbeat-wikimedia
script, which tracks the replication lag between ToolsDB primary and ToolsDB replica.
After this change, run-puppet-agent
in the new primary should enable the wmf-pt-heartbeat
systemd unit. Check it's running with:
tools-db-B:~$ sudo systemctl status pt-heartbeat-wikimedia
Finally check that the heartbeat
table is getting updated:
SELECT * FROM heartbeat.heartbeat;
You should see two lines with two different values for server_id
(one for tools-db-A and one for tools-db-B). Only the line for tools-db-B will be updated, so you can drop the old line with the tools-db-A id:
DELETE FROM heartbeat.heartbeat WHERE server_id = {OLD_SERVER_ID};
root
user in MariaDB, and it can update the heartbeat
table even on a server where read_only=1
. This can sometimes be confusing, we should make sure that pt-heartbeat respects the read_only
setting, tracked in phab:T381690.(Optional) Change the old primary to become the new replica
If the old primary does not have any issues, we can configure it to become the new replica and start replicating from the new primary. In this way we will have again a working primary-replica setup, where the original roles have been inverted.
In the old primary, use the following commands to start replicating from the new primary:
STOP SLAVE;
RESET MASTER;
RESET SLAVE;
CHANGE MASTER TO MASTER_HOST='tools-db-X.tools.eqiad1.wikimedia.cloud', MASTER_USER='repl', MASTER_PASSWORD='xxxxxx', MASTER_USE_GTID=slave_pos;
START SLAVE;
Note: the password for the repl user is stored in the secret repl-password
in Pwstore.
Remember to also update the DNS record tools-readonly.db.svc.wikimedia.cloud
to point to the new replica host.
If for any reason the old primary cannot become the new replica (e.g. errors, data corruption, running an old version), we can instead create a new replica from scratch by following the instructions in the paragraph "Creating a new replica host". In this case, remember to delete the old primary host and any data volume or snapshot associated with it.
Check that no alerts are firing
Finally, check there are no ToolsDB alerts firing in https://alerts.wikimedia.org/ and remove any silence you created during this procedure.
Creating a new replica host
Note: replace tools-db-A with the host you want to copy data from (it can be the primary or another replica), and tools-db-B with the new replica host.
Create the new replica host
cloudcumin1001:~$ sudo cookbook wmcs.vps.create_instance_with_prefix --project tools --prefix tools-db --security-group toolsdb --task-id Txxxxxx --image debian-12.0-bookworm
Note: the cookbook will automatically set the hostname to be tools-db-B
where B = last_prefix_member_id + 1
Note: the cookbook will automatically set the flavor to be the same of the latest existing instance. Double check it's using a "maxiops" flavor.
Refresh the Puppet certificates for the new host
cloudcumin1001:~$ sudo cookbook wmcs.vps.refresh_puppet_certs --fqdn tools-db-B.tools.eqiad1.wikimedia.cloud --task-id Txxxxxx
After running the command above, make sure that Puppet runs without errors in the new host:
tools-db-B:~$ sudo run-puppet-agent
Create a new empty Cinder volume
In Horizon, create a new volume with size 4000
, type high-iops
and name tools-db-B-data
: this will be the data volume for the new instance.
Attach it to the VM by clicking "manage attachments" and mount it using the wmcs-prepare-cinder-volume
script:
tools-db-B:~$ sudo wmcs-prepare-cinder-volume --device sdb --mountpoint /srv/labsdb
Note: it's important to use wmcs-prepare-cinder-volume
instead of simply using mkfs
and mount
, because the script will also add the volume UUID to /etc/fstab
Create a Cinder snapshot of the source data volume
SSH to the source host tools-db-A
and from a mariadb
prompt make sure data is flushed to disk (official docs):
tools-db-A:~$ sudo mariadb
BACKUP STAGE START;
BACKUP STAGE BLOCK_COMMIT;
SELECT @@gtid_current_pos; /* Record this value somewhere, you will need it later! */
/* Keep this SQL session open... */
Go to Horizon and create a snapshot of the tools-db-A-data
volume, name it tools-db-A-data-snap
.
Then signal to MariaDB that the backup is done, this will remove the lock and allow all the paused queries to complete:
/* ...Using the same SQL session from above */
BACKUP STAGE END;
Create a temporary volume for data transfer
In Horizon, create a new volume with name tools-db-B-data-temp
, based on the snapshot that you created in the previous step.
Attach it to tools-db-B
by clicking "manage attachments", then mount it:
tools-db-B:~$ sudo mkdir -p /mnt/tools-db-data-temp
tools-db-B:~$ sudo mount /dev/sdc /mnt/tools-db-data-temp
Note: if you get an error, try using /dev/sdc1 instead of /dev/sdc.
Note: this temporary volume is needed so that we can eventually delete the snapshot: Cinder does not let you delete a snapshot unless you delete all the volumes derived from the snapshot.
Copy all the contents from the snapshot to the new data volume
tools-db-B:~$ sudo apt install rsync
tools-db-B:~$ sudo rsync -a --info=progress2 /mnt/tools-db-data-temp/data/ /srv/labsdb/data
Start the replication
Ensure the correct ownership on the files that you have copied:
tools-db-B:~$ sudo chown -R mysql:mysql /srv/labsdb/*
Start MariaDB, then configure the new instance to replicate from tools-db-A:
tools-db-B:~$ sudo systemctl start mariadb
tools-db-B:~$ sudo mariadb
MariaDB [(none)]> RESET SLAVE;
MariaDB [(none)]> SET GLOBAL gtid_slave_pos = "{the value you recorded when taking the snapshot}"
MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='tools-db-A.tools.eqiad1.wikimedia.cloud', MASTER_USER='repl', MASTER_PASSWORD='{password for the repl user}', MASTER_USE_GTID=slave_pos;
MariaDB [(none)]> START SLAVE;
Note: the password for the repl user is stored in the secret repl-password
in Pwstore.
Verify that replication is working correctly
Follow the instructions at Checking_the_replication_status.
If the Grafana dashboard is not showing the new host in the "server" dropdown menu, check that prometheus-mysqld-exporter is running with systemctl status prometheus-mysqld-exporter
.
Initially, the new replica will be lagging a few hours behind the source (because the snapshot was taken a few hours earlier). This will trigger an alert ToolsToolsDBReplicationLagIsTooHigh
: go to https://alerts.wikimedia.org/ and add a silence for that alert.
Unmount, unattach and delete the temporary volume
After you are satisfied the new replica is working correctly, remove the temporary volume and snapshot, as they're not needed any more.
Unmount the volume:
tools-db-B:~$ sudo umount /mnt/tools-db-data-temp
In Horizon, detach and delete the volume tools-db-B-data-temp
.
Now open the snapshot list and delete the snapshot tools-db-A-data-snap
that you created a few steps above.
(Optional) Update tools-readonly DNS
If you want the new replica to become the canonical ToolsDB replica used by Quarry and other users running read-only queries against ToolsDB, update the DNS record tools-readonly.db.svc.wikimedia.cloud
to point to the new host.
Upgrading to a newer MariaDB version
Minor version upgrade (e.g. 10.6.19 to 10.6.20)
For a minor version upgrade, it's usually safe to upgrade the APT package in place on the existing servers. Upgrade the replica server first, then the primary.
# systemctl stop mariadb
# apt remove wmf-mariadbXXX
# apt install wmf-mariadbYYY
# systemctl start mariadb
# mysql_upgrade
// Only in replica
# mysql -e "SHOW SLAVE STATUS\G"
// Only in primary
# mysql -e "SET GLOBAL read_only=OFF;"
Major version upgrade (e.g. 10.6 to 10.11)
For major version upgrades, it's better to create a brand new host, install the new version there, and set it to replicate from the current primary. Once you are satisfied the new host is working correctly, you can promote the new host to become the new primary.
Create new hosts
Assuming tools-db-A is the current primary and tools-db-B is the current replica, we are gonna create two new hosts tools-db-C (which will become the new primary) and tools-db-D (which will become the new replica).
- Follow the procedure at Creating a new replica host and create a new host tools-db-C based on a snapshot from tools-db-B, then set it to replicate from tools-db-B
- Follow the procedure at Creating a new replica host and create a new host tools-db-D based on the same snapshot from tools-db-B, then set it to replicate from tools-db-C
This is the replication diagram you should have at this point:
tools-db-A [Primary] → tool-db-B → tools-db-C → tools-db-D
Upgrade MariaDB on both new hosts
On both tools-db-C and tools-db-D, install the new MariaDB version, then check that replication is working correctly.
# systemctl stop mariadb
# apt remove wmf-mariadbXXX
# apt install wmf-mariadbYYY
# systemctl start mariadb
# mysql_upgrade
# mysql -e "START SLAVE;"
# mysql -e "SHOW SLAVE STATUS\G"
Update the IRC topic to inform users
Update the topic in the IRC channel #wikimedia-cloud
. If you have the necessary permissions you can use !status ToolsDB is getting upgraded
.
Promote the new host to primary
Follow the procedure at #Changing_a_Replica_to_Become_the_Primary and promote tools-db-C to become the new primary.
This is the replication diagram you should have at this point:
tools-db-A [read-only] → tool-db-B → tools-db-C [Primary] → tools-db-D
Update DNS records
The following DNS records must be manually updated:
tools.db.svc.wikimedia.cloud
must be pointed to tools-db-C (you should have done this in the previous step)tools-readonly.db.svc.wikimedia.cloud
must be pointed to tools-db-D
Set the IRC topic back to "ok"
Update the topic in the IRC channel #wikimedia-cloud
. If you have the necessary permissions you can use !status ok
.
Delete the old hosts
If everything is working correctly, you can delete the two old hosts tools-db-A and tools-db-B, leaving you with this final state:
tools-db-C [Primary] → tools-db-D
wmcs.vps.remove_instance
cookbook, that will also remove the corresponding certificates from the Puppetserver.Re-importing data after replication failures
Old procedure |
---|
When the database crashes, bad things can happen. One thing that has happened before is that replication was interrupted in a way that prevented it from just picking up where it left off (see task T253738). When that happens replication filters for the problem tables need to be added until replication is able to continue using either the puppetized config file and restarts of mariadb or by stopping the slave process and setting the global variable Replicate_Wild_Ignore_Table to include more table patterns. The only operation to do on the primary or master server is mysqldump, which can be done from Toolforge bastions and tool accounts just to be on the safe side! eg. If we were adding "s52861__fnbot.review_filter" to the existing four database filters. On the replica (clouddb1002 usually): MariaDB [(none)]> stop slave;
MariaDB [(none)]> set global Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%,s52861\_\_fnbot.review\_filter';
MariaDB [(none)]> start slave;
MariaDB [(none)]> show slave status \G # To check if replication has errored out again
Any added filters should be committed to the puppet template with appropriate comments for the record. To get the table working again, you need to work with the user to ensure that no writes are coming to the table the whole time while you resync things. Then, export the table with something like (in this case dumping the maj_articles_recents table from the totoazero tool): tools.totoazero@tools-sgebastion-08:~$ mysqldump --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.eqiad.wmflabs s51245__totoazero maj_articles_recents | gzip > T257273-maj_articles_recents.sql.gz
As you can see, this can be done straight from a tool account as long as you move the file to where you can download it and send it to the replica server. You can also do this on the active ToolsDB server (usually clouddb1001) as root, if you like, but you will still need to transfer the resulting file to the replica. Then you need to ship that resulting gzipped dump to the ToolsDB replica. Once you have scp'd or whatever the file onto the replica server (usually clouddb1002), on that replica server run the following general procedure: me@clouddb1002 $ sudo -i mysql
MariaDB [(none)]> show slave status \G # Make sure that Seconds_Behind_Master is a numerical value not terribly far from zero--hopefully zero.
MariaDB [(none)]> stop slave;
MariaDB [(none)]> set global Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%'; # Remove the table filter that you added earlier from this list. Only remove the one you are fixing now.
MariaDB [(none)]> drop table <offending_db>.<offending table>;
Drop out of the mysql shell and restore the table you are working on: me@clouddb1002 $ gunzip tableback.sql.gz # In the example above it would be 'gunzip T257273-maj_articles_recents.sql.gz'
me@clouddb1002 $ sudo -i mysql <user's database> < tableback.sql # In the example above it would be 'sudo -i mysql s51245__totoazero < T257273-maj_articles_recents.sql'
At this point, as long as the user didn't write to that table, you should be in good shape to start replication! me@clouddb1002 $ sudo -i mysql
MariaDB [(none)]> start slave;
MariaDB [(none)]> show slave status \G # Did it blow up? You may have a problem to troubleshoot. Is Seconds_Behind_Master moving to zero when you run this again? If yes, and it reaches zero, then you are done!
|