Portal:Data Services/Admin/Toolsdb

From Wikitech
Jump to navigation Jump to search

ToolsDB is a multi-tenant mariadb database service for use by Toolforge tools and maintainers. This page is primarily for WMCS admins for database and server operations on the ToolsDB database cluster that is currently operating in the tools Cloud VPS project. For general information and background, see Portal:Data Services#ToolsDB and Help:Toolforge/Database#User_databases.


ToolsDB is a MariaDB server that runs on a two-host primary-replica configuration.

Both hosts are Cloud VPS virtual machine, where we install our customized MariaDB deb packages using the Puppet profile profile::wmcs::services::toolsdb_primary. We reuse a lot of Puppet classes created by the data persistence team for production MariaDB databases, and add some customization that we need for ToolsDB.

Finding the right hostnames

At the time this wiki page was last updated, the primary host was tools-db-1.tools.eqiad1.wikimedia.cloud and the replica host was tools-db-2.tools.eqiad1.wikimedia.cloud, but these hostnames will change and there is no guarantee that the primary is sorted alphabetically before the replica.

To find the current hostnames, navigate to the tools project in Horizon and look for hosts with name tools-db-*. There should be only one primary and one replica at any given time, unless some maintenance operation is in progress. To find which of the two hosts is the primary, SSH to both, open a MariaDB shell (sudo mariadb) and type SHOW SLAVE STATUS\G.

In the primary host you will see something like:

Empty set (0.000 sec)

In the replica host you will see something like:

*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud

Checking the replication status

SSH to the replica host and from a MariaDB shell (sudo mariadb), type SHOW SLAVE STATUS\G:

*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud


              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes




         Seconds_Behind_Master: 0


1 row in set (0.000 sec)

If both Slave_IO_Running and Slave_SQL_Running are Yes and Seconds_Behind_Master is 0, the replication is working correctly.

If both Slave_IO_Running and Slave_SQL_Running are Yes but Seconds_Behind_Master is greater than zero, it means the replication is working but the replica is out of sync. Most of the times this is not something to worry about, and the replica will catch up automatically. If the replication lag grows too much it can trigger the alert ToolsToolsDBReplicationLagIsTooHigh.

If at least one of Slave_IO_Running and Slave_SQL_Running are not Yes, the replication has stopped. The field Last_Error will usually provide the reason for the failure, but you can also look at the full mariadb logs with sudo journalctl -u mariadb.

There are some tricks to fix a broken replication, but many times the best thing is to create a new replica from scratch, following the instructions in the section "Creating a new replica host".

Failing over to the replica host

Official docs: https://mariadb.com/kb/en/changing-a-replica-to-become-the-primary/

Set the primary to read-only

You might want to notify users, as all write queries will start failing.

If the primary server is running, you can set the server to read-only without restarting the service, by executing the following command on the MySQL prompt:


NOTE: this command will wait for all active transactions to complete. If it's taking more than a few seconds to complete, it's a good idea to manually kill the pending transactions, as there is probably one or more slow queries that might take a very long time to complete. In a new MariaDB shell, you can identify the long transactions using SHOW PROCESSLIST, and kill them with KILL $(ID). When no pending transactions remain, "FLUSH TABLES WITH READ LOCK" will complete in the other window.

Change the DNS entry to point users to the replica

See Portal:Data Services/Admin/Wiki_Replica_DNS for the procedure; the record is in the same location as the configuration files for the replicas.

Promote the replica to primary

Stop replication on the replica in the mysql shell. This is done by running STOP SLAVE;

Save a copy of all the values shown by the following "STATUS" commands, as you might need them later:


Reset all replication on secondary to start clean, from the MySQL prompt:


To promote this host to primary, you need to make the host writable:

SET GLOBAL read_only=OFF;

NOTE: our config includes read-only=ON in my.cnf, which means that every time that the mariadb service is restarted, it is set to read_only automatically. This is for extra safety, so that an explicit command set global read_only=OFF is required to set a server to "writable".

To can check if the host is writable:

SELECT @@read_only;

It should be 0.

Change the old primary to become the new replica

Unless the old primary has some unrecoverable issue, we can configure it to become the new replica and start replicating from the new primary. In this way we will have again a working primary-replica setup, where the original roles have been inverted.

In the old primary, use the following commands to start replicating from the new primary:

CHANGE MASTER TO MASTER_HOST='tools-db-X.tools.eqiad1.wikimedia.cloud', MASTER_USER='repl', MASTER_PASSWORD='xxxxxx', MASTER_USE_GTID=slave_pos;

If for any reason the old primary cannot become the new replica (e.g. errors, data corruption, etc.), we can instead create a new replica from scratch by following the instructions in the paragraph "Creating a new replica host". In this case, remember to delete the old primary host and any data volume or snapshot associated with it.

Creating a new replica host

Assuming tools-db-1 is the current primary, and tools-db-2 is the new replica:

  • Create a new host tools-db-2:
    • fnegri@cloudcumin1001:~$ sudo cookbook wmcs.vps.create_instance_with_prefix --project tools --prefix tools-db --security-group toolsdb --task-id Txxxxxx --image debian-11.0-bullseye
  • Refresh the Puppet certificates and make sure that Puppet runs without errors in the new host:
    • fnegri@cloudcumin1001:~$ sudo cookbook wmcs.vps.refresh_puppet_certs --project tools --fqdn tools-db-3.tools.eqiad1.wikimedia.cloud --task-id Txxxxxx
  • Create a Cinder snapshot of the tools-db-1-data volume using the Horizon web interface
  • Create a new Cinder volume tools-db-2-data-temp, based on that snapshot
    • This temporary volume is needed so that we can eventually delete the Cinder snapshot (Cinder does not let you delete a snapshot unless you delete all the volumes derived from the snapshot)
  • Create a new empty Cinder volume tools-db-2-data (this will be the data volume for the new instance)
  • Attach both volumes to the tools-db-2 instance, and mount them:
    • tools-db-2-data -> /dev/sdb -> /srv/labsdb
    • tools-db-2-data-temp -> /dev/sdc -> /mnt/tools-db-data-temp
    • Note: in phab:T344298 we noticed some issues with /etc/fstab containing a wrong value, which caused the volume not to be mounted correctly after a restart. We need to investigate better how /etc/fstab was populated (maybe by the prepare_volume script?)
  • Copy all the contents from /mount/tools-db-data-temp to /srv/labsdb
  • Unmount, unattach and delete the temporary volume tools-db-2-data-temp
    • Now you can delete the snapshot in the snapshot list for the original volume tools-db-1-data
  • Delete all files in /srv/labsdb/binlogs and /srv/labsdb/tmp and keep only the files in /srv/labsdb/data
  • Set the right file ownership with chown -R mysql:mysql /srv/labsdb/*
  • Start MariaDB with systemctl start mariadb
  • Configure the new instance to replicate from tools-db-1:
$ sudo mariadb

MariaDB [(none)]> RESET SLAVE;

MariaDB [(none)]> CHANGE MASTER TO  MASTER_HOST='tools-db-1.tools.eqiad1.wikimedia.cloud', MASTER_USER='repl',  MASTER_PASSWORD='xxxxxx', MASTER_USE_GTID=slave_pos;

MariaDB [(none)]> START SLAVE;

To verify that replication is working correctly, follow the instructions at Checking_the_replication_status.

Upgrading to a newer MariaDB version

To upgrade to a newer MariaDB version, you can follow this procedure (assuming host-1 is the current primary and host-2 is the current replica):

  • Create a new replica host-3 running the new version, and set it to replicate from the existing replica host-2:
    • host-1 [Primary] --> host-2 --> host-3
  • If everything works fine, create another new replica host-4 running the new version, and set it to replicate from host-3:
    • host-1 [Primary] --> host-2 --> host-3 --> host-4
  • If everything works fine, promote host-3 to become the new primary:
    • host-1 [stopped] --> host-2 --> host-3 [Primary] --> host-4
  • Check that replication from host-3 to host-4 is working correctly, then delete the two old hosts host-1 and host-2:
    • host-3 [Primary] --> host-4

Re-importing data after replication failures

When the database crashes, bad things can happen. One thing that has happened before is that replication was interrupted in a way that prevented it from just picking up where it left off (see task T253738). When that happens replication filters for the problem tables need to be added until replication is able to continue using either the puppetized config file and restarts of mariadb or by stopping the slave process and setting the global variable Replicate_Wild_Ignore_Table to include more table patterns. The only operation to do on the primary or master server is mysqldump, which can be done from Toolforge bastions and tool accounts just to be on the safe side!

eg. If we were adding "s52861__fnbot.review_filter" to the existing four database filters. On the replica (clouddb1002 usually):

MariaDB [(none)]> stop slave;
MariaDB [(none)]> set global Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%,s52861\_\_fnbot.review\_filter';
MariaDB [(none)]> start slave;
MariaDB [(none)]> show slave status \G   # To check if replication has errored out again

Any added filters should be committed to the puppet template with appropriate comments for the record.

To get the table working again, you need to work with the user to ensure that no writes are coming to the table the whole time while you resync things.

Then, export the table with something like (in this case dumping the maj_articles_recents table from the totoazero tool):

tools.totoazero@tools-sgebastion-08:~$ mysqldump --defaults-file=$HOME/replica.my.cnf -h tools.db.svc.eqiad.wmflabs s51245__totoazero maj_articles_recents | gzip > T257273-maj_articles_recents.sql.gz

As you can see, this can be done straight from a tool account as long as you move the file to where you can download it and send it to the replica server. You can also do this on the active ToolsDB server (usually clouddb1001) as root, if you like, but you will still need to transfer the resulting file to the replica.

Then you need to ship that resulting gzipped dump to the ToolsDB replica. Once you have scp'd or whatever the file onto the replica server (usually clouddb1002), on that replica server run the following general procedure:

me@clouddb1002 $ sudo -i mysql
MariaDB [(none)]> show slave status \G    # Make sure that Seconds_Behind_Master is a numerical value not terribly far from zero--hopefully zero.
MariaDB [(none)]> stop slave;
MariaDB [(none)]> set global Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%';     # Remove the table filter that you added earlier from this list. Only remove the one you are fixing now.
MariaDB [(none)]> drop table <offending_db>.<offending table>;

Drop out of the mysql shell and restore the table you are working on:

me@clouddb1002 $ gunzip tableback.sql.gz   # In the example above it would be 'gunzip T257273-maj_articles_recents.sql.gz'
me@clouddb1002 $ sudo -i mysql <user's database> < tableback.sql  # In the example above it would be 'sudo -i mysql s51245__totoazero < T257273-maj_articles_recents.sql'

At this point, as long as the user didn't write to that table, you should be in good shape to start replication!

me@clouddb1002 $ sudo -i mysql
MariaDB [(none)]> start slave;
MariaDB [(none)]> show slave status \G    # Did it blow up? You may have a problem to troubleshoot. Is Seconds_Behind_Master moving to zero when you run this again? If yes, and it reaches zero, then you are done!