Jump to content

Portal:Toolforge/Admin/Runbooks/ToolsDBReplication

From Wikitech

This runbook is about different types of problems in the ToolsDB replication from the primary instance to the replica instance.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager. We currently have these 3 alerts about different replication problems:

  • ToolsDBReplicationMissing (replication is not running as expected)
  • ToolsDBReplicationError (replication is erroring)
  • ToolsDBReplicationLagIsTooHigh (replication lag is too high)

The alert labels will also indicate which instance is affected.

Debugging

Checking the replication status

The alert is usually triggered on the replica instance (currently tools-db-2). You can SSH to that instance and check the replication status:

dcaro@vulcanus$ ssh tools-db-2.tools.eqiad1.wikimedia.cloud
dcaro@tools-db-2:~$ sudo mariadb
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 49
Server version: 10.4.29-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW REPLICA STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
...
               Master_Log_File: log.014140
           Read_Master_Log_Pos: 4489664
                Relay_Log_File: tools-db-2-relay-bin.000002
                 Relay_Log_Pos: 725
         Relay_Master_Log_File: log.014015
              Slave_IO_Running: Yes  <-- if this is no, the slave is not running
             Slave_SQL_Running: Yes  <-- same for this
...

If the replication is running

If both Slave_IO_Running or Slave_SQL_Running are "Yes", it means the replication is working, but the replication lag has increased beyond the threshold. The first thing to check is this Grafana chart that will show you the evolution of the replication lag in the last few hours (and days/weeks if you change the time window).

Most of the times the replication lag will eventually decrease by itself until it gets back to zero. It's still worth identifying what caused the issue, so please open a Phabricator task like phab:T343819.

Below you find some of the issues that triggered this alert in the past.

If the replication is NOT running

If one of Slave_IO_Running or Slave_SQL_Running are not "Yes", the replication has stopped. Last_SQL_Error or Last_IO_Error will show you the error that caused the replication to stop.

This is a more critical situation that will NOT resolve by itself.

Sometimes restarting the replication is enough to fix the issue, so you should try this first:

ssh tools-db-X.tools.eqiad1.wikimedia.cloud

tools-db-X:~$ sudo mariadb

MariaDB [(none)]> START REPLICA;

If after restarting the replication the same error occurs, you can try analyzing the query that caused the replication to fail and see if it safe to skip it, or if it can be applied manually. The offending query is often shown in Last_SQL_Error or you can also check the MariaDB logs with journalctl -u mariadb.

In the worst case scenario, you can recreate a new replica from scratch, using a snapshot of the volume holding the primary data. Documentation on how to do it is available at Portal:Data Services/Admin/Toolsdb#Creating a new replica host.

Common issues

Add new issues here when you encounter them!

Transactions way slower on the secondary than the primary

One or more transactions that were applied (somewhat) quickly in the primary, take much longer (hours or days) to be replicated in the replica instance. This has happened many times in the past and we haven't found a way to completely avoid it. If this happens again, please file a new Phab task as a sub-task of phab:T357624.

Finding details about the slow transaction

To find out which transaction is causing the issue, you can use SHOW FULL PROCESSLIST;, you will usually see one of the replication thread stuck on a single query.

|     19 | system user     |           | s51698__yetkin | Slave_worker |    16 | Delete_rows_log_event::find_row(-1) on table `visited_pages_agg` | DELETE FROM visited_pages_agg WHERE vpa_year = '2025' AND vpa_month = '6' AND vpa_day = '28' |    0.000 |
Please note that the TIME column is not indicative of how long that transaction has been running. In row-based replication, a transaction can result in many "row events", and the TIME counter will reset after each event.

Using SHOW FULL PROCESSLIST is usually enough, but you can also check the binary log in the primary host to find out more details about the stuck transaction. In the result of SHOW REPLICA STATUS in the replica (see section above), look at the values Relay_Master_Log_File and Exec_Master_Log_Pos. Then SSH to the primary and cd into /srv/labsdb/binlogs. There you can use mysqlbinlog to see the corresponding transaction:

dcaro@urcuchillay$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
dcaro@tools-db-1:~$ sudo -i
root@tools-db-1:~# cd /srv/labsdb/binlogs
root@tools-db-1:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position $EXEC_MASTER_LOG_POS $RELAY_MASTER_LOG_FILE |less
There is now a cookbook to run the above commands in one line: cloudcumin1001:~$ sudo cookbook wmcs.toolforge.toolsdb.get_current_replica_transaction

Finally, if the stuck transaction is using a InnoDB table, you can also check the output of SHOW ENGINE INNODB STATUS\G for more details about the stuck transaction.

Running a query manually to work around row-based-replication

We use row-based-replication, which is generally the safest option, but has one downside: if a single query on the primary modifies many rows (e.g. DELETE FROM big_table;), this query gets translated to many row events in the binlog, and those events can take a long time to replicate.

If the stuck transaction is of this type, you can stop the replication, run that query manually on the replica, then resume the replication. This will complete much faster (usually in seconds, sometimes minutes for very big tables). After you have run the query, you can tell the replica to skip one transaction from the binary log (sql_slave_skip_counter = 1) and resume replication as normal:

STOP REPLICA;
SET sql_log_bin = 0;
/* run the stuck query manually, then: */
SET GLOBAL sql_slave_skip_counter = 1;
SET sql_log_bin = 1;
START REPLICA;
STOP REPLICA often gets stuck and does not actually stop the replication. If this happens, try manually killing the active replication thread with KILL {id} where {id} can be found with SHOW PROCESSLIST. If this is not enough, try killing all Slave_worker threads. If replication still doesn't stop, kill the mysqld process with kill -9

Slow replication and missing primary index

Replication getting stuck while processing a transaction is often caused by a missing primary key. If the same database or tool cause this issue on multiple occasions, it's probably worth contacting the tool owners and trying to add the missing index, to prevent this from happening again.

Replication stops because of "invalid event"

This happened a couple of times and the following error was shown in the MariaDB logs:

Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug

I'm not sure what was the root cause, but restarting the replication with START REPLICA; was enough to fix the issue.

If this happens again, please add a comment to phab:T351457.

Support contacts

The main discussion channel for this alert is the #wikimedia-cloud-admin in IRC.

There is no user impact as users are always connecting to the primary (as of 2023), so there's no need to notify anyone outside of the Toolforge admins.

If the situation is not clear or you need additional help, you can also contact the Data Persistence team (#wikimedia-data-persistence on IRC).

Old incidents

Add any incident tasks here!