Portal:Toolforge/Admin/Runbooks/ToolsToolsDBReplicationLagIsTooHigh

From Wikitech
Jump to navigation Jump to search

Overview

This happens when the secondary toolsdb host is not able to catch up with the primary one.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which instances are involved (secondary, primary).

Debugging

Checking the replication status

The alert is usually triggered on the replica instance (currently tools-db-2). You can SSH to that instance and check the replication status:

dcaro@vulcanus$ ssh tools-db-2.tools.eqiad1.wikimedia.cloud
dcaro@tools-db-2:~$ sudo mariadb
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 49
Server version: 10.4.29-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
...
               Master_Log_File: log.014140
           Read_Master_Log_Pos: 4489664
                Relay_Log_File: tools-db-2-relay-bin.000002
                 Relay_Log_Pos: 725
         Relay_Master_Log_File: log.014015
              Slave_IO_Running: Yes  <-- if this is no, the slave is not running
             Slave_SQL_Running: Yes  <-- same for this
...

If the replication is running

If both Slave_IO_Running or Slave_SQL_Running are "Yes", it means the replication is working, but the replication lag has increased beyond the threshold. The first thing to check is this Grafana chart that will show you the evolution of the replication lag in the last few hours (and days/weeks if you change the time window).

Most of the times the replication lag will eventually decrease by itself until it gets back to zero. It's still worth identifying what caused the issue, so please open a Phabricator task like phab:T343819.

Below you find some of the issues that triggered this alert in the past.

If the replication is NOT running

If one of Slave_IO_Running or Slave_SQL_Running are not "Yes", the replication has stopped. Last_SQL_Error will show you the error that caused the replication to stop. The separate critical alert ToolsDBReplication should also have fired.

This is a more critical situation that will NOT resolve by itself. You can try analyzing the query that caused the replication to fail and see if it safe to skip it, or if it can be applied manually.

In the worst case scenario, you can recreate a new replica using a snapshot of the volume holding the primary data. Documentation on how to perform this is not available yet, though you might find some useful information in phab:T329521.

Common issues

Add new issues here when you encounter them!

Transactions way slower on the secondary than the primary

One or more transactions that were applied (somewhat) quickly in the primary, take much longer (hours or days) to be replicated in the replica instance (see for example phab:T341891). According to this Stackoverflow thread, this behaviour can be caused by a missing index on the id column.

If the same database or tool cause this issue on multiple occasions, it's probably worth contacting the tool owners and trying to add the missing index, to prevent this from happening again. To find out which database is causing the issue, you can check the binary log. In the result of SHOW SLAVE STATUS in the replica (see section above), look at the values Relay_Master_Log_File and Exec_Master_Log_Pos. Then SSH to the primary and cd into /srv/labsdb/binlogs. There you can use mysqlbinlog to see the corresponding transaction:

dcaro@urcuchillay$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
dcaro@tools-db-1:~$ sudo -i
root@tools-db-1:~# cd /srv/labsdb/binlogs
root@tools-db-1:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position $EXEC_MASTER_LOG_POS $RELAY_MASTER_LOG_FILE |less

Related information

Support contacts

The main discussion channel for this alert is the #wikimedia-cloud-admin in IRC.

There is no user impact as users are always connecting to the primary (as of 2023), so there's no need to notify anyone outside of the Toolforge admins.

If the situation is not clear or you need additional help, you can also contact the Data Persistence team (#wikimedia-data-persistence on IRC).

Old incidents

Add any incident tasks here!