MariaDB/troubleshooting

From Wikitech
Jump to navigation Jump to search
DBAs are working on it

This guide is a work in progress. See also MariaDB/monitoring and MariaDB/Backups#Recovering_a_Snapshot

Depooling a slave

From cumin1001 or cumin2001:

dbctl instance dbXXXX depool
dbctl config commit -m "Depool dbXXXX"

More example commands at: https://wikitech.wikimedia.org/wiki/Dbctl#Usage

Create a task with the DBA tag so DBAs can follow up and checkout what happened, a proper fix etc

  • Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connections. You can also monitor connections with regular linux tools like netstat/ss on port 3306 (or the right mysql port). Monitoring tools regularly check the host, but they use separate users.

Example:

MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                 | Command | 
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|       2 | event_scheduler | localhost         | NULL               | Daemon  | 
# replication users, ignore
| 3192579 | system user     |                   | NULL               | Connect | 
| 3192580 | system user     |                   | NULL               | Connect | 
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep   | 
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep   | 
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep   | 
# this is your own connection
| 6311084 | root            | localhost         | NULL               | Query   | 
+---------+-----------------+-------------------+--------------------+---------+-

(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)

  • Except the dump slave -while creating the dumps- or some specific maintenance or long running tasks, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (KILL <#ID>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill. Sadly there are some times where idle connections keep connected for a long time.

Master comes back in read only

A master coming back in READ ONLY mode is expected after a crash and it is done to prevent accidental corruption (or even more corruption on a crash).

Unless you know what you are doing, do not set it back to writable: if unsure, call a DBA

Impact

Reads will remain unaffected but no writes will be able to go through, the wikis on that master will be on read-only mode.

To find out which wikis are those:

ssh to the host: mysql -e "show databases"

What to do

If unsure, call a DBA

If you know what you are doing:

  • Check the reason for the crash:
journalctl -xe -umariadb -f
dmesg
/var/log/messages
  • Check the state of the data:
    • Check errors above
    • Check if tables are marked as corrupted or you see InnoDB crashes
    • Select from the main tables and see if you can get results: revision text users watchlist actor comment
    • If storage is the cause, you most likely want to failover to a different host: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Emergency_failover
    • If a memory dimm is the cause, you most like want to:
      • Disable puppet
      • Reduce innodb buffer pool size on my.cnf
      • Restart mysql
      • Check data from a few tables
      • Check that all slaves are in sync
      • set global read_only=OFF;
      • Create a task to follow up

If this is part of a maintenance and pages it could be due to expired downtime or because it was forgotten to be downtimed. If that is the case, contact whoever is doing the maintenance, if it is you, remember to:

set global read_only=OFF;

Depooling a master (a.k.a. promoting a new slave to master)

See: Switch master (most of it still relevant).

Planned switchover

There is a script, switchover.py https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/switchover.py , to be run from a Wikimedia mysql root client (cumin1001 or cumin2001 at the moment), which will automate the most complex steps. However, due to mediawiki dependencies, we still need at the moment to perform some extra steps:

  • Set mediawiki in read only for that master (if possible) or migrate the service away. Normally that is a line uncommenting on db-eqiad.php or db-codfw.php:
'readOnlyBySection' => [
        's1'      => 'English Wikipedia in read only because reasons.',

... pointing the parsercache to another host:

$wmgParserCacheDBs = [
     '10.64.0.12'   => '10.64.32.72',   # pc1004, A3 2.4TB 256GB, temporarily failed over to pc1005 
     '10.64.32.72'  => '10.64.32.72',  # pc1005, C7 2.4TB 256GB

... or depooling it:

$wgDefaultExternalStore = [
       'DB://cluster24',
       # 'DB://cluster25',
];

Once that is deployed, execute switchover.py, with the original master and the target one as parameters:

./switchover.py db1052 db1067

This is an example of a successful output:

Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 6313 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /tmp/mysql.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  4.23hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 6313 at es1014.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 6313' -----                                                                                               
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  4.46hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 6313'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.010378122329711914 seconds
Servers sync at master: es1014-bin.002508:184384418 slave: es1017-bin.002491:41215873
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section es3 at es1017.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.29hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
12107 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.22hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es1017.eqiad.wmnet running with PID 12107
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

This will move the replicas below the other host, and perform the replication changes to migrate the service, while maintaining data consistency. For a faster migration, you could execute the switchover in 2 steps:

./switchover.py --only-slave-move db1052 db1067

which only changes the topology, and can be done almost in a fully hot way (it may create temporary replication lag on on server at a time), and when ready:

./switchover.py --skip-slave-move db1052 db1067

Which sets things in read only and does the actual master switchover.

Finally, set the service back in read-write/update master configuration by deploying mediawiki (dbctl, as of writing this, is not yet part of the switchover script).

A checklist of things to do or check after a successful switchover (some of them are already done by the script):

  • You can perform an edit on the section you just switchover
  • No further errors on logstash (there will be some that are unavoidable due to the read only period)
  • Semi-sync is enabled on new master and disabled on old master
  • Make sure tendril and zarcillo (dbtree, dbmonitor) have the correct master on its database (in the future this should happen automatically by switchover.py)
  • Update dns example: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/439533/ (these dns aliases are not used)
  • Patch prometheus, dblists example: https://gerrit.wikimedia.org/r/#/c/operations/software/+/439534/ (this should happen automatically in the future, based on zarcillo)
  • Enable GTID on all the replicas, make sure the master is not replicating from anywhere
  • Create a decommissioning ticket for the OLD host, if necessary
  • Ensure all replicas and masters have the right events on the ops database (events_coredb_slave.sql, events_coredb_master.sql) (in the future this should happen automatically by switchover.py)
  • Update/resolve phabricator ticket about failover

A full list of manual steps can be found at: MariaDB#Production_section_failover_checklist

Emergency failover

If the master is not available, or replication is broken, this is a more complex case. The reason is that slaves will have executed different amount of transactions and will be in a close, but different state. E.g. slave1 has executed transactions A, while slave2 has executed transactions A, B and C. In addition to that, if we do not have access to the master's binary log (or it has not properly been synchronized to disk after a crash), we will have to recover from a slave. In theory, with semi-sync replication, no transaction will be lost, and at least one slave will have the change, but all other slaves will be on different coordinates (and binary log position is only local to the masters).

Scenario 1 -master is recoverable: just wait until the master restarts, it will avoid headaches and be faster and less disruptive than trying to failover it.

Scenario 2 -master is not recoverable, but its binary log is (and all slaves have a less or equal amount of data):

  1. For each slave: send the master log position, starting from the last Exec_master_position so all slaves are in the same starting state
  2. Follow regular failover steps as mentioned in the scheduled maintenance

Scenario 3 -neither master is recoverable nor its binary logs (or a master binary log is behind a slave binary log): We need to put all servers in the same state, using the most up-to-date slave, then perform the regular failover process. This is the most complicated part without using GTIDs:

  1. Identify the most up to date slave by comparing Exec_master_log_pos
  2. By comparing binary log positions, try to find the binlog coordinate that corresponds to the other's slaves binlog to the most up to date slave's binlog. This is the tricky part. pt-heartbeat should be able to find this.
  3. Execute the pending transactions on each slave
  4. Follow the regular steps for regular scheduled maintenance

Again, these steps can be automatized.

Replication lag

See also MySQL#Replication lag for additional tips.

Caused by hardware

This is what a half-failing disk looks like in monitoring (small lag until it becomes critical).

One common cause of lag that is easy to check and repair is hardware issues.

Disks about to fail

Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.

As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "1 failed LD(s) (Degraded)" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under warranty, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.

To check the RAID status, execute:

 megacli -AdpAllInfo -aALL

And check the section "Devices present"

To identify the particular disk

 megacli -PDList -aALL

Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).

Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completely; in a state of "almost failing", there could be lag issues. If that is the case, executing:

 megacli -PDOffline -PhysDrv \[#:#\] -aALL

where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later.

Bad or defective BBU

If all the disks are looking good, it can be that the RAID controller went to WriteThrough mode because of a failed BBU or because it is in a learning cycle (which shouldn't because it is disabled in our environment). If the Cache Policy is set to WriteThrough it will dramatically affect performance. In order to check the Current Policy (that is it, the active one):

megacli -LDInfo -LAll -aAll | grep "Cache Policy:"

If it is not in WriteBack mode, it means (most likely) that the BBU has failed for some reason and the default is to switch back to WriteThrough as it is safer. You can check the BBU status with:

megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood' -e '^Charger Status' -e '^Remaining Capacity' -e 'Charging'

If you are in an emergency, you can always force WriteBack, but this can lead to data loss if there is a power failure, so use it carefully

megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll

Data loss

Normal reprovisioning

The best (and fastest) way to repair a slave (or a master) is to use the regular provisioning workflow - to copy the latest snapshot from dbprovXXXX hosts. Depending on the section it should take minutes to 1h.

If only a partial recovery is needed (single dropped table), logical backups (on the same dbprov* hosts) may be faster and more flexible. A full logical recovery can take from 12 hours to 1 day.

More info at MariaDB/Backups.

Long term backups

If for some reason the short term backups/provisioning files are not enough, you can get those from bacula. Recover from bacula to dbprovXXXX then use the same method as above.

More info at Bacula.

Cloning

If for some reason no backups are available, but replicas are, we can clone a running mariadb server with xtrabackup or the files of a stopped one into another host (see transfer.py utility for both file and xtrabackup transfers).

Binlogs

Binlogs are not a backup method, but they are files containing the transactions in the last month, on every master and replica. They are helpful to do point in time recovery if replication is not working, allowing to move forward a backup until an arbitrary point in time (e.g. before a DROP was sent).

Point in time recovery at the moment is fully manual but its automation is a work in progress.

Data inconsistency between nodes "drift" / replication broken

compare.py utility allows to manually check the difference between 2 hosts. This is right now manually run, but it is schedule to be constantly running comparing host inconsistencies.

Aside from a manual check, the other most common way to find inconsistencies is for replication to break as a change is applied to a master that cannot be applied to the replica.

Steps to perform:

  • If an inconsistency happens on one replica, it is likely the host data got corrupted, depool it and research should be done why it happened. If the issue was due to only replica issues, wipe data an recover from provisioning hosts.
  • If it happens on all replicas, it should be checked if there is master corruption or other operational error. If an operational error "a table exist on the master an not on others", it could be corrected manually (e.g. create the table manually). Otherwise, failover the master so it can be depooled and continue with a different host as the new master.
  • In all cases, it should be clear which host or set of hosts have the right data; the "bad host(s)" should be put out of production, itrs data deleted and reprovisioned from backup

NEVER use sql_slave_skip_counter! Not only you will skip full transactions (despite maybe a single row being problematic, creating more drift issues), you will make hosts have a different gtid counter. If you have to manually change something on only 1 host, apply the dml with set sql_log_bin=0 so it doesn't go to the binlog/gtid counter/replication.

Split brain

There is an additional case, which is if a "split brain" has happened and correct data was written to both master and replicas.

This is rare given that all replicas are set in read only to prevent this issue. Also this is difficult to handle- ideally data should be merged into a single unified version.

If data affected is derived (non-canonical) ", eg. *links tables, you could temporarily chose a single set of servers, go back to read/write and the try to merge the difference in the backgroud.

If canonical data is affected (page, revision, user), consider setting up the application/section in read only until data is reconciliated, so new versions are not added that could make the 2 data version merging more complicated.

Depooling a Labs dbproxy

The first thing is to depool it from the Wiki Replicas DNS Once that is done and if you are depooling dbproxy1010 all the traffic will go to dbproxy1011 which only has one server as active. The other one is a backup host as can be seen at the hiera file that lives on our puppet repo:

cat hieradata/hosts/dbproxy1011.yaml
profile::mariadb::proxy::master::primary_name: 'labsdb1009'
profile::mariadb::proxy::master::primary_addr: '10.64.4.14:3306'
profile::mariadb::proxy::master::secondary_name: 'labsdb1010'
profile::mariadb::proxy::master::secondary_addr: '10.64.37.23:3306'

That means when dbproxy1010 is depooled, all its traffic will go to labsdb1009. So it is advised to change haproxy configuration temporarily to make labsdb1010 also active (round robin dns). To do so:

ssh dbproxy1011
puppet agent --disable "Changing haproxy temporarily"
vim /etc/haproxy/conf.d/db-master.cfg

Replace the line:

server labsdb1010 10.64.37.23:3306 check backup

With:

server labsdb1010 10.64.37.23:3306 check inter 3s fall 20 rise 99999999

Reload HAProxy

systemctl reload haproxy