MariaDB/Backups/Example recovery

Data recovery plan

1) Recover up to the drop on db1095: db1109-bin.003832:611995336

transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s8.2020-04-05--19-00-02.tar.gz db1095.eqiad.wmnet:/srv/sqldata.s8 # [DONE]

Setup new instance: https://gerrit.wikimedia.org/r/c/operations/puppet/+/587010 [DONE]
Start server [DONE]
setup replication

CHANGE MASTER TO MASTER_HOST='db1109.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD=, MASTER_SSL=1; -- requires right password [DONE]

GTID stuff on file, check replication works.

SET GLOBAL gtid_slave_pos = ; -- requires a good gtid pos [DONE]

Disable gtid out of precaution

CHANGE MASTER TO MASTER_USE_GTID=no; -- [DONE]

Start replication until (before) delete

START SLAVE UNTIL MASTER_LOG_FILE = 'db1109-bin.003832', MASTER_LOG_POS = 611995336; -- [DONE]

Disable consitency options:

SET GLOBAL innodb_flush_log_at_trx_commit=0; -- [DONE]

Add to zarcillo so it can be monitored on prometheus [DONE]

2) Stop replication fully

STOP SLAVE; [DONE]

3) Rename the table to _old

set sql_log_bin=0; RENAME TABLE wb_items_per_site TO wb_items_per_site_recovered; [DONE]

4) Dump the table in 10 chunks: dbprov1001# dbprov1001# backup-mariadb s8_backup --type=dump --host=db1095.eqiad.wmnet --port=3318 --backup-dir /srv/backups/dumps/ongoing --regex='wikidatawiki[.]wb\_items\_per\_site' --rows=1000000 --user=dump --retention=15 --password= # requires a working --password [DONE]

The right dump is /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21

5) Stop replication on:

codfw
labs
Analytics

6) Recover on eqiad production hosts without replication:

myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host={} --port={} --user=root --password= # requires working --password
myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host=db1087.eqiad.wmnet --user=root --password= --enable-binlog # on db1087

(list of servers gotten from zarcillo with the section bash oneliner)

db1116:3318 (backup source, delayed) DONE
db2079 (codfw master, delayed) DONE
db1095:3318 (NOT TO BE DONE, delayed, instance to be removed)
db1124:3318 (NOT TO BE DONE, will be done on its master)
db1099:3318 DONE
db1101:3318 DONE
db1126 DONE
db1104 DONE
db1092 DONE
db1087 (WITH REPLICATION) DONE
db1111 DONE
dbstore1005:3318 (analytics) DONE
db1109 (master) DONE

7) drop? table and rename old to new:

- Set the master on RO for some time

db1109> set global read_only=1; RENAME TABLE wb_items_per_site TO wb_items_per_site_old; RENAME TABLE wb_items_per_site_recovered TO wb_items_per_site; set global read_only=0;

Main issue: metadata locking on the slaves
- Depooling hosts (a bunch of them at the time)
- Repooling them after rename
- Depooling the ones that get metadata locked again