MariaDB/Backups/Example recovery
Appearance
Data recovery plan
1) Recover up to the drop on db1095: db1109-bin.003832:611995336
transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s8.2020-04-05--19-00-02.tar.gz db1095.eqiad.wmnet:/srv/sqldata.s8 # [DONE]
- Setup new instance: https://gerrit.wikimedia.org/r/c/operations/puppet/+/587010 [DONE]
- Start server [DONE]
- setup replication
CHANGE MASTER TO MASTER_HOST='db1109.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD=, MASTER_SSL=1; -- requires right password [DONE]
GTID stuff on file, check replication works.
SET GLOBAL gtid_slave_pos = ; -- requires a good gtid pos [DONE]
- Disable gtid out of precaution
CHANGE MASTER TO MASTER_USE_GTID=no; -- [DONE]
- Start replication until (before) delete
START SLAVE UNTIL MASTER_LOG_FILE = 'db1109-bin.003832', MASTER_LOG_POS = 611995336; -- [DONE]
- Disable consitency options:
SET GLOBAL innodb_flush_log_at_trx_commit=0; -- [DONE]
- Add to zarcillo so it can be monitored on prometheus [DONE]
2) Stop replication fully
STOP SLAVE; [DONE]
3) Rename the table to _old
set sql_log_bin=0; RENAME TABLE wb_items_per_site TO wb_items_per_site_recovered; [DONE]
4) Dump the table in 10 chunks: dbprov1001# dbprov1001# backup-mariadb s8_backup --type=dump --host=db1095.eqiad.wmnet --port=3318 --backup-dir /srv/backups/dumps/ongoing --regex='wikidatawiki[.]wb\_items\_per\_site' --rows=1000000 --user=dump --retention=15 --password= # requires a working --password [DONE]
The right dump is /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21
5) Stop replication on:
- codfw
- labs
- Analytics
6) Recover on eqiad production hosts without replication:
myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host={} --port={} --user=root --password= # requires working --password myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host=db1087.eqiad.wmnet --user=root --password= --enable-binlog # on db1087
(list of servers gotten from zarcillo with the section bash oneliner)
- db1116:3318 (backup source, delayed) DONE
- db2079 (codfw master, delayed) DONE
- db1095:3318 (NOT TO BE DONE, delayed, instance to be removed)
- db1124:3318 (NOT TO BE DONE, will be done on its master)
- db1099:3318 DONE
- db1101:3318 DONE
- db1126 DONE
- db1104 DONE
- db1092 DONE
- db1087 (WITH REPLICATION) DONE
- db1111 DONE
- dbstore1005:3318 (analytics) DONE
- db1109 (master) DONE
7) drop? table and rename old to new:
- Set the master on RO for some time
db1109> set global read_only=1; RENAME TABLE wb_items_per_site TO wb_items_per_site_old; RENAME TABLE wb_items_per_site_recovered TO wb_items_per_site; set global read_only=0;
- Main issue: metadata locking on the slaves
- Depooling hosts (a bunch of them at the time)
- Repooling them after rename
- Depooling the ones that get metadata locked again