MariaDB/Backups/Example recovery

From Wikitech

Data recovery plan

1) Recover up to the drop on db1095: db1109-bin.003832:611995336

transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s8.2020-04-05--19-00-02.tar.gz db1095.eqiad.wmnet:/srv/sqldata.s8 # [DONE]
CHANGE MASTER TO MASTER_HOST='db1109.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD=, MASTER_SSL=1; -- requires right password [DONE]

GTID stuff on file, check replication works.

SET GLOBAL gtid_slave_pos = ; -- requires a good gtid pos [DONE]
  • Disable gtid out of precaution
CHANGE MASTER TO MASTER_USE_GTID=no; -- [DONE]
  • Start replication until (before) delete
START SLAVE UNTIL MASTER_LOG_FILE = 'db1109-bin.003832', MASTER_LOG_POS = 611995336; -- [DONE]
  • Disable consitency options:
SET GLOBAL innodb_flush_log_at_trx_commit=0; -- [DONE]
  • Add to zarcillo so it can be monitored on prometheus [DONE]

2) Stop replication fully

STOP SLAVE; [DONE]

3) Rename the table to _old

set sql_log_bin=0; RENAME TABLE wb_items_per_site TO wb_items_per_site_recovered; [DONE]

4) Dump the table in 10 chunks: dbprov1001# dbprov1001# backup-mariadb s8_backup --type=dump --host=db1095.eqiad.wmnet --port=3318 --backup-dir /srv/backups/dumps/ongoing --regex='wikidatawiki[.]wb\_items\_per\_site' --rows=1000000 --user=dump --retention=15 --password= # requires a working --password [DONE]

The right dump is /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21

5) Stop replication on:

  • codfw
  • labs
  • Analytics

6) Recover on eqiad production hosts without replication:

myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host={} --port={} --user=root --password= # requires working --password
myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host=db1087.eqiad.wmnet --user=root --password= --enable-binlog # on db1087

(list of servers gotten from zarcillo with the section bash oneliner)

  1. db1116:3318 (backup source, delayed) DONE
  2. db2079 (codfw master, delayed) DONE
  3. db1095:3318 (NOT TO BE DONE, delayed, instance to be removed)
  4. db1124:3318 (NOT TO BE DONE, will be done on its master)
  5. db1099:3318 DONE
  6. db1101:3318 DONE
  7. db1126 DONE
  8. db1104 DONE
  9. db1092 DONE
  10. db1087 (WITH REPLICATION) DONE
  11. db1111 DONE
  12. dbstore1005:3318 (analytics) DONE
  13. db1109 (master) DONE

7) drop? table and rename old to new:

- Set the master on RO for some time
db1109> set global read_only=1; RENAME TABLE wb_items_per_site TO wb_items_per_site_old; RENAME TABLE wb_items_per_site_recovered TO wb_items_per_site; set global read_only=0;
  • Main issue: metadata locking on the slaves
    • Depooling hosts (a bunch of them at the time)
    • Repooling them after rename
    • Depooling the ones that get metadata locked again