Backup and recovery methods
The following types of backups are part of the design:
- Daily compressed binary backups for disaster recovery (today, yesterday, 1 week old)
- Weekly logical backups for long term recovery (3 month retention)
- Binlog backups (for point in time recovery/incremental-ish backups) [not yet implemented]
- Incremental backups for append-only external storage servers [not yet implemented]
- The many production database servers that are used by Mediawiki and many other analytics, cloud-support and other misc services. These are normally called
db*(database) servers, in addition to the larger
es*(external storage) servers
- There are other mysql-focused servers, like
labsdb*hosts, but those are not backed up- in the first case, because they are essentially disk cache (data can be lost without losing user data), in the second because they don't contain but a filtered replica of production data
- es servers are not fully backed up on each cycle for 2 main reasons: they are append-only hosts, so incremental backups would be enough, and they are larger, so special treatment has to be done to prevent high redundancy and waste of resources.
- For a summary of the functionality of the different data servers, see: MariaDB#Sections_and_shards
- There are other mysql-focused servers, like
- There are specific dedicated backup replicas, whose only function is to mirror the production datasets and provide a source of backups with enough performance and reliability, without affecting the production hosts due to the load and locking on generating backups
- Bacula service, storing backups (only the logical and incremental) long-term, reading from the provisioning servers and sending it to the bacula storage nodes
- Provisioning servers: storing backups short term for quick recovery, provisioning and the post-processing needs (e.g. rotation, xtrabackup --prepare, compression, consolidation if many files are created). They, themselves, orchestrate the logical (dumps) backups.
- Testing databases: at the moment, only db1114 (one host per datacenter is procured). They are non-production hosts that regularly recover logical backups and snapshots and set up replication to verify backups are working properly [not yet implemented]
- Cluster management servers (aka cumin hosts, at the moment
cumin2001): provide orchestration, specifically for snapshots, as they require remote execution -beyond the mysql protocol- to execute root commands on the source servers and sending files to the provisioning servers
- Zarcillo, database currently at db1115 (tendril host) stores metadata information about the generated backups at generation time, and informs about the current ongoing status as well as the a priori success or failure of backups. It also stores the size and name of each file generated, for further data and backup analysis and trending. An icinga check is also setup there which alerts if the latest fresh backup is older than the configured amount of time.
Software and deployment architecture
WMFBackup is the main class controlling the generation of backups. It is an extensible backup class that at the time can use 3 class-methods:
- NullBackup: Does nothing
- MariaBackup: Uses
mariabackuputility, fork of XtraBackup just recompiled to support MariaDB specific InnoDB format. MariaBackup/XtraBackup allow for (in theory) lower time to recovery, as it will be as fast as putting the files into the dir and starting mysql. It is, at the time, the chosen method to generate what we call generically "snapshots" (binary or raw backups).
- MyDumper: Uses
mydumperto generate a fast, highly parallel, compressed logical dump. It is, at the time, the chosen method to generate what we call "dumps" (logical backups).
The backup process has 4 main functions:
- Generate the backup files (e.g run
- Post-process it, with a number of compusory and optional tasks: check the backup seems complete, prepare it (in the case of mariabackup), consolidate (tar) per database, compress, and rotate to its final location.
- Generate state and metadata about the backup and the file it generates (by the
backup_mariadb.py utility and its libraries are deployed on the provisioning host. However, because snapshotting requires root remote execution and we can only do local mysql snapshots.
To generate remote snapshots, the Transfer.py script is being used for the first part of the backup, installed on the cluster management server. Every day, a cron job runs
daily_snapshot.py script, and send the snapshots to the provisioning hosts using
transfer.py. Then, it runs locally on the provisioning host
backup_mariadb.py in order to post-process the generated files and gather the metadata statistics.
transfer.py is a generic utility, installed on the cluster management (orchestration) host (e.g.
cumin1001) to transfer files over the network, but has a switch
--type=xtrabackup, that allows also to transmit in a consistent way the mysql files of a live mysql server. Please note that the post-process of
backup_mariadb.py is still needed after
--prepare has to be run before recovering the files to a server.
The backups are stored in the following location:
/srv /backups /dumps /ongoing # ongoing logical backups /latest # latest completed logical backup /archive # recent, but not latest, logical backups (in case the latest has issues), regularly purged /snaphosts /ongoing # ongoing mariabackup files /latest # latest completed and prepared mariabackup tarballs /archive # recent, but not latest, mariabackup tarballs (in case the latest has issues), regularly purged
Finally, bacula regularly backups the configured path on the chosen active datacenter for long term backups (right now, eqiad) copying only the latest logical backups (
Puppet code is distributed into the following profiles:
mariadb::backup::bacula[provisioning host] to move logical dumps to long term storage (bacula)
mariadb::backup::check[currently on the metadata database, could really be run from almost anywhere] icinga checks that backups are generated correctly and fresh
mariadb::backup::mydumper[provisioning host] automation of logical backups, and backup and recovery software in general
mariadb::backup::snapshot[provisioning host] snapshotting environment
mariadb::backup::transfer[orchestration hosts] automation of snapshots
Incrementals and binlogs are not productionized/fully automated yet (although one can find binlogs and manual es backups in several locations).
A summary of the flow of control and data can be seen at the diagram: Database_backups_overview.svg
Mydumper is used for creating logical backups. Not only it is faster to create and recover than mysqldump, it also allows table or even lower level granularity- very interesting for when a single table needs to be restored.
On the host with the profile
mariadb::backup::mydumper (provisioning hosts), a configuration file exists on
/etc/mysql/backups.yaml that configures what and where and how to backup. Weekly, the script
backup_mariadb.py reads that config file and backups the core and misc sections into
The backup host -at the moment dbstore1001 and es2001- only keeps the last or latest backups so they can be sent to bacula, not long term. The backups are generated asynchronously to the bacula director handling because with the old system, it used to block all other backup and recovery processes for a long time. Now bacula only has to retrieve the latest directory and get the latest successful backups.
Each backup is a single directory with the following name:
timestamp is not in ISO format as ':' is a special symbol by some languages, like the command prompt. It is in the following format instead:
Each directory contains the same structure than a regular mydumper process generates, except that, if so configured, it can tar all objects of a single database to avoid having hundreds of thousands of files. This currently happens for x1 and s3, which have many thousand of objects- that gets reduced to only ~900, one per database.
To understand the status of the backups, mydumper creates a
metadata file that contains the start of the backup process (which the dump should be consistent to, the exact binlog and GTID coordinates for that consistency point, and the time the backup finished. If the backup fails, normally the database dir will not copied to latest and a log file on the ongoing directory with the name of the section will show some kind of error. If for some reason mydumper was successful but the overall process cannot, one can retry the backup with the
--only-postprocess option to rotate, consolidate, compress and/or generate statistics.
Adding a new dump
EVENT, LOCK TABLES, SELECT, SHOW VIEW, TRIGGERgrants to the given backup user for the database backup
- Add also
FILE, RELOAD, REPLICATION CLIENT, SUPERto the same user on *.*
- Make sure, if exists, that the regex filtering backups or not the appropiate objects
- Add or otherwise make sure that the host is backed in eqiad and/or codfw on
puppet://modules/profile/templates/mariadb/backups-*.cnf.erb. The right location will try to balance the load and disk space used of the several hosts used for backup
- Make sure your backups are done as scheduled, by looking at zarcillo db on db1115, tables backups and backup_files.
Recovering a logical backup
To recover, there is a script called recover_dump.py which automates the decompression (if a .tar.gz), untar (if databases consolidated) and runs myloader. If for some reason that wouldn't work, myloader can be called directly (assuming the directory is not compressed, tarred or that has been handled beforehand), or even load individual objects directly (mydumper creates a .sql with the structure and a .sql data file per table).
This is a step by step guide of how to recover a backup.
For this example we will recover: x1 on db1120 from the backup host es2001
Pre-requisites before recovering the backup
- db1120 must have a MySQL server up and running (if it is a new host it can be set up with: /opt/wmf-mariadb101/scripts/mysql_install_db
- db1120 must have privileges to let the source host (e.g. dbprov1001) connect, create the databases and tables (and other objects) and import the rows
- This can be tested with dbprov1001$ mysql -hdb1120.eqiad.wmnet -uroot -pREDACTED
Recovering the data
It is recommended to run the following command in a screen session:
dbprov1001:~# recover_dump.py --host db1120.eqiad.wmnet --user root --password REDACTED --port 3306 x1 Attempting to recover "dump.x1.2018-07-24--23-30-38" Running myloader...
Alternatively, if a specific backup has to be recovered, different from the one on latest with the given section, it also accepts absolute paths:
dbprov1001:~# recover_dump.py --host db1120.eqiad.wmnet --user root --password REDACTED --port 3306 /srv/backups/dumps/archive/dump.x1.2018-07-24--23-30-38
Remember to provide an absolute path to the dir or compressed .tar.gz if using the second format. it doesn't have to be inside /srv/backups/, it can be anywhere where your current user can read.
This will start recovering all the data on db1120 for the whole x1 section. We can check if this is actually working by checking if the data directory on db1120 is growing
db1120:~# du -s /srv/sqldata/
This can take several hours and once finished, es2001 will return to the prompt. And the data is ready on db1120 to be analyzed and processed if needed.
Enabling replication on the recovered server
If the server needs to be pooled in production, firstly we have to enable replication so it can catch up. First of all we should create the heartbeat table on our server.
CREATE DATABASE heartbeat; USE heartbeat; CREATE TABLE `heartbeat` ( `ts` varbinary(26) NOT NULL, `server_id` int(10) unsigned NOT NULL, `file` varbinary(255) DEFAULT NULL, `position` bigint(20) unsigned DEFAULT NULL, `relay_master_log_file` varbinary(255) DEFAULT NULL, `exec_master_log_pos` bigint(20) unsigned DEFAULT NULL, `shard` varbinary(10) DEFAULT NULL, `datacenter` binary(5) DEFAULT NULL, PRIMARY KEY (`server_id`) ) ENGINE=InnoDB DEFAULT CHARSET=binary;
And we should now insert a couple of rows with the server_id of the replication chain masters (eqiad and codfw just to be sure). To do so we can check: https://tendril.wikimedia.org/tree and check our section and masters. For x1 our masters are: db1069 and db2034 Let's find out the server_id
root@neodymium:~# mysql.py -hdb2034 -e "SELECT @@hostname; SELECT @@server_id" -BN db2034 180355159 root@neodymium:~# mysql.py -hdb1069 -e "SELECT @@hostname; SELECT @@server_id" -BN db1069 171966572
Now let's insert those two rows on db1120
USE heartbeat; SET SESSION sql_log_bin=0; INSERT INTO heartbeat (server_id) VALUES (171966572); INSERT INTO heartbeat (server_id) VALUES (180355159);
To do so, we have to gather the coordinates from the backup which are in a file called metadata. In our case the backup is at dump.x1.2018-07-24--23-30-38
es2001:~# cat /srv/backups/latest/dump.x1.2018-07-24--23-30-38/metadata Started dump at: 2018-07-24 23:30:38 SHOW SLAVE STATUS: Connection name: Host: db2034.codfw.wmnet Log: db2034-bin.000196 Pos: 978121166 GTID:0-171970580-683331037,1-171970580-1,171966572-171966572-191034075,171970580-171970580-596994206,171974681-171974681-198565537,180355159-180355159-13448767,180363268-180363268-40608909 Finished dump at: 2018-07-25 00:32:34
What we need to enable replication:
Once we have those we can execute the following command on db1120 (the password is at repl-password file on the pw repo)
CHANGE MASTER to MASTER_HOST='db2034.codfw.wmnet', MASTER_USER='repl', MASTER_PASSWORD='REDACTED' ,MASTER_PORT=3306, MASTER_LOG_FILE='db2034-bin.000196', MASTER_LOG_POS=978121166, MASTER_SSL=1; START SLAVE; SHOW SLAVE STATUS\G
We should see our Seconds_behind_master decreasing (sometimes it can take a while to see it decreasing)
If we have recovered and eqiad host from a codfw source, once the server has caught up, we need to move it under eqiad master In our case, db1120 is replicating from db2034 which is codfw masters, so we need to move them to be at the same level, that is: db1120 must be a sibling of db2034 and not a child. To do so we can use 'repl.pl' script on 'neodymium':
./marostegui/git/software/dbtools/repl.pl --switch-child-to-sibling --parent=db2034.codfw.wmnet:3306 --child=db1120.eqiad.wmnet:3306
Snapshoting and disaster recovery
While logical backups have a lot of advantages:
- Small disk footprint (specially if compressed). High compression ratio and no space wasted on indexes or fragmented data.
- Fast to generate: If enough data is on memory, reads can be very fast, specially if done with enough parallelism
- Very low granularity on recovery: Because we use one (or several) separate files per table, we can recover single databases, single tables or even individual rows if properly filtered. We can even separately recovered structure and data.
- Not prone to corruption: A physical copy of file would maintain and make very difficult to detect certain kinds of corruption. Because a logical dump requires reading all rows, once exported, corruption cannot happen unless the exports are themselves corrupted at a later time.
- Software independency and portability: Because in the end we are just generating text files, the format has great portability. It can be used on different MySQL/MariaDB versions, vendors or even on different database software. Also, they can converted if needed as they are human-readable text/manageable with 3rd party software.
Because many of the above, this format is ideal for long-term preservation.
However, logical dumps have some important weaknesses:
- They are slow to recover, as they have to be reimported row by row, as well as indexes have to be regenerated again.
- Taking a dump can create a lot of performance impact both due to the amount of logical reads need, and the state in which they leave the buffer pool afterwards
Because of this, logical backups tend to increase a lot the Time to Recovery in case of a full disaster. That is where snapshots (how we call generically raw or binary backups) come into play: While larger in size, they tend to be faster to recover as they use the native database format- only requiring to shutdown the original server and copy back the files- making it as fast as a regular file copy can be sent over the network. Snapshots are the bases of our "fast" disaster recovery method.
Snapshots can be generated in several ways:
- Cold backup (shutdown the server and copy the files)
- lvm snapshots + recovery
- MySQL Enterprise backup/Xtrabackup
We chose, after some research (Task T206204), to use the latest option, specifically with MariaDB's fork of the free software Percona XtraBackup, MariaBackup, which is mostly identical in usage and functionality to XtraBackup, but is compiled with MariaDB and thus supports its internal format better (Xtrabackup started having issues with MariaDB since 10.0).
MariaBackup is one of the supported methods in the WMFBackup Class, which means it can be taken using
backup_mariadb.py command line utility. The main difference between snapshot taking and logical backups is that snapshots requires raw access to the underlying mariadb files as a privileged user in addition to access to mysql itself as a priviliged account. Thus, at the moment, backup_mariadb.py only is able to create localhost backups by itself (more on remote backups later).
When used, files are copied to
/srv/backups/snapshots/ongoing, in what would appear as a complete datadir copy. However, at this point the backup process is not complete- as documented on the XtraBackup documentation page, the backup needs to be prepared before being used. The preparation step of the backup takes care of that before continuing with the rest of post-processing steps. At the moment preparation happens always after backup, but there are reasons to postpone this (e.g. to generate incremental or differential backups, or to export individual tables). This would be possible in the future, but not currently supported.
Special mention requires that one normally would want to compress the final set of files for several reasons: The most obvious is space saving- snapshots are as large as the original datadir. The second reason would be that while normally mydumper can be used to recover individual parts of the database, normally snapshots are used for full recoveries and provisioning only, thus them being pre-compressed speeds up the later full recovery.
On Remote Snapshotting
With the above implementation, local snapshotting would be possible; however a backups is only a backup if it is fully offline. Several options and designs were considered for this- in particular, the possibility of preparing backups locally to the host before sending them away on some way. This was discarded for the following reasons: 1) read access was needed to the datadir, making it difficult to do privilege separation; 2) it is not out of the question that a sever may not have enough space to temporarily store a copy of its database (specially under emergency, which is what snapshotting was intended for) and 3) if preparation was compulsory from the start, it would make incremental/differential backups impossible in the future.
Based on that, database hosts would only run
xtrabackup (mariabackup) --backup and then use its streaming capabilities to send its data away, to be prepared on a second step. For that purpose, existing tool
transfer.py, used in the past to perform cold backup transfer as well as general file transfer between hosts over the network was modified to allow xtrabackup as the source of data transference. Once transferred to the provisioning host(s), to
/srv/backups/snapshots/ongoing, the utility
backup_mariadb.py is run, although with the
--only-postprocess option so it finds and treats its, now, local files, but does not attempt to generate a new snapshot.
So, in summary, while logical backups and snapshots share most of the workflow, because of the particularities of creating a remote snapshot, this requires an extra initial step where first files are transferred and taken out of the source hosts (unlike logical backups, which are able to perform that just by using the mysql query protocol).
Orchestration and Scheduling
Because the remote copy mentioned on the previous section, remote execution is needed, something that is not available at the time (and probably never, for security reasons) from the provisioning hosts. This is why the transfer itself has to be initiated from the cluster management hosts, at the moment
cumin2001. Probably not surprisingly, cumin is being used for remote execution.
At the moment, daily snapshotting of all Mediawiki Metadata sections happens daily. This is initiated with a cron job, that reads the
/etc/mysql/backups.cnf yaml configuration files on the cluster management server and controls which sections to backup, with which options, as well as to which server those backups are sent. The cron job, rather than directly calling transfer.py and mariadb_backup.py, uses
daily_snapshots.py simple script, which does the above steps for simplicity. Because transfer cannot happen over more than one port at a time per destination server, each backup is done on a separate port, starting consecutively on the port
Adding a new database to Snapshot
Snapshotting doesn't require any special privileges other than the regular ones for a production mysql, as it is done (and must be done) as root.
To add additional hosts to snapshot, just edit the
puppet://modules/profile/templates/mariadb/backups-cuminXXXX.cnf.erb templates so they are run on the provided hosts, in the given order.
Recovering a Snapshot
At the moment, there is no specific utility to fully automate snapshots. There is however, a transfer type (
--type=decompress) that can automate the initial transfer and decompression from an existing snapshot: transfer.py.
For now, the rest of the setup steps have to be done manually:
- [from a cumin host] transfer.py --type=snapshot dbprov1001.eqiad.wmnet:/srv/backups/snapshots/snapshot.s1.2020-09-11--23-45-01.tar.gz db1051.eqiad.wmnet:/srv/sqldata
- Remember the transfer will fail if existing data is on that dir- it should be deleted or moved away first
- Chown the datadir recursively to be owner by the mysql user
- Just to be safe:
systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
- Start mysql
- Setup replication based on the GTID coordinates (remember GTID tracks already executed transactions, while binlogs tracks offsets or "gaps between transactions", do not confuse both methods). GTID is normally stored on the
CHANGE MASTER TO MASTER_GTID=XXXXXX; START SLAVE;
Binary log backups and point in time recovery
NOT YET IMPLEMENTED
Incremental backups and external storage disaster recovery
NOT YET IMPLEMENTED
Monitoring and metadata gathering
On backup generation, an entry is added to the `backups` (and several on the `backup_files`) metadata tables (at the moment, on db1115/zarcillo db). It contains the up to date status (ongoing, finished, failed) state of the backup genration and, when finished, some information about it and its generated files (date, size, etc.)
The zarcillo database (temporary code name for what will be the tendril replacement) now has 3 tables:
backup_objects(not in use)
backups contain an id and properties about backups (status -ongoing, finished (correctly), failed and deleted- dir name, source, section, start_time, end_time, total_size, etc:
CREATE TABLE `backups` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `name` varchar(100) CHARACTER SET latin1 DEFAULT NULL, `status` enum('ongoing','finished','failed', 'deleted') COLLATE utf8mb4_unicode_ci NOT NULL, `source` varchar(100) CHARACTER SET latin1 DEFAULT NULL, `host` varchar(300) CHARACTER SET latin1 DEFAULT NULL, `type` enum('dump','snapshot', 'cold') COLLATE utf8mb4_unicode_ci DEFAULT NULL, `section` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `start_date` timestamp NOT NULL DEFAULT '1970-01-01 00:00:01', `end_date` timestamp NULL DEFAULT NULL, `total_size` bigint(20) unsigned DEFAULT NULL, PRIMARY KEY (`id`), KEY `last_backup` (`type`,`section`,`status`,`start_date`) ) ENGINE=InnoDB AUTO_INCREMENT=174 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
backup_files contain a backup id and a list of files for that backup and its properties (date, size, name)
CREATE TABLE `backup_files` ( `backup_id` int(10) unsigned NOT NULL, `file_path` varchar(300) CHARACTER SET latin1 NOT NULL DEFAULT , `file_name` varchar(300) CHARACTER SET latin1 NOT NULL, `size` bigint(20) unsigned DEFAULT NULL, `file_date` timestamp NULL DEFAULT NULL, `backup_object_id` bigint(20) unsigned DEFAULT NULL, PRIMARY KEY (`backup_id`,`file_name`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
backup_objects will link the backup files to specific objects (tables, databases, triggers, etc.) for further checking (at the moment, this is not filled in). This will be useful when we maintain an inventory of database objects for all servers so we can make sure no objects are left uncopied/have appropriate size, etc.
CREATE TABLE `backup_objects` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `backup_id` int(10) unsigned NOT NULL, `db` varchar(100) CHARACTER SET latin1 NOT NULL, `name` varchar(100) CHARACTER SET latin1 DEFAULT NULL, `size` bigint(20) DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `backup_id_db_name` (`backup_id`,`db`,`name`) ) ENGINE=InnoDB AUTO_INCREMENT=6352 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Metadata is collected during generation, by the same script that generates the dumps,
dump_section.py. It connects to the metadata database (db1115:zarcillo at the moment) and logs the information. If the logging fails, the backup should continue, but it cannot log a successful backup without the backups being successful, too.
The metadata database ("statistics" db) is controlled on the backup parameters, on the
/etc/mysql/backup.yaml, the same place that controls the backups to be done.
Checks are done at icinga, and are controlled by a hiera setting:
This is used at the moment to alert if backups fail for any reason (icinga checks on db1115 called
check_mariadb_backups.py) and it checks backups are generated correctly weekly and have a reasonable size. Further checks could be added later, but thorough backup testing should be done through proper recovery validation (full recovery to test hardware, starting replication and maybe some checksums/smoke tests).
NOT YET IMPLEMENTED
Backups quick cheatsheet
This is WIP
Copy data from a backup source to a host
- Stop replication
- Gather the coordinates
- From cumin1001: Run transfer.py with type xtrabackup and using:
SOURCE_HOST:SOCKET DESTINATION_HOST:DATA_DIR(if the transfer will not happen within the same datacenter please remove
./transfer.py --type=xtrabackup --no-checksum --no-encrypt db1140.eqiad.wmnet:/run/mysqld/mysqld.x1.sock db1127.eqiad.wmnet:/srv/sqldata
- Once the transfer is done, and if you want to prepare the backup ssh to the host:
/opt/wmf-mariadb101/bin/mariabackup --prepare --innodb-buffer-pool-size=300GB --target-dir=/srv/sqldata
Provision a precompressed and prepared snapshot
- Once the snapshot has been generated go to cumin1001
transfer.py --no-checksum --no-encrypt --type=decompress dbprov2002.codfw.wmnet:/srv/backups/snapshots/latest/xxxx.tar.gz DESTINATION.FQDN:/srv/sqldata
- Once the data has been copied over successfully, ssh to the host.
cat /srv/sqldata/xtrabackup_slave_info | grep GLOBAL | mysql
- Now from the mysql prompt of that host (or from cumin) configure the replication thread:
CHANGE MASTER TO MASTER_HOST='FQDN', MASTER_USER='<user>', MASTER_PASSWORD='<pass>', MASTER_SSL=1, master_use_gtid = slave_pos; start slave;