Portal:Cloud VPS/Admin/Backy2
We are using Backy2 to back up Ceph volumes: VM images, Cinder volumes, and Glance images. Backy2 is designed for fast, incremental backups of rbd volumes and uses deduplication to minimize disk space.
Backups are stored on 'cloudbackup*' servers.
Backup agents are run by systemd timers; those same timers should also clean up out-of-date backups. The intended lifespan of a given backup is set when that backup is created.
The specific backup process for a given image handled by the python library rbd2backy2.py. The steps are
- make a new snapshot
- collect a diff between today's snapshot and yesterday's snapshot
- delete yesterday's snapshot
- back up today's snapshot, using the diff as a hint for Backy2 so it knows what to ignore
Note that the above process means that basically every image should have exactly one snapshot hanging around at all times. This can at times complicated deletion of images, as rbd doesn't like to delete things that have existing snapshots.
Cleanup
Backy2 does not purge unused blocks by default. In theory any of our automatic backup jobs also instruct backy2 to purge, but in reality sometimes cloudbackup hosts fill up with backy2 cruft. This can almost always be resolved by running backy2 cleanup. Depending on how much space needs freeing, this command may take hours to complete, so it's best started in a screen session.
Restoring
Backy2 can restore volumes straight into the Ceph pool. Getting restored images back into view of openstack is more complicated. In general to find what is backed up where, you can use 'backy2 ls' on a cloudbackup host.
root@cloudcontrol1005:~# backy2 ls 06cf27ba-bed2-48c7-af2b-2abdfa65463c
INFO: [backy2.logging] $ /usr/bin/backy2 ls 06cf27ba-bed2-48c7-af2b-2abdfa65463c
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| date | name | snapshot_name | size | size_bytes | uid | valid | protected | tags | expire |
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| 2020-10-20 16:00:03 | 06cf27ba-bed2-48c7-af2b-2abdfa65463c | 2020-10-20T16:00:02 | 4864 | 20401094656 | 508686ba-12ed-11eb-a7f5-4cd98fc4a649 | 1 | 0 | b_daily,b_monthly,b_weekly | 2020-10-27 00:00:00 |
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
INFO: [backy2.logging] Backy complete.
Handy commands
List every backup on a backup server:
root@cloudbackup1003:~# backy2 ls
List every backup for a given VM:
root@cloudbackup1003:~# backy2 ls <instance_id>_disk
List all rbd snapshots for a given VM:
root@cloudcontrol1005:~# rbd --pool eqiad1-compute snap ls <instance_id>_disk
Note: The backup job should leave at most one rbd snapshot for any given VM. If there are a bunch then something interesting is happening and we are probably leaking ceph storage space like mad.
Purge all rbd snapshots for a given VM:
root@cloudcontrol1005:~# rbd --pool eqiad1-compute snap purge <instance_id>_disk
Note: purging snaps is not especially disruptive. It will force the next backup to read the entire volume rather than only the changed blocks, which will slow things down quite a bit for the next backup.
Delete orphaned backy2 file blocks
root@cloudcontrol1005:~# backy2 cleanup
Future concerns
- Until large-scale ceph adoption, any time and space estimates are approximate. If it turns out to take more than 24 hours to backup the whole cluster, or cloudvirt1024 doesn't have space for all the backups, there are a few options:
- Reduce the number of daily backups. Even if we go as low as 2 days these backups will still be valuable.
- Split backup jobs between more hosts.
- Exclude more projects or VM types from backup
- To support incremental backups, every rbd image is accompanied at all times by yesterday's snapshot. Depending on how snapshots are stored, that may turn out to consume a massive amount of precious Ceph storage space. If this turns out to be an issue we may need to abandon incremental backups, or use some convoluted process like restoring yesterday's backup into and image, do a diff, and then remove yesterday's backup. It should be clearer what tradeoffs to make as use increases.
- Running backup jobs on cloudvirt1024 may interfere with performance of VMs hosted there. Ideally we'd have some way to allocate some cores for backups and other cores for virtualization.
- Some users of Backy2 complain that taking snapshots on an active Ceph cluster causes performance lags during snapshoting. We need to keep an eye out for such problems.
- The restoration process documented here is too cumbersome for mass restoration. Probably a restore feature should be added to rbd2backy2.py.