Portal:Cloud VPS/Admin/Instance backups
As of 2020-09-01, most VMs hosted on Ceph have their main disk backed up for a few days. This is a last-resort system put in place to guard against a total Ceph cluster collapse; any large-scale restore will be extremely labor-intensive.
We do not provide self-serve snapshots, backup, or restore services for openstack VMs.
Architecture
We are using Backy2 to back up VM images. VMs are backed up on cloudbackup1003 and cloudbackup1004.
The backup agent (wmcs-backup instances) is run daily by a systemd timer. Another systemd timer runs a cleanup script (wmcs-purge-backups) which deletes expired backups. The intended lifespan of a given backup is set when that backup is created.
The specific backup process for a given VM is handled by the python library rbd2backy2.py. The steps for each VM are:
- make a new snapshot
- collect a diff between today's snapshot and yesterday's snapshot
- delete yesterday's snapshot
- back up today's snapshot, using the diff as a hint for Backy2 so it knows what to ignore
What is backed up
Each backup is of a complete VM disk image. Backups do not include any openstack metadata (e.g. base image, flavor, etc.) so a given restore is likely to work only within the same openstack install context where it was captured.
Our backup agent has a simple regexp-based filter (in /etc/wmcs_backup_instances.yaml) that excludes some VMs from backup. Prime candidates for exclusion are:
- Cattle: VMs that can be trivially reproduced from scratch from puppet with no data loss. Kubernetes worker nodes are the most obvious example of this.
- Hogs: VMs that, by special request, have ENORMOUS disk drives for temporary processing work. Typically if a user requests a quota exception for a VM like this they should be warned that they will not be eligible for backup, and their project should be added to the exclusion list when the project is created. Example: encodingXX.video.eqiad.wmflabs
- Mayflies: Some internal-use projects are created to run a single test or experiment and then destroyed. Obvious examples of this are VMs in the admin-monitoring project or the sre-sandbox project.
We do not have the capacity to back up everything!
Restoring
Backy2 can restore volumes straight into the Ceph pool. To exercise the restore process (or to rollback to a previous backup):
- Stop the vm.
- Delete (or move) the existing ceph image. This prevents filename conflict when restoring.
- If you get a complaint about the volume still having "watchers", check if the shutdown of the VM really completed.
- If the command complains about snapshots, find use the commands below to remove the snapshots of that volume.
root@cloudcontrol1005:~# rbd rm eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Removing image: 100% complete...done.
- Find the backup you want to restore.
- Note the UID of the desired image and restore
- Start the restored VM
root@cloudcontrol1003:~# openstack server stop ee8bd285-73ab-4981-a1f1-498b79b50e2a
root@cloudbackup1003:~# backy2 ls ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
INFO: [backy2.logging] $ /usr/bin/backy2 ls ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
+---------------------+-------------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| date | name | snapshot_name | size | size_bytes | uid | valid | protected | tags | expire |
+---------------------+-------------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| 2020-08-19 01:37:12 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-19T01:37:11 | 5120 | 21474836480 | 8136f1c6-e1bc-11ea-94a2-b02628295df0 | 1 | 0 | b_daily,b_monthly,b_weekly | 2020-08-22 00:00:00 |
| 2020-08-19 02:00:51 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-19T02:00:50 | 5120 | 21474836480 | cedd7db6-e1bf-11ea-b5bb-b02628295df0 | 1 | 0 | | 2020-08-22 00:00:00 |
| 2020-08-20 02:00:49 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T02:00:48 | 5120 | 21474836480 | f8395878-e288-11ea-b5a0-b02628295df0 | 1 | 0 | b_daily | 2020-08-27 00:00:00 |
| 2020-08-20 18:43:04 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:43:03 | 5120 | 21474836480 | fb87dc0c-e314-11ea-a855-b02628295df0 | 1 | 0 | | 2020-08-27 00:00:00 |
| 2020-08-20 18:43:31 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:43:30 | 5120 | 21474836480 | 0b9046ac-e315-11ea-83c9-b02628295df0 | 1 | 0 | | 2020-08-27 00:00:00 |
| 2020-08-20 18:45:19 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:45:18 | 5120 | 21474836480 | 4ba4d38e-e315-11ea-9d18-b02628295df0 | 1 | 0 | | 2020-08-27 00:00:00 |
| 2020-08-20 18:51:33 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:51:31 | 5120 | 21474836480 | 2ac24c4a-e316-11ea-939c-b02628295df0 | 1 | 0 | | 2020-08-27 00:00:00 |
+---------------------+-------------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
INFO: [backy2.logging] Backy complete.
root@cloudbackup1003:~# backy2 restore 2ac24c4a-e316-11ea-939c-b02628295df0 rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
INFO: [backy2.logging] $ /usr/bin/backy2 restore 2ac24c4a-e316-11ea-939c-b02628295df0 rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [ ] Write Queue [ ] (0.0% 0.0MB/sØ ETA 2m56s)
INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [==========] Write Queue [==========] (23.0% 1121.1MB/sØ ETA 3s)
INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [==========] Write Queue [==========] (28.2% 1092.6MB/sØ ETA 5s)
INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [==========] Write Queue [==========] (34.1% 1087.6MB/sØ ETA 6s)
<etc>
root@cloudcontrol1003:~# openstack server start ee8bd285-73ab-4981-a1f1-498b79b50e2a
Restoring after 'openstack server delete'
We have rescued at least one VM from the void after an accidental deletion. The process involves creating a new 'host' VM with the same name (so that dns, neutron, etc are hooked up properly) and then overlaying the disk image of then new host with the restored backup.
This may be possible, with the following caveats
- Backups are only preserved for 7 days, so if the deletion is noticed weeks or months later it is probably too late.
- The restored VM will lose much of its openstack state: it will have a new IP address, forget its security groups, and most likely need its puppet config replaced in Horizon.
- If the VM predated the move from .eqiad.wmflabs to .eqiad1.wikimedia.cloud, the new VM will only be present under the new domain, eqiad1.wikimedia.cloud.
Here are the steps for rescue:
- Locate the VM in the nova database
- Locate the flavor in the nova api database
- Create the new host VM
- Proceed with the #Restoring steps from above
- Confirm puppet runs on the restored VM
- Add security groups, floating IPs, etc. as needed in Horizon
# mysql -u root nova_eqiad1
[nova_eqiad1]> SELECT hostname, id, image_ref, instance_type_id FROM instances WHERE hostname LIKE "<hostname>";
# mysql -u root nova_api_eqiad1;
[nova_api_eqiad1]> SELECT name, ID FROM flavors WHERE id='<instance_type_id from above>';
# OS_PROJECT_ID=<project> openstack server create --nic net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 --flavor <flavor_id_from_above> --image <image_ref_from_above> <hostname>
Restoring a lost Glance image
Glance images are backed up on cloudcontrol nodes: each image is backed up on every node. Restoring is similar to the process for instances, but Glance accesses a snapshot rather than the primary file so there's an extra step. In this example we are restoring an image with id '06cf27ba-bed2-48c7-af2b-2abdfa65463c'.
- Find the backup you want to restore.
- Note the UID of the desired image and restore
- Create a snapshot named 'snap' for Glance to access
root@cloudcontrol1005:~# backy2 ls 06cf27ba-bed2-48c7-af2b-2abdfa65463c
INFO: [backy2.logging] $ /usr/bin/backy2 ls 06cf27ba-bed2-48c7-af2b-2abdfa65463c
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| date | name | snapshot_name | size | size_bytes | uid | valid | protected | tags | expire |
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| 2020-10-20 16:00:03 | 06cf27ba-bed2-48c7-af2b-2abdfa65463c | 2020-10-20T16:00:02 | 4864 | 20401094656 | 508686ba-12ed-11eb-a7f5-4cd98fc4a649 | 1 | 0 | b_daily,b_monthly,b_weekly | 2020-10-27 00:00:00 |
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
INFO: [backy2.logging] Backy complete.
root@cloudcontrol1005:~# backy2 restore 508686ba-12ed-11eb-a7f5-4cd98fc4a649 rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c
INFO: [backy2.logging] $ /usr/bin/backy2 restore 508686ba-12ed-11eb-a7f5-4cd98fc4a649 rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c
INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c Read Queue [ ] Write Queue [ ] (0.0% 0.0MB/sØ ETA 2m57s)
INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c Read Queue [==========] Write Queue [==========] (9.4% 244.1MB/sØ ETA 11s)
<etc>
root@cloudcontrol1005:~# rbd snap create eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c@snap