Portal:Cloud VPS/Admin/Instance backups

Cloud-vps users should not rely on their VMs being backed up. Our backup system is not 100% reliable, does not back up every VM, and backups are only held for a few days.

As of 2020-09-01, most VMs hosted on Ceph have their main disk backed up for a few days. This is a last-resort system put in place to guard against a total Ceph cluster collapse; any large-scale restore will be extremely labor-intensive.

We do not provide self-serve snapshots, backup, or restore services for openstack VMs.

Architecture

We are using Backy2 to back up VM images. VMs are backed up on cloudbackup1003 and cloudbackup1004.

The backup agent (wmcs-backup instances) is run daily by a systemd timer. Another systemd timer runs a cleanup script (wmcs-purge-backups) which deletes expired backups. The intended lifespan of a given backup is set when that backup is created.

The specific backup process for a given VM is handled by the python library rbd2backy2.py. The steps for each VM are:

make a new snapshot
collect a diff between today's snapshot and yesterday's snapshot
delete yesterday's snapshot
back up today's snapshot, using the diff as a hint for Backy2 so it knows what to ignore

What is backed up

Each backup is of a complete VM disk image. Backups do not include any openstack metadata (e.g. base image, flavor, etc.) so a given restore is likely to work only within the same openstack install context where it was captured.

Our backup agent has a simple regexp-based filter (in /etc/wmcs_backup_instances.yaml) that excludes some VMs from backup. Prime candidates for exclusion are:

Cattle: VMs that can be trivially reproduced from scratch from puppet with no data loss. Kubernetes worker nodes are the most obvious example of this.
Hogs: VMs that, by special request, have ENORMOUS disk drives for temporary processing work. Typically if a user requests a quota exception for a VM like this they should be warned that they will not be eligible for backup, and their project should be added to the exclusion list when the project is created. Example: encodingXX.video.eqiad.wmflabs
Mayflies: Some internal-use projects are created to run a single test or experiment and then destroyed. Obvious examples of this are VMs in the admin-monitoring project or the sre-sandbox project.

We do not have the capacity to back up everything!

Restoring

Backy2 can restore volumes straight into the Ceph pool. To exercise the restore process (or to rollback to a previous backup):

Stop the vm.

root@cloudcontrol1003:~# openstack server stop ee8bd285-73ab-4981-a1f1-498b79b50e2a

Delete (or move) the existing ceph image. This prevents filename conflict when restoring.
- If you get a complaint about the volume still having "watchers", check if the shutdown of the VM really completed.
- If the command complains about snapshots, find use the commands below to remove the snapshots of that volume.
```
root@cloudcontrol1005:~# rbd rm eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
Removing image: 100% complete...done.
```
Find the backup you want to restore.

root@cloudbackup1003:~# backy2 ls  ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk  
    INFO: [backy2.logging] $ /usr/bin/backy2 ls ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
+---------------------+-------------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
|         date        | name                                      | snapshot_name       | size |  size_bytes |                 uid                  | valid | protected | tags                       |        expire       |
+---------------------+-------------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| 2020-08-19 01:37:12 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-19T01:37:11 | 5120 | 21474836480 | 8136f1c6-e1bc-11ea-94a2-b02628295df0 |   1   |     0     | b_daily,b_monthly,b_weekly | 2020-08-22 00:00:00 |
| 2020-08-19 02:00:51 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-19T02:00:50 | 5120 | 21474836480 | cedd7db6-e1bf-11ea-b5bb-b02628295df0 |   1   |     0     |                            | 2020-08-22 00:00:00 |
| 2020-08-20 02:00:49 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T02:00:48 | 5120 | 21474836480 | f8395878-e288-11ea-b5a0-b02628295df0 |   1   |     0     | b_daily                    | 2020-08-27 00:00:00 |
| 2020-08-20 18:43:04 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:43:03 | 5120 | 21474836480 | fb87dc0c-e314-11ea-a855-b02628295df0 |   1   |     0     |                            | 2020-08-27 00:00:00 |
| 2020-08-20 18:43:31 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:43:30 | 5120 | 21474836480 | 0b9046ac-e315-11ea-83c9-b02628295df0 |   1   |     0     |                            | 2020-08-27 00:00:00 |
| 2020-08-20 18:45:19 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:45:18 | 5120 | 21474836480 | 4ba4d38e-e315-11ea-9d18-b02628295df0 |   1   |     0     |                            | 2020-08-27 00:00:00 |
| 2020-08-20 18:51:33 | ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk | 2020-08-20T18:51:31 | 5120 | 21474836480 | 2ac24c4a-e316-11ea-939c-b02628295df0 |   1   |     0     |                            | 2020-08-27 00:00:00 |
+---------------------+-------------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
    INFO: [backy2.logging] Backy complete.

Note the UID of the desired image and restore

root@cloudbackup1003:~# backy2 restore 2ac24c4a-e316-11ea-939c-b02628295df0 rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
    INFO: [backy2.logging] $ /usr/bin/backy2 restore 2ac24c4a-e316-11ea-939c-b02628295df0 rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk
    INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [          ] Write Queue [          ] (0.0% 0.0MB/sØ ETA 2m56s) 
    INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [==========] Write Queue [==========] (23.0% 1121.1MB/sØ ETA 3s) 
    INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [==========] Write Queue [==========] (28.2% 1092.6MB/sØ ETA 5s) 
    INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-compute/ee8bd285-73ab-4981-a1f1-498b79b50e2a_disk Read Queue [==========] Write Queue [==========] (34.1% 1087.6MB/sØ ETA 6s) 
<etc>

Start the restored VM

root@cloudcontrol1003:~# openstack server start ee8bd285-73ab-4981-a1f1-498b79b50e2a

Restoring after 'openstack server delete'

We have rescued at least one VM from the void after an accidental deletion. The process involves creating a new 'host' VM with the same name (so that dns, neutron, etc are hooked up properly) and then overlaying the disk image of then new host with the restored backup.

This may be possible, with the following caveats

Backups are only preserved for 7 days, so if the deletion is noticed weeks or months later it is probably too late.
The restored VM will lose much of its openstack state: it will have a new IP address, forget its security groups, and most likely need its puppet config replaced in Horizon.
If the VM predated the move from .eqiad.wmflabs to .eqiad1.wikimedia.cloud, the new VM will only be present under the new domain, eqiad1.wikimedia.cloud.

Here are the steps for rescue:

Locate the VM in the nova database

# mysql -u root nova_eqiad1
[nova_eqiad1]> SELECT hostname, id, image_ref, instance_type_id FROM instances WHERE hostname LIKE "<hostname>";

Locate the flavor in the nova api database

# mysql -u root nova_api_eqiad1;
[nova_api_eqiad1]> SELECT name, ID FROM flavors WHERE id='<instance_type_id from above>';

Create the new host VM

# OS_PROJECT_ID=<project> openstack server create --nic net-id=7425e328-560c-4f00-8e99-706f3fb90bb4  --flavor <flavor_id_from_above> --image <image_ref_from_above> <hostname>

Proceed with the #Restoring steps from above
Confirm puppet runs on the restored VM
Add security groups, floating IPs, etc. as needed in Horizon

Restoring a lost Glance image

Glance images are backed up on cloudcontrol nodes: each image is backed up on every node. Restoring is similar to the process for instances, but Glance accesses a snapshot rather than the primary file so there's an extra step. In this example we are restoring an image with id '06cf27ba-bed2-48c7-af2b-2abdfa65463c'.

Find the backup you want to restore.

root@cloudcontrol1005:~# backy2 ls 06cf27ba-bed2-48c7-af2b-2abdfa65463c
    INFO: [backy2.logging] $ /usr/bin/backy2 ls 06cf27ba-bed2-48c7-af2b-2abdfa65463c
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
|         date        | name                                 | snapshot_name       | size |  size_bytes |                 uid                  | valid | protected | tags                       |        expire       |
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
| 2020-10-20 16:00:03 | 06cf27ba-bed2-48c7-af2b-2abdfa65463c | 2020-10-20T16:00:02 | 4864 | 20401094656 | 508686ba-12ed-11eb-a7f5-4cd98fc4a649 |   1   |     0     | b_daily,b_monthly,b_weekly | 2020-10-27 00:00:00 |
+---------------------+--------------------------------------+---------------------+------+-------------+--------------------------------------+-------+-----------+----------------------------+---------------------+
    INFO: [backy2.logging] Backy complete.

Note the UID of the desired image and restore

root@cloudcontrol1005:~# backy2 restore 508686ba-12ed-11eb-a7f5-4cd98fc4a649 rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c
    INFO: [backy2.logging] $ /usr/bin/backy2 restore 508686ba-12ed-11eb-a7f5-4cd98fc4a649 rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c
    INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c Read Queue [          ] Write Queue [          ] (0.0% 0.0MB/sØ ETA 2m57s) 
    INFO: [backy2.logging] Restore phase 1/2 (sparse) to rbd://eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c Read Queue [==========] Write Queue [==========] (9.4% 244.1MB/sØ ETA 11s) 
<etc>

Create a snapshot named 'snap' for Glance to access

root@cloudcontrol1005:~# rbd snap create eqiad1-glance-images/06cf27ba-bed2-48c7-af2b-2abdfa65463c@snap