Portal:Cloud VPS/Admin/Runbooks/Check for snapshots leaked by cinder backup agent
Error / Incident
Usually an email/alertmanager/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for snapshots leaked by cinder backup agent test is CRITICAL **
This happens when something is going wrong with periodic cinder backups. Common use cases:
- There's a backup that times out.
- Cinder-volume service is down.
Quick check
Verify leaked snapshots:
user@cloudcontrol1005:~ $ sudo wmcs-openstack volume snapshot list
| ID | Name | Description | Status | Size |
| d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab | wmde-templates-alpha-nfs-2022-02-23T10:34:32.423757 | None | available | 10 |
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598 | None | available | 10 |
| e5c9d3ef-3d8a-40f5-90f0-900f1e87297a | wikidumpparse-nfs-2022-02-23T10:32:36.696177 | None | available | 260 |
| 9d9aba32-9795-4d60-9d00-1005f5a19483 | proxy-03-backup-2022-02-23T10:32:08.152936 | None | available | 10 |
| a4acc0c9-2a56-4bb4-bace-644a838a4922 | proxy-04-backup-2022-02-23T10:32:02.187232 | None | available | 10 |
| 26ce6bea-6174-4960-9951-3ac8786cef96 | dumps-nfs-2022-02-23T10:31:14.228836 | None | available | 80 |
| b33fde43-703d-4fea-a27b-90a77b6fc049 | twl-nfs-2022-02-23T09:30:51.449991 | None | available | 100 |
| 77e4b1dd-7115-44d9-8dc5-d10999fb1003 | testlabs-nfs-2022-02-23T09:30:42.998448 | None | available | 40 |
| 0b02c50c-53f2-478e-8e2f-dc110b9972fb | quarry-nfs-2022-02-23T09:28:07.622987 | None | available | 400 |
| 4716e085-6ebd-4da9-974d-0b891fab6d92 | proxy-04-backup-2022-02-23T09:27:52.369365 | None | available | 10 |
| 2b347ed5-0dca-4495-8be7-8cd24efdea59 | huggle-nfs-2022-02-23T09:27:33.000022 | None | available | 40 |
| 405b056c-530f-479c-9e2c-630248ae5c20 | dumps-nfs-2022-02-23T09:27:23.461385 | None | available | 80 |
| 7f7676a4-c7b0-4dc2-8146-d76764afd6a8 | cvn-nfs-2022-02-23T09:27:14.921842 | None | available | 8 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037 | None | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328 | None | deleting | 10 |
| 3d215281-4e22-40ce-852b-9555b7727f35 | quarry-nfs-2022-02-21T16:35:24.291820 | None | available | 400 |
This list should be empty, because the backup_cinder_volumes
service clean snapshots after running the backup.
If the list is not empty, this is indeed an indication that something is not working as expected.
Check the service status:
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service
Check the service logs:
user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service
Check cinder logs:
user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service
Common remediation operations
Verify if cinder API is up and running and start it if not
Most of the times the cinder API being down is the base of the problems, to verify that it's up and running, on each cloudcontrol node:
user@cloudcontrol1005:~$ sudo wmcs-openstack volume service list
| Binary | Host | Zone | Status | State | Updated At |
| cinder-scheduler | cloudcontrol1004 | nova | enabled | up | 2022-06-06T14:52:24.000000 |
| cinder-scheduler | cloudcontrol1003 | nova | enabled | up | 2022-06-06T14:52:28.000000 |
| cinder-volume | cloudcontrol1004@rbd | nova | enabled | up | 2022-06-06T14:52:29.000000 |
| cinder-volume | cloudcontrol1005@rbd | nova | enabled | up | 2022-06-06T14:52:23.000000 |
| cinder-volume | cloudcontrol1003@rbd | nova | enabled | up | 2022-06-06T14:52:27.000000 |
| cinder-scheduler | cloudcontrol1005 | nova | enabled | up | 2022-06-06T14:52:28.000000 |
| cinder-backup | cloudbackup2002 | nova | enabled | up | 2022-06-06T14:52:22.000000 |
user@cloudcontrol1005:~# sudo systemctl status cinder* -l
There should be 3 services up and running, cinder-api
, cinder-volume
and cinder-scheduler
Examine leftover snapshots
user@cloudcontrol1005:~$ sudo wmcs-openstack volume snapshot show b56c4fea-5c77-4e35-bc6b-6ace1e1dd996
| Field | Value |
| created_at | 2022-06-06T10:30:02.000000 |
| description | None |
| id | b56c4fea-5c77-4e35-bc6b-6ace1e1dd996 |
| name | scratch-2022-06-06T10:30:02.003496 |
| os-extended-snapshot-attributes:progress | 100% |
| os-extended-snapshot-attributes:project_id | admin |
| properties | |
| size | 3072 |
| status | available |
| updated_at | 2022-06-06T14:06:57.000000 |
| volume_id | d1478efd-9fa6-4293-8389-e72459b794c0 |
user@cloudcontrol1005:~$ sudo wmcs-openstack volume show d1478efd-9fa6-4293-8389-e72459b794c0
| Field | Value |
| attachments | [{'id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'attachment_id': '957e9c36-04c7-4234-998f-7bab32174d93', |
| | 'volume_id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'server_id': '2fd8eb82-33ec-4060-91c6-cc0a90de8994', |
| | 'host_name': 'cloudvirt1046', 'device': '/dev/sdb', 'attached_at': '2022-05-13T04:31:46.000000'}] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2022-01-14T22:28:57.000000 |
| description | None |
| encrypted | False |
| id | d1478efd-9fa6-4293-8389-e72459b794c0 |
| migration_status | None |
| multiattach | False |
| name | scratch |
| os-vol-host-attr:host | cloudcontrol1004@rbd#RBD |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | cloudinfra-nfs |
| properties | |
| replication_status | None |
| size | 3072 |
| snapshot_id | None |
| source_volid | None |
| status | in-use |
| type | standard |
| updated_at | 2022-05-13T04:33:39.000000 |
| user_id | novaadmin |
Cleanup of corrupted backups and old volume snapshots
The backup_cinder_volumes
service uses the admin
project to store temporal volume snapshots before backing up them.
If you are sure they are not in use, you can just cleanup them, for that, check if there's any backups first:
user@cloudcontrol1005:~ $ sudo wmcs-openstack volume backup list | grep -v available
if there are any, you can delete them with:
user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}'); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done
Then you can proceed to remove the volume snapshots that are not being used (status available
user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i ; done
If you want a more aggressive approach, you can force the operation with:
user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i --force ; done
Some snapshots may get stuck forever in the 'deleting' step. This seems to happen when snapshot creation fails such that a snapshot has an entry in the Cinder database but no actual snap in ceph. These can often be deleted by first resetting their state to 'error':
user@cloudcontrol1005:~ $ openstack volume snapshot set --state error <snap_id>
user@cloudcontrol1005:~ $ openstack volume snapshot delete --force <snap_id>
Of course this doesn't solve the root of the problem, just the symptom.
Sometimes snapshots cannot be deleted because ceph has a reference to them in the trash. That can be artlessly handled with
rbd trash purge --pool eqiad1-cinder
Or by investigating directly:
root@cloudcontrol1005:~# rbd --pool eqiad1-cinder snap unprotect volume-8d687b46-03b8-4308-9b71-13704a664290@snapshot-92206021-5a25-4a66-9707-6c8fefa761f8
2022-11-14T22:08:21.386+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 child(ren) [38fb1c4569b367] in pool 'eqiad1-cinder'
2022-11-14T22:08:21.386+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: encountered error: (16) Device or resource busy
2022-11-14T22:08:21.386+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: 0x5615fc59ca10 should_complete_error: ret_val=-16
rbd: unprotecting snap failed: (16) Device or resource busy2022-11-14T22:08:21.390+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: 0x5615fc59ca10 should_complete_error: ret_val=-16
root@cloudcontrol1005:~# rbd children -a eqiad1-cinder/volume-8d687b46-03b8-4308-9b71-13704a664290@snapshot-92206021-5a25-4a66-9707-6c8fefa761f8
eqiad1-cinder/volume-c303be41-0fb1-41d7-8e37-3c0318559b2a (trash 38fb1c4569b367)
See also
There is no service page yet, so for now there's just the proposal:
Old occurrences
- Phabricator T302382 - icinga alert: Check for snapshots leaked by cinder backup agent -- example of a real-life alert
- phab:T302720