Portal:Cloud VPS/Admin/Runbooks/Check for snapshots leaked by cinder backup agent

From Wikitech
The procedures in this runbook require admin permissions to complete.

Error / Incident

Usually an email/alertmanager/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for snapshots leaked by cinder backup agent test is CRITICAL **

This happens when something is going wrong with periodic cinder backups. Common use cases:

  • There's a backup that times out.
  • Cinder-volume service is down.

Debugging

Quick check

Verify leaked snapshots:

user@cloudcontrol1005:~ $ sudo wmcs-openstack volume snapshot list
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| ID                                   | Name                                                | Description | Status    | Size |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab | wmde-templates-alpha-nfs-2022-02-23T10:34:32.423757 | None        | available |   10 |
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598             | None        | available |   10 |
| e5c9d3ef-3d8a-40f5-90f0-900f1e87297a | wikidumpparse-nfs-2022-02-23T10:32:36.696177        | None        | available |  260 |
| 9d9aba32-9795-4d60-9d00-1005f5a19483 | proxy-03-backup-2022-02-23T10:32:08.152936          | None        | available |   10 |
| a4acc0c9-2a56-4bb4-bace-644a838a4922 | proxy-04-backup-2022-02-23T10:32:02.187232          | None        | available |   10 |
| 26ce6bea-6174-4960-9951-3ac8786cef96 | dumps-nfs-2022-02-23T10:31:14.228836                | None        | available |   80 |
| b33fde43-703d-4fea-a27b-90a77b6fc049 | twl-nfs-2022-02-23T09:30:51.449991                  | None        | available |  100 |
| 77e4b1dd-7115-44d9-8dc5-d10999fb1003 | testlabs-nfs-2022-02-23T09:30:42.998448             | None        | available |   40 |
| 0b02c50c-53f2-478e-8e2f-dc110b9972fb | quarry-nfs-2022-02-23T09:28:07.622987               | None        | available |  400 |
| 4716e085-6ebd-4da9-974d-0b891fab6d92 | proxy-04-backup-2022-02-23T09:27:52.369365          | None        | available |   10 |
| 2b347ed5-0dca-4495-8be7-8cd24efdea59 | huggle-nfs-2022-02-23T09:27:33.000022               | None        | available |   40 |
| 405b056c-530f-479c-9e2c-630248ae5c20 | dumps-nfs-2022-02-23T09:27:23.461385                | None        | available |   80 |
| 7f7676a4-c7b0-4dc2-8146-d76764afd6a8 | cvn-nfs-2022-02-23T09:27:14.921842                  | None        | available |    8 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037                  | None        | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328                 | None        | deleting  |   10 |
| 3d215281-4e22-40ce-852b-9555b7727f35 | quarry-nfs-2022-02-21T16:35:24.291820               | None        | available |  400 |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+

This list should be empty, because the backup_cinder_volumes service clean snapshots after running the backup. If the list is not empty, this is indeed an indication that something is not working as expected.

Check the service status:

user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service

Check the service logs:

user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service

Check cinder logs:

user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service

Common remediation operations

Verify if cinder API is up and running and start it if not

Most of the times the cinder API being down is the base of the problems, to verify that it's up and running, on each cloudcontrol node:

user@cloudcontrol1005:~$ sudo wmcs-openstack volume service list
+------------------+----------------------+------+---------+-------+----------------------------+
| Binary           | Host                 | Zone | Status  | State | Updated At                 |
+------------------+----------------------+------+---------+-------+----------------------------+
| cinder-scheduler | cloudcontrol1004     | nova | enabled | up    | 2022-06-06T14:52:24.000000 |
| cinder-scheduler | cloudcontrol1003     | nova | enabled | up    | 2022-06-06T14:52:28.000000 |
| cinder-volume    | cloudcontrol1004@rbd | nova | enabled | up    | 2022-06-06T14:52:29.000000 |
| cinder-volume    | cloudcontrol1005@rbd | nova | enabled | up    | 2022-06-06T14:52:23.000000 |
| cinder-volume    | cloudcontrol1003@rbd | nova | enabled | up    | 2022-06-06T14:52:27.000000 |
| cinder-scheduler | cloudcontrol1005     | nova | enabled | up    | 2022-06-06T14:52:28.000000 |
| cinder-backup    | cloudbackup2002      | nova | enabled | up    | 2022-06-06T14:52:22.000000 |
+------------------+----------------------+------+---------+-------+----------------------------+
user@cloudcontrol1005:~# sudo systemctl status cinder* -l

There should be 3 services up and running, cinder-api, cinder-volume and cinder-scheduler.

Examine leftover snapshots

user@cloudcontrol1005:~$ sudo wmcs-openstack volume snapshot show b56c4fea-5c77-4e35-bc6b-6ace1e1dd996
+--------------------------------------------+--------------------------------------+
| Field                                      | Value                                |
+--------------------------------------------+--------------------------------------+
| created_at                                 | 2022-06-06T10:30:02.000000           |
| description                                | None                                 |
| id                                         | b56c4fea-5c77-4e35-bc6b-6ace1e1dd996 |
| name                                       | scratch-2022-06-06T10:30:02.003496   |
| os-extended-snapshot-attributes:progress   | 100%                                 |
| os-extended-snapshot-attributes:project_id | admin                                |
| properties                                 |                                      |
| size                                       | 3072                                 |
| status                                     | available                            |
| updated_at                                 | 2022-06-06T14:06:57.000000           |
| volume_id                                  | d1478efd-9fa6-4293-8389-e72459b794c0 |
+--------------------------------------------+--------------------------------------+
user@cloudcontrol1005:~$ sudo wmcs-openstack volume show d1478efd-9fa6-4293-8389-e72459b794c0
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
| Field                          | Value                                                                                                     |
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
| attachments                    | [{'id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'attachment_id': '957e9c36-04c7-4234-998f-7bab32174d93',  |
|                                | 'volume_id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'server_id': '2fd8eb82-33ec-4060-91c6-cc0a90de8994', |
|                                | 'host_name': 'cloudvirt1046', 'device': '/dev/sdb', 'attached_at': '2022-05-13T04:31:46.000000'}]         |
| availability_zone              | nova                                                                                                      |
| bootable                       | false                                                                                                     |
| consistencygroup_id            | None                                                                                                      |
| created_at                     | 2022-01-14T22:28:57.000000                                                                                |
| description                    | None                                                                                                      |
| encrypted                      | False                                                                                                     |
| id                             | d1478efd-9fa6-4293-8389-e72459b794c0                                                                      |
| migration_status               | None                                                                                                      |
| multiattach                    | False                                                                                                     |
| name                           | scratch                                                                                                   |
| os-vol-host-attr:host          | cloudcontrol1004@rbd#RBD                                                                                  |
| os-vol-mig-status-attr:migstat | None                                                                                                      |
| os-vol-mig-status-attr:name_id | None                                                                                                      |
| os-vol-tenant-attr:tenant_id   | cloudinfra-nfs                                                                                            |
| properties                     |                                                                                                           |
| replication_status             | None                                                                                                      |
| size                           | 3072                                                                                                      |
| snapshot_id                    | None                                                                                                      |
| source_volid                   | None                                                                                                      |
| status                         | in-use                                                                                                    |
| type                           | standard                                                                                                  |
| updated_at                     | 2022-05-13T04:33:39.000000                                                                                |
| user_id                        | novaadmin                                                                                                 |
+--------------------------------+-----------------------------------------------------------------------------------------------------------+


Cleanup of corrupted backups and old volume snapshots

The backup_cinder_volumes service uses the admin project to store temporal volume snapshots before backing up them.

If you are sure they are not in use, you can just cleanup them, for that, check if there's any backups first:

user@cloudcontrol1005:~ $ sudo wmcs-openstack volume backup list | grep -v available

if there are any, you can delete them with:

user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}'); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done

Then you can proceed to remove the volume snapshots that are not being used (status available):

user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i ; done

If you want a more aggressive approach, you can force the operation with:

user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i --force ; done

Some snapshots may get stuck forever in the 'deleting' step. This seems to happen when snapshot creation fails such that a snapshot has an entry in the Cinder database but no actual snap in ceph. These can often be deleted by first resetting their state to 'error':

user@cloudcontrol1005:~ $ openstack volume snapshot set --state error <snap_id>
user@cloudcontrol1005:~ $ openstack volume snapshot delete --force <snap_id>

Of course this doesn't solve the root of the problem, just the symptom.

Sometimes snapshots cannot be deleted because ceph has a reference to them in the trash. That can be artlessly handled with

rbd trash purge --pool eqiad1-cinder

Or by investigating directly:

root@cloudcontrol1005:~# rbd --pool eqiad1-cinder snap unprotect volume-8d687b46-03b8-4308-9b71-13704a664290@snapshot-92206021-5a25-4a66-9707-6c8fefa761f8
2022-11-14T22:08:21.386+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 child(ren) [38fb1c4569b367] in pool 'eqiad1-cinder'
2022-11-14T22:08:21.386+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: encountered error: (16) Device or resource busy
2022-11-14T22:08:21.386+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: 0x5615fc59ca10 should_complete_error: ret_val=-16
rbd: unprotecting snap failed: (16) Device or resource busy2022-11-14T22:08:21.390+0000 7fcbdeffd700 -1 librbd::SnapshotUnprotectRequest: 0x5615fc59ca10 should_complete_error: ret_val=-16
root@cloudcontrol1005:~# rbd children -a eqiad1-cinder/volume-8d687b46-03b8-4308-9b71-13704a664290@snapshot-92206021-5a25-4a66-9707-6c8fefa761f8
eqiad1-cinder/volume-c303be41-0fb1-41d7-8e37-3c0318559b2a (trash 38fb1c4569b367)

See also

There is no service page yet, so for now there's just the proposal:

Old occurrences