Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup cinder volumes

The procedures in this runbook require admin permissions to complete.

This is the systemd timer that triggers the cinder volumes backups.

Error / Incident

The systemd timer failed.

Debugging

First of all, ssh to the machine (ex. cloudcontrol1005) and check the timer status, might get some logs from there:

$ ssh cloudcontrol1005.wikimedia.org

user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.timer
 ● backup_cinder_volumes.timer - Periodic execution of backup_cinder_volumes.service
     Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Tue 2022-02-22 02:50:52 UTC; 1 weeks 1 days ago
    Trigger: Wed 2022-03-02 10:30:00 UTC; 1h 19min left
   Triggers: ● backup_cinder_volumes.service

You don't see it above, but the dot ● near the Triggers section is colored red, that means that the service failed.

Check the service status:

user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service

Check the service logs:

user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service

Check cinder logs:

user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service

Check all services, including logs

user@cloudcontrol1005:~# sudo systemctl status cinder* -l

There should be 3 services up and running, cinder-api, cinder-volume and cinder-scheduler.

Common issues

Timeout when doing a backup

Some backups (currently maps) are really big, and they time out before finishing the backup, this is currently (2022-03-02) a common case for this type of failure, and the main source of leaked snapshots.

Check the current speed of download in the backup machine

Currently the machine that actually does the backups is cloudbackup2002, so we can check if the network was fully saturated or if there were any other contingencies on it.

Go to the host grafana board, if the network is sustainably around 70-100MB/s, then it's the current max speed, and the only alternative is to increase the timeout.

Increasing the timeout of the individual backups

That is currently hardcoded in the python script triggered by the systemd service (there's two scripts, one that does the backups of all volumes, and it's the entry point for the systemctl service, and one that does the backup of a single volume that is used by the former).

You can find the code at wmcs-cinder-backup-manager.py.

Snapshot 'available' but cannot be deleted

An error in cinder means that sometimes deletion of a snapshot will fail silently without updating the snapshot state. Often this appears in the cinder logs like this:

Delete snapshot failed, due to snapshot busy.: cinder.exception.SnapshotIsBusy: deleting snapshot snapshot-fbf0bf47-115c-40c0-981e-3f4fdb7c0d2b that has dependent volumes

That probably means that a backup based on the snapshot is either still running, or in a failed state. Start with

root@cloudcontrol1005:~# openstack volume backup list | grep -iv available
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| ID                                   | Name                                                | Description | Status    | Size |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| 3cc49970-411d-4a51-a56e-50f2b3b248b5 | wmde-templates-alpha-nfs-2023-01-30T22:30:58.576418 | None        | error     |   10 |
| bec04b09-a52c-4b50-b92f-070f2e02d8f1 | scratch-2023-01-27T10:30:02.865886                  | None        | error     | 3072 |
| 5d5b7532-187d-4656-b40d-358078c7c93e | maps-2023-01-20T22:17:54.220263                     | None        | error     | 8192 |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+

Those failed backups can be deleted with something like

root@cloudcontrol1005:~# openstack volume backup delete --force  5d5b7532-187d-4656-b40d-358078c7c93e bec04b09-a52c-4b50-b92f-070f2e02d8f1 3cc49970-411d-4a51-a56e-50f2b3b248b5

It will take some time, but eventually those backups should change to 'deleting' state and then eventually vanish. Once they are gone, empty the trash

root@cloudcontrol1005:~# rbd trash purge --pool eqiad1-cinder

Now re-try the snapshot deletion.

Snapshot stuck in 'deleting' state

If you get an error like:

cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-7a6d86a5-79e3-447f-8125-1e969ef504a7)

It might be that the snapshots are getting stuck in 'deleting' status (for some underlying issue, look into that too). To check run:

root@cloudcontrol1005:~# cinder snapshot-list --volume-id 7b037262-7214-4cef-a876-a55e26bc43be
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| ID                                   | Volume ID                            | Status    | Name                                         | Size | User ID   |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| 784b0a3d-d93f-47fa-97ac-fbbe19b8174e | 7b037262-7214-4cef-a876-a55e26bc43be | available | wikidumpparse-nfs-2022-04-13T20:00:14.507152 | 260  | novaadmin |
| 93ba6b09-879f-441b-b9d4-4767c8e53b41 | 7b037262-7214-4cef-a876-a55e26bc43be | deleting  | wikidumpparse-nfs-2022-05-11T10:32:42.692626 | 260  | novaadmin |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+

That shows all the snapshots for the volume, as you can see, there's one in 'deleting' state, and has been there for a while (when writing this its 2022-05-24). Check that there's no rbd snapshots with that id on ceph:

root@cloudcontrol1005:~# rbd list -l  --pool eqiad1-cinder | grep 7b037262-7214-4cef-a876-a55e26bc43be

And if there are not, you can delete the snapshot setting it's state to error:

root@cloudcontrol1005:~# cinder snapshot-reset-state --state error 93ba6b09-879f-441b-b9d4-4767c8e53b41
root@cloudcontrol1005:~# cinder snapshot-delete 93ba6b09-879f-441b-b9d4-4767c8e53b41

A scripted loop with that:

root@cloudcontrol1005:~# for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;

You should still try to find the underlying issue, but if there was some instability in the system cleaning up the snapshots might be enough.

Related information

Old occurences

Support contacts

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support

Chat in real time in the IRC channel #wikimedia-cloud ^connect or the bridged Telegram group
Discuss via email after you have subscribed to the cloud@ mailing list

Stay aware of critical changes and plans

Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Read the News wiki page

Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)