Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup cinder volumes
This is the systemd timer that triggers the cinder volumes backups.
Error / Incident
The systemd timer failed.
Debugging
First of all, ssh to the machine (ex. cloudcontrol1005) and check the timer status, might get some logs from there:
$ ssh cloudcontrol1005.wikimedia.org
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.timer
● backup_cinder_volumes.timer - Periodic execution of backup_cinder_volumes.service
Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.timer; enabled; vendor preset: enabled)
Active: active (waiting) since Tue 2022-02-22 02:50:52 UTC; 1 weeks 1 days ago
Trigger: Wed 2022-03-02 10:30:00 UTC; 1h 19min left
Triggers: ● backup_cinder_volumes.service
You don't see it above, but the dot ●
near the Triggers
section is colored red, that means that the service failed.
Check the service status:
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service
Check the service logs:
user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service
Check cinder logs:
user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service
Check all services, including logs
user@cloudcontrol1005:~# sudo systemctl status cinder* -l
There should be 3 services up and running, cinder-api
, cinder-volume
and cinder-scheduler
.
Common issues
Timeout when doing a backup
Some backups (currently maps) are really big, and they time out before finishing the backup, this is currently (2022-03-02) a common case for this type of failure, and the main source of leaked snapshots.
Check the current speed of download in the backup machine
Currently the machine that actually does the backups is cloudbackup2002
, so we can check if the network was fully saturated or if there were any other contingencies on it.
Go to the host grafana board, if the network is sustainably around 70-100MB/s, then it's the current max speed, and the only alternative is to increase the timeout.
Increasing the timeout of the individual backups
That is currently hardcoded in the python script triggered by the systemd service (there's two scripts, one that does the backups of all volumes, and it's the entry point for the systemctl service, and one that does the backup of a single volume that is used by the former).
You can find the code at wmcs-cinder-backup-manager.py.
Snapshot 'available' but cannot be deleted
An error in cinder means that sometimes deletion of a snapshot will fail silently without updating the snapshot state. Often this appears in the cinder logs like this:
Delete snapshot failed, due to snapshot busy.: cinder.exception.SnapshotIsBusy: deleting snapshot snapshot-fbf0bf47-115c-40c0-981e-3f4fdb7c0d2b that has dependent volumes
That probably means that a backup based on the snapshot is either still running, or in a failed state. Start with
root@cloudcontrol1005:~# openstack volume backup list | grep -iv available +--------------------------------------+-----------------------------------------------------+-------------+-----------+------+ | ID | Name | Description | Status | Size | +--------------------------------------+-----------------------------------------------------+-------------+-----------+------+ | 3cc49970-411d-4a51-a56e-50f2b3b248b5 | wmde-templates-alpha-nfs-2023-01-30T22:30:58.576418 | None | error | 10 | | bec04b09-a52c-4b50-b92f-070f2e02d8f1 | scratch-2023-01-27T10:30:02.865886 | None | error | 3072 | | 5d5b7532-187d-4656-b40d-358078c7c93e | maps-2023-01-20T22:17:54.220263 | None | error | 8192 | +--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
Those failed backups can be deleted with something like
root@cloudcontrol1005:~# openstack volume backup delete --force 5d5b7532-187d-4656-b40d-358078c7c93e bec04b09-a52c-4b50-b92f-070f2e02d8f1 3cc49970-411d-4a51-a56e-50f2b3b248b5
It will take some time, but eventually those backups should change to 'deleting' state and then eventually vanish. Once they are gone, empty the trash
root@cloudcontrol1005:~# rbd trash purge --pool eqiad1-cinder
Now re-try the snapshot deletion.
Snapshot stuck in 'deleting' state
If you get an error like:
cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-7a6d86a5-79e3-447f-8125-1e969ef504a7)
It might be that the snapshots are getting stuck in 'deleting' status (for some underlying issue, look into that too). To check run:
root@cloudcontrol1005:~# cinder snapshot-list --volume-id 7b037262-7214-4cef-a876-a55e26bc43be +--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+ | ID | Volume ID | Status | Name | Size | User ID | +--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+ | 784b0a3d-d93f-47fa-97ac-fbbe19b8174e | 7b037262-7214-4cef-a876-a55e26bc43be | available | wikidumpparse-nfs-2022-04-13T20:00:14.507152 | 260 | novaadmin | | 93ba6b09-879f-441b-b9d4-4767c8e53b41 | 7b037262-7214-4cef-a876-a55e26bc43be | deleting | wikidumpparse-nfs-2022-05-11T10:32:42.692626 | 260 | novaadmin | +--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
That shows all the snapshots for the volume, as you can see, there's one in 'deleting' state, and has been there for a while (when writing this its 2022-05-24). Check that there's no rbd snapshots with that id on ceph:
root@cloudcontrol1005:~# rbd list -l --pool eqiad1-cinder | grep 7b037262-7214-4cef-a876-a55e26bc43be
And if there are not, you can delete the snapshot setting it's state to error:
root@cloudcontrol1005:~# cinder snapshot-reset-state --state error 93ba6b09-879f-441b-b9d4-4767c8e53b41 root@cloudcontrol1005:~# cinder snapshot-delete 93ba6b09-879f-441b-b9d4-4767c8e53b41
A scripted loop with that:
root@cloudcontrol1005:~# for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;
You should still try to find the underlying issue, but if there was some instability in the system cleaning up the snapshots might be enough.
Related information
Old occurences
Support contacts
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect or the bridged Telegram group
- Discuss via email after you have subscribed to the cloud@ mailing list
- Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
- Read the News wiki page
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)