Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup vms
Error / Incident
The backup_vms unit is failing, that means that the backups for virtual machines are not working as expected.
Debugging
To gather logs, just ssh to the host having the issue and check the backups_vm.service unit status:
root@cloudvirt1021:~# systemctl status backup_vms ● backup_vms.service - Backup vms assigned to this host Loaded: loaded (/lib/systemd/system/backup_vms.service; static) Active: failed (Result: exit-code) since Mon 2022-03-21 20:01:53 UTC; 17h ago TriggeredBy: ● backup_vms.timer Process: 4099425 ExecStart=/usr/local/sbin/wmcs-backup instances backup-assigned-vms (code=exited, status=1/FAILURE) Main PID: 4099425 (code=exited, status=1/FAILURE) Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: backup_entries = get_backups() Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 600, in get_backups Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: backup = subprocess.check_output([BACKY, "-ms", "ls"]) Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: File "/usr/lib/python3.9/subprocess.py", line 528, in run Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: raise CalledProcessError(retcode, process.args, Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', '-ms', 'ls']' returned non-zero exit status 100. Mar 21 20:01:53 cloudvirt1021 systemd[1]: backup_vms.service: Main process exited, code=exited, status=1/FAILURE Mar 21 20:01:53 cloudvirt1021 systemd[1]: backup_vms.service: Failed with result 'exit-code'.
You can also check the logs for a longer view:
root@cloudvirt1021:~# journalctl -u backup_vms
Common issues
Trying to backup a VM that is not there anymore
Currently the backups gather the list of VMs to backup at the start, but it might happen that one of them gets removed while the script is backing up other VMs, in that case the next run will update the VM list and fix the issue.
There's currently a task to do something more elegant: task T276892.
There's a new VM/project that should not be backed up
This is the projects/VMs lists that control which one get backups or does not, just add/remove the entry there.
Disk full
If this happens, there's a simple workaround that might work, doing a full cleanup, first make sure that backy will not run during the cleanup:
root@cloudvirt1021:~# systemctl stop backup_vms.timer root@cloudvirt1021:~# systemctl status backup_vms.service # wait for this to finish if it's running, probably not as it broke
Then run:
root@cloudvirt1021:~# backy2 cleanup --full INFO: [backy2.logging] $ /usr/bin/backy2 cleanup --full INFO: [backy2.logging] Cleanup: Removed 500709 blobs INFO: [backy2.logging] Backy complete.
Start the timer again:
root@cloudvirt1021:~# systemctl stop backup_vms.timer
Corrupted database
This one usually presents as the following errors in the logs:
Mar 19 20:03:45 cloudvirt1021 wmcs-backup[2952831]: INFO:[2022-03-19 20:03:45,731] Creating full backup of pool:eqiad1-compute, image_name:fff879b9-8300-4a11-9cf6-3d424be9ffa3_disk, snapshot_name:2022-03-19T20:03:45_cloudvirt1021 Mar 19 20:04:10 cloudvirt1021 wmcs-backup[2954847]: ERROR: [backy2.logging] Couldn't parse datetime string: 'b0094916-a7bf-11ec-95b0-dd68c0a50dd7' Mar 19 20:04:10 cloudvirt1021 wmcs-backup[2954847]: ERROR: [backy2.logging] Backy failed.
You can verify by checking the db integrity (any output other than ok is bad):
root@cloudvirt1021:~# sqlite3 /srv/backy/backy.sqlite SQLite version 3.34.1 2021-01-20 14:10:07 Enter ".help" for usage hints. sqlite> pragma integrity_check; *** in database main *** On tree page 71625 cell 219: Rowid 861598 out of order On tree page 4 cell 0: 2nd reference to page 71847 row 304958 missing from index ix_blocks_checksum ...
An to fix, dump the db and reload it, first make sure backy is not running, as doing this mid-db update might break things:
root@cloudvirt1021:~# systemctl stop backup_vms.timer root@cloudvirt1021:~# systemctl status backup_vms.service # wait for this to finish if it's running, probably not as it broke
Then dump the DB:
root@cloudvirt1021:~# sqlite3 /srv/backy/backy.sqlite sqlite> .output backup.db sqlite> .dump sqlite> .quit # the backy.fixed.sqlite file did not exist yet, sqlite3 creates it root@cloudvirt1021:~# sqlite3 backy.fixed.sqlite SQLite version 3.34.1 2021-01-20 14:10:07 Enter ".help" for usage hints. sqlite> .read backup.db sqlite> .quit # restore the DB root@cloudvirt1021:~# cp backy.fixed.sqlite /srv/backy2/backy.sqlite
Finally, as some entries might have been lost, you can do a full cleanup of the backups, this might take some time:
root@cloudvirt1021:~# backy2 cleanup --full INFO: [backy2.logging] $ /usr/bin/backy2 cleanup --full INFO: [backy2.logging] Cleanup: Removed 500709 blobs INFO: [backy2.logging] Backy complete.
Start the timer again:
root@cloudvirt1021:~# systemctl stop backup_vms.timer
Related information
Old incidents
- task T304408 cloudvirt1021 - CRITICAL: Status of the systemd unit backup_vms
- task T303870 cloudvirt1022/Check unit status of backup_vms is CRITICAL
Contacts
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia Movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect, the bridged Telegram group, or the bridged Mattermost channel
- Discuss via email after you subscribed to the cloud@ mailing list