Wikimedia Cloud Services team/EnhancementProposals/Decision record T304058 doing VM backups
Origin task: phab:T304058
Date of the decision: 2022-03-30
People in the decision meeting (alphabetical order):
Option 1 was chosen without consensus with the action point to ensure that there's an OKR on the next quarter to find a proper backup solution.
The Q4 meeting is tomorrow (31-03-2022) and the effort to keep the current solution running as is (bugs included) is cheaper than moving to any other interim solution.
Since we migrated the VMs to Ceph storage, we have been doing some backups for the VM disks in case that something went wrong (new technology).
Since then Ceph has proved stable and reliable, it has 3 copies of the data distributed in 3 different hosts.
There has not been many occasions (maybe a couple) on which we have used this backups, and the current setup needs a bit of work (<5 days) to get to a stable state.
Currently this backups are running on a few of the cloudvirts that got lots of spare space after moving to Ceph, but we have a couple new bare metal machines that were ordered to dedicate to these (and potentially other) backups.
Constraints and risks
- Doing nothing will end up on the backups misbehaving (using extra space, filling up disk, leaking ceph snapshots, ...), so that's the less preferred option.
- Not doing any backups (without an alternative) gives us no way of restoring users and our own VMs in cases of:
- User mistake (deleting files, etc.)
- Ceph issues (disk corruption, cluster mishap, etc.)
- Disk corruption at the VM level (migration issue, OS issue, etc.)
- No new investment needed
- No new hardware needed
- the current scripts have issues, might fill up the disk (~1/6 months) and require manual attention (run one command + some broken backups)
- the current scripts don't cleanup leaked ceph snapshots (no estimation, as it has not yet become an issue)
- There's some overhead on a few of the cloudvirts (not noticeable so far)
- We are not testing the backups, so there's a chance they might not work in the future (so far when we tried recovering they worked).
Improve the current scripts and move to the backup bare metals.
- We still have VM backups off-ceph in case of disaster, for us and most of our users.
- We have to do an initial investment (<5days estimation)
- Maintenance: we have to do an ongoing investment (maintain/test the backups, <1h/month estimation)
- We are using the hardware (cloudbackups) for this an not other things (nothing else planned afaik, so just future potential)
Stop doing VM backups
- No maintenance effort needed
- Free hardware (cloudbackups)
- Free space in ceph
- We have to do an initial investment (<5days estimation) to remove what we have
- There's no safety net in case of disaster on ceph, user error (deleting files), OS error (filesystem corruption) or otherwise, for us or our users. That means that we might not be able to restore systems like toolforge in case of a disaster.
- An alternative would require way more effort and time, though might be more future proof (we might still need to do some backups of things like databases, etcd, redis and similar data storage for critical components).