Wikimedia Cloud Services team/EnhancementProposals/Decision record T304058 doing VM backups

Date of the decision: 2022-03-30

People in the decision meeting (alphabetical order):

Decision taken

Option 1 was chosen without consensus with the action point to ensure that there's an OKR on the next quarter to find a proper backup solution.

Rationale

The Q4 meeting is tomorrow (31-03-2022) and the effort to keep the current solution running as is (bugs included) is cheaper than moving to any other interim solution.

Meeting notes

https://etherpad.wikimedia.org/p/WMCS-T304058

Problem

Since we migrated the VMs to Ceph storage, we have been doing some backups for the VM disks in case that something went wrong (new technology).

Since then Ceph has proved stable and reliable, it has 3 copies of the data distributed in 3 different hosts.

There has not been many occasions (maybe a couple) on which we have used this backups, and the current setup needs a bit of work (<5 days) to get to a stable state.

Currently this backups are running on a few of the cloudvirts that got lots of spare space after moving to Ceph, but we have a couple new bare metal machines that were ordered to dedicate to these (and potentially other) backups.

Constraints and risks

Doing nothing will end up on the backups misbehaving (using extra space, filling up disk, leaking ceph snapshots, ...), so that's the less preferred option.
Not doing any backups (without an alternative) gives us no way of restoring users and our own VMs in cases of:
- User mistake (deleting files, etc.)
- Ceph issues (disk corruption, cluster mishap, etc.)
- Disk corruption at the VM level (migration issue, OS issue, etc.)

Decision record

This page

Options

Option 1

Do nothing

Pros:

No new investment needed
No new hardware needed

Cons:

Maintenance:
- the current scripts have issues, might fill up the disk (~1/6 months) and require manual attention (run one command + some broken backups)
- the current scripts don't cleanup leaked ceph snapshots (no estimation, as it has not yet become an issue)
There's some overhead on a few of the cloudvirts (not noticeable so far)
We are not testing the backups, so there's a chance they might not work in the future (so far when we tried recovering they worked).

Option 2

Improve the current scripts and move to the backup bare metals.

Pros:

We still have VM backups off-ceph in case of disaster, for us and most of our users.

Cons:

We have to do an initial investment (<5days estimation)
Maintenance: we have to do an ongoing investment (maintain/test the backups, <1h/month estimation)
We are using the hardware (cloudbackups) for this an not other things (nothing else planned afaik, so just future potential)

Option 3

Stop doing VM backups

Pros:

No maintenance effort needed
Free hardware (cloudbackups)
Free space in ceph

Cons:

We have to do an initial investment (<5days estimation) to remove what we have
There's no safety net in case of disaster on ceph, user error (deleting files), OS error (filesystem corruption) or otherwise, for us or our users. That means that we might not be able to restore systems like toolforge in case of a disaster.
- An alternative would require way more effort and time, though might be more future proof (we might still need to do some backups of things like databases, etcd, redis and similar data storage for critical components).