Wikimedia Cloud Services team/EnhancementProposals/Decision record T325223 Upgrade cadence for Ceph
Origin task: phab:T325223
Date of the decision: 2023-03-06
No decision meeting needed, people commenting in the task:
Decision taken
Option 5. Upgrade every 6 months, alternating major (N+1.2.*) with minor (N.*.*) upgrades.
Problem
Currently we have no policy on Ceph upgrades, and that makes it hard to find the time for it. Current ceph releases have a lifespan of a bit more than 2 years, and there's a new release every year (see https://docs.ceph.com/en/latest/releases/index.html).
We currently get our "unofficial" packages from https://mirror.croit.io/ as there were some issues with the upstream ones (https://tracker.ceph.com/issues/53411) -- **Lately upstream download.ceph.com seems to have caught up on building the packages, so we should use those**
So two things have to be decided here, how frequent the upgrades should be, and what to upgrade on each.
Note that I'm considering only N.2.* versions as the others are only for development or test clusters, from the docs: ```
x.0.z - development versions x.1.z - release candidates (for test clusters, brave users) x.2.z - stable/bugfix releases (for users)
```
Constraints and risks
- We risk running an unsupported version of Ceph, not getting any new bugfixes or security patches.
- Debian packages are **very** delayed with respect upstream, so we might consider using other sources for them
Options
Option 1
Do nothing
Pros:
- No changes to the current workflow
Cons:
- This means upgrades will be done "whenever we find some time", that's usually when a security patch of blocking bug comes around.
- Potential EOL (end of life) versions
- 3rd party repository
Option 2
Frequency: once a year Version to upgrade to: (N-1).2.*
For example, if we have 16.2.15, and there is a new 18.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*
Pros:
- We get a very stable world tested version of ceph
Cons:
- We might get some months with EOL version
- We don't get fixes for a whole year
- We have to allocate time for it once a year (happy path 1 week work, challenging path 1 month work)
Option 3
Frequency: once a year Version to upgrade to: N.2.*
For example, if we have 16.2.15, and there is a new 17.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*
Pros:
- We get a very stable world tested version of ceph
- We don't get periods running an EOL version
Cons:
- We don't get fixes for a whole year
- We have to allocate time for it once a year (happy path 1 week work, challenging path 1 month work)
Option 4
Frequency: every 6 months Version to upgrade to: (N-1).2.*
For example, if we have 16.2.15, and there is a new 18.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*
Pros:
- We get a very stable world tested version of ceph
Cons:
- We might get some months with EOL version
- We have to allocate time for it twice a year (happy path 1 week work, challenging path 1 month work)
Option 5
Frequency: every 6 months Version to upgrade to: N.2.*
For example, if we have 16.2.15, and there is a new 17.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*
Pros:
- We get a very stable world tested version of ceph
- We don't get periods running an EOL version
Cons:
- We have to allocate time for it twice a year (happy path 1 week work, challenging path 1 month work)