Wikimedia Cloud Services team/EnhancementProposals/Decision record T325223 Upgrade cadence for Ceph

From Wikitech

Origin task: phab:T325223

Date of the decision: 2023-03-06

No decision meeting needed, people commenting in the task:

Decision taken

Option 5. Upgrade every 6 months, alternating major (N+1.2.*) with minor (N.*.*) upgrades.

Problem

Currently we have no policy on Ceph upgrades, and that makes it hard to find the time for it. Current ceph releases have a lifespan of a bit more than 2 years, and there's a new release every year (see https://docs.ceph.com/en/latest/releases/index.html).

We currently get our "unofficial" packages from https://mirror.croit.io/ as there were some issues with the upstream ones (https://tracker.ceph.com/issues/53411) -- **Lately upstream download.ceph.com seems to have caught up on building the packages, so we should use those**

So two things have to be decided here, how frequent the upgrades should be, and what to upgrade on each.

Note that I'm considering only N.2.* versions as the others are only for development or test clusters, from the docs: ```

   x.0.z - development versions
   x.1.z - release candidates (for test clusters, brave users)
   x.2.z - stable/bugfix releases (for users)

```

Constraints and risks

  • We risk running an unsupported version of Ceph, not getting any new bugfixes or security patches.
  • Debian packages are **very** delayed with respect upstream, so we might consider using other sources for them

Options

Option 1

Do nothing

Pros:

  • No changes to the current workflow

Cons:

  • This means upgrades will be done "whenever we find some time", that's usually when a security patch of blocking bug comes around.
  • Potential EOL (end of life) versions
  • 3rd party repository


Option 2

Frequency: once a year Version to upgrade to: (N-1).2.*

For example, if we have 16.2.15, and there is a new 18.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph

Cons:

  • We might get some months with EOL version
  • We don't get fixes for a whole year
  • We have to allocate time for it once a year (happy path 1 week work, challenging path 1 month work)


Option 3

Frequency: once a year Version to upgrade to: N.2.*

For example, if we have 16.2.15, and there is a new 17.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph
  • We don't get periods running an EOL version

Cons:

  • We don't get fixes for a whole year
  • We have to allocate time for it once a year (happy path 1 week work, challenging path 1 month work)

Option 4

Frequency: every 6 months Version to upgrade to: (N-1).2.*

For example, if we have 16.2.15, and there is a new 18.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph

Cons:

  • We might get some months with EOL version
  • We have to allocate time for it twice a year (happy path 1 week work, challenging path 1 month work)


Option 5

Frequency: every 6 months Version to upgrade to: N.2.*

For example, if we have 16.2.15, and there is a new 17.2.0, we upgrade to 17.2.*, otherwise we upgrade to the latest 16.2.*

Pros:

  • We get a very stable world tested version of ceph
  • We don't get periods running an EOL version

Cons:

  • We have to allocate time for it twice a year (happy path 1 week work, challenging path 1 month work)