Operating system upgrade policy
This document proposes a policy for Linux distribution updates for the Wikimedia production cluster and related infrastructure. This process is currently not clearly defined and streamlining it would reduce technical debt and allow the Wikimedia Foundation to benefit from technical innovations quicker.
There’s a number of reasons why we need to upgrade our software stack:
- To ensure the security of the site and our users’ data, we need to keep the infrastructure free of known security vulnerabilities. In contrast to scheduled maintenance updates, these updates cannot be planned ahead and happen based on when security issues are found and disclosed. There are two types of security updates we apply; the majority are updates provided by the Linux distributions, with some updates from specific vendors not present in distributions and internally prepared updates (usually for software we run modified from the upstream version or which is packaged internally)
- For our software deployments we also want to benefit from ongoing development trends and provide new features for our users. As an example, a newer release of the OpenSSL crypto library could enhance the support for recent versions of the Transport Layer Security (TLS) standard or support the latest advancements in cryptographic ciphers.
- Using current releases is also relevant for our hardware support. Eventually our hardware vendors stop shipping the server models we use. In case of a hardware refresh we can be faced with updated server components (e.g. network cards) which are only supported in a more recent version of the operating system.
- For some services, our deployments are some of the largest of a given software, so our experiences and feedback of operating the software at scale are valuable to the upstream maintainers and their users. For meaningful feedback it's important that our stack doesn't date back too far from current releases. This is of course always a balance, using software from a Linux distribution which stabilises towards a stable release always induces an unavoidable delay compared to running the most recent version of a software component.
Supporting older distribution releases comes at a significant, albeit not very visible, internal cost resulting in technical debt. The more releases of a Linux distribution we need to support, the more effort is spent on making changes compatible to all supported distributions. Much of that effort is also independent of the number of systems still running a given release. This is because all core changes still need to be adapted for the few remaining servers. Some random examples are services which could not simply ship a configuration unit for systemd, as some systems were not yet using systemd or internally maintained software components that need to be built/maintained for several distribution releases at once, like our configuration management system (Puppet). Supporting fewer distributions frees up engineering resources for improving our infrastructure elsewhere.
When this policy was published (March 2019) we were supporting four different Linux distributions (Ubuntu 14.04, Debian 8, Debian 9 and work in progress for Debian 10).
That level of technical debt not only applies to our maintained packages, but also extends to our git repository which stores the Puppet configuration data for our systems. Some of these configuration settings also affect the Puppet code, so we need to retain backwards compatibility here.
Debian release cadence
We use Debian as the operating system to run the Wikimedia production servers and the services comprising Wikimedia Cloud Services. In contrast to other distributions, Debian doesn’t set fixed release dates, but rather postpones a release until it’s considered “ready”. This may sound hard to plan for, but over the last decade they're mostly following a two year cadence with a variance of a few months:
- Debian 4.0 ("etch"): Apr 2007
- Debian 5.0 ("lenny"): Feb 2009
- Debian 6.0 ("squeeze"): Feb 2011
- Debian 7.0 ("wheezy"): May 2013
- Debian 8.0 ("jessie"): Apr 2015
- Debian 9.0 ("stretch"): Jun 2017
- Debian 10.0 ("buster"): Jul 2019
As such, for the rest of the document it’s assumed that a new release happens every two years.
Historically the infrastructure of the Wikimedia Foundation ran several releases of Ubuntu but Wikimedia support for Ubuntu is now deprecated. The remaining Ubuntu hosts are to be migrated by April 2019. This document covers our new setup with only Debian installations.
Support stages for Debian releases
After a release has happened, Debian follows a pattern of support levels similar to other Linux distributions (e.g. Red Hat Enterprise Linux). For the first years of support a mix of functional and security fixes are backported, while at later stages only security fixes are shipped.
Once released, support for Debian happens in two (three) stages:
- For the lifetime of a stable release plus one year after the release of the subsequent one (so effectively three years), there's security support provided by Debian itself. In addition to security updates there are also point releases every few months which collect bug fixes and ship minor security fixes which are not important enough for a regular security update. These point releases can also provide support for new hardware drivers.
- After the three year support period, the remaining time frame until five years after the initial release date (so around two years) is covered by security updates. This support is provided by the Debian LTS project, where paid contributors provide security updates. Compared to the standard security support in the first three years this usually covers fewer packages (but the omitted packages don't matter that much for our server setups). There are no bugfix updates for LTS, support is limited to security fixes (with a few critical exceptions like time zone updates). The support in LTS is also inherently a little degraded over the standard support, e.g. some packages cannot be backported after more than X years (e.g. Oracle withholds vulnerability information for their products, so MySQL 5.5 cannot be supported any longer in Debian 8 LTS as Oracle stopped supporting 5.5). In addition, for LTS there’s no longer support for the backports suite. This suite provides updated packages originally not included in a stable release and sometimes our software setups rely on components from this suite.
- There's even a third stage with extended LTS support (extending the lifetime even longer than five years). It doesn't cover the complete archive, but only selected packages for some companies paying for the support.
Timeline of previous distribution deprecations
Historically, older distributions have been phased out very close to the support termination of the respective distribution releases (or even after the target date in one case):
- For Ubuntu 10.04 Lucid, one system was not migrated in time
- For Ubuntu 12.04 Precise, the last system was migrated three weeks before the end-of-life date
- For Ubuntu 14.04 Trusty, the removal of the last systems will only happen shortly prior to the end-of-life date (per current planning/estimation)
The proposal is to limit the use of a Debian release to four years, in other words to two Debian releases at a time.
- For the first three years after it becomes available, a distribution release can get deployed arbitrarily but with the availability of a new release it’s strongly advised to use the newer early on.
- Once that three year period passes, the migration of the remaining installed base is centrally coordinated. This provides stakeholders with a full year to migrate existing hosts and services:
- For the servers in the Wikimedia production cluster, this work is coordinated by the SRE Infrastructure Foundations team. The actual migration would be owned by the respective service owners within the SRE teams.
- For servers managed by other stakeholders (most notably the Wikimedia Cloud Services team for Wikimedia Cloud VPS and Toolforge, and anyone building container/Docker images based on the Wikimedia package repository) the migration is to be organized by their respective teams.
- After four years, support for the old distribution is ended within the Wikimedia infrastructure and removed from Puppet trees, package repositories and related configuration settings.
The proposal is to enable this policy retroactively for Stretch, meaning it could be used until June 2021. The following chart displays the stages for future releases:
For the phase-out of Debian Jessie a date will be coordinated within the SRE teams (at this point less than 200 Jessie systems are running in production).
- There will be an overlap of a few months prior to the release of a new stable release where the next distribution is internally prepared/tested for our infrastructure, but this can be ignored for the purpose of this policy