Puppet/Pontoon/Rationale

Problem statement

We, as SRE at the Foundation, work together to maintain the production infrastructure up and running. A significant chunk of our work relates to making changes to our public Puppet repository. Routine changes have a wide range of intrusiveness to the infrastructure, consider the following:

fine-tuning parameters for production services
deploy new services
roll out operating system upgrades

Generally speaking, every change introduces risks that we have learned to accept. We have also deployed suitable mitigations to those risks such as:

code reviews
the Puppet compiler
Puppet realms other than 'production'.

In an ideal world we would be able to minimize risks on every change before going to production. Being able to test changes within a testbed stack (i.e. a "virtual production") greatly reduces risks and enables experimentation in a safe way.

Today

Setting up such stacks is possible today, but certainly not in a "disposable" fashion.

The word disposable in this context means that the stacks should be easy to set up and tear down, and are isolated from one another (i.e. self-contained as much as possible). For all intents and purposes the stacks resemble production, but receiving less (possibly zero) user traffic. Each stack also carries data that should be initialized and is stack-specific (e.g. private data).

SRE teams today set up WMCS instances to test changes, and roles are assigned via Horizon. Hiera data comes from different sources: Horizon and Puppet, and look up within hieradata isn't the same as production. The result is duplication of multiple variables and often times banging Hiera data and variables together until Puppet runs successfully.

This works but it requires duplication of variables and the resulting patch can't be applied to production as-is. In a perfect world we would have role assignment done the exact same way as production and Hiera variables looked up the same: a common default and override only what changes (e.g. domain names, hostnames, etc).

Pontoon

Pontoon (in the k8s nautical theme: a recreational floating device) explores the idea of disposable stacks as similar to production as possible. The key idea and goal being that the Puppet code base should not depend on hardcoded production-specific values (most notably hostnames in hiera).

Pontoon features include:

Role assignment happens by mapping a role to a list of hostnames that need such role.

The role mapping is used by Pontoon to drive its Puppet external node classifier (ENC). The ENC also supplies extra variables generated from the mapping.

Hostnames listed in the mapping will have their Puppet certificates automatically signed on the first Puppet run.
Load balancing and service discovery compatibility with production. See also the Services page for more information.

The explicit role to hostnames mapping enables meta-programming Puppet, which in turn enables replacing list of hostnames (e.g. firewall rules) in Hiera with variables containing "all hosts for role foobar" at catalog compile time.

As of April 2020 the implementation consists of the following:

A standalone puppet server with the production hiera-rchy. Two additional lookup files are provided too: one at the top of the hierarchy to be able to override production defaults and one at the bottom to be able to supply Pontoon-specific values, possibly auto generated. The latter file (auto.yaml) is used by Pontoon to work around some Hiera limitations (namely that variables from ENC are strings, whereas in some cases we need lists)

The standalone puppet server is driven by Pontoon's ENC.

Realm is 'labs', to keep subsystems like authentication working as expected.

Stack-specific data (e.g. the "root of trust", the Puppet CA, etc) must be initialized manually.

Benefits

Why go through all this trouble of isolated stacks similar to production? There are several benefits to having a bespoke testbed:

Lower overhead and faster iteration cycles for new ideas and services.

Increased confidence that a patch will work as expected once applied in production. For example being able to test distribution upgrades in isolation.

External reusability of the Puppet code base is improved as a bonus/side effect: we're factoring out assumptions about production and third parties can recreate similar environments. Similarly, contributing to the Puppet code base itself is made easier if recreating an isolated "mini-production" is possible without jumping through too many hoops.