Puppet/Pontoon

Reach out to godog / Filippo Giunchedi on #wikimedia-sre ^connect for more information and assistance

Intro

Pontoon enables you to (re)create isolated and disposable copies of production in Cloud VPS. The intended audience are SRE folks, although anyone with Cloud VPS access can follow the instructions -- no special access is required. "copies" of production in this context means your changes to puppet.git will reflect as close as possible what's going to happen in production (e.g. hieradata/role/ works as expected). For more details and information see the rationale page.

Quickstart (five minutes)

These instructions are outdated. Pontoon and Puppet 7 support is in progress at bug T352640. Subscribe to the task and/or reach out to Filippo to know more.

To get started with Pontoon you'll need a Puppet server and a name for your stack: copy modules/pontoon/files/bootstrap.sh to a newly provisioned Buster Cloud VPS host and execute it as sudo ./bootstrap.sh <your team's stack name>. When the script has finished you'll be presented with instructions on what to do next. After that, read how to enroll a new host, and how to make roles work in Pontoon to help you get started. Make sure to check out the bootstrap demo too.

Demo

See the asciinema recording at https://asciinema.org/a/WqirmPmHSlHa0LzN5dZLxgCAR. The subject in the following section is Filippo.

In the demo video I'm testing the migration of the graphite::production role to Buster. To do so, I'm adding a freshly provisioned Buster WMCS instance to a self-hosted Pontoon puppet server.

I'm confirming which role I want (graphite::production in this case) and proceed to change the 'observability' stack. I'm adding the new role and then assign the newly provisioned host.

I'm then committing the change and push the repo's HEAD it to the pontoon puppet server as if it was the 'production' branch.

Next, I'm taking over (i.e. enroll) the graphite host. The enroll script needs to know which stack to use (to locate the puppet server) and the hostname to enroll. The script will then log in and make the necessary adjustments (namely changing the puppet server and deleting the host's certificates) and kick off a puppet run.

Note that auto signing is disabled yet the puppet server issues the certificate because the host is present in the stack file and thus authorized/recognized.

The puppet run then proceeds as expected and graphite::production is applied, some failures are to be expected: for example custom Debian packages not yet available in buster-wikimedia and after the first puppet run (when apt sources are changed).

Next I run apt update manually and validate that another puppet agent run is possible.

Howto

This section outlines how to try out Pontoon yourself. The idea is to replicate production's model of one-host-per-role, in other words have a Pontoon server and several agent instances. Development happens locally on your workstation via a checkout of puppet.git and changes are pushed to the Pontoon server.

Server bootstrap

The puppet server is the first host to setup in a stack, therefore it has to be bootstrapped. Bootstrapping is more complicated than enrolling subsequent hosts to an existing stack. Most details about the process are coded in modules/pontoon/files/bootstrap.sh and instructions are provided once bootstrap has completed.

To start your new stack you will need the following:

A local checkout of puppet.git
A Cloud VPS Buster instance (g3.cores1.ram2.disk20 flavor or above). It is recommended to name the host after the function/role plus an integer, e.g. puppet-01 in this case
A name for your new stack. It is recommended to pick a short name after your team and the stack's function, e.g. o11y-alerts
The bootstrap.sh script present on the host to bootstrap. The script can be scp'd from your local puppet.git checkout, must be executable and in your user's $HOME. For convenience's sake on the host you can install the current version with:

curl -sS https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/pontoon/files/bootstrap.sh?format=text | base64 -d > bootstrap.sh && chmod a+x bootstrap.sh

With the requirements above in place, you can proceed with the bootstrap:

SSH as your user to the host
Issue sudo ./bootstrap.sh your_team_stack_name

The bootstrap script will complete in a few minutes if everything goes well. After completion you will need to finalize the bootstrap locally on your computer by creating the new stack in puppet and commit the result. The script will print out instructions for you to get started with your new stack (also saved on the Pontoon host at /etc/README.pontoon). Remember to push your stack-specific changes (i.e. in modules/pontoon/files/STACK NAME/) for review on Gerrit too once ready. This way your team mates will be able to collaborate / join the stack.

Congratulations! You have bootstrapped your first Pontoon stack.

Add a new host

This section outlines how to add a new (non Puppet server) host to an existing Pontoon stack.

Pick a name for your instance (e.g. myfrontend-01) and provision a new instance in Horizon. Make sure the correct Horizon project is selected in the web interface. We'll use graphite in this example. If you don't known the required instance specs yet, start with the smallest available.
Add the host's FQDN and its role to your stack modules/pontoon/files/STACK/rolemap.yaml. e.g.

 graphite::frontend:
   - myfrontend-01.graphite.eqiad1.wikimedia.cloud

Commit the result and push to your STACK's Pontoon server:

 git commit -m "pontoon: add host" modules/pontoon/files/STACK/rolemap.yaml
 git push -f pontoon-STACK HEAD:production

Enroll the new host. The script will take care of waiting for the host to be accessible, deleting the current puppet ssl keypair and flip the host to the pontoon server. Run this on your development machine:

 modules/pontoon/files/enroll.py --stack STACK myfrontend-01.graphite.eqiad1.wikimedia.cloud

There will be puppet agent failures (likely), tweak puppet/hiera locally as needed and push to the server as above. See #Hiera for more information.

Once happy with the result, remember to push your stack-specific changes (i.e.push your stack-specific changes (i.e. in modules/pontoon/files/STACK NAME/) to Gerrit too for review and share with your team mates.

Join an existing stack

When the stack exists already (i.e. there's a rolemap.yaml file in modules/pontoon/files/STACK) you can join it by:

Set up the remote server to act as a git remote: modules/pontoon/files/config.py --stack STACK setup-remote
List the commands you need to run locally to configure the remote: modules/pontoon/files/config.py --stack STACK git-config-remote

The last command will configure remotes such as pontoon-STACK, ready to accept your changes. The Pontoon server will read changes from production branch, thus remember to force-push your changes to the remote, e.g. git push -f pontoon-STACK HEAD:production. Make sure to read team collaboration and git branches to learn more on branch structure for teams.

Get patches ready for review

Before sending a patch for review you will iterate locally by pushing to your Pontoon stack. Ideally the stack runs off production unmodified, however in many cases you will have multiple commits on top of production to get your stack to work (the sandbox/ namespace branches). Therefore to avoid sending your patch plus all the stack-specific commits you'll need to toggle your patch's "base", from your stack's branch to production. The commands to achieve that are the following, assuming your stack's branch is pontoon-STACK:

 # To activate pontoon-STACK "below" the current branch instead of production:
 git rebase -i --onto pontoon-STACK
 # To restore the original "base" of production
 git rebase -i --onto production pontoon-STACK

The pontoon-STACK branch might need to be itself rebased on top of production first (and pushed to the sandbox/ namespace if applicable), to make sure the branch is updated.

Finally, the stack-specific changes (i.e. in modules/pontoon/files/STACK NAME/) are expected to be pushed to Gerrit and merged in the production branch to allow for multi-team collaboration.

Load balancing and service discovery

Pontoon provides a production-compatible load balancing and service discovery layer. When enabled you'll be able to reach svc.<site>.wmnet and discovery.wmnet DNS names backed by the services in your stack. See also the Services page for more information on this topic and how to enable LB/SD.

Make roles work in Pontoon

Read this section if you have added a new role to your stack and you are sad (e.g. Puppet fails).

There are a few failure classes to think about:

Undefined hiera variables. Check the common hiera settings file in modules/pontoon/files/settings for the missing values. If the values are not set already you’ll need to add them; see hiera section on how to do that.
Services on the host are unhealthy. The service’s dependencies haven’t been bootstrapped yet (e.g. databases, users missing, etc), the service can’t reach its dependencies (see also Services page for details on services in Pontoon), private material is missing (TODO section on private).

Debugging and fixing these issues will also help find production bugs (e.g. a reimaged host will likely yield the same error, porting a role to a new Debian distribution, etc). Typical bootstrap problems are directory initialization (puppetdb, trafficserver) or service dependencies (trafficserver). Keep in mind that fixing some of these issues might require hacks, and that's okay given that Pontoon is not production and an hack enabling automation is more desirable than manually bootstrapping and fixing services.

"Reimage" an host

To reprovision/reimage an host in your stack use the following steps:

Provision a new VM with the same specs and increment the trailing index e.g. o11y-puppet-01 becomes o11y-puppet-02
Swap the new hostname in rolemap.yaml and push the change to your Pontoon stack
Enroll the new host in the stack
Delete the old host

The above is necessary because the "rebuild" horizon feature doesn't seem to work.

Hiera

One of the key goals of Pontoon's "look and feel" is to be as close as possible to production. To this end, there are two guidelines to keep in mind when writing your stack’s hiera:

Minimal: only variables differing from production should be in your stack’s hiera (e.g. resource limits). If you are setting a variable with the same value as production, include it in production only and not in your stack.
Generic: group your hiera settings files by the functionality they enable. Shared settings files are also available to be included in your stack for common functionality (e.g. puppetdb.yaml, prometheus.yaml, etc)

Caveats and limitations

Writing a stack’s hiera can be as straightforward as setting a few variables, however there are some caveats to keep in mind:

Replace lists of hostnames with their role when possible. To do so, use “%{alias('__hosts_for_role_ROLE')}” as your variable’s value. The result will be expanded at lookup time with a list of hosts running the role in rolemap.yaml. The full list of available variables can be found in /etc/puppet/hieradata/auto.yaml on the Pontoon server (the file is updated by a puppet run on any host). Not having hardcoded hostnames truly makes hiera settings generic with respect to a particular stack and thus shareable with other stacks. There's also a crude "master election" available: “%{alias('__master_for_role_ROLE')}” will expand to a string with the first host running ROLE in rolemap.yaml.
Double colons in role names need to be replaced with double underscore
Only one role at a time can be expanded and used as a value: the alias function call must be the only value. No concatenation of role hostlist variables is possible from within hiera.
Sometimes you’ll have to hardcode hostnames, for example nested data structures with each host in a role being the hash’s key.
No interpolation of host lists via alias(), for example variables requiring a list of host:port will require hardcoded hostnames, or split ‘port’ into its own variable.
Per-host hiera overrides are available, however generic settings are preferred.
You will have to make compromises on production features to enable. This problem usually manifests when first porting your role(s) to Pontoon. Ideally your stack enables all production (sub)systems that are relevant to you. Sometimes though having all subsystems available is not possible or practical. In these cases consider disabling the system/feature via your stack’s hiera. TODO include examples

Lookup order

Your stack’s hiera sits above production and thus takes precedence over it. All other production functionality (e.g. role lookups) will be performed as usual. The relevant files and paths (in the order they are looked up, first match wins) are the following:

modules/pontoon/files/STACK/hiera/hosts/: This path allows for host-specific hiera settings if desired. Similarly to production, HOSTNAME.yaml will be searched for hiera settings.
modules/pontoon/files/STACK/hiera/: This is the main path for hiera overrides for your STACK. This path takes precedence over production hiera. All *.yaml files in this directory will be searched for variables, irrespective of their name. Typically files are named after the general area/service that they affect, and/or which feature they enable. In some cases the files are generic and shared among stacks with symlinks; for example puppetdb.yaml contains the minimal settings for a functional puppetdb in Pontoon, and the file links to the shared puppetdb.yaml
hieradata/pontoon.yaml: Common to all Pontoon stacks, changes to this file are not needed in most circumstances.

Team collaboration and git branches

A Pontoon stack is likely to be shared among multiple people, often in the same team. Ideally we are able to run an unmodified production branch on the Pontoon server, however there are a exceptions that warrant having a stack-specific branches. As of March 2021 the workflow for such branches is the following:

The branch is pushed under the sandbox/ namespace, to allow for force-push. For example sandbox/filippo/pontoon-o11y is the branch for the observability stack. Note that you'll also allow access to ldap/ops at https://gerrit.wikimedia.org/r/admin/repos/operations/puppet,access
Such branches should be periodically rebased on top of production and force-pushed. Note that the Pontoon server will also rebase its local production to keep up with updates. As with any self-hosted Puppet server the rebasing can fail, thus it is important to keep the sandbox branches rebased.
The stack branch is force-pushed as production to the Pontoon server, as explained in the Howto section.

Keep in mind that a Pontoon master has the "auto rebase" feature enabled: the git repository will periodically try to rebase itself on production (like a standalone puppet master). Therefore it is important to keep stack-specific (sandbox) branches rebased periodically too or the auto rebase process will fail if there are conflicts.

Roles bootstrap

This section contains role-specific bootstrapping instructions. Unless otherwise noted the expectation is for you to have enabled the role's settings in your stack (modules/pontoon/files/settings/) and have enrolled a new host as described in Add a new host section

PKI

To enable PKI (i.e. cfssl-based, as production) you'll need to perform the following:

Symlink modules/pontoon/files/settings/pki.yaml in your stack and push the changes to your Pontoon stack
Run puppet on the Pontoon server, the (freshly generated) root CA will be hosted there.
Provision and enroll a new instance to run the pki::multirootca role. This host will act as signer for intermediate CAs.
Your puppet code using PKI functionality (e.g profile::pki::get_cert) will talk to the multirootca host to acquire certificates.