User:Filippo Giunchedi/Pontoon

From Wikitech
Jump to navigation Jump to search

Problem statement

We, as SRE at the Foundation, work together to maintain the production infrastructure up and running. A significant chunk of our work relates to making changes to our public Puppet repository. Routine changes have a wide range of intrusiveness to the infrastructure, consider the following:

  • fine-tuning parameters for production services
  • deploy new services
  • roll out operating system upgrades

Generally speaking every change introduces risks that we have learned to accept. We have also deployed suitable mitigations to those risks such as:

  • code reviews
  • the Puppet compiler
  • Puppet realms other than 'production'.

In an ideal world we would be able to minimize risks on every change before going to production. Being able to test changes within a testbed stack (i.e. a "virtual production") greatly reduces risks and enables experimentation in a safe way.

Today

Setting up such stacks is possible today, but certainly not in a "disposable" fashion.

The word disposable in this context means that the stacks should be easy to set up and tear down, and are isolated from one another (i.e. self-contained as much as possible). For all intents and purposes the stacks resemble production, but receiving less (possibly zero) user traffic. Each stack also carries data that should be initialized and is stack-specific (e.g. private data).

SRE teams today set up WMCS instances to test changes, and roles are assigned via Horizon. Hiera data comes from different sources: Horizon and Puppet, and look up within hieradata isn't the same as production. The result is duplication of multiple variables and often times banging Hiera data and variables together until Puppet run successfully.

This works but it requires duplication of variables and the resulting patch can't be applied to production as-is. In a perfect world we would have role assignment done the exact same way as production and Hiera variables looked up the same: a common default and override only what changes (e.g. domain names, hostnames, etc).

Pontoon

Pontoon (in the k8s nautical theme: a recreational floating device) explores the idea of disposable stacks as similar to production as possibile. The key idea and goal being that the Puppet code base should not depend on hardcoded production-specific values.

Features include:

  • Role assignment happens by mapping a role to a list of hostnames that need such role.
  • The role mapping is used by Pontoon to drive its Puppet external node classifier (ENC). The ENC also supplies extra variables generated from the mapping.
  • Hostnames listed in the mapping will have their Puppet certificates automatically signed on the first Puppet run.

The explicit role to hostnames mapping enables meta-programming Puppet, which in turn enables replacing list of hostnames (e.g. firewall rules) in Hiera with variables containing "all hosts for role foobar" at catalog compile time.

In practice

As of April 2020 the implementation consists of the following:

  • A standalone puppetmaster with the production hiera-rchy. Two additional lookup files are provided too: one at the top of the hierarchy to be able to override production defaults and one at the bottom to be able to supply Pontoon-specific values, possibly auto generated. The latter file (auto.yaml) is used by Pontoon to work around some Hiera limitations (namely that variables from ENC are strings, whereas in some cases we need lists)
  • The standalone puppetmaster is driven by Pontoon's ENC.
  • Realm is 'labs', to keep subsystems like authentication working as expected.
  • Scenario-specific data (e.g. the "root of trust", the Puppet CA, etc) must be initialized manually.

Benefits

Why go through all this trouble of isolated stacks similar to production? There are several benefits to having a bespoke testbed:

  • Lower overhead to experiment with new ideas and services.
  • Increased confidence that a patch will work as expected once applied in production. For example being able to test distribution upgrades in isolation.
  • External reusability of the Puppet code base is improved as a bonus/side effect: we're factoring out assumptions about production and third parties can easily recreate similar environments. Similarly, contributing to the Puppet code base itself is made easier if recreating an isolated "mini-production" is possible without jumping through too many hoops.

Demo

See the video at File:Pontoon demo graphite buster.ogv

  • In the demo video I'm testing the migration of the graphite::production role to Buster. To do so, I'm adding a freshly provisioned Buster WMCS instance to a self-hosted Pontoon puppet master.
  • At the beginning the instance uses the standard cloudinfra puppet master.
  • Next, I'm confirming which role I want (graphite::production in this case) and proceed to change the 'observability' stack. I'm adding the new role and then assign the newly provisioned host.
  • I'm then committing the change and push it to the pontoon puppet master as if it was the 'production' branch.
  • Next, I'm taking over (i.e. enroll) the graphite host. The enroll script needs which stack to use (to locate the puppetmaster) and the hostname to enroll. The script will then log in and make the necessary adjustments (namely changing the puppet master and deleting the host's certificates) and kick off a puppet run.
  • Note that auto signing is disabled yet the puppet master issues the certificate because the host is present in the stack file.
  • The puppet run then proceeds as expected and the graphite role is applied, some failures are to be expected: for example custom Debian packages not yet available in buster-wikimedia.
  • Next I verify that another puppet agent run is possible.

Howto

This section outlines how to try out Pontoon yourself. The idea is to replicate production's model of one-host-per-role, in other words have a Pontoon master and several agent instances. Development happens locally on your workstation via a checkout of puppet.git and changes are pushed to the Pontoon master.

Master setup

  1. Create a Cloud VPS security group to allow access to port 8140 (for puppet clients talking to the puppetmaster).
  2. Create a Cloud VPS instance. This will be referred to as MASTER in the following sections.
    1. Use the debian buster source image
    2. An m1.small should be sufficient
    3. Add it to the puppetmaster security group you created
  3. Wait until the VPS is up and you can ssh into it.
  4. Add role::puppetmaster::pontoon to the VPS's horizon puppet configuration (click on your VPS, then on the puppet configuration tab, and enter it into the "Puppet Classes" section).
  5. Run sudo run-puppet-agent twice on the VPS
    1. Expect to see a number of failures here (e.g. puppet-master and apache2 failing to start).
  6. Restart the puppet master: sudo systemctl restart apache2
  7. Configure the master to be able to act as a remote for your user's git push commands. See instructions at Help:Standalone_puppetmaster#Push_using_a_single_branch

Local repository setup

  1. You are in a local clone of operations/puppet.git
  2. Add the master as a remote: git remote add -f pontoon ssh://MASTER/~/puppet.git
  3. Create a new branch off production: git checkout -b NAME origin/production

New stack creation

  1. Copy the stack template to your own stack (referenced as STACK): cp -v modules/pontoon/files/template.yml modules/pontoon/files/STACK.yml
  2. Add the FQDN of the MASTER to the puppetmaster::pontoon list in the stack file above
  3. Set pontoon::stack: STACK in hieradata/pontoon.yaml
  4. Commit the result: git commit -m "Pontoon master and stack" modules/pontoon/files hieradata/pontoon.yaml
  5. Push to the MASTER: git push -f pontoon HEAD:production

Switch to the new master

  1. Navigate to the MASTER's Horizon "Puppet configuration" tab
  2. Add these 2 lines to the Hiera data section:
    puppetmaster: MASTER_FQDN
    pontoon::stack: STACK
    
  3. On MASTER, sudo run-puppet-agent once
  4. Clean up SSL certificates:
    sudo find /var/lib/puppet/ssl/ -type f -exec rm -v {} \;
    sudo rm -v /var/lib/puppet/server/ssl/ca/signed/$(hostname -f).pem
    sudo cp -v /var/lib/puppet/{server/,}ssl/private_keys/$(hostname -f).pem
    sudo cp -v /var/lib/puppet/{server/,}ssl/certs/$(hostname -f).pem
    
  5. On MASTER, sudo run-puppet-agent
  6. The master is now enrolled!

Add a new host

This section outlines how to add a new (non-master) host to an existing Pontoon master.

  1. Add the host's FQDN and its role to your stack modules/pontoon/files/STACK.yml. e.g.
 graphite::production:
   - graphite-01.graphite.eqiad.wmflabs
  1. Commit the result and push to the Pontoon master:
 git commit -m "pontoon: add host" modules/pontoon/files/STACK.yml
 git push -f pontoon HEAD:production
  1. Provision a new instance in Horizon with the hostname you added above. Make sure the instance is created in the correct Horizon project (graphite in the example above)
  2. Wait for the host to be accessible via ssh (i.e. the first puppet run has completed, progress can be checked via "view log" from Horizon)
  3. Enroll the new host. The script will take care of deleting the current puppet ssl keypair and flip the host to the pontoon master. Run this on your development machine:
 modules/pontoon/files/enroll --config modules/pontoon/files/STACK.yml graphite-01.graphite.eqiad.wmflabs
  1. There will be puppet agent failures (likely), tweak puppet/hiera locally as needed and push to the master as above. Pontoon-specific hiera variables must live in hieradata/pontoon.yaml.

TODO

  • Multi-team collaboration