SRE/business case/Disposable Development Environment

1. Executive Summary

In order to maintain a good software development cycle for the WMF infrastructure we need have a robust development environment which is easy to set up, configure, adapt, and ideally disposable so that we are not consuming too many resources.

2. Business Problem

Many new and veteran staff members complain about the state of our testing and development environments as arguably we have no official testing or development environments. This makes it difficult for staff to gain confidence that any specific change will be deployed without issue. It also makes it difficult for engineers to quickly and easily test new ideas and innovations. A disposable development environment would allow engineers to easily spin up development environments and test changes with total confidence that in the worst case they can just delete the environment and start again. Furthermore, having a good disposable development environment would open the pathway for innovations such as incident training environments, whilst also making it easier for us to create more complex CI environments which can test the entire build/image process.

3. Problem Analysis

Testing and development continues to be a problem for new and old staff members. We have a number of solutions for testing at the Foundation, however they all have their own subtle nuances which means they are not simple to create for new users, and in many cases even once set up and configured require a constant effort both at the general puppet level and the individual contributor level to ensure that all shared resources, hiera configuration, and secrets are available to the testing environment. Furthermore, the multitude of options, each solving and introducing their own subtle issues, means that solution engineering, training, and applications is disjointed among the team which (I suspect) means new people will get different answers depending on who their on-boarding buddy is or who they ask.

As to the actual problems I think the following categories are some of the issues shared across all the extant solutions:

Lack of a shared services infrastructure: this includes things like the PKI service. Many puppet classes require the PKI services so they can request a certificate, the lack of a PKI service can cause these classes to fail, leaving the testing engineer to either locally hack the puppet code to factor out the problematic code or to build a dummy instance of this shared service.
Differences in environment: I think that any solution we create is likely to have some differences between environments such as changing the resolver IP addresses used or some other piece of meta data, all of which in an ideal world would be controlled via hiera. However, some of the current testing environments run in the WMCS cloud environment, in addition to having meta data changes, also run a completely different set of puppet base classes that often conflict with the production base environments. This leads to exclusions or work-arounds which need a constant effort to ensure manifests continue to work with the most recent changes. Many of the WMCS projects environment's also lack a puppetdb, introducing even more differences.
Different Puppet execution methods: Some of the testing options work by using puppet apply instead of puppet agent. This difference in execution means that some resources are evaluated differently for instance erb files will be compiled on the host where the catalog is compiled (which can be the users laptop) vs on the puppetmaster. Using puppet apply also means there is no puppetdb, leading to differences and sometimes failures in the catalog compilation.

4. Current Solutions

deployment-prep aka beta

The Beta cluster is an environment primarily used by MediaWiki developers to test changes. The cluster tries to recreate the production MediaWiki installation along with the many micro services required for operation. It's similar to a traditional development staging environment in that it is a single environment used by many developers, meaning that at any one time there is little visibility (at least from an SRE PoV) as to what state the cluster and their individual micro services are in. This environment also maintains its own set, of patches to either test new functionality or work around the aforementioned issues with trying to replicate the production environment in WMCS. I'm unaware of who, if anyone owns this project.

Individual or team WMCS projects

Many teams and individuals have created and maintain their own WMCS environments to replicate specific aspects of the production environment. Along with the aforementioned general issues with WMCS, the piece of infrastructure you are trying to replicate changes the complexity and effort of creating and maintaining such environments. This generally comes down to how many shared services and other pieces of infrastructure the specific services needs to interact with. As an example the sso project has very few external dependencies and only required the addition of a database services to replace the shared db service in production. However, projects like observability that need to interact with almost every piece of infrastructure in the environment would, with this methodology, have to spend a large amount of time creating and maintaining serves which they are not interested in and may be unfamiliar with, just to test a change to the specific work area.

Pontoon

Pontoon is a framework and set of tools designed to work around the issues described in the Individual or team WMCS projects section. It's the author's opinion that pontoon is the best solution we currently have for creating a disposable environment and any future work should use this project as a starting point to prevent having to relearn the same lessons again. However, pontoon is still built on top of the WMCS base puppet policy which leads to some creative workarounds and subtle difference between production and the testing environment. Furthermore, pontoon also requires a lot of resources as every environment creates its own set of shared resources. It would be more optimal if it could leverage some shared set of services in the first case, only creating them if they are actually needed to test changes specific to the shared environment.

Bolt

Using Bolt to test changes is a relatively new addition to our portfolio. It works by using puppet apply to compile a local catalog and then apply this catalog to a specific server using puppet apply --noop. The main issues with this is that it makes use of puppet apply instead of puppet agent, see points above on different execution environments. Bolt also uses puppet 7, however the production environment is still only on puppet 5, this is an issue which should go away with time, but also has the benefit of allowing us to test puppet 7 compatibility. It also runs from a user's laptop which introduces other issues such as dns for *.wmnet not working. Ultimately, I think bolt is a useful tool to have in our belt, however I believe the the difference in execution environments will prevent this from being the primary testing tool.

Beaker

Puppet/Beaker is an even newer addition to our tool set. In its current form it is also designed to run from a user's laptop and works by using puppet applyinstead of puppet agent. As such it suffers from very similar issues to bolt.

5. Available Options

The following options are not mutually exclusive, in fact the author believes they would complement each other

5.1 Option 1 - Supported shared services infrastructure

5.1.1 Description

A piece of infrastructure maintained and supported with all the shared services required to run a specific role in isolation. Theses shared services could be configured in a way that makes them wide open to anything, but they would only provide dummy information, enough to allow the puppet policy to compile in a sane manner. i.e. for PKI we can issue some generic certificate, from a bogus domain with a short expiry that is not trusted anywhere.

5.1.2 Benefits, Goals and Measurement Criteria

I feel the benefit of this is that it would solve many of the issues we currently have. In many cases users are only interested in testing a specific role, by providing a set of shared services users can concentrate on the role they are interested in instead of having to fight with a service or role they are unfamiliar with. This should also reduce the resources used by each individual environment. Furthermore if any disposable infrastructures defaults to make use of a set of shared services then it means there are less vm's or containers for them to spin up at any one time, increasing the velocity cycle of create, test, and destroy.

5.1.3 Costs and Funding Plan

5.1.4 Risks

One of the big risks here is that the infrastructure would have a fairly open access control list making it susceptible to abuse. Increased maintenance burden for the owners of the shared services.

5.1.5 Issues

What do we consider to be shared infrastructure? e.g. PKI, acme chief, LVS, cache etc. From my PoV shared services should consist of the minimum set of services needed to for puppet to cleanly complete on all roles in isolation. So for this we will likely need the PKI services as roles actively reach out to them to request certificates, however we wouldn't need an LVS service as AFAIK nothing actively polls LVS during puppet compilation or application.

5.1.6 Assumptions

5.2 Option 2 - Cloud agnostic pontoon

5.2.1 Description

As mentioned earlier, the authors believe that pontoon is the best solution we currently have, however it was designed to work around some of the issues only present in WMCS. It would be nice to have tooling similar to pontoon but that can work with any vanilla cloud or containerized environment. I think it would also be useful to make pontoon work with the shared infrastructure option proposed above so uses could minimize the number of resources that are required to test a change.

5.2.2 Benefits, Goals and Measurement Criteria

By making pontoon work with generic clouds and removing WMCS specific hacks we enable it to be used in many more places. Users and specifically volunteers would be able to spin them up in their favorite free tier cloud environment, local k8s environment, or beaker.

5.2.3 Costs and Funding Plan

5.2.4 Risks

5.2.5 Issues

Currently pontoon has specific workarounds to make sure it is compatible with WMCS, but the workarounds make it less like production. This proposal suggests that we remove those workarounds to both keep pontoon simple and to ensure any resulting development environment is as close to production as possible. However, by design this means that pontoon would become incompatible with the WMCS environments unless WMCS is able to provide unmanaged vanilla images.

5.2.6 Assumptions

5.3 Option 3 - Docker compose/helm chart repository for specific roles

5.3.1 Description

Building on top of pontoon would be great if we could also have a set docker-compose or helm chart repository describing specific roles and environments. So that users could easily take e.g. mediawiki, mediawiki + cache, or mediawiki + cache + LVS configurations, and spin up an environment which provides a workable representation of production.

5.3.2 Benefits, Goals and Measurement Criteria

This would allow for very customizable and highly disposable environments easing development for complex changes and also providing environments for new volunteers or contributors to explore, test, and become familiar with our infrastructure. This would further lay the ground work for more ambitious endeavors such as incident training environments

5.3.3 Costs and Funding Plan

5.3.4 Risks

I think the biggest risk here is the unknown unknowns and untangling some of the ingrained and hard-coded dependencies. I feel that in theory the many environments would be easy to adapt to this new plan but inevitably some will be difficult or even impossible. It's hard to give an estimate of where a given role will fall, between theses two points, without first trying to build some of these.

5.3.5 Issues

5.3.6 Assumptions

6. Recommended Option

I feel that all the options above should be considered as I feel they all complement each other. However, I think that Option 1 could already benefit deployment-prep, individual cloud projects and pontoon. It could also make things like running PCC locally simple. With those benefits in mind I think it should take priority. Option 3 largely depends on Option 2 so this would put Option 2 as the second priority.