Streamlined Service Delivery Design/research
There is an effort underway to introduce continous delivery to WMF, which started early in 2017. The exact shape and scope of this effort is still under discussion. User:Lars Wirzenius was hired in October 2018 to help with this.
This document outlines Lars's understanding and thinking so far. It's meant to aid discussion of this topic within the Release Enginering team, and other relevant parties interested, and to show where Lars is in the dark, to show where Lars have understood things correctly, and reveal things that need discussion. If you disagree with something here, please contact Lars and set him straight.
Continuous concepts
Continuous integration
w:Continous integration (CI) is the practice of merging changes to the main line of development frequently, preferably many times a day. The idea is to keep everyone working on about the same version of a code base, so that no-one's work diverges from the others for long or too much, to reduce problems and unnecessary work due to integration issues.
CI is typically coupled with automated test suites (at least unit tests, possibly integration tests, performance tests, or other tests), possibly in a way that changes don't get merged until tests pass, though it is common for tests to be run after a merge. CI also typically involves having an automated system run tests automatically, either every time the main line of development changes, or when a change is offered to be integrated into the main line, periodically, or all of the above.
Continuous delivery
w:Continuous delivery is the practice of releasing software frequently. Continous delivery typically uses CI to ensure the main line of development is in a releable state at all times. The actual release process if typically fully automated, but making a release is a manual decision. The release process typically results in a number of artifacts, which can be used to install / upgrade / deploy the software.
Continuous deployment
w:Continous deployment is the practice of deploying all changes to the main line of development to production. In contrast to continuous delivery, what gets deployed might not be a release, and what gets released might not get deployed.
CD vs CD
Continuous delivery and continuous deployment are both commonly abbreviated as CD. This document doesn't abbreviate either, for clarity.
Pipelines, gates
Continous integration, delivery, and deployment are often all modelled using "pipelines", where a change enters a pipeline, goes through a number of steps, and results in a change at the end. A step might be, for example, "get the source code", "build source code", "run unit tests", or "install on production server". At various steps in a pipeline might be "gates", where progress stops, until and unless an entity indicates that it's OK to move forward. A gate might "manual testing did not find problems".
Current status
WMF provides a large number of web sites. There are around a thousand wikis, implemented with MediaWiki, plus other sites MediaWiki sites often use backing services, which are often microservices.
- Example of microservice: Mathoid renders custom markup in wikitext to MathML and images.
- Databases and load balancers are other types of backing services.
Running a MediaWiki site counts as a service for this document, even if MediaWiki doesn't fit into the model of a microservice with an HTTP API. Deploying MediaWiki should ideally be similar to deploying backing services.
Changes to MediaWiki sites are deployed via two ways: the SWAT and the Train. SWAT is mainly for configuration changes, small bug-fixes for user-visible problems, and other low-risk changes. The Train is for other changes, most importantly, changes to MediaWiki core and extensions.
SWAT happens a couple of times a day (European afternoon, North American afternoon), and Train on Tuesday through Thursday most weeks. The Train changes group 0 ("minor sites") on Tuesday, group 1 ("medium sites") on Wednesday, and group 2 ("big sites") on Thursday, in the hope of catching errors on smaller sites before they affect large numbers of people.
The goal is to move services individually to a continuous delivery model. One microservice has been moved (Mathoid). Three more are destined to be moved by the end of 2018 (Graphoid, Zotero, Blubberoid). This means they will be deployed and updated using the new Delivery Pipeline for WMF in the future.
- Goals
- FIXME: Link to Deployment Pipeline
Currently deployment is triggered manually, but aided by some scripting, and happens daily (SWAT) or weekly (Train). Many of the mechanical steps of deployment have been scripted (scap), but the process still requires manual work. Part of the manual work is checking that the service still works after an upgrade. Part of it is change review, and in some cases small fixes to the changes. There seems to also be a lot of ad hoc communication happening between developers and the Release Engineering team. Deployment is labor-intensive, and as such too error-prone.
Scap uses canary servers when deploying: a small subset of the servers are deployed to first, and then logs and other things are checked for problems. If everything goes well, the rest of the servers are deployed to as well.
There is some automated testing of running services. service-checker which consumes an OpenAPI specification for HTTP endpoints to check along with their expected responses. In addition, the logstash log collection service is queried to see if rates of error change after deployment.
- service-checker
- OpenAPI example
- Question: at what point are these tests run?
There is also some monitoring of services, which SRE sets up and maintains. Monitoring alerts relevant parties, when it notices something breaking. It should be noted that monitoring is not the same as testing: a test suite tells you what aspect of a service doesn't work ("front page doesn't say Wikimedia"), monitoring tells you when measuring some aspect of a service doesn't fall into an expected range ("too many 500 status codes in the log file"). Both are needed. Test suites are especially useful when making changes to the service code ("do the things that the test suite tests, still work?"); monitoring is needed when something changes without deployment (server catches fire).
Known problems with current status
All services (except the one already moved to the delivery pipeline) run on bare metal, or in virtual machines. This limits how well WMF can react to fluctuations in traffic, and thus increases hosting costs, due to having to over-allocate hardware resources (each service needs to have all the resources it needs for peak traffic). A more container-based system could save on hosting costs by not allocating hardware for the peak load time of a particular service, and instead share hardware for different services according to load.
- Question: not sure how big a problem this is.
Software development for WMF needs is slow, because deployment slows it down. At the same time, deployment is risky because the deployment process doesn't keep pace with development; it can't keep pace because it's socially and cognitively too expensive.
Software development can be modelled as loops within loops. Development goes faster, and the software developing entity is more productive, when loops are iterated faster (or at least that's been my experience over the decades). Removing friction and obstacles and automating steps within a loop helps. The innermost loop is the "edit, build, test" cycle. Deployment is in the "build, deploy, test" loop. Making deployment easier and faster will help the overall software development productivity of Wikimedia and its community.
The WMF sites stay up thanks a lot of manual effort spent on review. This is aided by some automated testing, and deployment tooling, monitoring, and many users (both WMF staff and in the community) who eagerly report problems. There seems to be no major issues with quality and level of service, but it seems to require a lot more human effort than might be necessary. The friction coefficient of the deployment loop is high.
While WMF doesn't have a profit motive, it seems nevertheless that making processes smoother and more automated would be beneficial for WMF and the Wikimedia community in general, by freeing people to do more amazing things and spend less effort on mundane, repetitive deployment stuff. Better tools to are force multipliers for brains.
Setting up entirely new services is a medium-big project. This seems like it should be less of an effort.
Desired status
All services run in containers, hosted by Kubernetes.
This is still under discussion and possibly controversial, so Lars proposes two phases:
Phase 1: Continuous delivery: All the mechanical steps of deployment are automated using scap (or other tooling). Deployment is triggered when a deployer or release engineer runs the deployment script. There's still communication with developers, and keeping an eye on things, as part of the process. Changes that reach the beginning of the delivery pipeline will have been reviewed and approved already. The Train will be replaced by a more SWAT-like approach of changes getting deployed every workday.
Phase 2: Continuous deployment: Deployments are fully automated, except for change review. The deployment pipeline has one or more gates at which manual review is done. The deployment pipeline is triggered when a developer request a change to be reviewed and merged.
There are automated tests for all services. The test have sufficient coverage and quality that if tests pass, the release engineering team have confidence that the sites and services work for our users.
A possible cultural change: when anything breaks, and it should have been caught by automated tests, tests are changed to catch it in the future.
SRE team
SRE has responsibility for keeping Kubernetes running, as well as any other infrastructure (DNS, databases, etc). SRE also sets requirements for what runs in production: security, version traceability, testability, monitoring, and more.
Question: Does SRE want to review the code running in production, or its configuration? Also, changes to that. Probably not, but check with them.
SRE handles databases and their configuration seems to be at least partially in the MediaWiki configuration and code. They will probably want to review any changes to that.
Release engineering team
The Release Engineering team has responsibility for providing and maintaining tooling to do deployments automatically, running automation to make deployments happen frequently, and reviewing changes to production (a sanity check, if nothing else).
Service developers (community and WMF)
The developers of the services have responsibility for the writing and maintaining the service code, documenting service configuration, and writing and maintaining automated tests for the running service.
To facilitate this, developers get quick feedback from automated tests if anything seems hinky, so that they can fix it before the Release Engineering team gets involved. Quick here means within minutes. This is achieved by CI (Jenkins).
Possible cultural change: If any reviewer finds anything to fix, even if only extra whitespace, the change is made by the developer. (But silly, simple things like whitespace can be tested for automatically, and such tests are run before a human reviewer ever looks at the change.)
Possible cultural change: If there is a problem, the sites can be rolled back to a known-working version easily. When this happens, the automated tests get improved so they'll catch that problem in the future. The responsibility for improving the tests lies with both the developers and the release engineering team. Canary servers are still used, to look for problems that only happen under production condtions.
The smaller and safer a change is, the less effort it is to get it deployed to production. For example, a change to translations should be possible to do within an hour after having been approved by a reviewer. (That's a goal, not a requirement or a promise.)
Overview of planned solution
All services which can be run in a container, are run a container. Persistent data, such as databases, will stay on bare metal. Containers will communicate with them over the network. We will start by moving microservices into containers, and move MediaWiki last, probably not before late 2019.
All changes are built and tested by CI (Jenkins), which also builds container images. Such images get deplooyed into test instances, and such instances are tested.
This happens for changes to the master branch, as well as for developer branches. Phabricator and Gerrit are used to track changes, as before. Each change is automatically built and tested, and once tests pass, submitted by the developer for review by a human. If accepted by reviewer, merged into the master branch. Changes are handled in a way that notices when they work individually, but break together.
All configuration also gets deployed from git, and merged into the master branch using the same process.
No human ever changes the master branch directly. (If possible, the git server will be configured to prevent that.)
Suggestion for a continuous delivery process
Lars suggests the following process for a continuous delivery process. There is a textual description, followed by a UML sequence diagram.
Roles and systems
The process involves three roles:
- developer
- makes the actual change
- reviewer
- reviews the changes, accepts it (who this is depends on change, and may include QA or testers)
- deployer
- pushes change to production (might be someone in the release team)
The process also involves a number of computer systems:
- Phabricator
- ticketing system (for this)
- Gerrit
- code review system
- Zuul
- gating system
- Jenkins
- runs automated jobs, e.g., to build Docker images
- Docker image store
- stores images built by Jenkins
- staging, production
- various instances of Wikimedia sites and services
An instance here means the full set of virtual machines, containers, and software running in them, configured in the same as for production. However, non-production instances may have fewer resources (less CPU, less RAM, less disk space, less everything) than production does. Also the data in the instance may be different.
Process description
(This has been simplified a lot from previous versions. Important changes: no more ticket per change; no more Jenkins running system tests on the Docker image)
Note that this is a description only of the delivery pipeline, not other deployments. It also only applies to things running in containers in Kubernetes.
The process is roughly as follows:
The pipeline starts with someone (a developer) pushing a change to Gerrit. They've already tested it locally.
Gerrit generates a change event, which Zuul listens to, and Zuul triggers Jenkins to run a job that builds the software, and runs any tests in the build tree. Jenkins also builds and publishes various variants of the Docker images for the software to the image store.
If all of that goes OK, Jenkins informs Gerrit of this, and records in the ticket that CI is happy with the change. This triggers a notification to reviewers that a review is needed.
The reviewer looks at the change, and may download the image from the image store. The reviewer decides if they're happy or if they would like something to change in the change before they're happy. If they're happy, they record the fact in Gerrit. This triggers Gerrit to merge the change to the master branch, and notify the deployers.
The deployers manually trigger a Jenkins job to deploy the changes (or current master branch) to a staging Kubernetes cluster. The deployment is fully automated, and uses the exact same Docker images that will later be deployed to production.
The deployers review the staging cluster and decide if the change can be deployed to production. If so, they record this is Gerrit as a vote.
Then the deployers trigger a Jenkins job to deploy to production. It is identical to the deployment to staging. This is recorded to Gerrit.
The deployer closes the ticket for the change.
If somethings fails at any point, the process resets to the beginning. If something fails after deployment to staging or production, the instances are reverted to the state after the change.
Process as a UML sequence diagram
Discussion
The process, as outlined above, allows for only one change to be in flight at a time. This is mosly for simplicity, but it seems to Lars that everything after change review should be linear, to avoid confusion stemming from trying to do too many things at a time.
However, it would be catastrophic to development speed to only work on one change at a time. Thus, it seems sensible to allow many changes to be in flight at a time. However, it seems like a recipe for much confusion to review or test multiple changes at a time. Thus, ideally, each change would result in a separate "testing" instance, where each instance has the minimal set of services running.
Given that not everything will be running in Kubernetes containers, at least during a transition period, an "instance" should probably be a Kubernetes cluster, plus a set of virtual machines and other systems to run the full set of software the powers Wikimedia's sites (MediaWiki instances, databases, load balancers, services running in VMs, services running in Kubernetes, etc). This is not cheap in terms of resources, which is a concern.
Production needs to be configured to handle all real traffic. Staging needs to be identical, except it doesn't need all the resources to handle real traffic. Staging should have a copy of real data from production, refreshed from production at suitable intervals. Testing can be a small instance, with minimal resources.
The idea is that production is what users actually see and interact with, and we never break it. Staging is as similar to production as we need it to be to be confident that if we deploy a change to staging, and things work there, deploying the same change to production will also work. Testing is where we can deploy any changes, confident that if we break it, it doesn't matter. We need to be able to rebuild the testing instance without it being a big thing; the rebuild should ideally be fully automated. A rebuild may be necessary to undo a bad change. Testing is where we experiment, and sometimes experiments will fail badly: for example, changes to database schemas, or upgrades of PHP may break everything irreversibly. It needs to be safe to deploy changes to testing without having to be too careful.
If we have the resources, having many testing instances would be ideal. We could then have a separate testing instance for each change. When a change is merged into the master branch, all the testing instances for other changes should be rebuilt, with their changes rebased on top of the new master branch.