Streamlined Service Delivery Design/current-ci

From Wikitech

This page attempts to describe the Wikimedia Foundation Continuous Integration system, as of early 2019. Originally written by Lars Wirzenius from the Release Engineering team.

The WMF runs a CI service for the Wikipedia community to automatically test any changes to the software running the various Wikipedias and supporting infrastructure. This page describes my understanding of how the system works, and is written up to document my understanding, so that others may see where I misunderstand and can correct me. Also, later on, perhaps this can serve as documentation for others.

Overview

CI components

The CI system has several main components:

  • Gerrit is the Git server, and implements the workflow to review code.
  • Zuul schedules builds and merges approved changes into the target branch.
  • Jenkins hold the definition of jobs to build and test software. It attaches to workers (known as slaves) running on WMCS.

Additionally, Phabricator is the ticketing system, but that does not directly affect CI.

The various components work together to enable community and WMF developers to improve the software and services running the various Wikipedia variants and other sites.

Gerrit - git and code review

Gerrit provides the Git server, where the canonical source code of most projects is stored. A change is started by a developer pushing their changes to the refs/for/<target branch>[/topic] where <target branch> is the branch the patch is for, /topic is an optional field to tag related changes, making them easier to find. An example usage:

   git push HEAD:refs/for/master

This causes Gerrit to create a change. A later push for the same triplet of (repository, branch, Change-Id field in commit message) would not generate a new change, instead the commit would be attached to the existing change. The change thus tracks successive versions of the patch set.

Creation of a commit (as well as comments, references updates, a change abandoned etc) can be streamed as JSON events (gerrit stream-events, which requires a specific permission). For example the creation of https://gerrit.wikimedia.org/r/#/c/test/gerrit-ping/+/482886 generated several events such the creation of the patchset:

{
    "uploader": {"name":"Hashar","email":"hashar@free.fr","username":"hashar"},
    "patchSet":{
        "number":1,"revision":"c7cedd024b444219d4fc63a5210534dbec2771bb",
        "parents":["b997259c77abcf239829d658b3501514fa909db0"],
        "ref":"refs/changes/86/482886/1",
        "uploader":{"name":"Hashar","email":"hashar@free.fr","username":"hashar"},
        "createdOn":1546979984,
        "author":{"name":"Hashar","email":"hashar@free.fr","username":"hashar"},
        "kind":"REWORK","sizeInsertions":0,"sizeDeletions":0
    },
    "change":{
        "project":"test/gerrit-ping",
        "branch":"master",
        "id":"I087c761076d9e34ac97412f3a281c464fdec753a",
        "number":482886,
        "subject":"Demo change for documentation",
        "owner":{"name":"Hashar","email":"hashar@free.fr","username":"hashar"},
        "url":"https://gerrit.wikimedia.org/r/482886",
        "commitMessage":"Demo change for documentation\n\nChange-Id: I087c761076d9e34ac97412f3a281c464fdec753a\n",
        "createdOn":1546979984,
        "status":"NEW"
    },
    "project":"test/gerrit-ping",
    "refName":"refs/heads/master",
    "changeKey":{"id":"I087c761076d9e34ac97412f3a281c464fdec753a"},
    "type":"patchset-created",
    "eventCreatedOn":1546979984
}

Those events are streamed by a workflow system: Zuul.

Gerrit will also keep track of code review changes by humans, by recording -1/0/+1/+2 votes.

Zuul - gating and process management

Zuul is split into a "scheduler" process and "merger" workers. The Zuul scheduler process events from Gerrit such as a new change being uploaded, a new patch set, a code view vote, etc. Gerrit sends an event for any such change. The Zuul scheduler listens for these events and is configured to perform a suitable action for each event, or to ignore specific events. The configuration lives in the `integration/config.git` repository.

Zuul listens to Gerrit events which trigger the jobs to see if the change has any chance of being acceptable. A change that causes the project to fail to build, or to fail its own automated tests, will not be accepted. Once all jobs are completed, Zuul will determine the build set result (either a success or a failure), it can then report back to Gerrit with a comment. The Gerrit label Verified is reserved for the CI system, Zuul will vote Verified +2 when a buildset succeed and Verified -1 otherwise. The Gerrit workflow is configured to require Verified:+2 for a change to be submittable (and thus potentially merged).

Note that code review votes are distinct from verification votes. Verification votes happen automatically by CI; code review votes require humans.

Zuul tells the "downstream" parts of CI to do things (built things, run tests), and merges changes once they've been approved (via a Code-Review +2 vote which is configured to trigger a build set which if successful cause Zuul to submit the change in Gerrit).

The scheduler asks Gearman to actually do things. The scheduler itself does not do things directly.

Zuul scheduler runs merger:merge Gearman function which is executed by the Zuul merger workers. They would clone or refresh the repository from Gerrit and tentatively merges the patchset against the tip of the branch. The resulting merge commit has a reference attached and it is reported back as the function result. Zuul scheduler tells Gearman to run CI jobs and pass them build parameters containing the Zuul merger git repository URL, project name, commit sha1, the merge commit reference etc. Those parameters are prefixed with ZUUL (eg: ZUUL_PROJECT=mediawiki/core ZUUL_URL=[[1]] ). The SHA never changes, unlike (say) git branches and tags. If the tentative merge fails, the Zuul scheduler fails the action and never triggers any job.

Once the Zuul scheduler received results for all jobs, the configuration will determine a serie of actions to be conducted (known has reporters). For example a Code-Review +2 events is handled in a way that causes Zuul to report to Gerrit with the job results and ask Gerrit to submit the change. Gerrit would then attempt to merge the change. For a new patchset, Zuul would be configured to report the results of the CI jobs as well, with a vote for Verified label (-1 for failure, +2 for a success).

Gearman

Gearman is a light weight framework to distribute unit of jobs across several workers. It has been originally written by Brian Aker, the same author that wrote memcached for LiveJournal (which helped us Wikipedia back in the middle 2000's).

Workers register unit of works they can do to the gearman server, each unit would be known has a function. A client can then attach to the server and triggers those functions, then wait for the result to be reported.

Jenkins has all jobs and slaves exposed as functions to Gearman. Zuul merely ask the Gearman servers to trigger such and such functions, Jenkins being the Gearman worker will handle the function request and schedule a build on one of the slave.

The Gearman server we use is embedded in Zuul, it is based on https://pypi.org/project/gear/ a pure python asynchronous implementation written by the same author as Zuul.

Jenkins

Jenkins actually runs tasks, which are implemented as snippets of shell code, Docker containers, and/or as Groovy scripts. The jobs are specified in the integration/config.git repository using the Jenkins Job Builder tool (JJB). It is a YAML based DSL to define the jobs, it supports templating, variables replacements, generation of jobs based on a matrix of parameters and creating/update the jobs on the Jenkins master using the Jenkins REST API. Having the jobs defined in plain text files let us keep them in version control and review modifications via Gerrit.

Our Jenkins has its web interface open to logged in, suitably authorized users. We are expected to not change jobs via the web interface. Instead all job changes must go via the integration/config.git repository, and through a code review process via Gerrit.

A Jenkins job can be instructed to capture artifacts which are then transfered from the slave that run the job up to the Jenkins master instance. They can then be exposed to the build result page on the Web interface.

As of January 2019, we do not use Jenkins to build the CI Docker images, they are build on the Jenkins master using docker-pkg and their definitions are held in integration/config.git under the ./dockerfiles directory. The Streamlined Service Delivery system does build Docker images and would upload them to the Wikimedia Docker registry.

Second git server?

Not sure if the Gerrit git server is used for the tentative merges, or if they go to a different git server. Help?

Antoine: gerrit2001.codfw.wmnet is a hotspare for Gerrit. None of Gerrit instances do any tentative merges, those are done by zuul-merger process running on contint1001 and contint2001, they each exposes a merger:merge function to the Gearman server which is used by the Zuul scheduler early on when processing a patchset creation event.

The integration/config git repository

This git repository contains most the configuration for CI: the Zuul scheduler configuration (what to do for each event, and how to do that, and what to next); the actual Jenkins jobs; etc. All changes to this repository should be reviewed and OK'd by CI admins.

The CI admins are release engineering team members as well as members from other teams with deep interest in CI (notably: legoktm, krinkle, addshore).

Theorically releng team members should not OK their own changes. In practice integration/config is used in a way similar to operations/puppet: to provide an audit trail of configuration changes. Though, for complicated or impactful changes, reviews will be looked for.

Links