GitLab/Gitlab Runner/Platform Evaluation

Future Gitlab Runner setup (T286958)

This section contains the requirements and plan for a future Gitlab-Runner setup. The goal is to find a solution which matches our needs for the GitLab Runner infrastructure. GitLab Runner support various platforms, such as Kubernetes, Docker, OpenShift or just Linux VMs. Furthermore a wide range of compute environments can be leveraged, such as WMCS, Ganeti, bare metal hosts or public clouds. So this section compares the different options and collects advantages and disadvantages. Privacy considerations can be found in the next section.

GitLab Runner platform

GitLab Runner can be installed:

in Linux (using official packages)
in container
in Kubernetes (as helm chart or agent)
in OpenShift
in various other environments

The follow table compares the most important GitLab Runner platforms available (see Install GitLab Runner):


Platform	Advantages	Disadvantages	Additional considerations
Linux	Easy to setup low maintenance	Low elasticity, difficult to scale no separation of jobs^[1]	Separation between jobs is low, which could lead to security and privacy issues
Container	Easy to setup low maintenance separation of jobs by containers Similar to current solution	difficult to scale auto scaling needs`docker-machine`
Kubernetes	High elasticity/auto scaling separation of jobs by containers	Additional Kubernetes needed (for security) Additional cluster needs maintenance More difficult to setup	Could be used to strengthen Kubernetes knowledge Auto scaling needs elastic compute plattform Maybe a general purpose non-production cluster can be build?

Compute Environments

The following table compares the four main computing options for the GitLab Runner setup: WMCS, Ganeti, Bare Metal or Public Cloud.


Environment	Advantages	Disadvantages	Additional considerations
WMCS	Somewhat elastic Kubernetes auto scaling can leverage OpenStack(?)	Only in Eqiad not fully trusted environment	Elasticity is bound to appropriate quotas Kubernetes on OpenStack is new and different from existing Kubernetes solutions
Ganeti	Trusted environment	medium elasticity no ephemera VMs maintenance overhead additional security measures needed
Bare metal	Trusted environment Similar environment to existing Kubernetes setups	Low elasticity Machines have to be ordered and racked maintenance overhead additional security measures needed	Could old/decommissioned machines be used as runners?
Public Cloud (e.g. GCP)	High elasticity Low maintenance Easy Kubernetes setup (e.g. GKE)	untrusted environment (see privacy section) Dependency to cloud provider	Discussion about public cloud usage is needed Evaluation of privacy considerations is needed (see below)

Elastic demand

Typically the demand of computing resources for building code and running test is not constant. The usage of CI peaks around African, European and US Timezones and workdays (see Grafana dashboard and dashboard). So the ideal solution would adapt to this usage and scale computing resources up and down. This would maximize the utilization of resources and cover usage peaks. However this elasticity comes with costs. In general a dynamic provisioning of Runners is more complex than a static. Currently internal compute environments (such as Ganeti or Bare Metal) have limited elasticity, WMCS is somewhat elastic. So if high elasticity is needed, we have to consider using external providers like GKE. Which opens the discussion about privacy (see next chapter) and being independent from external parties.

We assume that elasticity won't have a major impact on costs with our current environment. More important elasticity could help to serve usage peaks and to keep the total pipeline latency low, thus increasing developer productivity. A similar effect could be achieved by simply over-provision the runner infrastructure.

So even if a elastic Runner setup would be the better technical solution we have to ask if we really need high elasticity now.

Privacy and trust considerations

Privacy is one core principal of WMF. So if public clouds are used we have to make sure this usage aligns with our privacy policy and doesn't cause any security risks.

We have to think about what data is transmitted to public clouds during builds and tests. Do we include secrets, passwords or private user data when running a job? Do we need a special policy for CI variables and secrets? Do we consider this data leaked/compromised when transmitted to public cloud machines even when encrypted/restricted machines are used? We also have to think about how to secure the artifacts and test results of jobs running in public clouds. How do we implement trust? How do we check if artifacts (images, compiled code) or test results weren't compromised?

The safest and easiest approach would be to implement two different Runner environments, one for untrusted builds and one for trusted builds. This solution was proposed bei ServiceOps^[2].

In GitLab terms this would mean hosting a Shared Runner for all untrusted projects and builds. This Shared Runners could be hosted in WMCS or a Public Cloud and if possible not inside the production network due to security considerations. Furthermore Specific Runners could be installed in a trusted environment and assigned to specific project. It is also possible to use this Specific Runners only for specific branches and tags, see Protected Runners.

Monitoring of performance and usage

Gitlab-Runner support Prometheus metric export. This metrics and some Grafana dashboards should give insights in performance and usage. See Monitoring Gitlab Runner documentation.

However the Gitlab Runner exporter does not support authorization or https. So depending on where the Runners are hosted, a https proxy with authorization is required.

^[3]

We would like to collect job metrics as soon as possible to also benchmark Runners on different environments (WMCS vs. Ganeti vs. Public Cloud).

Proposed future architecture

The following section describes the proposed architecture for GitLab Runner developed by ServiceOps. The architecture is open for discussion with other stakeholders.

The general architecture focuses on a non-GitLab specific architecture proposed by ServiceOps some time ago (see https://people.wikimedia.org/~oblivian/ci/ci-threat.pdf). The diagram on the right is the translation to a GitLab focused architecture. The proposed setup consists of one production GitLab instance (which needs some additional configuration, see questions below), two different of Shared GitLab Runners in untrusted environments and a set of Protected/Specific GitLab Runners in a trusted environment.

Shared GitLab Runners

Shared GitLab Runners (see Shared Runners for the implementation) are general purpose CI workers. They can be used in every project but can also be disabled for certain projects or groups. In the proposed architecture Shared Runners execute untrusted code from volunteers, developers and SREs. So this kind of runners are also considered untrusted.^[4]

The purpose of two different Shared Runners is to separate fully untrusted jobs from semi-trusted jobs. This fully untrusted jobs can come from private projects or forks from contributors. This kind of Shared Runners should be ephemeral by leveraging cloud resources like GCP. The other type of Shared Runners run semi-trusted jobs which come from projects of a certain GitLab group. This jobs don't require access to production credentials or infrastructure.

Proposed Runner configuration: Shared Runners

Proposed environment: In the beginning WMCS, long term GCP. Ephemeral Shared Runners on GCP.

Proposed platform: In the beginning Linux + Docker Executor (and use existing puppet code), long term Google Kubernetes Engine

Open topics:

Can we have a dedicated artifact stores (and) for Shared Runners?
Should we start build GKE Runner setup in parallel to WMCS setup?
What priority do Ephemeral Shared Runners have?
What restrictions are needed for Ephemeral Share Runners? (CI minutes, max runtime, forbid certain jobs)

Specific GitLab Runners

To build and deploy code to production environments a trusted set of Runners are needed (see Trusted Runner for the implementation). Access to these Runners should be restricted and gated. For that purpose GitLab more specific CI workers, namely Group Runners and Specific Runners. Group Runners can be assigned on a per-group level, Specific Runners on a per-project level. Both Runners can be configured to run only jobs with certain tags (like 'production'). Special CI jobs (like building production code) need to define these tags as well.

Furthermore it is possible to secure CI credentials and variables of these Runners by protecting the runners. Protected Runners are only allowed to run Jobs from protected branches. We also want to create multiple classes of Specific Runners depending on the access and secrets needed to reduce the risk of secrets being leaked or access abused. So for building Debian Packages we might want to have a different set of Specific Runners as for building Mediawiki or Docker Images.

Proposed Runner configuration: Specific Runners

Proposed environment: In WMF datacenter/Ganeti and later bare metal

Proposed platform: Linux + Docker Executor (and use existing puppet code)

Instance-wide Shared Cloud Runners

In T297426#7742386 the need to run CI jobs for untrusted code changes was identified. So a set of instance wide Shared Runners is needed (see Cloud Runners for the implementation). This Shared Runners execute unreviewed (meaning untrusted) code for private projects and forks. This Runners should be available for every project by default. Furthermore the Runners should be ephemeral so any compromise can not persist and affect other jobs. To make sure resources are shared equally, this Runners should be managed by a certain quota.

With the above requirements it is clear that this Runners can not live in production environment or WMCS (at least not in the same WMCS project gitlab-runners). A public cloud such as AWS or GCP could offer features like high elasticity, ephemeral machines, quotas and proper separation from production. This Runners do not perform critical CI/CD jobs, so any dependency to a public cloud provider is not affecting the ability to deploy production code. A termination of such a public cloud offering would disrupt private CI jobs mostly from the community and volunteers for a certain time.

To archive ephemeral, separated and resource-limited Runners the Docker or Kubernetes executor has to be used. The Kubernetes executor offers more flexibility regarding resource quotas and adding and removing ephemeral Runner hosts. So the goal should be to leverage a managed Kubernetes platform and use the Kubernetes executor.

Proposed Runner configuration: Specific Runners

Proposed environment: public cloud offering

Proposed platform: managed Kubernetes cluster + Kubernetes Executor

Open question:

How do we make sure Instance-wide Shared Runners won't steal jobs from Shared Runners in WMCS?

Bring-your-own Runner

It should also be possible to register your own GitLab Runner for projects which live outside of the officially supported groups and which may have additional requirements.