Wikimedia Cloud Services team/EnhancementProposals/Decision record T302863 toolforge byoc

From Wikitech

Origin task: phab:T302863

Date of the decision: 2022-03-29

People in the decision meeting:

  • User:David_Caro
  • Andrew Bogott
  • Komla Sapaty
  • Bryan Davis
  • Vivian Rook
  • Nicholas Skaggs
  • Arturo Borrero

Decision taken

Option 3 was chosen.

Rationale

We don't want to enable BYOC because we don't think it would be beneficial for the Toolforge service in the long run, and prefer avoiding to have users migrate twice.

For the purpose of easing GridEngine to Kubernetes migrations, we will wait for buildpacks to be ready. We will regard buildpacks as a requirement for deprecating or removing GridEngine from Toolforge.

Problem

As of this writing, one of the main reasons Toolforge tool developers keep using GridEngine vs Kubernetes is because our current k8s setup doesn't support mixing runtime environments. A tool that uses both java & python can only run in the grid. In Kubernetes we provide a concrete list of container images with fixed runtime environments (for example, python, nodejs, php, java, etc).

In the past, it was decided that a buildpack-based approach was the right solution to this problem. However, that project is a technical challenge, complex and requires non trivial amount of engineering work. The result is that the project is not ready to go yet and is not expected to be available at least until **TODO: when?**.

There is, however, another potential approach to unblock this situation in the short term: enable Bring Your Own Container (BYOC), while the buildpacks project is completed. This means allowing Toolforge developers to create kubernetes workloads using containers images created by them.

Some clarifications

Let's assume we have 3 categories of users in Toolforge:

1. non-engineer, basic users: they follow a tutorial to deploy a basic tool. They want easy abstractions and shortcuts to be able to perform complex tasks in a simple fashion. These users know only one programming language at most, and they don't know anything about containers, docker or kubernetes.

2. intermediate users: anyone between the previous category and the next.

3. engineer-level, advanced users: this user knows more than 1 programming language, knows some software engineering practices, and can follow an online tutorial to create a docker container. They know (or could easily understand) what's inside Toolforge, and the basics of how kubernetes works.

The BYOC feature is targeted for users in category 3. Which are the users that have the most complex tools in Toolforge, potentially in the grid, that cannot move to kubernetes because (for example) they mix multiple exec runtimes.

Users in this category had traditionally showed interests about BYOC for Toolforge in the past.

Constraints and risks

The fact that we disallow BYOC is mostly documented in a single place, this wikitech page, which reads:

We restrict only running images from the Tools Docker registry, which is available publicly (and inside tools) at docker-registry.tools.wmflabs.org. This is for the following purposes:

1. Making it easy to enforce our Open Source Code only guideline
2. Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
3. Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
4. Access control is provided totally by us, less dependent on dockerhub
5. Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment

This is enforced with a K8S Admission Controller, called RegistryEnforcer. It enforces that all containers come from docker-registry.tools.wmflabs.org, including the Pause container. 

Any decision taken in this topic should consider those five points.

In particular, one could argue that:

  1. we don't have any active scanning of software inside containers. Claiming that our users comply with the open-source-code-only policy because we control the base container image is a bit naive.
  2. we should review and discuss our current security maintenance practices for Toolforge. This is pretty much independent of any BYOC/buildpacks debate.
  3. deployment speed is a good point, but mostly relevant for tools that redeploy constantly. If we detected this was a problem, we could open our already present docker registry for tool users to cache their images in there
  4. is not clear what access control means in this point, or what specific needs we have.
  5. The LDAP configuration is important, so if we enabled any form of BYOC then clear instructions should be provided for our users to build their container images using a base layer of our own. Otherwise their tools may not work as expected.

Decision record

This page.

Options

Option 1

Enable BYOC. This enables a new workflow/usecase in Toolforge.

The simpler implementation of this option consists on:

  • disabling our custom kubernetes registry admission controller
  • create some docs for our users on how to effectively benefit from the new feature.
  • communicate with our users.

What to do with BYOC if and when buildpacks are ready to go is left for a future decision process. In particular, enabling BYOC **does not** prevent the buildpack project from being completed/implemented.

Pros:

  • Less dependency on the grid.
  • Less dependency on NFS (users could just deploy their code in the container, and so we have one less dependency on NFS).
  • Easy to implement.

Cons:

  • Enabling a new feature may mean supporting this new feature forever? If so, see option 2.

...

Option 2

Enable BYOC on a ***temporal fashion***. This enables a new workflow/use-case in Toolforge but only during the period of time where the buildpacks project is not completed, with the sole purpose of helping people migrate their tools away from GridEngine into Kubernetes.

The simpler implementation of this option consists on:

  • disabling our custom Kubernetes registry admission controller
  • create some docs for our users on how to effectively benefit from the new feature.
  • clearly communicate with our users, with a focus on the temporal fashion of the new feature.

Pros:

  • Less dependency on the grid.
  • Easy to implement.

Cons:

  • Given the temporal fashion, users may choose to don't adopt the solution.

...

Option 3

Leave BYOC disabled (discard this request). Hope that the buildpack project completes soon.

Pros:

  • No changes are currently needed.
  • No extra maintenance needed now or in the future.
  • No extra resources needed now or in the future.
  • No need to do a double migration (from Grid Engine to BYOC, and then from BYOC to Toolforge build service).


Cons:

  • Users will have to migrate directly to the toolforge build service instead, that means:
    • Toolforge build service is blocking the deprecation of the grid engine until that service is up and running.
    • Toolforge build service will have to implement all the needed usecases before we can move all users to it.

Option 4

Enable BYOC only for a few selected users that request it. Similar to some special Cloud VPS features that are enabled only to special projects that can demonstrate the requirement.

Implementation would be as follows:

  • modify the registry admission controller and introduce support for reading a configmap with an allow list
  • if a tool namespace is present in the allow list, then allow arbitrary container registries
  • the WMCS will review requests and update the configmap accordingly

Pros:

  • The impact of this feature being enabled for arbitrary users is therefore limited.

Cons:

  • Means the WMCS team has to gatekeep this feature.