Wikimedia Cloud Services team/EnhancementProposals/Toolforge Kubernetes component workflow improvements

From Wikitech
Jump to navigation Jump to search

This document proposes some improvements on how we build and deploy various components to the Toolforge Kubernetes clusters.

Problem statement / Current process

Flowchart of the current process.

See the diagram on the right on how the current process works. It has several problems:

  • Updates to the docker container image may get applied by surprise (for example when rebooting a worker node)
  • There's no easy way to roll back
  • There's no easy way to know which version is running right now, or monitor that the latest version is indeed live
  • It's too easy to do a mistake and not notice it

Proposed new workflow

Flowchart of the proposed new process.

This proposal would include two major changes to this workflow.

CI builds of new container images and Helm charts

The first major proposed change involves the container build process, which is currently done manually. Under this proposal, a CI system (likely either GitLab or the new Toolforge Build Service). The images and the helm charts should be tagged with the hash of the Git commit that is being built.

The new workflow requires setting up a Helm chart repository. Harbor includes one, and already being set up in Toolforge to support the buildpack / build service project, so we should likely use that.

Note: While dogfooding the buildpack based build service would be nice, we need to be very careful to not intruduce any chicken-and-egg problems that could prevent us from shipping fixes to broken code.

Controlling live container image and chart versions in Git

The second change involves moving the Helmfile configuration files from the individual component repositories to a central repository (let's call that the deployment control repository). (TBD: one big helmfile or multiple small ones?) In the new repository, the helmfiles pin the helm chart version and the image version to a specific git hash that is then updated by a manual git commit when deploying changes.

TBD: local development deployments? TBD: manual steps such as tls certs?

Once this system works fine, a next improvement in automation would be to automatically deploy the changes to the deployment control repository. That is however explicitely out of scope for this proposal.

Deployment plan

Due to the complexity of the final product, the deployment should be done in stages so we can more easily achieve at least partial benefits.

Stage 0: Helm migration

  • Finish the decision made in task T303931 to convert our components to use a standard Helm and deploy.sh based workflow

Stage A: Start using Git commit hashes for image versions

  • update the wmcs.toolforge.k8s.component.build cookbook to create Git commit hash based image tags
  • needs a separate commit or two to the same repository for deployment, but still gives us much better confidence on what we're rolling out and a better way to revert problematic changes

Stage B: GitLab migration for the affected repositories

  • as a pre-requisite for the CI-based automation

Stage C: Proper chart repository

  • CI to automatically upload new chart versions to a proper chart repository (like Harbor)

Stage D: Introduce the deployment control repository

Stage E: Automated container builds

  • build container images on push

Further improvements

Once this basic system is live, there are several smaller and larger improvements that could be made to it.

Monitoring

  • Adding alerting to ensure that the deployment control repository matches what's deployed on the clusters
  • Adding alerting that the components are up to date, that is all merges on the component repositories have been deployed

Automation

The section on the new control repository touches this a bit, but it's explicitely out of scope for this proposal.

Disk space clean-up

Once we have information on the live container updates, it'd be nice to automatically prune unused versions from the container registry (after a reasonable grace period to allow rollbacks) to save disk space.