Wikimedia Cloud Services team/EnhancementProposals/Toolforge Kubernetes component workflow improvements
This document proposes some improvements on how we build and deploy various components to the Toolforge Kubernetes clusters.
Problem statement / Current process
See the diagram on the right on how the current process works. It has several problems:
- Updates to the docker container image may get applied by surprise (for example when rebooting a worker node)
- There's no easy way to roll back
- There's no easy way to know which version is running right now, or monitor that the latest version is indeed live
- It's too easy to do a mistake and not notice it
Proposed new workflow
This proposal would include two major changes to this workflow.
CI builds of new container images and Helm charts
The first major proposed change involves the container build process, which is currently done manually. Under this proposal, a CI system (likely either GitLab or the new Toolforge Build Service). The images and the helm charts should be tagged with the hash of the Git commit that is being built.
The new workflow requires setting up a Helm chart repository. Harbor includes one, and already being set up in Toolforge to support the buildpack / build service project, so we should likely use that.
Note: While dogfooding the buildpack based build service would be nice, we need to be very careful to not intruduce any chicken-and-egg problems that could prevent us from shipping fixes to broken code.
Controlling live container image and chart versions in Git
The second change involves moving the Helmfile configuration files from the individual component repositories to a central repository (let's call that the deployment control repository). (TBD: one big helmfile or multiple small ones?) In the new repository, the helmfiles pin the helm chart version and the image version to a specific git hash that is then updated by a manual git commit when deploying changes.
TBD: local development deployments? TBD: manual steps such as tls certs?
Once this system works fine, a next improvement in automation would be to automatically deploy the changes to the deployment control repository. That is however explicitely out of scope for this proposal.
Due to the complexity of the final product, the deployment should be done in stages so we can more easily achieve at least partial benefits.
Stage 0: Helm migration
- Finish the decision made in task T303931 to convert our components to use a standard Helm and
Stage A: Start using Git commit hashes for image versions
- update the
wmcs.toolforge.k8s.component.buildcookbook to create Git commit hash based image tags
- needs a separate commit or two to the same repository for deployment, but still gives us much better confidence on what we're rolling out and a better way to revert problematic changes
Stage B: GitLab migration for the affected repositories
- as a pre-requisite for the CI-based automation
Stage C: Proper chart repository
- CI to automatically upload new chart versions to a proper chart repository (like Harbor)
Stage D: Introduce the deployment control repository
Stage E: Automated container builds
- build container images on push
Once this basic system is live, there are several smaller and larger improvements that could be made to it.
- Adding alerting to ensure that the deployment control repository matches what's deployed on the clusters
- Adding alerting that the components are up to date, that is all merges on the component repositories have been deployed
The section on the new control repository touches this a bit, but it's explicitely out of scope for this proposal.
Disk space clean-up
Once we have information on the live container updates, it'd be nice to automatically prune unused versions from the container registry (after a reasonable grace period to allow rollbacks) to save disk space.