Portal:Toolforge/Admin/Kubernetes
Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers. Kubernetes was selected in 2015 by the Cloud Services team as the replacement for Grid Engine in the Toolforge project.[1] Usage of k8s in Tools began in mid-2016,[2] and the current cluster design dates back to early 2020.[3]
Subpages
- 2020 Kubernetes cluster rebuild plan notes
- Certificates
- Docker-registry
- Etcd (deprecated)
- Labels
- Networking and ingress
- New cluster
- Pod tracing
- RBAC and Pod security
- RBAC and Pod security/PSP migration
- Upgrading Kubernetes
- Upgrading Kubernetes/1.21 to 1.22 notes
- Upgrading Kubernetes/1.22 to 1.23 notes
- Upgrading Kubernetes/1.24 to 1.25 notes
- Upgrading Kubernetes/1.25 to 1.26 notes
- Upgrading Kubernetes/1.26 to 1.27 notes
- foxtrot-ldap
- lima-kilo
About this document
This document tries to document the Kubernetes cluster used in Toolforge, and its direct support services (e.g. etcd). It does not cover specifics about services running in the cluster (e.g. the Jobs framework and build service), nor does it cover Toolforge services that are fully unrelated to the Kubernetes cluster (e.g. Redis).
The four main sections of this document correspond to the four categories of documentation in The Grand Unified Theory of Documentation system in a structure inspired by how the Tor Project Admins do it.
Tutorial
Access kubectl
kubectl
is the official Kubernetes command line interface tool. Assuming you are listed as a maintainer of the admin tool (or the toolsbeta equivalent) you will automatically have superuser credentials provisioned in your NFS home directory.
To use the CLI tool, log in to a bastion host on the project where the cluster you want to interact with is located. If you want to just experiment, you should use the toolsbeta cluster for that. Most read-only commands can be used out of the box, for example to list pods in the tool-fourohfour
namespace used by the 404 handler:
$ kubectl get pod -n tool-fourohfour
NAME READY STATUS RESTARTS AGE
fourohfour-7766466794-gtpgk 1/1 Running 0 7d20h
fourohfour-7766466794-qctt8 1/1 Running 0 6d18h
However, all write actions and some read-only actions (e.g. interacting with nodes or secrets) will give you a permission error:
$ kubectl delete pod -n tool-fourohfour fourohfour-7766466794-gtpgk
Error from server (Forbidden): pods "fourohfour-7766466794-gtpgk" is forbidden: User "taavi" cannot delete resource "pods" in API group "" in the namespace "tool-fourohfour"
If you're sure you want to continue, you need to use kubectl sudo
:
$ kubectl sudo delete pod -n tool-fourohfour fourohfour-7766466794-gtpgk
pod "fourohfour-7766466794-gtpgk" deleted
kubectl sudo
, as the name implies, really has full access to the entire cluster. You should only use it when you need to do something that your normal account does not have access to.Manage pods
Pods are the basic unit of compute in Kubernetes. A pod consists of one or more OS-level containers that share a network namespace.
List pods
Pods can be listed with the kubectl get pod
command. Log in to a toolsbeta bastion, become fourohfour
and run:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fourohfour-bd4ffc5ff-479sj 1/1 Running 0 43s
fourohfour-bd4ffc5ff-4lhcf 1/1 Running 0 35s
The -o
(--output
) flag can be used to customize the output. For example, -o wide
will display more information:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fourohfour-bd4ffc5ff-479sj 1/1 Running 0 91s 192.168.120.158 toolsbeta-test-k8s-worker-nfs-1 <none> <none>
fourohfour-bd4ffc5ff-4lhcf 1/1 Running 0 83s 192.168.145.16 toolsbeta-test-k8s-worker-nfs-2 <none> <none>
Or -o json
will display the data in JSON:
$ kubectl get pods -o json | head -n5
{
"apiVersion": "v1",
"items": [
{
"apiVersion": "v1",
So far we have only been accessing data in the namespace we are in. To access data in any namespace, we need to switch back to our user account. Now we can use the -n
(--namespace
) flag to specify which namespace to access.
$ kubectl get pod -n tool-fourohfour
NAME READY STATUS RESTARTS AGE
fourohfour-bd4ffc5ff-479sj 1/1 Running 0 4m27s
fourohfour-bd4ffc5ff-4lhcf 1/1 Running 0 4m19s
$ kubectl get pod -n tool-admin -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
admin-cb6d84bd8-pshh7 1/1 Running 0 5d21h 192.168.25.21 toolsbeta-test-k8s-worker-nfs-4 <none> <none>
Or we can use -A
(--all-namespaces
) to list data in the entire cluster:
$ kubectl get pod -A | head -n5
NAMESPACE NAME READY STATUS RESTARTS AGE
api-gateway api-gateway-nginx-6ddddd6f64-mbnlg 1/1 Running 0 12d
api-gateway api-gateway-nginx-6ddddd6f64-tdl6c 1/1 Running 0 8d
builds-admission builds-admission-7897cf7759-jtxb5 1/1 Running 0 28h
builds-admission builds-admission-7897cf7759-nvmzt 1/1 Running 0 26h
View logs for pod
To view the combined standard output and standard error for a pod, use kubectl logs
:
$ kubectl get pod -n maintain-kubeusers
NAME READY STATUS RESTARTS AGE
maintain-kubeusers-55b649885c-px8c6 1/1 Running 0 87m
$ kubectl logs -n maintain-kubeusers maintain-kubeusers-55b649885c-px8c6 | wc -l
176
Some useful flags for this command are:
--tail NUMBER
to only show the specified number of most recent lines--follow
to do, well, exactly what it says
Restart a pod
Manage workers
The main Toolforge cluster consists of a bit over 50 "normal" NFS-enabled workers, and some special workers used for specific purposes. These workers can be added and removed using cookbooks. Both adding and removing a node is fairly straightforward, although due to the time it takes to replace the entire cluster we prefer to update existing nodes instead of replacing the entire cluster during most routine maintenance (e.g. Kubernetes upgrades or node reboots). It is however totally fine to replace nodes in Toolsbeta if you want to try the process.
Add a worker
These cookbooks can be run from the cloudcumin hosts (recommended) or from your laptop if you have them set up locally. Use of screen
or tmux
is recommended.
To create a normal worker_nfs
in toolsbeta, use:
$ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name toolsbeta --role worker_nfs
Remove a worker
Removing a worker is equally straightforward. To remove the oldest worker_nfs
node in toolsbeta, use:
$ sudo cookbook wmcs.toolforge.remove_k8s_node --cluster-name toolsbeta --role worker_nfs
If you have a specific node that you want to remove, pass that as a parameter:
$ sudo cookbook wmcs.toolforge.remove_k8s_node --cluster-name toolsbeta --role worker_nfs --hostname-to-remove toolsbeta-test-k8s-worker-nfs-1
Drain and undrain a node
Sometimes a node is misbehaving or needs maintenance done on it, and needs to be drained from all workload. This is easiest done with the cookbook:
$ sudo cookbook wmcs.toolforge.k8s.worker.drain --cluster-name toolsbeta --hostname-to-drain toolsbeta-test-k8s-worker-nfs-1
To "uncordon" (allow new pods to be scheduled to it again) the node, run the following on a bastion in the relevant project:
$ kubectl sudo uncordon toolsbeta-test-k8s-worker-nfs-1
node/toolsbeta-test-k8s-worker-nfs-1 uncordoned
You can also just "cordon" a node which will prevent new workloads from being scheduled but won't drain existing ones:
$ kubectl sudo cordon toolsbeta-test-k8s-worker-nfs-1
node/toolsbeta-test-k8s-worker-nfs-1 cordoned
That is also reversed with the uncordon command.
How-to
Cluster management
Build a new cluster
We have not built a new cluster since the 2020 cluster redesign. The documentation written during the 2020 redesign is at Portal:Toolforge/Admin/Kubernetes/Deploying, although it is likely somewhat outdated.
Upgrade Kubernetes
Kubernetes upstream releases new versions about three times a year.[4] We cannot skip any upgrades and thus must upgrade sequentially. This process is documented at Portal:Toolforge/Admin/Kubernetes/Upgrading Kubernetes.
Upgrade Calico
Upgrade ingress-nginx
Upgrade worker operating system
We have upgraded the cluster OS once, from Buster to Bookworm, and during the same time changed the container runtime from Docker to containerd.[5] There is no set process or specific automation for this, but the approach taken last time was:
- Pick which Debian release you're going to upgrade to
- Ensure the container runtime version in that release is supported by Kubernetes, Calico and cadvisor
- Import kubeadm packages for the new Debian release
- Add a new worker in toolsbeta
- Test carefully that it works
- Do this for all types to test out all configuration combinations (with/without NFS, with/without extra volume)
- Remove matching number of old workers
- Replace a control node in toolsbeta
- Add a few new nodes in tools
- Wait a few days
- Replace all tools workers
- In paraller, replace remaining toolsbeta workers
- Replace tools controls
Roll reboot cluster
The wmcs.toolforge.k8s.reboot
cookbook can be used to reboot the entire cluster, for example to apply Kernel or container runtime updates, or in case the NFS server is having issues. Start from reading the --help
output for the cookbook. For example, in the NFS issue case in toolsbeta, you could run:
$ sudo cookbook wmcs.toolforge.k8s.reboot --cluster-name toolsbeta --all-workers
etcd
Add etcd node
Remove etcd nodes
Upgrade etcd
We run etcd from the Debian packages, so an etcd upgrade is automatically a Debian upgrade and vice versa.
We have not upgraded etcd yet since the 2020 cluster redesign. This section should be filled when we do that for the first time.
Component system
In the Toolforge Kubernetes component workflow improvements enchancement proposal we introduced a standard "components" system for various components that run in the Kubernetes cluster. The system is documented in more detail in the toolforge-deploy.git README file.
Deploy new version
This process is described in more detail in the toolforge-deploy.git README file. But, in summary, to deploy a change to a toolforge-deploy managed component:
- Get an MR on toolforge-deploy.git with the version bump. For chart/image updates in components we develop the MR is created automatically, and in other cases you need to create it manually.
- Run the deployment cookbook for toolsbeta:
$ ssh cloudcumin1001.eqiad.wmnet cloudcumin1001:~$ COMPONENT=builds-api # same as the directory name cloudcumin1001:~$ sudo cookbook wmcs.toolforge.component.deploy --cluster-name toolsbeta --component $COMPONENT --run-tests
- Deploy on tools:
cloudcumin1001:~$ COMPONENT=builds-api # same as the directory name cloudcumin1001:~$ sudo cookbook wmcs.toolforge.component.deploy --cluster-name tools --component $COMPONENT --run-tests
- Merge the MR (if everything went well)
Rollback a change
To rollback a change, revert the toolforge-deploy.git
commit and then follow the deployment steps as usual.
Manage (tool) users
Modify quotas
Tool quotas are managed by maintain-kubeusers and configured in in the values file in toolforge-deploy.git.[6] To update quotas for a specific tool:
- Send a patch to the values file changing the quotas. The format should be relatively self-explanatory, and the defaults and supported keys are listed in the default values file. Always change the version when making any kind of change or it will not be applied.
- Merge the patch to main and deploy it like any other component change.
Regenerate .kube/config
In case something goes wrong with the credentials for a certain tool user, you can delete the maintain-kubeusers
configmap which will cause maintain-kubeusers to re-generate the credentials for that user. On a bastion in the relevant project, run:
$ kubectl sudo delete cm -n tool-$TOOL maintain-kubeusers
Please have a look at the logs for maintain-kubeusers and file a bug so the issue can be fixed.
Enable observer access
Requests for observer access must be approved by the Toolforge admins in a Phabricator task. Once approved, they can be implemented on a control plane node with:
$ sudo -i wmcs-enable-cluster-monitor <tool-name>
Manage user workloads
Find newly added workloads
The Kubernetes capacity alert runbook documents how to find where a sudden increase in workload has come from.
Pod tracing
Given all tools running on a single worker share that worker's IP address, occasionally you need to figure out which tool on a given worker is misbehaving. That process is documented on Portal:Toolforge/Admin/Kubernetes/Pod tracing.
Update prebuilt images
This has been moved to the Jobs framework documentation.
Reference
Admission controllers
Repository | Related to functionality | Description |
---|---|---|
builds-admission | Build Service | Validate build service user-created pipelines |
envvars-admission | Envvars Service | Inject configured envvars to pods |
ingress-admission | Webservice | Validate created ingress objects use the domain allowed for that tool |
registry-admission | Jobs framework | Validate new pods use images in the Toolforge docker registry or Harbor |
volume-admission | Jobs framework | Inject NFS mounts to pods that are configured to have them |
Authentication, authorization, certificates and RBAC
cert-manager
External certificates
maintain-kubeusers
maintain-kubeusers is responsible for creating Kubernetes credentials and a namespace (tool-[tool name]
) for each tool, and removing access for disabled tools. It is also in charge of maintaining quotas and other resources for each tool (like kyverno security policies, etc). In addition, it creates admin credentials all maintainers of the admin
tool.
The service is written as a long-running daemon, and it talks to LDAP directly for tool data. It exports Prometheus metrics, but those are not used for any alerts or dashboards at this moment.
Observer access
Some tools (e.g. k8s-status) need more access to the Kubernetes API than what the default credentials require. For these tools, an "observer" role has been created that grants read-only access to non-sensitive data about the cluster and workloads that run on it.[7] The role is deployed from a file deployed from Puppet (although phab:T328539 proposes moving it to maintain-kubeusers), and role bindings are created manually using a script.
Using observer status in a job with serviceAccountName: ${tool}-obs
is not supported by the Jobs framework or webservice. The k8s-status tool uses a custom script for managing a web service with such access included.
Requests for such access should be approved by the Toolforge admins before access is granted.
Backups
The main thing worth backing up is the contents of the etcd cluster. It is not currently backed up.
Bastion nodes
The Toolforge bastion nodes have kubectl
installed. As the bastion nodes have NFS mounts, and maintain-kubeusers provisions certificates to NFS, everything will then work out of the box.
Kubernetes design
The Kubernetes documentation is both more detailed and up-to-date. Here is, however, a quick overview of the major Kubernetes components.
Control plane
etcd
Kubernetes stores all state in etcd - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable by the rest of the control plane nodes as well as etcd nodes, has client certificate verification in use for authentication (puppet is CA) and secrets are encrypted at rest in our etcd setup.
We currently use a 3 node cluster, hosted on VMs separate from the main control plane. They're all smallish Debian Buster instances configured largely by the same etcd puppet code we use in production. The main interesting thing about them is that they're localdisk
instances as etcd is rather sensitive to iowait.
API server
The API server the heart of the Kubernetes control plane. All communication between all components, whether they are internal system components or external user components, must go through the API server. It is purely a data access layer, containing no logic related to any of the actual end-functionality Kubernetes offers. It offers the following functionality:
- Authentication & Authorization
- Validation
- Read / Write access to all the API endpoints
- Watch functionality for endpoints, which notifies clients when state changes for a particular resource
When you are interacting with the Kubernetes API, this is the server that is serving your requests.
The API server runs as a static pod on the control plane nodes. It listens on port 6443/tcp, and all access from outside the Kubernetes cluster should go via HAProxy. Requests are authenticated with either tokens (mostly for internal usage) or client certificates signed via the certificates API.
controller-manager and scheduler
The controller-manager and scheduler contain most of the actual logic. The scheduler is responsible for assigning pods to nodes and the controller-manager is for most other actions, for example launching CronJobs at scheduled times or ensuring ReplicaSets have the correct number of Pods running. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.
Worker
The primary service running on each node is the Kubelet, which is an interface between the Kubernetes API and the container runtime (containerd in our case). Kubelet is responsible for ensuring the pods running on the node match with what the API server wants to run on that node, and reports back metrics to the API. It also proxies logs requests when necessary. Pod health checks are also done by the Kubelet.
In addition, there are two networking-related services running on each node:
- kube-proxy manages iptables NAT rules for Service addresses.
- The container network interface (or CNI, Calico in our cluster) manages the rest of the cluster networking. In practice this means an overlay network where each pod is assigned an cluster-internal IP address which can be routed across the entire cluster.
Labels
A reference of various used Kubernetes labels and their meanings is available on Portal:Toolforge/Admin/Kubernetes/Labels.
Monitoring and metrics
Alert runbooks
- Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity
- Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady
Kubernetes metrics stack
The Kubernetes cluster runs multiple pieces of software responsible for cluster monitoring:
- metrics-server (per-container and node metrics for Kubernetes internal use)
- cadvisor (per-container metrics for Prometheus)
- kube-state-metrics (cluster-level metrics from Prometheus)
These are all deployed via the wmcs-k8s-metrics
component using the standard component deployment model.
Prometheus integration
Toolforge Prometheus servers scrapes cadvisor, kube-state-metrics and Prometheus exporter endpoints in the apps that have them. For this, the Prometheus server have an external API certificate provisioned via Puppet that needs to be renewed yearly. The scrape targets are defined in the profile::toolforge::prometheus
Puppet module.
Alerts are managed via the Alerts GitLab repo and sent via metricsinfra infrastructure.
Networking
Calico
We use Calico as the Container Network Interface (CNI) for our cluster. In practice Calico is responsible for allocating a small private subnets in 192.168.0.0/16
for each node, and then routing those subnets to provide full connectivity across all nodes.
We deploy Calico following their self-managed on-premises model. We do not use their operator deployment - instead, we take their manifest deployment and build a Helm chart from it. Instructions for upgrading Calico in this setup are in the upgrade Calico section.
DNS
In the cluster we use the default CoreDNS DNS plugin. It resolves cluster-internal names (e.g. services) internally and forwards the remaining queries to the main Cloud VPS recursor service. CoreDNS configuration is managed by Kubeadm and generally works well enough, although we should consider increasing the number of replicas.
Ingress
We use kubernetes/ingress-nginx to route HTTP requests to specific tools inside the Kubernetes cluster. Ingress objects are created by webservice (soon jobs-api), and the ingress admission controller restricts each tool to [toolname].toolforge.org
.
HAProxy (external service access)
NFS and LDAP
The worker nodes are Puppetized which means they have the standard Cloud VPS SSSD setup for using LDAP data for user accounts.
In addition, most (as of February 2024) worker nodes have the shared storage NFS volumes mounted, and these nodes have the kubernetes.wmcloud.org/nfs-mounted=true
and toolforge.org/nfs-mounted=true
for tools to run NFS-requiring workloads on them. The volume-admission-controller admission controller mounts all volumes to pods with the toolforge: tool
label.
There are plans to introduce non-NFS workers to the pool once the Bookworm OS upgrades have finished. These would be used by tools with build service images, buildservice builds and infrastructure components with no need for NFS connectivity. Given the reliability issues with NFS, new features should be designed in a way that they at least do not make it harder to move away from NFS.
Pod isolation
We use kyverno to enforce a set of security and isolation constraints to all tool account Pod workloads running in the cluster.
Examples of things we ensure:
- pods have a limited set of privileges and cannot escalate
- pods cannot "hijack" files from other tools via jumping NFS data dir
- pods don't run as root
With kyverno, we not only validate that pods are correct, but we also mutate (modify) them to inject some values we want them to be set to something in particular, like the uid/gid of each tool account.
Each tool account has a Kyverno policy resource created by maintain-kubeusers.
Privileged workloads, like custom components we deploy, or internal kube-system components, are not subject to any Kyverno policy enforcement as of this writing.
Testing and local deployments
We have a testing deployment in the toolsbeta
Cloud VPS project. It is almost identical to the tools
cluster except it is much smaller.
The lima-kilo project can be used to run parts of a Toolforge Kubernetes cluster on a local machine.
User workloads
Jobs framework
The recommended way for someone to run a workload on Toolforge is to use the Jobs framework (admin docs). The framework will create deployment, cronjob and job objects in tool namespaces.
Raw Kubernetes API users
Before the Jobs framework was introduced, many users used the Kubernetes API directly to run their tools. This is now deprecated, but tools are still using it because it works.
Build service builds
The Build service (admin docs) runs builds in the image-build
namespace. All of this is managed via the build service API, users do not have direct access to that namespace. These builds run without NFS access.
Worker types
There are a few different types of workers.
Cookbook name | Name prefix | Description |
---|---|---|
worker
|
worker | Normal workers. As of February 2024, these do not have NFS access. |
worker_nfs
|
worker-nfs | Normal workers with NFS. |
control
|
control | Special purpose nodes for the Kubernetes control plane. |
ingress
|
ingress | Special purpose workers exclusively for the web ingress and the API gateway. |
Addition and removal of all of these types is fully automated via the cookbooks.
Discussion
Bring-your-own-image
We only allow running images from the from the Toolforge Docker registry (for "pre-built" images) and from the Toolforge Harbor server. This is for the following purposes:
- Making it easy to enforce our Open Source Code only guideline
- Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
- Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
- Access control is provided totally by us, less dependent on dockerhub
- Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment
This is enforced with an admission controller.
The decision to follow this approach was last discussed and re-evaluated at Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc.
GitOps tools
Puppet
We use Puppet to provision the Kubernetes nodes, and also related non-K8s managed infrastructure such as etcd and HAProxy. However, configuration for what's inside the cluster should not be managed by Puppet for several reasons:
- We already have a deployment management system for what's inside the cluster (toolforge-deploy.git), and we should not introduce two systems for the same purpose
- Puppet cannot be used to provision a local environment as is
- puppet.git merges require global root, which not all Toolforge admins have
Single cluster reliance
References
- ↑ [Labs-announce] [Tools] Kubernetes picked to provide alternative to GridEngine
- ↑ [Labs-announce] Kubernetes Webservice Backend Available for PHP webservices
- ↑ News/2020 Kubernetes cluster migration
- ↑ Kubernetes releases
- ↑ Toolforge k8s: Migrate workers to Containerd and Bookworm
- ↑ Track and apply Toolforge quota changes via a Git repository
- ↑ Create a "novaobserver" equivalent for Toolforge Kubernetes cluster inspection