Portal:Toolforge/Admin/Kubernetes

From Wikitech
Jump to navigation Jump to search

Kubernetes (often abbreviated k8s) is an open-source system for automating deployment, and management of applications running in containers. Kubernetes was selected in 2015 by the Cloud Services team as the replacement for Grid Engine in the Toolforge project.[1] Usage of k8s in Tools began in mid-2016,[2] and the current cluster design dates back to early 2020.[3]

For help on using Kubernetes in Toolforge, see the Kubernetes help documentation.

Subpages

About this document

This document tries to document the Kubernetes cluster used in Toolforge, and its direct support services (e.g. etcd). It does not cover specifics about services running in the cluster (e.g. the Jobs framework and build service), nor does it cover Toolforge services that are fully unrelated to the Kubernetes cluster (e.g. Redis).

The four main sections of this document correspond to the four categories of documentation in The Grand Unified Theory of Documentation system in a structure inspired by how the Tor Project Admins do it.

Tutorial

Access kubectl

kubectl is the official Kubernetes command line interface tool. Assuming you are listed as a maintainer of the admin tool (or the toolsbeta equivalent) you will automatically have superuser credentials provisioned in your NFS home directory.

To use the CLI tool, log in to a bastion host on the project where the cluster you want to interact with is located. If you want to just experiment, you should use the toolsbeta cluster for that. Most read-only commands can be used out of the box, for example to list pods in the tool-fourohfour namespace used by the 404 handler:

$ kubectl get pod -n tool-fourohfour
NAME                          READY   STATUS    RESTARTS   AGE
fourohfour-7766466794-gtpgk   1/1     Running   0          7d20h
fourohfour-7766466794-qctt8   1/1     Running   0          6d18h

However, all write actions and some read-only actions (e.g. interacting with nodes or secrets) will give you a permission error:

$ kubectl delete pod -n tool-fourohfour fourohfour-7766466794-gtpgk
Error from server (Forbidden): pods "fourohfour-7766466794-gtpgk" is forbidden: User "taavi" cannot delete resource "pods" in API group "" in the namespace "tool-fourohfour"

If you're sure you want to continue, you need to use kubectl sudo:

$ kubectl sudo delete pod -n tool-fourohfour fourohfour-7766466794-gtpgk
pod "fourohfour-7766466794-gtpgk" deleted
kubectl sudo, as the name implies, really has full access to the entire cluster. You should only use it when you need to do something that your normal account does not have access to.

Manage pods

Pods are the basic unit of compute in Kubernetes. A pod consists of one or more OS-level containers that share a network namespace.

List pods

Pods can be listed with the kubectl get pod command. Log in to a toolsbeta bastion, become fourohfour and run:

$ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
fourohfour-bd4ffc5ff-479sj   1/1     Running   0          43s
fourohfour-bd4ffc5ff-4lhcf   1/1     Running   0          35s

The -o (--output) flag can be used to customize the output. For example, -o wide will display more information:

$ kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP                NODE                              NOMINATED NODE   READINESS GATES
fourohfour-bd4ffc5ff-479sj   1/1     Running   0          91s   192.168.120.158   toolsbeta-test-k8s-worker-nfs-1   <none>           <none>
fourohfour-bd4ffc5ff-4lhcf   1/1     Running   0          83s   192.168.145.16    toolsbeta-test-k8s-worker-nfs-2   <none>           <none>

Or -o json will display the data in JSON:

$ kubectl get pods -o json | head -n5
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "v1",

So far we have only been accessing data in the namespace we are in. To access data in any namespace, we need to switch back to our user account. Now we can use the -n (--namespace) flag to specify which namespace to access.

$ kubectl get pod -n tool-fourohfour
NAME                         READY   STATUS    RESTARTS   AGE
fourohfour-bd4ffc5ff-479sj   1/1     Running   0          4m27s
fourohfour-bd4ffc5ff-4lhcf   1/1     Running   0          4m19s
$ kubectl get pod -n tool-admin -o wide
NAME                    READY   STATUS    RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
admin-cb6d84bd8-pshh7   1/1     Running   0          5d21h   192.168.25.21   toolsbeta-test-k8s-worker-nfs-4   <none>           <none>

Or we can use -A (--all-namespaces) to list data in the entire cluster:

$ kubectl get pod -A | head -n5
NAMESPACE            NAME                                                    READY   STATUS             RESTARTS         AGE
api-gateway          api-gateway-nginx-6ddddd6f64-mbnlg                      1/1     Running            0                12d
api-gateway          api-gateway-nginx-6ddddd6f64-tdl6c                      1/1     Running            0                8d
builds-admission     builds-admission-7897cf7759-jtxb5                       1/1     Running            0                28h
builds-admission     builds-admission-7897cf7759-nvmzt                       1/1     Running            0                26h

View logs for pod

To view the combined standard output and standard error for a pod, use kubectl logs:

$ kubectl get pod -n maintain-kubeusers
NAME                                  READY   STATUS    RESTARTS   AGE
maintain-kubeusers-55b649885c-px8c6   1/1     Running   0          87m
$ kubectl logs -n maintain-kubeusers maintain-kubeusers-55b649885c-px8c6 | wc -l
176

Some useful flags for this command are:

  • --tail NUMBER to only show the specified number of most recent lines
  • --follow to do, well, exactly what it says

Restart a pod

Manage workers

The main Toolforge cluster consists of a bit over 50 "normal" NFS-enabled workers, and some special workers used for specific purposes. These workers can be added and removed using cookbooks. Both adding and removing a node is fairly straightforward, although due to the time it takes to replace the entire cluster we prefer to update existing nodes instead of replacing the entire cluster during most routine maintenance (e.g. Kubernetes upgrades or node reboots). It is however totally fine to replace nodes in Toolsbeta if you want to try the process.

Add a worker

These cookbooks can be run from the cloudcumin hosts (recommended) or from your laptop if you have them set up locally. Use of screen or tmux is recommended.

To create a normal worker_nfs in toolsbeta, use:

$ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name toolsbeta --role worker_nfs

Remove a worker

Removing a worker is equally straightforward. To remove the oldest worker_nfs node in toolsbeta, use:

$ sudo cookbook wmcs.toolforge.remove_k8s_node --cluster-name toolsbeta --role worker_nfs

If you have a specific node that you want to remove, pass that as a parameter:

$ sudo cookbook wmcs.toolforge.remove_k8s_node --cluster-name toolsbeta --role worker_nfs --hostname-to-remove toolsbeta-test-k8s-worker-nfs-1

Drain and undrain a node

Sometimes a node is misbehaving or needs maintenance done on it, and needs to be drained from all workload. This is easiest done with the cookbook:

$ sudo cookbook wmcs.toolforge.k8s.worker.drain --cluster-name toolsbeta --hostname-to-drain toolsbeta-test-k8s-worker-nfs-1

To "uncordon" (allow new pods to be scheduled to it again) the node, run the following on a bastion in the relevant project:

$ kubectl sudo uncordon toolsbeta-test-k8s-worker-nfs-1
node/toolsbeta-test-k8s-worker-nfs-1 uncordoned

You can also just "cordon" a node which will prevent new workloads from being scheduled but won't drain existing ones:

$ kubectl sudo cordon toolsbeta-test-k8s-worker-nfs-1
node/toolsbeta-test-k8s-worker-nfs-1 cordoned

That is also reversed with the uncordon command.

How-to

Cluster management

Build a new cluster

We have not built a new cluster since the 2020 cluster redesign. The documentation written during the 2020 redesign is at Portal:Toolforge/Admin/Kubernetes/Deploying, although it is likely somewhat outdated.

Upgrade Kubernetes

Kubernetes upstream releases new versions about three times a year.[4] We cannot skip any upgrades and thus must upgrade sequentially. This process is documented at Portal:Toolforge/Admin/Kubernetes/Upgrading Kubernetes.

Upgrade Calico

Upgrade ingress-nginx

Upgrade worker operating system

We have upgraded the cluster OS once, from Buster to Bookworm, and during the same time changed the container runtime from Docker to Bookworm.[5] There is no set process or specific automation for this, but the approach taken last time was:

  1. Pick which Debian release you're going to upgrade to
  2. Ensure the container runtime version in that release is supported by Kubernetes, Calico and cadvisor
  3. Import kubeadm packages for the new Debian release
  4. Add a new worker in toolsbeta
    1. Test carefully that it works
    2. Do this for all types to test out all configuration combinations (with/without NFS, with/without extra volume)
    3. Remove matching number of old workers
  5. Replace a control node in toolsbeta
  6. Add a few new nodes in tools
  7. Wait a few days
  8. Replace all tools workers
    1. In paraller, replace remaining toolsbeta workers
  9. Replace tools controls

Roll reboot cluster

The wmcs.toolforge.k8s.reboot cookbook can be used to reboot the entire cluster, for example to apply Kernel or container runtime updates, or in case the NFS server is having issues. Start from reading the --help output for the cookbook. For example, in the NFS issue case in toolsbeta, you could run:

$ sudo cookbook wmcs.toolforge.k8s.reboot --cluster-name toolsbeta --all-workers

etcd

Add etcd node

Remove etcd nodes

Upgrade etcd

We run etcd from the Debian packages, so an etcd upgrade is automatically a Debian upgrade and vice versa.

We have not upgraded etcd yet since the 2020 cluster redesign. This section should be filled when we do that for the first time.

Component system

In the Toolforge Kubernetes component workflow improvements enchancement proposal we introduced a standard "components" system for various components that run in the Kubernetes cluster. The system is documented in more detail in the toolforge-deploy.git README file.

Deploy new version

This process is described in more detail in the toolforge-deploy.git README file. But, in summary, to deploy a change to a toolforge-deploy managed component:

  1. Merge the toolforge-deploy.git MR. For chart/image updates in components we develop the MR is created automatically, and in other cases you need to create it manually.
  2. Run the deployment cookbook for toolsbeta:
    $ sudo cookbook wmcs.toolforge.k8s.component.deploy --cluster-name toolsbeta --component $COMPONENT_DIRECTORY_NAME
    
  3. Test that the change works in Toolsbeta.
  4. If you have a separate MR for tools, merge it now.
  5. Re-run the cookbook for the tools cluster if applicable.

Rollback a change

To rollback a change, revert the toolforge-deploy.git commit and then follow the deployment steps as usual.

Manage (tool) users

Modify quotas

Tool quotas are managed by maintain-kubeusers and configured in in the values file in toolforge-deploy.git.[6] To update quotas for a specific tool:

  1. Send a patch to the values file changing the quotas. The format should be relatively self-explanatory, and the defaults and supported keys are listed in the default values file. Always change the version when making any kind of change or it will not be applied.
  2. Merge the patch to main and deploy it like any other component change.

Regenerate .kube/config

In case something goes wrong with the credentials for a certain tool user, you can delete the maintain-kubeusers configmap which will cause maintain-kubeusers to re-generate the credentials for that user. On a bastion in the relevant project, run:

$ kubectl sudo delete cm -n tool-$TOOL maintain-kubeusers

Please have a look at the logs for maintain-kubeusers and file a bug so the issue can be fixed.

Enable observer access

Requests for observer access must be approved by the Toolforge admins in a Phabricator task. Once approved, they can be implemented on a control plane node with:

$ sudo -i wmcs-enable-cluster-monitor <tool-name>

Manage user workloads

Find newly added workloads

The Kubernetes capacity alert runbook documents how to find where a sudden increase in workload has come from.

Pod tracing

Given all tools running on a single worker share that worker's IP address, occasionally you need to figure out which tool on a given worker is misbehaving. That process is documented on Portal:Toolforge/Admin/Kubernetes/Pod tracing.

Update prebuilt images

This has been moved to the Jobs framework documentation.

Reference

Admission controllers

Custom admission controllers in the Toolforge cluster
Repository Related to functionality Description
builds-admission Build Service Validate build service user-created pipelines
envvars-admission Envvars Service Inject configured envvars to pods
ingress-admission Webservice Validate created ingress objects use the domain allowed for that tool
registry-admission Jobs framework Validate new pods use images in the Toolforge docker registry or Harbor
volume-admission Jobs framework Inject NFS mounts to pods that are configured to have them

Authentication, authorization, certificates and RBAC

cert-manager

External certificates

maintain-kubeusers

maintain-kubeusers is responsible for creating Kubernetes credentials and a namespace (tool-[tool name]) for each tool, and removing access for disabled tools. It is also in charge of maintaining quotas and PodSecurityPolicies for each tool. In addition, it creates admin credentials all maintainers of the admin tool.

The service is written as a long-running daemon, and it talks to LDAP directly for tool data. It exports Prometheus metrics, but those are not used for any alerts or dashboards at this moment.

Observer access

Some tools (e.g. k8s-status) need more access to the Kubernetes API than what the default credentials require. For these tools, an "observer" role has been created that grants read-only access to non-sensitive data about the cluster and workloads that run on it.[7] The role is deployed from a file deployed from Puppet (although phab:T328539 proposes moving it to maintain-kubeusers), and role bindings are created manually using a script.

Using observer status in a job with serviceAccountName: ${tool}-obs is not supported by the Jobs framework or webservice. The k8s-status tool uses a custom script for managing a web service with such access included.

Requests for such access should be approved by the Toolforge admins before access is granted.

Backups

The main thing worth backing up is the contents of the etcd cluster. It is not currently backed up.

Bastion nodes

The Toolforge bastion nodes have kubectl installed. As the bastion nodes have NFS mounts, and maintain-kubeusers provisions certificates to NFS, everything will then work out of the box.

Kubernetes design

The Kubernetes documentation is both more detailed and up-to-date. Here is, however, a quick overview of the major Kubernetes components.

Control plane

etcd

Kubernetes stores all state in etcd - all other components are stateless. The etcd cluster is only accessed directly by the API Server and no other component. Direct access to this etcd cluster is equivalent to root on the entire k8s cluster, so it is firewalled off to only be reachable by the rest of the control plane nodes as well as etcd nodes, has client certificate verification in use for authentication (puppet is CA) and secrets are encrypted at rest in our etcd setup.

We currently use a 3 node cluster, hosted on VMs separate from the main control plane. They're all smallish Debian Buster instances configured largely by the same etcd puppet code we use in production. The main interesting thing about them is that they're localdisk instances as etcd is rather sensitive to iowait.

API server

The API server the heart of the Kubernetes control plane. All communication between all components, whether they are internal system components or external user components, must go through the API server. It is purely a data access layer, containing no logic related to any of the actual end-functionality Kubernetes offers. It offers the following functionality:

  • Authentication & Authorization
  • Validation
  • Read / Write access to all the API endpoints
  • Watch functionality for endpoints, which notifies clients when state changes for a particular resource

When you are interacting with the Kubernetes API, this is the server that is serving your requests.

The API server runs as a static pod on the control plane nodes. It listens on port 6443/tcp, and all access from outside the Kubernetes cluster should go via HAProxy. Requests are authenticated with either tokens (mostly for internal usage) or client certificates signed via the certificates API.

controller-manager and scheduler

The controller-manager and scheduler contain most of the actual logic. The scheduler is responsible for assigning pods to nodes and the controller-manager is for most other actions, for example launching CronJobs at scheduled times or ensuring ReplicaSets have the correct number of Pods running. The general idea is one of a 'reconciliation loop' - poll/watch the API server for desired state and current state, then perform actions to make them match.

Worker

The primary service running on each node is the Kubelet, which is an interface between the Kubernetes API and the container runtime (containerd in our case). Kubelet is responsible for ensuring the pods running on the node match with what the API server wants to run on that node, and reports back metrics to the API. It also proxies logs requests when necessary. Pod health checks are also done by the Kubelet.

In addition, there are two networking-related services running on each node:

  • kube-proxy manages iptables NAT rules for Service addresses.
  • The container network interface (or CNI, Calico in our cluster) manages the rest of the cluster networking. In practice this means an overlay network where each pod is assigned an cluster-internal IP address which can be routed across the entire cluster.

Labels

A reference of various used Kubernetes labels and their meanings is available on Portal:Toolforge/Admin/Kubernetes/Labels.

Monitoring and metrics

Alert runbooks

Kubernetes metrics stack

The Kubernetes cluster runs multiple pieces of software responsible for cluster monitoring:

These are all deployed via the wmcs-k8s-metrics component using the standard component deployment model.

Prometheus integration

Toolforge Prometheus servers scrapes cadvisor, kube-state-metrics and Prometheus exporter endpoints in the apps that have them. For this, the Prometheus server have an external API certificate provisioned via Puppet that needs to be renewed yearly. The scrape targets are defined in the profile::toolforge::prometheus Puppet module.

Alerts are managed via the Alerts GitLab repo and sent via metricsinfra infrastructure.

Networking

Calico

We use Calico as the Container Network Interface (CNI) for our cluster. In practice Calico is responsible for allocating a small private subnets in 192.168.0.0/16 for each node, and then routing those subnets to provide full connectivity across all nodes.

We deploy Calico following their self-managed on-premises model. We do not use their operator deployment - instead, we take their manifest deployment and build a Helm chart from it. Instructions for upgrading Calico in this setup are in the upgrade Calico section.

DNS

In the cluster we use the default CoreDNS DNS plugin. It resolves cluster-internal names (e.g. services) internally and forwards the remaining queries to the main Cloud VPS recursor service. CoreDNS configuration is managed by Kubeadm and generally works well enough, although we should consider increasing the number of replicas.

Ingress

We use kubernetes/ingress-nginx to route HTTP requests to specific tools inside the Kubernetes cluster. Ingress objects are created by webservice (soon jobs-api), and the ingress admission controller restricts each tool to [toolname].toolforge.org.

HAProxy (external service access)

NFS and LDAP

The worker nodes are Puppetized which means they have the standard Cloud VPS SSSD setup for using LDAP data for user accounts.

In addition, most (as of February 2024) worker nodes have the shared storage NFS volumes mounted, and these nodes have the kubernetes.wmcloud.org/nfs-mounted=true and toolforge.org/nfs-mounted=true for tools to run NFS-requiring workloads on them. The volume-admission-controller admission controller mounts all volumes to pods with the toolforge: tool label.

There are plans to introduce non-NFS workers to the pool once the Bookworm OS upgrades have finished. These would be used by tools with build service images, buildservice builds and infrastructure components with no need for NFS connectivity. Given the reliability issues with NFS, new features should be designed in a way that they at least do not make it harder to move away from NFS.

Pod isolation (PodSecurityPolicy)

Diagram of the PSP design as it was when the current cluster was being designed.

We use PodSecurityPolicy (PSP) for ensuring user workloads in pods can't escalate privileges to access data for other tools. PodSecurityPolicies are a default-deny mechanism; a pod can only run if it has access to a PSP that allows that specific configuration.

Tool user PSPs are provisioned by maintain-kubeusers.

Kubernetes internals in the kube-system namespace and most admin-managed cluster components use the privileged PSP which is managed in the Puppet repository.

PodSecurityPolicy is deprecated and will be removed in Kubernetes 1.25. We have not yet decided how it will be replaced.

Testing and local deployments

We have a testing deployment in the toolsbeta Cloud VPS project. It is almost identical to the tools cluster except it is much smaller.

The lima-kilo project can be used to run parts of a Toolforge Kubernetes cluster on a local machine.

User workloads

Jobs framework

The recommended way for someone to run a workload on Toolforge is to use the Jobs framework (admin docs). The framework will create deployment, cronjob and job objects in tool namespaces.

Raw Kubernetes API users

Before the Jobs framework was introduced, many users used the Kubernetes API directly to run their tools. This is now deprecated, but tools are still using it because it works.

Build service builds

The Build service (admin docs) runs builds in the image-build namespace. All of this is managed via the build service API, users do not have direct access to that namespace. These builds run without NFS access.

Worker types

There are a few different types of workers.

Cookbook name Name prefix Description
worker worker Normal workers. As of February 2024, these do not have NFS access.
worker_nfs worker-nfs Normal workers with NFS.
control control Special purpose nodes for the Kubernetes control plane.
ingress ingress Special purpose workers exclusively for the web ingress and the API gateway.

Addition and removal of all of these types is fully automated via the cookbooks.

Discussion

Bring-your-own-image

We only allow running images from the from the Toolforge Docker registry (for "pre-built" images) and from the Toolforge Harbor server. This is for the following purposes:

  1. Making it easy to enforce our Open Source Code only guideline
  2. Make it easy to do security updates when necessary (just rebuild all the containers & redeploy)
  3. Faster deploys, since this is in the same network (vs dockerhub, which is retreived over the internet)
  4. Access control is provided totally by us, less dependent on dockerhub
  5. Provide required LDAP configuration, so tools running inside the container are properly integrated in the Toolforge environment

This is enforced with an admission controller.

The decision to follow this approach was last discussed and re-evaluated at Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T302863_toolforge_byoc.

GitOps tools

PodSecurityPolicy replacement

Puppet

We use Puppet to provision the Kubernetes nodes, and also related non-K8s managed infrastructure such as etcd and HAProxy. However, configuration for what's inside the cluster should not be managed by Puppet for several reasons:

  • We already have a deployment management system for what's inside the cluster (toolforge-deploy.git), and we should not introduce two systems for the same purpose
  • Puppet cannot be used to provision a local environment as is
  • puppet.git merges require global root, which not all Toolforge admins have

Single cluster reliance

References