Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity
Appearance
The procedures in this runbook require admin permissions to complete.
The ToolforgeKubernetesCapacity alert fires when the Toolforge Kubernetes cluster is close to running out of a certain resource (CPU or memory). In general, this is caused either by:
- Natural increased usage (in which case the fix is to simply provision more capacity)
- Something misbehaving and taking up more resources than it should
Debugging
Dashboard
This panel has an overview of the cluster capacity. The main thing to worry about is "CPU requests" staying below "CPU allocatable".
Locating new workloads
On a bastion run as your own user:
$ kubectl get cronjob -A --sort-by=.metadata.creationTimestamp
$ kubectl get deployment -A --sort-by=.metadata.creationTimestamp
Fixing
Provisioning new nodes
On a cloudcumin:
$ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name tools --role worker
If you need to provision more than one node at once, it is safe to start a new cookbook run after the first one has created the VM.
Related information
Support contacts
Old incidents
- The cluster ran out of capacity on 2023-12-01 when cron jobs could not execute fast enough during a cluster reboot after several tools had migrated from the Grid Engine to K8s.