The ToolforgeKubernetesCapacity alert fires when the Toolforge Kubernetes cluster is close to running out of a certain resource (CPU or memory). In general, this is caused either by:
- Natural increased usage (in which case the fix is to simply provision more capacity)
- Something misbehaving and taking up more resources than it should
This panel has an overview of the cluster capacity. The main thing to worry about is "CPU requests" staying below "CPU allocatable".
Locating new workloads
On a bastion run as your own user:
$ kubectl get cronjob -A --sort-by=.metadata.creationTimestamp
$ kubectl get deployment -A --sort-by=.metadata.creationTimestamp
Provisioning new nodes
On a cloudcumin:
$ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name tools --role worker
If you need to provision more than one node at once, it is safe to start a new cookbook run after the first one has created the VM.
- The cluster ran out of capacity on 2023-12-01 when cron jobs could not execute fast enough during a cluster reboot after several tools had migrated from the Grid Engine to K8s.