Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity

From Wikitech
The procedures in this runbook require admin permissions to complete.

The ToolforgeKubernetesCapacity alert fires when the Toolforge Kubernetes cluster is close to running out of a certain resource (CPU or memory). In general, this is caused either by:

  • Natural increased usage (in which case the fix is to simply provision more capacity)
  • Something misbehaving and taking up more resources than it should

Debugging

Dashboard

This panel has an overview of the cluster capacity. The main thing to worry about is "CPU requests" staying below "CPU allocatable".

Locating new workloads

On a bastion run as your own user:

$ kubectl get cronjob -A --sort-by=.metadata.creationTimestamp
$ kubectl get deployment -A --sort-by=.metadata.creationTimestamp

Fixing

Provisioning new nodes

On a cloudcumin:

$ sudo cookbook wmcs.toolforge.add_k8s_node --cluster-name tools --role worker

If you need to provision more than one node at once, it is safe to start a new cookbook run after the first one has created the VM.

Related information

Support contacts

Old incidents

  • The cluster ran out of capacity on 2023-12-01 when cron jobs could not execute fast enough during a cluster reboot after several tools had migrated from the Grid Engine to K8s.