Portal:Toolforge/Admin/Runbooks/Kyverno
This page contains runbooks for dealing with Kyverno problems.
Kyverno is in the hot path for user workload scheduling in Toolforge. Every pod created by tool accounts will be evaluated by Kyverno in an admission webhook, where it will validate/mutate the resource being created.
It is therefore imperative that Kyverno is up and running at all times.
Kyverno policies are created by maintain-kubeusers for every tool account.
What happens if Kyverno is down or if no Kyverno policies are present or they are not READY
Kyverno is currently configured in fail-closed mode, meaning that if it is down, policies wont evaluate and the admission webhook will reject new user workload creation.
Per our configuration, a Kyverno policy must evaluate correctly for a tool account workload to be admitted into the cluster.
Therefore if either:
- Kyverno is down
- No Kyverno policy exists for a given tool account namespace
- A Kyverno policy exists in the tool account namespace, but it is not in READY state
The result is the same: No new tool workload (pods) will be allowed to run in the cluster.
How to fix it
If Kyverno is down, try any of the following:
- recreate the pods manually from Toolforge k8s control nodes. TODO: put the actual command here.
- redeploy it from scratch, using the toolforge-deploy repository. TODO: put the actual command here.
- verify resources (RAM, CPU) of the cluster and key components. Kyverno can be very resource intensive, for both itself and other Kubernetes components (apiserver, controller-manager, etcd, etc). TODO: put here some commands.
If policies are not present, try any of the following:
- restart maintain-kubeusers. TODO: put the actual command here.
- redeploy maintain-kubeusers. TODO: put the actual command here.
If policies are present, but not in READY state:
- verify that Kyverno pods are running correctly
- verify that Kyverno pods have enough resources allocated to them
- verify that the Kubernetes control plane is healthy, resource-wise, etc.
How to remove Kyverno from the hot path (don't do this unless extreme emergency)
In case of extreme emergency, we can remove Kyverno from the hot path, thus allowing every tool user workload to be admitted into the cluster without policy verification.
This is an extreme risk security-wise, and you should never do this unless there is a major outage happening.
Run this on a Toolforge k8s control node to disable Kyverno admission configuration:
sudo -i kubectl delete validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg sudo -i kubectl delete mutatingwebhookconfiguration kyverno-resource-mutating-webhook-cfg
Run this to stop the main Kyverno daemon from running:
sudo -i kubectl scale deploy kyverno-admission-controller -n kyverno --replicas 0
Error / Incident
Information about some specific alerts we have:
Toolforge Kyverno unknown state
This means Prometheus was unable to fetch one of the main metrics of Kyverno. It can mean Kyverno is down.
See section above to know what happens if Kyverno is down and how to fix it.
Toolforge Kyverno low policy resources
This means we have a surprisingly low number of policy resources loaded into the cluster. It may mean some kind of misconfiguration or error in maintain-kubeusers, or Kyverno is having a hard time reconciling policies into READY status.
See section above to know what happens if Kyverno is down and how to fix it.
Toolforge Kyverno no policy resources
This means no policy resources were loaded into the cluster. It may mean some kind of misconfiguration or error in maintain-kubeusers, or Kyverno not running at all.
See section above to know what happens if Kyverno is down and how to fix it.
Debugging
Some debugging information.
How to see state of Kyverno pods
Run this:
user@tools-k8s-control-7:~$ sudo -i kubectl -n kyverno get pods
NAME READY STATUS RESTARTS AGE
kyverno-admission-controller-5b9779d5c6-2zsrg 1/1 Running 0 17d
kyverno-admission-controller-5b9779d5c6-59f2p 1/1 Running 0 17d
kyverno-admission-controller-5b9779d5c6-6jk78 1/1 Running 0 17d
kyverno-admission-controller-5b9779d5c6-7fbcd 1/1 Running 0 17d
kyverno-admission-controller-5b9779d5c6-fptrv 1/1 Running 0 17d
kyverno-admission-controller-5b9779d5c6-ljdkl 1/1 Running 0 17d
kyverno-admission-controller-5b9779d5c6-sg5vg 1/1 Running 0 17d
kyverno-background-controller-5d6bc965bd-bjk6d 1/1 Running 0 17d
kyverno-background-controller-5d6bc965bd-nnnj4 1/1 Running 0 17d
kyverno-cleanup-admission-reports-28679960-2k4dp 0/1 Completed 0 6m10s
kyverno-cleanup-cluster-admission-reports-28679960-kkfh5 0/1 Completed 0 5m56s
kyverno-cleanup-controller-9bccdf4d6-5sgwk 1/1 Running 0 18d
kyverno-cleanup-controller-9bccdf4d6-zh8dp 1/1 Running 0 17d
kyverno-reports-controller-8849f9684-48xcz 1/1 Running 0 17d
kyverno-reports-controller-8849f9684-trrr6 1/1 Running 0 17d
user@tools-k8s-control-7:~$ sudo -i kubectl -n kyverno get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
kyverno-admission-controller 7/7 7 7 32d
kyverno-background-controller 2/2 2 2 32d
kyverno-cleanup-controller 2/2 2 2 32d
kyverno-reports-controller 2/2 2 2 18d
How to see the state of maintain-kubeusers
This is how you can get information about pods for maintain-kubeusers:
user@tools-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers get pods
NAME READY STATUS RESTARTS AGE
maintain-kubeusers-dc9d6978b-nthbw 1/1 Running 20 (3m44s ago) 2d
To see logs:
user@tools-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers logs deploy/maintain-kubeusers --timestamps=true
In case you want to check the logs for a previous container restart:
user@tools-k8s-control-7:~$ sudo -i kubectl -n maintain-kubeusers logs deploy/maintain-kubeusers --timestamps=true --previous
See also Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown
How to see the policy resources
To check the policies, run this:
user@tools-k8s-control-7:~$ sudo -i kubectl get policy -A
NAMESPACE NAME BACKGROUND VALIDATE ACTION READY AGE MESSAGE
tool-a-list-bulding-tool toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-aaabot toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-aalertbot toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-abbe98tools toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-abbreviso toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-abcgames toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-abdumubot toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-abibot toolforge-kyverno-pod-policy true Enforce True 29d Ready
tool-abigor toolforge-kyverno-pod-policy true Enforce True 29d Ready
[..]
To see how many of them are in READY status, run this:
user@tools-k8s-control-7:~$ sudo -i kubectl get policy -A | grep Ready | wc -l
3318
To see how many of them are not READY, run this:
user@tools-k8s-control-7:~$ sudo -i kubectl get policy -A | grep -v Ready | wc -l
Common issues
Add new issues here when you encounter them!
Related information
See upstream docs:
Old incidents
Old incidents related to Kyverno: