Kubernetes/Troubleshooting

From Wikitech

Kubernetes has a lot of moving parts and offers wide configuration options. So misconfiguration can happen. This page should help to troubleshoot some error cases.


Most of the time, errors in deployments are caught by helmfile. If a new deployment is unable to become ready, helmfile will roll back automatically after a timeout (currently 300 seconds). If your deployment takes a long time and fails after a timeout, then it is helpful to start another SSH session and troubleshoot the service components. Make sure to make yourself familiar with the usage of kubectl.


The following troubleshooting flowchart should give a rough guideline on how to find errors in Kubernetes service deployments. The flowchart is intended for the production Kubernetes platform, not Toolforge.

Production Kubernetes troubleshooting flowchart

Troubleshooting a deployment

If your deployment fails, helmfile will rollback on its own, so launching the deployment again may be necessary for realtime troubleshooting: helmfile -e $my_environment $my_service


From another terminal on the deployment server, look at the namespace status

kube-env $my_service $my_environment
kubectl get events

A frequent cause is a container crashing on start. This will be indicated by BackOff, CrashLookBackOff, Failed statuses and successive SuccessfulCreate SuccessfulDelete messages for the same object.

To continue troubleshooting, the deployment has to be in progress.

Troubleshooting a pod

kube-env $my_service $my_environment

Get the pod id from the events, or from

# get the most recent pod id
kubectl get pods

Get a container's logs

# Will return an error, but give you the container names inside the pod
kubectl logs $pod_id
kubectl logs $pod_id $container_name

Get a pod's status, exit code, and other events

kubectl get pods -o wide
kubectl describe pods $pod_id
kubectl get events --field-selector involvedObject.name=$pod_id

Exec into a pod and run commands

Not available to mere mortals. Requires global root.
kube-env $my_service $my_environment
kubectl get pods -o wide
# get the ip and the node
ssh $node_ip
sudo docker ps | grep $pod_name
# take the /pause container
sudo docker top $pause_container_id
# take the PID
sudo nsenter -t $pause_container_pid -n

Get a pod's IP and run sanity-check

# Get IP
kubectl get pods -o wide
# Get port
kubectl describe pods $pod_name

curl http://$pod_ip:$pod_port

Additional Resources