Jump to content

Kubernetes/Troubleshooting

From Wikitech

Kubernetes has a lot of moving parts and offers wide configuration options. So misconfiguration can happen. This page should help to troubleshoot some error cases.


Most of the time, errors in deployments are caught by helmfile. If a new deployment is unable to become ready, helmfile will roll back automatically after a timeout (currently 300 seconds). If your deployment takes a long time and fails after a timeout, then it is helpful to start another SSH session and troubleshoot the service components. Make sure to make yourself familiar with the usage of kubectl.


The following troubleshooting flowchart should give a rough guideline on how to find errors in Kubernetes service deployments. The flowchart is intended for the production Kubernetes platform, not Toolforge.

Production Kubernetes troubleshooting flowchart

Troubleshooting a deployment

If your deployment fails, helmfile will rollback on its own, so launching the deployment again may be necessary for realtime troubleshooting: helmfile -e $my_environment $my_service

From another terminal on the deployment server, check for Kubernetes events in the namespace of the deployment:

kube-env $my_service $my_environment
kubectl get events

A frequent cause is a container crashing on start. This will be indicated by BackOff, CrashLoopBackOff, Failed statuses and successive SuccessfulCreate SuccessfulDelete messages for the same object.

Another cause to consider is insufficient quota, indicated by FailedCreate events reporting exceeded quota with details on the resources at fault. If more resources are needed, namespace-level limits are controlled in helmfile.d/admin_ng/values (e.g., main.yaml for wikikube) in operations/deployment-charts. See Kubernetes/Resource_requests_and_limits for background on the resource model.

Much of the above also applies to Kubernetes deployments that are unhealthy at steady-state (in the absence of a helmfile deployment), indicated by unavailable pods due to container crashes or missing quota. Troubleshooting the underlying container issues (next section) or adjusting quotas as needed are good courses of action, respectively.

Note: To continue troubleshooting a failed helmfile deployment, the deployment has to be in progress.

Troubleshooting a pod

kube-env $my_service $my_environment

Get the pod id from the events, or from

# get the most recent pod id
kubectl get pods

Get a container's logs

# Will return an error, but give you the container names inside the pod
kubectl logs $pod_id
kubectl logs $pod_id $container_name

Get a pod's status, exit code, and other events

kubectl get pods -o wide
kubectl describe pods $pod_id
kubectl get events --field-selector involvedObject.name=$pod_id

Exec into a pod and run commands

Not available to mere mortals. Requires global root.
kube-env $my_service $my_environment
kubectl get pods -o wide
# get the ip and the node
ssh $node_ip
sudo docker ps | grep $pod_name
# take the /pause container
sudo docker top $pause_container_id
# take the PID
sudo nsenter -t $pause_container_pid -n

Strace a pod main container process

Not available to mere mortals. Requires global root.
kube-env $my_service $my_environment
kubectl get pods -o wide
# get the ip and the node
ssh $node_ip
sudo docker ps | grep ${container_name}_${pod_name}
sudo docker top $container_id
# take the PID
sudo strace -p ${container_process_pid}

Get a pod's IP and run sanity-check

# Get IP
kubectl get pods -o wide
# Get port
kubectl describe pods $pod_name

curl http://$pod_ip:$pod_port

Additional Resources