Kubernetes/Troubleshooting
Kubernetes has a lot of moving parts and offers wide configuration options. So misconfiguration can happen. This page should help to troubleshoot some error cases.
Most of the time, errors in deployments are caught by helmfile. If a new deployment is unable to become ready, helmfile will roll back automatically after a timeout (currently 300 seconds). If your deployment takes a long time and fails after a timeout, then it is helpful to start another SSH session and troubleshoot the service components. Make sure to make yourself familiar with the usage of kubectl.
The following troubleshooting flowchart should give a rough guideline on how to find errors in Kubernetes service deployments. The flowchart is intended for the production Kubernetes platform, not Toolforge.
Troubleshooting a deployment
If your deployment fails, helmfile will rollback on its own, so launching the deployment again may be necessary for realtime troubleshooting: helmfile -e $my_environment $my_service
From another terminal on the deployment server, check for Kubernetes events in the namespace of the deployment:
kube-env $my_service $my_environment
kubectl get events
A frequent cause is a container crashing on start. This will be indicated by BackOff
, CrashLoopBackOff
, Failed
statuses and successive SuccessfulCreate
SuccessfulDelete
messages for the same object.
Another cause to consider is insufficient quota, indicated by FailedCreate
events reporting exceeded quota
with details on the resources at fault. If more resources are needed, namespace-level limits are controlled in helmfile.d/admin_ng/values
(e.g., main.yaml
for wikikube) in operations/deployment-charts. See Kubernetes/Resource_requests_and_limits for background on the resource model.
Much of the above also applies to Kubernetes deployments that are unhealthy at steady-state (in the absence of a helmfile deployment), indicated by unavailable pods due to container crashes or missing quota. Troubleshooting the underlying container issues (next section) or adjusting quotas as needed are good courses of action, respectively.
Note: To continue troubleshooting a failed helmfile deployment, the deployment has to be in progress.
Troubleshooting a pod
kube-env $my_service $my_environment
Get the pod id from the events, or from
# get the most recent pod id
kubectl get pods
Get a container's logs
# Will return an error, but give you the container names inside the pod
kubectl logs $pod_id
kubectl logs $pod_id $container_name
Get a pod's status, exit code, and other events
kubectl get pods -o wide
kubectl describe pods $pod_id
kubectl get events --field-selector involvedObject.name=$pod_id
Exec into a pod and run commands
kube-env $my_service $my_environment
kubectl get pods -o wide
# get the ip and the node
ssh $node_ip
# For docker:
sudo docker ps | grep $pod_name
# take the /pause container
sudo docker top $container_id
# For containerd:
sudo crictl ps |grep $pod_name
# first column contains the container id
sudo crictl inspect --output go-template --template '{{.info.pid}}' $container_id
# take the PID
sudo nsenter -t $container_pid -n
Strace a pod main container process
kube-env $my_service $my_environment
kubectl get pods -o wide
# get the ip and the node
ssh $node_ip
sudo docker ps | grep ${container_name}_${pod_name}
sudo docker top $container_id
# take the PID
sudo strace -p ${container_process_pid}
Get a pod's IP and run sanity-check
# Get IP
kubectl get pods -o wide
# Get port
kubectl describe pods $pod_name
curl http://$pod_ip:$pod_port