Kubernetes has a lot of moving parts and offers wide configuration options. So misconfiguration can happen. This page should help to troubleshoot some error cases.
Most of the time, errors in deployments are caught by helmfile. If a new deployment is unable to become ready, helmfile will roll back automatically after a timeout (currently 300 seconds). If your deployment takes a long time and fails after a timeout, then it is helpful to start another SSH session and troubleshoot the service components. Make sure to make yourself familiar with the usage of kubectl.
The following troubleshooting flowchart should give a rough guideline on how to find errors in Kubernetes service deployments. The flowchart is intended for the production Kubernetes platform, not Toolforge.
Troubleshooting a deployment
If your deployment fails, helmfile will rollback on its own, so launching the deployment again may be necessary for realtime troubleshooting:
helmfile -e $my_environment $my_service
From another terminal on the deployment server, look at the namespace status
kube-env $my_service $my_environment kubectl get events
A frequent cause is a container crashing on start. This will be indicated by
Failed statuses and successive
SuccessfulDelete messages for the same object.
To continue troubleshooting, the deployment has to be in progress.
Troubleshooting a pod
kube-env $my_service $my_environment
Get the pod id from the events, or from
# get the most recent pod id kubectl get pods
Get a container's logs
# Will return an error, but give you the container names inside the pod kubectl logs $pod_id kubectl logs $pod_id $container_name
Get a pod's status, exit code, and other events
kubectl get pods -o wide kubectl describe pods $pod_id kubectl get events --field-selector involvedObject.name=$pod_id
Exec into a pod and run commands
kube-env $my_service $my_environment kubectl get pods -o wide # get the ip and the node ssh $node_ip sudo docker ps | grep $pod_name # take the /pause container sudo docker top $pause_container_id # take the PID sudo nsenter -t $pause_container_pid -n
Get a pod's IP and run sanity-check
# Get IP kubectl get pods -o wide # Get port kubectl describe pods $pod_name curl http://$pod_ip:$pod_port