Portal:Toolforge/Admin/Runbooks/TektonDown
Overview
This is when the tekton-pipelines-controller pod in the tekton-pipelines namespace of tools/toolsbeta k8s cluster is down or can't be reached.
The procedures in this runbook require admin permissions to complete.
Error / Incident
This usually comes in the form of an alert in alertmanager.
There you will get which project (tools, toolsbeta, ...) is the one it's failing for.
Debugging
The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) cloudcontrol servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:
- check that the pods are running. If the pods are not running you should try redeploying tekton by following the instructions in https://github.com/toolforge/buildservice/blob/main/README.md:
toolsbeta-test-k8s-control-4:/# sudo -i
root@ttoolsbeta-test-k8s-control-4:/# kubectl get pods -n tekton-pipelines
NAME READY STATUS RESTARTS AGE
tekton-pipelines-controller-5c78ddd49b-dj4hz 1/1 Running 0 34d
tekton-pipelines-webhook-5d899cc8c-zwf7p 1/1 Running 0 34d
- You can also check the log of the pod's deployment with
kubectl logs deploy/tekton-pipelines-controller -n tekton-pipelines
.
- It might also make sense to check if there has been any recent code change and re-deployment attempts. Again a good place to start is by looking at the recent commits in https://github.com/toolforge/buildservice
Common issues
Add new issues here when you encounter them!
Issue 1
...
Related information
Old incidents
Add any incident tasks here!