Portal:Toolforge/Admin/Runbooks/TektonDown
This is when the tekton-pipelines-controller pod in the tekton-pipelines namespace of tools/toolsbeta k8s cluster is down or can't be reached.
Error / Incident
This usually comes in the form of an alert in alertmanager.
There you will get which project (tools, toolsbeta, ...) is the one it's failing for.
Debugging
The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) cloudcontrol servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:
- check that the pods are running:
toolsbeta-test-k8s-control-4:/# sudo -i
root@ttoolsbeta-test-k8s-control-4:/# kubectl get pods -n tekton-pipelines
NAME READY STATUS RESTARTS AGE
tekton-pipelines-controller-5c78ddd49b-dj4hz 1/1 Running 0 34d
tekton-pipelines-webhook-5d899cc8c-zwf7p 1/1 Running 0 34d
- You can also check the log of the pod's deployment with
kubectl logs deploy/tekton-pipelines-controller -n tekton-pipelines
.
- It might also make sense to check if there has been any recent code change and re-deployment attempts. Again a good place to start is by looking at the recent commits in builds-builder repo, or the toolforge-deploy gitlab repo.
- If the pods don't exist or the deployment does not exist, you can try redeploying the jobs-api by following the instructions in the toolforge repo (it will do nothing if there's nothing to do).
Doing a manual curl for the stats
You can try doing a curl directly to the pods for the statisticts, by checking the configuration of prometheus, you'll get the cert, key and url:
root@tools-prometheus-6:~# grep 'job_name.*tekton' -A 40 /srv/prometheus/tools/prometheus.yml
- job_name: tekton-pipelines-controller
scheme: https
tls_config:
insecure_skip_verify: true
cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
key_file: "/etc/ssl/private/toolforge-k8s-prometheus.key"
kubernetes_sd_configs:
- api_server: https://k8s.tools.eqiad1.wikimedia.cloud:6443
role: pod
tls_config:
insecure_skip_verify: true
cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
key_file: "/etc/ssl/private/toolforge-k8s-prometheus.key"
namespaces:
names:
- tekton-pipelines
relabel_configs:
...
- source_labels:
- __meta_kubernetes_pod_name
regex: "(tekton-pipelines-controller-[a-zA-Z0-9]+-[a-zA-Z0-9]+)"
target_label: __metrics_path__
replacement: "/api/v1/namespaces/tekton-pipelines/pods/${1}:9090/proxy/metrics"
Then you can curl directly the pods by name, like:
root@tools-prometheus-6:~# curl \
--insecure \
--cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt \
--key /etc/ssl/private/toolforge-k8s-prometheus.key \
'https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/tekton-pipelines/pods/tekton-pipelines-controller-6f6bd874d9-kz9g2:9090/proxy/metrics'
....
Common issues
Add new issues here when you encounter them!
Prometheus k8s cert expired
If tekton seems up, you can check if the certificates that prometheus uses to connect to k8s have expired:
root@tools-prometheus-6:/srv/prometheus/tools# grep cert_file /srv/prometheus/tools/prometheus.yml
cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
...
root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
Certificate:
...
Validity
Not Before: Jun 2 11:55:07 2022 GMT
Not After : Jun 2 11:55:07 2023 GMT <-- this one should be later than today
To refresh and fix the issue follow Portal:Toolforge/Admin/Kubernetes/Certificates#Operations.
Related information
Old incidents
Add any incident tasks here!
- phab:T338025 - [T338025] [tools] Prometheus k8s cert expired