Portal:Toolforge/Admin/Runbooks/TektonDown

From Wikitech

This is when the tekton-pipelines-controller pod in the tekton-pipelines namespace of tools/toolsbeta k8s cluster is down or can't be reached.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which project (tools, toolsbeta, ...) is the one it's failing for.

Debugging

The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) cloudcontrol servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:

toolsbeta-test-k8s-control-4:/# sudo -i
root@ttoolsbeta-test-k8s-control-4:/# kubectl get pods -n tekton-pipelines
NAME                                           READY   STATUS    RESTARTS   AGE
tekton-pipelines-controller-5c78ddd49b-dj4hz   1/1     Running   0          34d
tekton-pipelines-webhook-5d899cc8c-zwf7p       1/1     Running   0          34d
  • You can also check the log of the pod's deployment with kubectl logs deploy/tekton-pipelines-controller -n tekton-pipelines.

Common issues

Add new issues here when you encounter them!

Prometheus k8s cert expired

If tekton seems up, you can check if the certificates that prometheus uses to connect to k8s have expired:

root@tools-prometheus-6:/srv/prometheus/tools# grep cert_file prometheus.yml
    cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
    ...
    
root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
Certificate:                                                    
...                                 
        Validity                                                
            Not Before: Jun  2 11:55:07 2022 GMT                
            Not After : Jun  2 11:55:07 2023 GMT   <-- this one should be later than today

To refresh and fix the issue follow Portal:Toolforge/Admin/Kubernetes/Certificates#Operations.

Related information

Old incidents

Add any incident tasks here!

  • phab:T338025 - [T338025] [tools] Prometheus k8s cert expired