Jump to content

Portal:Toolforge/Admin/Runbooks/PrometheusK8sCertExpirySoon

From Wikitech

This happens when the certificate that prometheus uses to fetch data from k8s is about to expire.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which project (tools, toolsbeta, ...) and which instance (VM) is the cert about to expire (all should have similar times though, but maybe they got out of sync).

Debugging

We don't have yet a way to autorefresh the certs prometheus uses to authenticate against k8s, so they need renewal. When they expire prometheus is not able to get metrics from it (so any k8s related metric will just not be there).

You can check manually if the certificates that prometheus uses to connect to k8s have expired:

root@tools-prometheus-6:/srv/prometheus/tools# grep cert_file /srv/prometheus/tools/prometheus.yml
    cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
    ...
    
root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
Certificate:                                                    
...                                 
        Validity                                                
            Not Before: Jun  2 11:55:07 2022 GMT                
            Not After : Jun  2 11:55:07 2023 GMT   <-- this one should be later than today

If that's the case, you can follow this guide.

Old incidents

Add any incident tasks here!