Portal:Toolforge/Admin/Runbooks/PrometheusK8sCertExpirySoon
Appearance
This happens when the certificate that prometheus uses to fetch data from k8s is about to expire.
The procedures in this runbook require admin permissions to complete.
Error / Incident
This usually comes in the form of an alert in alertmanager.
There you will get which project (tools, toolsbeta, ...) and which instance (VM) is the cert about to expire (all should have similar times though, but maybe they got out of sync).
Debugging
We don't have yet a way to autorefresh the certs prometheus uses to authenticate against k8s, so they need renewal. When they expire prometheus is not able to get metrics from it (so any k8s related metric will just not be there).
You can check manually if the certificates that prometheus uses to connect to k8s have expired:
root@tools-prometheus-6:/srv/prometheus/tools# grep cert_file /srv/prometheus/tools/prometheus.yml
cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
...
root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
Certificate:
...
Validity
Not Before: Jun 2 11:55:07 2022 GMT
Not After : Jun 2 11:55:07 2023 GMT <-- this one should be later than today
If that's the case, you can follow this guide.
Related information
- Karma UI (use
project=tools
orproject=toolsbeta
for filtering) - Tools prometheus
- Toolsbeta prometheus
- Alerts repository
- Toolforge admin docs
Old incidents
Add any incident tasks here!