Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown
This happens when there's no ready envvars-admission pods in the envvars-admission namespace of tools/toolsbeta k8s cluster, or no information about it (no metrics).
Error / Incident
This usually comes in the form of an alert in alertmanager.
There you will get which project (tools, toolsbeta, ...) is the one it's failing for.
Debugging
The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) k8s-control servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:
- check that the pods are running:
dcaro@tools-k8s-control-9:~# sudo -i
root@tools-k8s-control-9:~# kubectl get pods -n envvars-admission
NAME READY STATUS RESTARTS AGE
envvars-admission-78d68c8648-4fplx 1/1 Running 0 6d21h
envvars-admission-78d68c8648-v9n6m 1/1 Running 0 28h
- You can also check the log of the pod's deployment with
kubectl logs -n envvars-admission deploy/envvars-admission
.
- You can force a restart of all the pods with a rollout
kubectl rollout restart -n envvars-admission deployment/envvars-admission
- It might also make sense to check if there has been any recent code change and re-deployment attempts. Again a good place to start is by looking at the recent commits in envvars-admission gitlab repo, or the toolforge-deploy gitlab repo.
- If the pods don't exist or the deployment does not exist, you can try redeploying the envvars-admission by following the instructions in the toolforge repo (it will do nothing if there's nothing to do).
Checking if stats are coming in
You can go directly to prometheus and check there if there's any stats coming in for that pod (the actual link the alert is using is in the alert UI, under the time, named wmcloud.org
):
Common issues
Add new issues here when you encounter them!
Prometheus k8s cert expired
If envvars-admission seems up, you can check if the certificates that prometheus uses to connect to k8s have expired:
root@tools-prometheus-6:/srv/prometheus/tools# grep cert_file /srv/prometheus/tools/prometheus.yml
cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
...
root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
Certificate:
...
Validity
Not Before: Jun 2 11:55:07 2022 GMT
Not After : Jun 2 11:55:07 2023 GMT <-- this one should be later than today
To refresh and fix the issue follow Portal:Toolforge/Admin/Kubernetes/Certificates#Operations.
Related information
- Envvars API docs
- Karma UI (use
project=tools
orproject=toolsbeta
for filtering) - Tools prometheus
- Toolsbeta prometheus
- Alerts repository
- Toolforge admin docs
Old incidents
Add any incident tasks here!