Portal:Toolforge/Admin/Runbooks/JobsApiDown

This happens when the jobs-api pod in the jobs-api namespace of tools/toolsbeta k8s cluster is down or can't be reached.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which project (tools, toolsbeta, ...) is the one it's failing for.

Debugging

The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) k8s-control servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:

check that the pods are running:

dcaro@tools-k8s-control-9:~# sudo -i
root@tools-k8s-control-9:~# kubectl get pods -n jobs-api
NAME                       READY   STATUS    RESTARTS      AGE
jobs-api-5cf5644bb-b9zjf   2/2     Running   6 (25h ago)   25h
jobs-api-5cf5644bb-s7rvt   2/2     Running   0             25h

You can also check the log of the pod's deployment with kubectl logs -n jobs-api deploy/jobs-api.

You can force a restart of all the pods with a rollout kubectl rollout restart -n jobs-api deployment/jobs-api

It might also make sense to check if there has been any recent code change and re-deployment attempts. Again a good place to start is by looking at the recent commits in jobs-api gitlab repo, or the toolforge-deploy gitlab repo.

If the pods don't exist or the deployment does not exist, you can try redeploying the jobs-api by following the instructions in the toolforge repo (it will do nothing if there's nothing to do).

Doing a manual curl for the stats

You can try doing a curl directly to the pods for the statisticts, by checking the configuration of prometheus, you'll get the cert, key and url:

root@tools-prometheus-6:~# grep 'job_name.*jobs-api' -A 40 /srv/prometheus/tools/prometheus.yml
- job_name: jobs-api
  scheme: https
  tls_config:
    insecure_skip_verify: true
    cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
    key_file: "/etc/ssl/private/toolforge-k8s-prometheus.key"
  kubernetes_sd_configs:
  - api_server: https://k8s.tools.eqiad1.wikimedia.cloud:6443
    role: pod
    tls_config:
      insecure_skip_verify: true
      cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
      key_file: "/etc/ssl/private/toolforge-k8s-prometheus.key"
    namespaces:
      names:
      - jobs-api
  relabel_configs:
...
  - source_labels:
    - __meta_kubernetes_pod_name
    regex: "(jobs-api-[a-zA-Z0-9]+-[a-zA-Z0-9]+)"
    target_label: __metrics_path__
    replacement: "/api/v1/namespaces/jobs-api/pods/${1}:9000/proxy/metrics"
 ...

Then you can curl directly the pods by name, like:

root@tools-prometheus-6:~# curl \
  --insecure \
  --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt  \
  --key /etc/ssl/private/toolforge-k8s-prometheus.key \
  'https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/jobs-api/pods/jobs-api-5cf5644bb-b9zjf:9000/proxy/metrics'
....

Common issues

Add new issues here when you encounter them!

Prometheus k8s cert expired

If jobs-api seems up, you can check if the certificates that prometheus uses to connect to k8s have expired (there should have been another alert though) Portal:Toolforge/Admin/Runbooks/PrometheusK8sCertExpirySoon.

Related information

Old incidents

Add any incident tasks here!