Jump to content

Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails

From Wikitech

This happens when the jobs-emailer does not send emails for an extended period of time.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

There you will get which project (tools, toolsbeta, ...) is the one it's failing for.

You'll also get the value of the stat, if it's -1 it means that prometheus was unable to find the metric, so it might be a prometheus issue.

Debugging

The first most likely step is to ssh to tools/toolsbeta (depending on the project the alert is from) k8s-control servers (i.e toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud). From there you can:

  • check that the pods are running:
dcaro@tools-bastion-13:~$ kubectl-sudo get pods -n jobs-emailer
NAME                            READY   STATUS    RESTARTS   AGE
jobs-emailer-5946fb7cd5-6nhrm   1/1     Running   0          43m
  • You can also check the log of the pod's deployment with kubectl logs -n jobs-emailer deploy/jobs-emailer.
  • You can force a restart of all the pods with a rollout kubectl rollout restart -n jobs-emailer deployment/jobs-emailer
  • If the pods don't exist or the deployment does not exist, you can try redeploying the jobs-emailer by following the instructions in the toolforge repo (it will do nothing if there's nothing to do).

Doing a manual curl for the stats

You can try doing a curl directly to the pods for the statisticts, by checking the configuration of prometheus, you'll get the cert, key and url:

root@tools-prometheus-7:~# grep 'job_name.*jobs-emailer' -A 40 /srv/prometheus/tools/prometheus.yml
- job_name: jobs-emailer
  scheme: https
  tls_config:
    insecure_skip_verify: true
    cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
    key_file: "/etc/ssl/private/toolforge-k8s-prometheus.key"
  kubernetes_sd_configs:
  - api_server: https://k8s.tools.eqiad1.wikimedia.cloud:6443
    role: pod
    tls_config:
      insecure_skip_verify: true
      cert_file: "/etc/ssl/localcerts/toolforge-k8s-prometheus.crt"
      key_file: "/etc/ssl/private/toolforge-k8s-prometheus.key"
    namespaces:
      names:
      - jobs-emailer
...

Then you can curl directly the pods by name, like:

root@tools-prometheus-6:~# curl \
  --insecure \
  --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt  \
  --key /etc/ssl/private/toolforge-k8s-prometheus.key \
  'https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/jobs-emailer/pods/jobs-emailer-5946fb7cd5-6nhrm/proxy/metrics'
....

Common issues

Add new issues here when you encounter them!

Add new issues like this

With some info here.

Old incidents

Add any incident tasks here!