Portal:Toolforge/Admin/Jobs Service
This page contains information about the architecture and components of the Toolforge Jobs Service.
Components
- Jobs cli (source code): main entrypoint for users, runs in the Toolforge bastions.
- Jobs API (source code): runs inside the k8s cluster (part af Toolforge API). Offers the API that in turn interacts with the k8s API native objects:
CronJob
,Job
,Deployment
, etc. - Jobs Emailer (source code): Sends emails for job events to users on demand
Alerts
List of alerts: https://prometheus.svc.toolforge.org/tools/alerts?search=jobs
Runbooks: Category:JobsApiRunbooks
- Dashboard from the cloud UI: https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive&q=project%3D~^%28tools%7Ctoolsbeta%29
- Dashboard from the prod UI: https://alerts.wikimedia.org/?q=team%3Dwmcs&q=project%3D~%28tools%7Ctoolsbeta%29
Dashboards
See https://grafana-rw.wmcloud.org/dashboards/f/xnyXxnt4z/wmcs-toolforge-infra?tag=jobs
Main phabricator board
https://phabricator.wikimedia.org/project/board/539/
Implementation notes
Auth
cli
To ensure that Toolforge users only manage their own jobs, jobs-cli uses kubernetes certificates for client authentication. These x509 certificates are automatically managed by maintain-kubeusers
, and live in each user home directory:
toolsbeta.test@toolsbeta-sgebastion-04:~$ egrep client-certificate\|client-key .kube/config
client-certificate: /data/project/test/.toolskube/client.crt
client-key: /data/project/test/.toolskube/client.key
toolsbeta.test@toolsbeta-sgebastion-04:~$ head -1 /data/project/test/.toolskube/client.crt
-----BEGIN CERTIFICATE-----
toolsbeta.test@toolsbeta-sgebastion-04:~$ head -1 /data/project/test/.toolskube/client.key
-----BEGIN RSA PRIVATE KEY-----
The jobs-api
component needs to know the client certificate CommonName. With this information, jobs-api
can supplant the user by reading again the x509 certificates from the user home, and use them to interact with the kubernetes API.
- connection cli<->api gateway: an user contacts the API Gateway using k8s client TLS certs from its home directory. This can happen from a Toolforge bastion, or from a Job already running inside kubernetes. The connection can be made either using
jobs-cli
or directly contactingapi-gateway
programmatically by other methods. - connection jobs-api <-> k8s:
jobs-api
can now load the k8s client TLS certificate from the user home, and supplant the user to contact the k8s API. For this to be possible, thejobs-api
component needs permissions for every user home directory, pretty much likemaintain-kubeusers
has.
This setup is possible because the x509 certificates are maintained by the maintain-kubeusers
component, and because jobs-api
runs inside the kubernetes cluster itself and therefore can be configured with enough permissions to read each users home.
API
The jobs-api
sits behind the API Gateway, that does the authentication and provides some headers for the jobs-api
to use.
Administrative tasks
Starting the services
Jobs API
This lives in kubernetes, behind the API gateway. To start it you can try redeploying it. To do so follow Portal:Toolforge/Admin/Kubernetes/Components#Deploy (the component is jobs-api
).
You can monitor if it's coming up with the usual k8s commands:
dcaro@tools-bastion-13:~$ kubectl-sudo get all -n jobs-api
NAME READY STATUS RESTARTS AGE
pod/jobs-api-b88d5d498-4qjzm 2/2 Running 0 10d
pod/jobs-api-b88d5d498-ldr6q 2/2 Running 0 10d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/jobs-api ClusterIP 10.109.185.207 <none> 8443/TCP 2y185d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/jobs-api 2/2 2 2 2y185d
NAME DESIRED CURRENT READY AGE
...
replicaset.apps/jobs-api-b88d5d498 2 2 2 10d
Jobs Emailer
Just like Jobs API, deployed the same way but the component is jobs-emailer
. It lives in it's own namespace:
dcaro@tools-bastion-13:~$ kubectl-sudo get all -n jobs-emailer
NAME READY STATUS RESTARTS AGE
pod/jobs-emailer-57f7b5bdf6-8c5jr 1/1 Running 1 (2d19h ago) 5d18h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/jobs-emailer 1/1 1 1 726d
NAME DESIRED CURRENT READY AGE
...
replicaset.apps/jobs-emailer-d8769dd8f 0 0 0 25d
Stopping the services
Jobs API
Being a k8s deployment, the quickest way might be just to remove the deployment itself (will require redeploying to start again).
root@toolsbeta-test-k8s-control-4:~# kubectl get deployment -n jobs-api jobs-api -o yaml > backup.yaml # in case you want to restore later with kubectl apply -f backup.yaml
root@toolsbeta-test-k8s-control-4:~# kubectl delete deployment -n jobs-api jobs-api
For a full removal (CAREFUL! Only if you know what you are doing) you can use helm:
root@toolsbeta-test-k8s-control-4:~# helm uninstall -n jobs-api jobs-api
Jobs Emailer
Same as the Jobs API, but the deployment is named jobs-emailer
:
root@toolsbeta-test-k8s-control-4:~# kubectl get deployment -n jobs-emailer jobs-emailen -o yaml > backup.yaml # in case you want to restore later with kubectl apply -f backup.yaml
root@toolsbeta-test-k8s-control-4:~# kubectl delete deployment -n jobs-emailer jobs-emailer
For a full removal (CAREFUL! Only if you know what you are doing) you can use helm:
root@toolsbeta-test-k8s-control-4:~# helm uninstall -n jobs-emailer jobs-emailer
Checking all components are alive
You can check the dashboard: https://grafana-rw.wmcloud.org/dashboards/f/xnyXxnt4z/wmcs-toolforge-infra?tag=jobs
Jobs API
To see logs, try something like:
user@toolsbeta-test-k8s-control-4:~$ sudo -i kubectl logs deployment/jobs-api -n jobs-api nginx
[..]
192.168.17.192 - - [15/Feb/2022:12:57:54 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.81.64 - - [15/Feb/2022:12:59:50 +0000] "GET /api/v1/list/ HTTP/1.1" 200 3 "-" "python-requests/2.21.0"
192.168.17.192 - - [15/Feb/2022:13:00:34 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.81.64 - - [15/Feb/2022:13:01:01 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.17.192 - - [15/Feb/2022:13:01:02 +0000] "POST /api/v1/run/ HTTP/1.1" 409 52 "-" "python-requests/2.21.0"
user@toolsbeta-test-k8s-control-4:~$ sudo -i kubectl logs deployment/jobs-api -n jobs-api webservice
[..]
*** Operational MODE: single process ***
mounting api:app on /
Adding available container: {'shortname': 'tf-bullseye-std', 'image': 'docker-registry.tools.wmflabs.org/toolforge-bullseye-standalone:latest'}
Adding available container: {'shortname': 'tf-buster-std-DEPRECATED', 'image': 'docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest'}
Adding available container: {'shortname': 'tf-golang', 'image': 'docker-registry.tools.wmflabs.org/toolforge-golang-sssd-base:latest'}
Adding available container: {'shortname': 'tf-golang111', 'image': 'docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest'}
Adding available container: {'shortname': 'tf-jdk17', 'image': 'docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest'}
[..]
To verify the API endpoint is up try something like:
user@toolsbeta-test-k8s-control-4:~$ curl https://api.svc.toolsbeta.eqiad1.wikimedia.cloud:30003/jobs/v1/healthz -k | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 56 100 56 0 0 2290 0 --:--:-- --:--:-- --:--:-- 2434
{
"health": {
"message": "OK",
"status": "OK"
},
"messages": {}
}
See how many jobs of a given type are defined:
user@tools-k8s-control-1:~$ sudo -i kubectl get jobs -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/component=jobs
No resources found <-- this is somewhat normal, jobs may be short-lived
user@tools-k8s-control-1:~$ sudo -i kubectl get cronjob -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/component=cronjobs
NAMESPACE NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
tool-admin updatetools 19,39,59 * * * * False 0 3m57s 31d
tool-botriconferme botriconferme-full 0,10 22,23 * * * False 0 16h 27h
tool-botriconferme botriconferme-purge-log 0 0 1 * * False 0 <none> 27h
tool-botriconferme botriconferme-quick */15 * * * * False 0 2m57s 27h
tool-cdnjs update-index 17 4 * * * False 1 12d 34d
[..]
user@tools-k8s-control-1:~$ sudo -i kubectl get deploy -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/component=deployments
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
tool-cluebot3 cluebot3 1/1 1 1 10d
tool-fixsuggesterbot fix-suggester-bot-consume 1/1 1 1 198d
tool-fixsuggesterbot fix-suggester-bot-subscribe 1/1 1 1 198d
tool-majavah-bot t1-enwiki 1/1 1 1 18d
tool-mjolnir mjolnir 1/1 1 1 186d
tool-mjolnir uatu 1/1 1 1 183d
[..]
Jobs Emailer
Service logs:
user@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer logs deployment/jobs-emailer
Live configuration can be seen with:
user@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer get cm jobs-emailer-configmap -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
[..]
data:
debug: "yes"
email_from_addr: noreply@toolforge.org
email_to_domain: tools.wmflabs.org
email_to_prefix: tools
send_emails_for_real: "yes"
smtp_server_fqdn: mail.tools.wmflabs.org
smtp_server_port: "25"
task_compose_emails_loop_sleep: "400"
task_read_configmap_sleep: "10"
task_send_emails_loop_sleep: "10"
task_send_emails_max: "10"
task_watch_pods_timeout: "60"
Values can be edited with kubectl -n jobs-emailer edit cm jobs-emailer-configmap
. Editing some value will trigger a live reconfiguration (no need to restart anything), note that it will be overwritten on the next deployment.
Prebuilt image management
Images are built on the tools-docker-imagebuilder-01 instance, which is setup with appropriate credentials (and a hole in the proxy for the docker registry) to allow pushing. Note that you need to be root to build / push docker containers. Suggest using sudo -i
for it - since docker looks for credentials in the user's home directory, and it is only present in root's home directory.
Building Toolforge specific images
These are present in the git repository operations/docker-images/toollabs-images
. There is a base image called docker-registry.tools.wmflabs.org/toolforge-buster-sssd
that inherits from the wikimedia-buster base image but adds the toolforge debian repository + ldap SSSD support. All Toolforge related images should be named docker-registry.tools.wmflabs.org/toolforge-$SOMETHING
. The structure should be fairly self explanatory. There is a clone of it in /srv/images/toolforge
on the docker builder host.
You can rebuild any particular image by running the build.py
script in that repository. If you give it the path inside the repository where a Docker image lives, it'll rebuild all containers that your image lives from and all the containers that inherit from your container. This ensures that any changes in the Dockerfiles are completely built and reflected immediately, rather than waiting in surprise when something unrelated is pushed later on. We rely on Docker's build cache mechanisms to make sure this doesn't slow down incredibly. It then pushes them all to the docker registry.
Example of rebuilding the python2 images:
$ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
$ screen
$ sudo su
$ cd /srv/images/toolforge
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git rebase @{upstream}
$ ./build.py --push python2-sssd/base
By default, the script will build the testing tag of any image, which will not be pulled by webservice and it will build with the prefix of toolforge. Webservice pulls the latest tag. If the image you are working on is ready to be automatically applied to all newly-launched containers, you should add the --tag latest
argument to your build.py command:
$ ./build.py --tag latest --push python2-sssd/base
You will probably want to clean up intermediate layers after building new containers:
$ docker ps --no-trunc -aqf "status=exited" | xargs docker rm
$ docker images --no-trunc | grep '<none>' | awk '{ print $3 }' | xargs -r docker rmi
All of the web
images install our locally managed toollabs-webservice
package. When it is updated to fix bugs or add new features the Docker images need to be rebuilt. This is typically a good time to ensure that all apt managed packages are updated as well by rebuilding all of the images from scratch:
$ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
$ screen
$ sudo su
$ cd /srv/images/toolforge
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git reset --hard origin/master
$ ./rebuild_all.sh
See Portal:Toolforge/Admin/Kubernetes/Docker-registry for more info on the docker registry setup.
Managing images available for tools
Available images are managed in image-config. Here is how to add a new image:
- Add the new image name in the image-config repository
- Deploy this change to toolsbeta:
cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/ --cluster-name toolsbeta
- Deploy this change to tools:
cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/ --cluster-name tools
- Recreate the jobs-api pods in the Toolsbeta cluster, to make them read the new ConfigMap
- SSH to the bastion:
ssh toolsbeta-sgebastion-05.toolsbeta.eqiad1.wikimedia.cloud
- Find the pod ids:
kubectl get pod -n jobs-api
- Delete the pods, K8s will replace them with new ones:
kubectl sudo delete pod -n jobs-api {pod-name}
- SSH to the bastion:
- Do the same in the Tools cluster (same instructions, but use
login.toolforge.org
as the SSH bastion)
- Deploy this change to toolsbeta:
- From a bastion, check you can run the new image with
webservice {image-name} shell
- From a bastion, check the new image is listed when running
toolforge-jobs images
- Update the Toolforge/Kubernetes wiki page to include the new image
See also
- Help:Toolforge/API: api docs and toolforge general api.
- API Gateway: entry point for toolforge API users.
- Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_jobs: where this was initially designed.
- Help:Toolforge/Jobs_Service: end user documentation
- Phabricator T286135: Toolforge jobs framework: email maintainers on job failure: original feature request for the emailer component.
Some upstream kubernetes documentation pointers:
- https://kubernetes.io/docs/concepts/workloads/controllers/job/
- https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
- https://kubernetes.io/docs/tasks/job/
- https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/
- https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs