Jump to content

Portal:Toolforge/Admin/Jobs Service

From Wikitech

This page contains information about the architecture and components of the Toolforge Jobs Service.

Overview

The main component is an API to ease end user interaction with Toolforge jobs in the kubernetes cluster. The API abstracts away most of the k8s gory details for configuring, removing, managing and reading status on jobs. The abstraction approach is similar to what is being done with Toolforge webservices (we have the webservice command there), but with an approach that consist on having most of the business logic in an API service.

By splitting the software into several components, and introducing an stable API, we aim to reduce maintenance burden by not needing to rebuild all Toolforge docker containers every time we change some internal mechanism (which is the case of the tools-webservice package).

The service consists on 3 components:

  • jobs-api (code): runs inside the k8s cluster. Offers the API that in turn interacts with the k8s API native objects: CronJob, Job, Deployment, etc.
  • jobs-cli (code): command line interface to interact with the jobs API service. Typically used by end users in Toolforge bastions.
  • jobs-emailer (code): a daemon that runs inside k8s, listens to pod events, and emails the users that requested it about their jobs activity.

The API is part of the Toolforge API, and thus usable from within toolforge, from the bastion and from within any other job.

Auth

cli

To ensure that Toolforge users only manage their own jobs, jobs-cli uses kubernetes certificates for client authentication. These x509 certificates are automatically managed by maintain-kubeusers, and live in each user home directory:

toolsbeta.test@toolsbeta-sgebastion-04:~$ egrep client-certificate\|client-key .kube/config
    client-certificate: /data/project/test/.toolskube/client.crt
    client-key: /data/project/test/.toolskube/client.key
toolsbeta.test@toolsbeta-sgebastion-04:~$ head -1 /data/project/test/.toolskube/client.crt
-----BEGIN CERTIFICATE-----
toolsbeta.test@toolsbeta-sgebastion-04:~$ head -1 /data/project/test/.toolskube/client.key
-----BEGIN RSA PRIVATE KEY-----

The jobs-api component needs to know the client certificate CommonName. With this information, jobs-api can supplant the user by reading again the x509 certificates from the user home, and use them to interact with the kubernetes API.

  • connection cli<->api gateway: an user contacts the API Gateway using k8s client TLS certs from its home directory. This can happen from a Toolforge bastion, or from a Job already running inside kubernetes. The connection can be made either using jobs-cli or directly contacting api-gateway programmatically by other methods.
  • connection jobs-api <-> k8s: jobs-api can now load the k8s client TLS certificate from the user home, and supplant the user to contact the k8s API. For this to be possible, the jobs-api component needs permissions for every user home directory, pretty much like maintain-kubeusers has.

This setup is possible because the x509 certificates are maintained by the maintain-kubeusers component, and because jobs-api runs inside the kubernetes cluster itself and therefore can be configured with enough permissions to read each users home.

API

The jobs-api sits behind the API Gateway, that does the authentication and provides some headers for the jobs-api to use.

Endpoints

See Help:Toolforge/API.

Deployment and maintenance

Information on how to deploy and maintain the service.

jobs-api

deployment

See Portal:Toolforge/Admin/Kubernetes/Custom_components.

maintenance

To see logs, try something like:

user@toolsbeta-test-k8s-control-4:~$ sudo -i kubectl logs deployment/jobs-api -n jobs-api nginx
[..]
192.168.17.192 - - [15/Feb/2022:12:57:54 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.81.64 - - [15/Feb/2022:12:59:50 +0000] "GET /api/v1/list/ HTTP/1.1" 200 3 "-" "python-requests/2.21.0"
192.168.17.192 - - [15/Feb/2022:13:00:34 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.81.64 - - [15/Feb/2022:13:01:01 +0000] "GET /api/v1/containers/ HTTP/1.1" 200 2655 "-" "python-requests/2.21.0"
192.168.17.192 - - [15/Feb/2022:13:01:02 +0000] "POST /api/v1/run/ HTTP/1.1" 409 52 "-" "python-requests/2.21.0"
user@toolsbeta-test-k8s-control-4:~$ sudo -i kubectl logs deployment/jobs-api -n jobs-api webservice
[..]
*** Operational MODE: single process ***
mounting api:app on /
Adding available container: {'shortname': 'tf-bullseye-std', 'image': 'docker-registry.tools.wmflabs.org/toolforge-bullseye-standalone:latest'}
Adding available container: {'shortname': 'tf-buster-std-DEPRECATED', 'image': 'docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest'}
Adding available container: {'shortname': 'tf-golang', 'image': 'docker-registry.tools.wmflabs.org/toolforge-golang-sssd-base:latest'}
Adding available container: {'shortname': 'tf-golang111', 'image': 'docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest'}
Adding available container: {'shortname': 'tf-jdk17', 'image': 'docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest'}
[..]

To verify the API endpoint is up try something like:

user@toolsbeta-test-k8s-control-4:~$ curl https://api.svc.toolsbeta.eqiad1.wikimedia.cloud:30003/jobs/v1/healthz -k | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    56  100    56    0     0   2290      0 --:--:-- --:--:-- --:--:--  2434
{
  "health": {
    "message": "OK",
    "status": "OK"
  },
  "messages": {}
}

See how many jobs of a given type are defined:

user@tools-k8s-control-1:~$ sudo -i kubectl get jobs -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/component=jobs
No resources found      <-- this is somewhat normal, jobs may be short-lived
user@tools-k8s-control-1:~$ sudo -i kubectl get cronjob -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/component=cronjobs
NAMESPACE                NAME                              SCHEDULE           SUSPEND   ACTIVE   LAST SCHEDULE   AGE
tool-admin               updatetools                       19,39,59 * * * *   False     0        3m57s           31d
tool-botriconferme       botriconferme-full                0,10 22,23 * * *   False     0        16h             27h
tool-botriconferme       botriconferme-purge-log           0 0 1 * *          False     0        <none>          27h
tool-botriconferme       botriconferme-quick               */15 * * * *       False     0        2m57s           27h
tool-cdnjs               update-index                      17 4 * * *         False     1        12d             34d
[..]
user@tools-k8s-control-1:~$ sudo -i kubectl get deploy -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/component=deployments
NAMESPACE              NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
tool-cluebot3          cluebot3                      1/1     1            1           10d
tool-fixsuggesterbot   fix-suggester-bot-consume     1/1     1            1           198d
tool-fixsuggesterbot   fix-suggester-bot-subscribe   1/1     1            1           198d
tool-majavah-bot       t1-enwiki                     1/1     1            1           18d
tool-mjolnir           mjolnir                       1/1     1            1           186d
tool-mjolnir           uatu                          1/1     1            1           183d
[..]

jobs-cli

deployment

A simple debian package installed on the bastions. See Portal:Toolforge/Admin/Packaging.

jobs-emailer

deployment

See Portal:Toolforge/Admin/Kubernetes/Custom_components.

maintenance

Service logs:

user@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer logs deployment/jobs-emailer

Live configuration can be seen with:

user@tools-k8s-control-1:~$ sudo -i kubectl -n jobs-emailer get cm jobs-emailer-configmap -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  [..]
data:
  debug: "yes"
  email_from_addr: noreply@toolforge.org
  email_to_domain: tools.wmflabs.org
  email_to_prefix: tools
  send_emails_for_real: "yes"
  smtp_server_fqdn: mail.tools.wmflabs.org
  smtp_server_port: "25"
  task_compose_emails_loop_sleep: "400"
  task_read_configmap_sleep: "10"
  task_send_emails_loop_sleep: "10"
  task_send_emails_max: "10"
  task_watch_pods_timeout: "60"

Values can be edited with kubectl -n jobs-emailer edit cm jobs-emailer-configmap. Editing some value will trigger a live reconfiguration (no need to restart anything).

Prebuilt image management

Images are built on the tools-docker-imagebuilder-01 instance, which is setup with appropriate credentials (and a hole in the proxy for the docker registry) to allow pushing. Note that you need to be root to build / push docker containers. Suggest using sudo -i for it - since docker looks for credentials in the user's home directory, and it is only present in root's home directory.

Building Toolforge specific images

These are present in the git repository operations/docker-images/toollabs-images. There is a base image called docker-registry.tools.wmflabs.org/toolforge-buster-sssd that inherits from the wikimedia-buster base image but adds the toolforge debian repository + ldap SSSD support. All Toolforge related images should be named docker-registry.tools.wmflabs.org/toolforge-$SOMETHING. The structure should be fairly self explanatory. There is a clone of it in /srv/images/toolforge on the docker builder host.

You can rebuild any particular image by running the build.py script in that repository. If you give it the path inside the repository where a Docker image lives, it'll rebuild all containers that your image lives from and all the containers that inherit from your container. This ensures that any changes in the Dockerfiles are completely built and reflected immediately, rather than waiting in surprise when something unrelated is pushed later on. We rely on Docker's build cache mechanisms to make sure this doesn't slow down incredibly. It then pushes them all to the docker registry.

Example of rebuilding the python2 images:

$ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
$ screen
$ sudo su
$ cd /srv/images/toolforge
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git rebase @{upstream}
$ ./build.py --push python2-sssd/base

By default, the script will build the testing tag of any image, which will not be pulled by webservice and it will build with the prefix of toolforge. Webservice pulls the latest tag. If the image you are working on is ready to be automatically applied to all newly-launched containers, you should add the --tag latest argument to your build.py command:

$ ./build.py --tag latest --push python2-sssd/base

You will probably want to clean up intermediate layers after building new containers:

$ docker ps --no-trunc -aqf "status=exited" | xargs docker rm
$ docker images --no-trunc | grep '<none>' | awk '{ print $3 }' | xargs -r docker rmi

All of the web images install our locally managed toollabs-webservice package. When it is updated to fix bugs or add new features the Docker images need to be rebuilt. This is typically a good time to ensure that all apt managed packages are updated as well by rebuilding all of the images from scratch:

$ ssh tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
$ screen
$ sudo su
$ cd /srv/images/toolforge
$ git fetch
$ git log --stat HEAD..@{upstream}
$ git reset --hard origin/master
$ ./rebuild_all.sh

See Portal:Toolforge/Admin/Kubernetes/Docker-registry for more info on the docker registry setup.

Managing images available for tools

Available images are managed in image-config. Here is how to add a new image:

  • Add the new image name in the image-config repository
    • Deploy this change to toolsbeta: cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/ --cluster-name toolsbeta
    • Deploy this change to tools: cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/ --cluster-name tools
    • Recreate the jobs-api pods in the Toolsbeta cluster, to make them read the new ConfigMap
      • SSH to the bastion: ssh toolsbeta-sgebastion-05.toolsbeta.eqiad1.wikimedia.cloud
      • Find the pod ids: kubectl get pod -n jobs-api
      • Delete the pods, K8s will replace them with new ones: kubectl sudo delete pod -n jobs-api {pod-name}
    • Do the same in the Tools cluster (same instructions, but use login.toolforge.org as the SSH bastion)
  • From a bastion, check you can run the new image with webservice {image-name} shell
  • From a bastion, check the new image is listed when running toolforge-jobs images
  • Update the Toolforge/Kubernetes wiki page to include the new image

API docs

See https://api-docs.toolforge.org/docs


See also

Some upstream kubernetes documentation pointers: