Help:Toolforge/Jobs framework

From Wikitech
Jump to navigation Jump to search

This page contains information on the Toolforge jobs framework.

Every non-trivial task performed in Toolforge (like executing a script or running a bot) should be dispatched to a job scheduling backend (in this case, Kubernetes), which ensures that the job is run in a suitable place with sufficient resources.

The basic principle of running jobs is fairly straightforward:

  • You create a job from a submission server (usually
  • Kubernetes finds a suitable execution node to run the job on, and starts it there once resources are available
  • As it runs, your job will send output and errors to files until the job completes or is aborted.

Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once.

Creating jobs

Information about job creation using the toolforge jobs run command.

Creating one-off jobs

One-off jobs (or normal jobs) are workloads that will be scheduled by Toolforge Kubernetes and run until finished. They will run once, and are expected to finish at some point.

Select a runtime, a command in your tool home directory and then use toolforge jobs run to create the job, example using job name myjob:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye

The --command option supports input arguments, using quotes, example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command "./ --witharguments" --image bullseye

You can instruct the command line to wait and don't return until the job is finished with the --wait option, example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --wait

Creating scheduled jobs (cron jobs)

To schedule a recurrent job (also known as cron jobs), use the --schedule WHEN option when creating it:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run mycronjob --command ./ --image bullseye --schedule "@daily"

The schedule argument uses cron syntax (see also cron on Wikipedia).

Please use the @hourly, @daily, @weekly, @monthly, @yearly macros if possible. Those make it possible to spread the cluster load evenly through the day which makes maintaining the cluster much easier.

Creating continuous jobs

Continuous jobs are programs that are never meant to end. If they end (for example, because of an error) the Toolforge Kubernetes system will restart them.

To create a continuous job, use the --continuous option:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myalwaysrunningjob --command ./ --image bullseye --continuous

About the executable

In all job types (normal, continuous, cronjob) the --command parameter should meet the following conditions:

  • it should refer to an executable file.
  • mind the path, the command working directory is the tools home directory, so --command will likely fail (it references $PATH), and --command ./ is likely what you mean.
  • arguments are optional but if present then better use quotes, example: --command "./ --arg1 x --arg2 y".

Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.

About the job name

The job name is a unique string identifier. The string should meet these criteria:

  • between 1 and 100 characters long.
  • any combination of numbers, lower-case letters and the . (dot) and - (dash) characters.
  • no spaces, no underscores, no special symbols.

Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.

Choosing the execution runtime

In Toolforge Kubernetes we offer a pre-defined set of container images that you can use as the execution runtime for your job.

To view which execution runtimes are available, run the toolforge jobs images command.


tools.mytool@tools-sgebastion-11:~$ toolforge jobs images
Short name    Container image URL
------------  ----------------------------------------------------------------------

In addition, there are several deprecated images that are available for older tools that rely for them but should not be used for new use cases.

Introducing additional flexibility for execution runtimes is currently part of the WMCS team roadmap.

NOTE: if your tool uses python, you may want to use a virtualenv, see Help:Toolforge/Python#Kubernetes_python_jobs.

Retry policy

You can specify the retry policy for failed jobs.

The default policy is to not try to restart failed jobs. But you can choose for them to be retried up to five times before given up by the scheduling engine.

Use the --retry N option. Example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --retry 2

Note that the retry policy will be ignored for continuous jobs, given they are always restarted in case of failure.

Loading jobs from a YAML file

You can define a list of jobs in a YAML file and load them all at once using the toolforge jobs load command, example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs load jobs.yaml

NOTE: loading jobs from a file will flush jobs with the same name if their definition varies.

You can use the --job <name> option to load only one job as defined in the YAML file. Example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs load jobs.yaml --job "everyminute"

Example YAML file:

# a cronjob
- name: hourly
  command: ./ -v
  image: bullseye
  no-filelog: true
  schedule: "@hourly"
  emails: onfailure
# a continuous job
- image: python3.11
  name: endlessjob
  command: python3 --endless
  continuous: true
  emails: all
# wait for this normal job before loading the next
- name: myjob
  image: bullseye
  command: ./ --argument1
  wait: true
  emails: onfinish
# another normal job after the previous one finished running
- name: anotherjob
  image: bullseye
  command: ./ --argument1
  emails: none
# this job sets custom stdout/stderr log files
- name: normal-job-with-custom-logs
  image: bullseye
  command: ./ --argument1
  filelog-stdout: logs/stdout.log
  filelog-stderr: logs/stderr.log
# this job sets a custom retry policy
- name: normal-job-with-custom-retry-policy
  image: bullseye
  command: ./ --argument1
  retry: 2

Listing your existing jobs

You can get information about the jobs created for your tool using toolforge jobs list, example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs list
Job name:       Job type:          Status:
--------------  -----------------  ---------------------------
myscheduledjob  schedule: @hourly  Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   continuous         Running
myjob           normal             Completed

Listing even more information at once is possible using --output long:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs list --output long
Job name:       Command:                 Job type:          Image:    File log:    Output log:    Error log:    Emails:    Resources:    Retry:    Status:
--------------  -----------------------  -----------------  --------  -----------  -------------  ------------  ---------  ------------  --------  ---------
myscheduledjob  ./          schedule: @hourly  bullseye  no           /dev/null      /dev/null     none       default       no        Running
alwaysrunning   ./    continuous         bullseye  yes          test2.out      test2.err     none       default       no        Running
myjob           ./ --debug   normal             bullseye  yes          logs/mylog     logs/mylog    onfinish   default       2         Completed

NOTE: normal jobs will be deleted from this listing shortly after being completed (even if they finish with some error).

Deleting your jobs

You can delete your jobs in two ways:

  • manually delete each job, identified by name, using the toolforge jobs delete command.
  • delete all defined jobs at once, using the toolforge jobs flush command.

Showing information about your job

You can get information about a defined job using the toolforge jobs show command, example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs show myscheduledjob
| Job name:  | myscheduledjob                                                  |
| Command:   | ./ myargument                                      |
| Job type:  | schedule: * * * * *                                             |
| Image:     | bullseye                                                        |
| File log:  | yes                                                             |
| Emails:    | none                                                            |
| Resources: | mem: 10Mi, cpu: 100                                             |
| Status:    | Last schedule time: 2021-06-30T10:26:00Z                        |
| Hints:     | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
|            | 'waiting' for reason 'ContainerCreating'.                       |

This should include information about the job status and some hints (in case of failure, etc).

Restarting your jobs

You can restart cronjobs or continuous jobs.

Use toolforge jobs restart <jobname>. Example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs restart myjob

You can use this functionality to reset internal state of stuck jobs or jobs in failed state. The internal behavior is similar to removing the job and defining it again.

Trying to restart a non-existent job will do nothing.

Job logs

Jobs log stdout/stderr to files in your tool home directory.

For a job myjob, you will find:

  • a myjob.out file, containing stdout generated by your job.
  • a myjob.err file, containing stderr generated by your job.


tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye
tools.mytool@tools-sgebastion-11:~$ ls myjob*
myjob.out myjob.err

Subsequent same-name job runs will append to the same files.

Log generation can be disabled with the --no-filelog parameter when creating a new job, for example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --no-filelog

Custom log files

You can control where you store your logs. This allows for things like:

  • using a custom directory
  • merging stdout/stderr logs together into a single file
  • ignoring one of the two log streams

To do that, make use of the following options when running a new job:

  • (for stdout) -o path/to/file.log or --filelog-stdout path/to/file.log
  • (for stderr) -e path/to/file.log or --filelog-stderr path/to/file.log

Example, running a job that merges both log streams into a single log file:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --filelog-stdout myjob.log --filelog-stderr myjob.log

Example, running a job that uses the default `jobname`.out but ignores stderr:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --filelog-stderr /dev/null

Example, running a job that log both streams separately in a custom directory:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --filelog-stdout mylogs/myjob.out.log --filelog-stderr mylogs/myjob.err.log

Custom directories should be created by hand previous to the job run. Selecting an invalid directory here will likely result in the job failing with exit code 2.

Pruning log files

Users should take care of log files growing too large.

The mariadb image includes the logrotate program which can be used to control the sizes of log files using the Toolforge jobs framework.

If you have a continuous job, you will want to use copytruncate mode for log rotation. To set it up, create a configuration file logrotate-myjob.conf similar to this:

tools.mytool@tools-sgebastion-11:~$ nano logrotate-myjob.conf
    rotate 6

This configuration will rotate your log files daily, and keep 6 days of old logs in addition to the log for the current day. The dateext option renames rotated log files by appending a date to their filenames, allowing for better organization and differentiation of log files based on the date of rotation.

Then you can start automatic log rotation with:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run logrotate-myjob --command "logrotate -v ./logrotate-myjob.conf --state ./logrotate-myjob.state" --image mariadb --schedule "@daily"

Providing more modern approaches and facilities for logs management, metrics, etc. is in the current roadmap for the WMCS team. See Phabricator T127367 for example.

Job quotas

Each tool account has a limited quota available. The same quota is used for jobs and other things potentially running on Kubernetes, like webservices.

To check your quota, run:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs quota
Running jobs                                    Used  Limit
--------------------------------------------  ------  -------
Total running jobs at once (Kubernetes pods)       0  10
Running one-off and cron jobs                      0  15
CPU                                                0  2
Memory                                             0  8Gi

Per-job limits    Limit
----------------  -------
CPU               1
Memory            4Gi

Job definitions                             Used    Limit
----------------------------------------  ------  -------
Cron jobs                                      0       50
Continuous jobs (including web services)       0        3

As of this writing, new jobs get 512Mi memory and 1/2 CPU by default.

You can run jobs with additional CPU and memory using the --mem MEM and --cpu CPU parameters, example:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command "./" --image bullseye --mem 1Gi --cpu 2

Requesting more memory or CPU will fail if the tool quota is exceeded.

Quota increases

It is possible to request a quota increase if you can demonstrate your tool's need for more resources than the default namespace quota allows. Instructions and a template link for creating a quota request can be found at Toolforge (Quota requests) in Phabricator.

Please read all the instructions there before submitting your request.

Note for Toolforge admins: there are docs on how to do quota upgrades.

Job email notifications

You can select to receive email notifications from your job activity, by using the --emails EMAILS option when creating a job.

The available choices are:

  • none, don't get any email notification. The default behavior.
  • onfailure, receive email notifications in case of a failure event.
  • onfinish, receive email notifications in case of the job finishing (both successfully and on failure).
  • all, receive all possible notifications.


tools.mytool@tools-sgebastion-11:~$ toolforge jobs run myjob --command ./ --image bullseye --emails onfinish

The email will be sent to, which is an email alias that by default redirects to all tool maintainers associated with that particular tool account.

Complete example session

Here is a complete example of a work session with the Toolforge jobs framework.

Help command

List all available jobs-framework commands using the toolforge jobs -h command:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs -h
usage: toolforge jobs [-h] [--debug] [--cfg CFG]

Toolforge Jobs Framework, command line interface

positional arguments:
                        possible operations (pass -h to know usage of each)
    containers          Kept for compatibility reasons, use `images` instead.
    images              list information on available container image types
                        for Toolforge jobs
    run                 run a new job of your own in Toolforge
    show                show details of a job of your own in Toolforge
    list                list all running jobs of your own in Toolforge
    delete              delete a running job of your own in Toolforge
    flush               delete all running jobs of your own in Toolforge
    load                flush all jobs and load a YAML file with job
                        definitions and run them
    restart             restarts a running job

optional arguments:
  -h, --help            show this help message and exit
  --debug               activate debug mode
  --cfg CFG             YAML config for the CLI. Defaults to '/etc/toolforge-
                        jobs-framework-cli.cfg'. Only useful for Toolforge

List all available run command arguments using the toolforge jobs run -h command:

tools.mytool@tools-sgebastion-11:~$ toolforge jobs run -h
usage: toolforge jobs run [-h] --command COMMAND --image IMAGE [--no-filelog]
                          [-o FILELOG_STDOUT] [-e FILELOG_STDERR]
                          [--retry {0,1,2,3,4,5}] [--mem MEM] [--cpu CPU]
                          [--emails {none,all,onfinish,onfailure}]
                          [--schedule SCHEDULE | --continuous | --wait]

positional arguments:
  name                  new job name

optional arguments:
  -h, --help            show this help message and exit
  --command COMMAND     full path of command to run in this job
  --image IMAGE         image shortname (check them with `images`)
  --no-filelog          don't store job stdout in `jobname`.out and stderr in
                        `jobname`.err files in the user home directory
  -o FILELOG_STDOUT, --filelog-stdout FILELOG_STDOUT
                        location to store stdout logs for this job
  -e FILELOG_STDERR, --filelog-stderr FILELOG_STDERR
                        location to store stderr logs for this job
  --retry {0,1,2,3,4,5}
                        specify the retry policy of failed jobs.
  --mem MEM             specify additional memory limit required for this job
  --cpu CPU             specify additional CPU limit required for this job
  --emails {none,all,onfinish,onfailure}
                        specify if the system should email notifications about
                        this job. Defaults to 'none'.
  --schedule SCHEDULE   run a job with a cron-like schedule (example '1 * * *
  --continuous          run a continuous job
  --wait                run a job and wait for completition. Timeout is 300

Grid Engine migration

Main article: News/Toolforge Grid Engine deprecation

This section contains specific documentation for Grid Engine users that are trying to migrate their jobs to Kubernetes.

In particular, here is a list of common command equivalences between Grid Engine (legacy, with jsub and friends) and Kubernetes (with the new toolforge jobs).

task Grid Engine Kubernetes
Basic job submission:
$ jsub ./
$ toolforge jobs run myjob --command ./ --image bullseye
Allocating additional memory:
$ jsub -mem 1000m php i_like_more_ram.php
$ toolforge jobs run myjob --command "php i_like_more_ram.php" --image php8.2 --mem 1Gi --cpu 2
Waiting until the job is completed:
$ jsub -sync y program [args...]
$ toolforge jobs run myjob --command "python3 ./" --image python3.11 --wait
Viewing information about all jobs:
$ qstat
$ toolforge jobs list
Deleting a job:
$ qdel job_number/job_name
$ qstop job_name
$ toolforge jobs delete myjob
Deleting all jobs: -
$ toolforge jobs flush

NOTE: the old grid jlocal command has no equivalent in this jobs framework. And is not really needed, just schedule cron jobs as documented here, even if they are small scripts.

Useful links

The following tools have been built by the Toolforge admin team to help others see job status:

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Stay aware of critical changes and plans
Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)

See also

External links