Help:Toolforge/Jobs framework
This page contains information on the Toolforge jobs framework.
Every non-trivial task performed in Toolforge (like executing a script or running a bot) should be dispatched to a job scheduling backend (in this case, Kubernetes), which ensures that the job is run in a suitable place with sufficient resources.
The basic principle of running jobs is fairly straightforward:
- You create a job from a submission server (usually login.toolforge.org)
- Kubernetes finds a suitable execution node to run the job on, and starts it there once resources are available
- As it runs, your job will send output and errors to files until the job completes or is aborted.
Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once.
Creating jobs
Information about job creation using the toolforge-jobs run
command.
Creating one-off jobs
One-off jobs (or normal jobs) are workloads that will be scheduled by Toolforge Kubernetes and run until finished. They will run once, and are expected to finish at some point.
Select a runtime, a command in your tool home directory and then use toolforge-jobs run
to create the job, example using job name myjob
:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye
The --command
option supports input arguments, using quotes, example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command "./mycommand.sh --witharguments" --image bullseye
You can instruct the command line to wait and don't return until the job is finished with the --wait
option, example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye --wait
Creating scheduled jobs (cron jobs)
To schedule a recurrent job (also known as cron jobs), use the --schedule WHEN
option when creating it:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run mycronjob --command ./daily.sh --image bullseye --schedule "17 13 * * *"
The schedule argument uses cron syntax (see also cron on Wikipedia).
If you need to run a daily/hourly job, please avoid scheduling jobs at exactly midnight (00:00) or at the top of the hour (at :00 minutes) if your job does not explicitly require it. Instead, pick a random time of the day so that system load is balanced evenly through the day.
Creating continuous jobs
Continuous jobs are programs that are never meant to end. If they end (for example, because of an error) the Toolforge Kubernetes system will restart them.
To create a continuous job, use the --continuous
option:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image bullseye --continuous
About the executable
In all job types (normal, continuous, cronjob) the --command
parameter should meet the following conditions:
- it should refer to an executable file.
- mind the path, the command working directory is the tools home directory, so
--command mycommand.sh
will likely fail (it references $PATH), and--command ./mycommand.sh
is likely what you mean. - arguments are optional but if present then better use quotes, example:
--command "./mycommand.sh --arg1 x --arg2 y"
.
Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.
About the job name
The job name is a unique string identifier. The string should meet these criteria:
- between 1 and 100 characters long.
- any combination of numbers, lower-case letters and the
.
(dot) and-
(dash) characters. - no spaces, no underscores, no special symbols.
Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.
Choosing the execution runtime
In Toolforge Kubernetes we offer a pre-defined set of container images that you can use as the execution runtime for your job.
To view which execution runtimes are available, run the toolforge-jobs images
command.
Example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs images
Short name Container image URL
------------ ----------------------------------------------------------------------
bullseye docker-registry.tools.wmflabs.org/toolforge-bullseye-sssd:latest
golang1.11 docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
jdk17 docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest
mariadb docker-registry.tools.wmflabs.org/toolforge-mariadb-sssd-base:latest
mono6.8 docker-registry.tools.wmflabs.org/toolforge-mono68-sssd-base:latest
node16 docker-registry.tools.wmflabs.org/toolforge-node16-sssd-base:latest
perl5.32 docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-base:latest
php7.4 docker-registry.tools.wmflabs.org/toolforge-php74-sssd-base:latest
python3.9 docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest
ruby2.1 docker-registry.tools.wmflabs.org/toolforge-ruby21-sssd-base:latest
ruby2.7 docker-registry.tools.wmflabs.org/toolforge-ruby27-sssd-base:latest
tcl8.6 docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest
In addition, there are several deprecated images that are available for older tools that rely for them but should not be used for new use cases.
Introducing additional flexibility for execution runtimes is currently part of the WMCS team roadmap.
NOTE: if your tool uses python, you may want to use a virtualenv, see Help:Toolforge/Python#Kubernetes_python_jobs.
Retry policy
You can specify the retry policy for failed jobs.
The default policy is to not try to restart failed jobs. But you can choose for them to be retried up to five times before given up by the scheduling engine.
Use the --retry N
option. Example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./myjob.sh --image bullseye --retry 2
Note that the retry policy will be ignored for continuous jobs, given they are always restarted in case of failure.
Loading jobs from a YAML file
You can define a list of jobs in a YAML file and load them all at once using the toolforge-jobs load
command, example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs load jobs.yaml
NOTE: loading jobs from a file will flush jobs with the same name if their definition varies.
You can use the --job <name>
option to load only one job as defined in the YAML file. Example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs load jobs.yaml --job "everyminute"
Example YAML file:
# https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
---
# a cronjob
- name: everyminute
command: ./myothercommand.sh -v
image: bullseye
no-filelog: true
schedule: "* * * * *"
emails: onfailure
# a continuous job
- image: python3.9
name: endlessjob
command: python3 dumps-daemon.py --endless
continuous: true
emails: all
# wait for this normal job before loading the next
- name: myjob
image: bullseye
command: ./mycommand.sh --argument1
wait: true
emails: onfinish
# another normal job after the previous one finished running
- name: anotherjob
image: bullseye
command: ./mycommand.sh --argument1
emails: none
# this job sets custom stdout/stderr log files
- name: normal-job-with-custom-logs
image: bullseye
command: ./mycommand.sh --argument1
filelog-stdout: logs/stdout.log
filelog-stderr: logs/stderr.log
# this job sets a custom retry policy
- name: normal-job-with-custom-retry-policy
image: bullseye
command: ./mycommand.sh --argument1
retry: 2
Listing your existing jobs
You can get information about the jobs created for your tool using toolforge-jobs list
, example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs list
Job name: Job type: Status:
-------------- -------------------- ---------------------------
myscheduledjob schedule: * * * * * Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning continuous Running
myjob normal Completed
Listing even more information at once is possible using --output long
:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs list --output long
Job name: Command: Job type: Image: File log: Output log: Error log: Emails: Resources: Retry: Status:
-------------- ----------------------- ------------------- -------- ----------- ------------- ------------ --------- ------------ -------- ---------
myscheduledjob ./read-dumps.sh schedule: * * * * * bullseye no /dev/null /dev/null none default no Running
alwaysrunning ./myendlesscommand.sh continuous bullseye yes test2.out test2.err none default no Running
myjob ./mycommand.sh --debug normal bullseye yes logs/mylog logs/mylog onfinish default 2 Completed
NOTE: normal jobs will be deleted from this listing shortly after being completed (even if they finish with some error).
Deleting your jobs
You can delete your jobs in two ways:
- manually delete each job, identified by name, using the
toolforge-jobs delete
command. - delete all defined jobs at once, using the
toolforge-jobs flush
command.
Showing information about your job
You can get information about a defined job using the toolforge-jobs show
command, example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name: | myscheduledjob |
+------------+-----------------------------------------------------------------+
| Command: | ./read-dumps.sh myargument |
+------------+-----------------------------------------------------------------+
| Job type: | schedule: * * * * * |
+------------+-----------------------------------------------------------------+
| Image: | bullseye |
+------------+-----------------------------------------------------------------+
| File log: | yes |
+------------+-----------------------------------------------------------------+
| Emails: | none |
+------------+-----------------------------------------------------------------+
| Resources: | mem: 10Mi, cpu: 100 |
+------------+-----------------------------------------------------------------+
| Status: | Last schedule time: 2021-06-30T10:26:00Z |
+------------+-----------------------------------------------------------------+
| Hints: | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
| | 'waiting' for reason 'ContainerCreating'. |
+------------+-----------------------------------------------------------------+
This should include information about the job status and some hints (in case of failure, etc).
Restarting your jobs
You can restart cronjobs or continuous jobs.
Use toolforge-jobs restart <jobname>
. Example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs restart myjob
You can use this functionality to reset internal state of stuck jobs or jobs in failed state. The internal behavior is similar to removing the job and defining it again.
Trying to restart a non-existent job will do nothing.
Job logs
Jobs log stdout/stderr to files in your tool home directory.
For a job myjob
, you will find:
- a
myjob.out
file, containing stdout generated by your job. - a
myjob.err
file, containing stderr generated by your job.
Example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye
tools.mytool@tools-sgebastion-11:~$ ls myjob*
myjob.out myjob.err
Subsequent same-name job runs will append to the same files.
NOTE: as of this writing there is no automatic way to prune log files, so tool users must take care of such files growing too large.
Log generation can be disabled with the --no-filelog
parameter when creating a new job, for example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye --no-filelog
Custom log files
You can control where you store your logs. This allows for things like:
- using a custom directory
- merging stdout/stderr logs together into a single file
- ignoring one of the two log streams
To do that, make use of the following options when running a new job:
- (for stdout)
-o path/to/file.log
or--filelog-stdout path/to/file.log
- (for stderr)
-e path/to/file.log
or--filelog-stderr path/to/file.log
Example, running a job that merges both log streams into a single log file:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye --filelog-stdout myjob.log --filelog-stderr myjob.log
Example, running a job that uses the default `jobname`.out
but ignores stderr:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye --filelog-stderr /dev/null
Example, running a job that log both streams separately in a custom directory:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye --filelog-stdout mylogs/myjob.out.log --filelog-stderr mylogs/myjob.err.log
Custom directories should be created by hand previous to the job run. Selecting an invalid directory here will likely result in the job failing with exit code 2.
Please, note that providing more modern approaches and facilities for logs management, metrics, etc. is in the current roadmap for the WMCS team. See Phabricator T127367 for example.
Job quotas
Each tool account has a limited quota available. The same quota is used for jobs and other things potentially running on Kubernetes, like webservices.
To check your quota, run:
tools.mytool@tools-sgebastion-11:~$ kubectl describe resourcequotas
Name: tool-mytool
Namespace: tool-mytool
Resource Used Hard
-------- ---- ----
configmaps 2 10
count/cronjobs.batch 0 50 <--
count/deployments.apps 0 3 <--
count/jobs.batch 0 15 <--
limits.cpu 0 2
limits.memory 0 8Gi
persistentvolumeclaims 0 3
pods 0 10
replicationcontrollers 0 1
requests.cpu 0 2
requests.memory 0 6Gi
secrets 1 10
services 0 1
services.nodeports 0 0
The quota entries marked with the <--
symbol indicate:
- maximum number of cronjobs
- maximum number of continuous jobs
- maximum number of jobs
As of this writing, new jobs get 512Mi memory and 1/2 CPU by default.
You can run jobs with additional CPU and memory using the --mem MEM
and --cpu CPU
parameters, example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command "./heavycommand.sh" --image bullseye --mem 1Gi --cpu 2
Requesting more memory or CPU will fail if the tool quota is exceeded.
Quota increases
It is possible to request a quota increase if you can demonstrate your tool's need for more resources than the default namespace quota allows. Instructions and a template link for creating a quota request can be found at Toolforge (Quota requests) in Phabricator.
Please read all the instructions there before submitting your request.
Note for Toolforge admins: there are docs on how to do quota upgrades.
Job email notifications
You can select to receive email notifications from your job activity, by using the --emails EMAILS
option when creating a job.
The available choices are:
none
, don't get any email notification. The default behavior.onfailure
, receive email notifications in case of a failure event.onfinish
, receive email notifications in case of the job finishing (both successfully and on failure).all
, receive all possible notifications.
Example:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye --emails onfinish
The email will be sent to tools.mytool@toolforge.org
, which is an email alias that by default redirects to all tool maintainers associated with that particular tool account.
Complete example session
Here is a complete example of a work session with the Toolforge jobs framework.
Example shell session |
---|
The following content has been placed in a collapsed box for improved usability. |
$ ssh dev.toolforge.org
$ become $mytool
$ toolforge-jobs images
Short name Container image URL
------------ ----------------------------------------------------------------------
bullseye docker-registry.tools.wmflabs.org/toolforge-bullseye-sssd:latest
golang1.11 docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
jdk17 docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latest
mono6.8 docker-registry.tools.wmflabs.org/toolforge-mono68-sssd-base:latest
node16 docker-registry.tools.wmflabs.org/toolforge-node16-sssd-base:latest
perl5.32 docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-base:latest
php7.4 docker-registry.tools.wmflabs.org/toolforge-php74-sssd-base:latest
python3.9 docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latest
ruby2.1 docker-registry.tools.wmflabs.org/toolforge-ruby21-sssd-base:latest
ruby2.7 docker-registry.tools.wmflabs.org/toolforge-ruby27-sssd-base:latest
tcl8.6 docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest
$ # running a normal job:
$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye
$ # running a normal job and waiting for it to complete:
$ toolforge-jobs run myotherjob --command ./myothercommand.sh --image bullseye --wait
$ # running a continuous job:
$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image bullseye --continuous
$ # running a scheduled job:
$ toolforge-jobs run myscheduledjob --command ./everyminute.sh --image bullseye --schedule "1 * * * *"
$ toolforge-jobs list
Job name: Command: Job type: Image: Status:
-------------- ----------------------- ------------------- -------- ---------------------------
myscheduledjob ./everyminute.sh schedule: 1 * * * * bullseye Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning ./myendlesscommand.sh continuous bullseye Running
myjob ./mycommand.sh normal bullseye Completed
$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name: | myscheduledjob |
+------------+-----------------------------------------------------------------+
| Command: | ./read-dumps.sh |
+------------+-----------------------------------------------------------------+
| Job type: | schedule: * * * * * |
+------------+-----------------------------------------------------------------+
| Image: | bullseye |
+------------+-----------------------------------------------------------------+
| Status: | Last schedule time: 2021-06-30T10:26:00Z |
+------------+-----------------------------------------------------------------+
| Hints: | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
| | 'waiting' for reason 'ContainerCreating'. |
+------------+-----------------------------------------------------------------+
$ toolforge-jobs delete myscheduledjob
$ toolforge-jobs flush
$ toolforge-jobs list
[.. nothing ..]
|
The above content has been placed in a collapsed box for improved usability. |
Help command
List all available jobs-framework commands using the toolforge-jobs -h
command:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs -h
usage: toolforge-jobs [-h] [--debug] [--cfg CFG]
{containers,images,run,show,list,delete,flush,load,restart}
...
Toolforge Jobs Framework, command line interface
positional arguments:
{containers,images,run,show,list,delete,flush,load,restart}
possible operations (pass -h to know usage of each)
containers Kept for compatibility reasons, use `images` instead.
images list information on available container image types
for Toolforge jobs
run run a new job of your own in Toolforge
show show details of a job of your own in Toolforge
list list all running jobs of your own in Toolforge
delete delete a running job of your own in Toolforge
flush delete all running jobs of your own in Toolforge
load flush all jobs and load a YAML file with job
definitions and run them
restart restarts a running job
optional arguments:
-h, --help show this help message and exit
--debug activate debug mode
--cfg CFG YAML config for the CLI. Defaults to '/etc/toolforge-
jobs-framework-cli.cfg'. Only useful for Toolforge
admins.
List all available run command arguments using the toolforge-jobs run -h
command:
tools.mytool@tools-sgebastion-11:~$ toolforge-jobs run -h
usage: toolforge-jobs run [-h] --command COMMAND --image IMAGE [--no-filelog]
[-o FILELOG_STDOUT] [-e FILELOG_STDERR]
[--retry {0,1,2,3,4,5}] [--mem MEM] [--cpu CPU]
[--emails {none,all,onfinish,onfailure}]
[--schedule SCHEDULE | --continuous | --wait]
name
positional arguments:
name new job name
optional arguments:
-h, --help show this help message and exit
--command COMMAND full path of command to run in this job
--image IMAGE image shortname (check them with `images`)
--no-filelog don't store job stdout in `jobname`.out and stderr in
`jobname`.err files in the user home directory
-o FILELOG_STDOUT, --filelog-stdout FILELOG_STDOUT
location to store stdout logs for this job
-e FILELOG_STDERR, --filelog-stderr FILELOG_STDERR
location to store stderr logs for this job
--retry {0,1,2,3,4,5}
specify the retry policy of failed jobs.
--mem MEM specify additional memory limit required for this job
--cpu CPU specify additional CPU limit required for this job
--emails {none,all,onfinish,onfailure}
specify if the system should email notifications about
this job. Defaults to 'none'.
--schedule SCHEDULE run a job with a cron-like schedule (example '1 * * *
*')
--continuous run a continuous job
--wait run a job and wait for completition. Timeout is 300
seconds.
Grid Engine migration
- Main article: News/Toolforge Grid Engine deprecation
This section contains specific documentation for Grid Engine users that are trying to migrate their jobs to Kubernetes.
In particular, here is a list of common command equivalences between Grid Engine (legacy, with jsub
and friends) and Kubernetes (with the new toolforge-jobs
).
task | Grid Engine | Kubernetes |
---|---|---|
Basic job submission: | $ jsub ./mycommand.sh
|
$ toolforge-jobs run myjob --command ./mycommand.sh --image bullseye
|
Allocating additional memory: | $ jsub -mem 1000m php i_like_more_ram.php
|
$ toolforge-jobs run myjob --command "php i_like_more_ram.php" --image php7.4 --mem 1Gi --cpu 2
|
Waiting until the job is completed: | $ jsub -sync y program [args...]
|
$ toolforge-jobs run myjob --command "python3 ./myScript.py" --image python3.9 --wait
|
Viewing information about all jobs: | $ qstat
|
$ toolforge-jobs list
|
Deleting a job: | $ qdel job_number/job_name
$ qstop job_name
|
$ toolforge-jobs delete myjob
|
Deleting all jobs: | - | $ toolforge-jobs flush
|
NOTE: the old grid jlocal
command has no equivalent in this jobs framework. And is not really needed, just schedule cron jobs as documented here, even if they are small scripts.
Useful links
The following tools have been built by the Toolforge admin team to help others see job status:
- k8s-status.toolforge.org — status board of Kubernetes nodes and tools (webservices, jobs) they are currently running.
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia Movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect, the bridged Telegram group, or the bridged Mattermost channel
- Discuss via email after you subscribed to the cloud@ mailing list
See also
- Help:Toolforge/Web
- Help:Toolforge/Kubernetes
- News/Toolforge Stretch deprecation
- News/2020 Kubernetes cluster migration
- Alternate procedure for managing jobs in Toolforge Kubernetes, using the raw k8s API, only recommended if you are an advanced user.
- Portal:Toolforge/Admin/Kubernetes/Jobs framework - Engineering documentation about this system.
External links
- Source code of the toolforge-jobs command
- Wikimedia Techblog: Toolforge Jobs Framework Arturo Borrero González, Site Reliability Engineer, Wikimedia Cloud Services Team, March 18, 2022