Mw-cron jobs
This page documents the new Kubernetes way of running scheduled MediaWiki Maintenance scripts. The old system on the Maintenance servers is still available as a fallback for now, but those servers will be going away.
If you discover issues with a migrated Maintenance script, please report this via to the relevant of task T341555.mw-cron is a MediaWiki On Kubernetes deployment in WikiKube, for automatically executing MediaWiki maintenance scripts on a regular schedule.
SRE is targetting March 2025 to complete migration from the Maintenance server to the mw-cron
deployment.
Logs
Monitoring
Alerting
By default, we are setting up two alerts:
- A warning for
team=serviceops
that will show up in ServiceOp's AlertManager dashboard - Either a Phabricator task on the owning team's PHID, or a slack message on their channel, depending on preference
Please note that in the case of a Phabricator alert, it will open a task with the generic name MediaWikiCronJobFailed
and, as long as the task is open, future alerts firing for that team will update the description with which particular CronJobs
are failing.
Probe
For now, the probe fires when a Job
has failed, no matter the reason (eviction, non-0 exit code, etc.).
It will not autoresolve on a subsequent successful run, and requires manual deletion of the failed Job
to resolve.
Troubleshooting
If your job has failed, you can either look at logstash, or diagnose from the command line on the deployment server.
Logs from kubernetes
If one of your maintenance scripts recently failed, a Phabricator task will have been opened for your team (example). The task description provides a more specific kubectl get jobs
command invocation that lists only failed jobs specific to the maintenance script that prompted the task.
The example below assumes Eqiad is the primary datacenter.
1. Enter the kubectl scope for the mw-cron cluster in the primary datacenter.
deploy1003:~$ kube-env mw-cron eqiad
2. list recently failed jobs, or, list all jobs
deploy1003:~$ kubectl get jobs --field-selector status.successful=0
deploy1003:~$ kubectl get jobs
NAMESPACE NAME COMPLETIONS DURATION AGE
mw-cron example-something-29050030 0/1 9s 28m
3. access the logs from the pod (insert job name after "jobs/", and select mediawiki container)
deploy1003:~$ kubectl logs jobs/example-something-29050030 mediawiki-main-app
Doing stuff for things...
...found this.
...did that.
Done!
Manually running a CronJob
cgoubert@deploy1003:~$ kube-env mw-cron eqiad
cgoubert@deploy1003:~$ KUBECONFIG=/etc/kubernetes/mw-cron-deploy-eqiad.config
cgoubert@deploy1003:~$ kubectl create job mediawiki-main-serviceops-version-$(date +"%Y%m%d%H%M") --from=cronjobs/mediawiki-main-serviceops-version
job.batch/mediawiki-main-serviceops-version-202504081112 created
cgoubert@deploy1003:$ kubectl get jobs -l 'team=sre-serviceops, cronjob=serviceops-version'
NAME COMPLETIONS DURATION AGE
mediawiki-main-serviceops-version-202504081112 1/1 49s 81s
[...]
Job Migration
General procedure
Code changes
The jobs are still defined in puppet, using the profile::mediawiki::periodic_job
resource.
Additional parameters are necessary to migrate a job to mw-cron
and remove it from the maintenance servers.
- If the periodic jobs are defined in a subprofile of
profile:mediawiki::maintenance
, change the class definition to include the$helmfile_defaults_dir
parameter
class profile::mediawiki::maintenance::subprofile(
Stdlib::Unixpath $helmfile_defaults_dir = lookup('profile::kubernetes::deployment_server::global_config::general_dir', {default_value => '/etc/helmfile-defaults'}),
) {
- Include your subprofile in
profile::kubernetes::deployment_server::mediawiki::periodic_jobs
- Add the following additional parameters:
cron_schedule => '*/10 * * * *', # The interval must be converted from systemd-calendar intervals to crontab syntax. Keep the interval parameter as well if your job is used on beta
kubernetes => true, # Create the CronJob resource in mw-cron, and remove the systemd-timer from the maintenance server
team => 'job-owner-team', # For easier monitoring, log dashboard, and alerting
script_label => 'scriptName-wikiName', # A label for monitoring, logging, and alerting, preferably the script name and its target
description => 'A longer form description of the periodic job',
helmfile_defaults_dir => $helmfile_defaults_dir, # Pass down the directory where the jobs will be defined
Example
We recommend the following procedure:
- [Done in a batch, not necessary anymore] Create a first change adding the subprofile to the
profile::kubernetes::deployment_server::mediawiki::periodic_jobs
profile, like 1117234. This will have all the no-op changes. - Migrate the jobs in follow-up patches like 1117862
Deployment
- Disable puppet on the maintenance server if the job can't be interrupted easily
- Merge the puppet change
- Run puppet on the deployment server, this will create the job definition, but it won't be deployed to
mw-cron
yet - Stop the job on the maintenance server if it is running (use
systemctl status mediawiki_job_<jobname>.service
to check,systemctl stop mediawiki_job_<jobname>.service
to stop). If the script is currently running, check with the responsible team if it's ok to stop it. - Deploy the
mw-cron
change with helmfile on the deployment server. The job will start on its next scheduled trigger. - Enable puppet on the maintenance server and run it. This will delete the systemd timer for the maintenance job.
- If needed, use Periodic_jobs#Manually_running_a_CronJob to trigger a manual run.