Jump to content

Mw-cron jobs

From Wikitech
(Redirected from Periodic jobs)

This page documents the new Kubernetes way of running scheduled MediaWiki Maintenance scripts. The old system on the Maintenance servers is still available as a fallback for now, but those servers will be going away.

If you discover issues with a migrated Maintenance script, please report this via to the relevant of task T341555.

mw-cron is a MediaWiki On Kubernetes deployment in WikiKube, for automatically executing MediaWiki maintenance scripts on a regular schedule.

SRE is targetting March 2025 to complete migration from the Maintenance server to the mw-cron deployment.

Logs

Logstash Dashboard

Monitoring

Alerting

By default, we are setting up two alerts:

  1. A warning for team=serviceops that will show up in ServiceOp's AlertManager dashboard
  2. Either a Phabricator task on the owning team's PHID, or a slack message on their channel, depending on preference

Please note that in the case of a Phabricator alert, it will open a task with the generic name MediaWikiCronJobFailed and, as long as the task is open, future alerts firing for that team will update the description with which particular CronJobs are failing.

Probe

For now, the probe fires when a Job has failed, no matter the reason (eviction, non-0 exit code, etc.). It will not autoresolve on a subsequent successful run, and requires manual deletion of the failed Job to resolve.

Troubleshooting

If your job has failed, you can either look at logstash, or diagnose from the command line on the deployment server.

Logs from kubernetes

If one of your maintenance scripts recently failed, a Phabricator task will have been opened for your team (example). The task description provides a more specific kubectl get jobs command invocation that lists only failed jobs specific to the maintenance script that prompted the task.

The example below assumes Eqiad is the primary datacenter.

1. Enter the kubectl scope for the mw-cron cluster in the primary datacenter.

deploy1003:~$ kube-env mw-cron eqiad

2. list recently failed jobs, or, list all jobs

deploy1003:~$ kubectl get jobs --field-selector status.successful=0
deploy1003:~$ kubectl get jobs

NAMESPACE   NAME                              COMPLETIONS   DURATION   AGE
mw-cron     example-something-29050030        0/1           9s         28m

3. access the logs from the pod (insert job name after "jobs/", and select mediawiki container)

deploy1003:~$ kubectl logs jobs/example-something-29050030 mediawiki-main-app

Doing stuff for things...
...found this.
...did that.
Done!

Manually running a CronJob

cgoubert@deploy1003:~$ kube-env mw-cron eqiad
cgoubert@deploy1003:~$ KUBECONFIG=/etc/kubernetes/mw-cron-deploy-eqiad.config
cgoubert@deploy1003:~$ kubectl create job mediawiki-main-serviceops-version-$(date +"%Y%m%d%H%M") --from=cronjobs/mediawiki-main-serviceops-version
job.batch/mediawiki-main-serviceops-version-202504081112 created
cgoubert@deploy1003:$ kubectl get jobs -l 'team=sre-serviceops, cronjob=serviceops-version'
NAME                                             COMPLETIONS   DURATION   AGE
mediawiki-main-serviceops-version-202504081112   1/1           49s        81s
[...]

Job Migration

General procedure

Code changes

The jobs are still defined in puppet, using the profile::mediawiki::periodic_job resource.

Additional parameters are necessary to migrate a job to mw-cron and remove it from the maintenance servers.

  • If the periodic jobs are defined in a subprofile of profile:mediawiki::maintenance, change the class definition to include the $helmfile_defaults_dir parameter
class profile::mediawiki::maintenance::subprofile(
    Stdlib::Unixpath $helmfile_defaults_dir = lookup('profile::kubernetes::deployment_server::global_config::general_dir', {default_value => '/etc/helmfile-defaults'}),
) {
  • Include your subprofile in profile::kubernetes::deployment_server::mediawiki::periodic_jobs
  • Add the following additional parameters:
      cron_schedule         => '*/10 * * * *', # The interval must be converted from systemd-calendar intervals to crontab syntax. Keep the interval parameter as well if your job is used on beta
      kubernetes            => true, # Create the CronJob resource in mw-cron, and remove the systemd-timer from the maintenance server
      team                  => 'job-owner-team', # For easier monitoring, log dashboard, and alerting
      script_label          => 'scriptName-wikiName', # A label for monitoring, logging, and alerting, preferably the script name and its target
      description           => 'A longer form description of the periodic job',
      helmfile_defaults_dir => $helmfile_defaults_dir, # Pass down the directory where the jobs will be defined

Example

We recommend the following procedure:

  • [Done in a batch, not necessary anymore] Create a first change adding the subprofile to the profile::kubernetes::deployment_server::mediawiki::periodic_jobs profile, like 1117234. This will have all the no-op changes.
  • Migrate the jobs in follow-up patches like 1117862

Deployment

  1. Disable puppet on the maintenance server if the job can't be interrupted easily
  2. Merge the puppet change
  3. Run puppet on the deployment server, this will create the job definition, but it won't be deployed to mw-cron yet
  4. Stop the job on the maintenance server if it is running (use systemctl status mediawiki_job_<jobname>.service to check, systemctl stop mediawiki_job_<jobname>.service to stop). If the script is currently running, check with the responsible team if it's ok to stop it.
  5. Deploy the mw-cron change with helmfile on the deployment server. The job will start on its next scheduled trigger.
  6. Enable puppet on the maintenance server and run it. This will delete the systemd timer for the maintenance job.
  7. If needed, use Periodic_jobs#Manually_running_a_CronJob to trigger a manual run.