Jump to content

Mw-cron jobs

From Wikitech
(Redirected from Cron)

This page documents the new Kubernetes way of running scheduled MediaWiki Maintenance scripts. The old system on the Maintenance servers is still available as a fallback for now, but those servers will be going away.

If you discover issues with a migrated Maintenance script, please report this via to the relevant of task T341555.

mw-cron is a MediaWiki On Kubernetes deployment in WikiKube, for automatically executing MediaWiki maintenance scripts on a regular schedule.

SRE is targetting March 2025 to complete migration from the Maintenance server to the mw-cron deployment.

Logs

Logstash Dashboard

Monitoring

Alerting

By default, we are setting up two alerts:

  1. A warning for team=serviceops that will show up in ServiceOp's AlertManager dashboard
  2. Either a Phabricator task on the owning team's PHID, or a slack message on their channel, depending on preference

Please note that in the case of a Phabricator alert, it will open a task titled MediaWiki periodic job $cronjob_name failed and, as long as the task is open, future alerts firing for that team and job will update the description.

Probe

For now, the probe fires when a Job has failed, no matter the reason (eviction, non-0 exit code, etc.).

It will not autoresolve on a subsequent successful run, and requires manual deletion of the failed Job to resolve.

Creating a new periodic job

Given that deployment-prep still uses systemd timers for periodic jobs, some parameters are duplicated due to incompatibilities between crontab and systemd timer syntax.

In puppet, create your periodic_job resource:

profile::mediawiki::periodic_job { 'myteam-periodicjob':
    command                 => '/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/large.dblist extensions/blah/maintenance/periodic_job.php',
    interval                => '*-*-* *:11:00', # systemd-timer syntax, for deployment-prep
    cron_schedule           => '11 * * * *', # crontab syntax, for mw-cron on kubernetes
    kubernetes              => true, # In production, create the resource on kubernetes, not the maintenance server
    team                    => 'myteam', # AlertManager team tag, should lead to a proper alert route in modules/alertmanager/templates/alertmanager.yml.erb
    script_label            => 'periodic_job.php-large', # Your script name, potentially augmented of the wiki or section
    description             => 'Run blah-ish periodic job on large wikis',
    concurrency_policy      => 'Forbid', # This can be ommited if your job can be restarted if it overruns, the default is Replace.
    startingdeadlineseconds => 1800, # This should be ommited if you do not specify concurrency_policy: Forbid
    ttlsecondsafterfinished => 106400, # Default 106400, how long should finished jobs stay after completion. Should be about 2*interval to have some history.
    helmfile_defaults_dir   => $helmfile_defaults_dir, # Mandatory
}

Then after a proper review and PCC check, you can move on to Deploying periodic jobs

What's the deal with concurrency_policy and startingdeadlineseconds?

Jobs that need to complete and not be restarted if they run longer than the time between schedules should set concurrency_policy: Forbid and a proper (around half the interval, but can be fine tuned) startingDeadlineSeconds

Jobs that can handle or need to restart should stick to the default settings (concurrency_policy: Replace). They will get killed and restarted on schedule.

More information in ⚓ T394423 Investigate startingDeadlineSeconds setting for kubernetes CronJobs

Deploying periodic jobs

  1. Merge your puppet change
  2. Run puppet on the currently active deployment server
    deploy1003:~$ sudo run-puppet-agent
    
  3. Deploy the changes to mw-cron using helmfile on the currently active datacentre
    deploy1003:~$ cd /srv/deployment-charts/helmfile.d/services/mw-cron
    deploy1003:~$ helmfile -e eqiad -i apply --context 5
    
  4. The job will start on its next schedule, but you can use #Manually running a CronJob to execute it ahead of schedule.

Troubleshooting

If your job has failed, you can either look at logstash, or diagnose from the command line on the deployment server.

Logs from kubernetes

If one of your maintenance scripts recently failed, a Phabricator task will have been opened for your team (example). The task description provides a more specific kubectl get jobs command invocation that lists only failed jobs specific to the maintenance script that prompted the task.

The example below assumes Eqiad is the primary datacenter.

1. Enter the kubectl scope for the mw-cron cluster in the primary datacenter.

deploy1003:~$ kube-env mw-cron eqiad

2. list recently failed jobs, or, list all jobs

deploy1003:~$ kubectl get jobs --field-selector status.successful=0
deploy1003:~$ kubectl get jobs

NAMESPACE   NAME                              COMPLETIONS   DURATION   AGE
mw-cron     example-something-29050030        0/1           9s         28m

3. access the logs from the pod (insert job name after "jobs/", and select mediawiki container)

deploy1003:~$ kubectl logs jobs/example-something-29050030 mediawiki-main-app

Doing stuff for things...
...found this.
...did that.
Done!

Manually running a CronJob

deploy1003:~$ kube-env mw-cron eqiad
deploy1003:~$ KUBECONFIG=/etc/kubernetes/mw-cron-deploy-eqiad.config
deploy1003:~$ kubectl create job mediawiki-main-serviceops-version-$(date +"%Y%m%d%H%M") --from=cronjobs/mediawiki-main-serviceops-version
job.batch/mediawiki-main-serviceops-version-202504081112 created
deploy1003:$ kubectl get jobs -l 'team=sre-serviceops, cronjob=serviceops-version'
NAME                                             COMPLETIONS   DURATION   AGE
mediawiki-main-serviceops-version-202504081112   1/1           49s        81s
[...]

It's a good idea to delete the job once your manual run is done and you don't need information from it anymore

deploy1003:~$ kubectl delete job mediawiki-main-serviceops-version-202504081112

Job Migration

General procedure

Code changes

The jobs are still defined in puppet, using the profile::mediawiki::periodic_job resource.

Additional parameters are necessary to migrate a job to mw-cron and remove it from the maintenance servers.

  • If the periodic jobs are defined in a subprofile of profile:mediawiki::maintenance, change the class definition to include the $helmfile_defaults_dir parameter
class profile::mediawiki::maintenance::subprofile(
    Stdlib::Unixpath $helmfile_defaults_dir = lookup('profile::kubernetes::deployment_server::global_config::general_dir', {default_value => '/etc/helmfile-defaults'}),
) {
  • Include your subprofile in profile::kubernetes::deployment_server::mediawiki::periodic_jobs
  • Add the following additional parameters:
      cron_schedule         => '*/10 * * * *', # The interval must be converted from systemd-calendar intervals to crontab syntax. Keep the interval parameter as well if your job is used on beta
      kubernetes            => true, # Create the CronJob resource in mw-cron, and remove the systemd-timer from the maintenance server
      team                  => 'job-owner-team', # For easier monitoring, log dashboard, and alerting
      script_label          => 'scriptName-wikiName', # A label for monitoring, logging, and alerting, preferably the script name and its target
      description           => 'A longer form description of the periodic job',
      helmfile_defaults_dir => $helmfile_defaults_dir, # Pass down the directory where the jobs will be defined

Example

We recommend the following procedure:

  • [Done in a batch, not necessary anymore] Create a first change adding the subprofile to the profile::kubernetes::deployment_server::mediawiki::periodic_jobs profile, like 1117234. This will have all the no-op changes.
  • Migrate the jobs in follow-up patches like 1117862

Deployment

  1. Disable puppet on the maintenance server if the job can't be interrupted easily
  2. Merge the puppet change
  3. Run puppet on the deployment server, this will create the job definition, but it won't be deployed to mw-cron yet
  4. Stop the job on the maintenance server if it is running (use systemctl status mediawiki_job_<jobname>.service to check, systemctl stop mediawiki_job_<jobname>.service to stop). If the script is currently running, check with the responsible team if it's ok to stop it.
  5. Deploy the mw-cron change with helmfile on the deployment server. The job will start on its next scheduled trigger.
  6. Enable puppet on the maintenance server and run it. This will delete the systemd timer for the maintenance job.
  7. If needed, use Manually_running_a_CronJob to trigger a manual run.