This page describes operational actions involving the MediaWiki JobQueue infrastructure at WMF, such as runbooks for responding to JobQueue-related alerts.

Alerts

JobQueueLowTrafficConsumerWidespreadHighLatency

This alert and runbook are brand new. If anything is unclear, please seek assistance and flag in task T378609.

This alert detects when the majority of active job types processed by changeprop-jobqueue’s shared "low-traffic" consumer have been experiencing unusually high processing delays for sustained period.

Investigation

The most likely trigger is a large influx of jobs for one specific job type, which is starving other job types processed by the shared consumer for throughput (i.e., a job that is no longer “low traffic”).

Start by opening the JobQueue Low Traffic Jobs dashboard link that accompanied the alert. Look for a large increase in processing rate for a single job’s associated rule (the trigger) that correlates with increased processing delay across many other rules. You may also see a large increase in backlog for one or more topics, including that associated with the trigger job.

If there's no obvious "high traffic" trigger, another possibility is a modest influx of an unusually slow-processing job, that is starving other job types on the shared consumer (concurrency is limited). For that, instead look for an outlier rule (job) with an unusually high processing duration.

If at this point there is no clear trigger job, STOP. Something more subtle is likely happening, and attempting to follow the response guidance below may be more disruptive than helpful.

Response

Responding to this scenario is a three-step process, in which we move processing of the trigger job out of the shared low-traffic consumer (mitigation) and onto its own dedicated consumer / processing rule (resolution).

Mitigation

First, prepare a patch to add the trigger job to the high_traffic_jobs_config in helmfile.d/services/changeprop-jobqueue/values.yaml in the deployment-charts repo with enabled: false.

For example, if TriggerJob is the name of the job you identified (associated with rule low-traffic-jobs-mediawiki-job-TriggerJob and topic $DC.mediawiki.job.TriggerJob):

    high_traffic_jobs_config:
      [ ... ]
+     TriggerJob:
+       enabled: false
+       concurrency: 10  # Arbitrary initial concurrency.

Merge this change and deploy it (Kubernetes/Deployments).

Once this is deployed, processing delay time should start to drop for other jobs (rules) on the low-traffic consumer. If this is the case, the incident is now mitigated, but not yet resolved, since processing is now paused for the trigger job.

Resolution

FIXME: This procedure ignores retry topics.

Resolution involves two steps.

First, prepare the new dedicated consumer to pick up from where the low-traffic consumer left off.

In practice, this step is likely only required in the alerting site (which should in general be the current primary DC). However, it's best to perform this same procedure in both core DC sites in sequence.

Start by fetching the current offset of the trigger job's topic on the cpjobqueue-low_traffic_jobs consumer. From any kafka-main broker host in the site you're operating on:

kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-low_traffic_jobs --describe

For the topic associated with a low-traffic job type, we should only ever see a single partition (0). Take note of its CURRENT-OFFSET, which we will use to initialize the offset of the new dedicated consumer group. We can test this out by running the following:

kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-$TRIGGER --topic $DC.mediawiki.job.$TRIGGER --reset-offsets --to-offset $OFFSET --dry-run

where $TRIGGER is the trigger job you identified (i.e., TriggerJob in our example above), $OFFSET is the CURRENT-OFFSET on the old consumer group, and $DC is site you're currently operating on.

Then, if there are no issues reported (e.g., we got the topic name wrong) run that again with --execute instead of --dry-run. You can then confirm the offset was initialized with:

kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-$TRIGGER --describe

Second, prepare a second patch to set enabled: true in the high_traffic_jobs_config on the dedicated consumer / rule you configured previously, merge, and deploy. Continuing the example from above:

      TriggerJob:
-       enabled: false
+       enabled: true
        concurrency: 10

At this point, you should see processing begin on the new dedicated consumer / rule begin processing, which you can confirm via the JobQueue Job dashboard by selecting the trigger job.