MediaWiki JobQueue/Operations
This page describes operational actions involving the MediaWiki JobQueue infrastructure at WMF, such as runbooks for responding to JobQueue-related alerts.
Alerts
JobQueueLowTrafficConsumerWidespreadHighLatency
This alert detects when the majority of active job types processed by changeprop-jobqueue
’s shared "low-traffic" consumer have been experiencing unusually high processing delays for sustained period.
Investigation
The most likely trigger is a large influx of jobs for one specific job type, which is starving other job types processed by the shared consumer for throughput (i.e., a job that is no longer “low traffic”).
Start by opening the JobQueue Low Traffic Jobs dashboard link that accompanied the alert. Look for a large increase in processing rate for a single job’s associated rule (the trigger) that correlates with increased processing delay across many other rules. You may also see a large increase in backlog for one or more topics, including that associated with the trigger job.
If there's no obvious "high traffic" trigger, another possibility is a modest influx of an unusually slow-processing job, that is starving other job types on the shared consumer (concurrency is limited). For that, instead look for an outlier rule (job) with an unusually high processing duration.
Response
Responding to this scenario is a three-step process, in which we move processing of the trigger job out of the shared low-traffic consumer (mitigation) and onto its own dedicated consumer / processing rule (resolution).
Mitigation
First, prepare a patch to add the trigger job to the high_traffic_jobs_config
in helmfile.d/services/changeprop-jobqueue/values.yaml in the deployment-charts repo with enabled: false
.
For example, if TriggerJob
is the name of the job you identified (associated with rule low-traffic-jobs-mediawiki-job-TriggerJob
and topic $DC.mediawiki.job.TriggerJob
):
high_traffic_jobs_config: [ ... ] + TriggerJob: + enabled: false + concurrency: 10 # Arbitrary initial concurrency.
Merge this change and deploy it (Kubernetes/Deployments).
Once this is deployed, processing delay time should start to drop for other jobs (rules) on the low-traffic consumer. If this is the case, the incident is now mitigated, but not yet resolved, since processing is now paused for the trigger job.
Resolution
Resolution involves two steps.
First, prepare the new dedicated consumer to pick up from where the low-traffic consumer left off.
In practice, this step is likely only required in the alerting site (which should in general be the current primary DC). However, it's best to perform this same procedure in both core DC sites in sequence.
Start by fetching the current offset of the trigger job's topic on the cpjobqueue-low_traffic_jobs
consumer. From any kafka-main broker host in the site you're operating on:
kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-low_traffic_jobs --describe
For the topic associated with a low-traffic job type, we should only ever see a single partition (0). Take note of its CURRENT-OFFSET
, which we will use to initialize the offset of the new dedicated consumer group. We can test this out by running the following:
kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-$TRIGGER --topic $DC.mediawiki.job.$TRIGGER --reset-offsets --to-offset $OFFSET --dry-run
where $TRIGGER
is the trigger job you identified (i.e., TriggerJob
in our example above), $OFFSET
is the CURRENT-OFFSET
on the old consumer group, and $DC
is site you're currently operating on.
Then, if there are no issues reported (e.g., we got the topic name wrong) run that again with --execute
instead of --dry-run
. You can then confirm the offset was initialized with:
kafka-consumer-groups --bootstrap-server localhost:9092 --group cpjobqueue-$TRIGGER --describe
Second, prepare a second patch to set enabled: true
in the high_traffic_jobs_config
on the dedicated consumer / rule you configured previously, merge, and deploy. Continuing the example from above:
TriggerJob: - enabled: false + enabled: true concurrency: 10
At this point, you should see processing begin on the new dedicated consumer / rule begin processing, which you can confirm via the JobQueue Job dashboard by selecting the trigger job.