Wikidata Query Service/Runbooks/ElevatedMaxLagWDQS
ElevatedMaxLagWDQS
Begin
STEP: Find the host or hosts with lag over 10m using the Wikidata > Wikidata Query Service dashboard . If you don't see any hosts lag > 10m, try selecting a different Cluster name (wdqs-main vs wdqs-scholarly) value from the dashboard.
DECISION POINT:
- If one or two hosts are lagged go to 1.2
- If an entire DC is lagged go to 1.3
1.2 One or two hosts lagged
STEP: Depool host
bking@wdqs2002:~$ sudo depool
STEP: Restart blazegraph
bking@wdqs2002:~$ sudo systemctl restart wdqs-blazegraph.service
STEP: Monitor lag/repool
You can monitor the host or host's lag from the dashboard panel linked above, or wait for the host-scopedRdfStreamingUpdaterHighConsumerUpdateLag alert(s) to clear
In my experience, the backlog clears pretty quickly once BG has been restarted. See this example , where we cleared 24h of backlog in about an hour.
Once the lag drops below 10 minutes, you can repool the host:
bking@wdqs2002:~$ sudo pool
1.3 Entire DC lagged
This typically means the rdf-streaming updater has failed in the datacenter. Thus, you can end user impact by depooling the affected datacenter.
STEP: Create a Phab task for the work. It can be a placeholder, we just need a descriptive title and something to link back to. For example: Investigate rdf-streaming-updater failure in eqiad
STEP: Depool the affected DC using confctl:
bking@cumin2002:~$ sudo confctl --object-type discovery select 'dnsdisc=wdqs,name=${DC}' set/pooled=false
Be sure to !log this to the SAL in #wikimedia-operations and reference the Phab task.
STEP: Check the health of the RDF Streaming Updater in the DC
Check the RDF Streaming Updater dashboard page (be sure to select the correct DC!). Rather than a specific metric, you're looking for general info that show the service is alive, metrics are being reported, work is being done, etc.
Likewise, you can check the application's state from the deploy server
kube_env rdf-streaming-updater-deploy ${DC}
kubectl get po | grep taskmanager
If the above command shows no task manager pods, the application is broken.
Optionally, you can gather more logs with
kubectl logs -f flink-app-${RELEASE}-(TAB COMPLETE) flink-main-container
.
DECISION POINT:
- If the service appears to be alive AND/OR there are no obvious errors on the logs, put as much info into the Phab task and escalate to an SME. If they're not available, ping them in Slack/IRC so they can take a look once they get back to work. END
- If the service appears to be dead from the dashboard, go to 1.4
1.4 Restore rdf-streaming-updater from checkpoint
STEP: Find the latest checkpoint data info from the Flink App Dashboard's Disaster Recovery panel
STEP: Using the information in the dashboard, create a deployment-charts patch with a new
initialSavepointPath
value. Example CR . Pseudo-template string for the path (using Prometheus variables):
s3://${namespace}-${site}/${release}/checkpoints/${job_id}/chk-${checkpoint_id}
STEP: Destroy the helmfile release
Once your patch is merged, login to the deployment server and begin the process.
kube_env rdf-streaming-updater-deploy ${DC}
Note: We use the -deploy user, as it has helmfile destroy permissions.
cd /srv/deployment-charts/helmfile.d/services/rdf-streaming/updater;
helmfile -e ${DC} -i destroy --selector name=${RELEASE}
Where release is one of wikidata or commons. Confirm that no pods remain after destroying.
STEP: Apply the helmfile release
cd /srv/deployment-charts/helmfile.d/services/rdf-streaming/updater;
helmfile -e ${DC} -i apply --selector name=${RELEASE}
After application, monitor the logs via
kubectl logs -f flink-app-${RELEASE}-(TAB COMPLETE) flink-main-container
It takes a few minutes, but during the bootstrap process things will change state from INITIALISING to RUNNING. Eventually, you will see a sequence of log lines such as
{"@timestamp":"2025-06-25T21:45:02.104Z","log.level": "INFO","message":"Triggering checkpoint 3508640 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1750887902096 for job a859e5ab8a27f072561979eeb5ee4853.", "ecs.version": "1.2.0","process.thread.name":"Checkpoint Timer","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
{"@timestamp":"2025-06-25T21:45:10.372Z","log.level": "INFO","message":"Completed checkpoint 3508640 for job a859e5ab8a27f072561979eeb5ee4853 (1534345549 bytes, checkpointDuration=7220 ms, finalizationTime=1056 ms).", "ecs.version": "1.2.0","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
{"@timestamp":"2025-06-25T21:45:10.372Z","log.level": "INFO","message":"Marking checkpoint 3508640 as completed for source Source: KafkaSource:mediawiki.page_change.v1.", "ecs.version": "1.2.0","process.thread.name":"SourceCoordinator-Source: KafkaSource:mediawiki.page_change.v1","log.logger":"org.apache.flink.runtime.source.coordinator.SourceCoordinator"}
The final log line marking the checkpoint as complete serves as your confirmation that the service is healthy again.
STEP Goto step 1.2 and follow the "Monitor lag/repool" step. When the alerts have cleared, you're ready to repool the datacenter.
STEP Repool the datacenter
bking@cumin2002:~$ sudo confctl --object-type discovery select 'dnsdisc=wdqs,name=${DC}' set/pooled=true
Quick links
Additional Information
Provide any additional information that might be useful for SREs navigating through the runbook, such as troubleshooting tips, FAQs, or references to related documentation.
In Conclusion
This space intentionally left blank