Jump to content

Wikidata Query Service/Runbooks/ElevatedMaxLagWDQS

From Wikitech

ElevatedMaxLagWDQS

Begin

STEP: Find the host or hosts with lag over 10m using the Wikidata > Wikidata Query Service dashboard . If you don't see any hosts lag > 10m, try selecting a different Cluster name (wdqs-main vs wdqs-scholarly) value from the dashboard.

DECISION POINT:

  • If one or two hosts are lagged go to 1.2
  • If an entire DC is lagged go to 1.3

1.2 One or two hosts lagged

STEP: Depool host

bking@wdqs2002:~$ sudo depool

STEP: Restart blazegraph

bking@wdqs2002:~$ sudo systemctl restart wdqs-blazegraph.service

STEP: Monitor lag/repool

You can monitor the host or host's lag from the dashboard panel linked above, or wait for the host-scopedRdfStreamingUpdaterHighConsumerUpdateLag alert(s) to clear

In my experience, the backlog clears pretty quickly once BG has been restarted. See this example , where we cleared 24h of backlog in about an hour.

Once the lag drops below 10 minutes, you can repool the host:

bking@wdqs2002:~$ sudo pool

1.3 Entire DC lagged

This typically means the rdf-streaming updater has failed in the datacenter. Thus, you can end user impact by depooling the affected datacenter.

STEP: Create a Phab task for the work. It can be a placeholder, we just need a descriptive title and something to link back to. For example: Investigate rdf-streaming-updater failure in eqiad

STEP: Depool the affected DC using confctl:

bking@cumin2002:~$ sudo confctl --object-type discovery select 'dnsdisc=wdqs,name=${DC}' set/pooled=false

Be sure to !log this to the SAL in #wikimedia-operations and reference the Phab task.

STEP: Check the health of the RDF Streaming Updater in the DC

Check the RDF Streaming Updater dashboard page (be sure to select the correct DC!). Rather than a specific metric, you're looking for general info that show the service is alive, metrics are being reported, work is being done, etc.

Likewise, you can check the application's state from the deploy server

kube_env rdf-streaming-updater-deploy ${DC}
kubectl get po | grep taskmanager

If the above command shows no task manager pods, the application is broken.

Optionally, you can gather more logs with

kubectl logs -f flink-app-${RELEASE}-(TAB COMPLETE) flink-main-container

.


DECISION POINT:

  • If the service appears to be alive AND/OR there are no obvious errors on the logs, put as much info into the Phab task and escalate to an SME. If they're not available, ping them in Slack/IRC so they can take a look once they get back to work. END
  • If the service appears to be dead from the dashboard, go to 1.4


1.4 Restore rdf-streaming-updater from checkpoint

STEP: Find the latest checkpoint data info from the Flink App Dashboard's Disaster Recovery panel

STEP: Using the information in the dashboard, create a deployment-charts patch with a new

initialSavepointPath

value. Example CR . Pseudo-template string for the path (using Prometheus variables):

s3://${namespace}-${site}/${release}/checkpoints/${job_id}/chk-${checkpoint_id}

STEP: Destroy the helmfile release

Once your patch is merged, login to the deployment server and begin the process.

kube_env rdf-streaming-updater-deploy ${DC}

Note: We use the -deploy user, as it has helmfile destroy permissions.

cd /srv/deployment-charts/helmfile.d/services/rdf-streaming/updater; 
 helmfile -e ${DC} -i destroy --selector name=${RELEASE}

Where release is one of wikidata or commons. Confirm that no pods remain after destroying.

STEP: Apply the helmfile release

cd /srv/deployment-charts/helmfile.d/services/rdf-streaming/updater; 
 helmfile -e ${DC} -i apply --selector name=${RELEASE}

After application, monitor the logs via

kubectl logs -f flink-app-${RELEASE}-(TAB COMPLETE) flink-main-container

It takes a few minutes, but during the bootstrap process things will change state from INITIALISING to RUNNING. Eventually, you will see a sequence of log lines such as

{"@timestamp":"2025-06-25T21:45:02.104Z","log.level": "INFO","message":"Triggering checkpoint 3508640 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1750887902096 for job a859e5ab8a27f072561979eeb5ee4853.", "ecs.version": "1.2.0","process.thread.name":"Checkpoint Timer","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
{"@timestamp":"2025-06-25T21:45:10.372Z","log.level": "INFO","message":"Completed checkpoint 3508640 for job a859e5ab8a27f072561979eeb5ee4853 (1534345549 bytes, checkpointDuration=7220 ms, finalizationTime=1056 ms).", "ecs.version": "1.2.0","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
{"@timestamp":"2025-06-25T21:45:10.372Z","log.level": "INFO","message":"Marking checkpoint 3508640 as completed for source Source: KafkaSource:mediawiki.page_change.v1.", "ecs.version": "1.2.0","process.thread.name":"SourceCoordinator-Source: KafkaSource:mediawiki.page_change.v1","log.logger":"org.apache.flink.runtime.source.coordinator.SourceCoordinator"}

The final log line marking the checkpoint as complete serves as your confirmation that the service is healthy again.


STEP Goto step 1.2 and follow the "Monitor lag/repool" step. When the alerts have cleared, you're ready to repool the datacenter.

STEP Repool the datacenter

bking@cumin2002:~$ sudo confctl --object-type discovery select 'dnsdisc=wdqs,name=${DC}' set/pooled=true


Quick links

Additional Information

Provide any additional information that might be useful for SREs navigating through the runbook, such as troubleshooting tips, FAQs, or references to related documentation.

In Conclusion

This space intentionally left blank