Incidents/2023-05-05 wdqs not updating in codfw

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2023-05-05 wdqs not updating in codfw	Start	2023-05-04T10:00
Task	T336134	End	2023-05-10T10:30
People paged	0	Responder count	3
Coordinators	None	Affected metrics/SLOs	WDQS update lag
Impact	End users accessing WDQS from the CODFW region received stale results.

…

The rdf-streaming-updater application in CODFW became unstable and stopped sending updates, resulting in stale data for users connecting through CODFW.

Timeline

2023-05-04T10:00: the streaming updater flink job stopped to function in codfw for both WDQS and WCQS
- user impact starts: stale results are seen when using WDQS from a region that hits CODFW
- reason is likely https://issues.apache.org/jira/browse/FLINK-22597
2023-05-05T16:22: the problem is reported by Bovlb via https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search
2023-05-05T19:00: the flink jobmanager container is manually restarted and the jobs resume but the WDQS one is very unstable (k8s is heavily throttling cpu usage and taskmanager mem usage grows quickly)
- (assumptions) because the job was backfilling 1day of data it required more resources than usual, though this is not the first time that a backfill happens (e.g. k8s cluster upgrades went well)
- (assumptions) because the job was resource constrained rocksdb resource compaction did not happen in a timely manner
2023-05-05T21:00: the job fails again
2023-05-06T10:00: the job resumes (unknown reasons)
2023-05-06T19:00: the job fails again
- Seeing jvm OutOfMemoryError
- The checkpoint it tries to recover from is abnormally large (6G instead of 1.5G usually), assumption is that rocksdb compaction did not occur properly
2023-05-07T17:27: this ticket is created as UBN
2023-05-08T16:00: wdqs in CODFW is depooled
- user impact ends
2023-05-09T14:00: increasing taskmanager memory from 1.9G to 2.5G did not help
2023-05-09T14:00: starting the job from yarn using across 12 containers with 5G did help
- the job recovered and started to produce reasonable checkpoint sizes
2023-05-10T00:00: lag is back to normal on all wdqs servers
2023-05-10T10:30: the job is resumed from k8s@codfw

Detection

Prometheus alerts for the WCQS cluster fired starting at 2023-05-04T1030 . Alerts were dispatched via email, with subject RdfStreamingUpdaterFlinkJobUnstable .

WDQS cluster alerts started a bit later, at 2023-05-05T1908.

In addition to the above subject, WDQS alerts also included the subject RdfStreamingUpdaterHighConsumerUpdateLag.

The alerts correctly identified the problem and linked to the appropriate documentation.

Conclusions

What went well?

The community recognized and alerted us to the issue.

What went poorly?

The alert was not treated with the appropriate urgency.
Remediation steps (temporarily shifting the streaming updater from Kubernetes to Yarn, which has higher resource availability) were taken by a single person and may not be repeatable/documented.

Where did we get lucky?

User impact was limited, as the issue was confined to CODFW. The issue itself only resulted in stale results, as opposed to a complete lack of service.

Links to relevant documentation

Wikidata Query Service/Streaming Updater

Actionables

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	no
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	yes
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	no
	Was a public wikimediastatus.net entry created?	no
	Is there a phabricator task for the incident?	yes	T336134
	Are the documented action items assigned?	yes
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	no
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		8