Incidents/2023-05-05 wdqs not updating in codfw
document status: draft
Summary
Incident ID | 2023-05-05 wdqs not updating in codfw | Start | 2023-05-04T10:00 |
---|---|---|---|
Task | T336134 | End | 2023-05-10T10:30 |
People paged | 0 | Responder count | 3 |
Coordinators | None | Affected metrics/SLOs | WDQS update lag |
Impact | End users accessing WDQS from the CODFW region received stale results. |
…
The rdf-streaming-updater application in CODFW became unstable and stopped sending updates, resulting in stale data for users connecting through CODFW.
Timeline
2023-05-04T10:00
: the streaming updater flink job stopped to function in codfw for both WDQS and WCQS- user impact starts: stale results are seen when using WDQS from a region that hits CODFW
- reason is likely https://issues.apache.org/jira/browse/FLINK-22597
2023-05-05T16:22
: the problem is reported by Bovlb via https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search2023-05-05T19:00
: the flink jobmanager container is manually restarted and the jobs resume but the WDQS one is very unstable (k8s is heavily throttling cpu usage and taskmanager mem usage grows quickly)- (assumptions) because the job was backfilling 1day of data it required more resources than usual, though this is not the first time that a backfill happens (e.g. k8s cluster upgrades went well)
- (assumptions) because the job was resource constrained rocksdb resource compaction did not happen in a timely manner
2023-05-05T21:00
: the job fails again2023-05-06T10:00
: the job resumes (unknown reasons)2023-05-06T19:00
: the job fails again- Seeing jvm OutOfMemoryError
- The checkpoint it tries to recover from is abnormally large (6G instead of 1.5G usually), assumption is that rocksdb compaction did not occur properly
2023-05-07T17:27
: this ticket is created as UBN2023-05-08T16:00
: wdqs in CODFW is depooled- user impact ends
2023-05-09T14:00
: increasing taskmanager memory from 1.9G to 2.5G did not help2023-05-09T14:00
: starting the job from yarn using across 12 containers with 5G did help- the job recovered and started to produce reasonable checkpoint sizes
2023-05-10T00:00
: lag is back to normal on all wdqs servers2023-05-10T10:30
: the job is resumed from k8s@codfw
Detection
Prometheus alerts for the WCQS cluster fired starting at 2023-05-04T1030
. Alerts were dispatched via email, with subject RdfStreamingUpdaterFlinkJobUnstable .
WDQS cluster alerts started a bit later, at 2023-05-05T1908.
In addition to the above subject, WDQS alerts also included the subject RdfStreamingUpdaterHighConsumerUpdateLag.
The alerts correctly identified the problem and linked to the appropriate documentation.
Conclusions
What went well?
- The community recognized and alerted us to the issue.
What went poorly?
- The alert was not treated with the appropriate urgency.
- Remediation steps (temporarily shifting the streaming updater from Kubernetes to Yarn, which has higher resource availability) were taken by a single person and may not be repeatable/documented.
Where did we get lucky?
User impact was limited, as the issue was confined to CODFW. The issue itself only resulted in stale results, as opposed to a complete lack of service.
Links to relevant documentation
Wikidata Query Service/Streaming Updater
Actionables
- Update WDQS Runbook following update lag incident
- Review alerting around Wikidata Query Service update pipeline
- WDQS: Document procedure for switching between Kubernetes and Yarn Streaming Updater
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | no | |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | yes | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | |
Was a public wikimediastatus.net entry created? | no | ||
Is there a phabricator task for the incident? | yes | T336134 | |
Are the documented action items assigned? | yes | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | no | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 8 |