Incidents/2023-05-05 wdqs not updating in codfw

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-05-05 wdqs not updating in codfw Start 2023-05-04T10:00
Task T336134 End 2023-05-10T10:30
People paged 0 Responder count 3
Coordinators None Affected metrics/SLOs WDQS update lag
Impact End users accessing WDQS from the CODFW region received stale results.

The rdf-streaming-updater application in CODFW became unstable and stopped sending updates, resulting in stale data for users connecting through CODFW.

Timeline

  • 2023-05-04T10:00: the streaming updater flink job stopped to function in codfw for both WDQS and WCQS
    • user impact starts: stale results are seen when using WDQS from a region that hits CODFW
    • reason is likely https://issues.apache.org/jira/browse/FLINK-22597
  • 2023-05-05T16:22: the problem is reported by Bovlb via https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search
  • 2023-05-05T19:00: the flink jobmanager container is manually restarted and the jobs resume but the WDQS one is very unstable (k8s is heavily throttling cpu usage and taskmanager mem usage grows quickly)
    • (assumptions) because the job was backfilling 1day of data it required more resources than usual, though this is not the first time that a backfill happens (e.g. k8s cluster upgrades went well)
    • (assumptions) because the job was resource constrained rocksdb resource compaction did not happen in a timely manner
  • 2023-05-05T21:00: the job fails again
  • 2023-05-06T10:00: the job resumes (unknown reasons)
  • 2023-05-06T19:00: the job fails again
    • Seeing jvm OutOfMemoryError
    • The checkpoint it tries to recover from is abnormally large (6G instead of 1.5G usually), assumption is that rocksdb compaction did not occur properly
  • 2023-05-07T17:27: this ticket is created as UBN
  • 2023-05-08T16:00: wdqs in CODFW is depooled
    • user impact ends
  • 2023-05-09T14:00: increasing taskmanager memory from 1.9G to 2.5G did not help
  • 2023-05-09T14:00: starting the job from yarn using across 12 containers with 5G did help
    • the job recovered and started to produce reasonable checkpoint sizes
  • 2023-05-10T00:00: lag is back to normal on all wdqs servers
  • 2023-05-10T10:30: the job is resumed from k8s@codfw

Detection

Prometheus alerts for the WCQS cluster fired starting at 2023-05-04T1030 . Alerts were dispatched via email, with subject RdfStreamingUpdaterFlinkJobUnstable .

WDQS cluster alerts started a bit later, at 2023-05-05T1908.

In addition to the above subject, WDQS alerts also included the subject RdfStreamingUpdaterHighConsumerUpdateLag.

The alerts correctly identified the problem and linked to the appropriate documentation.

Conclusions

What went well?

  • The community recognized and alerted us to the issue.

What went poorly?

  • The alert was not treated with the appropriate urgency.
  • Remediation steps (temporarily shifting the streaming updater from Kubernetes to Yarn, which has higher resource availability) were taken by a single person and may not be repeatable/documented.

Where did we get lucky?

User impact was limited, as the issue was confined to CODFW. The issue itself only resulted in stale results, as opposed to a complete lack of service.

Links to relevant documentation

Wikidata Query Service/Streaming Updater

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes T336134
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? no
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 8