Incidents/2022-02-22 wdqs updater codfw
document status: in-review
|Incident ID||2022-02-22 wdqs updater codfw||Start||2022-02-22 17:47:00|
|People paged||0||Responder count||3|
|Coordinators||Ryan Kemper||Affected metrics/SLOs||updateQueryServiceLag (Grafana)|
|Impact||For about two hours, WDQS updates failed to be processed. As a consequence, bots and tools were unable to edit Wikidata during this time.|
WDQS updaters stopped processing updates in Codfw due to a failure with Flink in Codfw.
The API maxlag feature, is configured on Wikidata.org to incorporate WDQS lag. The updateQueryServiceLag service exists to transfer this datapoint from Prometheus to MW. Because bots generally opt-in to be friendly and enable the "maxlag" parameter, and because the metric was configured to consider both Eqiad and Codfw, their edits were rejected for two hours.
17:30 Search dev deploys a version upgrade (0.3.103) of the flink application to codfw for wdqs
17:31 The flink application is unable to restore from the savepoint
17:51 Search dev does not find any solution to unblock the situation and asks for a depool of wdqs@codfw (users no longer see stale results when hitting wdqs@codfw)
17:52 (traffic switched to eqiad)
<gehel> !log depooling WDQS codfw (internal + public) - issues with deployment of new updater version on codfw
19:00 wikidata maxlag alert is triggered eventhough codfw is depooled (known limitation: phab:T238751)
19:20 wdqs@codfw is removed from the wikidata maxlag calculation (bots can resume editing)
19:20 Search dev rolls WDQS codfw flink state back to a previously saved checkpoint , restoring the processing of updates in WDQS. Within a few minutes lag catches up and the user impact resolves.
<ryankemper> !log T302330 `ryankemper@cumin1001:~$ sudo -E cumin '*mwmaint*' 'run-puppet-agent'` (getting https://gerrit.wikimedia.org/r/c/operations/puppet/+/764875 out)
(RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org
20:00 WCQS version 0.3.104 is deployed, which includes a fix for WCQS failures https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864. (Note: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864 addressed some WCQS failures but was not the primary cause of the WDQS failures)
14:00 investigation of the root cause shows that flink can no longer start properly in k8s, the app was restarted in yarn
18:00 the flink app is still unable to run from k8s@codfw
10:00 Search devs link the root cause to a poor implementation of the swift client protocol and decides to switch to a S3 client, the app will remain running in YARN while we move away from this swift client.
10:00 The flink app is able to start on k8s@codfw thanks to the switch to the S3 client protocol
- https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&orgId=1&var-cluster_name=wdqs&from=1645548333076&to=1645559701497 Graph of affected lag
- https://phabricator.wikimedia.org/T238751 (pre-existing ticket) would have prevented the period in which Wikidata edits could not get through despite the affected hosts having already been depooled
TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.
|People||Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)||1|
|Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)||1|
|Were more than 5 people paged? (score 0 for yes, 1 for no)||0||unclear if this paged, please update if known|
|Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)||0||unclear if this paged, please update if known|
|Were pages routed to online (business hours) engineers? (score 1 for yes, 0 if people were paged after business hours)||0||unclear if this paged, please update if known|
|Process||Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)||1|
|Was the public status page updated? (score 1 for yes, 0 for no)||0|
|Is there a phabricator task for the incident? (score 1 for yes, 0 for no)||0|
|Are the documented action items assigned? (score 1 for yes, 0 for no)||1|
|Is this a repeat of an earlier incident (score 0 for yes, 1 for no)||0|
|Tooling||Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)||0|
|Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)||1|
|Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)||1|
|Were all engineering tools required available and in service? (score 1 for yes, 0 for no)||1|
|Was there a runbook for all known issues present? (score 1 for yes, 0 for no)||1|