Incidents/2022-12-02 wdqs outage

From Wikitech

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-12-02 wdqs outage Start 2022-12-02 15:14
Task T323620 End 2022-12-02 18:18
People paged Responder count
Coordinators brett, jhathaway Affected metrics/SLOs wdqs
Impact For at least 15 minutes, users of the wikidata query service experienced a lack of service and/or extremely slow responses

An AWS bot overloaded WDQS causing CPU usage spikes/increased lag. A requestctl rule was put into place to block a specific offending user agent.

Timeline

[15:14] Grafana dashboard shows increase in error rates/resource usage

[15:21] First victorops page is issued

[15:32] Decided no IC/incident necessary due to quick resolution/acknowledgement of subpar service

[15:36] Second victorops page is issued

[15:38] Incident opened. Brett Cornwall becomes IC.

[15:50] _joe_ creates requestctl rule to blocking all AWS bots during the incident (volans verifies)

[16:00] Load averages acknowledged as not having dropped

[16:20] Acknowledgement of subpar requestctl rule; requests hadn’t been blocked

[16:26] WDQS discovered to be part of Varnish’s “misc” cluster, so requestctl rules are negated/non-effective. Other methods of blocking sought

[16:49] bblack applies change to include requestctl rules in the “misc” cluster

[16:50] jhathaway takes over IC from brett, who needed to step away

[17:52] aws block removed, replaced with rule to block offending user-agent

[18:03] UA rule disabled to test efficacy, i.e. see if the load climbs again. High load observed immediately after rule disabled.

[18:11] UA rule re-enabled

[18:18] Incident closed as resolved by jhathaway

Metrics

Detection

The issue was detected by monitoring (pybal alerts)

Example alert verbiage: PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet,

The appropriate alerts fired, and contained enough actionable information for humans to quickly remediate the problem.

Conclusions

What went well?

What went poorly?

We're still investigating, but we believe one bad query caused this outage. A single user or query should not be able to take out an entire datacenter.

Where did we get lucky?

Links to relevant documentation

Wikidata Query Service/Runbook

Actionables

  • Email to Wikidata Users list for awareness (DONE)

Scorecard

Incident Engagement Scorecard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively? yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the incident status section actively updated during the incident? yes
Was the public status page updated? no
Is there a phabricator task for the incident? yes T323620
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? no
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

no Marking as no as there is an acknowledged need to make changes to WDQS
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? yes
Total score (count of all “yes” answers above) 11