Incidents/2022-12-02 wdqs outage

Summary

See Incidents/2022-11-22_wdqs_outage for a nearly-exact replica of this incident

Incident metadata (see Incident Scorecard)
Incident ID	2022-12-02 wdqs outage	Start	2022-12-02 15:14
Task	T323620	End	2022-12-02 18:18
People paged		Responder count
Coordinators	brett, jhathaway	Affected metrics/SLOs	wdqs
Impact	For at least 15 minutes, users of the wikidata query service experienced a lack of service and/or extremely slow responses

An AWS bot overloaded WDQS causing CPU usage spikes/increased lag. A requestctl rule was put into place to block a specific offending user agent.

Timeline

[15:14] Grafana dashboard shows increase in error rates/resource usage

[15:21] First victorops page is issued

[15:32] Decided no IC/incident necessary due to quick resolution/acknowledgement of subpar service

[15:36] Second victorops page is issued

[15:38] Incident opened. Brett Cornwall becomes IC.

[15:50] _joe_ creates requestctl rule to blocking all AWS bots during the incident (volans verifies)

[16:00] Load averages acknowledged as not having dropped

[16:20] Acknowledgement of subpar requestctl rule; requests hadn’t been blocked

[16:26] WDQS discovered to be part of Varnish’s “misc” cluster, so requestctl rules are negated/non-effective. Other methods of blocking sought

[16:49] bblack applies change to include requestctl rules in the “misc” cluster

[16:50] jhathaway takes over IC from brett, who needed to step away

[17:52] aws block removed, replaced with rule to block offending user-agent

[18:03] UA rule disabled to test efficacy, i.e. see if the load climbs again. High load observed immediately after rule disabled.

[18:11] UA rule re-enabled

[18:18] Incident closed as resolved by jhathaway

Metrics

CPU load (15m)
Thread count
Queries per second
Banned requests

Detection

The issue was detected by monitoring (pybal alerts)

Example alert verbiage: PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet,

The appropriate alerts fired, and contained enough actionable information for humans to quickly remediate the problem.

Conclusions

What went well?

What went poorly?

We're still investigating, but we believe one bad query caused this outage. A single user or query should not be able to take out an entire datacenter.

Where did we get lucky?

Links to relevant documentation

Wikidata Query Service/Runbook

Actionables

Email to Wikidata Users list for awareness (DONE)

Scorecard

Incident Engagement Scorecard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively?	yes
	Were fewer than five people paged?	yes
	Were pages routed to the correct sub-team(s)?	yes
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the incident status section actively updated during the incident?	yes
	Was the public status page updated?	no
	Is there a phabricator task for the incident?	yes	T323620
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	no
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	no	Marking as no as there is an acknowledged need to make changes to WDQS
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	yes
Total score (count of all “yes” answers above)		11