Incidents/2022-12-02 wdqs outage
|2022-12-02 wdqs outage
|For at least 15 minutes, users of the wikidata query service experienced a lack of service and/or extremely slow responses
An AWS bot overloaded WDQS causing CPU usage spikes/increased lag. A requestctl rule was put into place to block a specific offending user agent.
[15:14] Grafana dashboard shows increase in error rates/resource usage
[15:21] First victorops page is issued
[15:32] Decided no IC/incident necessary due to quick resolution/acknowledgement of subpar service
[15:36] Second victorops page is issued
[15:38] Incident opened. Brett Cornwall becomes IC.
[15:50] _joe_ creates requestctl rule to blocking all AWS bots during the incident (volans verifies)
[16:00] Load averages acknowledged as not having dropped
[16:20] Acknowledgement of subpar requestctl rule; requests hadn’t been blocked
[16:26] WDQS discovered to be part of Varnish’s “misc” cluster, so requestctl rules are negated/non-effective. Other methods of blocking sought
[16:49] bblack applies change to include requestctl rules in the “misc” cluster
[16:50] jhathaway takes over IC from brett, who needed to step away
[17:52] aws block removed, replaced with rule to block offending user-agent
[18:03] UA rule disabled to test efficacy, i.e. see if the load climbs again. High load observed immediately after rule disabled.
[18:11] UA rule re-enabled
[18:18] Incident closed as resolved by jhathaway
CPU load (15m)
Queries per second
The issue was detected by monitoring (pybal alerts)
Example alert verbiage:
PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet,
The appropriate alerts fired, and contained enough actionable information for humans to quickly remediate the problem.
What went well?
What went poorly?
We're still investigating, but we believe one bad query caused this outage. A single user or query should not be able to take out an entire datacenter.
Where did we get lucky?
Links to relevant documentation
- Email to Wikidata Users list for awareness (DONE)
|Were the people responding to this incident sufficiently different than the previous five incidents?
|Were the people who responded prepared enough to respond effectively?
|Were fewer than five people paged?
|Were pages routed to the correct sub-team(s)?
|Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.
|Was the incident status section actively updated during the incident?
|Was the public status page updated?
|Is there a phabricator task for the incident?
|Are the documented action items assigned?
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
|To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented.
|Marking as no as there is an acknowledged need to make changes to WDQS
|Were the people responding able to communicate effectively during the incident with the existing tooling?
|Did existing monitoring notify the initial responders?
|Were the engineering tools that were to be used during the incident, available and in service?
|Were the steps taken to mitigate guided by an existing runbook?
|Total score (count of all “yes” answers above)