Incidents/2022-03-27 wdqs outage

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2022-03-27 wdqs outage	Start	2022-03-27 13:51
Task		End	2022-03-27 14:24
People paged	sre	Responder count	1
Coordinators	0	Affected metrics/SLOs	per-host latency
Impact	For 30 minutes, all WDQS queries failed due to an internal deadlock.

WDQS in Codfw entered a state of deadlock that persisted until service restarts were performed. Note that WDQS eqiad was depooled at the time of the incident, reducing total capacity.

Simple Timeline

[2022-03-26] wdqs2002 gets caught in deadlock a day before the incident

[2022-03-27 12:53:00] wdqs2007 deadlocks. Within an hour, with wdqs2002 and wdqs2007 already deadlocked, the remaining hosts will rapidly fall offline within a span of a few minutes.

[2022-03-27 14:00:40] All codfw wdqs hosts are deadlocked; no queries can complete

[2022-03-27 14:05:03] First page emitted (see following line)

[2022-03-27 14:05:03] <icinga-wm> PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

[2022-03-27 14:16:59] elukey responds

[2022-03-27 14:18:17] elukey performs rolling restart of wdqs codfw

[2022-03-27 14:22:54] full service health restored

Detailed Timeline

Leading up to the incident

The day before the actual incident, wdqs2002 dropped out of service due to its blazegraph process locking up. Note that while not ideal, this does happen somewhat regularly, usually without incident. However it's relevant to this incident because it reduced available capacity for when the "actual incident" began the following day.

[2022-03-26 01:09:13] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-26 13:54:55] <icinga-wm> PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

Note that wdqs2002 being caught in a state of deadlock is visible on this Grafana dashboard, as evidenced by the null metrics reported during the period of the deadlock

The actual incident

[2022-03-27 12:10:18] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 12:12:24] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 12:58:20] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 13:01:44] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:03:32] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:04:00] <icinga-wm> RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:05:48] <icinga-wm> RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:55:36] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled

[2022-03-27 13:57:26] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled

[2022-03-27 14:00:04] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:00:34] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:03:38] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:05:03] <icinga-wm> PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Elukey responds

[2022-03-27 14:16:59] <elukey> hello folks, anybody checking wdqs?

[2022-03-27 14:17:11] <elukey> Cc: gehel, dcausse, ryankemper

[2022-03-27 14:18:17] <elukey> !log restart blazegraph on wdqs2003

[2022-03-27 14:20:22] <icinga-wm> PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.809 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:20:43] <icinga-wm> RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.237 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

[2022-03-27 14:20:49] <elukey> !log roll restart of wqds-blazegraph-public codfw

[2022-03-27 14:22:08] <icinga-wm> PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 6.823 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

Rolling restart complete, service health restored

[2022-03-27 14:22:54] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.381 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:22:56] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.456 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:23:48] <icinga-wm> RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:24:06] <icinga-wm> RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:24:50] <elukey> I am checking metrics in https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-3h&to=now and they look reasonably ok

[2022-03-27 14:25:31] <elukey> the thread count is a little weird

[2022-03-27 14:25:46] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.444 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:25:50] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.238 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:26:18] <icinga-wm> RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 14:26:36] <icinga-wm> RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

[2022-03-27 14:26:38] <icinga-wm> RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:26:38] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.335 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:26:42] <icinga-wm> RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

Actionables

To make that service stable is to re-architect and replace Blazegraph. The Search team will discuss this and arrange follow up actions
In the meantime, https://phabricator.wikimedia.org/T293862 might help to improve the reliability of Blazegraph.
As the service is fairly fragile, but recovers quickly after a restart, simple auto-remediation such as scheduled service restarts might be appropriate.
Investigate if earlier alerts should page https://phabricator.wikimedia.org/T303134
As discussed here, the command-line utility jstack can detect deadlocks, and is installed on all wdqs hosts. Perhaps we can use it to monitor for these deadlocks.
Update https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Blazegraph_deadlock with the exact verbiage from the alerts and examples of what Grafana looks like during these outages.

Scorecard

Incident Engagement™ ScoreCard
	Question	Score	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)	1
	Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)	1
	Were more than 5 people paged? (score 0 for yes, 1 for no)	0
	Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)	0
	Were pages routed to online (business hours) engineers? (score 1 for yes, 0 if people were paged after business hours)	0	as weekend
Process	Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)	0
	Was the public status page updated? (score 1 for yes, 0 for no)	N/A
	Is there a phabricator task for the incident? (score 1 for yes, 0 for no)	0
	Are the documented action items assigned? (score 1 for yes, 0 for no)	1
	Is this a repeat of an earlier incident (score 0 for yes, 1 for no)	0
Tooling	Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)	?
	Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)	1
	Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)	1
	Were all engineering tools required available and in service? (score 1 for yes, 0 for no)	1
	Was there a runbook for all known issues present? (score 1 for yes, 0 for no)	0
Total score		6