Incidents/2022-03-27 wdqs outage

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-03-27 wdqs outage Start 2022-03-27 13:51
Task End 2022-03-27 14:24
People paged sre Responder count 1
Coordinators 0 Affected metrics/SLOs per-host latency
Impact For 30 minutes, all WDQS queries failed due to an internal deadlock.

WDQS in Codfw entered a state of deadlock that persisted until service restarts were performed. Note that WDQS eqiad was depooled at the time of the incident, reducing total capacity.

Simple Timeline

[2022-03-26] wdqs2002 gets caught in deadlock a day before the incident

[2022-03-27 12:53:00] wdqs2007 deadlocks. Within an hour, with wdqs2002 and wdqs2007 already deadlocked, the remaining hosts will rapidly fall offline within a span of a few minutes.

[2022-03-27 14:00:40] All codfw wdqs hosts are deadlocked; no queries can complete

[2022-03-27 14:05:03] First page emitted (see following line)

[2022-03-27 14:05:03] <icinga-wm> PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

[2022-03-27 14:16:59] elukey responds

[2022-03-27 14:18:17] elukey performs rolling restart of wdqs codfw

[2022-03-27 14:22:54] full service health restored

Detailed Timeline

Leading up to the incident

The day before the actual incident, wdqs2002 dropped out of service due to its blazegraph process locking up. Note that while not ideal, this does happen somewhat regularly, usually without incident. However it's relevant to this incident because it reduced available capacity for when the "actual incident" began the following day.

[2022-03-26 01:09:13] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-26 13:54:55] <icinga-wm> PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

Note that wdqs2002 being caught in a state of deadlock is visible on this Grafana dashboard, as evidenced by the null metrics reported during the period of the deadlock

The actual incident

[2022-03-27 12:10:18] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 12:12:24] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.201 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 12:58:20] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.171 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 13:01:44] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:03:32] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:04:00] <icinga-wm> RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 13:05:48] <icinga-wm> RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal


[2022-03-27 13:55:36] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled

[2022-03-27 13:57:26] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled

[2022-03-27 14:00:04] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:00:34] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:03:38] <icinga-wm> PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:05:03] <icinga-wm> PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems


Elukey responds

[2022-03-27 14:16:59] <elukey> hello folks, anybody checking wdqs?

[2022-03-27 14:17:11] <elukey> Cc: gehel, dcausse, ryankemper

[2022-03-27 14:18:17] <elukey> !log restart blazegraph on wdqs2003

[2022-03-27 14:20:22] <icinga-wm> PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 7.809 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:20:43] <icinga-wm> RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.237 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

[2022-03-27 14:20:49] <elukey> !log roll restart of wqds-blazegraph-public codfw

[2022-03-27 14:22:08] <icinga-wm> PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 6.823 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

Rolling restart complete, service health restored

[2022-03-27 14:22:54] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.381 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:22:56] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2004 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.456 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook


[2022-03-27 14:23:48] <icinga-wm> RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:24:06] <icinga-wm> RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:24:50] <elukey> I am checking metrics in https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-3h&to=now and they look reasonably ok

[2022-03-27 14:25:31] <elukey> the thread count is a little weird

[2022-03-27 14:25:46] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.444 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:25:50] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.238 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:26:18] <icinga-wm> RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

[2022-03-27 14:26:36] <icinga-wm> RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

[2022-03-27 14:26:38] <icinga-wm> RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service

[2022-03-27 14:26:38] <icinga-wm> RECOVERY - WDQS SPARQL on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.335 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook

[2022-03-27 14:26:42] <icinga-wm> RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal

Actionables

Scorecard

Incident Engagement™ ScoreCard
Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) 1
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) 1
Were more than 5 people paged? (score 0 for yes, 1 for no) 0
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) 0
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours) 0 as weekend
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) 0
Was the public status page updated? (score 1 for yes, 0 for no) N/A
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) 0
Are the documented action items assigned?  (score 1 for yes, 0 for no) 1
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) 0
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) ?
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) 1
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) 1
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) 1
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) 0
Total score 6