Talk:Incidents/2022-03-27 wdqs outage

Rough notes around the incident from the Search team.

Bking: I see Icinga alerts matching " PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook" in #wikimedia-operations IRC room, but I can't find any emails for this. Action: Ensure that search team SREs get emails for these failures.

As discussed here, the command-line utility jstack can detect deadlocks, and is installed on all wdqs hosts. Perhaps we can use it to monitor for these deadlocks.

We also update https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Blazegraph_deadlock with the exact verbiage from the alerts and examples of what Grafana looks like during these outages.

Update the alert verbiage itself to say "restart blazegraph service on X"

Potential things to alert on

Thread count plateau, see 2002 and 2007 https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1648382400000&to=1648396800000

Sustained load avg 15 leads

Performance improvements

We have NUMA enabled on these nodes, is that a good idea?

Do we have a perf testing environment, maybe there are other tunables we should look into.