Incidents/2019-06-24 wdqs

Summary

WDQS public endpoint in eqiad was overloaded between 11:50 UTC until 13:15 UTC, leading to HTTP 5xx being served to users. Updates were disabled to mitigate the issue.

Impact

The outage (or at least reduced service availability) went for ~1.5h, leading to ~7K HTTP 5xx being served.

Detection

The LVS check was alerting.

Timeline

All times in UTC.

11:58: increased rate of HTTP 5xx on WDQS public endpoint
12:04: first Icinga alert for "PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds"
12:36: ban of GuzzleHttp user agent (this seems to be the UA generating the most read traffic at the moment)
12:49: restarting blazegraph on wdqs1004 (JVM thread out of control)
13:00: shutting down wdqs-updater on wdqs-public / eqiad
13:02: last Icinga recovery for "RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.341 second response time"

Conclusions

What went well?

internal WDQS clusters were not affected, segregating use cases works

What went poorly?

while we do have throttling in place to keep read load under control, this was not sufficient to prevent the issue

Where did we get lucky?

Not sure there was much luck here.

Links to relevant documentation

Wikidata_query_service/Runbook#Overload_due_to_high_edit_rate

Actionables

Better throttle generic user agents 517555 (should be deployed later today)
Rate limit updates task T226413