Incidents/2018-04-23 wdqs
(Redirected from Incident documentation/20180423-wdqs)
Summary
Wikidata Query Service experienced slow down and timeouts for ~30 minutes
Timeline
- 7:30 UTC: wdqs.svc.eqiad.wmnet start experiencing slowdowns (95 %-ile is reaching the 1 minute timeout)
- 7:39 UTC: icinga alert: LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
- 7:41 UTC: icinga recovery: LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.029 second response time
- 8:08 UTC: restart of wdqs1003, situation back to normal
Conclusions
During the slowdown, we see multiple clients being aggressively throttled. This seems to be a consequence of the slowdown, not a cause. While we do have clients who don't backoff when receiving HTTP 429, they seem to be blocked correctly.
GC logs show mostly no GC activity after 7:28 UTC, which indicates that for once, the issue is somewhere else.
Actionables
- investigate the cause of the slowdown of wdqs1003 phab:T192759
- create a dedicated WDQS cluster for internal traffic phab:T178492 (mostly done, still need to move clients to use this new cluster)