Incidents/2017-11-30 wdqs

From Wikitech

Summary

Around 14:55 UTC wdqs1004 was caught in GC death spiral and froze. It recovered after a restart of blazegraph.

Timeline

  • 14:55 UTC: slowdown in updates can be observed for wdqs1004
  • 15:15 UTC: icinga alert: LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
  • 15:15 UTC: icinga recovery
  • 15:19 UTC: restart of blazegraph on wdqs1004

Conclusions

  • Looking at GC logs, I can see a peak at 17GB/s of heap allocation. This looks related to the traffic received. Much more investigation will be needed to get to the bottom of this.
  • Looking at throttled requests during that period, I can see that most requests are coming from user agent "MediaWiki/1.31.0-wmf.10". This is a surprise to me.

Actionables

  • modify the local icinga checks to use the same check as LVS, which do a real query and not just a call to a dummy page phab:T181989
  • new wdqs cluster, dedicated to synchronous and trusted traffic phab:T178492 (this is a goal of search backend for next quarter)
  • investigate memory allocation on blazegraph phab:T181988
  • investigate and document clients of wdqs, a tracking page has been created.