Incidents/2019-06-13 wdqs

From Wikitech

Summary

From June 13 ~15:10UTC to ~15:50 UTC the public WDQS endpoint in eqiad was overloaded by a bot to the point where it was not serving user queries. There is no reason to think that this bot was malicious. To mitigate this, the python-requests user agent is temporarily banned from accessing WDQS, consistent with our user agent policy.

Impact

The WDQS public endpoint in eqiad was unavailable from ~15:25 to ~15:45 UTC.

The python-requests user agent is still being banned, we are waiting to implement a more gentle solution before removing this ban.

The internal WDQS endpoint was not impacted.

Detection

Problem was detected by the Icinga LVS probe.

Timeline

All times in UTC.

  • 15:10: load starts to increase on the public wdqs eqiad cluster
  • 15:31: Icinga LVS alert for wdqs.svc.eqiad.wmnet

Conclusions

  • identifying and throttling bots is a hard problem
  • we need to take more drastic action to protect the stability of the service (aggressively throttle generic user agents)

What went well?

  • problem was detected automatically in a timely manner
  • good collaboration and clear communication between

What went poorly?

  • while we do have logic to throttle abusive bots, this throttling was not sufficient to protect the service
  • we are still banning python-requests as a user agent, which affects a number of bots

Where did we get lucky?

  • This happened during SRE offsite, when most SRE are in the same timezone. Luckily this wasn't when all of them were sleeping!

Actionables