Incident documentation/20140622-es1006

From Wikitech
Jump to: navigation, search

Summary

There was a 50-second flash of 5xx responses which corresponds to a spike of "Too many connections" errors for es1006 in dberror.log.

Timeline

First error: Sun Jun 22 8:01:37 UTC 2014
Last error: Sun Jun 22 8:02:27 UTC 2014

Mean CPU utilization since the 12th is up around 90% compared to the previous ten day period:

http://graphite.wikimedia.org/render/?target=servers.es1006.cpu.total.user.value&from=00%3A00_20140601&until=23%3A59_20140622&width=600&height=300

In fact it is a general load jump for external storage that has been causing similar glitches for some days. There is a corresponding jump also starting on the 12th on the S5 slaves (dewiki, wikidatawiki). None of the other shards show the pattern.

During IRC discussion a probable spike in Wikidata traffic was identified; mostly Wikibase\Lib\Store\WikiPageEntityLookup::selectRevisionRow which would also hit ES. Aude and Hoo investigated and found a latent Wikidata caching bug.

Conclusions

Traffic increased on ES and S5. Probable cause was a latent Wikidata bug.

Actionables

  • Status:    Done An additional S5 slave has been deployed.
  • Status:    Done DB traffic sampling has been deployed to S5.
  • Status:    Done Aude and Hoo deployed https://gerrit.wikimedia.org/r/#/c/141997/