Incident documentation/20150901-Elasticsearch

From Wikitech
Jump to: navigation, search

Summary

Elasticsearch service (on elastic*.eqiad.wmnet nodes) backing the search functionality went red for few minutes. We didn't lose any real data and we failed to service some searches during 10 minutes.

Timeline

  • 05:28: dcausse pauses write before applying the firewall rules to master (elastic1001)
  • 05:32: chasemp applies the rules
  • 05:32: master is starting to lose track of its nodes
  • 05:33: cluster is red
  • 05:33: chasemp revert the rules
  • 05:34: cluster is starting to recover
  • 05:39: cluster is back to yellow
  • 05:48: there's a 10 min spike of "Pool errors", dcausse and chasemp test some queries on enwiki and they all worked
  • 07:58: cluster is back to green
  • 08:00: dcausse unfreeze the indices

Conclusions

Actionables