Incidents/20150901-Elasticsearch

Summary

Elasticsearch service (on elastic*.eqiad.wmnet nodes) backing the search functionality went red for few minutes. We didn't lose any real data and we failed to service some searches during 10 minutes.

Timeline

05:28: dcausse pauses write before applying the firewall rules to master (elastic1001)
05:32: chasemp applies the rules
05:32: master is starting to lose track of its nodes
05:33: cluster is red
05:33: chasemp revert the rules
05:34: cluster is starting to recover
05:39: cluster is back to yellow
05:48: there's a 10 min spike of "Pool errors", dcausse and chasemp test some queries on enwiki and they all worked
07:58: cluster is back to green
08:00: dcausse unfreeze the indices

Conclusions

https://phabricator.wikimedia.org/T104962#1594537

Actionables

https://phabricator.wikimedia.org/T104962#1594537