Incidents/2017-09-20 Logstash

Summary

Logstash stopped processing logs while cirrus elasticsearch cluster (eqiad) was down for maintenance.

Timeline

Sept 20, 15:48 to 16:44: logs collected drop to almost zero (see graph)
Sept 20, 16:42: rolling restart of the logstash collectors (this happened at roughly the same time as elasticsearch recovery, so it might have been useless)

Conclusions

The only identified link between logstash and elasticsearch cluster is the logging of API features. Logs are collected by logstash and forwarded not to the logstash elasticsearch cluster, but to the cirrus elasticsearch cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage. While the cirrus cluster was down, we saw timeouts in the logstash logs. The hypothesis is that this blocked enough threads that the logstash ingester threadpool was saturated and basically stopped processing anything.

It is not clear that lgostash provides way to implement a circuit breaker in case elasticsearch is down. The resurrect_delay option might be useful.

Actionables

More investigation should be done task T176335