Incidents/2017-09-20 Logstash
Summary
Logstash stopped processing logs while cirrus elasticsearch cluster (eqiad) was down for maintenance.
Timeline
- Sept 20, 15:48 to 16:44: logs collected drop to almost zero (see graph)
- Sept 20, 16:42: rolling restart of the logstash collectors (this happened at roughly the same time as elasticsearch recovery, so it might have been useless)
Conclusions
The only identified link between logstash and elasticsearch cluster is the logging of API features. Logs are collected by logstash and forwarded not to the logstash elasticsearch cluster, but to the cirrus elasticsearch cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage. While the cirrus cluster was down, we saw timeouts in the logstash logs. The hypothesis is that this blocked enough threads that the logstash ingester threadpool was saturated and basically stopped processing anything.
It is not clear that lgostash provides way to implement a circuit breaker in case elasticsearch is down. The resurrect_delay
option might be useful.
Actionables
- More investigation should be done task T176335