Incidents/2019-08-20 logstash

From Wikitech

document status: in-review

Summary

For about 30 minutes, Logstash was not getting any messages from the MediaWiki servers.

Impact

During the Logstash outage, we were partly blind in terms of operational monitoring. It also meant developers were unable to use WikimediaDebug, and unable to deploy new code for MediaWiki and most other services.

While this impacted scheduling and developer productivity, it did not directly affect end-users of any public services. Also, the logs were eventually recovered into Logstash after it was restarted (the Logstash-Kafka consumer picks up where it left off).

Detection

  • Icinga alerts.

Timeline

All times in UTC.

Conclusions

What went well?

  • Detected early. Quickly fixed by restarting.

How many people were involved in the remediation?

  • 1 SRE.

Actionables