Incidents/2019-02-08 logstash-mediawiki

Summary

A single database host down caused an exceptional influx of log entries from MediaWiki, which in turn caused overload in logstash ingestion.

Timeline

12:18 db1114 DOWN alert fires, see also bug T214720
12:30 DBAs are engaged
12:42 UDP packet loss alerts fire for logstash
12:44 db1114 is depooled
13:09 UDP packet loss for logstash recovers

Conclusions

MediaWiki alone was able to cause a logstash overload, resulting in UDP packet loss. Applications using UDP as log transport have experienced loss of logs, while applications using the new logging pipeline (i.e. writing to Kafka, MediaWiki included) experienced a slowdown in log processing while logstash instances were catching up on the backlog.

Note that the length of this incident was a contributing factor in the UDP loss, a shorter reoccurrence (20min) happened on Feb 11th due to repool of db1118 but resulted in no UDP loss on the logstash side as instances were able to catch up. A deeper understanding of logstash performance characteristics is needed as well.

Links to relevant documentation

DB depooling: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting

Actionables

A MediaWiki dependency being down (single database host in this case) should not cause log spam/overload bug T215611
The logging pipeline will need some additional spam / ratelimit protection bug T215900
Better understanding of Logstash performance bug T215904