Incidents/20160125-EventLogging

Summary

The GC logs of some Kafka nodes grew so big that filled their disks. A couple hours later, EventLogging's processors lost connection with Kafka and stopped consuming events (EL's mysql consumers were already disabled at the time of the incident for db maintenance). Operations fixed the issue removing the GC logs and setting up rotation for them (https://phabricator.wikimedia.org/T124644). Analytics restarted EventLogging and backfilled the missing data.

Timeline

2016-01-24 00:09:18 UTC

PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=96%)

First occurrence of the issue in the wikimedia-operations channel, followed by other ones.

2016-01-25 07:22 UTC

The Ops team started to see the following errors in the wikimedia-operations channel:

PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=96%)

They quickly identified the problem, the disk was full:

df -h:
...
/dev/md0         28G   26G  960M  97% /
...

Due to really big GC logs 20+GB:

ls -lh /var/log/kafka/kafkaServer-gc.log
-rw-r--r-- 1 kafka kafka 23G Jan 25 08:32 /var/log/kafka/kafkaServer-gc.log

This was more or less spread across all nodes.

2016-01-25 09:30 UTC

The Ops team filed a gerrit code review to add a better rotation policy to Kafka brokers (https://gerrit.wikimedia.org/r/#/c/266203/).

2016-01-25 09:56 UTC

The Ops team restarted the first Kafka broker (kafka1012).

2016-01-25 10:13 UTC

The Analytics team received the following alert in the wikimedia-analytics channel:

PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: 60.00% of data above the critical threshold [30.0]

The Kafka brokers restarts caused an expected metadata change that was not handled correctly by pykafka clients. EL's logs were stuck to Connection refused messages from Kafka brokers.

2016-01-25 11:00 UTC

Analytics restarted EventLogging and processors went back to normal. The root cause of the Kafka failure was that the GC logs were not setup for rotation. The root cause for the EventLogging producers blocking is still unknown.

Actionables

Status: Pending Investigate the root cause of the Kafka disconnects.
Status: Pending Investigate if a stricter set of alarms would have helped (triggering for example at 70/80% disk utilization rather than 96%).
Status: Pending Adding Kafka and Hadoop alarms to the analytics channel for better visibility.
Status: Done Backfill missing data.