Incidents/20160125-EventLogging
Summary
The GC logs of some Kafka nodes grew so big that filled their disks. A couple hours later, EventLogging's processors lost connection with Kafka and stopped consuming events (EL's mysql consumers were already disabled at the time of the incident for db maintenance). Operations fixed the issue removing the GC logs and setting up rotation for them (https://phabricator.wikimedia.org/T124644). Analytics restarted EventLogging and backfilled the missing data.
Timeline
2016-01-24 00:09:18 UTC
PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=96%)
First occurrence of the issue in the wikimedia-operations channel, followed by other ones.
2016-01-25 07:22 UTC
The Ops team started to see the following errors in the wikimedia-operations channel:
PROBLEM - Disk space on kafka1020 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=96%)
They quickly identified the problem, the disk was full:
df -h: ... /dev/md0 28G 26G 960M 97% / ...
Due to really big GC logs 20+GB:
ls -lh /var/log/kafka/kafkaServer-gc.log -rw-r--r-- 1 kafka kafka 23G Jan 25 08:32 /var/log/kafka/kafkaServer-gc.log
This was more or less spread across all nodes.
2016-01-25 09:30 UTC
The Ops team filed a gerrit code review to add a better rotation policy to Kafka brokers (https://gerrit.wikimedia.org/r/#/c/266203/).
2016-01-25 09:56 UTC
The Ops team restarted the first Kafka broker (kafka1012).
2016-01-25 10:13 UTC
The Analytics team received the following alert in the wikimedia-analytics channel:
PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: 60.00% of data above the critical threshold [30.0]
The Kafka brokers restarts caused an expected metadata change that was not handled correctly by pykafka clients. EL's logs were stuck to Connection refused messages from Kafka brokers.
2016-01-25 11:00 UTC
Analytics restarted EventLogging and processors went back to normal. The root cause of the Kafka failure was that the GC logs were not setup for rotation. The root cause for the EventLogging producers blocking is still unknown.
Actionables
- Status: Pending Investigate the root cause of the Kafka disconnects.
- Status: Pending Investigate if a stricter set of alarms would have helped (triggering for example at 70/80% disk utilization rather than 96%).
- Status: Pending Adding Kafka and Hadoop alarms to the analytics channel for better visibility.
- Status: Done Backfill missing data.