Incidents/20151127-EventLogging
Appearance
(Redirected from Incident documentation/20151127-EventLogging)
Timeline
On the 27th of November
1:30 am UTC sql insertion rate goes to zero, topics that are feed from outside in kafka continue to receive events, event-login-valid-mixed is receiving events but not as much as it should have
At the same time we see this errors in the eventlogging_processor log:
2015-11-27 01:31:09,663 (MainThread) Could not receive response to request [0000026b0000000000a0 ... 6b69223a2022656e77696b69227d] from server <KafkaConnection host=kafka1013.eqiad.wmnet port=9092>: Kafka @ kafka1013.eqiad.wmnet:9092 went away 2015-11-27 01:31:09,664 (MainThread) Could not receive response to request [0000038a00000000009d ... 6b69223a2022657377696b69227d] from server <KafkaConnection host=kafka1013.eqiad.wmnet port=9092>: Kafka @ kafka1013.eqiad.wmnet:9092 went away
Kafka had an outage in which only one of the brokers seems to be working:
06:50 am UTC eventlogging gets rebooted and a spike on consumption can be seen on Grafana
07:05 am UTC consumption catches up
Conclusions
Eventlogging consumers get stuck when there are connection problems talking to kafka. System requires a reboot to be able to recover after kafka has been brought back.
Actionables
- Status: Done Investigate whether backfilling is needed
- Status: Done Backfill missing data
- Status: Done Make Eventlogging more resilient to kafka outages: [1]