Incident documentation/20151127-EventLogging

From Wikitech
Jump to: navigation, search

Timeline

On the 27th of November

1:30 am UTC sql insertion rate goes to zero, topics that are feed from outside in kafka continue to receive events, event-login-valid-mixed is receiving events but not as much as it should have

Eventlogging-outage-2015-11-27 2.png

At the same time we see this errors in the eventlogging_processor log:

2015-11-27 01:31:09,663 (MainThread) Could not receive response to request [0000026b0000000000a0 ... 6b69223a2022656e77696b69227d] from server <KafkaConnection host=kafka1013.eqiad.wmnet port=9092>: Kafka @ kafka1013.eqiad.wmnet:9092 went away
2015-11-27 01:31:09,664 (MainThread) Could not receive response to request [0000038a00000000009d ... 6b69223a2022657377696b69227d] from server <KafkaConnection host=kafka1013.eqiad.wmnet port=9092>: Kafka @ kafka1013.eqiad.wmnet:9092 went away

Kafka had an outage in which only one of the brokers seems to be working:

Kafka-outage-2015-11-27 1.png


06:50 am UTC eventlogging gets rebooted and a spike on consumption can be seen on Grafana

Eventlogging-outage-2015-11-27 1.png

07:05 am UTC consumption catches up

Conclusions

Eventlogging consumers get stuck when there are connection problems talking to kafka. System requires a reboot to be able to recover after kafka has been brought back.

Actionables

  • Status:    Done Investigate whether backfilling is needed
  • Status:    Done Backfill missing data
  • Status:    Done Make Eventlogging more resilient to kafka outages: [1]