Incidents/2017-08-29 EventStreams

From Wikitech

Summary

The Services team needed an increase in the maximum message size that the main Kafka brokers would allow, so that they could produce large JobQueue events. https://gerrit.wikimedia.org/r/#/c/372179/ was merged and services were restarted around 2017-08-28T19:00. This change was not applied to analytics Kafka brokers, nor was it applied to MirrorMaker producer and consumer settings. This caused the MirrorMaker processes to die when they attempted to mirror a message from a main Kafka cluster. Because no data from the main Kafka clusters could be mirrored to the analytics Kafka cluster, EventStreams, which uses topics that come from the main Kafka clusters, but consumes from the analytics Kafka cluster, stopped sending events to connected clients.

Otto was notified about no messages in EventStreams by an IRC user at on 2017-08-29, at around 14:44 EST. He looked in MirrorMaker logs, and found a message about a message exceeding max allowed size, and realized that this setting would need to be synced between all clusters (if we are mirroring messages between them) and also for clients less than 0.10 (all of our current clients).

Timeline

  • 2017-08-28T15:03 (gerrit time?) https://gerrit.wikimedia.org/r/#/c/372179/ is merged.
  • 2017-08-28T19:24 UTC Otto begins restarting the main-eqiad Kafka cluster and evenbus producers.
  • Petr confirms that JobQueue can now produce larger messages to main Kafka clusters.
  • 2017-08-29T14:44 EST, IRC user notifies Otto that EventStreams is broken.
  • Otto begins merging puppet patches to synchronize settings between Kafka clusters and clients:
  • 2017-08-29T19:34 UTC, Otto applies puppet, restarts analytics Kafka cluster and all MirrorMakers
  • Mirrored events begin flowing again into analytics Kafka, and to connected EventStreams clients.

Conclusions

There were ZERO alerts about a drop of messages in these analytics Kafka topics, and for EventStreams clients!

Actionables