Talk:Incidents/2019-04-02 0401KafkaJumbo

[Luca] A couple of notes after the first pass of reading:

I don't think that Kafka stopped to serve connections, since we'd have had a way bigger impact in my opinion. Some brokers were still up and running (while the others were OOMing), but of course they were not able to sustain all the traffic.
We need to define a clear SLO (service level objectives) with the SRE team about the Kafka Jumbo cluster. In this case, the incident report says that we were lucky to find somebody in PST working on it from the SRE team, and it was clearly an emergency since Analytics traffic was dropped. We (as Analytics) should have a clear definition of what level of service the Jumbo cluster should get, and have support from SRE accordingly. It is true that the Analytics team can count on two SREs in US/EU timezones, but as this incident report shows it can happen that two is not enough :)
As a follow up on the item above, should any page be fired to SRE/Analytics if an event like this re-happens?
Should we need to raise a bit the heap size of the Kafka brokers (currently 2G) to account for events like these? It would remove a couple of Gigabytes from the page cache of course..

Start a discussion about Incidents/2019-04-02 0401KafkaJumbo

Talk pages are where people discuss how to make content on Wikitech the best that it can be. You can use this page to start a discussion with others about how to improve Incidents/2019-04-02 0401KafkaJumbo.

Start a discussion