Incidents/20160128-MediaWiki-API
Appearance
Summary
A Kafka broker was rebooted as part of a standard upgrade. The MediaWiki API cluster failed as a result, due to an overwhelming number of piled up connection attempts to that particular Kafka broker. The broker was ultimately depooled from mediawiki-config as part of the incident response and the service was ultimately restored, 25 minutes after it first started failing.
Timeline
- 12:53 Luca follows the safe broker restart process and stops Kafka on kafka1012 for a host reboot for a kernel upgrade
- 13:36 multiple mw* hosts report HHVM Rendering "Socket timeout after 10 seconds" failures, multiple opsens respond
- 13:49 Giuseppe notices a lot of HHVM threads stuck at attempting to connect, then a large number of connections to kafka1012 in a SYN_SENT state.
- 13:53 Faidon pushes a patch to remove kafka1012 from mediawiki-config and syncs it to the appserver fleet
- 13:58 File is synced across the fleet, traffic recovers
Actionables
- MediaWiki monolog doesn't handle Kafka failures gracefully (bug T125084)