Incident documentation/20160128-MediaWiki-API

From Wikitech
Jump to: navigation, search

Summary

A Kafka broker was rebooted as part of a standard upgrade. The MediaWiki API cluster failed as a result, due to an overwhelming number of piled up connection attempts to that particular Kafka broker. The broker was ultimately depooled from mediawiki-config as part of the incident response and the service was ultimately restored, 25 minutes after it first started failing.

Timeline

  • 12:53 Luca follows the safe broker restart process and stops Kafka on kafka1012 for a host reboot for a kernel upgrade
  • 13:36 multiple mw* hosts report HHVM Rendering "Socket timeout after 10 seconds" failures, multiple opsens respond
  • 13:49 Giuseppe notices a lot of HHVM threads stuck at attempting to connect, then a large number of connections to kafka1012 in a SYN_SENT state.
  • 13:53 Faidon pushes a patch to remove kafka1012 from mediawiki-config and syncs it to the appserver fleet
  • 13:58 File is synced across the fleet, traffic recovers

Conclusions

Actionables

  • Status:    In progress MediaWiki monolog doesn't handle Kafka failures gracefully (bug T125084)