Incident documentation/20150709-poolcounter

From Wikitech
Jump to: navigation, search

Summary

There was an 8 minute outage of api at eqiad, starting from 2015-07-09 17:26:35 and ending at 2015-07-09 17:34:15 caused by scheduled maintenance and an unforeseen dependency. helium was powered down for https://phabricator.wikimedia.org/T84770. helium however is also a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.

Timeline

17:22 cmjohnson1: shutting down helium for a few minutes to move within the same row

17:26:35 icinga complains about api.svc.eqiad.wmnet. Before that it had already complained about HHVM queue sizes on various mw hosts. mutante noticed it's poolcounter host

17:31 ori merges https://gerrit.wikimedia.org/r/#/c/223838/ making mw1154 a poolcounter server, effectively bypassing helium and pottasium. Recoveries start coming in

17:34 icinga declares api.svc.eqiad.wmnet OK.

Actionables