Incidents/20150709-poolcounter

Summary

There was an 8 minute outage of api at eqiad, starting from 2015-07-09 17:26:35 and ending at 2015-07-09 17:34:15 caused by scheduled maintenance and an unforeseen dependency. helium was powered down for https://phabricator.wikimedia.org/T84770. helium however is also a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.

Timeline

17:22 cmjohnson1: shutting down helium for a few minutes to move within the same row

17:26:35 icinga complains about api.svc.eqiad.wmnet. Before that it had already complained about HHVM queue sizes on various mw hosts. mutante noticed it's poolcounter host

17:31 ori merges https://gerrit.wikimedia.org/r/#/c/223838/ making mw1154 a poolcounter server, effectively bypassing helium and pottasium. Recoveries start coming in

17:34 icinga declares api.svc.eqiad.wmnet OK.

Actionables

Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) Done - T105378
Revert https://gerrit.wikimedia.org/r/#/c/223838 after helium is deemed fine again (merge https://gerrit.wikimedia.org/r/#/c/223847/ to revert) Done T105379
Remove poolcounter from mw1154 for housecleaning (the box is going to get reimaged anyway however, hopefully soon) Done T105380