Incidents/20150709-poolcounter
Summary
There was an 8 minute outage of api at eqiad, starting from 2015-07-09 17:26:35 and ending at 2015-07-09 17:34:15 caused by scheduled maintenance and an unforeseen dependency. helium was powered down for https://phabricator.wikimedia.org/T84770. helium however is also a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.
Timeline
17:22 cmjohnson1: shutting down helium for a few minutes to move within the same row
17:26:35 icinga complains about api.svc.eqiad.wmnet. Before that it had already complained about HHVM queue sizes on various mw hosts. mutante noticed it's poolcounter host
17:31 ori merges https://gerrit.wikimedia.org/r/#/c/223838/ making mw1154 a poolcounter server, effectively bypassing helium and pottasium. Recoveries start coming in
17:34 icinga declares api.svc.eqiad.wmnet OK.
Actionables
- Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) Done - T105378
- Revert https://gerrit.wikimedia.org/r/#/c/223838 after helium is deemed fine again (merge https://gerrit.wikimedia.org/r/#/c/223847/ to revert) Done T105379
- Remove poolcounter from mw1154 for housecleaning (the box is going to get reimaged anyway however, hopefully soon) Done T105380