Incident documentation/20150709-poolcounter

From Wikitech
Jump to: navigation, search


There was an 8 minute outage of api at eqiad, starting from 2015-07-09 17:26:35 and ending at 2015-07-09 17:34:15 caused by scheduled maintenance and an unforeseen dependency. helium was powered down for helium however is also a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.


17:22 cmjohnson1: shutting down helium for a few minutes to move within the same row

17:26:35 icinga complains about api.svc.eqiad.wmnet. Before that it had already complained about HHVM queue sizes on various mw hosts. mutante noticed it's poolcounter host

17:31 ori merges making mw1154 a poolcounter server, effectively bypassing helium and pottasium. Recoveries start coming in

17:34 icinga declares api.svc.eqiad.wmnet OK.