User:CDanis/Use more heatmaps

From Wikitech
An illustration of a real incident on Wikimedia's API appserver cluster: load increased, and a small group of (older) servers saturated their CPUs under load, leading to greatly increased tail latency for users. Asking our load balancer to send fewer queries to the older servers restored user happiness.
Plots from Wikimedia's appserver pool showing a low average CPU utilization across nodes, but that's not the whole story: the heatmap shows there are actually two distinct groups of nodes, one with low utilization and one with medium/high utilization (and a couple debug servers with near-0 utilization)