No relevant SLOs exist. The only metric that comes to mind as relevant is deployment failures. A possible future SLO is Kubernetes API availability.
Impact
No impact for wiki users. 4 MediaWiki deployments failed. A global scap lock was put into place as a precautionary measure against bad UX for deployers
Triggered by a failed scap deployment of MediaWiki, the team dug into figuring out the underlying causes, unearthing a network and CPU saturation at the WikiKube Kubernetes cluster api servers. Aside from the issue being witnessed by 2 deployers, no other discernible impact was observed. As an inadvertent result of the investigation, scap deployments became faster.
They were multiple interrelated causes that contributed and made debugging slower
For the past few months, during any MW deploy, we've been saturating k8s control plane CPU in both eqiad and codfw. A graph showing slowly increasing CPU saturation for codfw api serversA graph showing slowly increasing CPU saturation for eqiad api serversThe saturation also manifested as slowly increasing latencies for blackbox probes
More and more worker servers were being added to the WikiKube clusters
1 of the kubemaster VMs was on a 1G Ganeti node, the other one on a 10G ganeti node. After the first incident, the second kubemaster (the one on 1G ganeti node) took over part of the load. Network wise, it was unable to sustain the level of traffic and ended up having ~6k TCP retransmits per second.
At the end of the week before the incident, we had enabled OpenTelemetry collector in eqiad, following a successful deployment in codfw
We were in the process of adding more kubernetes api-servers in both datacenters.Slowly increasing blackbox probe latency
Timeline
All times in UTC. Incident investigation spans 3 days.
12:23 <TimStarling> ^ trying to deploy a single-file change but k8s is very slow
12:23 <TimStarling> 12:23:11 Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'apply']' returned non-zero exit status 1.
12:35:09 <TimStarling> Started sync-prod-k8s [retry]
First page: [12:49:12] <+jinxer-wm> FIRING: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown INCIDENT BEGINS
[12:52:08] <hnowlan> effie and I are looking at this, unclear what's wrong. pybal is failing to connect to the masters
[12:52:12] A recovery arrives but it is not yet clear why, INCIDENT ENDS
Another page: [13:23:12] <+jinxer-wm> FIRING: ProbeDown: Service kubemaster1002:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown INCIDENT BEGINS
13:27:52: Incident opened. Kamila becomes IC
13:46:28 <cdanis> kamila_: scap lock is now in place per effie
13:50 <akosiaris> ok first thing's first, apparently firewall for etcd isn't open on wikikube-ctrl1001
13:54 akosiaris manually opened firewall port, cluster-health happy now – but probably a red herring
14:30 <cdanis> so that looks like traffic to *other* k8s api servers and also from etcd that is being retransmitted
14:30 <cdanis> so kinda everything
14:56 <akosiaris> !log migrate kubemaster1002 to ganeti1037 [10G NIC, was 1G]
<akosiaris> Tue May 28 15:00:28 2024 * memory transfer progress: 113.26 %
15:02:20 <cdanis> kubemaster1001 just saturated
15:02:23 <cdanis> briefly
15:03:04 <akosiaris> all I did was to restart kube-apiserver on kubemaster1002 fwiw
15:03:20 <cdanis> ok so just startup then probably
15:03:20 <akosiaris> which adds credence to your theory that we were maxing out on both
15:03:53 <akosiaris> VM migration done, I 'll start up the test shortly
15:12 akosiaris rolling-restarts kube-apiserver; NIC no longer saturated
15:24 rzl becomes IC
15:58 cdanis identifies that the kubemaster IPs are hardcoded in the otelcol values file, as of the end of last week, so it can only talk to the legacy hosts (10.64.0.117, 10.64.32.116)
16:08 Amir runs a scap backport, which saturates kubemaster1001 CPUs again but not for long enough that they either fail probes or reach scap’s timeout
17:00 Decision to leave things as-is over the EU nighttime, since they’re not ideal but workable for deployments. rzl stands down as IC INCIDENT ENDS
Graphs showing behavior during tests
Following day 2024-05-29.
From 09:27 until 11:26 Alex runs a battery of tests against the kube-API servers, using a simulation of deployments. The entirety of it is in T366094#9842257. It mostly matches what Chris wrote the previous day with 1 difference of identifying OpenTelemetry collector as a contributing factor to the packet loss and NIC saturation observed by +50%. The tests didn't cause any issue aside from some temporary increase in latency that subsided quickly.
2 new controller nodes added => stable
OTel collector still disabled
mitigation in place for later: logs & metrics pipelines disabled => hoped to reduce bandwidth usage to ⅓ of original (1/3rd did not pan out, but the reduction later was still substantial)
Reduction in otel collector traffic due to configuration changes. eqiad (without otelcol running at this point) is used as "control" while config is changed on codfw
Following day 2024-05-30
02:02. Chris corroborates that OpenTelemetry collector increases substantially the network traffic to WikiKube API servers and searches for fixes.
19:53. Two fixes by Chris in OpenTelemetry collector's configuration managed to lower the increase to ~10-15% (comparing codfw to eqiad)
all 3 new dedicated nodes added, old nodes still in place for the extra network capacity, everything is stable
Followup filed to upgrade to 10G NICs on the bare metal hosts, which should allow the Ganeti VMs to be removed.
Detection
The issue was detected due to a failed deployment and this necessitates right a human witness. There were alerts of the following nature
These alerts have been firing for some time every now and then. However,
Due to the very slow increasing nature of the underlying cause (latency due to CPU saturation) it wasn't possible to correlate it. Furthermore when they accompanied caused an incident, corrective action was already being taken (moving Kubernetes masters to dedicated machines)
The CPU saturation events were short enough to not show up in the "Host Overview" Grafana dashboard. This effect was exacerbated by the fact the CPU panels in that dashboard use a resolution of 1/3.
Conclusions
Growing pains for the WikiKube cluster during the mw-on-k8s project were expected. There had already been Kubernetes API related incidents and the team, after an investigation, concluded that adding more etcd capacity, in the form of 3 new dedicated stacked API servers would alleviate those problems. However,
It appears there were also short CPU saturation problems that were never directly detected that caused eventually a few failed deployments.
The default configuration of the OpenTelemetry collector with k8sattributesprocessor enabled increased the level of network traffic egressing API servers by > 50%. Cumulative traffic egressing from all API servers reached 272MB/s (~2.1Gbps), enough to saturate heavily any 1Gbps link.
There are 2 different methods of load balancing regarding traffic to the API servers. One from workloads outside the cluster itself (i.e. kubelet, scap, kube-proxy, rsyslog) and one from workloads inside the cluster, i.e. OpenTelemetry collector, Calico, eventrouter (we call these cluster components). The former go via LVS, the latter via Kubernetes probabilistic DNAT. The result is that the 2 methods might end up sending traffic to different sets of nodes, complicating things.
What went well?
The incident did not impact anyone aside from 2 deployers. End users saw minimal issues, the biggest being a perceived slowness during a few tests.
We had alerts and good communication between members. An IC took over quickly, they also handed over to a different IC when needed.
What went poorly?
The short CPU saturation events were not noticed earlier.
The deployment of the OpenTelemetry collector in codfw coincided with the addition of new Kubernetes API servers, effectively utilizing the extra capacity added there making codfw a not so good testbed for the enablement to eqiad.
Where did we get lucky?
The incident happened during EU morning hours, allowing a large percentage of the Service Operations team, which owns the WikiKube clusters to be present and help with figuring out the situation
Getting rid of the CPU saturation during Mediawiki deploys doubled k8s deployment speed. We went from ~7 minutes to ~3.5 minutes to do a full scap "sync" to k8s.
Links to relevant documentation
…
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Actionables
A decision was made to add 10G cards to all new WikiKube api servers. T366204 and T366205.