For approximately 30 minutes 1% of traffic received errors (120 requests/s)
A change to the global k8s defaults was merged that made the next mediawiki on kubernetes deployment pick up a wrong certificate for TLS termination.
Timeline
All times in UTC.
09:37 A change to the global k8s defaults got merged, switching to cert-manager certificates for supported charts (this is what caused the issue later on)
(ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)
(ProbeDown) firing: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4)
There was also secondary failures:
[12:51] <jinxer-wm> (KubernetesAPILatency) resolved: (4) High Kubernetes API latency
[12:02] <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL
Conclusions
What went well?
The root of the problem could be detected quickly because the person that issued the breaking change was a responder
What went poorly?
The root cause should have been visible CI (unfortunately not in the CI of the repo the change was made in; hieradata vs. deployment-charts)
Where did we get lucky?
Only 1% of traffic is routed to k8s currently, so there was only limited impact
Links to relevant documentation
…
Actionables
Because of the low traffic (relatively small amount of errors) it took some time to pin this to k8s- does that need some actionables, or will it resolve itself as k8s becomes the majority of requests?
Review if some deployment procedures/testing should be strengthen (e.g. surprising changes on next deployment, canary deployment for k8s, etc)
Some metrics become unavailable or unhealthy during deployment- could something be done about that (either for the metrics or mitigation of deployment impact)