Incidents/2024-10-09 calico-codfw

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-10-09 calico-codfw	Start	2024-10-09 05:01:36
Task	T376795	End	2024-10-09 06:10:00
People paged	2	Responder count	6
Coordinators	Mark Bergsma	Affected metrics/SLOs	No relevant SLOs exist, but latencies and errors for MediaWiki flapped during the period
Impact	For approximately one hour after 5am UTC some uncached requests, such as those coming from logged in users, edits and api calls observed high latencies and/or errors due to an overload on MediaWiki nodes at the primary datacenter.

An operator exercised the new mwscript-k8s utility to perform a batch maintenance of video transcodes on Timed Media Handlers requeueTranscodes.php. This worked fine for a quite a few hours, however it was causing a silent and linearly increasing in size regression. Namely, the mwscript-k8s utilize deploys an entire helm release on every script invocation. This is by design, but it did have an unforeseen consequence and that is that close to 2k Calico Network Policies were created (including a ton of other resources, which would create their own set of problems), which meant all Calico components in WikiKube had to gradually react to the increasing number of those, which ended up hitting resources limits for some of those, which led to throttles and then to failures and into a slowly cascading outage that was putting hardware nodes one by one out of rotation.

Timeline

All times in UTC.

05:01:36 START OF USER IMPACT <jinxer-wm> FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
05:03:32 <jinxer-wm> FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
05:04:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
05:14:10 swfrench-wmf | alright, something isn't quite right in k8s in codfw, which I suspect may be related to the steadily increasing numbers of KubernetesCalicoDown alerts, which I'm starting to look at now
05:14:21 swfrench-wmf | if other folks are around to lend a hand, a that would be greatly appreciated
05:22:59 swfrench-wmf | kubectl -n kube-system get po -l k8s-app=calico-kube-controllers
2024-10-09 05:22:59 swfrench-wmf NAME READY STATUS RESTARTS AGE
2024-10-09 05:23:00 swfrench-wmf calico-kube-controllers-57c75d4867-cgzpc 0/1 CrashLoopBackOff 21 (4m19s ago) 21d
2024-10-09 05:24:58 swfrench-wmf 2024-10-09 05:17:48.105 [ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org /v1/clusterinformations/default": context deadline exceeded
2024-10-09 05:24:58 swfrench-wmf 2024-10-09 05:17:48.105 [FATAL][1] main.go 124: Failed to initialize Calico datastore error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2024-10-09 05:34:33 swfrench-wmf summarizing what I've seen so far:
2024-10-09 05:34:33 swfrench-wmf 1. typha appears to be fine, or at least available
2024-10-09 05:34:33 swfrench-wmf 2. k8s API calls made by the calico controller appear to be timing out
2024-10-09 05:34:33 swfrench-wmf 3. we appear to be bleeding calico node pods as a result of #2?
2024-10-09 05:34:57 _joe_ So etc issues?
2024-10-09 05:35:05 _joe_ etcd
2024-10-09 05:35:48 _joe_ scott please phone Alex and Janis
2024-10-09 05:36:26 swfrench-wmf _joe_: ack, will do
2024-10-09 05:45:28 akosiaris I see liveness probes of calico-node pods failing
2024-10-09 05:46:54 akosiaris looking at per host stats, there is some increase in CPU usage but I don't yet see any CPU pegged boxes
2024-10-09 05:47:50 akosiaris so the CPU usage increase is probably due to the restart of pods
2024-10-09 05:47:54 akosiaris do we have an incident doc?
2024-10-09 05:47:56 jelto I just wrote Jayme but not sure if he is awake already
2024-10-09 05:48:15 swfrench-wmf akosiaris: no, I but I can start one right now
2024-10-09 05:48:47 akosiaris looking into calico pods logs, nothing eye catching yet
05:50 Incident opened. mark becomes IC.
2024-10-09 05:50:54 swfrench-wmf just dropped the doc link in _security
2024-10-09 05:53:42 jelto The calico/typha pods are hitting resource limits, https://grafana.wikimedia.org/d/2AfU0X_Mz/jayme-calico-resources?orgId=1&from=now-3h&to=now
2024-10-09 05:53:42 jelto Can we increase the limits maybe?
2024-10-09 05:54:34 effie it would proly get us as far
2024-10-09 05:54:40 akosiaris jelto: go ahead
2024-10-09 05:55:12 effie I will take a look what happened between 3-4 UTC that may have triggered this
2024-10-09 05:56:39 swfrench-wmf there's clearly a sizable upward trend in get / list ops on the k8s API starting some time between 3 and 4 UTC, yeah - hard to tell if that's a result of things failing and restarting, though (https://grafana.wikimedia.org/goto/ZZZQY_kHg?orgId=1)
2024-10-09 05:57:00 question_mark effie: I think it started earlier, at least 8 hours ago?
2024-10-09 05:57:42 akosiaris On the calico-node level, there is nothing weird in logs. The weirdest thing I found was
2024-10-09 05:57:50 akosiaris bird: I/O loop cycle took 7808 ms for 6 events
2024-10-09 05:59:25 effie question_mark: so far betwwn 3-4 seems like a good starting point
2024-10-09 05:59:32 jelto resource bump like that? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1078791
2024-10-09 05:59:53 akosiaris jelto: +1ed, let's try it out to see if it stops the bleeding
2024-10-09 06:00:00 akosiaris want to deploy it? or should I?
2024-10-09 06:00:22 question_mark effie: I'm guessing the KubernetesAPILatency alerts earlier were then unrelated?
2024-10-09 06:00:25 jelto I can deploy when I got a +2 from jenkins
2024-10-09 06:00:34 akosiaris jelto: +2 yourself and proceed
2024-10-09 06:00:37 jelto ok
2024-10-09 06:00:38 akosiaris forget about CI right now
2024-10-09 06:02:09 jelto submitted, I'll apply once it's on deploy2002 and the diff looks as expected
2024-10-09 06:02:22 akosiaris jelto: the 1.5Gi thing, might backfire https://github.com/kubernetes/kubectl/issues/1426
2024-10-09 06:02:28 effie question_mark: looking
2024-10-09 06:02:38 akosiaris if it does, just manually switch to G on deploy2002 and proceed
2024-10-09 06:04:01 akosiaris first time I see some warnings in calico-node logs: 2024-10-09 06:03:00.473 [WARNING][144778] felix/health.go 211: Reporter is not ready. name="int_dataplane"
2024-10-09 06:04:29 akosiaris we are 25 nodes with fully functional calico-node pods
2024-10-09 06:04:44 _joe_ akosiaris: I think it's a positive feeback
2024-10-09 06:04:49 akosiaris unless the patch that jelto has works, we might want to switchback to eqiad
2024-10-09 06:05:11 _joe_ we should move traffic for everything but mw-rw to eqiad right now and maybe destroy all deployments but the mw ones
2024-10-09 06:05:17 _joe_ so that the system can try to recover
2024-10-09 06:05:26 jayme o/
2024-10-09 06:05:30 _joe_ akosiaris: only doubt about switching back is - do DBAs need to do anything?
2024-10-09 06:05:39 jelto diff looks ok, running helmfile -e codfw --selector name=calico apply -i --context 5
2024-10-09 06:05:55 swfrench-wmf I can get started moving -ro services to eqiad
2024-10-09 06:06:01 swfrench-wmf we're still provisioned for that
2024-10-09 06:06:23 akosiaris _joe_: hmmm, the circular replication thing?
2024-10-09 06:06:31 _joe_ yes
2024-10-09 06:06:37 _joe_ so we need to call someone in that team too
2024-10-09 06:06:52 akosiaris swfrench-wmf: yeah, that will allow us to serve at least some uses
2024-10-09 06:06:55 _joe_ anyways, I'm still soaked in sweat and panting, don't trust a word I say rn :P
2024-10-09 06:07:22 akosiaris 212 calico-nodes ready
2024-10-09 06:07:24 akosiaris what?
2024-10-09 06:07:27 akosiaris are we recovering?
2024-10-09 06:07:28 _joe_ lol
2024-10-09 06:07:30 jelto apply still running,
2024-10-09 06:07:32 _joe_ I guess that was jelto
2024-10-09 06:07:36 swfrench-wmf lol
2024-10-09 06:07:39 jelto yes helmfile still doing it work
2024-10-09 06:07:39 swfrench-wmf holding
2024-10-09 06:07:48 jelto but a lot of fresh new healthy pods
2024-10-09 06:09:06 jelto jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
2024-10-09 06:09:22 _joe_ I still see a lot of app-level issues
2024-10-09 06:09:26 question_mark shall I call someone in d/p for the question of whether we can switch to eqiad right now?
2024-10-09 06:09:30 _joe_ but calico has recovered?
2024-10-09 06:09:42 akosiaris I think we are going to bet getting out of the woods soon
2024-10-09 06:10:00 Calico has recovered END OF USER IMPACT
2024-10-09 06:12:56 jayme other things talking to the apiserver seem to behave strange as well ... like helm-state-metrics in crashloop
2024-10-09 06:13:24 swfrench-wmf precipitous drop in k8s API calls and etcd ops ... yeah, that has the smell of resource saturation -> feedback loop
2024-10-09 06:13:50 jayme and holes in metrics, like https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=calico-kube-controllers-57c75d4867-cgzpc&var-container=All
2024-10-09 06:14:01 _joe_ swfrench-wmf: yeah as I guessed, it was probably calico down -> reallocate pods -> more strain on calico
2024-10-09 06:14:46 jelto I linked the previous incident "2024-04-03 calico/typha down" in the new doc as well which was quite similar i think
2024-10-09 06:17:13 akosiaris funnily enough, it takes a while to realize that the effects on MediaWiki are kinda hidden in https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&from=now-1h&to=now
2024-10-09 06:17:30 akosiaris the best sign is 200s dropping by 25%, from 4k to 3k for ~5 minutes
2024-10-09 06:18:20 akosiaris latencies increased too, but for a short period of time
2024-10-09 06:18:36 akosiaris this is weird btw, cause a death of calico-node doesn't reallocate pods
2024-10-09 06:20:09 effie so far the only thing I have found standing out was mw-script jobs starting
2024-10-09 06:20:32 _joe_ effie: I doubt that could be the cause
2024-10-09 06:20:37 effie but unsure yet if it is related
2024-10-09 06:20:39 _joe_ unless it was like 900 jobs at the same time
2024-10-09 06:21:12 _joe_ then maybe we need to add some poolcounter support to mwscript-k8s to limit the number of jobs that can be started :)
2024-10-09 06:23:40 jayme there is this sharp increase in rx traffic on nodes running calico typha
2024-10-09 06:24:01 jayme ~7MB/s
2024-10-09 06:24:10 jayme starting around 4:00
2024-10-09 06:24:59 jayme which matches the increased GET rate...okay
2024-10-09 06:27:27 jelto I'm not sure if the resource increase was what really resolved the issue. It was probably more related to the redeploy of all calico pods at the same time. But that's just a hypothesis :)
2024-10-09 06:28:11 swfrench-wmf jayme: looking at etcd objects in codfw, I see configmaps(?!) roughly double from 6k to 12k between 2 and 6?
2024-10-09 06:28:16 jayme jelto: did you see calico-node pods being killed?
2024-10-09 06:28:53 jayme swfrench-wmf: yeah...I was looking at the same thing
2024-10-09 06:29:07 jayme might be jobs effie mentioned?
2024-10-09 06:29:26 jelto just crashlooping, like `calico-node-s89nl 0/1 CrashLoopBackOff 20 (3m46s ago) 133d` and a lot Running but not ready
2024-10-09 06:29:39 akosiaris there is nothing in SAL fwiw around that timeframe
2024-10-09 06:29:53 swfrench-wmf ah, yeah ... perfectly correlates with jobs.batch objects ... and their associated network policies!
2024-10-09 06:30:23 akosiaris ?
2024-10-09 06:30:32 * akosiaris trying to put 2+2 together
2024-10-09 06:30:36 swfrench-wmf https://grafana.wikimedia.org/goto/4N6Qs_zHR?orgId=1
2024-10-09 06:30:55 akosiaris kubectl -n mw-script get jobs |wc -l
2024-10-09 06:30:55 akosiaris 1791
2024-10-09 06:30:59 akosiaris hoooooooolyyyy
2024-10-09 06:31:03 jayme there we go
2024-10-09 06:31:04 swfrench-wmf heh, precisely
2024-10-09 06:31:27 _joe_ yeah
2024-10-09 06:31:28 swfrench-wmf so, we have a pretty wild amount of objects now
2024-10-09 06:31:35 _joe_ ok so
2024-10-09 06:31:40 jayme the increased apiserver GET rate is mostly for node and endpoint object - that is probably just inflicted by calicos
2024-10-09 06:31:44 akosiaris durations less than 8h for all
2024-10-09 06:31:55 _joe_ we might need to make mw networkpolicy names not depend on the deployment?
2024-10-09 06:32:11 akosiaris for maint scripts, yeah
2024-10-09 06:32:15 akosiaris it doesn't make sense
2024-10-09 06:32:18 _joe_ so that for that many jobs we still have just one networkpolicy?
2024-10-09 06:32:31 akosiaris creating a netpol for every script invocation isn't worth it
2024-10-09 06:42:14 RhinosF1 There was a thread on wikitech-l of a maint script being started on every wiki question_mark
2024-10-09 06:42:27 RhinosF1 That was 00:37 utc + 1
2024-10-09 06:47:28 akosiaris usr/bin/python3 /usr/local/bin/mwscript-k8s -f --comment T363966 -- extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=mrwiki --throttle --video --key=144p.mjpeg.mov --missing
2024-10-09 06:47:28 stashbot T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966
2024-10-09 06:47:29 akosiaris there we go
2024-10-09 06:56:44 Incident closed
2024-10-09 07:51:00 Jynus fixes dates on the doc, were on UTC+1 TZ before that (unimportant for the incident, but FYI for coordination)

Detection

The issue was detected by automated monitoring and responded to by the oncall SRE.

05:01:36 <jinxer-wm> FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page -

05:03:32 <jinxer-wm> FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running

05:04:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally

Conclusions

We almost had a full outage

What went well?

Automated monitoring detected the outage
Outage was root-caused, thanks to the many pairs of fresh eyes
The incident was handed off from Americas to EMEA SREs fine

What went poorly?

Impact to the end users
We were unsure whether we could Switchover to eqiad to react to the emergency

Where did we get lucky?

It was noticed just before the point in time where Americas SRE shifts retire and before EMEA SRE shifts take over. We have a long-standing, known lack of coverage during that period.
We were lucky the queue transcodes were not submitted on a Friday US evening.

Links to relevant documentation

Actionables

mwscript-k8s creates too many resources

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	yes
	Were pages routed to the correct sub-team(s)?	yes
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	yes
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	yes
	Was a public wikimediastatus.net entry created?	yes
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	no
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	no
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	-	unsure
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	yes
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		11