Incidents/2024-10-09 calico-codfw
document status: in-review
Summary
Incident ID | 2024-10-09 calico-codfw | Start | 2024-10-09 05:01:36 |
---|---|---|---|
Task | T376795 | End | 2024-10-09 06:10:00 |
People paged | 2 | Responder count | 6 |
Coordinators | Mark Bergsma | Affected metrics/SLOs | No relevant SLOs exist, but latencies and errors for MediaWiki flapped during the period |
Impact | For approximately one hour after 5am UTC some uncached requests, such as those coming from logged in users, edits and api calls observed high latencies and/or errors due to an overload on MediaWiki nodes at the primary datacenter. |
An operator exercised the new mwscript-k8s utility to perform a batch maintenance of video transcodes on Timed Media Handlers requeueTranscodes.php. This worked fine for a quite a few hours, however it was causing a silent and linearly increasing in size regression. Namely, the mwscript-k8s utilize deploys an entire helm release on every script invocation. This is by design, but it did have an unforeseen consequence and that is that close to 2k Calico Network Policies were created (including a ton of other resources, which would create their own set of problems), which meant all Calico components in WikiKube had to gradually react to the increasing number of those, which ended up hitting resources limits for some of those, which led to throttles and then to failures and into a slowly cascading outage that was putting hardware nodes one by one out of rotation.
Timeline
All times in UTC.
- 05:01:36 START OF USER IMPACT <jinxer-wm> FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
- 05:03:32 <jinxer-wm> FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
- 05:04:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
- 05:14:10 swfrench-wmf | alright, something isn't quite right in k8s in codfw, which I suspect may be related to the steadily increasing numbers of KubernetesCalicoDown alerts, which I'm starting to look at now
- 05:14:21 swfrench-wmf | if other folks are around to lend a hand, a that would be greatly appreciated
- 05:22:59 swfrench-wmf | kubectl -n kube-system get po -l k8s-app=calico-kube-controllers
- 2024-10-09 05:22:59 swfrench-wmf NAME READY STATUS RESTARTS AGE
- 2024-10-09 05:23:00 swfrench-wmf calico-kube-controllers-57c75d4867-cgzpc 0/1 CrashLoopBackOff 21 (4m19s ago) 21d
- 2024-10-09 05:24:58 swfrench-wmf 2024-10-09 05:17:48.105 [ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org /v1/clusterinformations/default": context deadline exceeded
- 2024-10-09 05:24:58 swfrench-wmf 2024-10-09 05:17:48.105 [FATAL][1] main.go 124: Failed to initialize Calico datastore error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
- 2024-10-09 05:34:33 swfrench-wmf summarizing what I've seen so far:
- 2024-10-09 05:34:33 swfrench-wmf 1. typha appears to be fine, or at least available
- 2024-10-09 05:34:33 swfrench-wmf 2. k8s API calls made by the calico controller appear to be timing out
- 2024-10-09 05:34:33 swfrench-wmf 3. we appear to be bleeding calico node pods as a result of #2?
- 2024-10-09 05:34:57 _joe_ So etc issues?
- 2024-10-09 05:35:05 _joe_ etcd
- 2024-10-09 05:35:48 _joe_ scott please phone Alex and Janis
- 2024-10-09 05:36:26 swfrench-wmf _joe_: ack, will do
- 2024-10-09 05:45:28 akosiaris I see liveness probes of calico-node pods failing
- 2024-10-09 05:46:54 akosiaris looking at per host stats, there is some increase in CPU usage but I don't yet see any CPU pegged boxes
- 2024-10-09 05:47:50 akosiaris so the CPU usage increase is probably due to the restart of pods
- 2024-10-09 05:47:54 akosiaris do we have an incident doc?
- 2024-10-09 05:47:56 jelto I just wrote Jayme but not sure if he is awake already
- 2024-10-09 05:48:15 swfrench-wmf akosiaris: no, I but I can start one right now
- 2024-10-09 05:48:47 akosiaris looking into calico pods logs, nothing eye catching yet
- 05:50 Incident opened. mark becomes IC.
- 2024-10-09 05:50:54 swfrench-wmf just dropped the doc link in _security
- 2024-10-09 05:53:42 jelto The calico/typha pods are hitting resource limits, https://grafana.wikimedia.org/d/2AfU0X_Mz/jayme-calico-resources?orgId=1&from=now-3h&to=now
- 2024-10-09 05:53:42 jelto Can we increase the limits maybe?
- 2024-10-09 05:54:34 effie it would proly get us as far
- 2024-10-09 05:54:40 akosiaris jelto: go ahead
- 2024-10-09 05:55:12 effie I will take a look what happened between 3-4 UTC that may have triggered this
- 2024-10-09 05:56:39 swfrench-wmf there's clearly a sizable upward trend in get / list ops on the k8s API starting some time between 3 and 4 UTC, yeah - hard to tell if that's a result of things failing and restarting, though (https://grafana.wikimedia.org/goto/ZZZQY_kHg?orgId=1)
- 2024-10-09 05:57:00 question_mark effie: I think it started earlier, at least 8 hours ago?
- 2024-10-09 05:57:42 akosiaris On the calico-node level, there is nothing weird in logs. The weirdest thing I found was
- 2024-10-09 05:57:50 akosiaris bird: I/O loop cycle took 7808 ms for 6 events
- 2024-10-09 05:59:25 effie question_mark: so far betwwn 3-4 seems like a good starting point
- 2024-10-09 05:59:32 jelto resource bump like that? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1078791
- 2024-10-09 05:59:53 akosiaris jelto: +1ed, let's try it out to see if it stops the bleeding
- 2024-10-09 06:00:00 akosiaris want to deploy it? or should I?
- 2024-10-09 06:00:22 question_mark effie: I'm guessing the KubernetesAPILatency alerts earlier were then unrelated?
- 2024-10-09 06:00:25 jelto I can deploy when I got a +2 from jenkins
- 2024-10-09 06:00:34 akosiaris jelto: +2 yourself and proceed
- 2024-10-09 06:00:37 jelto ok
- 2024-10-09 06:00:38 akosiaris forget about CI right now
- 2024-10-09 06:02:09 jelto submitted, I'll apply once it's on deploy2002 and the diff looks as expected
- 2024-10-09 06:02:22 akosiaris jelto: the 1.5Gi thing, might backfire https://github.com/kubernetes/kubectl/issues/1426
- 2024-10-09 06:02:28 effie question_mark: looking
- 2024-10-09 06:02:38 akosiaris if it does, just manually switch to G on deploy2002 and proceed
- 2024-10-09 06:04:01 akosiaris first time I see some warnings in calico-node logs: 2024-10-09 06:03:00.473 [WARNING][144778] felix/health.go 211: Reporter is not ready. name="int_dataplane"
- 2024-10-09 06:04:29 akosiaris we are 25 nodes with fully functional calico-node pods
- 2024-10-09 06:04:44 _joe_ akosiaris: I think it's a positive feeback
- 2024-10-09 06:04:49 akosiaris unless the patch that jelto has works, we might want to switchback to eqiad
- 2024-10-09 06:05:11 _joe_ we should move traffic for everything but mw-rw to eqiad right now and maybe destroy all deployments but the mw ones
- 2024-10-09 06:05:17 _joe_ so that the system can try to recover
- 2024-10-09 06:05:26 jayme o/
- 2024-10-09 06:05:30 _joe_ akosiaris: only doubt about switching back is - do DBAs need to do anything?
- 2024-10-09 06:05:39 jelto diff looks ok, running helmfile -e codfw --selector name=calico apply -i --context 5
- 2024-10-09 06:05:55 swfrench-wmf I can get started moving -ro services to eqiad
- 2024-10-09 06:06:01 swfrench-wmf we're still provisioned for that
- 2024-10-09 06:06:23 akosiaris _joe_: hmmm, the circular replication thing?
- 2024-10-09 06:06:31 _joe_ yes
- 2024-10-09 06:06:37 _joe_ so we need to call someone in that team too
- 2024-10-09 06:06:52 akosiaris swfrench-wmf: yeah, that will allow us to serve at least some uses
- 2024-10-09 06:06:55 _joe_ anyways, I'm still soaked in sweat and panting, don't trust a word I say rn :P
- 2024-10-09 06:07:22 akosiaris 212 calico-nodes ready
- 2024-10-09 06:07:24 akosiaris what?
- 2024-10-09 06:07:27 akosiaris are we recovering?
- 2024-10-09 06:07:28 _joe_ lol
- 2024-10-09 06:07:30 jelto apply still running,
- 2024-10-09 06:07:32 _joe_ I guess that was jelto
- 2024-10-09 06:07:36 swfrench-wmf lol
- 2024-10-09 06:07:39 jelto yes helmfile still doing it work
- 2024-10-09 06:07:39 swfrench-wmf holding
- 2024-10-09 06:07:48 jelto but a lot of fresh new healthy pods
- 2024-10-09 06:09:06 jelto jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
- 2024-10-09 06:09:22 _joe_ I still see a lot of app-level issues
- 2024-10-09 06:09:26 question_mark shall I call someone in d/p for the question of whether we can switch to eqiad right now?
- 2024-10-09 06:09:30 _joe_ but calico has recovered?
- 2024-10-09 06:09:42 akosiaris I think we are going to bet getting out of the woods soon
- 2024-10-09 06:10:00 Calico has recovered END OF USER IMPACT
- 2024-10-09 06:12:56 jayme other things talking to the apiserver seem to behave strange as well ... like helm-state-metrics in crashloop
- 2024-10-09 06:13:24 swfrench-wmf precipitous drop in k8s API calls and etcd ops ... yeah, that has the smell of resource saturation -> feedback loop
- 2024-10-09 06:13:50 jayme and holes in metrics, like https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=calico-kube-controllers-57c75d4867-cgzpc&var-container=All
- 2024-10-09 06:14:01 _joe_ swfrench-wmf: yeah as I guessed, it was probably calico down -> reallocate pods -> more strain on calico
- 2024-10-09 06:14:46 jelto I linked the previous incident "2024-04-03 calico/typha down" in the new doc as well which was quite similar i think
- 2024-10-09 06:17:13 akosiaris funnily enough, it takes a while to realize that the effects on MediaWiki are kinda hidden in https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&from=now-1h&to=now
- 2024-10-09 06:17:30 akosiaris the best sign is 200s dropping by 25%, from 4k to 3k for ~5 minutes
- 2024-10-09 06:18:20 akosiaris latencies increased too, but for a short period of time
- 2024-10-09 06:18:36 akosiaris this is weird btw, cause a death of calico-node doesn't reallocate pods
- 2024-10-09 06:20:09 effie so far the only thing I have found standing out was mw-script jobs starting
- 2024-10-09 06:20:32 _joe_ effie: I doubt that could be the cause
- 2024-10-09 06:20:37 effie but unsure yet if it is related
- 2024-10-09 06:20:39 _joe_ unless it was like 900 jobs at the same time
- 2024-10-09 06:21:12 _joe_ then maybe we need to add some poolcounter support to mwscript-k8s to limit the number of jobs that can be started :)
- 2024-10-09 06:23:40 jayme there is this sharp increase in rx traffic on nodes running calico typha
- 2024-10-09 06:24:01 jayme ~7MB/s
- 2024-10-09 06:24:10 jayme starting around 4:00
- 2024-10-09 06:24:59 jayme which matches the increased GET rate...okay
- 2024-10-09 06:27:27 jelto I'm not sure if the resource increase was what really resolved the issue. It was probably more related to the redeploy of all calico pods at the same time. But that's just a hypothesis :)
- 2024-10-09 06:28:11 swfrench-wmf jayme: looking at etcd objects in codfw, I see configmaps(?!) roughly double from 6k to 12k between 2 and 6?
- 2024-10-09 06:28:16 jayme jelto: did you see calico-node pods being killed?
- 2024-10-09 06:28:53 jayme swfrench-wmf: yeah...I was looking at the same thing
- 2024-10-09 06:29:07 jayme might be jobs effie mentioned?
- 2024-10-09 06:29:26 jelto just crashlooping, like `calico-node-s89nl 0/1 CrashLoopBackOff 20 (3m46s ago) 133d` and a lot Running but not ready
- 2024-10-09 06:29:39 akosiaris there is nothing in SAL fwiw around that timeframe
- 2024-10-09 06:29:53 swfrench-wmf ah, yeah ... perfectly correlates with jobs.batch objects ... and their associated network policies!
- 2024-10-09 06:30:23 akosiaris ?
- 2024-10-09 06:30:32 * akosiaris trying to put 2+2 together
- 2024-10-09 06:30:36 swfrench-wmf https://grafana.wikimedia.org/goto/4N6Qs_zHR?orgId=1
- 2024-10-09 06:30:55 akosiaris kubectl -n mw-script get jobs |wc -l
- 2024-10-09 06:30:55 akosiaris 1791
- 2024-10-09 06:30:59 akosiaris hoooooooolyyyy
- 2024-10-09 06:31:03 jayme there we go
- 2024-10-09 06:31:04 swfrench-wmf heh, precisely
- 2024-10-09 06:31:27 _joe_ yeah
- 2024-10-09 06:31:28 swfrench-wmf so, we have a pretty wild amount of objects now
- 2024-10-09 06:31:35 _joe_ ok so
- 2024-10-09 06:31:40 jayme the increased apiserver GET rate is mostly for node and endpoint object - that is probably just inflicted by calicos
- 2024-10-09 06:31:44 akosiaris durations less than 8h for all
- 2024-10-09 06:31:55 _joe_ we might need to make mw networkpolicy names not depend on the deployment?
- 2024-10-09 06:32:11 akosiaris for maint scripts, yeah
- 2024-10-09 06:32:15 akosiaris it doesn't make sense
- 2024-10-09 06:32:18 _joe_ so that for that many jobs we still have just one networkpolicy?
- 2024-10-09 06:32:31 akosiaris creating a netpol for every script invocation isn't worth it
- 2024-10-09 06:42:14 RhinosF1 There was a thread on wikitech-l of a maint script being started on every wiki question_mark
- 2024-10-09 06:42:27 RhinosF1 That was 00:37 utc + 1
- 2024-10-09 06:47:28 akosiaris usr/bin/python3 /usr/local/bin/mwscript-k8s -f --comment T363966 -- extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=mrwiki --throttle --video --key=144p.mjpeg.mov --missing
- 2024-10-09 06:47:28 stashbot T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966
- 2024-10-09 06:47:29 akosiaris there we go
- 2024-10-09 06:56:44 Incident closed
- 2024-10-09 07:51:00 Jynus fixes dates on the doc, were on UTC+1 TZ before that (unimportant for the incident, but FYI for coordination)
Detection
The issue was detected by automated monitoring and responded to by the oncall SRE.
05:01:36 <jinxer-wm> FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page -
05:03:32 <jinxer-wm> FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running
05:04:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally
Conclusions
We almost had a full outage
What went well?
- Automated monitoring detected the outage
- Outage was root-caused, thanks to the many pairs of fresh eyes
- The incident was handed off from Americas to EMEA SREs fine
What went poorly?
- Impact to the end users
- We were unsure whether we could Switchover to eqiad to react to the emergency
Where did we get lucky?
- It was noticed just before the point in time where Americas SRE shifts retire and before EMEA SRE shifts take over. We have a long-standing, known lack of coverage during that period.
- We were lucky the queue transcodes were not submitted on a Friday US evening.
Links to relevant documentation
Actionables
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | yes | ||
Were pages routed to the correct sub-team(s)? | yes | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | yes | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | |
Was a public wikimediastatus.net entry created? | yes | ||
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | no | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | no | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
- | unsure |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 11 |