Jump to content

Incidents/2024-10-09 calico-codfw

From Wikitech

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2024-10-09 calico-codfw Start 2024-10-09 05:01:36
Task T376795 End 2024-10-09 06:10:00
People paged 2 Responder count 6
Coordinators Mark Bergsma Affected metrics/SLOs No relevant SLOs exist, but latencies and errors for MediaWiki flapped during the period
Impact For approximately one hour after 5am UTC some uncached requests, such as those coming from logged in users, edits and api calls observed high latencies and/or errors due to an overload on MediaWiki nodes at the primary datacenter.

An operator exercised the new mwscript-k8s utility to perform a batch maintenance of video transcodes on Timed Media Handlers requeueTranscodes.php. This worked fine for a quite a few hours, however it was causing a silent and linearly increasing in size regression. Namely, the mwscript-k8s utilize deploys an entire helm release on every script invocation. This is by design, but it did have an unforeseen consequence and that is that close to 2k Calico Network Policies were created (including a ton of other resources, which would create their own set of problems), which meant all Calico components in WikiKube had to gradually react to the increasing number of those, which ended up hitting resources limits for some of those, which led to throttles and then to failures and into a slowly cascading outage that was putting hardware nodes one by one out of rotation.

Timeline

All times in UTC.

  • 05:01:36 START OF USER IMPACT <jinxer-wm> FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh
  • 05:03:32 <jinxer-wm> FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
  • 05:04:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
  • 05:14:10  swfrench-wmf | alright, something isn't quite right in k8s in codfw, which I suspect may be related to the steadily increasing numbers of KubernetesCalicoDown alerts, which I'm starting to look at now
  • 05:14:21  swfrench-wmf | if other folks are around to lend a hand, a that would be greatly appreciated
  • 05:22:59  swfrench-wmf | kubectl -n kube-system get po -l k8s-app=calico-kube-controllers
  • 2024-10-09 05:22:59 swfrench-wmf NAME                                   READY   STATUS         RESTARTS     AGE
  • 2024-10-09 05:23:00 swfrench-wmf calico-kube-controllers-57c75d4867-cgzpc   0/1 CrashLoopBackOff   21 (4m19s ago)   21d
  • 2024-10-09 05:24:58 swfrench-wmf 2024-10-09 05:17:48.105 [ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org /v1/clusterinformations/default": context deadline exceeded
  • 2024-10-09 05:24:58 swfrench-wmf 2024-10-09 05:17:48.105 [FATAL][1] main.go 124: Failed to initialize Calico datastore error=Get "https://10.192.72.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
  • 2024-10-09 05:34:33 swfrench-wmf summarizing what I've seen so far:
  • 2024-10-09 05:34:33 swfrench-wmf 1. typha appears to be fine, or at least available
  • 2024-10-09 05:34:33 swfrench-wmf 2. k8s API calls made by the calico controller appear to be timing out
  • 2024-10-09 05:34:33 swfrench-wmf 3. we appear to be bleeding calico node pods as a result of #2?
  • 2024-10-09 05:34:57 _joe_   So etc issues?
  • 2024-10-09 05:35:05 _joe_   etcd
  • 2024-10-09 05:35:48 _joe_   scott please phone Alex and Janis
  • 2024-10-09 05:36:26 swfrench-wmf _joe_: ack, will do
  • 2024-10-09 05:45:28 akosiaris   I see liveness probes of calico-node pods failing
  • 2024-10-09 05:46:54 akosiaris   looking at per host stats, there is some increase in CPU usage but I don't yet see any CPU pegged boxes
  • 2024-10-09 05:47:50 akosiaris   so the CPU usage increase is probably due to the restart of pods
  • 2024-10-09 05:47:54 akosiaris   do we have an incident doc?
  • 2024-10-09 05:47:56 jelto   I just wrote Jayme but not sure if he is awake already
  • 2024-10-09 05:48:15 swfrench-wmf akosiaris: no, I but I can start one right now
  • 2024-10-09 05:48:47 akosiaris   looking into calico pods logs, nothing eye catching yet
  • 05:50  Incident opened.  mark becomes IC.
  • 2024-10-09 05:50:54 swfrench-wmf just dropped the doc link in _security
  • 2024-10-09 05:53:42 jelto   The calico/typha pods are hitting resource limits, https://grafana.wikimedia.org/d/2AfU0X_Mz/jayme-calico-resources?orgId=1&from=now-3h&to=now
  • 2024-10-09 05:53:42 jelto   Can we increase the limits maybe?
  • 2024-10-09 05:54:34 effie   it would proly get us as far
  • 2024-10-09 05:54:40 akosiaris   jelto: go ahead
  • 2024-10-09 05:55:12 effie   I will take a look what happened between 3-4 UTC that may have triggered this
  • 2024-10-09 05:56:39 swfrench-wmf there's clearly a sizable upward trend in get / list ops on the k8s API starting some time between 3 and 4 UTC, yeah - hard to tell if that's a result of things failing and restarting, though (https://grafana.wikimedia.org/goto/ZZZQY_kHg?orgId=1)
  • 2024-10-09 05:57:00 question_mark   effie: I think it started earlier, at least 8 hours ago?
  • 2024-10-09 05:57:42 akosiaris   On the calico-node level, there is nothing weird in logs. The weirdest thing I found was
  • 2024-10-09 05:57:50 akosiaris   bird: I/O loop cycle took 7808 ms for 6 events
  • 2024-10-09 05:59:25 effie   question_mark: so far betwwn 3-4 seems like a good starting point
  • 2024-10-09 05:59:32 jelto   resource bump like that? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1078791
  • 2024-10-09 05:59:53 akosiaris   jelto: +1ed, let's try it out to see if it stops the bleeding
  • 2024-10-09 06:00:00 akosiaris   want to deploy it? or should I?
  • 2024-10-09 06:00:22 question_mark   effie: I'm guessing the KubernetesAPILatency alerts earlier were then unrelated?
  • 2024-10-09 06:00:25 jelto   I can deploy when I got a +2 from jenkins
  • 2024-10-09 06:00:34 akosiaris   jelto: +2 yourself and proceed
  • 2024-10-09 06:00:37 jelto   ok
  • 2024-10-09 06:00:38 akosiaris   forget about CI right now
  • 2024-10-09 06:02:09 jelto   submitted, I'll apply once it's on deploy2002 and the diff looks as expected
  • 2024-10-09 06:02:22 akosiaris   jelto: the 1.5Gi thing, might backfire https://github.com/kubernetes/kubectl/issues/1426
  • 2024-10-09 06:02:28 effie   question_mark: looking
  • 2024-10-09 06:02:38 akosiaris   if it does, just manually switch to G on deploy2002 and proceed
  • 2024-10-09 06:04:01 akosiaris   first time I see some warnings in calico-node logs: 2024-10-09 06:03:00.473 [WARNING][144778] felix/health.go 211: Reporter is not ready. name="int_dataplane"
  • 2024-10-09 06:04:29 akosiaris   we are 25 nodes with fully functional calico-node pods
  • 2024-10-09 06:04:44 _joe_   akosiaris: I think it's a positive feeback
  • 2024-10-09 06:04:49 akosiaris   unless the patch that jelto has works, we might want to switchback to eqiad
  • 2024-10-09 06:05:11 _joe_   we should move traffic for everything but mw-rw to eqiad right now and maybe destroy all deployments but the mw ones
  • 2024-10-09 06:05:17 _joe_   so that the system can try to recover
  • 2024-10-09 06:05:26 jayme   o/
  • 2024-10-09 06:05:30 _joe_   akosiaris: only doubt about switching back is - do DBAs need to do anything?
  • 2024-10-09 06:05:39 jelto   diff looks ok, running helmfile -e codfw --selector name=calico apply -i --context 5
  • 2024-10-09 06:05:55 swfrench-wmf I can get started moving -ro services to eqiad
  • 2024-10-09 06:06:01 swfrench-wmf we're still provisioned for that
  • 2024-10-09 06:06:23 akosiaris   _joe_: hmmm, the circular replication thing?
  • 2024-10-09 06:06:31 _joe_   yes
  • 2024-10-09 06:06:37 _joe_   so we need to call someone in that team too
  • 2024-10-09 06:06:52 akosiaris   swfrench-wmf: yeah, that will allow us to serve at least some uses
  • 2024-10-09 06:06:55 _joe_   anyways, I'm still soaked in sweat and panting, don't trust a word I say rn :P
  • 2024-10-09 06:07:22 akosiaris   212 calico-nodes ready
  • 2024-10-09 06:07:24 akosiaris   what?
  • 2024-10-09 06:07:27 akosiaris   are we recovering?
  • 2024-10-09 06:07:28 _joe_   lol
  • 2024-10-09 06:07:30 jelto   apply still running,
  • 2024-10-09 06:07:32 _joe_   I guess that was jelto
  • 2024-10-09 06:07:36 swfrench-wmf lol
  • 2024-10-09 06:07:39 jelto   yes helmfile still doing it work
  • 2024-10-09 06:07:39 swfrench-wmf holding
  • 2024-10-09 06:07:48 jelto   but a lot of fresh new healthy pods
  • 2024-10-09 06:09:06 jelto   jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
  • 2024-10-09 06:09:22 _joe_   I still see a lot of app-level issues
  • 2024-10-09 06:09:26 question_mark   shall I call someone in d/p for the question of whether we can switch to eqiad right now?
  • 2024-10-09 06:09:30 _joe_   but calico has recovered?
  • 2024-10-09 06:09:42 akosiaris   I think we are going to bet getting out of the woods soon
  • 2024-10-09 06:10:00 Calico has recovered END OF USER IMPACT
  • 2024-10-09 06:12:56 jayme   other things talking to the apiserver seem to behave strange as well ... like helm-state-metrics in crashloop
  • 2024-10-09 06:13:24 swfrench-wmf precipitous drop in k8s API calls and etcd ops ... yeah, that has the smell of resource saturation -> feedback loop
  • 2024-10-09 06:13:50 jayme   and holes in metrics, like https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s&var-namespace=kube-system&var-pod=calico-kube-controllers-57c75d4867-cgzpc&var-container=All
  • 2024-10-09 06:14:01 _joe_   swfrench-wmf: yeah as I guessed, it was probably calico down -> reallocate pods -> more strain on calico
  • 2024-10-09 06:14:46 jelto   I linked the previous incident "2024-04-03 calico/typha down" in the new doc as well which was quite similar i think
  • 2024-10-09 06:17:13 akosiaris   funnily enough, it takes a while to realize that the effects on MediaWiki are kinda hidden in https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&from=now-1h&to=now
  • 2024-10-09 06:17:30 akosiaris   the best sign is 200s dropping by 25%, from 4k to 3k for ~5 minutes
  • 2024-10-09 06:18:20 akosiaris   latencies increased too, but for a short period of time
  • 2024-10-09 06:18:36 akosiaris   this is weird btw, cause a death of calico-node doesn't reallocate pods
  • 2024-10-09 06:20:09 effie   so far the only thing I have found standing out was mw-script jobs starting
  • 2024-10-09 06:20:32 _joe_   effie: I doubt that could be the cause
  • 2024-10-09 06:20:37 effie   but unsure yet if it is related
  • 2024-10-09 06:20:39 _joe_   unless it was like 900 jobs at the same time
  • 2024-10-09 06:21:12 _joe_   then maybe we need to add some poolcounter support to mwscript-k8s to limit the number of jobs that can be started :)
  • 2024-10-09 06:23:40 jayme   there is this sharp increase in rx traffic on nodes running calico typha
  • 2024-10-09 06:24:01 jayme   ~7MB/s
  • 2024-10-09 06:24:10 jayme   starting around 4:00
  • 2024-10-09 06:24:59 jayme   which matches the increased GET rate...okay
  • 2024-10-09 06:27:27 jelto   I'm not sure if the resource increase was what really resolved the issue. It was probably more related to the redeploy of all calico pods at the same time. But that's just a hypothesis :)
  • 2024-10-09 06:28:11 swfrench-wmf jayme: looking at etcd objects in codfw, I see configmaps(?!) roughly double from 6k to 12k between 2 and 6?
  • 2024-10-09 06:28:16 jayme   jelto: did you see calico-node pods being killed?
  • 2024-10-09 06:28:53 jayme   swfrench-wmf: yeah...I was looking at the same thing
  • 2024-10-09 06:29:07 jayme   might be jobs effie mentioned?
  • 2024-10-09 06:29:26 jelto   just crashlooping, like `calico-node-s89nl                      0/1 CrashLoopBackOff   20 (3m46s ago)   133d` and a lot Running but not ready
  • 2024-10-09 06:29:39 akosiaris   there is nothing in SAL fwiw around that timeframe
  • 2024-10-09 06:29:53 swfrench-wmf ah, yeah ... perfectly correlates with jobs.batch objects ... and their associated network policies!
  • 2024-10-09 06:30:23 akosiaris   ?
  • 2024-10-09 06:30:32  *  akosiaris trying to put 2+2 together
  • 2024-10-09 06:30:36 swfrench-wmf https://grafana.wikimedia.org/goto/4N6Qs_zHR?orgId=1
  • 2024-10-09 06:30:55 akosiaris   kubectl -n mw-script get jobs |wc -l
  • 2024-10-09 06:30:55 akosiaris   1791
  • 2024-10-09 06:30:59 akosiaris   hoooooooolyyyy
  • 2024-10-09 06:31:03 jayme   there we go
  • 2024-10-09 06:31:04 swfrench-wmf heh, precisely
  • 2024-10-09 06:31:27 _joe_   yeah
  • 2024-10-09 06:31:28 swfrench-wmf so, we have a pretty wild amount of objects now
  • 2024-10-09 06:31:35 _joe_   ok so
  • 2024-10-09 06:31:40 jayme   the increased apiserver GET rate is mostly for node and endpoint object - that is probably just inflicted by calicos
  • 2024-10-09 06:31:44 akosiaris   durations less than 8h for all
  • 2024-10-09 06:31:55 _joe_   we might need to make mw networkpolicy names not depend on the deployment?
  • 2024-10-09 06:32:11 akosiaris   for maint scripts, yeah
  • 2024-10-09 06:32:15 akosiaris   it doesn't make sense
  • 2024-10-09 06:32:18 _joe_   so that for that many jobs we still have just one networkpolicy?
  • 2024-10-09 06:32:31 akosiaris   creating a netpol for every script invocation isn't worth it
  • 2024-10-09 06:42:14 RhinosF1    There was a thread on wikitech-l of a maint script being started on every wiki question_mark
  • 2024-10-09 06:42:27 RhinosF1    That was 00:37 utc + 1
  • 2024-10-09 06:47:28 akosiaris   usr/bin/python3 /usr/local/bin/mwscript-k8s -f --comment T363966 -- extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --wiki=mrwiki --throttle --video --key=144p.mjpeg.mov --missing
  • 2024-10-09 06:47:28 stashbot    T363966: Videos still unplayable on Safari in iOS 11 and 12 - https://phabricator.wikimedia.org/T363966
  • 2024-10-09 06:47:29 akosiaris   there we go
  • 2024-10-09 06:56:44 Incident closed
  • 2024-10-09 07:51:00 Jynus fixes dates on the doc, were on UTC+1 TZ before that (unimportant for the incident, but FYI for coordination)

Detection

The issue was detected by automated monitoring and responded to by the oncall SRE.

05:01:36 <jinxer-wm> FIRING: GatewayBackendErrorsHigh: rest-gateway: elevated 5xx errors from page-analytics_cluster in codfw #page -

05:03:32 <jinxer-wm> FIRING: [16x] KubernetesCalicoDown: kubernetes2037.codfw.wmnet is not running

05:04:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally

Conclusions

We almost had a full outage

What went well?

  • Automated monitoring detected the outage
  • Outage was root-caused, thanks to the many pairs of fresh eyes
  • The incident was handed off from Americas to EMEA SREs fine

What went poorly?

  • Impact to the end users
  • We were unsure whether we could Switchover to eqiad to react to the emergency

Where did we get lucky?

  • It was noticed just before the point in time where Americas SRE shifts retire and before EMEA SRE shifts take over. We have a long-standing, known lack of coverage during that period.
  • We were lucky the queue transcodes were not submitted on a Friday US evening.

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes
Was a public wikimediastatus.net entry created? yes
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? no
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

- unsure
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 11