Incidents/2024-11-26 WMCS network problems
Appearance
document status: draft
Summary
Incident ID | 2024-11-26 WMCS network problems | Start | 2024-11-26 02:21:00 |
---|---|---|---|
Task | T380882 | End | 2024-11-26 11:08:00 |
People paged | 0 | Responder count | 4 Andrew, David, Arturo, Slavina |
Coordinators | Slavina Stefanova | Affected metrics/SLOs | No relevant SLOs exist |
Impact | There were two main ways things were imapcted:
|
- (Suspected root cause) A restart of Openstack virtual network service, created a network interruption for all VMs, causing toolforge NFS to fail, and toolforge k8s to fail internal DNS resolution.
- On the NFS side:
- Users were unable to log into login.toolforge.org
- On the DNS side
- Users were seeing many systemic errors and specific DNS resolution errors on their workload
- The k8s cluster components started failing due to DNS resolution problems causing many different errors around the cluster
- This includes MediaWiki train deployment getting blocked by this incident (not clear the dependency yet)
Timeline
All times in UTC.
- 01:59 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092425 is deployed, which contains a change to openstack OVS configuration. Max 30m later, puppet will trigger a restart of the OVS daemons that implement the virtual network.
- 02:21 OUTAGE STARTS first reports of toolforge jobs failing (
gitlab-account-approval
job failure email to Bryan Davis) - 02:26 in #wikimedia-cloud-feed
<wmcs-alerts> FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
- 02:30 in #wikimedia-cloud-feed
<wmcs-alerts> FIRING: MaintainKubeusersDown: maintain-kubeusers is down
- 02:31 in #wikimedia-cloud-feed
<wmcs-alerts> RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
- 03:35 in #wikimedia-cloud-feed
<wmcs-alerts> FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools
- 04:58 in #wikimedia-cloud-feed
<wmcs-alerts> FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
- 05:03 in #wikimedia-cloud-feed
<wmcs-alerts> RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
- 03:58 in #wikimedia-cloud
< anomie> Grr. login-buster.toolforge.org seems broken, and I still don't have a good way to restart my bot without it.
- 04:08 in #wikimedia-cloud
< anomie> Seems other stuff is broken too. "ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes."
- 04:14 in #wikimedia-cloud
< anomie> Seems like networking is borked. "ERROR: TjfCliError: Unknown error (HTTPSConnectionPool(host='k8s.tools.eqiad1.wikimedia.cloud', port=6443): Max retries exceeded with url: /apis/batch/v1/namespaces/tool-anomiebot/jobs?labelSelector=toolforge%3Dtool%2Capp.kubernetes.io%2Fmanaged-by%3Dtoolforge-jobs-framework%2Capp.kubernetes.io%2Fcreated-by%3Danomiebot%2Capp.kubernetes.io%2Fcomponent%3Djobs%2Capp.kubernetes.io%2Fname%3Danomiebot-4 (Caused by
- 04:14 in #wikimedia-cloud
< anomie> NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f370f1ffc10>: Failed to resolve 'k8s.tools.eqiad1.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")))" and other errors.
- 04:53 in #wikimedia-cloud
<andrewbogott> anomie: I'm looking, not sure about the dns thing because I was thinking this was an nfs issue
- 04:56 in #wikimedia-cloud
<andrewbogott> !log admin 'systemctl restart nfs-server' on tools-nfs-2.tools.eqiad1.wikimedia.cloud
- 05:19 Andrew considers initial NFS outage as resolved (phab:T380827)
- 05:45 in #wikimedia-cloud-feed
<wmcs-alerts> RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools
- 05:47 in #wikimedia-cloud-feed
<wmcs-alerts> FIRING: InstanceDown: Project tools instance tools-sgebastion-10 is down
- 05:52 in #wikimedia-cloud-feed
<wmcs-alerts> RESOLVED: InstanceDown: Project tools instance tools-sgebastion-10 is down
- 06:15 in #wikimedia-cloud-feed
<wmcs-alerts> RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down
- 06:15 in #wikimedia-cloud-feed
<wmcs-alerts> FIRING: MaintainKubeusersDown: maintain-kubeusers is down
- 06:45 in #wikimedia-cloud-feed
<logmsgbot_cloud> !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers (T380827)
- 07:15 Jobs API failing after worker node reboots
- 07:19 Observed pods crashing with DNS resolution errors
- 07:24 Created T380832 for jobs-api crashes
- 07:46 David joins incident response
- 08:05 Identified issues at the api-gateway level
- 09:00 Arturo joins incident response
- 09:22 INCIDENT DECLARED - Slavina becomes IC
- 09:26 Started investigating possible relation to nameserver puppet issue
- 09:28 Arturo scanning pdns recursor logs
- 09:31 Discovered DNS failures are intermittent within the same pod
- 09:34 Increased CPU request for CoreDNS pods
- 09:37 CoreDNS CPU usage spiked, increased allocation to 500m
- 09:42 Verified direct queries to external DNS servers work without issues
- 09:43 Scaled CoreDNS replicas to 8
- 09:47 Analysis shows cluster has ~4000 services+pods
- 09:49 Memory usage analysis: current 170M vs expected <100M based on cluster size
- 09:55 Discovered calico-node pods may be recreating iptables ruleset
- 09:58 Found concerning calico-node warnings about NAT table rule inconsistencies
- 10:03 Connectivity issues between coredns pods observed
- 10:04 Confirmed calico is flushing/recreating the ruleset
- 10:25 After control plane reboot, direct DNS lookups working to all nodes
- 10:29 Control-7 not yet rebooted, still showing ruleset recreation
- 10:31 Determined ruleset refreshes mainly from kube-proxy
- 10:47 Continuous calico warnings observed
- 10:48 Running functional tests in a loop to verify stability
- 10:53 Confirmed all DNS errors in crashed pods predate control plane reboot
- 11:04 Confirmed ruleset recreation loop is by kube-proxy, not calico
- 11:08 INCIDENT RESOLVED, OUTAGE ENDS
- 13:10 root cause main theory presented https://phabricator.wikimedia.org/T380827#10357462
Detection
- The initial detection and reporting of something wrong "going on" was via reports from community members on IRC (users unable to login on the bastions due to NFS).
- Further Toolforge problems weres detected when Andrew and Slavina were investigating the cause of the jobs-api crashing after the reboot of several k8s-nfs-worker nodes. The jobs-api logs revealed DNS resolution failures, which were also causing user workloads to fail in many different ways.
- In any case, there was no clear detection of the root cause of the problem and any point during the outage.
Exploration
Three main effects to explore:
Toolforge k8s dns issues
Many different errors were happening around the cluster:
- Pods crashing due to dns resolution timing out
- Errors due to k8s API server failing to resolve webhooks (ex. kyverno)
- Errors due to k8s API server failing to contact etcd (dns resolution)
- Errors during image builds due to being unable to clone from github (dns error)
- This was caused by coredns requests failing intermittently
- This was caused by some of the coredns pods being unable to comunicate with some of the cluster pods (network partition)
- Not clear what caused the network partition between k8s nodes
- Restarting the control and worker nodes restored the network (so there was some bad state)
- Timing matches with the openstack network agent restart
Openstack recursor DNS issues
Toolforge NFS issues
Users were unable to login to login.toolforge.org as ssh sessions were hanging
- This was caused by the NFS clients on the bastions failing to connect to the NFS service
- (unconfirmed) This was caused by the NFS service having had a network outage
- This was caused by the openstack neutron agents restarting
- This was caused by a puppet change to the neutron agent configuration restarting the agent
Conclusions
The primary theory of the problem is:
- a puppet change to openstack neutron-openvswitch-agent config was merged, that caused puppet to restart the neutron agents.
- this resulted in virtual machines temporarily losing connectivity between them
This resulted in a number of effects:
- (unconfirmed) increase of DNS recursive request failures
- Toolforge NFS server being briefly unavailable
- because how we configuure Toolforge NFS clients, the will regard the NFS server as down, and refuse further operations, to prevent data corruption
- (unconfirmed) Toolforge kubernetes pod network also had problems connecting, and some pods (most notably coredns) could not properly work
If that was the case, we were unaware of the impactful effects of a restart of the neutron-openvswitch-agent and OVS.
What went well?
- WMCS engineers were able to track this incident pretty much on a "follow the sun" fashion regarding timezone coverage.
- Enough WMCS engineers were available to work on the incident.
What went poorly?
- The root cause was never in the radar until very late in the debugging process, when the incident was already mitigated.
- There were no paging alerts.
- There were challenges in operating the jobs-api, related to admin documentation and lack of experience with established workflows.
- There was no monitoring in place to check the specific bits that were failing (network cross checks or the like)
Where did we get lucky?
- A reboot of Toolforge kubernetes control plane nodes got Toolforge back into a reliable state, even if we don't know the root cause of the problem yet.
Links to relevant documentation
None. See action items.
Actionables
- https://phabricator.wikimedia.org/T380886 openstack: increase virtual network observability
- https://phabricator.wikimedia.org/T380892 toolforge: introduce additional observability for calico
- https://phabricator.wikimedia.org/T380959 toolforge: create docs on how to operate the cluster and core components
- https://phabricator.wikimedia.org/T380972 openstack: prevent puppet from restarting neutron-openvswitch-agent
- https://phabricator.wikimedia.org/T380980 openstack: introduce additional DNS monitoring and alerting
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | no | the team has reduced headcount |
Were the people who responded prepared enough to respond effectively | yes | for the most part yes, barring action item about k8s docs | |
Were fewer than five people paged? | yes | there were no paging alerts | |
Were pages routed to the correct sub-team(s)? | no | there were no paging alerts | |
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | there were no paging alerts | |
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | |
Was a public wikimediastatus.net entry created? | no | N/A | |
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | yes | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | no | at least not the root cause | |
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) | 9 |