Jump to content

Incidents/2024-11-26 WMCS network problems

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2024-11-26 WMCS network problems Start 2024-11-26 02:21:00
Task T380882 End 2024-11-26 11:08:00
People paged 0 Responder count 4 Andrew, David, Arturo, Slavina
Coordinators Slavina Stefanova Affected metrics/SLOs No relevant SLOs exist
Impact There were two main ways things were imapcted:
  • Users unable to login into the bastions
  • All Toolforge users were affected by intermittent DNS resolution failures, causing some user workloads to crash, possibly including CI jobs. The outage also affected core k8s components as well as internal toolforge k8s components, leading to widespread malfunction of the cluster as a whole.
  • (Suspected root cause) A restart of Openstack virtual network service, created a network interruption for all VMs, causing toolforge NFS to fail, and toolforge k8s to fail internal DNS resolution.
  • On the NFS side:
    • Users were unable to log into login.toolforge.org
  • On the DNS side
    • Users were seeing many systemic errors and specific DNS resolution errors on their workload
    • The k8s cluster components started failing due to DNS resolution problems causing many different errors around the cluster
    • This includes MediaWiki train deployment getting blocked by this incident (not clear the dependency yet)

Timeline

All times in UTC.

  • 01:59 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092425 is deployed, which contains a change to openstack OVS configuration. Max 30m later, puppet will trigger a restart of the OVS daemons that implement the virtual network.
  • 02:21 OUTAGE STARTS first reports of toolforge jobs failing (gitlab-account-approval job failure email to Bryan Davis)
  • 02:26 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
  • 02:30 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: MaintainKubeusersDown: maintain-kubeusers is down
  • 02:31 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
  • 03:35 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools
  • 04:58 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
  • 05:03 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
  • 03:58 in #wikimedia-cloud < anomie> Grr. login-buster.toolforge.org seems broken, and I still don't have a good way to restart my bot without it.
  • 04:08 in #wikimedia-cloud < anomie> Seems other stuff is broken too. "ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes."
  • 04:14 in #wikimedia-cloud < anomie> Seems like networking is borked. "ERROR: TjfCliError: Unknown error (HTTPSConnectionPool(host='k8s.tools.eqiad1.wikimedia.cloud', port=6443): Max retries exceeded with url: /apis/batch/v1/namespaces/tool-anomiebot/jobs?labelSelector=toolforge%3Dtool%2Capp.kubernetes.io%2Fmanaged-by%3Dtoolforge-jobs-framework%2Capp.kubernetes.io%2Fcreated-by%3Danomiebot%2Capp.kubernetes.io%2Fcomponent%3Djobs%2Capp.kubernetes.io%2Fname%3Danomiebot-4 (Caused by
  • 04:14 in #wikimedia-cloud < anomie> NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f370f1ffc10>: Failed to resolve 'k8s.tools.eqiad1.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")))" and other errors.
  • 04:53 in #wikimedia-cloud <andrewbogott> anomie: I'm looking, not sure about the dns thing because I was thinking this was an nfs issue
  • 04:56 in #wikimedia-cloud <andrewbogott> !log admin 'systemctl restart nfs-server' on tools-nfs-2.tools.eqiad1.wikimedia.cloud
  • 05:19 Andrew considers initial NFS outage as resolved (phab:T380827)
  • 05:45 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools
  • 05:47 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: InstanceDown: Project tools instance tools-sgebastion-10 is down
  • 05:52 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: InstanceDown: Project tools instance tools-sgebastion-10 is down
  • 06:15 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down
  • 06:15 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: MaintainKubeusersDown: maintain-kubeusers is down
  • 06:45 in #wikimedia-cloud-feed <logmsgbot_cloud> !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers (T380827)
  • 07:15 Jobs API failing after worker node reboots
  • 07:19 Observed pods crashing with DNS resolution errors
  • 07:24 Created T380832 for jobs-api crashes
  • 07:46 David joins incident response
  • 08:05 Identified issues at the api-gateway level
  • 09:00 Arturo joins incident response
  • 09:22 INCIDENT DECLARED - Slavina becomes IC
  • 09:26 Started investigating possible relation to nameserver puppet issue
  • 09:28 Arturo scanning pdns recursor logs
  • 09:31 Discovered DNS failures are intermittent within the same pod
  • 09:34 Increased CPU request for CoreDNS pods
  • 09:37 CoreDNS CPU usage spiked, increased allocation to 500m
  • 09:42 Verified direct queries to external DNS servers work without issues
  • 09:43 Scaled CoreDNS replicas to 8
  • 09:47 Analysis shows cluster has ~4000 services+pods
  • 09:49 Memory usage analysis: current 170M vs expected <100M based on cluster size
  • 09:55 Discovered calico-node pods may be recreating iptables ruleset
  • 09:58 Found concerning calico-node warnings about NAT table rule inconsistencies
  • 10:03 Connectivity issues between coredns pods observed
  • 10:04 Confirmed calico is flushing/recreating the ruleset
  • 10:25 After control plane reboot, direct DNS lookups working to all nodes
  • 10:29 Control-7 not yet rebooted, still showing ruleset recreation
  • 10:31 Determined ruleset refreshes mainly from kube-proxy
  • 10:47 Continuous calico warnings observed
  • 10:48 Running functional tests in a loop to verify stability
  • 10:53 Confirmed all DNS errors in crashed pods predate control plane reboot
  • 11:04 Confirmed ruleset recreation loop is by kube-proxy, not calico
  • 11:08 INCIDENT RESOLVED, OUTAGE ENDS
  • 13:10 root cause main theory presented https://phabricator.wikimedia.org/T380827#10357462

Detection

  • The initial detection and reporting of something wrong "going on" was via reports from community members on IRC (users unable to login on the bastions due to NFS).
  • Further Toolforge problems weres detected when Andrew and Slavina were investigating the cause of the jobs-api crashing after the reboot of several k8s-nfs-worker nodes. The jobs-api logs revealed DNS resolution failures, which were also causing user workloads to fail in many different ways.
  • In any case, there was no clear detection of the root cause of the problem and any point during the outage.

Exploration

Three main effects to explore:

Toolforge k8s dns issues

Many different errors were happening around the cluster:

  • Pods crashing due to dns resolution timing out
  • Errors due to k8s API server failing to resolve webhooks (ex. kyverno)
  • Errors due to k8s API server failing to contact etcd (dns resolution)
  • Errors during image builds due to being unable to clone from github (dns error)
  • This was caused by coredns requests failing intermittently
  • This was caused by some of the coredns pods being unable to comunicate with some of the cluster pods (network partition)
  • Not clear what caused the network partition between k8s nodes
    • Restarting the control and worker nodes restored the network (so there was some bad state)
    • Timing matches with the openstack network agent restart

Openstack recursor DNS issues

Toolforge NFS issues

Users were unable to login to login.toolforge.org as ssh sessions were hanging

  • This was caused by the NFS clients on the bastions failing to connect to the NFS service
  • (unconfirmed) This was caused by the NFS service having had a network outage
  • This was caused by the openstack neutron agents restarting
  • This was caused by a puppet change to the neutron agent configuration restarting the agent

Conclusions

The primary theory of the problem is:

  • a puppet change to openstack neutron-openvswitch-agent config was merged, that caused puppet to restart the neutron agents.
  • this resulted in virtual machines temporarily losing connectivity between them

This resulted in a number of effects:

  • (unconfirmed) increase of DNS recursive request failures
  • Toolforge NFS server being briefly unavailable
    • because how we configuure Toolforge NFS clients, the will regard the NFS server as down, and refuse further operations, to prevent data corruption
  • (unconfirmed) Toolforge kubernetes pod network also had problems connecting, and some pods (most notably coredns) could not properly work

If that was the case, we were unaware of the impactful effects of a restart of the neutron-openvswitch-agent and OVS.

What went well?

  • WMCS engineers were able to track this incident pretty much on a "follow the sun" fashion regarding timezone coverage.
  • Enough WMCS engineers were available to work on the incident.

What went poorly?

  • The root cause was never in the radar until very late in the debugging process, when the incident was already mitigated.
  • There were no paging alerts.
  • There were challenges in operating the jobs-api, related to admin documentation and lack of experience with established workflows.
  • There was no monitoring in place to check the specific bits that were failing (network cross checks or the like)

Where did we get lucky?

  • A reboot of Toolforge kubernetes control plane nodes got Toolforge back into a reliable state, even if we don't know the root cause of the problem yet.

None. See action items.

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no the team has reduced headcount
Were the people who responded prepared enough to respond effectively yes for the most part yes, barring action item about k8s docs
Were fewer than five people paged? yes there were no paging alerts
Were pages routed to the correct sub-team(s)? no there were no paging alerts
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no there were no paging alerts
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? yes
Was a public wikimediastatus.net entry created? no N/A
Is there a phabricator task for the incident? yes
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no at least not the root cause
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 9