Incidents/2024-11-26 WMCS network problems

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-11-26 WMCS network problems	Start	2024-11-26 02:21:00
Task	T380882	End	2024-11-26 11:08:00
People paged	0	Responder count	4 Andrew, David, Arturo, Slavina
Coordinators	Slavina Stefanova	Affected metrics/SLOs	No relevant SLOs exist
Impact	There were two main ways things were imapcted: Users unable to login into the bastions All Toolforge users were affected by intermittent DNS resolution failures, causing some user workloads to crash, possibly including CI jobs. The outage also affected core k8s components as well as internal toolforge k8s components, leading to widespread malfunction of the cluster as a whole.

(Suspected root cause) A restart of Openstack virtual network service, created a network interruption for all VMs, causing toolforge NFS to fail, and toolforge k8s to fail internal DNS resolution.
On the NFS side:
- Users were unable to log into login.toolforge.org
On the DNS side
- Users were seeing many systemic errors and specific DNS resolution errors on their workload
- The k8s cluster components started failing due to DNS resolution problems causing many different errors around the cluster
- This includes MediaWiki train deployment getting blocked by this incident (not clear the dependency yet)

Timeline

All times in UTC.

01:59 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1092425 is deployed, which contains a change to openstack OVS configuration. Max 30m later, puppet will trigger a restart of the OVS daemons that implement the virtual network.
02:21 OUTAGE STARTS first reports of toolforge jobs failing (gitlab-account-approval job failure email to Bryan Davis)
02:26 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
02:30 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: MaintainKubeusersDown: maintain-kubeusers is down
02:31 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
03:35 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools
04:58 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
05:03 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4)
03:58 in #wikimedia-cloud < anomie> Grr. login-buster.toolforge.org seems broken, and I still don't have a good way to restart my bot without it.
04:08 in #wikimedia-cloud < anomie> Seems other stuff is broken too. "ERROR: TjfCliError: The jobs service seems to be down – please retry in a few minutes."
04:14 in #wikimedia-cloud < anomie> Seems like networking is borked. "ERROR: TjfCliError: Unknown error (HTTPSConnectionPool(host='k8s.tools.eqiad1.wikimedia.cloud', port=6443): Max retries exceeded with url: /apis/batch/v1/namespaces/tool-anomiebot/jobs?labelSelector=toolforge%3Dtool%2Capp.kubernetes.io%2Fmanaged-by%3Dtoolforge-jobs-framework%2Capp.kubernetes.io%2Fcreated-by%3Danomiebot%2Capp.kubernetes.io%2Fcomponent%3Djobs%2Capp.kubernetes.io%2Fname%3Danomiebot-4 (Caused by
04:14 in #wikimedia-cloud < anomie> NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f370f1ffc10>: Failed to resolve 'k8s.tools.eqiad1.wikimedia.cloud' ([Errno -3] Temporary failure in name resolution)")))" and other errors.
04:53 in #wikimedia-cloud <andrewbogott> anomie: I'm looking, not sure about the dns thing because I was thinking this was an nfs issue
04:56 in #wikimedia-cloud <andrewbogott> !log admin 'systemctl restart nfs-server' on tools-nfs-2.tools.eqiad1.wikimedia.cloud
05:19 Andrew considers initial NFS outage as resolved (phab:T380827)
05:45 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools
05:47 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: InstanceDown: Project tools instance tools-sgebastion-10 is down
05:52 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: InstanceDown: Project tools instance tools-sgebastion-10 is down
06:15 in #wikimedia-cloud-feed <wmcs-alerts> RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down
06:15 in #wikimedia-cloud-feed <wmcs-alerts> FIRING: MaintainKubeusersDown: maintain-kubeusers is down
06:45 in #wikimedia-cloud-feed <logmsgbot_cloud> !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers (T380827)
07:15 Jobs API failing after worker node reboots
07:19 Observed pods crashing with DNS resolution errors
07:24 Created T380832 for jobs-api crashes
07:46 David joins incident response
08:05 Identified issues at the api-gateway level
09:00 Arturo joins incident response
09:22 INCIDENT DECLARED - Slavina becomes IC
09:26 Started investigating possible relation to nameserver puppet issue
09:28 Arturo scanning pdns recursor logs
09:31 Discovered DNS failures are intermittent within the same pod
09:34 Increased CPU request for CoreDNS pods
09:37 CoreDNS CPU usage spiked, increased allocation to 500m
09:42 Verified direct queries to external DNS servers work without issues
09:43 Scaled CoreDNS replicas to 8
09:47 Analysis shows cluster has ~4000 services+pods
09:49 Memory usage analysis: current 170M vs expected <100M based on cluster size
09:55 Discovered calico-node pods may be recreating iptables ruleset
09:58 Found concerning calico-node warnings about NAT table rule inconsistencies
10:03 Connectivity issues between coredns pods observed
10:04 Confirmed calico is flushing/recreating the ruleset
10:25 After control plane reboot, direct DNS lookups working to all nodes
10:29 Control-7 not yet rebooted, still showing ruleset recreation
10:31 Determined ruleset refreshes mainly from kube-proxy
10:47 Continuous calico warnings observed
10:48 Running functional tests in a loop to verify stability
10:53 Confirmed all DNS errors in crashed pods predate control plane reboot
11:04 Confirmed ruleset recreation loop is by kube-proxy, not calico
11:08 INCIDENT RESOLVED, OUTAGE ENDS
13:10 root cause main theory presented https://phabricator.wikimedia.org/T380827#10357462

Detection

The initial detection and reporting of something wrong "going on" was via reports from community members on IRC (users unable to login on the bastions due to NFS).
Further Toolforge problems weres detected when Andrew and Slavina were investigating the cause of the jobs-api crashing after the reboot of several k8s-nfs-worker nodes. The jobs-api logs revealed DNS resolution failures, which were also causing user workloads to fail in many different ways.
In any case, there was no clear detection of the root cause of the problem and any point during the outage.

Exploration

Three main effects to explore:

Toolforge k8s dns issues

Many different errors were happening around the cluster:

Pods crashing due to dns resolution timing out
Errors due to k8s API server failing to resolve webhooks (ex. kyverno)
Errors due to k8s API server failing to contact etcd (dns resolution)
Errors during image builds due to being unable to clone from github (dns error)

This was caused by coredns requests failing intermittently
This was caused by some of the coredns pods being unable to comunicate with some of the cluster pods (network partition)
Not clear what caused the network partition between k8s nodes
- Restarting the control and worker nodes restored the network (so there was some bad state)
- Timing matches with the openstack network agent restart

Openstack recursor DNS issues

Toolforge NFS issues

Users were unable to login to login.toolforge.org as ssh sessions were hanging

This was caused by the NFS clients on the bastions failing to connect to the NFS service
(unconfirmed) This was caused by the NFS service having had a network outage
This was caused by the openstack neutron agents restarting
This was caused by a puppet change to the neutron agent configuration restarting the agent

Conclusions

The primary theory of the problem is:

a puppet change to openstack neutron-openvswitch-agent config was merged, that caused puppet to restart the neutron agents.
this resulted in virtual machines temporarily losing connectivity between them

This resulted in a number of effects:

(unconfirmed) increase of DNS recursive request failures
Toolforge NFS server being briefly unavailable
- because how we configuure Toolforge NFS clients, the will regard the NFS server as down, and refuse further operations, to prevent data corruption
(unconfirmed) Toolforge kubernetes pod network also had problems connecting, and some pods (most notably coredns) could not properly work

If that was the case, we were unaware of the impactful effects of a restart of the neutron-openvswitch-agent and OVS.

What went well?

WMCS engineers were able to track this incident pretty much on a "follow the sun" fashion regarding timezone coverage.
Enough WMCS engineers were available to work on the incident.

What went poorly?

The root cause was never in the radar until very late in the debugging process, when the incident was already mitigated.
There were no paging alerts.
There were challenges in operating the jobs-api, related to admin documentation and lack of experience with established workflows.
There was no monitoring in place to check the specific bits that were failing (network cross checks or the like)

Where did we get lucky?

A reboot of Toolforge kubernetes control plane nodes got Toolforge back into a reliable state, even if we don't know the root cause of the problem yet.

Links to relevant documentation

None. See action items.

Actionables

https://phabricator.wikimedia.org/T380886 openstack: increase virtual network observability
https://phabricator.wikimedia.org/T380892 toolforge: introduce additional observability for calico
https://phabricator.wikimedia.org/T380959 toolforge: create docs on how to operate the cluster and core components
https://phabricator.wikimedia.org/T380972 openstack: prevent puppet from restarting neutron-openvswitch-agent
https://phabricator.wikimedia.org/T380980 openstack: introduce additional DNS monitoring and alerting

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	no	the team has reduced headcount
	Were the people who responded prepared enough to respond effectively	yes	for the most part yes, barring action item about k8s docs
	Were fewer than five people paged?	yes	there were no paging alerts
	Were pages routed to the correct sub-team(s)?	no	there were no paging alerts
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no	there were no paging alerts
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	yes
	Was a public wikimediastatus.net entry created?	no	N/A
	Is there a phabricator task for the incident?	yes
	Are the documented action items assigned?	yes
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	yes
	Did existing monitoring notify the initial responders?	no	at least not the root cause
	Were the engineering tools that were to be used during the incident, available and in service?	yes
	Were the steps taken to mitigate guided by an existing runbook?	no
Total score (count of all “yes” answers above)		9