Incidents/2023-09-29 CloudVPS vms losing network connectivity

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2023-09-29 CloudVPS vms losing network connectivity Start 11:00 UTC 2023-09-28 (day before this report)
Task T347665 End 13:00 UTC 2023-09-29
People paged 1 Responder count 5
Coordinators arturo

dcaro taavi

Affected metrics/SLOs No relevant SLOs exist
Impact Any cloud-vps hosted external service was down (including toolforge, paws, quarry, and others). Some of the VMs became unreachable through ssh.

During a package cleanup, 961005 was merged to remove some packages. This caused to bullseye VMs in the cloud realm to remove isc-dhcp-client, and once the ip leases for these started to expire the VMs started losing network connectivity.

This eventually included the proxies CloudVPS uses to serve external traffic, making any hosted project lose that traffic too.

This also took down the metricsinfra VMs that are in charge of the monitoring and alerting for CloudVPS hosted projects, so there were no alerts from it.

From there recovery included having to roll-reboot all the toolforge VMs that depend on the nfs servers as the nfs service itself got affected and clients got stuck (common procedure, but slow).

Timeline

tools k8s nodes reboot: https://sal.toolforge.org/tools?d=2023-09-29

Alert logs (team=wmcs, there might be non-wmcs ones): https://logstash.wikimedia.org/goto/906ec4838ab338cc70e1484010ab7df2

IRC archives:

All times in UTC.

28 September 2023:

29 September 2023:

  • 04:14:00 - 07:06 - some users report connectivity issues on irc, no admins notice it - First user impact
  • 06:24 - user reports connectivity issues https://phabricator.wikimedia.org/T347661
  • 06:56 - outage task created by user https://phabricator.wikimedia.org/T347665
  • 07:06 - admin starts looking into the issues as they notice alertmanager down (https://www.irccloud.com/pastebin/aA1NNmt1/), another admin joins
  • 07:11:25 - find that project-proxy is not responding
  • 07:16:13 - find out that dhclient is not installed on the VM (that otherwise looks ok)
  • 07:18:06 - found that there was a patch that deleted the package through puppet by the apt logs
  • 07:23:59 - revert sent, restore started (as they can't run puppet, we have to "manually" fix them), two efforts: one admin writing as script to automate the fix, the other admin starting to manually fix the core/critical VMs
  • 07:31 - first email sent to cloud-announce about the outage
  • 08:10 - a third admin joins, helps manually fixing the other critical VMs
  • 08:21:23 - metricsinfra alerts restored (manually)
  • 08:35 - we ran a script to fix the issues, running in parallel
  • 09:16:58 - script finishes a first round through the whole fleet
  • 09:32 - rebooting tools-nfs-2 since the network setup on nfs servers needs a reboot + puppet run (task T347681)
  • 09:37 - start rebooting k8s worker nodes to release stuck nfs file handles
  • 09:38:10 - admin paged: checker.tools.wmflabs.org/toolschecker: NFS read/writeable on labs instances
  • 09:42 - grid reboot cookbook is failing as the nodes are stuck and it does not try to force-reboot through openstack
  • 10:02 - rebooting all other NFS instances
  • 10:08:42 - all grid bastions and workers rebooted
  • 11:46 (Voila) OUTAGE ENDS - message on irc, all services running
  • 13:03 - email to cloud-announce declaring the outage over

Detection

The issue was first detected by users, and it was not until the first admin started their work day that they noticed something was wrong.

The only page was received way after, once the recovery had started.

Note that the outage took one of the monitoring and alerting systems down, though we would not have been paged by any alert there (https://phabricator.wikimedia.org/T323510).

Actionables

  • task T347694 - investigate why we did not get any pages, and fix/add them
  • task T288053 - add meta-monitoring for metricsinfra
  • task T347683 - create a cookbook to run commands through virsh console
  • task T347681 - improve current nfs setup so it does not require to reboot + run puppet to bring online (as it might take 30 min for puppet to run unattended)

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes Alerting was broken.
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no None created
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes task T347665
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? no task T288053 will pursue meta-monitoring
Were the engineering tools that were to be used during the incident, available and in service? no
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 9