Incident documentation/2021-11-02 Cloud VPS networking

From Wikitech
Jump to navigation Jump to search

document status: in-review

Summary

After a kernel upgrade for several Cloud VPS network components (cloudnet, cloudgw servers; see T291813), we found problems with Toolforge NFS in Kubernetes. Later LDAP connections were found to be affected. Eventually it turned out to be a problem with all ingress traffic to the network edge for cloud VMs (except those with floating IPs, which were unaffected). The issue was resolved by rolling back the kernel upgrade.

Impact: For about 1 hour and 40 minutes, Toolforge services and VMs in Cloud VPS may have experienced connectivity issues.

Actionables

  • Improve automated testing and monitoring of cloud networking, T294955
  • Set up static route for cr-codfw, T295288
  • Avoid keepalived flaps when rebooting servers, T294956