Jump to content

Catalyst/Incidents/2026-02-10

From Wikitech

Time to restore: 1h

Timeline

Alert email received at 10:36 UTC signaling the k3s VM was down. The team noticed the email roughly 30m later.

VM was unresponsive and not accepting SSH connections -> Attempted a soft reboot, which succeeded. A large number of pods were running on k3, well above the recommended limit for correct operation of 110 per node:

$ kubectl get pods -A -o=custom-columns=NODE:.spec.nodeName | sort | uniq -c | sort -n
      1 NODE
      2 k3s-envdb
     74 k3s-worker02
     84 k3s-worker01
    131 k3s

The k3s service logs were flooded with all kinds of errors leading up to the moment when the alert was sent out. The team observed a large number of recent pods created by CI:

$ kubectl get pods -n cat-env | grep mw-ext-wl-ci | wc -l
46

Deleting the associated environments brought the number of pods below recommended. At around 11:36 UTC the cluster seemed stable and the incident resolved.

Symptoms

404 pages when attempting to load Patchdemo or any wiki environment

Overview

  • Wikis unavailable
  • Catalyst API unavailable
  • Patchdemo unavailable
  • SSH connections not accepted by instance
  • Rebooted instance
  • Deleted all recent CI wikis

TODOS

As already mentioned on Catalyst/Incidents/2026-01-29 probably we want to be more aggressive deleting CI environments