Catalyst/Incidents/2026-02-10
Time to restore: 1h
Timeline
Alert email received at 10:36 UTC signaling the k3s VM was down. The team noticed the email roughly 30m later.
VM was unresponsive and not accepting SSH connections -> Attempted a soft reboot, which succeeded. A large number of pods were running on k3, well above the recommended limit for correct operation of 110 per node:
$ kubectl get pods -A -o=custom-columns=NODE:.spec.nodeName | sort | uniq -c | sort -n
1 NODE
2 k3s-envdb
74 k3s-worker02
84 k3s-worker01
131 k3s
The k3s service logs were flooded with all kinds of errors leading up to the moment when the alert was sent out. The team observed a large number of recent pods created by CI:
$ kubectl get pods -n cat-env | grep mw-ext-wl-ci | wc -l 46
Deleting the associated environments brought the number of pods below recommended. At around 11:36 UTC the cluster seemed stable and the incident resolved.
Symptoms
404 pages when attempting to load Patchdemo or any wiki environment
Overview
- Wikis unavailable
- Catalyst API unavailable
- Patchdemo unavailable
- SSH connections not accepted by instance
- Rebooted instance
- Deleted all recent CI wikis
TODOS
As already mentioned on Catalyst/Incidents/2026-01-29 probably we want to be more aggressive deleting CI environments