Catalyst/Incidents/2026-01-29
Time to restore: 45min
Timeline
Detailed timeline of when incident started, what we did when, and when service was restored
At 10:03 PST an email alert was sent alerting that the instance was down. The team noticed and responded to a slack message regarding the incident 30 minutes later.
The logs on the instance indicated that the instance was out of memory. The team restarted the instance and k3s. There were some errors restarting k3s due to incorrect passwords. This appears to happen when the instance is restarted. Deleting the password files fixes the problem.
At 10:48 PST the instance was deemed healthy.
Symptoms
What users see The wikimedia cloud services error page.
Overview
What we saw and what we did
We noticed errors in abstract wiki/wikifunctions pods generated by ci. There were over 200 pods which probably caused the out of memory incident. We deleted the offending demos.
TODOS
Followup actions
Figure out how to limit the number of ci demos that can exist. When a demo fails on gitlab, it is kept around for 3 days for debugging purposes unless it is manually deleted. If there are many subsequent retries, it could overwhelm the cluster.