Jump to content

Catalyst/Incidents/2026-01-29

From Wikitech

Time to restore: 45min

Timeline

Detailed timeline of when incident started, what we did when, and when service was restored

At 10:03 PST an email alert was sent alerting that the instance was down. The team noticed and responded to a slack message regarding the incident 30 minutes later.

The logs on the instance indicated that the instance was out of memory. The team restarted the instance and k3s. There were some errors restarting k3s due to incorrect passwords. This appears to happen when the instance is restarted. Deleting the password files fixes the problem.

At 10:48 PST the instance was deemed healthy.

Symptoms

What users see The wikimedia cloud services error page.

Overview

What we saw and what we did

We noticed errors in abstract wiki/wikifunctions pods generated by ci. There were over 200 pods which probably caused the out of memory incident. We deleted the offending demos.


TODOS

Followup actions

Figure out how to limit the number of ci demos that can exist. When a demo fails on gitlab, it is kept around for 3 days for debugging purposes unless it is manually deleted. If there are many subsequent retries, it could overwhelm the cluster.