Catalyst/Incidents/2026-01-27
Appearance
Time to restore: 3:56
Timeline
- 2026-01-28T02:49:00+00:00: InstanceDown alert for k3s—instance starts to be flaky
- 2026-01-28T04:06:00+00:00: <jeena> Our instance has been crashing over the last 30 min, maybe due to high cpu
- 2026-01-28T04:34:00+00:00: PROBLEM identified: OOM Killer wheel-warring with K8s due to bot traffic
- PLAN: reboot, and merge patchdemo!235 to mitigate bots
- 2026-01-28T04:44:00+00:00:
load average: 443.72, 420.71, 405.30 - 2026-01-28T04:48:00+00:00: REBOOT <tyler> well k3s is failing to start
/var/lib/rancher/k3s/server/cred/passwd newer than datastore and could cause a cluster outage. Remove the file(s) from disk and restart to be recreated from datastore.- ACTION:
rm /var/lib/rancher/k3s/server/cred/passwd && systemctl restart k3s
- 2026-01-28T04:51:00+00:00: <tyler> lots of
Jan 28 04:51:02 k3s k3s[2731]: time="2026-01-28T04:51:02Z" level=info msg="Waiting for API server to become available"[in the logs]- <jeena> shall I merge now? I was able to get pods
- 2026-01-28T04:51:00+00:00: ACTION merge patchdemo!235
- Deployment bad:
Error: Kubernetes cluster unreachable: an error on the server ("unknown") has prevented the request from succeeding
- Deployment bad:
- 2026-01-28T05:07:00+00:00: We suspect we need to upgrade the GitLab agent and try:
helm upgrade --install api-deploy-agent gitlab/gitlab-agent --namespace gitlab-agent-api-deploy-agent - 2026-01-28T05:17:00+00:00:
Error: looks like "https://helm.mariadb.com/mariadb-operator" is not a valid chart repository or cannot be reached: write /home/jhuneidi/.cache/helm/repository/mariadb-operator-index.yaml: no space left on device - 2026-01-28T05:20:00+00:00: <tyler> sure enough: 12GB of syslog
truncate -s 0 /var/log/syslog- k3s is filling syslog with
Waiting for API server to become available
- k3s is filling syslog with
- 2026-01-28T05:22:37+00+00:00: We realize that we can see the pods, but kubernetes is unhealthy
kubectl get componentstatus Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR scheduler Unhealthy Get "https://127.0.0.1:10259/healthz": dial tcp 127.0.0.1:10259: connect: connection refused etcd-0 Healthy ok controller-manager Unhealthy Get "https://127.0.0.1:10257/healthz": dial tcp 127.0.0.1:10257: connect: connection refused
- 2026-01-28T06:23:00+00+00:00: Run with debug to try to find startup error
- Stop k3s
- Truncate syslog (since k3s is filling syslog)
- rm /var/lib/rancher/k3s/server/cred/passwd
- systemctl cat k3s, copy the start command
- run k3s with the
--debugflag time="2026-01-28T06:34:21Z" level=info msg="Cluster-Http-Server 2026/01/28 06:34:21 http: TLS handshake error from 127.0.0.1:56198: remote error: tls: bad certificate"
- 2026-01-28T06:43:00+00:00:
mv /var/lib/rancher/k3s/server/tls{.$(date -I)} && systemctl start k3sEnd of outage - 2026-01-28T06:43:00+00:00: Deployed bot mitigations, confirmed working
Symptoms
- patchdemo unavailable
- wikis unavailable
- OOM killer in horizon logs
- System load incredulously high (100+)
Overview
- We restarted the k3s node and added some mitigations for bots
- We got an alert about k3s being down
- On investigation, we found that massive bot traffic was causing high load
- Possibly compounded by test environments running
- We opted to restart the machine and then merge bot mitigations
- Restarting revealed that our tls certs needed rotation
- this was buried in the (12GB+) of log noise generated by a flailing k3s
TODOS
- Move bot mitigations to the ingress
- Create a script for restarting k3s that checks:
- /var/lib/rancher/k3s/server/cred/passwd
- tls certs