Jump to content

Catalyst/Incidents/2026-01-27

From Wikitech

Time to restore: 3:56

Timeline

  • 2026-01-28T02:49:00+00:00: InstanceDown alert for k3s—instance starts to be flaky
  • 2026-01-28T04:06:00+00:00: <jeena> Our instance has been crashing over the last 30 min, maybe due to high cpu
  • 2026-01-28T04:34:00+00:00: PROBLEM identified: OOM Killer wheel-warring with K8s due to bot traffic
  • 2026-01-28T04:44:00+00:00: load average: 443.72, 420.71, 405.30
  • 2026-01-28T04:48:00+00:00: REBOOT <tyler> well k3s is failing to start
    • /var/lib/rancher/k3s/server/cred/passwd newer than datastore and could cause a cluster outage. Remove the file(s) from disk and restart to be recreated from datastore.
    • ACTION: rm /var/lib/rancher/k3s/server/cred/passwd && systemctl restart k3s
  • 2026-01-28T04:51:00+00:00: <tyler> lots of Jan 28 04:51:02 k3s k3s[2731]: time="2026-01-28T04:51:02Z" level=info msg="Waiting for API server to become available" [in the logs]
    • <jeena> shall I merge now? I was able to get pods
  • 2026-01-28T04:51:00+00:00: ACTION merge patchdemo!235
    • Deployment bad: Error: Kubernetes cluster unreachable: an error on the server ("unknown") has prevented the request from succeeding
  • 2026-01-28T05:07:00+00:00: We suspect we need to upgrade the GitLab agent and try: helm upgrade --install api-deploy-agent gitlab/gitlab-agent --namespace gitlab-agent-api-deploy-agent
  • 2026-01-28T05:17:00+00:00: Error: looks like "https://helm.mariadb.com/mariadb-operator" is not a valid chart repository or cannot be reached: write /home/jhuneidi/.cache/helm/repository/mariadb-operator-index.yaml: no space left on device
  • 2026-01-28T05:20:00+00:00: <tyler> sure enough: 12GB of syslog truncate -s 0 /var/log/syslog
    • k3s is filling syslog with Waiting for API server to become available
  • 2026-01-28T05:22:37+00+00:00: We realize that we can see the pods, but kubernetes is unhealthy
kubectl get componentstatus      
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME                 STATUS      MESSAGE                                                                                        ERROR
scheduler            Unhealthy   Get "https://127.0.0.1:10259/healthz": dial tcp 127.0.0.1:10259: connect: connection refused   
etcd-0               Healthy     ok                                                                                             
controller-manager   Unhealthy   Get "https://127.0.0.1:10257/healthz": dial tcp 127.0.0.1:10257: connect: connection refused
  • 2026-01-28T06:23:00+00+00:00: Run with debug to try to find startup error
    • Stop k3s
    • Truncate syslog (since k3s is filling syslog)
    • rm /var/lib/rancher/k3s/server/cred/passwd
    • systemctl cat k3s, copy the start command
    • run k3s with the --debug flag
    • time="2026-01-28T06:34:21Z" level=info msg="Cluster-Http-Server 2026/01/28 06:34:21 http: TLS handshake error from 127.0.0.1:56198: remote error: tls: bad certificate"
  • 2026-01-28T06:43:00+00:00: mv /var/lib/rancher/k3s/server/tls{.$(date -I)} && systemctl start k3s End of outage
  • 2026-01-28T06:43:00+00:00: Deployed bot mitigations, confirmed working

Symptoms

  • patchdemo unavailable
  • wikis unavailable
  • OOM killer in horizon logs
  • System load incredulously high (100+)

Overview

  • We restarted the k3s node and added some mitigations for bots
  • We got an alert about k3s being down
  • On investigation, we found that massive bot traffic was causing high load
    • Possibly compounded by test environments running
  • We opted to restart the machine and then merge bot mitigations
  • Restarting revealed that our tls certs needed rotation
    • this was buried in the (12GB+) of log noise generated by a flailing k3s

TODOS

  • Move bot mitigations to the ingress
  • Create a script for restarting k3s that checks:
    • /var/lib/rancher/k3s/server/cred/passwd
    • tls certs