Catalyst/Incidents/2025-01-29
Appearance
< Catalyst
Time to restore: 5:32
Timeline
- 2025-01-29T17:36:03.719212+00:00: OOM killer starts to thrash and thrashed until soft reboot at 18:38 per dmesg
- 2025-01-29T17:42:00+00:00: esanders https://patchdemo.wmcloud.org/ is down for me, is that known?
- Noting that load in grafana is 400(?!)
- 2025-01-29T18:38:01+00:00 soft reboot
- 2025-01-29T18:43:00+00:00 successful reboot + SSH, uptime usage immediately spikes as k3s spins up (5m avg load > 20)
- 2025-01-29T20:51:17+00:00: Fix Catalyst Environments
- ...: Resize
k3s16GB Ram -> 32 GB - ...: Fix Catalyst Environments (again)
- 2025-01-29T23:08:00+00:00: All clear
Symptoms
- website doesn't load
- cannot SSH into machine
Overview
- Machine ran out of memory due to too many environments
- Nothing malicious, normal use
- OOM Killer caused full system lock-up
- Reboot got us back online
- Knock-on problems with catalyst environments added complications
Logs off of Horizon
[331315.051863] Memory cgroup out of memory: Killed process 5521 (mysqld) total-vm:4204604kB, anon-rss:380944kB, file-rss:6152kB, shmem-rss:0kB, UID:1001 pgtables:1204kB oom_score_adj:985 [687059.151280] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:381940kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985 [687059.156277] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:382004kB, file-rss:2844kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985 [687059.159804] Memory cgroup out of memory: Killed process 2225273 (runc:[2:INIT]) total-vm:1236700kB, anon-rss:3424kB, file-rss:5808kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985 [1051464.115192] Memory cgroup out of memory: Killed process 2225342 (mysqld) total-vm:4731292kB, anon-rss:381436kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1252kB oom_score_adj:985 [1051464.118194] Memory cgroup out of memory: Killed process 3460173 (runc:[2:INIT]) total-vm:1236536kB, anon-rss:3176kB, file-rss:5816kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985
After everything came back, new problems:
- CPU running hot for certain php processes
- Checked out
/proc/<pid>/environto find the helm deployment for an out-of-control php process - Traced to jobrunner on
wiki-5648f3da62-146-mediawiki-b9b775bb8-l8fjvpod- Thrashing with errors (can't find jobs table in database)
- All Catalyst wikis also reporting sql errors
- Turns out, the mysql instances lost their tables
- Tried to redeploy
- initcontainer won't run again if pod is deleted
- initcontainer runs install.sh/post-install.sh which is how the database is set up so we can't re-run using kubectl we'll have to do helm uninstall/reinstall
Noticed inotify.max_user_instances did not persist: https://phabricator.wikimedia.org/T383280
root@k3s:~# sysctl fs.inotify.max_user_instances fs.inotify.max_user_instances = 128 root@k3s:~# sysctl -w fs.inotify.max_user_instances=1024 fs.inotify.max_user_instances = 1024 root@k3s:~# sysctl fs.inotify.max_user_instances fs.inotify.max_user_instances = 1024
20:51:17: bringing catalyst envs back via helm uninstall/reinstall:
- Get list of catalyst releases:
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -n cat-env | awk '/wiki-/ { print $1 }' > list-envs.txt - Get copy of helm chart: git clone ci-charts
- Copy all values from running deploys (that have no sql database):
while read i; do helm get values -n cat-env "$i" -o yaml > values/"$i".yaml; done < list-envs.txt - Uninstall all helm releases:
while read i; do helm uninstall -n cat-env "$i"; done < list-envs.txt - Reinstall all helm releases:
while read i; do helm install -n cat-env $i ci-charts/mediawiki -f values/$i.yaml - Troubleshoot any failing releases: Set
debug.initContainer= true invalues/<helm-release-name>.yaml
These commands were written as a collection of scripts here: https://gitlab.wikimedia.org/kindrobot/reinstall-bad-catalyst-envs
Upped memory on k3s vm 16GB -> 32GB DB problem happened again(!) following reboot after VM resize :((
TODOS
Done Figure out inotify persistence
Done figure out database pv persistence problem
Done publish scripts for redeploying all catalyst patchdemos that we wrote during this incident
Done Add an api call to reinstall a wiki
Done Add a button on UI?
Done Add CPU and memory limits for wikis
Done Increase the memory on the k3s box
Done sleep if init container fails for debugging
Done Admin password for wikis is incorrect, fix that
Done Fix: if you leave the page then the patchdemo db doesn't update
Done Turn off the checkbox to use catalyst for patch demo (until some of the above is better)
Not done See if we can make alerts in grafana alert us
23:08: all clear