Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses
Error / Incident
There's at least one toolforge worker node with many processes in 'D' (uninterruptible sleep) pointing to an IO/NFS issue.
Debugging
If you can, ssh to the node and run htop/top/ps to see the running processes and their state.
You can try checking the journal or dmesg logs for nfs-related entries, for example:
root@tools-k8s-worker-nfs-56:~# journalctl --grep tools-nfs Mar 04 18:29:19 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying Mar 07 00:00:28 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
You can check if new processes are still getting stuck, by trying to ls
the users or tools home directories from the affected worker (might get stuck, so we start it in the background):
root@tools-k8s-worker-nfs-42:~# ls -l /data/project/ &
And you can check which processes are getting stuck by searching for processes in 'D' state:
root@tools-k8s-worker-nfs-42:~# ps aux | grep D # will show some other stuff, but you get the gist
Common issues
NFS server went away
If there was an NFS hiccup, you'll see entries like:
root@tools-k8s-worker-nfs-56:~# journalctl --grep tools-nfs Mar 04 18:29:19 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying Mar 07 00:00:28 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
If you are unable to login or the server is having a lot of trouble, you can just restart that node with the cookbook, for example:
dcaro@urcuchillay$ cookbook wmcs.toolforge.k8s.reboot --cluster-name tools --hostname-list tools-k8s-worker-nfs-21
NFS stuck processes
If new processes work ok (ex.ls -l
of the projects dir works well), you can try restarting only the pods that are stuck, the new ones should come up without problems.
Related information
You can find a graph of the number of 'D' processes per toolforge VM here:
https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview
Old incidents
Add here any new tasks for incidents you might encounter.
- https://phabricator.wikimedia.org/T362690 - new processes also were getting stuck in D state, needed host reboot (if it happens again, please retake the debugging)
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect or the bridged Telegram group
- Discuss via email after you have subscribed to the cloud@ mailing list
- Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
- Read the News wiki page
Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)