Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses

The procedures in this runbook require admin permissions to complete.

Error / Incident

There's at least one toolforge worker node with many processes in 'D' (uninterruptible sleep) pointing to an IO/NFS issue.

Debugging

If you can, ssh to the node and run htop/top/ps to see the running processes and their state.

You can try checking the journal or dmesg logs for nfs-related entries, for example:

root@tools-k8s-worker-nfs-56:~# journalctl --grep tools-nfs
Mar 04 18:29:19 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Mar 07 00:00:28 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK

You can check if new processes are still getting stuck, by trying to ls the users or tools home directories from the affected worker (might get stuck, so we start it in the background):

root@tools-k8s-worker-nfs-42:~# ls -l /data/project/ &

And you can check which processes are getting stuck by searching for processes in 'D' state:

root@tools-k8s-worker-nfs-42:~# ps aux | grep D  # will show some other stuff, but you get the gist

Common issues

NFS server went away

If there was an NFS hiccup, you'll see entries like:

root@tools-k8s-worker-nfs-56:~# journalctl --grep tools-nfs
Mar 04 18:29:19 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Mar 07 00:00:28 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK

If you are unable to login or the server is having a lot of trouble, you can just restart that node with the cookbook, for example:

dcaro@urcuchillay$ cookbook wmcs.toolforge.k8s.reboot --cluster-name tools --hostname-list tools-k8s-worker-nfs-21

NFS stuck processes

If new processes work ok (ex.ls -l of the projects dir works well), you can try restarting only the pods that are stuck, the new ones should come up without problems.

Related information

You can find a graph of the number of 'D' processes per toolforge VM here:

https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview

Old incidents

Add here any new tasks for incidents you might encounter.

https://phabricator.wikimedia.org/T362690 - new processes also were getting stuck in D state, needed host reboot (if it happens again, please retake the debugging)

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support

Chat in real time in the IRC channel #wikimedia-cloud ^connect or the bridged Telegram group
Discuss via email after you have subscribed to the cloud@ mailing list

Stay aware of critical changes and plans

Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Read the News wiki page

Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)