This happens when there's one or more queues in a not healthy state.
Error / Incident
There you will get which queue and which error statuses it's in.
You can get the list of all failed jobs, queues and nodes with the cookbook:
03:01 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks (wmcs|…1) dcaro@vulcanus$ cookbook wmcs.toolforge.grid.get_cluster_status --project tools --only-failed ...
If the queues have
UKNOWN status, even if there's
ERROR, that most possibly means that the exec service in the node died, see below for how to start it up again.
If the queues have also
DISABLED then it means that the node was taken out of the pool.
If you ssh to the node with the problem, you can get a description of the errors with:
root@tools-sgeexec-0916:~# qstat -explain E -xml | grep -e name -e state -e message
This will give you something like:
... <message>queue continuous marked QERROR as result of job 1143597's failure at host tools-sgeexec-0916.tools.eqiad.wmflabs</message> ...
Debugging a failed job
In this case we can see that the queue
continuous was marked as
QERROR because the job 1143597 failed. We can check why by grepping the logs under
root@tools-sgeexec-0916:~# grep 1143597 /data/project/.system_sge/gridengine/spool/qmaster/messages* /data/project/.system_sge/gridengine/spool/qmaster/messages.1:02/26/2022 18:03:50|worker|tools-sgegrid-master|W|job 1143597.1 failed on host tools-sgeexec-0916.tools.eqiad.wmflabs general assumedly before job because: can't create directory active_jobs/1143597.1: No space left on device
One liner to get all the failed jobs logs:
root@tools-sgeexec-0916:~# for job in $(qstat -explain E -xml | grep -e name -e state -e message | grep -o 'job [^ ]*' | grep -o '*'); do echo; echo "#################### job $job"; grep $job /data/project/.system_sge/gridengine/spool/qmaster/messages*; done
If that does not work, you can ssh to the exec node and look into it's own syslog:
root@tools-sgeexec-0916:~# grep sge_shepherd /var/log/syslog*
When the exec node disk gets full, it has to be fixed by restarting the exec node, and clearing up the queues error state.
Exec service died
To start again the exec service on the node, you'll have to stop and start individually (this is because we use an old init script):
root@tools-sgeexec-0913:~# systemctl stop sge_execd.service root@tools-sgeexec-0913:~# systemctl start sge_execd.service root@tools-sgeexec-0913:~# systemctl status sge_execd.service
If the status is not started, then you can try debugging the issue by looking at the logs here:
root@tools-sgeexec-0913:~# tail /var/spool/gridengine/execd/tools-sgeexec-0913/messages
Sometimes it crashes when there's a job directory that's inconsistent to what it expects, in that case you can remove it and try again.
Epilog failed on webgrid nodes
The grid supports a mechanism called epilog/prolog, which is a way to run some code before/after the job itself is run. In the case of the webgrid, the epilog runs Portal:Toolforge/Admin/Dynamicproxy#Grid_web_services portgrabber and portreleaser to hook the webservice to the front proxy.
If a web job fails with something related to epilog or prolog, then it is likely that it failed to allocate/release a port. This shouldn't be a big deal unless there is a pattern.
This can be confirmed by checking the logs on the active front proxy server:
user@tools-proxy-06:~$ grep -i failed /var/log/proxylistener [..] 2022-03-10 09:30:13,448 Identd auth failed, sent 33264,8282 got back 33264,8282:ERROR:UNKNOWN-ERROR
If the error is fixed/known, you can clear up the queues error state to let new jobs get scheduled.
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation: