Obsolete:Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem
Overview
This happens when there's one or more queues in a not healthy state.
Error / Incident
This usually comes in the form of an alert in alertmanager, an example is: phab:T302702.
There you will get which queue and which error statuses it's in.
Debugging
You can get the list of all failed jobs, queues and nodes with the cookbook:
03:01 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks (wmcs|âŚ1) dcaro@vulcanus$ cookbook wmcs.toolforge.grid.get_cluster_status --project tools --only-failed ...
You will see also an extended summary of the errors, with a bit more information:
###### Failed queues extended info - !!python/object:cookbooks.wmcs.libs.grid.GridQueueInfo ... messages: - queue continuous marked QERROR as result of job 1143597's failure at host tools-sgeexec-0916.tools.eqiad1.wikimedia.cloud name: continuous@tools-sgeexec-0916.tools.eqiad1.wikimedia.cloud ... states: !GridQueueStatesSet - !GridQueueState 'ERROR' ...
If the queues have ALARM1
and UNKNOWN
status, even if there's ERROR
, that most possibly means that the exec service in the node died, see below for how to start it up again.
If the queues have also DISABLED
then it means that the node was taken out of the pool.
Debugging a failed job
In this case we can see that the queue continuous
was marked as QERROR
because the job 1143597 failed. We can check why by grepping the logs under /data/project/.system_sge/gridengine/spool/qmaster/messages*
:
root@tools-sgeexec-0916:~# grep 1143597 /data/project/.system_sge/gridengine/spool/qmaster/messages* /data/project/.system_sge/gridengine/spool/qmaster/messages.1:02/26/2022 18:03:50|worker|tools-sgegrid-master|W|job 1143597.1 failed on host tools-sgeexec-0916.tools.eqiad.wmflabs general assumedly before job because: can't create directory active_jobs/1143597.1: No space left on device
One liner to get all the failed jobs logs:
root@tools-sgeexec-0916:~# for job in $(qstat -explain E -xml | grep -e name -e state -e message | grep -o 'job [^ ]*' | grep -o '[0123456789]*'); do echo; echo "#################### job $job"; grep $job /data/project/.system_sge/gridengine/spool/qmaster/messages*; done
If that does not work, you can ssh to the exec node and look into it's own syslog:
root@tools-sgeexec-0916:~# grep sge_shepherd /var/log/syslog*
Common issues
Disk full
When the exec node disk gets full, it has to be fixed by restarting the exec node, and clearing up the queues error state.
Exec service died
To start again the exec service on the node, you'll have to stop and start individually (this is because we use an old init script):
root@tools-sgeexec-0913:~# systemctl stop sge_execd.service root@tools-sgeexec-0913:~# systemctl start sge_execd.service root@tools-sgeexec-0913:~# systemctl status sge_execd.service
If the status is not started, then you can try debugging the issue by looking at the logs here:
root@tools-sgeexec-0913:~# tail /var/spool/gridengine/execd/tools-sgeexec-0913/messages
Sometimes it crashes when there's a job directory that's inconsistent to what it expects, in that case you can remove it and try again.
Epilog failed on webgrid nodes
Example: phabricator T304816 - Toolforge grid queue problem: epilog failed
The grid supports a mechanism called epilog/prolog, which is a way to run some code before/after the job itself is run. In the case of the webgrid, the epilog runs Portal:Toolforge/Admin/Dynamicproxy#Grid_web_services portgrabber and portreleaser to hook the webservice to the front proxy.
If a web job fails with something related to epilog or prolog, then it is likely that it failed to allocate/release a port. This shouldn't be a big deal unless there is a pattern.
This can be confirmed by checking the logs on the active front proxy server:
user@tools-proxy-06:~$ grep -i failed /var/log/proxylistener
[..]
2022-03-10 09:30:13,448 Identd auth failed, sent 33264,8282 got back 33264,8282:ERROR:UNKNOWN-ERROR
Cleaning up
If the error is fixed/known, you can Portal:Toolforge/Admin/Grid#Clearing_error_state clear up the queues error state] to let new jobs get scheduled.