Obsolete:Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem

This page contains historical information. It may be outdated or unreliable.

2024

Overview

This happens when there's one or more queues in a not healthy state.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager, an example is: phab:T302702.

There you will get which queue and which error statuses it's in.

Debugging

You can get the list of all failed jobs, queues and nodes with the cookbook:

03:01 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks  (wmcs|…1)
dcaro@vulcanus$ cookbook  wmcs.toolforge.grid.get_cluster_status --project tools --only-failed
...

You will see also an extended summary of the errors, with a bit more information:

###### Failed queues extended info
- !!python/object:cookbooks.wmcs.libs.grid.GridQueueInfo
...
 messages:
 - queue continuous marked QERROR as result of job 1143597's failure at host
   tools-sgeexec-0916.tools.eqiad1.wikimedia.cloud
 name: continuous@tools-sgeexec-0916.tools.eqiad1.wikimedia.cloud
...
 states: !GridQueueStatesSet
 - !GridQueueState 'ERROR'
...

If the queues have ALARM1 and UNKNOWN status, even if there's ERROR, that most possibly means that the exec service in the node died, see below for how to start it up again.

If the queues have also DISABLED then it means that the node was taken out of the pool.

Debugging a failed job

In this case we can see that the queue continuous was marked as QERROR because the job 1143597 failed. We can check why by grepping the logs under /data/project/.system_sge/gridengine/spool/qmaster/messages*:

root@tools-sgeexec-0916:~# grep 1143597 /data/project/.system_sge/gridengine/spool/qmaster/messages*
/data/project/.system_sge/gridengine/spool/qmaster/messages.1:02/26/2022 18:03:50|worker|tools-sgegrid-master|W|job 1143597.1 failed on host tools-sgeexec-0916.tools.eqiad.wmflabs general assumedly before job because: can't create directory active_jobs/1143597.1: No space left on device

One liner to get all the failed jobs logs:

root@tools-sgeexec-0916:~# for job in $(qstat -explain E -xml | grep -e name -e state -e message | grep -o 'job [^ ]*' | grep -o '[0123456789]*'); do echo; echo "#################### job $job"; grep $job /data/project/.system_sge/gridengine/spool/qmaster/messages*; done

If that does not work, you can ssh to the exec node and look into it's own syslog:

root@tools-sgeexec-0916:~# grep sge_shepherd /var/log/syslog*

Common issues

Disk full

When the exec node disk gets full, it has to be fixed by restarting the exec node, and clearing up the queues error state.

Exec service died

To start again the exec service on the node, you'll have to stop and start individually (this is because we use an old init script):

root@tools-sgeexec-0913:~# systemctl stop sge_execd.service
root@tools-sgeexec-0913:~# systemctl start sge_execd.service
root@tools-sgeexec-0913:~# systemctl status sge_execd.service

If the status is not started, then you can try debugging the issue by looking at the logs here:

root@tools-sgeexec-0913:~# tail /var/spool/gridengine/execd/tools-sgeexec-0913/messages

Sometimes it crashes when there's a job directory that's inconsistent to what it expects, in that case you can remove it and try again.

Epilog failed on webgrid nodes

Example: phabricator T304816 - Toolforge grid queue problem: epilog failed

The grid supports a mechanism called epilog/prolog, which is a way to run some code before/after the job itself is run. In the case of the webgrid, the epilog runs Portal:Toolforge/Admin/Dynamicproxy#Grid_web_services portgrabber and portreleaser to hook the webservice to the front proxy.

If a web job fails with something related to epilog or prolog, then it is likely that it failed to allocate/release a port. This shouldn't be a big deal unless there is a pattern.

This can be confirmed by checking the logs on the active front proxy server:

user@tools-proxy-06:~$ grep -i failed /var/log/proxylistener
[..]
2022-03-10 09:30:13,448 Identd auth failed, sent 33264,8282 got back 33264,8282:ERROR:UNKNOWN-ERROR

Cleaning up

If the error is fixed/known, you can Portal:Toolforge/Admin/Grid#Clearing_error_state clear up the queues error state] to let new jobs get scheduled.

Related information

Old incidents

phab:T302702