Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem

From Wikitech
Jump to navigation Jump to search

Overview

This happens when there's one or more queues in a not healthy state.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager, an example is: phab:T302702.

There you will get which queue and which error statuses it's in.

Debugging

You can get the list of all failed jobs, queues and nodes with the cookbook:

03:01 PM <operations-cookbooks-python3> ~/Work/wikimedia/operations-cookbooks  (wmcs|…1)
dcaro@vulcanus$ cookbook -c ~/.config/spicerack/cookbook.yaml  wmcs.toolforge.grid.get_cluster_status --project tools --only-failed
...

If the queues have ALARM1 and UKNOWN status, even if there's ERROR, that most possibly means that the exec service in the node died, see below for how to start it up again.

If the queues have also DISABLED then it means that the node was taken out of the pool.

If you ssh to the node with the problem, you can get a description of the errors with:

root@tools-sgeexec-0916:~# qstat -explain E -xml | grep -e name -e state -e message

This will give you something like:

 ...
      <message>queue continuous marked QERROR as result of job 1143597's failure at host tools-sgeexec-0916.tools.eqiad.wmflabs</message>
 ...

Debugging a failed job

In this case we can see that the queue continuous was marked as QERROR because the job 1143597 failed. We can check why by grepping the logs under /data/project/.system_sge/gridengine/spool/qmaster/messages*:

root@tools-sgeexec-0916:~# grep 1143597 /data/project/.system_sge/gridengine/spool/qmaster/messages*
/data/project/.system_sge/gridengine/spool/qmaster/messages.1:02/26/2022 18:03:50|worker|tools-sgegrid-master|W|job 1143597.1 failed on host tools-sgeexec-0916.tools.eqiad.wmflabs general assumedly before job because: can't create directory active_jobs/1143597.1: No space left on device


One liner to get all the failed jobs logs:

root@tools-sgeexec-0916:~# for job in $(qstat -explain E -xml | grep -e name -e state -e message | grep -o 'job [^ ]*' | grep -o '[0123456789]*'); do echo; echo "#################### job $job"; grep $job /data/project/.system_sge/gridengine/spool/qmaster/messages*; done

If that does not work, you can ssh to the exec node and look into it's own syslog:

root@tools-sgeexec-0916:~# grep sge_shepherd /var/log/syslog*

Common issues

Disk full

When the exec node disk gets full, it has to be fixed by restarting the exec node, and clearing up the queues error state.

Exec service died

To start again the exec service on the node, you'll have to stop and start individually (this is because we use an old init script):

root@tools-sgeexec-0913:~# systemctl stop sge_execd.service
root@tools-sgeexec-0913:~# systemctl start sge_execd.service
root@tools-sgeexec-0913:~# systemctl status sge_execd.service

If the status is not started, then you can try debugging the issue by looking at the logs here:

root@tools-sgeexec-0913:~# tail /var/spool/gridengine/execd/tools-sgeexec-0913/messages

Sometimes it crashes when there's a job directory that's inconsistent to what it expects, in that case you can remove it and try again.

Epilog failed on webgrid nodes

Example: phabricator T304816 - Toolforge grid queue problem: epilog failed

The grid supports a mechanism called epilog/prolog, which is a way to run some code before/after the job itself is run. In the case of the webgrid, the epilog runs Portal:Toolforge/Admin/Dynamicproxy#Grid_web_services portgrabber and portreleaser to hook the webservice to the front proxy.

If a web job fails with something related to epilog or prolog, then it is likely that it failed to allocate/release a port. This shouldn't be a big deal unless there is a pattern.

This can be confirmed by checking the logs on the active front proxy server:

user@tools-proxy-06:~$ grep -i failed /var/log/proxylistener
[..]
2022-03-10 09:30:13,448 Identd auth failed, sent 33264,8282 got back 33264,8282:ERROR:UNKNOWN-ERROR

Related information

Old incidents

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Receive mail announcements about critical changes
Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Track work tasks and report bugs
Use the Phabricator workboard #Cloud-Services for bug reports and feature requests about the Cloud VPS infrastructure itself
Learn about major near-term plans
Read the News wiki page
Read news and stories about Wikimedia Cloud Services
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)