Portal:Toolforge/Admin/New grid engine exec host
Checklists/procedures for standing up a new Grid Engine exec node.
![]() | Note: this checklist is probably not completely correct and up to date. Please update the guide if you encounter any issues. |
Debian Stretch/Son of Grid Engine grid
New grid names have meaning to the grid-configurator script as well as to clush and the puppet prefixes.
- Host types:
- sgeexec
- sgeweblight (old sgewebgrid-lighttpd)
- sgewebgen (old sgewebgrid-generic)
- custom (not completely implemented in the configurator script yet, we'll see)
- Hosts are all Debian Stretch for now (-09xx)
- Hosts are numbered incrementally
Host setup
- Create a new host using Horizon:
- Instance name:
tools-${host_type}-09-x
(before namedtools-${host_type}-09xx
)- 09 is for Debian v9 (aka Stretch)
- x is incremental
- Instance type:
m1.large
- Image type: stretch
- Security groups:
- sgeexec:
default
,execnode
- sgweblight (before named sgewebgrid-lighttpd):
default
,execnode
,webserver
- sgewebgen (before named sgewebgrid-generic):
default
,execnode
,webserver
- custom:
default
,execnode
- sgeexec:
- Instance name:
- Configure host (not needed as this is handled by the prefix in Horizon--notes for reference):
- sgeexec:
role::wmcs::toolforge::grid::compute::general
- sgewebgrid-lighttpd:
role::wmcs::toolforge::grid::web::lighttpd
- sgewebgrid-generic:
role::wmcs::toolforge::grid::web::generic
- custom: ??
- sgeexec:
- follow instructions for getting a puppet client set up
- run
sudo apt-get update
&&puppet agent -t
until no failures
Grid configuration
The new grid is primarily configured via the /usr/local/bin/grid-configurator script from the sonofgridengine module in puppet. It uses the various bits that puppet runs have been leaving behind in NFS all this time with a little from OpenStack to figure out adding and removing hosts (as well as queue management, checkpoints and host groups). It does not yet handle complexes (which is the only two grid object configured by hand in the new grid). The script is idempotent, so you can simply run for all domains and expect it to do the right thing, for the most part.
Once you've run puppet a few times in the last step, things should be ready to go for the script.
- log into the current master or shadow master of the new grid's cluster (hint: It'll be named something like "tools-sgegrid-master")
- If you are a manager, you don't need the sudo in the next command:
sudo /usr/local/bin/grid-configurator --all-domains
- For some hosts, you may need to run more than once due to the order of adding servers vs. queues. After a couple runs, you likely need to stop and then start the gridengine-exec service on the new nodes. When the package installs, they aren't exec nodes yet, so the service will fail.
sudo systemctl status gridengine-exec.service
On an admin host,
- In order for the changes to the host_aliases file to take affect, restart the gridengine-master service
sudo systemctl status gridengine-master.service
qhost -q -h $fqdn
should show the new queues without trailing 'au', indicating the host is up and running- If any queues say 'd' for the status column, try running
qmod -e "*@${fqdn}"
, but you shouldn't need to. - If the status is 'au', check if the gridengine-exec service is running on the new node, or just try again because you might have been too quick.
- If any queues say 'd' for the status column, try running
qhost -j -h $fqdn
hopefully already shows jobs being submitted on the host