Portal:Toolforge/Admin/New grid engine exec host

From Wikitech
Jump to navigation Jump to search

Checklists/procedures for standing up a new Grid Engine exec node.

Debian Stretch/Son of Grid Engine grid

New grid names have meaning to the grid-configurator script as well as to clush and the puppet prefixes.

  • Host types:
    • sgeexec
    • sgeweblight (old sgewebgrid-lighttpd)
    • sgewebgen (old sgewebgrid-generic)
    • custom (not completely implemented in the configurator script yet, we'll see)
  • Hosts are all Debian Stretch for now (-09xx)
  • Hosts are numbered incrementally

Host setup

  1. Create a new host using Horizon:
    • Instance name: tools-${host_type}-09-x (before named tools-${host_type}-09xx)
      • 09 is for Debian v9 (aka Stretch)
      • x is incremental
    • Instance type: m1.large
    • Image type: stretch
    • Security groups:
      • sgeexec: default, execnode
      • sgweblight (before named sgewebgrid-lighttpd): default, execnode, webserver
      • sgewebgen (before named sgewebgrid-generic): default, execnode, webserver
      • custom: default, execnode
  2. Configure host (not needed as this is handled by the prefix in Horizon--notes for reference):
    • sgeexec: role::wmcs::toolforge::grid::compute::general
    • sgewebgrid-lighttpd: role::wmcs::toolforge::grid::web::lighttpd
    • sgewebgrid-generic: role::wmcs::toolforge::grid::web::generic
    • custom: ??
  3. follow instructions for getting a puppet client set up
  4. run sudo apt-get update && puppet agent -t until no failures

Grid configuration

The new grid is primarily configured via the /usr/local/bin/grid-configurator script from the sonofgridengine module in puppet. It uses the various bits that puppet runs have been leaving behind in NFS all this time with a little from OpenStack to figure out adding and removing hosts (as well as queue management, checkpoints and host groups). It does not yet handle complexes (which is the only two grid object configured by hand in the new grid). The script is idempotent, so you can simply run for all domains and expect it to do the right thing, for the most part.

Once you've run puppet a few times in the last step, things should be ready to go for the script.

  1. log into the current master or shadow master of the new grid's cluster (hint: It'll be named something like "tools-sgegrid-master")
  2. If you are a manager, you don't need the sudo in the next command:
    sudo /usr/local/bin/grid-configurator --all-domains
  3. For some hosts, you may need to run more than once due to the order of adding servers vs. queues. After a couple runs, you likely need to stop and then start the gridengine-exec service on the new nodes. When the package installs, they aren't exec nodes yet, so the service will fail.
    sudo systemctl status gridengine-exec.service

On an admin host,

  1. In order for the changes to the host_aliases file to take affect, restart the gridengine-master service
    sudo systemctl status gridengine-master.service
  2. qhost -q -h $fqdn should show the new queues without trailing 'au', indicating the host is up and running
    1. If any queues say 'd' for the status column, try running qmod -e "*@${fqdn}", but you shouldn't need to.
    2. If the status is 'au', check if the gridengine-exec service is running on the new node, or just try again because you might have been too quick.
  3. qhost -j -h $fqdn hopefully already shows jobs being submitted on the host

See also