Portal:Toolforge/Admin/New grid engine exec host

From Wikitech
Jump to navigation Jump to search

Checklists/procedures for standing up a new Grid Engine exec node.

The original version postulated we'd only ever run one grid per toolforge project. Now, there are two for the deprecation of Ubuntu Trusty. Because segregation of these grids is extremely important (they are not compatible and attempted communication between them causes segfaults of the master), documentation will be segmented between the "old grid" and "new grid" entries below.

Debian Stretch/Son of Grid Engine grid

New grid names have meaning to the grid-configurator script as well as to clush and the puppet prefixes.

  • Host types:
    • sgeexec
    • sgewebgrid-lighttpd
    • sgewebgrid-generic
    • custom (not completely implemented in the configurator script yet, we'll see)
  • Hosts are all Debian Stretch for now (-09xx)
  • Hosts are numbered incrementally

Host setup

  1. Create a new host using Horizon:
    • Instance name: tools-<host type>-NNxx
      • stretch: NN=09
      • xx is incremental
    • Instance type: m1.large
    • Image type: stretch
    • Security groups:
      • sgeexec: default, execnode
      • sgewebgrid-lighttpd: default, execnode, webserver
      • sgewebgrid-generic: default, execnode, webserver
      • custom: default, execnode
  2. Configure host (not needed as this is handled by the prefix in Horizon--notes for reference):
    • sgeexec: role::wmcs::toolforge::grid::compute::general
    • sgewebgrid-lighttpd: role::wmcs::toolforge::grid::web::lighttpd
    • sgewebgrid-generic: role::wmcs::toolforge::grid::web::generic
    • custom: ??
  3. follow instructions for getting a puppet client set up
  4. run sudo apt-get update && puppet agent -t until no failures

Grid configuration

The new grid is primarily configured via the /usr/local/bin/grid-configurator script from the sonofgridengine module in puppet. It uses the various bits that puppet runs have been leaving behind in NFS all this time with a little from OpenStack to figure out adding and removing hosts (as well as queue management, checkpoints and host groups). It does not yet handle complexes (which is the only two grid object configured by hand in the new grid). The script is idempotent, so you can simply run for all domains and expect it to do the right thing, for the most part.

Once you've run puppet a few times in the last step, things should be ready to go for the script.

  1. log into the current master or shadow master of the new grid's cluster (hint: It'll be named something like "tools-sgegrid-master")
  2. If you are a manager, you don't need the sudo in the next command:
    sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')
  3. For some hosts, you may need to run more than once due to the order of adding servers vs. queues. After a couple runs, you likely need to stop and then start the gridengine-exec service on the new nodes. When the package installs, they aren't exec nodes yet, so the service will fail.
    sudo service gridengine-exec restart

On an admin host,

  1. qhost -q -h <hostname> should show the new queues without trailing 'au', indicating the host is up and running
    1. If any queues say 'd' for the status column, try running qmod -e "*@<hostname>", but you shouldn't need to.
    2. If the status is 'au', check if the gridengine-exec service is running on the new node, or just try again because you might have been too quick.
  2. qhost -j -h <hostname> hopefully already shows jobs being submitted on the host

Ubuntu Trusty/Open Grid Engine grid

  • Host types:
    • exec
    • webgrid-lighttpd
    • webgrid-generic 
    • custom (cyberbot, catscan, ...)
  • Hosts are all Trusty (-14xx).
  • Hosts are numbered incrementally.

Host setup

  1. Create a new host
    • Instance name: tools-<host type>-NNxx
      • precise: NN=12, trusty: NN=14
      • xx is incremental
    • Instance type: m1.large
    • Image type: precise or trusty
    • Security groups:
      • exec: default, execnode
      • webgrid-lighttpd: default, execnode, webserver
      • webgrid-generic: default, execnode, webserver
      • custom: default, execnode
  2. Configure host (not needed as this is handled by the prefix in Horizon--notes for reference):
    • all hosts: role::toollabs::compute,
    • exec: role::toollabs::node::compute::general
    • webgrid-lighttpd: toollabs::node::web::lighttpd
    • webgrid-generic: toollabs::node::web::generic
    • custom: ??
  3. run sudo apt-get update && puppet agent -tv until no failures
    1. For precise instances, you need to reboot them after the first puppet run, and run puppet again. This fixes an NFS permissions issue and turns on swap partition properly, and outputs the correct vmem value for the gridengine configuration.
  4. kill mpt-statusd

Grid configuration

When pooling precise instances, remember to check that swap is enabled ('sudo swapon -s' on the new host) and that the exec host config file mentions 30G as value for vmem (on a large host)


On an admin host (e.g. tools-login), run the following commands:

  1. add the host as exec host: qconf -Ae /var/lib/gridengine/etc/exechosts/<hostname>
  2. webgrid, custom: add the host as submit host: qconf -as <hostname>
  3. Add the host to a queue / hostgroup, to tell gridengine what to use it as
    • exec: add the host (fqdn including project name) to hostgroup @generic: qconf -mhgrp \@general
    • webgrid-lighttpd: add the host to hostgroup @webgrid: qconf -mhgrp \@webgrid
    • webgrid-generic: add the host to queue webgrid-generic: qconf -mq webgrid-generic
    • custom: add the host to the custom queue: qconf -mq <queue name>
  4. qmod -e "*@<hostname>" should now tell you the new hosts' queues are enabled

On the new host,

  1. start gridengine-exec with sudo service gridengine-exec start

On an admin host,

  1. qhost -q -h <hostname> should show the new queues without trailing 'au', indicating the host is up and running
  2. qhost -j -h <hostname> hopefully already shows jobs being submitted on the host


See also