Portal:Toolforge/Admin/New grid engine exec host

From Wikitech
Jump to navigation Jump to search

Checklists/procedures for standing up a new Grid Engine exec node.

The original version postulated we'd only ever run one grid per toolforge project. Now, there are two for the deprecation of Ubuntu Trusty. Because segregation of these grids is extremely important (they are not compatible and attempted communication between them causes segfaults of the master), documentation will be segmented between the "old grid" and "new grid" entries below.

Debian Stretch/Son of Grid Engine grid

New grid names have meaning to the grid-configurator script as well as to clush and the puppet prefixes.

  • Host types:
    • sgeexec
    • sgewebgrid-lighttpd
    • sgewebgrid-generic
    • custom (not completely implemented in the configurator script yet, we'll see)
  • Hosts are all Debian Stretch for now (-09xx)
  • Hosts are numbered incrementally

Host setup

  1. Create a new host using Horizon:
    • Instance name: tools-<host type>-NNxx
      • stretch: NN=09
      • xx is incremental
    • Instance type: m1.large
    • Image type: stretch
    • Security groups:
      • sgeexec: default, execnode
      • sgewebgrid-lighttpd: default, execnode, webserver
      • sgewebgrid-generic: default, execnode, webserver
      • custom: default, execnode
  2. Configure host (not needed as this is handled by the prefix in Horizon--notes for reference):
    • sgeexec: role::wmcs::toolforge::grid::compute::general
    • sgewebgrid-lighttpd: role::wmcs::toolforge::grid::web::lighttpd
    • sgewebgrid-generic: role::wmcs::toolforge::grid::web::generic
    • custom: ??
  3. follow instructions for getting a puppet client set up
  4. run sudo apt-get update && puppet agent -t until no failures

Grid configuration

The new grid is primarily configured via the /usr/local/bin/grid-configurator script from the sonofgridengine module in puppet. It uses the various bits that puppet runs have been leaving behind in NFS all this time with a little from OpenStack to figure out adding and removing hosts (as well as queue management, checkpoints and host groups). It does not yet handle complexes (which is the only two grid object configured by hand in the new grid). The script is idempotent, so you can simply run for all domains and expect it to do the right thing, for the most part.

Once you've run puppet a few times in the last step, things should be ready to go for the script.

  1. log into the current master or shadow master of the new grid's cluster (hint: It'll be named something like "tools-sgegrid-master")
  2. If you are a manager, you don't need the sudo in the next command:
    sudo /usr/local/bin/grid-configurator --all-domains
  3. For some hosts, you may need to run more than once due to the order of adding servers vs. queues. After a couple runs, you likely need to stop and then start the gridengine-exec service on the new nodes. When the package installs, they aren't exec nodes yet, so the service will fail.
    sudo service gridengine-exec restart

On an admin host,

  1. qhost -q -h <hostname> should show the new queues without trailing 'au', indicating the host is up and running
    1. If any queues say 'd' for the status column, try running qmod -e "*@<hostname>", but you shouldn't need to.
    2. If the status is 'au', check if the gridengine-exec service is running on the new node, or just try again because you might have been too quick.
  2. qhost -j -h <hostname> hopefully already shows jobs being submitted on the host

See also