Portal:Toolforge/Admin/new exec host

From Wikitech
Jump to navigation Jump to search

New exec node checklist

Initial notes

  • Host types:
    • exec
    • webgrid-lighttpd
    • webgrid-generic 
    • custom (cyberbot, catscan, ...)
  • Hosts typically exist in Precise (-12xx) and Trusty (-14xx) variants.
  • Hosts are numbered incrementally.

Host setup

  1. Create a new host
    • Instance name: tools-<host type>-NNxx
      • precise: NN=12, trusty: NN=14
      • xx is incremental
    • Instance type: m1.large
    • Image type: precise or trusty
    • Security groups:
      • exec: default, execnode
      • webgrid-lighttpd: default, execnode, webserver
      • webgrid-generic: default, execnode, webserver
      • custom: default, execnode
  2. Configure host:
    • all hosts: role::toollabs::compute,
    • exec: role::toollabs::node::compute::general
    • webgrid-lighttpd: toollabs::node::web::lighttpd
    • webgrid-generic: toollabs::node::web::generic
    • custom: ??
  3. run sudo apt-get update && puppet agent -tv until no failures
    1. For precise instances, you need to reboot them after the first puppet run, and run puppet again. This fixes an NFS permissions issue and turns on swap partition properly, and outputs the correct vmem value for the gridengine configuration.
  4. kill mpt-statusd

Grid configuration

When pooling precise instances, remember to check that swap is enabled ('sudo swapon -s' on the new host) and that the exec host config file mentions 30G as value for vmem (on a large host)

On an admin host (e.g. tools-login), run the following commands:

  1. add the host as exec host: qconf -Ae /var/lib/gridengine/etc/exechosts/<hostname>
  2. webgrid, custom: add the host as submit host: qconf -as <hostname>
  3. Add the host to a queue / hostgroup, to tell gridengine what to use it as
    • exec: add the host (fqdn including project name) to hostgroup @generic: qconf -mhgrp \@general
    • webgrid-lighttpd: add the host to hostgroup @webgrid: qconf -mhgrp \@webgrid
    • webgrid-generic: add the host to queue webgrid-generic: qconf -mq webgrid-generic
    • custom: add the host to the custom queue: qconf -mq <queue name>
  4. qmod -e "*@<hostname>" should now tell you the new hosts' queues are enabled

On the new host,

  1. start gridengine-exec with sudo service gridengine-exec start

On an admin host,

  1. qhost -q -h <hostname> should show the new queues without trailing 'au', indicating the host is up and running
  2. qhost -j -h <hostname> hopefully already shows jobs being submitted on the host

See also

Some related information.