Obsolete:Portal:Toolforge/Admin/Grid

This page contains historical information. It may be outdated or unreliable.

2024

Failover

This section contains an explanation of how failover works in Grid Engine. For concrete operational steps, check section below.

GridEngine Master

The gridengine scheduler/dispatcher runs on tools-master, and manages dispatching jobs to execution nodes and reporting. The active master write its name to /var/lib/gridengine/default/common/act_qmaster, where all enduser tools pick it up. tools-sgegrid-master normally serves in this role but tools-sgegrid-shadow can also be manually started as the master iff there are currently no active masters with service gridengine-master start on the shadow master.

For Grid Engine 8 (stretch/son of grid engine), the service is not marked to be running in puppet, and systemd may stop trying to restart it if things are bad for a bit. It will require a manual restart in those situations.

Redundancy

Every 30s, the master touches the file /var/spool/gridengine/qmaster/heartbeat. On tools-sgegrid-shadow there is a shadow master that watches this file for staleness, and will fire up a new master on itself if it has been for too long (currently set at 10m -- 2m in the stretch grid). This only works if the running master crashed or was killed uncleanly (including the server hosting it crashing), because a clean shutdown will create a lockfile forbidding shadows from starting a master (as would be expected in the case of willfuly stopped masters). It may leave the lock file in place for other reasons as well depending on how it died. Delete the lock file at /var/spool/gridengine/qmaster/lock if the takeover is desired.

If it does, then it changes /data/project/.system_sge/gridengine/default/common/act_qmaster to point to itself, redirecting all userland tools. This move is unidirectional; once the master is ready to take over again then the gridengine-master on tools-sgegrid-shadow need to be shut down manually (Note: on the stretch grid, this doesn't seem to be true--it failed back smoothly in testing but manually failing back is still smart), and the one on tools-master started (this is necessary to prevent flapping, or split brain, if tools-grid-master only failed temporarily). This is simply done with service gridengine-master {stop/start}.

Because of the heartbeat file and act_qmaster mechanisms, when it fails over, the gridengine-master service will not start if act_qmaster points to the shadow master. You must manually stop the gridengine-shadow service on tools-sgegrid-shadow and then start the gridengine-master service on tools-sgegrid-master and then start gridengine-shadow on tools-sgegrid-shadow to restore the "normal" state after failover. The services are largely kept under manual systemctl control because of these sort of dances.

Administrative tasks

Failover / Failback

See which node is the current primary:

user@tools-sgegrid-master:~$ cat /var/lib/gridengine/default/common/act_qmaster
tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
user@tools-sgegrid-shadow:~$ cat /var/lib/gridengine/default/common/act_qmaster
tools-sgegrid-master.tools.eqiad1.wikimedia.cloud

If moving shadow -> master, then:

user@tools-sgegrid-shadow:~$ sudo systemctl stop gridengine-shadow.service
user@tools-sgegrid-master:~$ sudo systemctl start gridengine-master.service

SGE resources

Son of Grid Engine doesn't appear to be actively developed, but it is somewhat updated from the last open source release of Univa Grid Engine (8.0.0), which is an active commercial product that is not open source at this point.

Documentation for Son of Grid Engine is mostly archives of the Sun/Oracle documents. This can be found at the University of Liverpool website.

PDF manuals for the older grid engine can found using [1]. Most of the information in these still applies to Son of Grid Engine (version 8.1.9).

Nearly all installation guides for any version of Grid Engine are incorrect because they assume some process of untarring executables on NFS, like one would on a classic Solaris installation. The execs on NFS in our environment is purely a set of symlinks to the exec files on local disk that are installed via deb packages.

With that in mind, see this page for most of the current how-tos: https://arc.liv.ac.uk/SGE/howto/howto.html

Dashboard

In addition to the cli commands below, an overview can be viewed at https://sge-status.toolforge.org/

List of handy commands

Most commands take -xml as a parameter to enable xml output. This is useful when lines get cut off. These are unchanged between grid versions.

Note that qmod and qconf commands will only work on masters and shadow masters (tools-sgegrid-master and tools-sgegrid-shadow) in the grid because bastions are not admin hosts.

Queries

list queues on given host: qhost -q -h $hostname
list jobs on given host: qhost -j -h $hostname
list all queues: qstat -f
qmaster log file: tail -f /data/project/.system_sge/gridengine/spool/qmaster/messages

Configuration

The global and scheduler configs are managed by puppet. See the files under modules/profile/files/toolforge/grid-global-config and modules/profile/files/toolforge/grid-scheduler-config

modify host group config: qconf -mhgrp \@general
print host group config: qconf -shgrp \@general

modify queue config: qconf -mq queuename
print queue config: qconf -sq continuous
enable a queue: qmod -e 'queue@node_name'
disable a queue: qmod -d 'queue@node_name'

add host as exec host: qconf -ae node_name
print exec host config: qconf -se node_name
remove host as exec host: ??

add host as submit host: qconf -as node_name
remove host as submit host: ??

add host as admin host: qconf -ah node_name
remove host as admin host: ??

Accounting

retrieve information on finished job: qacct -j [jobid or jobname]

there are a few scripts in /home/valhallasw/accountingtools: (need to be puppetized)

vanaf.py makes a copy of recent entries in the accounting file
accounting.py contains python code to read in the accounting file

Usage:

valhallasw@tools-bastion-03:~/accountingtools$ php time.php "-1 hour"
1471465675
valhallasw@tools-bastion-03:~/accountingtools$ python vanaf.py 1471465675 mylog
Seeking to timestamp  1471465675
...
done!
valhallasw@tools-bastion-03:~/accountingtools$ grep mylog -e '6727696' | python merlijn_stdin.py
25 1970-01-01 00:00:00 1970-01-01 00:00:00 tools-webgrid-lighttpd-1206.eqiad1.wikimedia.cloud tools.ptwikis lighttpd-precise-ptwikis 6727696
0 2016-08-17 21:01:42 2016-08-17 21:01:46 tools-webgrid-lighttpd-1207.eqiad1.wikimedia.cloud tools.ptwikis lighttpd-precise-ptwikis 6727696
Traceback (most recent call last):
  File "merlijn_stdin.py", line 4, in <module>
    line = raw_input()
EOFError: EOF when reading a line

Ignore the EOFError; the relevant lines are above that. Error codes (first entry) are typically 0 (finished succesfully), 19 ('before writing exit_status' = crashed?), 25 (rescheduled) or 100 ('assumedly after job' = lost job?). I'm not entirely sure about the codes when the job stops because of an error.

Orphan processes

Hunt for orphan processes (parent process id == 1) that have leaked from grid jobs:

$ clush -w @exec-stretch -w @webgrid-generic-stretch -w @webgrid-lighttpd-stretch -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|Debian-exim|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|systemd|www-data|sgeadmin)"|grep -v systemd|grep -v perl|grep -E "     1 "'

The exclusion for perl processes is because there are 2-3 tools built with perl that make orphans via the "normal" forking process.

Kill orphan processes:

$ clush -w @exec-stretch -w @webgrid-generic-stretch -w @webgrid-lighttpd-stretch -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|Debian-exim|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|systemd|www-data|sgeadmin)"|grep -v systemd|grep -v perl|grep -E "     1 "|awk "{print \$3}"|xargs sudo kill -9'

Creating a new node

If you absolutely must create a new grid node, use one of the wmcs.toolforge.scale_grid_* cookbooks.

For outdated manual docs, see Special:Permalink/1929192.

Clearing error state

Sometimes due to various hiccups (like LDAP or DNS malfunction), grid jobs can move to an Error state from which they will not come out without explicit user action. Error states can be created by repeated job failures caused by user error on healthy nodes. This includes an 'A' state from heavy job load. Nodes in this state are unschedulable, so unless this condition persists, it's not required to attempt to alleviate this error code. Persistent 'A' error state could however mean a node is broken. Lastly error state 'au' generally means the host isn't reachable. This is also likely attributed to load. If this error persists, check the host's job queue and ensure gridengine is still running on the host.

To view the any potential error states and messages for each node:

qstat -explain E -xml | grep -e name -e state -e message

Once you have ascertained the cause of the Error state and fixed it, you can clear all the error state queue using the cookbook:

dcaro@vulcanus$ cookbook wmcs.toolforge.grid.cleanup_queue_errors -h

usage: cookbooks.wmcs.toolforge.grid.cleanup_queue_errors [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] [--master-hostname MASTER_HOSTNAME]

WMCS Toolforge - grid - cleanup queue errors

Usage example:
    cookbook wmcs.toolforge.grid.cleanup_queue_errors         --project toolsbeta         --master-hostname toolsbeta-sgegrid-master

options:
  -h, --help            show this help message and exit
  --project PROJECT     Relevant Cloud VPS openstack project (for operations, dologmsg, etc). If this cookbook is for hardware, this only affects dologmsg calls. Default is 'toolsbeta'.
  --task-id TASK_ID     Id of the task related to this operation (ex. T123456).
  --no-dologmsg         To disable dologmsg calls (no SAL messages on IRC).
  --master-hostname MASTER_HOSTNAME
                        The hostname of the grid master node. Default is '<project>-sgegrid-master'

Or manually with:

user@tools-sgegrid-master:~$ sudo qmod -c '*'

You also need to clear all the queues that have gone into error state. Failing to do so prevents jobs from being scheduled on those queues. You can clear all error states on queues with:

qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//'  | xargs qmod -cq

If a single job is stuck in the dr state, meaning is stuck in deleting state but never goes away, run the following:

user@tools-sgegrid-master:~$ sudo qdel -f 9999850
root forced the deletion of job 9999850

Draining a node of Jobs

In real life, you just do this with the exec-manage script. Run sudo exec-manage depool $fqdn on the grid master or shadow master (eg. tools-sgegrid-master.tools.eqiad1.wikimedia.cloud). What follow are the detailed steps that are handled by that script.

Disable the queues on the node with qmod -d '*@$node_name'
Reschedule continuous jobs running on the node (see below)
Wait for non-restartable jobs to drain (if you want to be nice!) or qdel them
Once whatever needed to be done, reenable the node with qmod -e '*@$node_name'

There is no simple way to delete or reschedule jobs on a single host, but the following snippet is useful to provide a list to the command line:

$(qhost -j -h $NODE_NAME | awk '{print $1}' | egrep ^[0-9])

which make for reasonable arguments for qdel or qmod -rj.

Decommission a node

Drain the node (see above!). Give the non-restartable jobs some time to finish (maybe even a day if you are feeling generous?).
Use the wmcs.toolforge.remove_grid_node cookbook

Troubleshooting

Removing a node fails with 'Host object "[...]" is still referenced in cluster queue "[...]".'

Check that the node has no stuck 'Deleting' jobs - if there are just qdel -f them.