Obsolete:Portal:Toolforge/Admin/Grid
Failover
This section contains an explanation of how failover works in Grid Engine. For concrete operational steps, check section below.
GridEngine Master
The gridengine scheduler/dispatcher runs on tools-master, and manages dispatching jobs to execution nodes and reporting. The active master write its name to
/var/lib/gridengine/default/common/act_qmaster
, where all enduser tools pick it up. tools-sgegrid-master
normally serves in this role but tools-sgegrid-shadow
can also be manually started as the master iff there are currently no active masters with service gridengine-master start
on the shadow master.
For Grid Engine 8 (stretch/son of grid engine), the service is not marked to be running in puppet, and systemd may stop trying to restart it if things are bad for a bit. It will require a manual restart in those situations.
Redundancy
Every 30s, the master touches the file /var/spool/gridengine/qmaster/heartbeat
. On tools-sgegrid-shadow
there is a shadow master that watches this file for staleness, and will fire up a new master on itself if it has been for too long (currently set at 10m -- 2m in the stretch grid). This only works if the running master crashed or was killed uncleanly (including the server hosting it crashing), because a clean shutdown will create a lockfile forbidding shadows from starting a master (as would be expected in the case of willfuly stopped masters). It may leave the lock file in place for other reasons as well depending on how it died. Delete the lock file at /var/spool/gridengine/qmaster/lock
if the takeover is desired.
If it does, then it changes /data/project/.system_sge/gridengine/default/common/act_qmaster
to point to itself, redirecting all userland tools. This move is unidirectional; once the master is ready to take over again then the gridengine-master on tools-sgegrid-shadow
need to be shut down manually (Note: on the stretch grid, this doesn't seem to be true--it failed back smoothly in testing but manually failing back is still smart), and the one on tools-master started (this is necessary to prevent flapping, or split brain, if tools-grid-master
only failed temporarily). This is simply done with service gridengine-master {stop/start}
.
Because of the heartbeat file and act_qmaster mechanisms, when it fails over, the gridengine-master
service will not start if act_qmaster
points to the shadow master. You must manually stop the gridengine-shadow
service on tools-sgegrid-shadow
and then start the gridengine-master
service on tools-sgegrid-master
and then start gridengine-shadow
on tools-sgegrid-shadow
to restore the "normal" state after failover. The services are largely kept under manual systemctl
control because of these sort of dances.
Administrative tasks
Failover / Failback
See which node is the current primary:
user@tools-sgegrid-master:~$ cat /var/lib/gridengine/default/common/act_qmaster
tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
user@tools-sgegrid-shadow:~$ cat /var/lib/gridengine/default/common/act_qmaster
tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
If moving shadow -> master, then:
user@tools-sgegrid-shadow:~$ sudo systemctl stop gridengine-shadow.service
user@tools-sgegrid-master:~$ sudo systemctl start gridengine-master.service
SGE resources
Son of Grid Engine doesn't appear to be actively developed, but it is somewhat updated from the last open source release of Univa Grid Engine (8.0.0), which is an active commercial product that is not open source at this point.
Documentation for Son of Grid Engine is mostly archives of the Sun/Oracle documents. This can be found at the University of Liverpool website.
PDF manuals for the older grid engine can found using [1]. Most of the information in these still applies to Son of Grid Engine (version 8.1.9).
Nearly all installation guides for any version of Grid Engine are incorrect because they assume some process of untarring executables on NFS, like one would on a classic Solaris installation. The execs on NFS in our environment is purely a set of symlinks to the exec files on local disk that are installed via deb packages.
With that in mind, see this page for most of the current how-tos: https://arc.liv.ac.uk/SGE/howto/howto.html
Dashboard
In addition to the cli commands below, an overview can be viewed at https://sge-status.toolforge.org/
List of handy commands
Most commands take -xml as a parameter to enable xml output. This is useful when lines get cut off. These are unchanged between grid versions.
Note that qmod and qconf commands will only work on masters and shadow masters (tools-sgegrid-master and tools-sgegrid-shadow) in the grid because bastions are not admin hosts.
Queries
- list queues on given host:
qhost -q -h $hostname
- list jobs on given host:
qhost -j -h $hostname
- list all queues:
qstat -f
- qmaster log file:
tail -f /data/project/.system_sge/gridengine/spool/qmaster/messages
Configuration
The global and scheduler configs are managed by puppet. See the files under modules/profile/files/toolforge/grid-global-config and modules/profile/files/toolforge/grid-scheduler-config
See also: http://gridscheduler.sourceforge.net/howto/commontasks.html
- modify host group config:
qconf -mhgrp \@general
- print host group config:
qconf -shgrp \@general
- modify queue config:
qconf -mq queuename
- print queue config:
qconf -sq continuous
- enable a queue:
qmod -e 'queue@node_name'
- disable a queue:
qmod -d 'queue@node_name'
- add host as exec host:
qconf -ae node_name
- print exec host config:
qconf -se node_name
- remove host as exec host: ??
- add host as submit host:
qconf -as node_name
- remove host as submit host: ??
- add host as admin host:
qconf -ah node_name
- remove host as admin host: ??
Accounting
- retrieve information on finished job:
qacct -j [jobid or jobname]
- there are a few scripts in /home/valhallasw/accountingtools: (need to be puppetized)
- vanaf.py makes a copy of recent entries in the accounting file
- accounting.py contains python code to read in the accounting file
- Usage:
valhallasw@tools-bastion-03:~/accountingtools$ php time.php "-1 hour" 1471465675 valhallasw@tools-bastion-03:~/accountingtools$ python vanaf.py 1471465675 mylog Seeking to timestamp 1471465675 ... done! valhallasw@tools-bastion-03:~/accountingtools$ grep mylog -e '6727696' | python merlijn_stdin.py 25 1970-01-01 00:00:00 1970-01-01 00:00:00 tools-webgrid-lighttpd-1206.eqiad1.wikimedia.cloud tools.ptwikis lighttpd-precise-ptwikis 6727696 0 2016-08-17 21:01:42 2016-08-17 21:01:46 tools-webgrid-lighttpd-1207.eqiad1.wikimedia.cloud tools.ptwikis lighttpd-precise-ptwikis 6727696 Traceback (most recent call last): File "merlijn_stdin.py", line 4, in <module> line = raw_input() EOFError: EOF when reading a line
- Ignore the EOFError; the relevant lines are above that. Error codes (first entry) are typically 0 (finished succesfully), 19 ('before writing exit_status' = crashed?), 25 (rescheduled) or 100 ('assumedly after job' = lost job?). I'm not entirely sure about the codes when the job stops because of an error.
Orphan processes
Hunt for orphan processes (parent process id == 1) that have leaked from grid jobs:
$ clush -w @exec-stretch -w @webgrid-generic-stretch -w @webgrid-lighttpd-stretch -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|Debian-exim|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|systemd|www-data|sgeadmin)"|grep -v systemd|grep -v perl|grep -E " 1 "'
The exclusion for perl processes is because there are 2-3 tools built with perl that make orphans via the "normal" forking process.
Kill orphan processes:
$ clush -w @exec-stretch -w @webgrid-generic-stretch -w @webgrid-lighttpd-stretch -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|Debian-exim|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|systemd|www-data|sgeadmin)"|grep -v systemd|grep -v perl|grep -E " 1 "|awk "{print \$3}"|xargs sudo kill -9'
Creating a new node
If you absolutely must create a new grid node, use one of the wmcs.toolforge.scale_grid_*
cookbooks.
For outdated manual docs, see Special:Permalink/1929192.
Clearing error state
Sometimes due to various hiccups (like LDAP or DNS malfunction), grid jobs can move to an Error state from which they will not come out without explicit user action. Error states can be created by repeated job failures caused by user error on healthy nodes. This includes an 'A' state from heavy job load. Nodes in this state are unschedulable, so unless this condition persists, it's not required to attempt to alleviate this error code. Persistent 'A' error state could however mean a node is broken. Lastly error state 'au' generally means the host isn't reachable. This is also likely attributed to load. If this error persists, check the host's job queue and ensure gridengine is still running on the host.
To view the any potential error states and messages for each node:
qstat -explain E -xml | grep -e name -e state -e message
Once you have ascertained the cause of the Error state and fixed it, you can clear all the error state queue using the cookbook:
dcaro@vulcanus$ cookbook wmcs.toolforge.grid.cleanup_queue_errors -h usage: cookbooks.wmcs.toolforge.grid.cleanup_queue_errors [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] [--master-hostname MASTER_HOSTNAME] WMCS Toolforge - grid - cleanup queue errors Usage example: cookbook wmcs.toolforge.grid.cleanup_queue_errors --project toolsbeta --master-hostname toolsbeta-sgegrid-master options: -h, --help show this help message and exit --project PROJECT Relevant Cloud VPS openstack project (for operations, dologmsg, etc). If this cookbook is for hardware, this only affects dologmsg calls. Default is 'toolsbeta'. --task-id TASK_ID Id of the task related to this operation (ex. T123456). --no-dologmsg To disable dologmsg calls (no SAL messages on IRC). --master-hostname MASTER_HOSTNAME The hostname of the grid master node. Default is '<project>-sgegrid-master'
Or manually with:
user@tools-sgegrid-master:~$ sudo qmod -c '*'
You also need to clear all the queues that have gone into error state. Failing to do so prevents jobs from being scheduled on those queues. You can clear all error states on queues with:
qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
If a single job is stuck in the dr state, meaning is stuck in deleting state but never goes away, run the following:
user@tools-sgegrid-master:~$ sudo qdel -f 9999850
root forced the deletion of job 9999850
Draining a node of Jobs
In real life, you just do this with the exec-manage script. Run sudo exec-manage depool $fqdn
on the grid master or shadow master (eg. tools-sgegrid-master.tools.eqiad1.wikimedia.cloud). What follow are the detailed steps that are handled by that script.
- Disable the queues on the node with
qmod -d '*@$node_name'
- Reschedule continuous jobs running on the node (see below)
- Wait for non-restartable jobs to drain (if you want to be nice!) or
qdel
them - Once whatever needed to be done, reenable the node with
qmod -e '*@$node_name'
There is no simple way to delete or reschedule jobs on a single host, but the following snippet is useful to provide a list to the command line:
$(qhost -j -h $NODE_NAME | awk '{print $1}' | egrep ^[0-9])
which make for reasonable arguments for qdel
or qmod -rj
.
Decommission a node
- Drain the node (see above!). Give the non-restartable jobs some time to finish (maybe even a day if you are feeling generous?).
- Use the
wmcs.toolforge.remove_grid_node
cookbook
Troubleshooting
Removing a node fails with 'Host object "[...]" is still referenced in cluster queue "[...]".'
Check that the node has no stuck 'Deleting' jobs - if there are just qdel -f
them.