NFS is served from
labstore1002 as a cold failover server (it needs to be kept off because the shelves do not allow concurrent access to both servers – it is safe to power it on and start it iff labstore1001 is down).
labstore1001 storage is mounted under
/srv, but the NFS export tree is
/exp which is a maintained farm of bind mounts to the appropriate directories in the real store.
/usr/local/sbin/manage-nfs-volumes-daemon manages the exports (analogous on how volumes were managed with gluster) according to what appears in LDAP, invoking
/usr/local/sbin/sync-exports to manage the
/exp mount farm.
NFS exports end up in
/etc/exports.d/*.exports; mounts from the instances are done with puppet by role::labs::instance.
Note that in order for NFS4 to work properly, usernames and groupnames must exist on server and clients; this is achieved by making labstore1001 an LDAP client for the passwd and group databases, which allows it to see the Labs users and service groups.
There is a secondary NFS server,
labstore1003, which has a simple static export file for the Labs's copy of the dumps.
There is also a presentation with slides detailing the architecture of the Labs Storage on Commons.
The database replicas are slaved from the sanitorium to the following hosts in the labsupport net:
labsdb1003. Each of those runs a MariaDB 10 instance on the default port to which every (sanitized) production is copied. In addition, there is a MariaDB running on
labsdb1005 that contains no replicas but is available for general use by Labs users.
labsdb1005 are also, respectively, the postgresql master and slave for OSM.
Credentials for the Labs endusers are generated from
labstore1001 where a daemon, /usr/local/sbin/replica-addusers watches LDAP for new users and creates database credentials. The new password is stored in
/var/cache/dbusers, and configured on all databases with select access to '%_p'.* and full control over 'username_%'.*
As users' homes get created on different projects, the credentials are copied to the users' homes in
replicas.my.cnf from the cache.
Web service of tools is done with a dynamic, per-tool web server that are then dispatched to by the proxy,
tools-webproxy. The nginx proxy matches request URLs with keys stored in a local redis database, mapping them to a grid node and port on the actual labs instances.
On every instance from which a web server can be run (
tools-webgrid-*), there is a daemon
/usr/local/sbin/portgranter which tracks free ports, allocates them to new web servers, stores the resulting mapping in
/data/project/.system/dynamic and informs the webproxy redis of the new map. (The file in /data/project used to be used by the older apache proxy, but remains a useful resource to find current mappings from the command line and could be used to reconstruct the redis db at need).
When a webservice is started, it is done through
/usr/local/bin/portgrabber. That script contacts the portgranter for a port and starts the web server proper on that port, keeping the filedescriptor to portgranter open when it exec()s. This is how portgranter knows if the web server dies: as the process ends the filedescriptor is closed and portgranter removes the mapping.
By default, tools users have lighttpd set as their web server, but they can also configure their environment to start a different server.
Resource management in tools is handled by gridengine (aka OGE), the Open Source fork of the Sun Grid Engine. Its operating principles are relatively simple even if the implementation is a little baroque (documentation).
The grid is any number of nodes, managed by a gridmaster (
tools-grid-master). Jobs are submitted to a queue, which specifies the list of nodes available to it. The master then checks for availability of resources and allocates the job on a suitable node (generally, the one with the lightest current load). In addition to the gridmaster, there is also a failover master (
tools-grid-shadow) that takes over automatically should the primary master fail.
In tools, there are two "standard" queues: task, and continuous that include all of the normal nodes
tools-exec-nn, and the only difference between the two is that on the continuous queue, jobs are automatically restarted if possible when they fail. Endusers are not allowed to place jobs on the continuous queue, but tool accounts are.
In addition to those queues, there are a number of specialized ones for more specific purposes: webgrid-* for webservices, and several queues dedicated to tools with specific requirements.
qconf -sql from a bastion will show the list of queues, and
qconf -sq queuename will show a queue's configuration.
Jobs can be submitted from any of the bastion hosts, or from the
tools-webgrid-* nodes (to allow web services to schedule jobs). Administrative commands can be made from the bastion hosts, the master, and the shadow master.
Administrative access to gridengine is restricted to 'manager' users, the current list of which can be enumerated with
qconf -sm; in a pinch, root is a manager and can be used to intervene.
If a node fails or crashes in such a way that the master looses contact with it for over 5 minutes, any job that was running there (except those on the task queue) will be requeued for allocation on a different node. Given that there are not very many spare resources at this time, this is likely to end up with many jobs queued and waiting for resources (
qstat -u '*' -s p will show the list; a large number of jobs in 'qw' state is symptomatic of being out of available resources).