Toolforge proxies and the webservice toolchain are somewhat complex. This is admin documentation for managing them. For more context around the tools-webservice package and command, also read Webservice
At the Grid Engine exec nodes
Each webgrid job submission runs the
webservice-runner script when a new web service is launched via the
webservice command on a bastion. The script is part of the tools-webservice Debian package: https://gerrit.wikimedia.org/r/admin/projects/operations/software/tools-webservice
The webservice-runner script contacts the tools-proxy nodes via port 8282 to register the random port from 1024-65535 with the active proxy (set via active_proxy_host hiera variable for the old grid and profile::toolforge::active_proxy_host hiera variable on the new grid) on the proxy's local redis.
(TODO: determine if the portgrabber.py script in the toolforge profile actually does anything considering the above -- it may for some other services?)
At the K8s worker nodes
webservice command on bastions can also simply launch pods on Kubernetes behind a deployment that is namespaced for the tool. Nothing special is done here since tools-webservice is actually communicating through the Kubernetes API. The special part for this is done on the proxy with the
kube2proxy service running on the active proxy instance.
At the cron host (new grid) or active services host (old grid)
The tools/manifest/webservicemonitor.py script runs after being installed by the tools-manifest package: https://gerrit.wikimedia.org/r/admin/projects/operations/software/tools-manifest
Webservicemonitor ensures that services are running on the gridengine environment using a combination of submit host commands (such as
qstat) and access the proxy's list endpoint at <active_proxy_host>:8081/list. It will only function properly if it can read that endpoint correctly.
The script works as a kind of reconciliation loop by comparing the list of registered tool->host:port mappings gathered from the active proxy server, the list of jobs running on the grid from
qstat, and service.manifest files gathered from the tool $HOME directories. The manifest data is the primary driver for this reconciliation. Each manifest is checked to determine if it:
- contains a "web: ..." declaration
- contains a "backend: gridengine" declaration
- contains a "distribution: ..." declaration matching the distribution of the instance running webservicemonitor
If any of the checks fail, the manifest is skipped.
The qstat output is searched for a running job matching the "web: ..." type and the tool name. If a matching job is found, its state is checked to determine if it should be treated as 'running' (state contains r, s, w, of h).
If the job is found to not be running or the tool is not in the list of known proxy backends the job is eligible for submission to the job grid. Before actually submitting the job, the manifest is checked to see if "too many" restart attempts have happened since the last time the job was seen to be running. (FIXME: pretty sure this is actually broken in the code. The tracking data is not persisted.)
On the tools-proxy servers
These use a number of tools to feed information about web services into redis and remove it when it no longer applies. The dynamicproxy class in puppet is then used to read from redis (via lua) to provide the aforementioned list endpoint as well as the actual proxy of the web services themselves under its domain name as path items.
Grid web services
Grid services are added and removed using the proxylistener python script. This service runs on port 8282 and is used to register and deregister service locations with the proxy (by talking to the local redis service directly). The logs are found at
In addition to this, the webgrid exec nodes all have the portgrabber.py library and the portreleaser scripts. These are still used at the end of jobs, especially ones that are killed by grid engine and friends in an epilog for the web queues. The epilog runs /usr/local/bin/portreleaser, which will call the proxylister script on 8282. The /usr/local/bin/portgrabber script does not appear to run anymore directly.
Kubernetes web services
A script called kube2proxy that exists in puppet at modules/toollabs/files/kube2proxy.py runs as a service that watches the Kubernetes API as a custom event loop. It depends on the web request buffering capabilities of python3-requests>=2.7
That script also talks directly to the local redis instance, thus updating the proxy. It needs to be able to communicate with the Kubernetes API only.
To manipulate the proxy, the nginx lua modules are used. The details of this are all in the dynamicproxy module in puppet: modules/dynamicproxy/manifests/init.pp
TODO: document dynamicproxy a bit better
- If you add new proxy hosts, you will need to update the hiera variable toollabs::proxy::proxies and then restart ferm on the proxy hosts or changes won't take effect at the firewall level.