Portal:Toolforge/Admin/Webservice

From Wikitech
Jump to navigation Jump to search

Toolforge proxies and the webservice toolchain are complex. This is admin documentation for managing them. For more context around the tools-webservice package and command, also read Help:Toolforge/Web.

TL;DR

  • Nginx running on tools-proxy-* instances terminates TLS connections for tools.wmflabs.org
  • The nginx config includes a lua module named urlproxy.lua which implements lookup logic to find a backend host:port to handle the request
  • The data that urlproxy.lua uses for this lookup is stored in Redis
  • The data in Redis comes from the webservice system:
    • For a job grid hosted webservice:
      • webservice-runner runs on the job grid as the entry point of the submitted job
      • webservice-runner contacts a proxylistener.py service running on the active tools-proxy-* instance via TCP on port 8282
      • proxylistener.py updates Redis with the host:port sent by webservice-runner
    • For a Kubernetes hosted webservice:
      • kube2proxy.py runs on the active tools-proxy-* instance
      • kube2proxy.py connects to the Kubernetes API and watches for Service objects in the cluster
      • kube2proxy.py updates Redis with the host:port of each Service

Interesting files:

  • modules/dynamicproxy/templates/urlproxy.conf
  • modules/dynamicproxy/files/urlproxy.lua
  • modules/toollabs/files/proxylistener.py
  • modules/toollabs/files/kube2proxy.py
  • modules/dynamicproxy/files/invisible-unicorn.py (only used for domainproxy?)

At the Grid Engine exec nodes

Each webgrid job submission runs the webservice-runner script when a new web service is launched via the webservice command on a bastion. The script is part of the tools-webservice Debian package: https://gerrit.wikimedia.org/r/admin/projects/operations/software/tools-webservice

The webservice-runner script contacts the tools-proxy nodes via port 8282 to register the random port from 1024-65535 with the active proxy (set via active_proxy_host hiera variable for the old grid and profile::toolforge::active_proxy_host hiera variable on the new grid) on the proxy's local redis.

(TODO: determine if the portgrabber.py script in the toolforge profile actually does anything considering the above -- it may for some other services?)

At the K8s worker nodes

The webservice command on bastions can also simply launch pods on Kubernetes behind a deployment that is namespaced for the tool. Nothing special is done here since tools-webservice is actually communicating through the Kubernetes API. The special part for this is done on the proxy with the kube2proxy service running on the active proxy instance.

At the cron host (new grid) or active services host (old grid)

The tools/manifest/webservicemonitor.py script runs after being installed by the tools-manifest package: https://gerrit.wikimedia.org/r/admin/projects/operations/software/tools-manifest

Webservicemonitor ensures that services are running on the gridengine environment using a combination of submit host commands (such as qstat) and access the proxy's list endpoint at <active_proxy_host>:8081/list. It will only function properly if it can read that endpoint correctly.

The script works as a kind of reconciliation loop by comparing the list of registered tool->host:port mappings gathered from the active proxy server, the list of jobs running on the grid from qstat, and service.manifest files gathered from the tool $HOME directories. The manifest data is the primary driver for this reconciliation. Each manifest is checked to determine if it:

  • contains a "web: ..." declaration
  • contains a "backend: gridengine" declaration
  • contains a "distribution: ..." declaration matching the distribution of the instance running webservicemonitor

If any of the checks fail, the manifest is skipped.

The qstat output is searched for a running job matching the "web: ..." type and the tool name. If a matching job is found, its state is checked to determine if it should be treated as 'running' (state contains r, s, w, of h).

If the job is found to not be running or the tool is not in the list of known proxy backends the job is eligible for submission to the job grid. Before actually submitting the job, the manifest is checked to see if "too many" restart attempts have happened since the last time the job was seen to be running. (FIXME: pretty sure this is actually broken in the code. The tracking data is not persisted.)

On the tools-proxy servers

These use a number of tools to feed information about web services into redis and remove it when it no longer applies. The dynamicproxy class in puppet is then used to read from redis (via lua) to provide the aforementioned list endpoint as well as the actual proxy of the web services themselves under its domain name as path items.

Grid web services

Grid services are added and removed using the proxylistener python script. This service runs on port 8282 and is used to register and deregister service locations with the proxy (by talking to the local redis service directly). The logs are found at /var/log/proxylistener.

In addition to this, the webgrid exec nodes all have the portgrabber.py library and the portreleaser scripts. These are still used at the end of jobs, especially ones that are killed by grid engine and friends in an epilog for the web queues. The epilog runs /usr/local/bin/portreleaser, which will call the proxylister script on 8282. The /usr/local/bin/portgrabber script does not appear to run anymore directly.

Kubernetes web services

A script called kube2proxy that exists in puppet at modules/toollabs/files/kube2proxy.py runs as a service that watches the Kubernetes API as a custom event loop. It depends on the web request buffering capabilities of python3-requests>=2.7

That script also talks directly to the local redis instance, thus updating the proxy. It needs to be able to communicate with the Kubernetes API only.

Lua scripts

To manipulate the proxy, the nginx lua modules are used. The details of this are all in the dynamicproxy module in puppet: modules/dynamicproxy/manifests/init.pp

TODO: document dynamicproxy a bit better

Gotchas

  • If you add new proxy hosts, you will need to update the hiera variable toollabs::proxy::proxies and then restart ferm on the proxy hosts or changes won't take effect at the firewall level.

...