This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
This page contains information on the webservicemonitor functionality in Toolforge.
The webservicemonitor component is a daemon which scans Toolforge's tool manifests looking for grid-based webservices, check they are alive and re-start them if required.
Since the Stretch version of Toolforge, this component is meant to run in cronrunner nodes. Previously it was running in services nodes.
The source code (python3) is currently deployed as a Debian package, named tools-manifest, and can be found at https://gerrit.wikimedia.org/r/#/admin/projects/operations/software/tools-manifest.
All the setup is done using puppet, in the profile::toolforge::grid::webservicemonitor profile: modules/profile/manifests/toolforge/grid/webservicemonitor.pp
How it works
There is a daemon collector-runner which reads all Toolforge manifests from the NFS share. The manifests should indicate the tool is meant to run on the grid, using a web node:
tools.wdcat@tools-bastion-03:~$ cat service.manifest
Then, the daemon will check if there is a job for this tool running. If not, will restart it and procude a log entry in the tool log. To be able to check the tool status and restart it, the daemon interacts with the grid. The server running the daemon should be a grid submit host.