This page is currently a draft.
Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org"). Some URLs are also monitored externally by Catchpoint.
This list is defined in the
toollabs::checker_hosts key in https://wikitech.wikimedia.org/wiki/Hiera:Tools and is used in configuring the ferm rules for Toolforge's flannel and Kubernetes etcd clusters.
Several tools are involved in the checks:
Expects the mtime of /data/project/toolschecker/crontest.txt to be updated every 5 minutes by a grid job executed by the toolschecker tool.
- ssh login.tools.wmaflabs.org
- become toolschecker
- crontab -l
*/5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet touch /data/project/toolschecker/crontest.txt
There is a small script in
/data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:
*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh
The bigbrother.sh script checks for the job and restarts it if not found.