Portal:Toolforge/Admin/Toolschecker
This page is currently a draft. |
Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org").
Contents
- 1 Servers
- 2 Tools
- 3 Checks
- 3.1 /cron
- 3.2 /db/toolsdb
- 3.3 /db/wikilabelsrw
- 3.4 /dns/private
- 3.5 /etcd/flannel
- 3.6 /etcd/k8s
- 3.7 /grid/continuous/stretch
- 3.8 /grid/start/stretch
- 3.9 /k8s/nodes/ready
- 3.10 /ldap
- 3.11 /nfs/dumps
- 3.12 /nfs/home
- 3.13 /nfs/secondary_cluster_showmount
- 3.14 /redis
- 3.15 /self
- 3.16 /webservice/gridengine
- 3.17 /webservice/kubernetes
Servers
- tools-checker-03.tools.eqiad.wmflabs
This list is defined in the toollabs::checker_hosts
key in https://wikitech.wikimedia.org/wiki/Hiera:Tools and is used in configuring the ferm rules for Toolforge's flannel and Kubernetes etcd clusters.
Tools
Several tools are involved in the checks:
- toolschecker
- Crontab for /cron check
- toolschecker-ge-ws
- Webservice for /webservice/gridengine check
- toolschecker-k8s-ws
- Webservice for /webservice/kubernetes check
Checks
/cron
Expects the mtime of /data/project/toolschecker/crontest.txt to be updated every 5 minutes by a grid job executed by the toolschecker tool.
Troublshooting:
- ssh login.tools.wmaflabs.org
- become toolschecker
- crontab -l
*/5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet touch /data/project/toolschecker/crontest.txt
/db/toolsdb
/db/wikilabelsrw
/dns/private
/etcd/flannel
/etcd/k8s
/grid/continuous/stretch
There is a small script in /data/project/toolschecker/bin/long-running.sh
that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:
*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh
The bigbrother.sh script checks for the job and restarts it if not found.