Portal:Toolforge/Admin/Toolschecker

From Wikitech
Jump to navigation Jump to search

Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org").

Servers

  • tools-checker-03.tools.eqiad.wmflabs

This list is defined in the toollabs::checker_hosts key in https://wikitech.wikimedia.org/wiki/Hiera:Tools and is used in configuring the ferm rules for Toolforge's flannel and Kubernetes etcd clusters.

Tools

Several tools are involved in the checks:

toolschecker
Crontab for /cron check
toolschecker-ge-ws
Webservice for /webservice/gridengine check
toolschecker-k8s-ws
Webservice for /webservice/kubernetes check

Checks

/cron

Expects the mtime of /data/project/toolschecker/crontest.txt to be updated every 5 minutes by a grid job executed by the toolschecker tool.

Troublshooting:

  • ssh login.tools.wmaflabs.org
  • become toolschecker
  • crontab -l
*/5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet touch /data/project/toolschecker/crontest.txt

/db/toolsdb

/db/wikilabelsrw

/dns/private

/etcd/flannel

/etcd/k8s

/grid/continuous/stretch

There is a small script in /data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:

*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh

The bigbrother.sh script checks for the job and restarts it if not found.

/grid/start/stretch

/k8s/nodes/ready

/ldap

/nfs/dumps

/nfs/home

/nfs/secondary_cluster_showmount

/redis

/self

/webservice/gridengine

/webservice/kubernetes