This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org").
This list is defined in the
toollabs::checker_hosts key in https://wikitech.wikimedia.org/wiki/Hiera:Tools and is used in configuring the ferm rules for Toolforge's flannel and Kubernetes etcd clusters.
Several tools are involved in the checks:
Expects the mtime of /data/project/toolschecker/crontest.txt to be updated every 5 minutes by a grid job executed by the toolschecker tool.
- ssh login.toolforge.org
- become toolschecker
- crontab -l
*/5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet -j y -o /data/project/toolschecker/logs/crontest.log touch /data/project/toolschecker/crontest.txt
- You can check the log at
- Note that in the log
[Sun Apr 11 22:45:07 2021] there is a job named 'toolschecker.crontest' already activeis a normal log line caused by latency in the whole system and not indicative of a major issue.
- Note that in the log
qstatto see what jobs are up there. An erroring job is quite likely to still be visible with the
- If you have a job in the
qstat -j $jobidwill give you grid-specific error messages (such as an LDAP issue)
- Jobs that are not in the system anymore and you somehow are aware of the ID for can sometimes be learned about using
qacct -j $jobid. This takes longer, but it reads the accounting files instead of what's in the system at this time. The accounting file *is* rotated, so older jobs will not be in there forever.
- If an errored job is hanging around, it will block the next execution (unique names are required per user and queue), so run
qdel $jobidif you suspect it is doing that.
- It is possible the problem is with the grid's cron host. It is currently
tools-sgecron-01.tools.eqiad1.wikimedia.cloud, is a single point of failure and is where cron jobs actually live.
- If there is an overall grid problem, most of our documentation is in Portal:Toolforge/Admin for that, and Brooke is a good escalation point, if needed.
There is a small script in
/data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:
*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh
The bigbrother.sh script checks for the job and restarts it if not found.