This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org").
This list is defined in the
profile::toolforge::checker_hosts Hiera key and is used in configuring the ferm rules for Toolforge's Kubernetes etcd clusters. The server needs to be manually configured as a grid submit host.
Several tools are involved in the checks:
- Crontab for /cron check
- Webservice for /webservice/gridengine check
- Webservice for /webservice/kubernetes check
What's the issue?
The mtime of
/data/project/toolschecker/crontest.txt was not updated in the last 5 minutes.
- The grid engine job that updates the timestamp of that file crashed, and any following jobs just don't start as the job is in Error state ('E' on qstat).
Check the jobs queue:
mylaptop> ssh login.toolforge.org myuser@toolforge> become toolschecker toolschecker@toolforge> qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 1138412 0.69895 test-long- tools.toolsc Rr 03/25/2021 17:56:31 continuous@tools-sgeexec-0911. 1 2815 0.39405 lighttpd-t tools.toolsc r 03/25/2021 18:49:01 webgrid-lighttpd@tools-sgewebg 1 3772080 0.25000 toolscheck tools.toolsc Eqw 06/10/2021 08:15:07
There you see that there's a job in error state ('E'), to check the details of the job, and the error cause:
toolschecker@toolforge> qstat -j 3772080 ============================================================== job_number: 3772080 ... stderr_path_list: NONE:NONE:/data/project/toolschecker/logs/crontest.log ... error reason 1: can't get password entry for user "tools.toolschecker". Either user does not exist or error with NIS/LDAP etc. ... Job is in error state
This is a known issue with ldap flakiness, in this case you can just remove the job and let the next run try again (if it happens more than once might be a different issue):
toolschecker@toolforge> qdel 3772080 tools.toolschecker has deleted job 3772080
Some more info
- You can see the crontab line with:
toolschecker@toolforge> crontab -l */5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet -j y -o /data/project/toolschecker/logs/crontest.log touch /data/project/toolschecker/crontest.txt
- There's also a log at
- Note that in the log
[Sun Apr 11 22:45:07 2021] there is a job named 'toolschecker.crontest' already activeis a normal log line caused by latency in the whole system and not indicative of a major issue.
- Note that in the log
- Jobs that are not in the system anymore and you somehow are aware of the ID for can sometimes be learned about using
qacct -j $jobid. This takes longer, but it reads the accounting files instead of what's in the system at this time. The accounting file *is* rotated, so older jobs will not be in there forever.
- It is possible the problem is with the grid's cron host. It is currently
tools-sgecron-01.tools.eqiad1.wikimedia.cloud, is a single point of failure and is where cron jobs actually live.
- If there is an overall grid problem, most of our documentation is in Portal:Toolforge/Admin for that, and Brooke is a good escalation point, if needed.
There is a small script in
/data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:
*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh
The bigbrother.sh script checks for the job and restarts it if not found.