From Wikitech
Jump to navigation Jump to search

Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org").

Each endpoint is served by a different uwsgi service behind an nginx running on the VM (currently tools-checker-04.tools.eqiad1.wikimedia.cloud). Configured by puppet.


  • tools-checker-04.tools.eqiad1.wikimedia.cloud

This list is defined in the profile::toolforge::checker_hosts Hiera key and is used in configuring the ferm rules for Toolforge's Kubernetes etcd clusters. The server needs to be manually configured as a grid submit host.


Several tools are involved in the checks:

Crontab for /cron check
Webservice for /webservice/gridengine check
Webservice for /webservice/kubernetes check



What's the issue?

The mtime of /data/project/toolschecker/crontest.txt was not updated in the last 5 minutes.


  • The grid engine job that updates the timestamp of that file crashed, and any following jobs just don't start as the job is in Error state ('E' on qstat).


Check the jobs queue:

mylaptop>               ssh login.toolforge.org
myuser@toolforge>       become toolschecker
toolschecker@toolforge> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
1138412 0.69895 test-long- tools.toolsc Rr    03/25/2021 17:56:31 continuous@tools-sgeexec-0911.     1
   2815 0.39405 lighttpd-t tools.toolsc r     03/25/2021 18:49:01 webgrid-lighttpd@tools-sgewebg     1
3772080 0.25000 toolscheck tools.toolsc Eqw   06/10/2021 08:15:07 

There you see that there's a job in error state ('E'), to check the details of the job, and the error cause:

toolschecker@toolforge> qstat -j 3772080
job_number:                 3772080
stderr_path_list:           NONE:NONE:/data/project/toolschecker/logs/crontest.log
error reason          1:      can't get password entry for user "tools.toolschecker". Either user does not exist or error with NIS/LDAP etc. 
                           Job is in error state

This is a known issue with ldap flakiness, in this case you can just remove the job and let the next run try again (if it happens more than once might be a different issue):

toolschecker@toolforge> qdel 3772080
 tools.toolschecker has deleted job 3772080

Some more info

  • You can see the crontab line with:
toolschecker@toolforge> crontab -l
   */5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet -j y -o /data/project/toolschecker/logs/crontest.log touch /data/project/toolschecker/crontest.txt
  • There's also a log at /data/project/toolschecker/logs/crontest.log.
    • Note that in the log [Sun Apr 11 22:45:07 2021] there is a job named 'toolschecker.crontest' already active is a normal log line caused by latency in the whole system and not indicative of a major issue.
  • Jobs that are not in the system anymore and you somehow are aware of the ID for can sometimes be learned about using qacct -j $jobid. This takes longer, but it reads the accounting files instead of what's in the system at this time. The accounting file *is* rotated, so older jobs will not be in there forever.
  • It is possible the problem is with the grid's cron host. It is currently tools-sgecron-01.tools.eqiad1.wikimedia.cloud, is a single point of failure and is where cron jobs actually live.
  • If there is an overall grid problem, most of our documentation is in Portal:Toolforge/Admin for that, and Brooke is a good escalation point, if needed.


Portal:Data Services/Admin/Toolsdb




There is a small script in /data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:

*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh

The bigbrother.sh script checks for the job and restarts it if not found.