Help:Toolforge/Monitoring

From Wikitech
Jump to navigation Jump to search

Toolschecker

Toolschecker is an application that runs various active checks on Toolforge. The statuses are picked up by Icinga for alerting purposes (see "checker.tools.wmflabs.org").

It's based on Flask and runs as an uWSGI application behind nginx (communication is done locally through UNIX socket files).

Servers

  • tools-checker-01.tools.eqiad.wmflabs
  • tools-checker-02.tools.eqiad.wmflabs

Also defined in the toollabs::checker_hosts key in https://wikitech.wikimedia.org/wiki/Hiera:Tools

Files & Directories

Files in Puppet repository:

  • puppet/modules/toollabs/manifests/check.pp - defines Puppet resource that creates upstart services (/etc/init/toolschecker_$check.conf)
  • puppet/modules/toollabs/manifests/checker.pp - deploys checks and necessary packages/files
  • puppet/modules/toollabs/templates/toolschecker.upstart.erb - upstart service configuration template for each check
  • puppet/modules/toollabs/templates/toolschecker.nginx.erb - nginx configuraton template for each check
  • puppet/modules/toollabs/files/toolscheckerctl - iterates over check list to start/stop/status
  • puppet/modules/toollabs/files/toolschecker.py - actual checks live here
  • puppet/modules/toollabs/files/toolschecker_generic_service.py - test Python tool
  • puppet/modules/toollabs/files/toolschecker_lighttpd_service.php - test PHP tool
  • puppet/modules/icinga/manifests/monitor/toollabs.pp - Icinga configuration

Other relevant files:

  • /etc/init/toolschecker_${check}.conf - upstart service configuration
  • /run/toolschecker/toolschecker_${check}.sock - UNIX socket files

Sample Output

tools-checker-01:~$ sudo toolscheckerctl status
toolschecker_labs_private start/running, process 24689
toolschecker_toolsdb start/running, process 24702
toolschecker_dumps start/running, process 24715
toolschecker_cron start/running, process 24729
toolschecker_webservice_kubernetes start/running, process 24744
toolschecker_continuous_job_trusty start/running, process 24757
toolschecker_labsdb_labsdb1005 start/running, process 24781
toolschecker_grid_start_precise stop/waiting
toolschecker_service_start start/running, process 24811
toolschecker_nfs_secondary_cluster_showmount start/running, process 24825
toolschecker_nfs_home start/running, process 24839
toolschecker_nfs_showmount stop/waiting
toolschecker_kubernetes_nodes_ready start/running, process 24865
toolschecker_self start/running, process 24879
toolschecker_continuous_job_precise start/running, process 24890
toolschecker_flannel_etcd start/running, process 24912
toolschecker_labsdb_labsdb1001 stop/waiting
toolschecker_labsdb_labsdb1003rw stop/waiting
toolschecker_kubernetes_etcd start/running, process 24950
toolschecker_grid_start_trusty start/running, process 24969
toolschecker_puppet_catalog start/running, process 24983
toolschecker_labsdb_labsdb1003 stop/waiting
toolschecker_redis start/running, process 25011
toolschecker_ldap start/running, process 25024
toolschecker_labsdb_labsdb1001rw stop/waiting
toolschecker_labsdb_labsdb1004rw start/running, process 25051

Internals

This is nginx talking to the toolschecker_self check over the uWSGI protocol:

tools-checker-01:~$ sudo strace -ff -p $pid_of_check_process
...
epoll_wait(4, {{EPOLLIN, {u32=3, u64=3}}}, 1, -1) = 1
accept4(3, {sa_family=AF_LOCAL, NULL}, [2], SOCK_NONBLOCK) = 6
read(6, "\0\213\1\0\f\0QUERY_STRING\0\0\16\0REQUEST_METHOD\3\0GET\f\0CONTENT_TYPE\0\0\16\0CONTENT_LENGTH\0\0\v\0REQUEST_URI\5\0/self\t\0PATH_INFO\5\0/self\r\0DOCUMENT_ROOT\25\0/usr/share/nginx/html\17\0SERVER_PROTOCOL\10\0HTTP/1.1\f\0UWSGI_SCHEME\4\0http\v\0REMOTE_ADDR\r\000208.80.154.84\v\0REMOTE_PORT\5\00048406\v\0SERVER_PORT\2\00080\v\0SERVER_NAME\0\0\17\0HTTP_USER_AGENT(\0check_http/v2.2 (monitoring-plugins 2.2)\17\0HTTP_CONNECTION\5\0close\t\0HTTP_HOST\31\0checker.tools.wmflabs.org", 4100) = 399
write(6, "HTTP/1.1 200 OK\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 2\r\n\r\n", 78) = 78
write(6, "OK", 2)                       = 2
close(6)                                = 0
writev(2, [{"[pid: 2430|app: 0|req: 67/67] 208.80.154.84 () {32 vars in 395 bytes} [Wed Jan  9 10:35:18 2019] GET /self => generated 2 bytes in 5 msecs (HTTP/1.1 200) 2 headers in 78 bytes (1 switches on core 0)\n", 199}], 1) = 199