User:Yuvipanda/Icinga for tools

From Wikitech

Monitoring for tools running on toollabs should alert the tool authors when something goes wrong.

Problem

Currently, when a tool goes down, the users on wiki notice it and complain on one of the on-wiki venues (VPT, etc). Then someone reads that, knows/finds out who the author is, and pokes them into fixing it. This is cumbersome, unreliable and also slow. Debug information from the original reporter might also be lost.

Requirements

Setup an instance of icinga just for toollabs.

  1. This will be separate from the current cluster icinga, to prevent each group from spamming the other.
  2. Will be configurable by tool authors in a simple way (plain text files in the home of the tool, perhaps?). Tool authors would not need to learn about icinga config to get basics working
  3. A bunch of common tests for checking status of tools. Should cover 90% of use cases. Ones that I can think of right now are:
    1. Status of Job on SGE
    2. Ping a URL, check if it returns 200
    3. Some form of 'heartbeat' based monitoring. Checking mtime of log files or similar.
  4. Some way of contacting the appropriate tool authors. We already have .forward in tools, so this can probably be reused
  5. Some way for tool authors to acknowledge that they're 'on it', to prevent repeated tool alerts spamming them. This could be as simple as running a command after logging in.
  6. Some way of not spamming every tool author when toollabs itself goes down. Spamming everyone when NFS gets stuck or similar is a bad idea, and there are already alerts for such infrastructure changes that ping the Labs Admins / Ops folks. Only messages that are *actionable* by the tool authors themselves should be sent to them

Solution

I'm rather new to icinga (have just a very basic understanding) and monitoring things in general, so I expect a lot of these details to evolve over time. But a basic solution would be creation of a tools-icinga instance on toollabs, and then perhaps building plugins (or finding them?) for doing the monitoring. Differentiating between 'tool labs is down' vs 'your tool is down' is going to be an important factor as well. This section will be fleshed out more after I read more documentation and talk to people who actually know about icinga.


Infrastructure Checks

Checks for

  1. NFS responsiveness
  2. Apaches responsiveness
  3. SGE responsiveness
  4. Responsiveness / load for each exec node
  5. Redis responsiveness
  6. (When it lands) our uwsgi responsiveness
  7. tools-login and tools-dev availability

These don't trigger any spam for tool authors, but are available in status pages so that when a tool is down, the author knows why.