Check graphite

From Wikitech
Jump to navigation Jump to search

check_graphite is a nagios/icinga plugin script that can be used to generate alerts based on metric values in Graphite. It simply queries graphite to fetch data in JSON format through the /render endpoint of the graphite server. Our code is an (almost complete) rewrite of the check_graphite plugin from disquis.

Puppet Usage

We have two types of checks that can be performed on graphite-collected metrics:

  • check_graphite_threshold for checking thresholds
  • check_graphite_anomaly which performs some form of anomaly detection on the metric.

Both are a just wrapper for our monitor_service define. See their respective documentation for up to date usage docs.

Define monitor_graphite_threshold

A simple threshold checking is supported -this simply checks if a given percentage (by default, 1%) of the data points in the interested interval exceeds a threshold.

So, for instance, if you want to ensure that less than 5% of the checks in the last hour for the number of 5xx responses is above 500, you can do as follows:

  # Alert if the same metric exceeds an absolute threshold 5% of
  # times.
  monitor_graphite_threshold { 'reqstats-5xx':
      description          => 'Number of 5xx responses',
      metric               => 'reqstats.5xx',
      warning              => 250,
      critical             => 500,
      from                 => '1hours',
      percentage           => 5,
  }

Define monitor_graphite_anomaly

A very simple predictive checking is also supported - it will check if more than N points in a given range of datapoints are outside of the Holt-Winters confidence bands, as calculated by graphite (see http://bit.ly/graphiteHoltWinters), at 3 delta confidence level (99.7%) - which should be good in most cases). The obvious advantage of this method is that we don't need to pre-define thresholds at all.

This kind of monitoring always requires at least a week of data to graphite, which is needed to have decent predictions, so it's pretty computationally-expensive. you can define the interval of datapoints on which you wish to check for anomalies via the check_window parameter.

Let's see how we could try to detect an anomaly in the same metric as before: we will raise an alarm if 5 (or 10) measured datapoints out of the last 200 have an anomaly.

  # Alert if an anomaly is found in the number of 5xx responses
  monitor_graphite_anomaly { 'reqstats-5xx-anomaly':
      description          => 'Anomaly in number of 5xx responses',
      metric               => 'reqstats.5xx',
      warning              => 5,
      critical             => 10,
      check_window         => 200,
  }

A few more checks are in the work, this page will be updated then.