Check graphite
check_graphite is a nagios/icinga plugin script that can be used to generate alerts based on metric values in Graphite. It simply queries graphite to fetch data in JSON format through the /render endpoint of the graphite server. Our code is an (almost complete) rewrite of the check_graphite plugin from disquis.
Puppet Usage
We have two types of checks that can be performed on graphite-collected metrics:
- check_graphite_threshold for checking thresholds
- check_graphite_anomaly which performs some form of anomaly detection on the metric.
Both are a just wrapper for our monitor_service define. See their respective documentation for up to date usage docs.
Define monitor_graphite_threshold
A simple threshold checking is supported -this simply checks if a given percentage (by default, 1%) of the data points in the interested interval exceeds a threshold.
So, for instance, if you want to ensure that less than 5% of the checks in the last hour for the number of 5xx responses is above 500, you can do as follows:
# Alert if the same metric exceeds an absolute threshold 5% of
# times.
monitor_graphite_threshold { 'reqstats-5xx':
description => 'Number of 5xx responses',
metric => 'reqstats.5xx',
warning => 250,
critical => 500,
from => '1hours',
percentage => 5,
}
Define monitor_graphite_anomaly
A very simple predictive checking is also supported - it will check if more than N points in a given range of datapoints are outside of the Holt-Winters confidence bands, as calculated by graphite (see http://bit.ly/graphiteHoltWinters), at 3 delta confidence level (99.7%) - which should be good in most cases). The obvious advantage of this method is that we don't need to pre-define thresholds at all.
This kind of monitoring always requires at least a week of data to graphite, which is needed to have decent predictions, so it's pretty computationally-expensive. you can define the interval of datapoints on which you wish to check for anomalies via the check_window parameter.
Let's see how we could try to detect an anomaly in the same metric as before: we will raise an alarm if 5 (or 10) measured datapoints out of the last 200 have an anomaly.
# Alert if an anomaly is found in the number of 5xx responses
monitor_graphite_anomaly { 'reqstats-5xx-anomaly':
description => 'Anomaly in number of 5xx responses',
metric => 'reqstats.5xx',
warning => 5,
critical => 10,
check_window => 200,
}
A few more checks are in the work, this page will be updated then.