Monitoring

Monitoring is a multi-faceted topic covering testing and collecting metrics to audit the availability and performance of networks, servers, and clustered applications, as well as processing collected data for fault detection, notification, graphing, capacity planning, and analytics.

Stakeholders

Operations: to discover and diagnose problems or attacks/compromises, and for capacity planning
Developers: for debugging, for discovery of problems with features/systems
Other departments: to track business metrics
Users: to check status of outages/components

Goals

Availability monitoring
Performance monitoring
Business metric analytics
Security incident detection
Security auditing
Easy to add metrics
Easy for non-ops to make alerts and graphs
Detect aberrant behavior without describing static thresholds for each metric: Holt-Winters forecasting or similar
Clustering strategy for capacity, HA, and multi-datacenter support
Consistent and consolidated metrics collection and storage: one agent, one storage engine

Design

Art, Science

Monitoring is both art and science. The art of monitoring involves making value judgements. Qualified users should be able to perform these tasks without making changes to Puppet:

define computed metrics
define dependencies
define graphs and dashboards
define alert conditions

The science of monitoring should be handled by software managed by Puppet, and agnostic to the metrics themselves:

collect the data
transport the data
store the data
generate events based on the data
generate alerts based on conditions
generate notifications based on dependencies
display the data: provide UI, draw graphs

Components

Contemporary design of a large-scale monitoring infrastructure divides the task among several subcomponents:

Agent / sensor / collector: A daemon or service normally run on each node to be monitored, periodically collecting metrics from the kernel and application-specific plugins and forwarding them to a broker or directly to a storage engine. These should replace Gmond, NSCA, NRPE.
- Collectd (C) and Diamond (Python) are both popular packages which deliver similar sets of metrics.
Aggregator / broker: Routes all locally collected metrics to a queue, event processor, or storage engine. May batch or summarize the data.
- Statsd: Summarizes data (sum, avg) to minimize network load, which may or may not be acceptable. Can cache metrics in case of network fail.
- Collectd: Native networking support supports multicast, crypto, proxying. Accepts statsd-format metrics via input plugin. Caching optional with AMQP output.
- RabbitMQ: Collectd, Diamond, and Statsd can all output via AMQP to RabbitMQ, decoupling metric submission from metric processing.
- Kafka: Already in use for Analytics, offers log replay. Perhaps this service should be used for all metrics. Would require writing plugins for Statsd or Collectd.
Storage engine: RRD files, the popular metric storage of the last decade, are showing their age. Two challengers have appeared:
- Carbon/Whisper: Graphite's time series database service and storage format
- OpenTSDB: An open source implementation similar to Google's Borgmon storage engine, runs atop HBase.
State engine / event processor: In addition to narrowly focused event processing packages, this functionality together with a poller forms the core of many monolithic monitoring packages. There is the possibility to process events as they received by the storage engine, or by asynchronously examining stored metrics:
- Icinga: Can actively poll services, indirectly poll via check_carbon, or operate asynchronously via NSCA, event broker plugins, etc. Clustering design contains a SPOF called the Central Monitoring Server.
- Icinga + Mod-Gearman: Gearman implements a custom agent and broker with the goal of accelerating active checks. Unclear if polling is delegated to workers. Retains central server SPOF.
- Riemann Purpose-built event processor. Able to listen to events sent over Carbon's plaintext protocol (precisely how does this work?), as well as record events to Graphite. Documentation on scaling is not encouraging.
- Sensu: A custom agent plus "monitoring router", which uses RabbitMQ and Redis to avoid SPOFs. Running more than one server is supported and recommended.
- Shinken Modular rewrite of Nagios in Python which seems to have no SPOFs.
Notifier: Icinga/Shinken/Sensu all have in-built notification systems based on email gateways. The leading alternative is Pagerduty, a paid service who offer an API and SLA.
Visualizer: If all monitoring metrics are stored in Carbon, adding dashboards is easy. Finding one which presents a structured way to drill down from cluster to host to metric in a manner similar to Ganglia has been a bit more challenging.
- Grafana: Aims to present data from Carbon similarly to Kibana's presentation of Logstash data. Dashboards may be defined interactively as well as via Puppet file templates, to provide customized views as well as per-cluster views.

Targets

Targets may be described in terms of layers, and sources within those layers.

Network:
- Routers
- Switches
- Firewalls
- PDUs
- Netapps
Server:
- Linux kernel
- System daemons (ntpd, puppet, sshd..)
- Applications (Apache, Kafka..)
Cluster service:
- Mediawiki
- Memcached
- External Storage

A separate article has been created to identify all targets: Monitoring sources

Implementation

Available tools

A separate article has been created to list the dozens of relevant tools discussed or in use: Monitoring package survey

Current generation

Network:
- Routers and PDUs are monitored via SNMP by LibreNMS and Torrus
- Routers syslog to LibreNMS
- Network latency is measured by Prometheus Blackbox probes
- Router config changes are watched by RANCID
- External reachability monitored by Nimsoft Cloud Monitor
Servers:
- Physical info stored in Netbox, manually input
- Icinga for availability monitoring via active checks as well as NRPE and NSCA on select hosts
- Syslog goes to logstash in PMTPA, no alerting
Cluster services:
- Database dashboard: dbtree
- Performance dashboards: Grafana, Graphite, Gdash
- Webrequest analytics infrastructure: Kraken (limited deployment)
- Which apps deliver stats to Carbon? Which apps use jmxtrans? Where is sqstat used? Do they all communicate via Statsd?
- Which apps deliver logs to Logstash?

Next generation

Desired

Anomaly detection
Store data in Carbon/Whisper rather than RRD wherever possible
Send all logs to Logstash
Consolidate all logs, generate alerts
Clients may submit metrics without pre-configuration of monitoring server to accept them
IDS/IPS, send stats to Carbon, generate alerts
~~Eliminate Ganglia~~, NRPE (make redundant with Graphite et al.)
Eliminate Torrus (appears redundant with LibreNMS)
Use a broker to decouple metric submission from processing and storage
Agents or broker cache and resubmit data in case of network outage
Is it worth the effort to dump historical data from RRDs to import into Carbon?
Virtualization-aware monitoring to eliminate separate Icinga instance for Labs?

External links

Chase's presentation on all-python monitoring using Diamond, Pystatsd, and Carbon
The State of Open Source Monitoring - Jason Dixon
"#monitoringsucks" - A grassroots campaign for better monitoring tools, distilled in these blog posts
"Counter-rant" response to #monitoringsucks - Dave Josephsen
Why Monitoring Sucks — For Now - Cliff Moon speaks about the OODA loop and its usefulness with monitoring
"Monalytics: Online Monitoring and Analytics for Managing Large Scale Data Centers"