Monitoring package survey

An exhaustive list of monitoring tools and services evaluated or used at WMF

Open Source

Alerta

Overview: Nagios-alike, powered by MongoDB, released by The Guardian
URL:
- https://github.com/guardian/alerta/

Cabot

Overview: Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty
URL:
- http://cabotapp.com/
- https://github.com/arachnys/cabot

Centreon

Overview: Fork of Nagios with commercial support, can also use the Icinga "engine"
URL: http://www.centreon.com
Pro:
Con:
- Some features only in paid "enterprise" version
- Nagios architecture: check scripts determine warning and critical state

Check_graphite

Overview: Nagios check script to generate alerts based on data in Graphite. Multiple implementations exist.
URL: https://github.com/disqus/nagios-plugins/
Pro:
Con:
Status: Currently in use

Check_mk

Overview: Client agent runs checks/plugins async, listens on a TCP port, immediately spews all stats and closes the connection. Single Icinga active check connects to clients, returning data as passive checks.
URL:
- http://mathias-kettner.com/check_mk.html
- http://en.wikipedia.org/wiki/Check_MK
Pro:
- Very fast / scales well
- Integrates with Graphite
- Can use Nagios plugins
- Can completely replace Nagios/Icinga using the optional "check_mk micro core"
Con:
- Generates its own Icinga config based on discovery, so service monitors are not defined for services which are down at the time of the discovery scan
- Can be tricky to integrate this with Puppet based config templates for Icinga.
- Some features only in paid "enterprise" version

Collectd

Overview: collectd is a daemon which collects system performance statistics periodically and provides mechanisms to store the values in a variety of ways
URL:
Pro:
- Written in C
- Network traffic can be signed or encrypted
- Clients push data to a server or multicast group
- Default resolution is 10 seconds
- Can store data in Graphite (Carbon), RRD, Redis, MongoDB, several others
- Statsd plugin implements the StatsD network protocol to allow clients to report events. These events are aggregated by collectd and dispatched regularly.
- Can execute nagios check scripts
- Contains glue allowing Nagios to check stats harvested by collectd
Con:
- As of Feb/2014, website says that it has run on hundreds of nodes but admits nobody has reported 1000.
- There is no "write plugin" to publish to Ganglia, preventing collectd from being a drop-in replacement for gmond

Cucumber-nagios

Overview: Allows you to write high-level behavioural tests of web application, and plug the results into Nagios.
URL: https://github.com/auxesis/cucumber-nagios
Pro:
Con:

Cyanite

Overview: cassandra backend for graphite carbon
URL: https://github.com/pyr/cyanite

Dashing

Overview: Dashing is a Sinatra based framework that lets you build beautiful dashboards.
URL: http://shopify.github.io/dashing/
Pro:
Con:

Dbeacon

Overview: dbeacon is a multicast beacon: its main purpose is to monitor other beacons' reachability and collect statistics such as loss, delay and jitter between them.
URL: https://packages.debian.org/sid/dbeacon
Pro:
Con:

Diamond

Overview: Diamond is a python daemon that collects system metrics and publishes them to Graphite, OpenTSDB, others.
URL: https://github.com/BrightcoveOS/Diamond
Pro:
- Popular
- Written in Python
Con:

Fail2ban

Overview: Anti-brute force monitor which reacts to repeated bad auth attempts by banning the source IP temporarily
URL: http://www.fail2ban.org/wiki/index.php/Main_Page
Pro:
Con:

Firefly

Overview: Graph display/dashboard from yelp
URL: http://engineeringblog.yelp.com/2012/08/firefly-illuminate-your-websites-performance.html
Pro: supports ganglia and graphite
Con:

Ganglia

Overview: Cluster-oriented performance metric agent, collector, & grapher
URL:
Pro:
- Newer versions can use Graphite instead of RRD for storage and graphing
Con:
- Unpleasant to write plugins for (requires a lot of boilerplate)
Status: Retired, not in use anymore.

Ganglios

Overview: ganglios is a collection of tools that allow nagios to trigger alerts based on data it pulls from ganglia.
URL: https://bitbucket.org/maplebed/ganglios/
URL: https://wikitech.wikimedia.org/wiki/Ganglios
Pro:
Con:

Grafana

Overview: Graphite dashboard inspired by Kibana
URL:
- http://grafana.org
- https://github.com/torkelo/grafana
Pro:
Con:

Graphios

Overview: send nagios perf data to graphite (carbon)
URL: https://github.com/shawn-sterling/graphios
Pro:
Con:
See also: Metricinga

Graphite

Overview: Graphite is the overall project name for a monitoring system consisting of:
- Carbon, a Twisted daemon which listens for time-series data
- Whisper, a database library for storing time-series data (similar to RRD)
- Graphite, a Django webapp that renders graphs on-demand using Cairo
- Does not include an agent daemon, however stats may be submitted to Carbon from Collectd, Diamond, Ganglia, Jmxtrans, Statsd, and others
- Does not include alerting, however this is available via tools which read from Graphite such as Rearview or Seyren
URL:
Pro:
- Separates the collection of metrics from the definition of graphs
- Can accept metrics at any update frequency, including sporadic events
- Can use AMQP
Con:
Status: Currently in use: http://graphite.wikimedia.org/

Graphsky

Overview: Graphite dashboard similar to the Ganglia UI, using data from Collectd
URL:
- https://github.com/hyves-org/graphsky
- Demo: http://graphsky.skyler.cc/
Pro:
- Ganglia-like design gives overview + drill-down ability
- Simple dashboard and graph definition in JSON
Con:
- Doesn't display any numbers in dashboards or graph legends
- Lacking navigation elements in the UI
- Documentation error: they recommend the prefix "collectd." but really you want "collectd.production.bits." or etc. to encode the environment and cluster into the metric name.
Status: Good potential but needs development.

Groundwork

Overview: Unified systems monitoring and network management: Nagios(R), Nmap, RRDtool, etc. - integrated in one system administration tool
URL: http://www.groundworkopensource.com
Pro:
Con:
- Nagios architecture: check scripts determine warning and critical state

Hyperic

Overview:
- "Hyperic is application monitoring and performance management for virtual, physical, and cloud infrastructures. Auto-discover resources of 75+ technologies, including vSphere, and collect availability, performance, utilization, and throughput metrics."
- Now owned by VMWare
URL: http://www.hyperic.com
Pro:
Con:

Icinga

Overview: Fork of Nagios
URL:
Pro:
Con:
- Nagios architecture: check scripts determine warning and critical state
Status: Currently in use: https://icinga.wikimedia.org , http://icinga.wmflabs.org/

IDOUtils (NDOUtils)

Overview: "The IDOUtils (Icinga Data Output Utils) addon is designed to store all configuration and event (status, historical) data from Icinga into a relational database. Storing information from Icinga in an RDBMS will allow for quicker retrieval and processing of that data." An Event Broker plugin.
URL:
- http://docs.icinga.org/latest/en/ch12.html
- http://nagios.larsmichelsen.com/ndoutils-nagios-data-out/
Pro:
Con:
- Unclear how/if this achieves the goal of increased scalability
- This seems to be only for "output" -- simply recording stats rather than using the database as an R/W data store

Jmxtrans

Overview: jmxtrans is effectively the missing connector between speaking to a JVM via JMX on one end and whatever logging / monitoring / graphing package that you can dream up on the other end.
URL: http://www.jmxtrans.org/
Pro:
- Can log to Carbon/Graphite
Con:
Status: Currently in use for Hadoop

KairosDB

Overview: Rewrite of OpenTSDB, can use either HBase or Cassandra
URL: https://github.com/proofpoint/kairosdb
Pro:
Con:

LibreNMS

Overview: GPL fork of Observium
URL:
- https://wikitech.wikimedia.org/wiki/LibreNMS
Pro:
Con:
Status: Currently in use: https://librenms.wikimedia.org/

Logstash

Overview: A tool that can be used to collect, process and forward events and log messages. When used with Elasticsearch and Kibana it provides a dashboard for searching and analyzing logs.
URL:
- http://logstash.net/
- https://wikitech.wikimedia.org/wiki/Logstash
Pro:
Con:
Status: Currently in use: https://logstash.wikimedia.org/

Logster

Overview: Logster is a utility for reading log files and generating metrics in Graphite or Ganglia
URL: https://github.com/etsy/logster
Pro:
Con:

Merlin

Overview: Merlin is a Nagios Event Broker plugin: Module for Effortless Redundancy and Loadbalancing In Nagios, was initially started to create an easy way to set up distributed Nagios installations, allowing Nagios processes to exchange information directly as an alternative to the standard nagios way using NSCA.
URL:
- https://kb.op5.com/display/MERLIN/Distributed+%28Merlin%29+Home
- https://github.com/ageric/merlin
Pro:
Con:

Metricinga

Overview: Parses performance data files from Nagios/Icinga and sends the results to Graphite via the Carbon pickle port.
URL: https://github.com/jgoldschrafe/metricinga
Pro:
Con:
See also: Graphios

Mod-gearman

Overview: Extends Nagios/Naemon/Icinga to run scalable and distributed setups. Worker nodes can be placed all over your network while keeping the simplicity of a central configuration. Uses the Nagios Event Broker (NEB) API. Intercepts checks on the Nagios server, placing them on a queue. Client/worker nodes execute tasks from the queue and place the results in another queue: http://labs.consol.de/nagios/mod-gearman/#_how_does_it_work
URL:
- http://mod-gearman.org/
- https://wiki.icinga.org/display/howtos/Icinga+with+mod_gearman+on+RHEL+and+Debian
Pro:
Con:

Monit

Overview: Process supervisor with optional GUI and email alerting
URL:
- http://mmonit.com/monit/
- http://en.wikipedia.org/wiki/Monit
Pro:
Con:

Munin

Overview:
URL:
- http://munin-monitoring.org/
- http://en.wikipedia.org/wiki/Munin_(network_monitoring_application)
Pro:
Con:
- Default resolution: 5 minutes
- Doesn't scale well enough for our needs

Nagios

Overview: The de facto standard tool for availability monitoring and alerting
URL: http://www.nagios.org
Pro:
Con:
- Nagios architecture: check scripts determine warning and critical state
- Dissatisfaction with the project has led to multiple forks and rewrites: Centreon, Icinga, Naemon, OpsView, Shinken

NetDB

Overview: NetDB keeps track of devices on your network and the status of your switch ports over time.
URL: http://netdbtracking.sourceforge.net

NRPE

Overview: Nagios Remote Plugin Executor - NRPE is an addon that allows you to execute plugins on remote Linux/Unix hosts. This is useful if you need to monitor local resources/attributes like disk usage, CPU load, memory usage, etc. on a remote host.
URL: http://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details
Pro:
- Allows Nagios/Icinga server to trigger execution of check scripts on client nodes
Con:
- Opens a TCP connection for each check script
- Clients run a listener daemon, requiring complementary firewall rules on hosts which have public IPs
- Check scripts are run synchronously (active checks), potentially causing high latency responses which impact Nagios/Icinga server performance
- Code review by Tim deemed NRPE unacceptable for Fundraising cluster
Status: Currently in use

NSCA

Overview: Nagios Service Check Acceptor - NSCA is an addon that allows you to send passive check results from remote Linux/Unix hosts to the Nagios daemon running on the monitoring server.
URL: http://exchange.nagios.org/directory/Addons/Passive-Checks/NSCA--2D-Nagios-Service-Check-Acceptor/details
Pro:
- Submits passive checks
Con:
- Requires external mechanism on clients to trigger execution, such as cron
Status: Currently in use on Fundraising cluster

Observium

Overview:
URL:
Pro:
Con:
Status: We switched from this to LibreNMS

Oculus

Overview: Anomaly correlation: given an identified anomalous metric, searches for similar metrics to help determine scope and root cause.
URL: https://github.com/etsy/oculus
Pro:
Con:

OpenNMS

Overview: OpenNMS is a free and open-source enterprise grade network monitoring and network management platform written in Java
URL:
- http://www.opennms.org
- http://en.wikipedia.org/wiki/OpenNMS
Pro:
Con:

OpenTSDB

Overview: OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. Unlike RRD or Whisper, it never deletes or downsamples data.
URL: http://opentsdb.net/
Pro:
- Super-scalable, said to be similar to Google's proprietary Borgmon
Con:

OpsView

Overview: GPL-licensed, commercially sponsored monitoring suite based on Nagios
URL:
- http://www.opsview.org
- http://en.wikipedia.org/wiki/Opsview
Pro:
Con:
- Nagios architecture

Pandora FMS

Overview: Complete monitoring system
URL: http://pandorafms.org
Pro:
Con:
- Some features only in paid "enterprise" version
- https://github.com/monitoringsucks/tool-repos/tree/master/pandora-fms

Prometheus

Overview:
URL:
- https://prometheus.io/
- Prometheus at WMF
Pro:
- multi-dimensional data model
- powerful query language
- pull model for metric collection
Con:

RANCID

Overview:
URL:
- https://wikitech.wikimedia.org/wiki/RANCID
Pro:
Con:
Status: Currently in use

Rearview

Overview: Allows users to create monitors that both visualize and alert on data as it streams from Graphite.
URL: https://github.com/livingsocial/rearview/
Pro:
- Could replace Icinga for alerting
Con:
- Crontab compatible time specification means minimum 1 minute sampling frequency

Riemann

Overview: Riemann is an event stream processor.
URL: http://riemann.io/
Pro:
- Could replace Icinga for alerting
Con:

Sensu

Overview: open source monitoring framework, uses RabbitMQ and Redis
URL:
- http://sensuapp.org/
- http://failshell.io/sensu/high-availability-sensu/
Pro:
- Can be HA / no SPOFs
- Can send data to Carbon and OpenTSDB
Con:

Servermon

Overview: "Servermon is a Django project with the aim of facilitating server monitoring and management through Puppet."
URL: https://github.com/servermon/servermon (not http://sourceforge.net/projects/servermon/)
Pro:
Con:
Status: Previously in use on Sockpuppet, did not survive the transition to Palladium. There is desire to use it again in the future. See: sockpuppet:/srv/servermon/

Seyren

Overview: An alerting dashboard for Graphite
URL: https://github.com/scobal/seyren
Pro:
- Could replace Icinga for alerting
Con:

Shinken

Overview: Rewrite of Nagios in Python. "Shinken's architecture aims to offer easier load balancing and high availability . The administrator manages a single configuration, the system automatically "cuts" it into parts and dispatches it to worker nodes."
URL:
- http://www.shinken-monitoring.org/
- http://en.wikipedia.org/wiki/Shinken_%28software%29
Pro:
Con:

Skyline

Overview: Skyline is a real-time anomaly detection system, built to enable passive monitoring of hundreds of thousands of metrics, without the need to configure a model/thresholds for each one
URL: https://github.com/etsy/skyline
Pro:
Con:

Smokeping

Overview: Network latency grapher
URL: http://oss.oetiker.ch/smokeping/
Pro:
- Provides a view of network latency and packet loss not available from other tools
Con:
- Doesn't support RRDCached
Status: Deprecated (see also smokeping.wikimedia.org)

Statsd

Overview: A network daemon that listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
URL:
- Reference implementation, Node.js: https://github.com/etsy/statsd/
- Python-Twisted implementation used at WMF: https://github.com/sidnei/txstatsd
- DeviantArt fork of pystatsd: https://github.com/deviantART/pystatsd/
Pro:
Con:
Status: Python-txstatsd is currently in use on Tungsten

Tessera

Overview: A dashboard front-end for graphite, similar but different to grafana.
URL:
- https://github.com/urbanairship/tessera

Torrus

Overview: SNMP grapher. "Torrus is an alternative software platform to MRTG, Cricket and Cacti."
URL:
- http://www.torrus.org/
- https://wikitech.wikimedia.org/wiki/Torrus
Pro:
Con:
Status: Currently in use: http://torrus.wikimedia.org

Umpire

Overview: Lets you test Graphite metrics via HTTP: "Umpire provides a normalized HTTP endpoint that responds with 200 / non-200 according to the metric check parameters specified in the requested URL."
URL: https://github.com/heroku/umpire
Pro:
Con:

Zenoss

Overview:
URL:
- http://www.zenoss.com
- http://en.wikipedia.org/wiki/Zenoss
Pro:
Con:

Zabbix

Overview:
URL: http://www.zabbix.com
Pro:
Con:

Developed for WMF

Dbtree

Overview: Database replication and performance dashboard developed at WMF
URL:
- https://wikitech.wikimedia.org/wiki/Dbtree
Pro:
Con:
Status: Currently in use: https://dbtree.wikimedia.org/

Ishmael

Overview: a visual tool that shows MySQL statistics and lets you analyze MySQL query logs.
URL:
- https://wikitech.wikimedia.org/wiki/Ishmael
Pro:
Con:
Status: Currently in use: https://ishmael.wikimedia.org/

Labsnagiosbuilder

Overview: Python script to grab labs instances from ldap and build Nagios configs for them. Uses Puppet classes to determine hostgroups and services to monitor.
URL: https://github.com/DamianZaremba/labsnagiosbuilder
Status: Currently in use: http://icinga.wmflabs.org/

Sqstat

Overview: Short WMF perl script to stick Squid/Varnish stats into Ganglia/Graphite
URL:
Pro:
Con:
Status: Currently in use

Tendril

Overview: a tool for analytics and performance tuning of the MariaDB servers
URL:
- https://wikitech.wikimedia.org/wiki/Tendril
Pro:
Con:
Status: Currently in use: https://tendril.wikimedia.org/

Services

Boundary

Datadog

Overview: Service - Fully buzzword compliant integrated monitoring service.
URL:
- https://www.datadoghq.com/
- http://en.wikipedia.org/wiki/Datadog
Pro: https://www.datadoghq.com/product/
Con:

New Relic

Overview: Service - Popular choice for app-layer metrics, also offers system level monitoring
URL:
Pro:
Con:

Nimsoft Cloud Monitor (formerly Watchmouse)

Overview: Service - External reachability measurements (HTTP probes)
URL: http://cloudmonitor.nimsoft.com/en/
Pro:
Con:
Status: Currently in use

PagerDuty

Overview: Service - "PagerDuty is the command center for IT, providing on-call schedule management, alerting and incident tracking. When your systems are down, we wake you up."
URL: http://www.pagerduty.com/
Pro:
- Integrates with Nagios, Zenoss, Zabbix, Splunk, etc.
Con:

Pingdom

Overview: Service -
URL:
Pro:
Con:

RIPE Atlas

Main article: RIPE Atlas

Open Source

Alerta

Cabot

Centreon

Check_graphite

Check_mk

Collectd

Cucumber-nagios

Cyanite

Dashing

Dbeacon

Diamond

Fail2ban

Firefly

Ganglia

Ganglios

Grafana

Graphios

Graphite

Graphsky

Groundwork

Hyperic

Icinga

IDOUtils (NDOUtils)

Jmxtrans

KairosDB

LibreNMS

Logstash

Logster

Merlin

Metricinga

Mod-gearman

Monit

Munin

Nagios

NetDB

NRPE

NSCA

Observium

Oculus

OpenNMS

OpenTSDB

OpsView

Pandora FMS

Prometheus

RANCID

Rearview

Riemann

Sensu

Servermon

Seyren

Shinken

Skyline

Smokeping

Statsd

Tessera

Torrus

Umpire

Zenoss

Zabbix

Developed for WMF

Dbtree

Ishmael

Labsnagiosbuilder

Sqstat

Tendril

Services

Boundary

Datadog

New Relic

Nimsoft Cloud Monitor (formerly Watchmouse)

PagerDuty

Pingdom

RIPE Atlas

See also