Monitoring package survey
Appearance
An exhaustive list of monitoring tools and services evaluated or used at WMF
Open Source
Alerta
- Overview: Nagios-alike, powered by MongoDB, released by The Guardian
- URL:
Cabot
- Overview: Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty
- URL:
Centreon
- Overview: Fork of Nagios with commercial support, can also use the Icinga "engine"
- URL: http://www.centreon.com
- Pro:
- Con:
- Some features only in paid "enterprise" version
- Nagios architecture: check scripts determine warning and critical state
Check_graphite
- Overview: Nagios check script to generate alerts based on data in Graphite. Multiple implementations exist.
- URL: https://github.com/disqus/nagios-plugins/
- Pro:
- Con:
- Status: Currently in use
Check_mk
- Overview: Client agent runs checks/plugins async, listens on a TCP port, immediately spews all stats and closes the connection. Single Icinga active check connects to clients, returning data as passive checks.
- URL:
- Pro:
- Very fast / scales well
- Integrates with Graphite
- Can use Nagios plugins
- Can completely replace Nagios/Icinga using the optional "check_mk micro core"
- Con:
- Generates its own Icinga config based on discovery, so service monitors are not defined for services which are down at the time of the discovery scan
- Can be tricky to integrate this with Puppet based config templates for Icinga.
- Some features only in paid "enterprise" version
Collectd
- Overview: collectd is a daemon which collects system performance statistics periodically and provides mechanisms to store the values in a variety of ways
- URL:
- Pro:
- Written in C
- Network traffic can be signed or encrypted
- Clients push data to a server or multicast group
- Default resolution is 10 seconds
- Can store data in Graphite (Carbon), RRD, Redis, MongoDB, several others
- Statsd plugin implements the StatsD network protocol to allow clients to report events. These events are aggregated by collectd and dispatched regularly.
- Can execute nagios check scripts
- Contains glue allowing Nagios to check stats harvested by collectd
- Con:
- As of Feb/2014, website says that it has run on hundreds of nodes but admits nobody has reported 1000.
- There is no "write plugin" to publish to Ganglia, preventing collectd from being a drop-in replacement for gmond
Cucumber-nagios
- Overview: Allows you to write high-level behavioural tests of web application, and plug the results into Nagios.
- URL: https://github.com/auxesis/cucumber-nagios
- Pro:
- Con:
Cyanite
- Overview: cassandra backend for graphite carbon
- URL: https://github.com/pyr/cyanite
Dashing
- Overview: Dashing is a Sinatra based framework that lets you build beautiful dashboards.
- URL: http://shopify.github.io/dashing/
- Pro:
- Con:
Dbeacon
- Overview: dbeacon is a multicast beacon: its main purpose is to monitor other beacons' reachability and collect statistics such as loss, delay and jitter between them.
- URL: https://packages.debian.org/sid/dbeacon
- Pro:
- Con:
Diamond
- Overview: Diamond is a python daemon that collects system metrics and publishes them to Graphite, OpenTSDB, others.
- URL: https://github.com/BrightcoveOS/Diamond
- Pro:
- Popular
- Written in Python
- Con:
Fail2ban
- Overview: Anti-brute force monitor which reacts to repeated bad auth attempts by banning the source IP temporarily
- URL: http://www.fail2ban.org/wiki/index.php/Main_Page
- Pro:
- Con:
Firefly
- Overview: Graph display/dashboard from yelp
- URL: http://engineeringblog.yelp.com/2012/08/firefly-illuminate-your-websites-performance.html
- Pro: supports ganglia and graphite
- Con:
Ganglia
- Overview: Cluster-oriented performance metric agent, collector, & grapher
- URL:
- Pro:
- Newer versions can use Graphite instead of RRD for storage and graphing
- Con:
- Unpleasant to write plugins for (requires a lot of boilerplate)
- Status: Retired, not in use anymore.
Ganglios
- Overview: ganglios is a collection of tools that allow nagios to trigger alerts based on data it pulls from ganglia.
- URL: https://bitbucket.org/maplebed/ganglios/
- URL: https://wikitech.wikimedia.org/wiki/Ganglios
- Pro:
- Con:
Grafana
- Overview: Graphite dashboard inspired by Kibana
- URL:
- Pro:
- Con:
Graphios
- Overview: send nagios perf data to graphite (carbon)
- URL: https://github.com/shawn-sterling/graphios
- Pro:
- Con:
- See also: Metricinga
Graphite
- Overview: Graphite is the overall project name for a monitoring system consisting of:
- Carbon, a Twisted daemon which listens for time-series data
- Whisper, a database library for storing time-series data (similar to RRD)
- Graphite, a Django webapp that renders graphs on-demand using Cairo
- Does not include an agent daemon, however stats may be submitted to Carbon from Collectd, Diamond, Ganglia, Jmxtrans, Statsd, and others
- Does not include alerting, however this is available via tools which read from Graphite such as Rearview or Seyren
- URL:
- Pro:
- Separates the collection of metrics from the definition of graphs
- Can accept metrics at any update frequency, including sporadic events
- Can use AMQP
- Con:
- Status: Currently in use: http://graphite.wikimedia.org/
Graphsky
- Overview: Graphite dashboard similar to the Ganglia UI, using data from Collectd
- URL:
- Pro:
- Ganglia-like design gives overview + drill-down ability
- Simple dashboard and graph definition in JSON
- Con:
- Doesn't display any numbers in dashboards or graph legends
- Lacking navigation elements in the UI
- Documentation error: they recommend the prefix "collectd." but really you want "collectd.production.bits." or etc. to encode the environment and cluster into the metric name.
- Status: Good potential but needs development.
Groundwork
- Overview: Unified systems monitoring and network management: Nagios(R), Nmap, RRDtool, etc. - integrated in one system administration tool
- URL: http://www.groundworkopensource.com
- Pro:
- Con:
- Nagios architecture: check scripts determine warning and critical state
Hyperic
- Overview:
- "Hyperic is application monitoring and performance management for virtual, physical, and cloud infrastructures. Auto-discover resources of 75+ technologies, including vSphere, and collect availability, performance, utilization, and throughput metrics."
- Now owned by VMWare
- URL: http://www.hyperic.com
- Pro:
- Con:
Icinga
- Overview: Fork of Nagios
- URL:
- Pro:
- Con:
- Nagios architecture: check scripts determine warning and critical state
- Status: Currently in use: https://icinga.wikimedia.org , http://icinga.wmflabs.org/
IDOUtils (NDOUtils)
- Overview: "The IDOUtils (Icinga Data Output Utils) addon is designed to store all configuration and event (status, historical) data from Icinga into a relational database. Storing information from Icinga in an RDBMS will allow for quicker retrieval and processing of that data." An Event Broker plugin.
- URL:
- Pro:
- Con:
- Unclear how/if this achieves the goal of increased scalability
- This seems to be only for "output" -- simply recording stats rather than using the database as an R/W data store
Jmxtrans
- Overview: jmxtrans is effectively the missing connector between speaking to a JVM via JMX on one end and whatever logging / monitoring / graphing package that you can dream up on the other end.
- URL: http://www.jmxtrans.org/
- Pro:
- Can log to Carbon/Graphite
- Con:
- Status: Currently in use for Hadoop
KairosDB
- Overview: Rewrite of OpenTSDB, can use either HBase or Cassandra
- URL: https://github.com/proofpoint/kairosdb
- Pro:
- Con:
LibreNMS
- Overview: GPL fork of Observium
- URL:
- Pro:
- Con:
- Status: Currently in use: https://librenms.wikimedia.org/
Logstash
- Overview: A tool that can be used to collect, process and forward events and log messages. When used with Elasticsearch and Kibana it provides a dashboard for searching and analyzing logs.
- URL:
- Pro:
- Con:
- Status: Currently in use: https://logstash.wikimedia.org/
Logster
- Overview: Logster is a utility for reading log files and generating metrics in Graphite or Ganglia
- URL: https://github.com/etsy/logster
- Pro:
- Con:
Merlin
- Overview: Merlin is a Nagios Event Broker plugin: Module for Effortless Redundancy and Loadbalancing In Nagios, was initially started to create an easy way to set up distributed Nagios installations, allowing Nagios processes to exchange information directly as an alternative to the standard nagios way using NSCA.
- URL:
- Pro:
- Con:
Metricinga
- Overview: Parses performance data files from Nagios/Icinga and sends the results to Graphite via the Carbon pickle port.
- URL: https://github.com/jgoldschrafe/metricinga
- Pro:
- Con:
- See also: Graphios
Mod-gearman
- Overview: Extends Nagios/Naemon/Icinga to run scalable and distributed setups. Worker nodes can be placed all over your network while keeping the simplicity of a central configuration. Uses the Nagios Event Broker (NEB) API. Intercepts checks on the Nagios server, placing them on a queue. Client/worker nodes execute tasks from the queue and place the results in another queue: http://labs.consol.de/nagios/mod-gearman/#_how_does_it_work
- URL:
- Pro:
- Con:
Monit
- Overview: Process supervisor with optional GUI and email alerting
- URL:
- Pro:
- Con:
Munin
- Overview:
- URL:
- Pro:
- Con:
- Default resolution: 5 minutes
- Doesn't scale well enough for our needs
Nagios
- Overview: The de facto standard tool for availability monitoring and alerting
- URL: http://www.nagios.org
- Pro:
- Con:
- Nagios architecture: check scripts determine warning and critical state
- Dissatisfaction with the project has led to multiple forks and rewrites: Centreon, Icinga, Naemon, OpsView, Shinken
NetDB
- Overview: NetDB keeps track of devices on your network and the status of your switch ports over time.
- URL: http://netdbtracking.sourceforge.net
NRPE
- Overview: Nagios Remote Plugin Executor - NRPE is an addon that allows you to execute plugins on remote Linux/Unix hosts. This is useful if you need to monitor local resources/attributes like disk usage, CPU load, memory usage, etc. on a remote host.
- URL: http://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details
- Pro:
- Allows Nagios/Icinga server to trigger execution of check scripts on client nodes
- Con:
- Opens a TCP connection for each check script
- Clients run a listener daemon, requiring complementary firewall rules on hosts which have public IPs
- Check scripts are run synchronously (active checks), potentially causing high latency responses which impact Nagios/Icinga server performance
- Code review by Tim deemed NRPE unacceptable for Fundraising cluster
- Status: Currently in use
NSCA
- Overview: Nagios Service Check Acceptor - NSCA is an addon that allows you to send passive check results from remote Linux/Unix hosts to the Nagios daemon running on the monitoring server.
- URL: http://exchange.nagios.org/directory/Addons/Passive-Checks/NSCA--2D-Nagios-Service-Check-Acceptor/details
- Pro:
- Submits passive checks
- Con:
- Requires external mechanism on clients to trigger execution, such as cron
- Status: Currently in use on Fundraising cluster
Observium
- Overview:
- URL:
- Pro:
- Con:
- Status: We switched from this to LibreNMS
Oculus
- Overview: Anomaly correlation: given an identified anomalous metric, searches for similar metrics to help determine scope and root cause.
- URL: https://github.com/etsy/oculus
- Pro:
- Con:
OpenNMS
- Overview: OpenNMS is a free and open-source enterprise grade network monitoring and network management platform written in Java
- URL:
- Pro:
- Con:
OpenTSDB
- Overview: OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. Unlike RRD or Whisper, it never deletes or downsamples data.
- URL: http://opentsdb.net/
- Pro:
- Super-scalable, said to be similar to Google's proprietary Borgmon
- Con:
OpsView
- Overview: GPL-licensed, commercially sponsored monitoring suite based on Nagios
- URL:
- Pro:
- Con:
- Nagios architecture
Pandora FMS
- Overview: Complete monitoring system
- URL: http://pandorafms.org
- Pro:
- Con:
- Some features only in paid "enterprise" version
- https://github.com/monitoringsucks/tool-repos/tree/master/pandora-fms
Prometheus
- Overview:
- URL:
- https://prometheus.io/
- Prometheus at WMF
- Pro:
- multi-dimensional data model
- powerful query language
- pull model for metric collection
- Con:
RANCID
- Overview:
- URL:
- Pro:
- Con:
- Status: Currently in use
Rearview
- Overview: Allows users to create monitors that both visualize and alert on data as it streams from Graphite.
- URL: https://github.com/livingsocial/rearview/
- Pro:
- Could replace Icinga for alerting
- Con:
- Crontab compatible time specification means minimum 1 minute sampling frequency
Riemann
- Overview: Riemann is an event stream processor.
- URL: http://riemann.io/
- Pro:
- Could replace Icinga for alerting
- Con:
Sensu
- Overview: open source monitoring framework, uses RabbitMQ and Redis
- URL:
- Pro:
- Can be HA / no SPOFs
- Can send data to Carbon and OpenTSDB
- Con:
Servermon
- Overview: "Servermon is a Django project with the aim of facilitating server monitoring and management through Puppet."
- URL: https://github.com/servermon/servermon (not http://sourceforge.net/projects/servermon/)
- Pro:
- Con:
- Status: Previously in use on Sockpuppet, did not survive the transition to Palladium. There is desire to use it again in the future. See: sockpuppet:/srv/servermon/
Seyren
- Overview: An alerting dashboard for Graphite
- URL: https://github.com/scobal/seyren
- Pro:
- Could replace Icinga for alerting
- Con:
Shinken
- Overview: Rewrite of Nagios in Python. "Shinken's architecture aims to offer easier load balancing and high availability . The administrator manages a single configuration, the system automatically "cuts" it into parts and dispatches it to worker nodes."
- URL:
- Pro:
- Con:
Skyline
- Overview: Skyline is a real-time anomaly detection system, built to enable passive monitoring of hundreds of thousands of metrics, without the need to configure a model/thresholds for each one
- URL: https://github.com/etsy/skyline
- Pro:
- Con:
Smokeping
- Overview: Network latency grapher
- URL: http://oss.oetiker.ch/smokeping/
- Pro:
- Provides a view of network latency and packet loss not available from other tools
- Con:
- Doesn't support RRDCached
- Status: Deprecated (see also smokeping.wikimedia.org)
Statsd
- Overview: A network daemon that listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
- URL:
- Reference implementation, Node.js: https://github.com/etsy/statsd/
- Python-Twisted implementation used at WMF: https://github.com/sidnei/txstatsd
- DeviantArt fork of pystatsd: https://github.com/deviantART/pystatsd/
- Pro:
- Con:
- Status: Python-txstatsd is currently in use on Tungsten
Tessera
- Overview: A dashboard front-end for graphite, similar but different to grafana.
- URL:
Torrus
- Overview: SNMP grapher. "Torrus is an alternative software platform to MRTG, Cricket and Cacti."
- URL:
- Pro:
- Con:
- Status: Currently in use: http://torrus.wikimedia.org
Umpire
- Overview: Lets you test Graphite metrics via HTTP: "Umpire provides a normalized HTTP endpoint that responds with 200 / non-200 according to the metric check parameters specified in the requested URL."
- URL: https://github.com/heroku/umpire
- Pro:
- Con:
Zenoss
- Overview:
- URL:
- Pro:
- Con:
Zabbix
- Overview:
- URL: http://www.zabbix.com
- Pro:
- Con:
Developed for WMF
Dbtree
- Overview: Database replication and performance dashboard developed at WMF
- URL:
- Pro:
- Con:
- Status: Currently in use: https://dbtree.wikimedia.org/
Ishmael
- Overview: a visual tool that shows MySQL statistics and lets you analyze MySQL query logs.
- URL:
- Pro:
- Con:
- Status: Currently in use: https://ishmael.wikimedia.org/
Labsnagiosbuilder
- Overview: Python script to grab labs instances from ldap and build Nagios configs for them. Uses Puppet classes to determine hostgroups and services to monitor.
- URL: https://github.com/DamianZaremba/labsnagiosbuilder
- Status: Currently in use: http://icinga.wmflabs.org/
Sqstat
- Overview: Short WMF perl script to stick Squid/Varnish stats into Ganglia/Graphite
- URL:
- Pro:
- Con:
- Status: Currently in use
Tendril
- Overview: a tool for analytics and performance tuning of the MariaDB servers
- URL:
- Pro:
- Con:
- Status: Currently in use: https://tendril.wikimedia.org/
Services
Boundary
Datadog
- Overview: Service - Fully buzzword compliant integrated monitoring service.
- URL:
- Pro: https://www.datadoghq.com/product/
- Con:
New Relic
- Overview: Service - Popular choice for app-layer metrics, also offers system level monitoring
- URL:
- Pro:
- Con:
Nimsoft Cloud Monitor (formerly Watchmouse)
- Overview: Service - External reachability measurements (HTTP probes)
- URL: http://cloudmonitor.nimsoft.com/en/
- Pro:
- Con:
- Status: Currently in use
PagerDuty
- Overview: Service - "PagerDuty is the command center for IT, providing on-call schedule management, alerting and incident tracking. When your systems are down, we wake you up."
- URL: http://www.pagerduty.com/
- Pro:
- Integrates with Nagios, Zenoss, Zabbix, Splunk, etc.
- Con:
Pingdom
- Overview: Service -
- URL:
- Pro:
- Con:
RIPE Atlas
Main article: RIPE Atlas