Monitoring package survey

From Wikitech
Jump to navigation Jump to search

An exhaustive list of monitoring tools and services evaluated or used at WMF

Open Source

Alerta

Cabot

Centreon

  • Overview: Fork of Nagios with commercial support, can also use the Icinga "engine"
  • URL: http://www.centreon.com
  • Pro:
  • Con:
    • Some features only in paid "enterprise" version
    • Nagios architecture: check scripts determine warning and critical state

Check_graphite

Check_mk

  • Overview: Client agent runs checks/plugins async, listens on a TCP port, immediately spews all stats and closes the connection. Single Icinga active check connects to clients, returning data as passive checks.
  • URL:
  • Pro:
    • Very fast / scales well
    • Integrates with Graphite
    • Can use Nagios plugins
    • Can completely replace Nagios/Icinga using the optional "check_mk micro core"
  • Con:
    • Generates its own Icinga config based on discovery, so service monitors are not defined for services which are down at the time of the discovery scan
    • Can be tricky to integrate this with Puppet based config templates for Icinga.
    • Some features only in paid "enterprise" version

Collectd

  • Overview: collectd is a daemon which collects system performance statistics periodically and provides mechanisms to store the values in a variety of ways
  • URL:
  • Pro:
    • Written in C
    • Network traffic can be signed or encrypted
    • Clients push data to a server or multicast group
    • Default resolution is 10 seconds
    • Can store data in Graphite (Carbon), RRD, Redis, MongoDB, several others
    • Statsd plugin implements the StatsD network protocol to allow clients to report events. These events are aggregated by collectd and dispatched regularly.
    • Can execute nagios check scripts
    • Contains glue allowing Nagios to check stats harvested by collectd
  • Con:
    • As of Feb/2014, website says that it has run on hundreds of nodes but admits nobody has reported 1000.
    • There is no "write plugin" to publish to Ganglia, preventing collectd from being a drop-in replacement for gmond

Cucumber-nagios

Cyanite

Dashing

Dbeacon

  • Overview: dbeacon is a multicast beacon: its main purpose is to monitor other beacons' reachability and collect statistics such as loss, delay and jitter between them.
  • URL: https://packages.debian.org/sid/dbeacon
  • Pro:
  • Con:

Diamond

Fail2ban

Firefly

Ganglia

Ganglios

Grafana

Graphios

Graphite

Graphsky

  • Overview: Graphite dashboard similar to the Ganglia UI, using data from Collectd
  • URL:
  • Pro:
    • Ganglia-like design gives overview + drill-down ability
    • Simple dashboard and graph definition in JSON
  • Con:
    • Doesn't display any numbers in dashboards or graph legends
    • Lacking navigation elements in the UI
    • Documentation error: they recommend the prefix "collectd." but really you want "collectd.production.bits." or etc. to encode the environment and cluster into the metric name.
  • Status: Good potential but needs development.

Groundwork

  • Overview: Unified systems monitoring and network management: Nagios(R), Nmap, RRDtool, etc. - integrated in one system administration tool
  • URL: http://www.groundworkopensource.com
  • Pro:
  • Con:
    • Nagios architecture: check scripts determine warning and critical state

Hyperic

  • Overview:
    • "Hyperic is application monitoring and performance management for virtual, physical, and cloud infrastructures. Auto-discover resources of 75+ technologies, including vSphere, and collect availability, performance, utilization, and throughput metrics."
    • Now owned by VMWare
  • URL: http://www.hyperic.com
  • Pro:
  • Con:

Icinga

IDOUtils (NDOUtils)

  • Overview: "The IDOUtils (Icinga Data Output Utils) addon is designed to store all configuration and event (status, historical) data from Icinga into a relational database. Storing information from Icinga in an RDBMS will allow for quicker retrieval and processing of that data." An Event Broker plugin.
  • URL:
  • Pro:
  • Con:
    • Unclear how/if this achieves the goal of increased scalability
    • This seems to be only for "output" -- simply recording stats rather than using the database as an R/W data store

Jmxtrans

  • Overview: jmxtrans is effectively the missing connector between speaking to a JVM via JMX on one end and whatever logging / monitoring / graphing package that you can dream up on the other end.
  • URL: http://www.jmxtrans.org/
  • Pro:
    • Can log to Carbon/Graphite
  • Con:
  • Status: Currently in use for Hadoop

KairosDB

LibreNMS

Logstash

Logster

Merlin

Metricinga

Mod-gearman

Monit

Munin

Nagios

  • Overview: The de facto standard tool for availability monitoring and alerting
  • URL: http://www.nagios.org
  • Pro:
  • Con:
    • Nagios architecture: check scripts determine warning and critical state
    • Dissatisfaction with the project has led to multiple forks and rewrites: Centreon, Icinga, Naemon, OpsView, Shinken

NetDB

NRPE

  • Overview: Nagios Remote Plugin Executor - NRPE is an addon that allows you to execute plugins on remote Linux/Unix hosts. This is useful if you need to monitor local resources/attributes like disk usage, CPU load, memory usage, etc. on a remote host.
  • URL: http://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details
  • Pro:
    • Allows Nagios/Icinga server to trigger execution of check scripts on client nodes
  • Con:
    • Opens a TCP connection for each check script
    • Clients run a listener daemon, requiring complementary firewall rules on hosts which have public IPs
    • Check scripts are run synchronously (active checks), potentially causing high latency responses which impact Nagios/Icinga server performance
    • Code review by Tim deemed NRPE unacceptable for Fundraising cluster
  • Status: Currently in use

NSCA

Observium

  • Overview:
  • URL:
  • Pro:
  • Con:
  • Status: We switched from this to LibreNMS

Oculus

  • Overview: Anomaly correlation: given an identified anomalous metric, searches for similar metrics to help determine scope and root cause.
  • URL: https://github.com/etsy/oculus
  • Pro:
  • Con:

OpenNMS

OpenTSDB

  • Overview: OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. Unlike RRD or Whisper, it never deletes or downsamples data.
  • URL: http://opentsdb.net/
  • Pro:
    • Super-scalable, said to be similar to Google's proprietary Borgmon
  • Con:

OpsView

Pandora FMS

Prometheus

RANCID

Rearview

  • Overview: Allows users to create monitors that both visualize and alert on data as it streams from Graphite.
  • URL: https://github.com/livingsocial/rearview/
  • Pro:
    • Could replace Icinga for alerting
  • Con:
    • Crontab compatible time specification means minimum 1 minute sampling frequency

Riemann

  • Overview: Riemann is an event stream processor.
  • URL: http://riemann.io/
  • Pro:
    • Could replace Icinga for alerting
  • Con:

Sensu

Servermon

Seyren

Shinken

Skyline

  • Overview: Skyline is a real-time anomaly detection system, built to enable passive monitoring of hundreds of thousands of metrics, without the need to configure a model/thresholds for each one
  • URL: https://github.com/etsy/skyline
  • Pro:
  • Con:

Smokeping

Statsd

Tessera

Torrus

Umpire

  • Overview: Lets you test Graphite metrics via HTTP: "Umpire provides a normalized HTTP endpoint that responds with 200 / non-200 according to the metric check parameters specified in the requested URL."
  • URL: https://github.com/heroku/umpire
  • Pro:
  • Con:

Zenoss

Zabbix

Developed for WMF

Dbtree

Ishmael

Labsnagiosbuilder

Sqstat

  • Overview: Short WMF perl script to stick Squid/Varnish stats into Ganglia/Graphite
  • URL:
  • Pro:
  • Con:
  • Status: Currently in use

Tendril

Services

Boundary

Datadog

New Relic

  • Overview: Service - Popular choice for app-layer metrics, also offers system level monitoring
  • URL:
  • Pro:
  • Con:

Nimsoft Cloud Monitor (formerly Watchmouse)

PagerDuty

  • Overview: Service - "PagerDuty is the command center for IT, providing on-call schedule management, alerting and incident tracking. When your systems are down, we wake you up."
  • URL: http://www.pagerduty.com/
  • Pro:
    • Integrates with Nagios, Zenoss, Zabbix, Splunk, etc.
  • Con:

Pingdom

  • Overview: Service -
  • URL:
  • Pro:
  • Con:

RIPE Atlas

Main article: RIPE Atlas

See also