Open main menu

graphite is used to store/display/aggregate both system-level metrics and application-level metrics. Currently, most of the metrics come in via statsd for aggregation/calculation via UDP and then written to graphite (carbon-relay). Graphite in turn will write those metrics to (individual) whisper files on disk. Whisper takes care of handling the files on disk and downsample for long-term retention.

problems

At the moment all of the above (statsd and load balancer+graphite) is running on graphite1001, which obviously makes it a SPOF. As graphite becomes increasingly important we need to scale it in several dimensions:

  • availability:
    • reduce SPOFs (ideally to 0)
  • performance:
    • at the moment all statsd traffic is being received by a single machine over UDP, this also means silent packet/metric dropping is possible if the machine is overwhelmed

solutions

The problem can be attacked from different angles, the solutions below start with easiest first (not necessarily the simplest, see also Simple is not Easy and Simple made Easy)

performance: revisit aggregation periods

Depending on the purpose, we might want to revisit the aggregations currently in use. As of writing (November 2017), the default retentions are (source)

retentions = 1m:7d,5m:14d,15m:30d,1h:1y,1d:5y

for all metrics, see also bug T96662

availability: statsd+graphite active/standby behind LVS

while graphite can still run on a single machine an easy move is to put it (graphite+statsd) behind LVS with a standby machine that periodically pulls whisper files from the active machine. Failover is done by declaring the standby host as active (by e.g. changing pybal config)

pros:

  • familiar and simple architecture

cons:

  • still limited to a single machine

availability: statsd+graphite clustered setup behind LVS

we can use graphite's clustering to have 2+ machines, fronted by LVS, each running statsd+graphite to relay metrics onto a set of machines (possibly the same set) with consistent hashing and replication. Essentially the last scenario outlined here: http://bitprophet.org/blog/2013/03/07/graphite/. How to shard/rebalance machines is an open question, there are tools like carbonate that hint at a possible solution.

pros:

  • no new components
  • potentially long-term solution

cons:

  • untested configuration
  • not clear how failed/degraded machines can be brought in sync again
  • not clear how to add new machines without too much disruption
  • potentially painful operations-wise

performance+availability: switch graphite backend

whisper by its nature writes a metric per file and in practice this means at least 1 disk seek per metric received which doesn't play well with spinning disks and to a lesser extent with SSD.

There are solutions to switch the graphite backend away from whisper but keep a compatible http interface.

performance+availability: influxdb backend

One of the alternative graphite backends is influxdb, a possible solution that integrates influxdb and graphite is outlined here: http://dieter.plaetinck.be/influxdb-as-graphite-backend-part2.html

pros:

  • scalable
  • can be used for more than graphite data
  • powerful query language
  • more and more tools are supporting influxdb
  • compression support in some backends

cons:

performance+availability: cassandra backend (cyanite)

Another option is cassandra for storage, via cyanite using graphite-cyanite

pros:

  • scalable
  • tunable data compression & native TTL handling
  • append-only LSMs, sequential IO on write & compaction
  • cassandra already lined up to be used in production
  • cyanite daemon itself is stateless

cons:

performance+availability: cassandra backend (newts)

newts is a timeseries database backed by cassandra, during the Lyon hackathon Eric Evans added support for ingesting line-protocol graphite metrics and Filippo Giunchedi added a graphite-api backend for it graphite-newts

pros:

  • scalable
  • we have experience with cassandra already
  • newts expertise inside the WMF

cons:

  • more development might be needed

performance+availability: opentsdb backend

Yet another option would be to use opentsdb opentsdb and plug it on top of graphite-api so it is backwards compatible.

opentsdb doesn't do rollups and expire old data points by itself, so the idea would be to use carbon-aggregator to perform aggregation on the incoming points and write them to separate metrics (e.g. with incoming metric foo.bar the 10m datapoints would be written to _10m.foo.bar) and retrieve the appropriate aggregated series while fetching data via graphite-api

pros:

  • scalable
  • hdfs/hbase based, we can reuse some of the existing infrastructure/knowledge
  • tunable data compression

cons:

  • untested solution

performance+availability: switch away from graphite

another option is to completely switch away from graphite as a whole and move to a different model of multidimensional metrics (e.g. what influxdb, prometheus and opentsdb do, a datapoint can have several key=value pairs associated) and powerful aggregation

pros:

  • more powerful model
  • long-term commitment/solution

cons:

  • most tools and dashboards expect graphite-compatible API, but grafana for example can work with opentsdb natively
  • graphite on-retrieval aggregation/summarization/etc functions are useful and would be missing, influxdb has some similar functionality

clustering

Sooner or later we will exceed a machine's capacity in one (or more) dimensions, when that happens we can turn to graphite clustering. We would do clustering for performance reasons, not redundancy since we would already mirroring metrics to codfw/eqiad, in that case we can incrementally scale each cluster by adding machines to a consistent hash ring and instruct carbon-c-relay to send metrics to the appropriate machine based on the ring. Of course adding/removing machines from a ring means you have to rebalance the ring, in which case we can use carbonate (https://github.com/jssjr/carbonate) to rebalance the ring by moving metrics into the right machine.

current setup as of 2014/11/14

The current setup is a single db-class machine with 12 spinning disks (tungsten) 64gb ram and 2x8 core xeon for ~1.5T usable space.

Incoming traffic sources:

8125/udp statsd protocol

  • most of the inbound traffic comes from here, among others:
    • machine-level stats (via diamond)
    • jobrunner, eventlogging, services, etc

2003/tcp carbon line protocol

  • mwprof-to-carbon running on localhost, in turn receives metrics from mwprof

2003/udp carbon line protocol

  • reqstats running on analytics1026

2004/tcp carbon pickle protocol

  • txstatsd outputs carbon pickle protocol
    • currently from localhost and from swift machines

current setup as of 2015/10/02

The current setup is a machine with 4x ssd in raid10, 64GB of ram, 2x8 core E5-2450 and ~1TB usable space.

Incoming traffic sources:

8125/udp statsd protocol

  • most of the inbound traffic comes from here, among others:
    • machine-level stats (via diamond)
    • jobrunner, eventlogging, services, etc
    • handled by statsdlb which does consistent-hashing on the metric name and passed it on to statsite instances

2003/tcp carbon line protocol

  • cassandra metrics, from cassandra-metrics-collector
  • swift metrics, generated by statsite running locally on swift machines

2003/udp carbon line protocol

  • reqstats running on analytics1026

2004/tcp carbon pickle protocol

  • nothing