This page may be outdated or contain incorrect details. Please update it if you can.

graphite is used to store/display/aggregate both system-level metrics and application-level metrics. Currently, most of the metrics come in via statsd for aggregation/calculation via UDP and then written to graphite (carbon-relay). Graphite in turn will write those metrics to (individual) whisper files on disk. Whisper takes care of handling the files on disk and downsample for long-term retention.

problems

At the moment all of the above (statsd and load balancer+graphite) is running on graphite1001, which obviously makes it a SPOF. As graphite becomes increasingly important we need to scale it in several dimensions:

availability:
- reduce SPOFs (ideally to 0)
performance:
- at the moment all statsd traffic is being received by a single machine over UDP, this also means silent packet/metric dropping is possible if the machine is overwhelmed

solutions

The problem can be attacked from different angles, the solutions below start with easiest first (not necessarily the simplest, see also Simple is not Easy and Simple made Easy)

performance: revisit aggregation periods

Depending on the purpose, we might want to revisit the aggregations currently in use. As of writing (November 2017), the default retentions are (source)

retentions = 1m:7d,5m:14d,15m:30d,1h:1y,1d:5y

for all metrics, see also bug T96662

availability: statsd+graphite active/standby behind LVS

while graphite can still run on a single machine an easy move is to put it (graphite+statsd) behind LVS with a standby machine that periodically pulls whisper files from the active machine. Failover is done by declaring the standby host as active (by e.g. changing pybal config)

pros:

familiar and simple architecture

cons:

still limited to a single machine

availability: statsd+graphite clustered setup behind LVS

we can use graphite's clustering to have 2+ machines, fronted by LVS, each running statsd+graphite to relay metrics onto a set of machines (possibly the same set) with consistent hashing and replication. Essentially the last scenario outlined here: http://bitprophet.org/blog/2013/03/07/graphite/. How to shard/rebalance machines is an open question, there are tools like carbonate that hint at a possible solution.

pros:

no new components
potentially long-term solution

cons:

untested configuration
not clear how failed/degraded machines can be brought in sync again
not clear how to add new machines without too much disruption
potentially painful operations-wise

performance+availability: switch graphite backend

whisper by its nature writes a metric per file and in practice this means at least 1 disk seek per metric received which doesn't play well with spinning disks and to a lesser extent with SSD.

There are solutions to switch the graphite backend away from whisper but keep a compatible http interface.

performance+availability: influxdb backend

One of the alternative graphite backends is influxdb, a possible solution that integrates influxdb and graphite is outlined here: http://dieter.plaetinck.be/influxdb-as-graphite-backend-part2.html

pros:

scalable
can be used for more than graphite data
powerful query language
more and more tools are supporting influxdb
compression support in some backends

cons:

young/unproven project
operationally unknown
semi-automatic sharding, not clear how dynamic resharding works without disruption
the fully clustered version is closed source: https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/
- other solutions based on influxdb are outlined here too https://influxdata.com/high-availability/

performance+availability: cassandra backend (cyanite)

Another option is cassandra for storage, via cyanite using graphite-cyanite

pros:

scalable
tunable data compression & native TTL handling
append-only LSMs, sequential IO on write & compaction
cassandra already lined up to be used in production
cyanite daemon itself is stateless

cons:

elasticsearch backend required with many metrics to handle search
rough edges (e.g. no internal metrics exposed yet https://github.com/pyr/cyanite/issues/27)

performance+availability: cassandra backend (newts)

newts is a timeseries database backed by cassandra, during the Lyon hackathon Eric Evans added support for ingesting line-protocol graphite metrics and Filippo Giunchedi added a graphite-api backend for it graphite-newts

pros:

scalable
we have experience with cassandra already
newts expertise inside the WMF

cons:

more development might be needed

performance+availability: opentsdb backend

Yet another option would be to use opentsdb opentsdb and plug it on top of graphite-api so it is backwards compatible.

opentsdb doesn't do rollups and expire old data points by itself, so the idea would be to use carbon-aggregator to perform aggregation on the incoming points and write them to separate metrics (e.g. with incoming metric foo.bar the 10m datapoints would be written to _10m.foo.bar) and retrieve the appropriate aggregated series while fetching data via graphite-api

pros:

scalable
hdfs/hbase based, we can reuse some of the existing infrastructure/knowledge
tunable data compression

cons:

untested solution

performance+availability: switch away from graphite

another option is to completely switch away from graphite as a whole and move to a different model of multidimensional metrics (e.g. what influxdb, prometheus and opentsdb do, a datapoint can have several key=value pairs associated) and powerful aggregation

pros:

more powerful model
long-term commitment/solution

cons:

most tools and dashboards expect graphite-compatible API, but grafana for example can work with opentsdb natively
graphite on-retrieval aggregation/summarization/etc functions are useful and would be missing, influxdb has some similar functionality

clustering

Sooner or later we will exceed a machine's capacity in one (or more) dimensions, when that happens we can turn to graphite clustering. We would do clustering for performance reasons, not redundancy since we would already mirroring metrics to codfw/eqiad, in that case we can incrementally scale each cluster by adding machines to a consistent hash ring and instruct carbon-c-relay to send metrics to the appropriate machine based on the ring. Of course adding/removing machines from a ring means you have to rebalance the ring, in which case we can use carbonate (https://github.com/jssjr/carbonate) to rebalance the ring by moving metrics into the right machine.

current setup as of 2014/11/14

The current setup is a single db-class machine with 12 spinning disks (tungsten) 64gb ram and 2x8 core xeon for ~1.5T usable space.

Incoming traffic sources:

8125/udp statsd protocol

most of the inbound traffic comes from here, among others:
- machine-level stats (via diamond)
- jobrunner, eventlogging, services, etc

2003/tcp carbon line protocol

mwprof-to-carbon running on localhost, in turn receives metrics from mwprof

2003/udp carbon line protocol

reqstats running on analytics1026

2004/tcp carbon pickle protocol

txstatsd outputs carbon pickle protocol
- currently from localhost and from swift machines

current setup as of 2015/10/02

The current setup is a machine with 4x ssd in raid10, 64GB of ram, 2x8 core E5-2450 and ~1TB usable space.

Incoming traffic sources:

8125/udp statsd protocol

most of the inbound traffic comes from here, among others:
- machine-level stats (via diamond)
- jobrunner, eventlogging, services, etc
- handled by statsdlb which does consistent-hashing on the metric name and passed it on to statsite instances

2003/tcp carbon line protocol

cassandra metrics, from cassandra-metrics-collector
swift metrics, generated by statsite running locally on swift machines

2003/udp carbon line protocol

reqstats running on analytics1026
- on its way out bug T83580

2004/tcp carbon pickle protocol

nothing