Graphite

From Wikitech
Jump to navigation Jump to search

Graphite is a real-time time series data store and graph renderer. The system is a bit like RRDTool though much more scalable and with faster access letting it handle huge amounts of metrics and still be fast enough.

A big advantage is that metric identifiers do not need to be predefined on the server side, thus saving a lot of configuration overhead. The metric names are handled by the client. As such, submission of data is not publicly exposed. This is deferred to other deployed applications. See #Data sources for more about that.

Front-ends

Wikimedia deploys various web applications that provide convenient ways to access the data and generate graphs.

  • grafana.wikimedia.org, a frontend for flexibly querying metrics and creating new graphs. Unlike other front-ends, this queries the raw data and renders interactive graphs client-side.
  • graphite.wikimedia.org (restricted), the default graphite-web frontend. Provides a visual interface to all raw metrics, discovering functions to transform data, and an API with PNG and JSON output formats.

Service

The graphite receiver is hosted on graphite1001. (Previously on tungsten.)

Always use graphite-in.eqiad.wmnet as the inbound receiver endpoint for graphite/carbon protocol traffic (i.e. port 2003). This decouples the inbound receiver service from the actual hosting machine allowing safer maintenance operations as well as easier HA/Load balancing. For statsd pushing via udp on port 8125 see the guidelines at statsd.

For the beta cluster, the receiver is labmon1001, with graphs available at https://graphite-labs.wikimedia.org. If you're adding custom metrics for your labs project you shouldn't name them under the same top level component as your labs project name (or another project). If you do they will be auto archived as instances are created or deleted.

Data sources

Graphite is one of the primary aggregators for metrics at Wikimedia. It providers a powerful API to query, transform and aggregate the data.

Data is rarely recorded with Graphite directly. Most commonly, data goes through statsd.

statsd

The statsd server acts as an intermediary between Graphite and other applications.

EventLogging

To aggregate data from EventLogging events from client-side JavaScript, we usually create a Python script that subscribes to relevant topics from the EventLogging ZMQ stream, and reacts by sending packets to statsd. Such script is then deployed on hafnium through the role::webperf role in puppet.

For example, puppet:///webperf/navtiming.py.

See Webperf for more information about how this works. See EventLogging for how to create new schemas and start sending events from your application.

statsv

statsv is an HTTP beacon endpoint (/beacon/statsv) for sending data to statsd.

Data goes http-> kafka ->statsd

It's a lightweight way of sending data from clients. This is useful when you only require one or more values to be aggregated, without needing the overhead of an EventLogging schema or storing each entry in a database.

See statsv.py and kafka::statsv.

You can hit this endpoint directly with an HTTP request.

Within MediaWiki, you should use the abstraction layer provided by the WikimediaEvents extension. Use the "timing" and "counter" topic namespace of mw.track (source), e. g. mw.track( 'timing.foo', 1234.56 ) or mw.track( 'counter.bar', 5 ). Note that this abstraction layer does not add the $wgStatsdMetricPrefix (see below), so if you want to track events with the same prefix from MediaWiki and statsv, you will have to add it to the metric name yourself.

Code here: https://gerrit.wikimedia.org/r/#/projects/analytics/statsv,dashboards/default

MediaWiki

Use MediaWiki's MediaWikiServices::getInstance()->getStatsdDataFactory() interface. This buffers data within the process, and sends it to Statsd at the end.

Note that properties from MediaWiki automatically get the MediaWiki. prefix added to the metric name (configurable with $wgStatsdMetricPrefix).

TCP

To record data to Graphite directly, the client will send a simple message over TCP port 2003 that will contains three space separated entries:

  1. Metric name.
  2. Integer value.
  3. Unix timestamp.

Example:

$ echo "my.metric 1911 $(date +%s)" | nc -q0 graphite-in.eqiad.wmnet 2003

The my.metric does not to be preconfigured in graphite, it will be happily recorded as-is. Any missing hierarchy is automatically created.

Everything stored in graphite has a path with components delimited by dots. In a path such as "foo.bar.baz", each thing surrounded by dots is called a path component. So "foo" is a path component, as well as "bar", etc. When coming up with metric names, adhere to the following guidelines:

  • Each path component should have a clear and well-defined purpose.
  • Volatile path components should be kept as deep into the hierarchy as possible

Terminology

  • Metric (also known as Bucket). Each metric has a name and a bucket with one or more values over time.
  • Flush interval. At a configured interval, the statsd server will aggregate all buckets and send the representative values for each property to Graphite. At Wikimedia the interval is currently one minute.
  • Aggregation. Each minute, statsd takes each bucket and summarises all values with a single value to represent that minute. It also creates the derivative properties at this point (e.g. lower, upper, p95, etc.). At later stages, once in graphite, more aggregation happens. For example, data older than 7 days is represented in intervals of 5 minutes, and after 30 days the interval is 15 minutes. [1]

Troubleshooting

carbon-cache too many creates

This alert is used to signal whenever too many files (and therefore disk space) are being created on disk. It can be benign, e.g. when a new cassandra instance gets bootstrapped there is a flood of new metrics being created. To check which files are being created:

 sudo tail -F /var/log/upstart/carbon_cache-*.log /var/log/carbon/carbon-cache@*/creates.log | grep 'creating database file'

Applying carbon storage_aggregation changes

The current settings are only used when creating new Whisper data sets. Existing ones will generally not be affected. To apply, say, newer xFilesFactor configuration to an existing property, use the following steps.

While there is no script to read the current would-be settings and apply them, there is a script to manually apply specific settings.

  1. Get new xFilesFactor settings for the relevant metric from puppet:/role/graphite/base.pp. For example: "0.01" or "0".
  2. Get new retention settings from puppet:/role/graphite/base.pp. For example: "1m:7d,5m:14d,15m:30d,1h:1y,1d:5y".
  3. Check current settings:
    $ whisper-info mw/js/deprecate/tipsy_live/sum.wsp
    xFilesFactor: 0.0
  4. $ sudo -su _graphite (or whichever user is the owner of the wsp file)
  5. Run whisper-resize and set xFilesFactor, then the path, and then the retention values as distinct space-separated command-line arguments:
    $ whisper-resize --xFilesFactor=0 mw/js/deprecate/tipsy_live/sum.wsp 1m:7d 5m:14d 15m:30d 1h:1y 1d:5y
  6. Remember to run it on both graphite1001 and graphite2001 for Wikimedia.

Identifying heavy and/or expensive queries

Graphite might suffer from heavy CPU or memory load if queries requesting a lot of data are run. It is possible to identify those after the fact by asking uwsgi for big (as in bytes) or long (as in milliseconds) queries by awk-ing the request logs. See also bug T116767 e.g.

 # filter for >5MB responses
 | awk '$26 > 5000000 {print }'  | less 
 # filter for > 9s responses
 | awk '$29 > 9000 {print }'

Extended properties

This is the missing manual about aggregation by statsd and Graphite at Wikimedia. This describes the primary metric types we use: counters and meters.

For other metric types, see Statsd Metric Types.

Counters

A simple counter that resets each flush interval. Aggregation layers will combine values using sum().

A single push can increment the counter with any positive number. Incrementing by 1 is most common in application code, however aggregation may happen at any layer, even within applications. As such, StatsD may see your increments as higher than 1.

Recommended properties:

  • rate: The total per second. This is initially computed by dividing the minute's sum by 60.
    This is aggregated by Graphite using avg(). It remains accurate as "average rate per second".
    • Tip: Use scale() to plot a counter for other intervals. For example, to draw the rate per minute, use metric.rate and scale(60).
  • sum: The total per variable aggregation window. This is not reliably per-minute.
    This is aggregated by Graphite using sum(). Viewing older data shows higher values than recent data.
    Recent data (less than 7 days old) retains a sum of each minute, for older data there is only a sum() over larger windows (e.g. per 15min, per hour or per day, see retentions in graphite config.)
    • Tip: Only use sum to produce an accurate total. For example, in conjunction with integral().
    • Tip: Use rate instead if you want a rate per known interval. For example, rate per second, or rate per minute.
  • lower: The lowest single increment command per variable aggregation window.
    This is aggregated by Graphite using min().
    Viewing older data shows lower values than recent data, which has a data point per minute.
  • upper: The highest single increment per variable aggregation window.
    This is aggregated by Graphite using max().
    Viewing older data shows higher values than recent data, which has a data point per minute.

Discouraged properties:

  • count: Number of StatsD commands received per variable aggregation window. If your application only increments by 1, and if there are no buffers or aggregation layers between that code and StatsD, then this is usually equal to sum . Otherwise, this will differ. For example, a sequence received as "+0, +4, +1" count is recorded as 3 commands, where sum will record 5.
    This is aggregated by Graphite using sum(). Viewing older data shows higher values than recent data, see sum for why.
  • mean: The average of all increments per variable aggregation window.
    This is aggregated by Graphite using avg(). Data older than 7 days is not statistically meaningful (average of averages).

Timers

Track the duration of a particular event.

Recommended properties:

  • median: The middle timing value per variable aggregation window. See also Comparison of mean and median on Wikipedia.
    This is aggregated by Graphite using avg(). Data older than 7 days is not statistically meaningful (average of average of medians).
  • sample_rate: The total number of values per second. This behaves identical to Counter.rate.
    This is aggregated by Graphite using avg(). It remains accurate as "average rate per second".
    • Tip: Use scale() to plot a counter for other intervals. For example, to draw the rate per minute, use metric.rate and scale(60).
  • p75, p95, p99, etc.: See also Percentile on Wikipedia.
    This is aggregated by Graphite using avg(). Data older than 7 days is not statistically meaningful (average of average of percentile).
  • sum: The total sum of timing durations in this interval.
    Use this to compute the (globally) total amount of time spent in your metric.
    For example "250ms, 300ms, 550ms" produces 1100ms, not 3.
    This is aggregated by Graphite using sum(). Viewing older data shows higher values than recent data.
  • lower: The lowest value in an interval.
    This is aggregated by Graphite using min().
  • upper: The highest value in an interval.
    This is aggregated by Graphite using max().

Discouraged properties:

  • count: Number of StatsD commands received per variable aggregation window. This is not reliably per minute.
    This is aggregated by Graphite using sum(). Viewing older data shows higher values than recent data.
    • Tip: Use sample_rate instead if you want a rate per known interval. For example, rate per second, or rate per minute.
  • rate: Do not use Timer.rate! For the equivalent of Counter.rate, see Timer.sample_rate! The rate here is the sum of the timer for the reporting interval (in milliseconds) divided by the reporting interval (in seconds). In other word, it is the total time of that measurement, normalized to the second.
    To draw a counter from a timing metric, use sample_rate instead.
  • mean: The average of all values in this interval.
    This is aggregated by Graphite using avg(). Data older than 7 days is not statistically meaningful (average of averages).

Functions

Here is a short list of common functions you should know about.

Moving average

Adding a moving average to your metric can turn an oscillating line, which obscures any trend, into a line that more accurately reflects how values are changing over time. Expand your query to at least 24 hours and start with a value of 5. Increase as needed up to 100. Higher generally makes the data too influenced by old data. To produce an average per day or week, use summarize() instead.

Time shift

Events influenced by user input (e.g. how long it takes to parse an article), or events that happen on the user's device, often have a daily and weekly pattern to them. Looking at the last 12 hours of data (even with a moving average) might not tell you much as it will always be going up or down depending on the time of day.

A time shift gives you context about how this metric behaved in the past and helps decide whether it is higher or lower than usual. Typically we add a time shift to show the metric at the same time yesterday and last week.

See the Navigation Timing dashboard for an example.

Summarize

For graphs showing the history of a metric over the course of several weeks or months it can be helpful to summarise data points to a higher interval to help hide normal variation. For example, it's much easier to see a regression from ~ 10 to ~ 20 on a straight line than a line that continuously wiggles between 1 and 30. Even after aggregation into a median and application of moving average, data can still exhibit a wide variation over longer periods of time.

summarize() helps you plot very bold and spaced out data points onto a graph. For example, one value per hour, day, or week. To emphasise changes in the metric more prominently, the "staircase" line mode can be used.

See the "History" panel on the Navigation Timing dashboard for an example.

Operations

Deleting metrics

There's no (as of Nov 2016) formal/periodic clean up of old or unwanted/unneeded metrics. To get a metrics deleted please file a Phabricator task under #Graphite.

For people with access to graphite: to delete metrics it is sufficient to find the correct files/directories under /var/lib/carbon/whisper from graphite1001 and graphite2001. (Exception for cassandra. metrics, graphite1003 and graphite2002)

rsync metrics

Graphite machines run an rsync server to make metrics accessible to other graphite machines in case manual sync is needed (e.g. after machine failure). To rsync in parallel, first sync only top-level directories and then the contents in parallel:

 install -d -o _graphite -g _graphite /var/lib/carbon/whisper
 su -s /bin/bash _graphite
 cd /var/lib/carbon/whisper
 rsync -vd SOURCE::carbon/whisper/ .
 /usr/bin/time parallel -j5 -i rsync -a SOURCE::carbon/whisper/{}/ {}/ -- *

Merge and sync metrics

The rsync method above provides a "point in time" sync. Another possibility albeit slower is to sync and merge metrics from a machine onto another one. Thanks to carbonate this is possible to do in the background, where metrics are rsync'd from an host and then merged onto existing ones. Using this method the metrics files are locked during merge and thus can it be used to fill "holes" even when the destination machine is actively being written to. The carbonate package is needed (available internally on stretch).

 (cd /srv/whisper ; find . -type f | carbon-path -r) | time parallel --files --jobs 10 --pipe --block 20k -- carbon-sync -s SOURCE -d /srv/whisper --source-storage-dir :carbon/whisper -l

List slow queries

Graphite's web application logs how long a given query took, you can list all queries taking over a certain amount of time (in seconds) with:

 awk -Ftook '$2 > 10 { print }' /var/log/graphite-web/rendering.log

Note that a slow query isn't necessary the cause of a slowdown, it might be a consequence

List metrics being created

 tail -F /var/log/carbon/carbon-cache@*/creates.log | grep 'matched schema'

FAQ

How do I render a counter metric as running total?

Start the sum property which stores the total per interval (e.g. minute or hour). Then apply integral() to produce a running total. (Original discussion at T108480)

What queries are being asked to graphite?

The graphite web application graphite-web does a fair amount of logging, specifically inside /var/log/graphite-web/metricaccess.log and /var/log/graphite-web/rendering.log. All requested queries are logged together with how much time it spent serving those.

Further reading

External links

References

  1. Graphite configuration, Wikimedia operations puppet