Graphite
Graphite is a real-time time series data store and graph renderer.
Front-ends
Grafana
Grafana.wikimedia.org is the primary frontend to Graphite used at Wikimedia Foundation for providing convenient ways to access the data and generate graphs, charts, and tables. It allows for flexible querying of metrics and easily creating new graphs and dashboards. Note that Grafana performs its Graphite data queries and graph rendering client-side.
graphite.wikimedia.org
This is the built-in graphite-web frontend. It provides a complete listing of metric names for discovery, and provides server-side rendering of PNG graphs, as well as a JSON API for Grafana. Whilte the graph images and JSON API are public, the discovery interface is restricted to WMF staff and people with NDA (using your Wikimedia Developer account, or LDAP, for authentication).
Service operation
The Graphite receiver for production is hosted on:
- graphite1004 (Eqiad)
- graphite2003 (Codfw) - hot standby
The graphite-web frontend is served through served text-lb (Varnish).
Former hosts: graphite1001, tungsten.
TCP
Always use graphite-in.eqiad.wmnet
as the inbound receiver endpoint for graphite/carbon protocol traffic (i.e. port 2003). This decouples the inbound receiver service from the actual hosting machine allowing safer maintenance operations as well as easier HA/Load balancing. For statsd pushing via udp on port 8125 see the guidelines at Statsd.
Metric creation
Graphite does not have a procedure for creating or registering metrics. Instead, Graphite clients can submit data for any metric by name, and the metric is created automatically in storage backend.
Everything stored in graphite has a path with components delimited by dots. In a path such as foo.bar.baz
, each segment surrounded by dots is called a path component. So "foo" is a path component, as well as "bar", etc. When coming up with metric names, adhere to the following guidelines:
- Each path component should have a clear and well-defined purpose.
- Volatile path components should be kept as deep into the hierarchy as possible.
The automatic creation of metrics avoids overhead of configuration, but also means metric name fields should not be publicly exposed to user input. This is instead the responsibility of Graphite clients. For a list of Graphite clients in use at Wikimedia Foundation, see #Data sources.
Terminology
- Metric (also known as Bucket). Each metric has a name and a bucket with one or more values over time.
- Flush interval. At a configured interval, the Statsd server will aggregate all buckets and send the representative values for each property to Graphite. At Wikimedia the interval is currently one minute.
- Aggregation. Each minute, Statsd takes each bucket and reduces its values to a single value to represent that minute. It also creates the derivative properties at this point (e.g. "median", "rate", "sum", "p95", etc.).
- Retention and Resolution. In Graphite, each metric has multiple databases with differing retentions and resolutions. For example, data kept for 7 days has a resolution of 1 data point per minute. Data kept for 30 days has a resolution of 1 data point per 15 minutes.[1]
Data sources
Graphite is one of the primary aggregators for metrics at Wikimedia Foundation. It providers a powerful API to query, transform, and aggregate the data.
Data is usually not submitted to Graphite directly. Instead, it should go through one of the below clients. The most commonly used client is Statsd.
statsd
The Statsd server acts as an intermediary between Graphite and other applications.
Data submitted via Statsd uses the raw metric name from Statsd as-is inside Graphite. No prefix or namespace is added by default. However, Statsd does add #Extended properties to your metrics.
EventLogging
To aggregate data from EventLogging events from client-side JavaScript, we usually create a Python deamon that subscribes to relevant EventLogging topics in Kafka, that reacts by sending packets to Statsd. See EventLogging for how to create new schemas.
statsv
MediaWiki
Use MediaWiki's MediaWikiServices::getInstance()->getStatsdDataFactory()
interface. This buffers data within the process, and sends it to Statsd at the end.
Note that properties from MediaWiki automatically get the MediaWiki.
prefix added to the metric name (configured by $wgStatsdMetricPrefix).
TCP
To record data to Graphite directly, the client must send a message over TCP to port 2003 that will contains three space separated entries:
- Metric name.
- Integer value.
- Unix timestamp.
Example:
$ echo "example.metric 1911 $(date +%s)" | nc -q0 graphite-in.eqiad.wmnet 2003
Extended properties
This is the missing manual about aggregation by Statsd and Graphite, at Wikimedia. This describes the primary metric types we use: counters and timers. For other metric types, see Statsd Documentation: Metric Types.
Counters
A simple counter that resets after each interval. Aggregation layers will combine values using sum()
.
A single push can increment the counter with any positive number. Incrementing by 1
is most common in application code, however aggregation may happen within your application or at other layers in-between the application and Graphite. As such, Statsd may see your increments as higher than 1
.
Recommended properties:
rate
: The total per second. This is initially computed by dividing the minute's sum by 60.
This is aggregated by Graphite usingavg()
. It remains accurate as "average rate per second".- Tip: Use
scale()
to plot a counter for other intervals. For example, to draw the rate per minute, usemetric.rate
andscale(60)
.
- Tip: Use
sum
: The total per variable aggregation window. This is not always per-minute.
This is aggregated by Graphite usingsum()
. Viewing older data shows higher values than recent data.
When querying recent data only (< 7 days) points show a sum per minute. When including older data, there is only asum()
over larger windows of time (eg. 15min, 1 hour, or more).- Tip: Only use
sum
when intending to plot an accurate total. For example, in conjunction withintegral()
. - Tip: To plot the rate per minute, or per second, use
rate
instead!
- Tip: Only use
lower
: The lowest single increment command per variable aggregation window.
This is aggregated by Graphite usingmin()
.
Viewing older data may show lower values than recent data due to aggregation.upper
: The highest single increment per variable aggregation window.
This is aggregated by Graphite usingmax()
.
Viewing older data may show higher values than recent data due to aggregation.
Discouraged properties:
count
: Number of StatsD commands received per variable aggregation window. If there is no aggregation in your application or elsewhere between the application and StatsD, and if all increments are by 1, then this is usually equal tosum
. Otherwise, this will differ. For example, after a sequence of "+1, +4, +2
" the count is recorded as3
commands, where sum would record7
.
This is aggregated by Graphite usingsum()
. Viewing older data shows higher values than recent data, seesum
for why.mean
: The average of all increments per variable aggregation window.
This is aggregated by Graphite usingavg()
. Data older than 7 days is not statistically meaningful (average of averages).
Timers
Track the duration of a particular event. Recommended properties:
median
: The middle timing value per variable aggregation window. See also Comparison of mean and median on Wikipedia.
This is aggregated by Graphite usingavg()
. Data older than 7 days is not statistically meaningful (average of average of medians).sample_rate
: The total number of values per second. This behaves identical toCounter.rate
.
This is aggregated by Graphite usingavg()
. It remains accurate as "average rate per second".- Tip: Use
scale()
to plot a counter for other intervals. For example, to draw the rate per minute, usemetric.rate
andscale(60)
.
- Tip: Use
p75
,p95
,p99
, etc.: See also Percentile on Wikipedia.
This is aggregated by Graphite usingavg()
. Data older than 7 days is not statistically meaningful (average of average of percentile).sum
: The total sum of timing durations in this interval.
Use this to compute the (globally) total amount of time spent in your metric.
For example "250ms, 300ms, 550ms
" produces1100ms
, not3
.
This is aggregated by Graphite usingsum()
. Viewing older data shows higher values than recent data.lower
: The lowest value in an interval. This is aggregated by Graphite usingmin()
.upper
: The highest value in an interval. This is aggregated by Graphite usingmax()
.
Discouraged properties:
count
: Number of StatsD commands received per variable aggregation window. This is not reliably per minute.
This is aggregated by Graphite usingsum()
. Viewing older data shows higher values than recent data.- Tip: Use
sample_rate
instead if you want a rate per known interval. For example, rate per second, or rate per minute.
- Tip: Use
rate
: Do not use Timer.rate! For the equivalent of Counter.rate, seeTimer.sample_rate
! Therate
here is the sum of the timer for the reporting interval (in milliseconds) divided by the reporting interval (in seconds). In other words, it is the total time of that measurement, normalized to the second. It's weird and confusing. To draw a counter from a timing metric, usesample_rate
instead.mean
: The average of all values in this interval. This is aggregated by Graphite usingavg()
. Data older than 7 days is not statistically meaningful (average of averages).
Functions
Here is a short list of common functions you should know about.
summarize
For graphs showing the history of a metric over the course of several weeks or months it can be helpful to summarise data points to a higher interval to help hide normal variation. For example, it's much easier to see a regression from ~ 10 to ~ 20 on a straight line than a line that continuously wiggles between 1 and 30. Even after aggregation into a median and application of moving average, data can still exhibit a wide variation over longer periods of time.
summarize()
helps you plot very bold and spaced out data points onto a graph. For example, one value per hour, day, or week. To emphasise changes in the metric more prominently, the "staircase" line mode can be used in addition to this.
See the "History" section on the Navigation Timing dashboard for an example of the summarize() function.
aliasByNode
For longer metric paths, this function can help shorten the labels in the legend. Its benefit over assigning labels manually with alias() is that it is automatically derived from the metric path (avoids the label from becoming incorrect when the metric path is changed). It also has the benefit of deriving short names for a series containing multiple metrics, without having to name each metric separately.
Architecture
As of October 2018 the main components of the Graphite stack are:
- carbon frontend relay
- carbon local relay
- global statsd aggregator
- HTTP API via graphite-web
statsd ingestion
statsd UDP traffic is ingested through the statsd.eqiad.wmnet DNS name on port 8125, this is all statsd traffic that needs global aggregation and by far the most widespread protocol for pushing metrics at WMF. Statsd metrics sent there to the global aggregato will produce values aggregated across all hosts that sent said metrics. For example mediawiki statsd metrics are aggregated globally, i.e. the metric name doesn't contain the host name sending the metric.
For statsd metrics where global aggregation is not desired or needed the recommended approach is to run a local statsd aggregator (in our case statsite) and point the application to localhost:8125 instead. Such cases are for example Swift metrics, any application that generates metrics including the hostname is a candidate for a local statsd aggregator. The results are going to be the same as if the metrics were aggregated globally while putting less stress on the global aggregator.
Note that if an application sends a mixture of hostname-specific metrics and global metrics then it should use statsd.eqiad.wmnet.
After aggregation is done, either local or global, the resulting metrics are flushed to Graphite using the Carbon protocol over TCP.
carbon ingestion
The Graphite protocol is called Carbon and it is TCP-based and line oriented, metrics have a name, a value and a timestamp. No metric data types are possible like in statsd. The entry point for carbon traffic is graphite-in.eqiad.wmnet DNS name on port 2003, said traffic will hit the carbon frontend relay component, implemented by the carbon-c-relay software. Once the frontend has accepted the metrics, they will be mirrored to all datacenters where graphite is present (eqiad and codfw as of October 2018) and received by the carbon local relay (also implemented by carbon-c-relay) for storage on disk.
HTTP read-only API
Once stored on disk the metrics are served by graphite-web as a uwsgi application via HTTP. The web application is what powers the Grafana backend for Graphite and graphite.wikimedia.org, it will query for metrics from all graphite hosts local to its datacenter and serve the resulting data.
FAQ
How do I render a counter metric as running total?
Start the sum
property which stores the total per interval (e.g. minute or hour). Then apply integral()
to produce a running total. (Original discussion at T108480)
Why is the data so different when zooming out or moving a week back?
This is most likely a side-effect from a graph using .count
or .sum
where .rate
should be used instead. See Extended properties for how to resolve this.
Operations manual
Failover
When needing to failover traffic from one graphite host to another it is important to think about such traffic in terms of read and write traffic.
Read traffic: all reads happen via HTTP and graphite is fronted by edge cache, thus a change like 731435 is sufficient to switch read traffic. Once that's done make sure to flip monitoring checks too with a change like 731434.
Write traffic: this traffic is trickier, as it involves changing multiple entry points. Additional complication is brought by the fact that some clients will ignore DNS records changes and thus will need to be restarted (more on that below).
- Non-mediawiki clients will go through DNS, thus a change like 904774 is needed to flip records
- For Mediawiki a configuration deployment is required, see 731918
One successful switchover (Dec 2021) has been performed as part of Bullseye migration in T247963 and the full details can be found in the task.
In general it is a good idea to audit the whole codebase for the host ip/address you are failing over from, e.g. via https://codesearch.wmcloud.org/search/ or github. This is to ensure no more "entry points" have crept up since the last failover.
Finally, there will be some minor straggler producers that don't ever refresh DNS. To audit those make sure to run tcpdump 'port 8125 or port 2003' on the graphite host you are failing over from. Note the source host and port. On said host a lsof -i :PORT will reveal what process is using the port, a restart of the daemon is what's needed in most cases.
Monitor graphite queries
The graphite web application graphite-web
does a fair amount of logging, specifically inside /var/log/graphite-web/metricaccess.log
and /var/log/graphite-web/rendering.log
. All requested queries are logged together with how much time it spent serving those.
Deleting metrics
To get a metrics deleted please file a Phabricator task under #Graphite
. Please indicate the use case, i.e. if this is a one-off or periodic deletion of stale metrics is desired instead.
For people with access to graphite: to delete metrics find
first (as user _graphite) the correct files/directories under /var/lib/carbon/whisper
from graphite1004.eqiad.wmnet and graphite2003.codfw.wmnet. Then add -delete
when the list of metrics matches what you expect. If you are unsure which hosts are active graphite hosts, check puppet's manifests/site.pp
for hosts with graphite::production
role. It is recommended to run find/delete on the standby host first (usually the host in codfw) and then on the active host.
For example,
sudo -u _graphite find /srv/carbon/whisper/Mediawiki/CodeMirror -name "*.wsp" -delete
rsync metrics
Graphite machines run an rsync server to make metrics accessible to other graphite machines in case manual sync is needed (e.g. after machine failure). To rsync in parallel, first sync only top-level directories and then the contents in parallel:
apt install time install -d -o _graphite -g _graphite /srv/carbon/whisper su -s /bin/bash _graphite src_host=fqdn_to_sync_from cd /srv/carbon/whisper rsync -vd ${src_host}::carbon/whisper/ . /usr/bin/time parallel -j5 -i rsync -a ${src_host}::carbon/whisper/{}/ {}/ -- *
Merge and sync metrics
The rsync method above provides a "point in time" sync. Another possibility albeit slower is to sync and merge metrics from a machine onto another one. Thanks to carbonate this is possible to do in the background, where metrics are rsync'd from an host and then merged onto existing ones. Using this method the metrics files are locked during merge and thus can it be used to fill "holes" even when the destination machine is actively being written to. The carbonate
package is needed (available internally) together with time
and GNU parallel
.
apt install time parallel carbonate su -s /bin/bash _graphite
src_host=fqdn_to_sync_from export CARBONATE_CONFIG=/etc/carbon/carbonate.conf
# gather a list of metrics to transfer (only files, in "carbon path" format) rsync --list-only --recursive ${src_host}::carbon/whisper | grep -v ^d | cut -b47- | carbon-path -r -d whisper | grep -F . > ${src_host}_metrics
# sync in parallel, job output will be in /tmp/*.par cat ${src_host}_metrics | time parallel --files --jobs $(nproc --ignore 6) --pipe --max-replace-args 1000 -- carbon-sync --source-node ${src_host} --storage-dir /srv/carbon/whisper --source-storage-dir :carbon/whisper --lock
List slow queries
Graphite's web application logs how long a given query took, you can list all queries taking over a certain amount of time (in seconds) with:
awk -Ftook '$2 > 10 { print }' /var/log/graphite-web/rendering.log
Note that a slow query isn't necessary the cause of a slowdown, it might be a consequence
List metrics being created
tail -F /var/log/carbon/carbon-cache@*/creates.log | grep 'matched schema'
Operations troubleshooting
carbon-cache too many creates
This alert is used to signal whenever too many files (and therefore disk space) are being created on disk. It can be benign, e.g. when a new cassandra instance gets bootstrapped there is a flood of new metrics being created. To check which files are being created:
sudo tail -F /var/log/upstart/carbon_cache-*.log /var/log/carbon/carbon-cache@*/creates.log | grep 'creating database file'
It is also useful to tally which metrics have been created according to "top level" (i.e. the leftmost component)
sudo grep 'aggregation schema' /var/log/carbon/carbon-cache@*/creates.log* | awk '{print $6}' | cut -d. -f1 | sort | uniq -c | sort -rn | less
Applying carbon storage_aggregation changes
The current settings are only used when creating new Whisper data sets. Existing ones will generally not be affected. To apply, say, newer xFilesFactor configuration to an existing property, use the following steps.
While there is no script to read the current would-be settings and apply them, there is a script to manually apply specific settings.
- Get new xFilesFactor settings for the relevant metric from puppet:/role/graphite/base.pp. For example: "0.01" or "0".
- Get new retention settings from puppet:/role/graphite/base.pp. For example: "
1m:7d,5m:14d,15m:30d,1h:1y,1d:5y
". - Check current settings:
$ whisper-info mw/js/deprecate/tipsy_live/sum.wsp
xFilesFactor: 0.0
$ sudo -su _graphite
(or whichever user is the owner of the wsp file)- Run
whisper-resize
and set xFilesFactor, then the path, and then the retention values as distinct space-separated command-line arguments:$ whisper-resize --xFilesFactor=0 mw/js/deprecate/tipsy_live/sum.wsp 1m:7d 5m:14d 15m:30d 1h:1y 1d:5y
- Remember to run it on both Eqiad and Codfw primary Graphite hosts (as of July 2018: graphite1004.eqiad.wmnet and graphite2003.codfw.wmnet).
Identifying heavy and/or expensive queries
Graphite might suffer from heavy CPU or memory load if queries requesting a lot of data are run. It is possible to identify those after the fact by asking uwsgi for big (as in bytes) or long (as in milliseconds) queries by awk-ing the request logs. See also bug T116767 e.g.
# filter for >5MB responses tail -F /var/log/uwsgi-graphite-web.log | awk '$26 > 5000000 {print }' | less # filter for > 9s responses tail -F /var/log/uwsgi-graphite-web.log | awk '$29 > 9000 {print }'
uwsgi can also dump a bunch of stats and info about pending requests (note that this outputs to stderr):
sudo uwsgi --connect-and-read /run/uwsgi/graphite-web-stats.sock |& less
Further reading
- Graphite/Scaling
- Explaining auto maxDataPoints and consolidateBy, by Addshore, 10 September 2018.
- 25 Graphite, Grafana and statsd gotchas, by Dieter Plaetinck, 15 March 2016.
- 10 Things I Learned Deploying Graphite, by Kevin McCarthy, 18 July 2013.
- Accurate Counting with Graphite and Statsd, by Geordie Henderson, 6 March 2013.
- Understanding StatsD and Graphite, by Pål-Kristian Hamre, 24 July 2012.
External links
- https://graphite.wikimedia.org
- Grafana dashboard about graphite.wikimedia.org
- Graphite API documentation (upstream)
- Graphite on GitHub