Grafana/Best practices

From Wikitech

This page describes best practices for Grafana dashboards at Wikimedia. This page was previously maintained as part of Performance Team guides and SRE Observability guidelines.

High-level approach

The USE Method

This method focuses on Utilization, Saturation, and Errors (USE). This is most effective to quickly diagnose any system performance issue. To quote Brendan Gregg's guide to USE:

For every resource, check utilization, saturation, and errors.

The Host overview dashboard is example of this method applied to server-level metrics about a single host. Resources (CPU, network, etc) are placed in rows, the left column is used for the resource's utilization, while the right column displays saturation or errors, as applicable.

Recommendations:

  • Y-axis should be zero-based.
  • For most graphs, use a line without fill (Fill opacity: 0), unless the graph is stacked.

Four golden signals (4GS)

This method is described in detail in Google's SRE book and focuses on the system's user-impacting metrics. Specifically it can be used as a basis for alerting and diagnosis of ongoing problems.

This method can be seen applied to Swift, Sessionstore, and other dashboards in the "Service" Grafana folder.

Dashboard layout

Legend

Star your dashboard with a legend. For good examples, refer to the ResourceLoader, Backend Pageview Time, and MediaWiki Static dashboards.

  • Create a "Text" panel, and leave it at the very top of the dashboard without a row. Set the panel title to "Legend".
  • Describe the subject of the dashboard in one sentence (e.g. What does the service do for end-users? What interaction does it instrument?)
  • Summarise in a sentence or two the flow of the data from the instrumentation source to the Grafana screen, mentioning any meaningful transformations it goes through along the way (e.g. Statsd counter incremented during cache misses in the backend, aggregated via mtail, pool size is measured every few minutes).
  • Link to high-level docs on Wikitech about the service, and/or link to the Phabricator tracking task about launching the instrumentation/campaign.
  • Considering naming or linking the source repo or source file of the instrumentation, especially if the metrics are not built-in to the program being measured (e.g. a dedicated background process that measures something).

Dashboard settings

General settings

  • Editable: Yes.
  • Preferred timezone: UTC.
  • Preferred range: Last N days for most dashboards. Last N hours for alert dashboards.
  • Auto refresh: Provide options for 5min and 15min. If on by default, use 5min as the default interval. Avoid smaller intervals due to unnecessary load on metric database. If you need to be notified, consider using an alert instead.
  • Graph tooltip: Enable the shared crosshair.

Annotations

Manual annotations

You can create annotations within Grafana for any moment or range of time. These can then be associated with one or more tags. On each dashboard you can decide which tags you'd like to query for shared annotations. For example, most Performance-team dashboards query "mediawiki", "performance", and "operations". Which means an annotation created by anyone from any dashboard with one of these tags will be shown in the panels on that dashboard.

  • Edit the default "Annotations & Alerts" annotation.
  • Leave the default settings (Enabled: Yes, Hidden: Yes, Color: Blue / Cyan).
  • Filter by: Tags.
  • Match: "any".
  • Tags: (insert one or more globally shared tags).

MediaWiki deployments

If the service or instrumentation may be affected by MediaWiki deployments, enable one or both of the following annotations:

All MediaWiki deployments:

  • Name: MW deploy. Data source: graphite.
  • Enabled: No. Hidden: No. Color: Orange.
  • Query: exclude(aliasByNode(deploy.*.count,-2),"all")

Only full branch promotions part of the Train:

  • Name: Train deploy. Data source: graphite.
  • Enabled: Yes (this is the default state for the dashboard). Hidden: No (this means the control is shown and you can enable it ad-hoc when you need it).
  • Color: Orange.
  • Query: exclude(aliasByNode(deploy.sync-wikiversions.count,-2),"all")

Graph panels

Keep your graph focused

When creating a graph, keep in mind what question you want the graph to answer. If possible, focus on a single metric only.

Ideally no more than 4 lines in a single graph. More than 3 metrics may a indicate you are trying to answer too many questions at once. This may cause it to be unable to accurately answer any of the questions involved, for example due to axes having to span a wide range of values, or due to it being difficult to correlate which of the many colors and lines belong to which labels.

One case where you do want to consider many metrics in one graph, is when wanting to understand the relationship between quantities and their distribution. See #Graph with many metrics below.

Draw mode

When plotting metrics that represent a quantity per interval, use a bar chart (e.g. rate counter, CPU usage percentage, bytes gauge for memory or disk).

For timing metrics, use a line chart.

Graph recommended settings

Metrics:

  • Remember to use .rate, when querying Statsd counters from Graphite. Never use count or sum. (Why: Graphite#Extended properties.)
  • Preferred scale for counters is per second, and otherwise per minute.
  • For timing metrics, prefer plotting the max (Statsd: upper). Otherwise, consider p99 or p75. Avoid lower percentiles, medians, or mean averages. (Why: Measuring load times.)
  • Prefer minimal or no aggregations in queries. If aggregation is applied, be sure to clearly indicate this in the legend. You can use the alias function to describe how the value is produced. For example, frontend.navtiming2.responseStart.mobile.p75 | movingAverage (24h) | alias("responseStart.mobile.p75 | movingAverage (24h)"). Notice how the movingAverage is specified both as actual query function and as text for the alias function.

Axes:

  • Always include a Left Y-axis on graph panels.
  • Unit: Set this correctly for timing metrics and percentages. For counters, we typically use the "short" notation.
  • Label: Use this to document the scale of counting metrics (e.g. "rate per minute"). The label is usually left blank for timing metrics.
  • Min/Max: Usually left to auto. For percentage graphs that can't exceed 100%, do set a max of 100% to avoid the automatic margin expansion to 120%.

Display:

  • Draw Mode: Bars or Lines.
  • Line width: 1. Line fill: 1.
  • Tooltip: All series. If the graph contains more than a dozen metrics, use Single instead.
  • Null value: null. (Setting this to Continuous or Zero almost always causes issues, eventually.)

Graph with many metrics

When plotting more than a dozen metrics with the intent to understand distribution, it is recommended to create a stacked bar chart (not a line graph). Like so:

  • Display: Set Drawing mode to Bars, and enable Stacking mode. Ensure the hover value is stacked "individually".
  • Legend: Hide the legend (its too crowded). Alternatively, show as a scrollable table to the right.

Alert rules

  • Evaluate every: 15 min.
  • Query condition: Range for last 15min or 1h, until now-5min.
  • If no data or all nulls: Alerting. (This helps detect when the underlying service may be down or broken. We used to ignore this due to a bug in Graphite, but as of January 2019 we're trying it again.)
  • If error or timeout: Keep Last State. (Graphite often times out; when using Prometheus consider Alerting on errors.)

See also