grafana.wikimedia.org

From Wikitech
Jump to navigation Jump to search

grafana.wikimedia.org is a frontend for creating queries and storing dashboards using data from Graphite and other datasources.

Service

Currently hosted on and served from krypton.

Viewing

Most viewing features can be discovered naturally, but here's a few features you may not immediately realise exist:

  • Dynamic time range, you can zoom in and focus on any portion of a plot by selecting and dragging within the graph.
  • Metrics, you can click on metric names in the graph legend to isolate a single metric, or ctrl/cmd click to exclude a metric.
  • Annotations, such as for deployments, can be toggled by clicking the lightning icon on the top left.

Save dashboards in puppet

For critical dashboards it is important to have revision control, to this end it is possible to save dashboards in puppet and have them effectively read-only in Grafana.

NB the dashboard url will change suffix, from /db/<dashboard> to /file/<dashboard>.json

Import a new dashboard

  1. Clone operations/puppet repository, see https://phabricator.wikimedia.org/diffusion/OPUP/repository/production/
  2. cd puppet/modules/grafana/files/dashboards
  3. Import a dashboard with ../grafana-dashboard DASHBOARD_URL (requires python and requests), the filename will be the same as the dashboard's name.
    1. The dashboard will get tagged with source:puppet.git and readonly if it doesn't carry the tags already.
  4. Add grafana::dashboard resource to role::grafana e.g. https://gerrit.wikimedia.org/r/#/c/268085 and commit in git
  5. Send and schedule the code review for next puppet SWAT

Update an existing dashboard

  1. Save the readonly dashboard under another name (e.g. NAME-DASHBOARD) and make the desired changes
  2. Import in puppet at files/grafana/dashboards with ../grafana-dashboard NEW_DASHBOARD_URL like above
  3. The new dashboard will get saved under a different name, thus rename to desired name and commit
  4. Send and schedule the code review for next puppet SWAT
  5. (Optional) Delete the modified dashboard in grafana

Features

Shared crosshair

To enable the shared crosshair (which draws the current target cursor in all graphs on the page), go to "Configure dashboard" (top right menu). Then tick the "Shared Crosshair" setting in the Features section.

Shared tooltip

By default each data point of each metric has its own tooltip, only shown when hovering the exact point. Consider enabling the "All series" tooltip. This will ensure the tooltip is always shown when inside the graph. All points on the vertical axis are shown in a single tooltip at the same same. Horizontally the closest data point will be shown in the tooltip.

  1. Click on the graph title and select Edit.
  2. In the Display Styles section, enable tooltip "All series".
  3. From the top navigation, go back to the dashboard.

Show deployments

Add MediaWiki deployment events as annotations to your dashboard:

  1. Enable the "Annotations" feature from the "Configure dashboard" panel (top right menu).
  2. From new settings menu on the top left, choose Annotations.
  3. Add a new annotation.
    • Name: Show deployments
    • Data source: graphite
    • Graphite target expression: exclude(aliasByNode(deploy.*.count,-2),"all")
  4. Click Update, and close the settings menu.

Input variables

See Grafana Templated dashboards.

Time correction

From "Configure dashboard" (top right menu) one can change the default ("browser time") to use "UTC" instead.

Negative axis

If you're plotting metrics with the intention to show some of them as negative, apply "Transform: negative-Y" from "Display > Series overides".[1] This will visually flip the values in the graph (as negative), whilst preserving the positive values for legends and crosshairs. This is preferred over modifications like scale(-1), which will affects other displays of the metric as well and can cause confusion

Example: Server board - Network traffic (plots upload and download bandwidth)

Alerts (with notifications via Icinga)

Alerts can be set up through Grafana on each panel. For example, this panel has an alert set when the Varnish cache hit ratio for ResourceLoader requests drops below a certain percentage.

For most alerts that query data from Graphite, it makes sense to use "Keep Last State" for error conditions and missing data. (Because it is not unusual for Graphite to fail to respond to a request intermittently, and also because data for one of the minutes can be missing in certain race conditions).

In order to receive email notifications about Grafana alerts, you need to connect an Icinga contact group to a given dashboard by making some changes in Puppet configuration. All alerts from a given dashboard will be sent to the same "contact_group". The "Notifications" tab in the Grafana interface is not used (background at T153167).

Example of lines to add to a file in puppet.git:/modules/icinga/manifests/monitor/

class icinga::monitor::example {
    monitoring::grafana_alert { 'db/resourceloader-alerts':
        contact_group => 'team-performance',
    }
}

If not specified, contact_group defaults to "admin" which is irc only. Full list available in puppet.git:modules/nagios_common/files/contactgroups.cfg

See also:

Annotations based on Prometheus data

Grafana Annotations allow marking specific points on the graph with a vertical line and an associated description. The information about when a given event has occurred can be extracted with a Prometheus query.

Prometheus-grafana-annotations.png

To add a Prometheus-based Annotation:

  1. Choose "Annotations" from the settings button (gear icon)
  2. Click on "New"
  3. Choose a name for the new annotation and a Prometheus data source
  4. Insert a Prometheus query returning 1 when the annotation should be displayed. For example, in case of a metric tracking uptime in seconds, you can add an annotation to show when the service is started by using the resets() function. For example: resets(service_uptime{site=~"$site"}[5m]) > bool 0
  5. Add a the label that will be displayed when moving the cursor over the annotation (triangle at the bottom of the vertical line). To do that, fill the "Field formats" section of the form and specify some constant text under "Title" and a comma separated list of "Tags" which must be Prometheus labels returned by the query (eg: instance, job).

Common pitfalls

Alerts fire but the threshold was not reached

Alert fired with value different from the graph's metric.

Reported upstream at https://github.com/grafana/grafana/issues/12134.

It is common for a newly configured alert to fire within days for a value that, later, cannot be found in the graph. The reason for this is likely due to the alert query having a "to" time of "now". This a problem especially with data queried from Graphite where the data for the current minute may be null or otherwise incomplete. Resolve this by always giving the alert query a "to" time of at least a 2 minutes in the past. For example, from 1h, to -2m.

Alerts for values derived from multiple metrics fire unexpectedly

When writing an alert for a value that is derived from multiple metrics (e.g. "cache_miss.rate" and "cache_hit.rate"), be sure to have the alert query until now-5m instead of until now because the last few data points may not be complete. Especially if they come from different servers. When evaluating math in Graphite in a way that involves a single metric, null remains null. But when involving multiple metrics, null is treated as zero. This can cause percentage values derived from two or more metrics to temporarily become a nonsensical value that can trigger your alert. Frustratingly, there is a good chance that by the time you look at the alert dashboard, the value will be complete, and no amount of zooming into the time frame where the alert occurred will reveal the bad value.

For more info, see also Graphite gotcha: Null in math (Grafana blog).

Month duration

In Grafana, the unit for months is "M", for example "7M" means "last 7 months". In Graphite the unit for months is mon[2] which also accepts month, but these do not work in Grafana interfaces, which will either break or interpret the leading lowercase "m" as minute. (grafana/grafana#1864)

Recover after making the dashboard not editable

Delete the dashboard via the API, then restore it without the toggle.

Known issues

Alerts with asPercent() not working

movingAverage template

When using a template inside movingAverage, the default mode wrongly expands the variable (it adds quotes around it, instead of leaving it as a number). These have to be removed manually by editing the metric directly (click on the pencil). Whenever the metric is changed, it has to be fixed again.

Color inspector broken

After each time you change a color through a color picker you must click on empty space anywhere outside the color picker. Otherwise the value will not be saved. (The little square will reflect your chosen color, but once applied, it will be lost). If you try to click "Invert", "Save", "Update" or one of the other color squares directly, the change will be lost.

Beta cluster

To use Grafana in the beta cluster, use https://grafana-labs.wikimedia.org for viewing / https://grafana-labs-admin.wikimedia.org/ for editing. If you copy a dashboard from production, you need to change the data source to Labs Graphite and replace the top-level MediaWiki with BetaMediaWiki. (Or use Prod Graphite to test a dashboard with production data.)

External links

  • Negative y-transform,  What's new in Grafana v2.1
  • Render API, Graphite documentation