grafana.wikimedia.org

From Wikitech
Jump to: navigation, search

grafana.wikimedia.org is a frontend for creating queries and storing dashboards using data from Graphite and other datasources.

Service

Currently hosted on and served from krypton.

Viewing

Most viewing features can be discovered naturally, but here's a few features you may not immediately realise exist:

  • Dynamic time range, you can zoom in and focus on any portion of a plot by selecting and dragging within the graph.
  • Metrics, you can click on metric names in the graph legend to isolate a single metric, or ctrl/cmd click to exclude a metric.
  • Annotations, such as for deployments, can be toggled by clicking the lightning icon on the top left.

Template

  1. Go to https://grafana-admin.wikimedia.org/dashboard/db/template-dashboard (or import from grafana.wikimedia.org/template-dashboard.json in case it was accidentally overwritten or deleted)
  2. Click on the gear icon on the top left and use "Save As..." to create a copy under a different name.
  3. Make your own graphs!

This template incorporates features we usually want but would otherwise have to be configured manually. Such as:

  • The "Show deployments" annotation.
  • Auto-refresh every minute.
  • Timepicker to include a "6 months" and "1 year" entry.
  • UTC timezone.
  • Shared crosshair, shared tooltip.

Save dashboards in puppet

For critical dashboards it is important to have revision control, to this end it is possible to save dashboards in puppet and have them effectively read-only in Grafana.

NB the dashboard url will change suffix, from /db/<dashboard> to /file/<dashboard>.json

Import a new dashboard

  1. Clone operations/puppet repository, see https://phabricator.wikimedia.org/diffusion/OPUP/repository/production/
  2. cd puppet/modules/grafana/files/dashboards
  3. Import a dashboard with ../grafana-dashboard DASHBOARD_URL (requires python and requests), the filename will be the same as the dashboard's name.
    1. The dashboard will get tagged with source:puppet.git and readonly if it doesn't carry the tags already.
  4. Add grafana::dashboard resource to role::grafana e.g. https://gerrit.wikimedia.org/r/#/c/268085 and commit in git
  5. Send and schedule the code review for next puppet SWAT

Update an existing dashboard

  1. Save the readonly dashboard under another name (e.g. NAME-DASHBOARD) and make the desired changes
  2. Import in puppet at files/grafana/dashboards with ../grafana-dashboard NEW_DASHBOARD_URL like above
  3. The new dashboard will get saved under a different name, thus rename to desired name and commit
  4. Send and schedule the code review for next puppet SWAT
  5. (Optional) Delete the modified dashboard in grafana

Features

Shared crosshair

To enable the shared crosshair (which draws the current target cursor in all graphs on the page), go to "Configure dashboard" (top right menu). Then tick the "Shared Crosshair" setting in the Features section.

Shared tooltip

By default each data point of each metric has its own tooltip, only shown when hovering the exact point. Consider enabling the "All series" tooltip. This will ensure the tooltip is always shown when inside the graph. All points on the vertical axis are shown in a single tooltip at the same same. Horizontally the closest data point will be shown in the tooltip.

  1. Click on the graph title and select Edit.
  2. In the Display Styles section, enable tooltip "All series".
  3. From the top navigation, go back to the dashboard.

Show deployments

Add MediaWiki deployment events as annotations to your dashboard:

  1. Enable the "Annotations" feature from the "Configure dashboard" panel (top right menu).
  2. From new settings menu on the top left, choose Annotations.
  3. Add a new annotation.
    • Name: Show deployments
    • Data source: graphite
    • Graphite target expression: exclude(aliasByNode(deploy.*.count,-2),"all")
  4. Click Update, and close the settings menu.

Input variables

See Grafana Templated dashboards.

Time correction

From "Configure dashboard" (top right menu) one can change the default ("browser time") to use "UTC" instead.

Negative axis

If you're plotting metrics that contain negative values, you can use the "Transform: negative-Y" series override to flip the values in the graph, whilst preserving the positive values for min/max legends. You can use this instead of modifications like scale(-1). For example, plotting download traffic in a graph about upload and download bandwidth.[1] Example: Server board - Network traffic.

Alerts (with notifications via Icinga)

Alerts can be set up through the grafana admin interface on each panel. For example, this panel has an alert set for any point at which the 5xx errors per minute rises above 5. For many alerts, it makes sense to set "Keep last state" for error conditions and missing data. When an alert triggers or recovers, this is marked with red and green vertical lines in the corresponding graph.

In order to receive email notifications about grafana alerts, you need to connect an icinga contact group to a given dashboard by making some changes in puppet. Note that all alerts from a dashboard will be sent to the same "contact_group", and the "Notifications" tab in the grafana interface is not functional (see T153167 for background).

E.g. for ORES, the following lines included in /modules/icinga/manifests/monitor/ores.pp:

class icinga::monitor::ores {
    monitoring::grafana_alert { 'db/ores':
        contact_group   => 'team-ores',
    }
}

See also:

Common pitfalls

Alerts for values derived from multiple metrics fire unexpectedly

When writing an alert for a value that is derived from multiple metrics (e.g. "cache_miss.rate" and "cache_hit.rate"), be sure to have the alert query until now-5m instead of until now because the last few data points may not be complete. Especially if they come from different servers. When evaluating math in Graphite in a way that involves a single metric, null remains null. But when involving multiple metrics, null is treated as zero. This can cause percentage values derived from two or more metrics to temporarily become a nonsensical value that can trigger your alert. Frustratingly, there is a good chance that by the time you look at the alert dashboard, the value will be complete, and no amount of zooming into the time frame where the alert occurred will reveal the bad value.

For more info, see also Graphite gotcha: Null in math (Grafana blog).

Month duration

Configuring a relative time duration as "month" or "mon" doesn't work as expected. While in Graphite the unit for months is mon[2], in Grafana is it "M", e.g. use "7M" for the last 7 months. (grafana/grafana#1864)

Recover after making the dashboard not editable

Delete the dashboard via the API, then restore it without the toggle.

Known issues

Alerts with asPercent() not working

movingAverage template

When using a template inside movingAverage, the default mode wrongly expands the variable (it adds quotes around it, instead of leaving it as a number). These have to be removed manually by editing the metric directly (click on the pencil). Whenever the metric is changed, it has to be fixed again.

Color inspector broken

After each time you change a color through a color picker you must click on empty space anywhere outside the color picker. Otherwise the value will not be saved. (The little square will reflect your chosen color, but once applied, it will be lost). If you try to click "Invert", "Save", "Update" or one of the other color squares directly, the change will be lost.

External links

  • Negative y-transform,  What's new in Grafana v2.1
  • Render API, Graphite documentation