Grafana

grafana.wikimedia.org is a frontend for creating queries and storing dashboards using data from Prometheus and other datasources.

Service

Currently hosted on and served from grafana1002.

Editing dashboards

To edit dashboards, you need to be a member of the cn=nda or cn=wmf LDAP groups. https://grafana.wikimedia.org is read-only, to edit dashboards (or change administrative settings) you need to access the separate vhost https://grafana-rw.wikimedia.org; hitting the "login" link at the bottom of the left sidebar will also redirect you as needed. The Grafana web interface is integrated into our web SSO identity provider based on Apereo CAS.

Private dashboards

A folder for private dashboards (or the same name) is also available. Dashboards created in (or moved to) this folder will require logging into Grafana to be able to view. Please use this feature sparingly, and default to public dashboards unless absolutely needed (cfr e.g. bug T267930 for one such case)

Viewing

Most viewing features can be discovered naturally, but here's a few features you may not immediately realise exist:

Dynamic time range, you can zoom in and focus on any portion of a plot by selecting and dragging within the graph.
Metrics, you can click on metric names in the graph legend to isolate a single metric, or ctrl/cmd click to exclude a metric.
Annotations, such as for deployments, can be toggled by clicking the lightning icon on the top left.

Dashboards as code

For critical dashboards it is important to have revision control, to this end it is possible to save dashboards using Grizzly and have them effectively read-only in Grafana.

note that Grafana provisioned dashboards via puppet has been superseded by Grizzly, grafana::dashboard is deprecated targeted for removal

Grizzly

Grizzly is a utility for managing various observability resources with Jsonnet. We are currently piloting this to manage our grafana dashboards as code.

Use-cases

In the context of grafana and grizzly, there are multiple use cases. Each has a slightly different workflow

Static, hand-crafted dashboards
- These are traditional dashboards created and edited within the UI

Templated, programmatically generated dashboards
- Today we are using grizzly to render and deploy jsonnet for our SLO dashboard template, which provisions a dashboard for each SLO substituting panel data and queries within each, see slo_dashboards.jsonnet and slo_definitions.libsonnet in the https://gerrit.wikimedia.org/r/admin/repos/operations/grafana-grizzly repository.
- A potential future case is service dashboards for k8s

Workflows

Creating or updating a dashboard using Grizzly

Preparing the Patch

First, clone the operations/grafana-grizzly git repository https://gerrit.wikimedia.org/r/admin/repos/operations/grafana-grizzly

git clone "https://gerrit.wikimedia.org/r/operations/grafana-grizzly" && (cd "grafana-grizzly" && mkdir -p .git/hooks && curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://gerrit.wikimedia.org/r/tools/hooks/commit-msg; chmod +x `git rev-parse --git-dir`/hooks/commit-msg)

Upload a patch with your changes. See the Varnish SLO Dashboard change as an example.
Use 'grr preview' and 'grr diff' (steps below) to see what will be changed/added to affected dashboards.

Patch review, feel free to tag anyone in sre-observability for a review.

Updating a dashboard that already exists in Grafana using grizzly

Find the dashboard json object in the grizzly repository
- An easy way to do this is git grep for the dashboards UID. A dashboards UID can be found in the dashboard URL, and in the json object as well. For example, dashboard with link https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status has UID 'O_OXJyTVk'.
- ```
:~/git/wikimedia/grafana-grizzly$ git grep O_OXJyTVk
static/home_w_wiki_status.json:  "uid": "O_OXJyTVk",
```
- If there is no hit for the UID in the grizzly repository yet, the dashboard will first need to be onboarded. The process is largely the same as an update, however when onboarding a new json file will need to be created and imported by the main jsonnet file. Please see the onboarding section for details.
Create a temporary working copy of the dashboard you wish to change in the Drafts folder
- While logged-in and viewing the dashboard, navigate to dashboard settings via the gear at the top of the screen, and select "Save as" (top right) and enter the desired details for your temporary copy.
Make the desired changes in the UI and save the temporary working copy
- This step is fairly self-explanatory, in other words use the temporary working copy to make your changes and get the dashboard into the state that you'd like to deploy as the live version.
Export the JSON Model of your temporary working copy
- While logged in to Grafana open dashboard settings using the gear icon at the top of the screen, select JSON Model. Capture the full JSON object.
Prepare a patch to apply the JSON from the draft dashboard to the JSON in the Grizzly repository.
- Ensure that the UID, title, folderName and tags fields match the values in the original dashboard. This is to make sure our patch changes the live dashboard (as opposed to accidentally onboarding the draft dashboard copy into grizzly). Please see the "Preparing the Patch" section above for additional details.
Once you've uploaded your patch, you are ready to move on to the code review section

Onboarding a dashboard with Grizzly

The onboarding process is largely the same as the update process, with the additional step of initial creation of the json file representing the dashboard and importing it in the main jsonnet file, typically static_dashboards.jsonnet

Create a file to the effect of static/my_dashboard.json and populate it with the JSON model obtained in the above section

Add an entry to the main jsonnet file that is called by grizzly, for static dashboards this is static_dashboards.jsonnet

  # static_dashboards.jsonnet
  
  grafanaDashboards+: {
    ... existing dashboards here
    'my_dashboard.json': (import 'static/my_dashboard.json'),
  },

Grizzly supports importing dashboards from grafana, however a few steps may be necessary to adapt the json. Typically this involves removing the id and version fields, and ensuring that other fields like title, uid and folderName match the desired value.

There are a couple ways to go about this.

You can fetch the json through the grafana UI, or attempt to pull it using grizzly itself. you will need the uid of the dashboard. To fetch, use the 'get' subcommand i.e. `grr get Dashboard.UID`. To find the UID

Code Review

Finding reviewers

Feel free to tag any members of the observability team for patch reviews, in addition please try to tag at least one stakeholder of the dashboard(s) which would be affected by the proposed changed. For instance if the dashboard being adjusted relates to varnish, try to find an SRE on the traffic team to add as a reviewer.

Previewing the Change ('grr preview')

Grizzly has a preview feature which enables rendering the change as snapshots that can be previewed without making actual changes to the live dashboards. This will likely be automated in the future, for now it is a manual process from the grafana host.

To create a 'grr preview', first push or fetch your changes to a working copy in your home directory on grafana1002

Then run `grr preview <template>`, for example:

herron@grafana1002:~/git/grafana-grizzly$ grr preview slo_dashboards.jsonnet

herron@grafana1002:~/git/grafana-grizzly$ grr preview static_dashboards.jsonnet

Assuming the template is parsable, this will output links to view (and delete) snapshots for each dashboard affected by the change.

Follow the view link to see a preview of your changes, and repeat the process as needed.

Diffing the change ('grr diff')

Grizzly also has a diff feature which will print the differences in JSON between the live dashboard(s) and rendered output of the proposed patchset.

herron@grafana1002:~/git/grafana-grizzly$ grr diff slo_dashboards.jsonnet

herron@grafana1002:~/git/grafana-grizzly$ grr diff static_dashboards.jsonnet

Likewise, this depends on the json(net) being valid, grr diff/preview will provide errors if the patchset isn't parsable and needs improvement.

Deploying the change

After the patch has been reviewed and merged (currently requiring a manual V+2), the working repository on the grafana hosts (/srv/grafana-grizzly) will be updated on the next puppet run.

grafana1002:~$ sudo run-puppet-agent
grafana1002:~$ cd /srv/grafana-grizzly
grafana1002:/srv/grafana-grizzly$ grr diff slo_dashboards.jsonnet

#  Manually review the diff, make sure it looks good to you
#
#  Note: grizzly will output "Dashboard/dashboard_name not present in Dashboard" if the Dashboard does not yet exist in grafana, and will not show a diff. In this case you can use 'grr preview' to generate a snapshot of the dashboard in Grafana for review.

grafana1002:/srv/grafana-grizzly$ grr preview slo_dashboards.jsonnet

#  When ready to deploy:

grafana1002:/srv/grafana-grizzly$ grr apply slo_dashboards.jsonnet

Usage Examples

grafana1002:/srv/grafana-grizzly$ grr list slo_dashboards.jsonnet
API VERSION                     KIND         UID
grizzly.grafana.com/v1alpha1    Dashboard    slo-etcd
grizzly.grafana.com/v1alpha1    Dashboard    slo-logstash

grafana1002:/srv/grafana-grizly# grr diff slo_dashboards.jsonnet
Dashboard/slo-etcd no differences
Dashboard/slo-logstash no differences

Style Guide

Conventions

Grizzly dashboards should be tagged as ‘Grizzly’

Grizzly managed dashboards should be placed into a directory within grafana that contains dashboards managed only by grizzly.

Dashboards of similar types should be grouped into a single jsonnet file, which can include additional dashboard json files as required.

Notes

The grr command itself is configured via environment variables containing attributes like grafana server url, api key, etc. A wrapper has been deployed as /usr/local/bin/grr to supply these values from a file readable by group ops /etc/grafana/grizzly.env

Features

Shared crosshair

To enable the shared crosshair (which draws the current target cursor in all graphs on the page), go to "Configure dashboard" (top right menu). Then tick the "Shared Crosshair" setting in the Features section.

By default each data point of each metric has its own tooltip, only shown when hovering the exact point. Consider enabling the "All series" tooltip. This will ensure the tooltip is always shown when inside the graph. All points on the vertical axis are shown in a single tooltip at the same same. Horizontally the closest data point will be shown in the tooltip.

Click on the graph title and select Edit.
In the Display Styles section, enable tooltip "All series".
From the top navigation, go back to the dashboard.

Show deployments

Add MediaWiki deployment events as annotations to your dashboard:

Enable the "Annotations" feature from the "Configure dashboard" panel (top right menu).
From new settings menu on the top left, choose Annotations.
Add a new annotation.
- Name: MW deploy
- Data source: Public Logs (Loki)
- Query: {channel="scap"} |~ "(?i)finished|synchronized"
Click Update, and close the settings menu.

Input variables

See Grafana Templated dashboards.

Time correction

From "Configure dashboard" (top right menu) one can change the default ("browser time") to use "UTC" instead.

Negative axis

If you're plotting metrics with the intention to show some of them as negative, apply "Transform: negative-Y" from "Display > Series overides".^[1] This will visually flip the values in the graph (as negative), whilst preserving the positive values for legends and crosshairs. This is preferred over modifications like scale(-1), which will affects other displays of the metric as well and can cause confusion

Example: Server board - Network traffic (plots upload and download bandwidth)

Alerts (with notifications via Alertmanager)

Main article: Alertmanager#Grafana alerts

Originally set up via T152473 and T153167 by the Performance team.

Annotations based on Prometheus data

Grafana Annotations allow marking specific points on the graph with a vertical line and an associated description. The information about when a given event has occurred can be extracted with a Prometheus query.

To add a Prometheus-based Annotation:

Choose "Annotations" from the settings button (gear icon)
Click on "New"
Choose a name for the new annotation and a Prometheus data source
Insert a Prometheus query returning 1 when the annotation should be displayed. For example, in case of a metric tracking uptime in seconds, you can add an annotation to show when the service is started by using the resets() function. For example: resets(service_uptime{site=~"$site"}[5m]) > bool 0
Add a the label that will be displayed when moving the cursor over the annotation (triangle at the bottom of the vertical line). To do that, fill the "Field formats" section of the form and specify some constant text under "Title" and a comma separated list of "Tags" which must be Prometheus labels returned by the query (eg: instance, job).

Search/audit metrics usage across dashboards

It is possible to audit metrics usage across all grafana dashbaords (e.g. useful during metrics rename), check out search-grafana-dashboards.js in the operations/software.git repository for a command line utility.

Alternatively you can use https://github.com/panodata/grafana-wtf (supports caching the dashboards locally, searching in template variables, ...) with a read-only API key.

Best practices

If you are creating or updating a dashboard, see also Performance/Runbook/Grafana best practices for a list of best practices.

Common pitfalls

Alerts fire but the threshold was not reached

Alert fired with value different from the graph's metric.

Reported upstream at https://github.com/grafana/grafana/issues/12134.

It is common for a newly configured alert to fire within days for a value that, later, cannot be found in the graph. The reason for this is likely due to the alert query having a "to" time of "now". This a problem especially with data queried from Graphite where the data for the current minute may be null or otherwise incomplete. Resolve this by always giving the alert query a "to" time of at least 1 minute in the past. For example, from 1h, to now-1m.

Alerts for values derived from multiple metrics fire unexpectedly

When writing an alert for a value that is derived from multiple metrics (e.g. "cache_miss.rate" and "cache_hit.rate"), be sure to have the alert query until now-5m instead of until now because the last few data points may not be complete. Especially if they come from different servers. When evaluating math in Graphite in a way that involves a single metric, null remains null. But when involving multiple metrics, null is treated as zero. This can cause percentage values derived from two or more metrics to temporarily become a nonsensical value that can trigger your alert. Frustratingly, there is a good chance that by the time you look at the alert dashboard, the value will be complete, and no amount of zooming into the time frame where the alert occurred will reveal the bad value.

For more info, see also Graphite gotcha: Null in math (Grafana blog).

Recover after making the dashboard not editable

Delete the dashboard via the API, then restore it without the toggle.

Known issues

Alerts with asPercent() not working

Upstream: https://github.com/grafana/grafana/issues/9917

movingAverage template

Upstream https://github.com/grafana/grafana/issues/2078

When using a template inside movingAverage, the default mode wrongly expands the variable (it adds quotes around it, instead of leaving it as a number). These have to be removed manually by editing the metric directly (click on the pencil). Whenever the metric is changed, it has to be fixed again.

Color inspector broken

Upstream https://github.com/grafana/grafana/issues/2953

After each time you change a color through a color picker you must click on empty space anywhere outside the color picker. Otherwise the value will not be saved. (The little square will reflect your chosen color, but once applied, it will be lost). If you try to click "Invert", "Save", "Update" or one of the other color squares directly, the change will be lost.

DatasourceError notification spam

You might be getting DatasourceError notifications from your alerts, for example when Graphite or Thanos/Prometheus are temporarily unavailable. Since these notifications are not actionable for alerts recipients, you should disable this type of notification for your alert(s). To do so, navigate to your alert rule configuration page, then under "3 Alert evaluation behavior" section set "Alert state if execution error or timeout" to "OK" and then "save" at the top of the page.

Wikimedia Cloud Services

The Cloud Services admins maintain a Grafana installation at https://grafana.wmcloud.org. This instance queries data from the Prometheus instance located in the metricsinfra project (which monitors all VMs), as well as some in-project Prometheus instances.

Historically this instance was located on https://grafana-labs.wikimedia.org.

Pipeline

The Deployment Pipeline is well supported in Grafana. All services that are deployed in it benefit from ready to make dashboards that have basic functionality and structure already set. Most of the dashboards follow the RED/4 golden signals approach, by providing Traffic (aka Rate), Errors, Latency(aka Duration) and Saturation rows and panels in a dashboard named as the service. The hierarchy for the pipeline is under the Service folder.

While we have started experimenting with Grafana Grizzly to maintain this hierarchy, for now the process of instantiating and maintaining a new dashboard is manual. It consists of copying the Template Dashboard from the Service folder, changing the service variable to the name of the service (specifically the k8s namespace) and saving.

Usage for product analytics purposes

Grafana/Graphite is very useful for incident monitoring, but is less suitable for systematically analyzing data for product decisions. For example:

It does not allow easy comparison of data along dimensions like browser family or project domain, like Superset and Turnilo do. (This can be a limitation for technical investigations too, see e.g. phab:T166414 or [1].)
The underlying data in Graphite can’t be queried easily like we can do for EventLogging data. This makes it more difficult to vet and debug an instrumentation, and to answer more involved data questions (that go beyond time series data).
Also, Graphite compresses data after some time, making it hard to use it for investigating/comparing historical data.

That said, every EventLogging schema has an associated Grafana board (always linked from the schema talk page - example) which is valuable for monitoring its overall event rate.

Operations

Retiring a Dashboard

Dashboards that have aged, become unused, orphaned, are broken or are otherwise are no longer serving their original purpose will be tagged for cleanup using a non-destructive grace period. Panels within the dashboard will be replaced by a single text panel outlining that the dashboard was marked for cleanup, and instructions included for self service restore/revert to its previous state.

To do this, edit the dashboard

Add tags "to-be-deleted" "statsd-viking-funeral" (and any other useful tags to taste)
Remove all its panels and create a new text panel explaining the current status to the effect of:

# This dashboard has been tagged for cleanup.

If you believe this was done in error, please restore the previous version of this dashboard to return it to its former state **(gear icon > versions > restore)** and **remove tag** to-be-deleted

Version upgrade

This section details how to roll out a Grafana version upgrade.

Download the latest Debian package from https://grafana.com/grafana/download
Copy the package and install it on the host acting as backend for https://grafana-next.wikimedia.org. The mapping is in puppet at hieradata/common/profile/trafficserver/backend.yaml.
Verify basic functionality (login, view, edit)

Update the APT repository on the host serving apt.wikimedia.org as follows:

root@apt1002:~# reprepro --noskipold --restrict grafana checkupdate bookworm-wikimedia
root@apt1002:~# reprepro --noskipold --restrict grafana update bookworm-wikimedia

Backup the database prior to upgrade on the main Grafana host.

cp /var/lib/grafana/grafana.db /var/lib/grafana/grafana.db-$(date -I)

Upgrade the package
```
apt -q update
apt install grafana
```
Roll out the upgrade to all Grafana hosts. The full list of grafana hosts is obtained with cumin C:grafana

cumin C:grafana 'apt -q update' && cumin C:grafana 'apt -y install grafana'

Recovering a deleted dashboard

If a dashboard is deleted, point-in-time recovery is possible by extracting the dashboard JSON from a recent backup of the /var/lib/grafana/grafana.db sqlite database

Follow the Bacula#Restore (aka Panic mode) instructions to restore grafana.db from a time before the deletion happened, this will restore the past database contents to /var/tmp/bacula-restores/var/lib/grafana/grafana.db
Use sqlite3 to extract the desired dashboard.
- Identify the dashboard UID. Given a dashboard url of https://grafana-rw.wikimedia.org/d/-K8NgsUnz/foo-dashboard the UID would be -K8NgsUnz
- Connect to the database and extract the json
```
cd /var/tmp/bacula-restore/var/lib/grafana
sqlite3
.open grafana.db
select data from dashboard where uid='-K8NgsUnz';
```
- Copy this raw json into a working file, e.g. recover.json
- optionally pretty print it e.g. cat recover.json | jq
Import raw json into Grafana dashboard
- UI Method:
  - Create new dashboard, click the gear, click JSON model
  - Paste in the raw JSON obtained in step 2
  - Save dashboard
    - Note if you encounter "could not save dashboard" errors, try removing the outermost "id" field, and adjusting the "title" field to avoid collisions.

Failing over from the active to the passive host

Failing over to a new instance consists of several steps, executed from the active cumin host.

Stopping services on the passive host

First, stop the Grafana and Loki services on the passive host:

sudo cumin 'passive_host' 'systemctl stop grafana-server'
sudo cumin 'passive_host' 'systemctl stop grafana-loki'

Pulling data from the active host

Synchronize data from the active host to the passive host:

sudo cumin 'passive_host' 'sudo systemctl start rsync-var-lib-grafana'
sudo cumin 'passive_host' 'sudo systemctl start rsync-loki-data'

Starting services

Once the data is synchronized, start the services on the passive host:

sudo cumin 'passive_host' 'systemctl start grafana-server'
sudo cumin 'passive_host' 'systemctl start grafana-loki'

Making the necessary Puppet changes for the failover

Adjust the Puppet configuration to reflect the change in active and passive hosts:

Define the active and passive hosts in Grafana's Hieradata: hieradata/role/common/grafana.yaml
Update the destination for the grafana-next* domains in the sites hosting the Grafana instances:
- For eqiad: hieradata/role/eqiad/grafana.yaml
- For codfw: hieradata/role/codfw/grafana.yaml
Ensure the ATS rules route to the correct hosts in the Trafficserver hieradata: hieradata/common/profile/trafficserver/backend.yaml
Set the 'passive_host' as the new Loki host for the OpenSearch Collector's hieradata: hieradata/role/eqiad/logging/opensearch/collector.yaml

Running Puppet on the Grafana and Traffic hosts

Apply the changes by running Puppet on the Grafana and Traffic hosts:

sudo cumin 'A:grafana' 'run-puppet-agent'
sudo cumin 'A:cp' 'run-puppet-agent'

Ensuring services are working as expected

Verify that the Grafana and Loki services are active:

sudo cumin 'A:grafana' 'systemctl is-active grafana-server'
sudo cumin 'A:grafana' 'systemctl is-active grafana-loki'

Additionally, access Grafana via a web browser to confirm its functionality.

Notes

↑ Negative y-transform, What's new in Grafana v2.1

External links

[1] Negative y-transform, What's new in Grafana v2.1

[1]

Service

Editing dashboards

Private dashboards

Viewing

Dashboards as code

Grizzly

Use-cases

Workflows

Creating or updating a dashboard using Grizzly

Preparing the Patch

Updating a dashboard that already exists in Grafana using grizzly

Onboarding a dashboard with Grizzly

Code Review

Finding reviewers

Previewing the Change ('grr preview')

Diffing the change ('grr diff')

Deploying the change

Usage Examples

Style Guide

Conventions

Notes

Features

Shared crosshair

Shared tooltip

Show deployments

Input variables

Time correction

Negative axis

Alerts (with notifications via Alertmanager)

Annotations based on Prometheus data

Search/audit metrics usage across dashboards

Best practices

Common pitfalls

Alerts fire but the threshold was not reached

Alerts for values derived from multiple metrics fire unexpectedly

Recover after making the dashboard not editable

Known issues

Alerts with asPercent() not working

movingAverage template

Color inspector broken

DatasourceError notification spam

Wikimedia Cloud Services

Pipeline

Usage for product analytics purposes

Operations

Retiring a Dashboard

Version upgrade

Recovering a deleted dashboard

Failing over from the active to the passive host

Stopping services on the passive host

Pulling data from the active host

Starting services

Making the necessary Puppet changes for the failover

Running Puppet on the Grafana and Traffic hosts

Ensuring services are working as expected

Notes

External links