Performance/Alerts

Alerts in Grafana

We've been working on finding web performance regressions for a couple of years. We are slowly getting more confident in our metrics and find regressions easier. Before we found regression by looking at the graphs in Graphite/Grafana but now we use the built in alerts in Grafana.

History

When we started out we only used RUM to find regressions. Back then (and now) we use https://github.com/wikimedia/mediawiki-extensions-NavigationTiming to collect the data. We collect metrics from a small portion of the users and pass on the metrics to our servers that later ends up https://graphiteapp.org/. We collect Navigation Timing, a couple of User Timings and first paint for browsers that supports it.

Finding regressions

The way we found regressions was to closely look at the graphs in Graphite/Grafana. Yep watching them real close. The best way for us is to compare current metrics with the metrics we had one week back in time. The traffic and usage pattern for Wikipedia is almost the same if we compare 7 days. Comparing 24 hours back in time can also work, depending on when you look (weekend traffic is different).

Did we find any regressions? Yes we did. This is what one looked like for us:

First paint change found on Graphite GUI

Looks good right, we could actually see that we have a regression on first paint. What is kind of cool is that the human eye is pretty good at spotting differences between two lines.

But we moved on to use alerts in Grafana to automate how we find them.

Current setup

At the moment we alert on WebPageTest, WebPageReplay, Navigation Timing metrics and Save timings.

Alerts and history

We have set up alerts both for RUM and synthetic testing. I've spent a lot of time tuning and setting up web performance alerts and the best way so far has been to create one set of alert queries that compare the metric in percentage. Talking about a change in percentage is easier for people to understand than the raw change in numbers. And then we have one history graph to the right. It looks like this:

Speed Index regression found using Grafana

To the left we have changes in percentage. These are the numbers where we add alerts. In this case we first create a query and take the moving average seven days back (this is the number we will use and compare with) and then we take the moving average of the latest 24 hours. We have big span here of 24 hours, meaning we don't find regressions immediately but that helps us to have stable metrics.

To the right is the history graph. We have a graph to the right because it is nice to see the real metrics (not in percentage), it makes it easier to know if the regression is real or not. The history graph is pretty straight forward. You list the metrics you want and you configure how long back in time you want to graph them. We used to do 30 days (that is really good to see trends) but it was to long to see something when an actual regression was happening. Now we use 7 days.

Navigation Timing

We alert on our RUM metrics. We alert on first paint, TTFB and loadEventEnd. We set the alerts on p75 and p95 of the metrics we collect and alert on a 5-30% change depending on the metrics. Some metrics are really unstable and some are better. You can see our RUM alerts at https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts

WebPageTest

At the moment we test three URLs on desktop in our synthetic testing. We also alert on three URLs for mobile. If a regression is larger than 100 ms on all three URLs, an alert is fired. We test three URLs to make sure the change is across the board and not specific to one URL.

WebPageReplay

At the moment we test three URLs on 13 different wikis both in desktop and mobile. If a regression is larger than 20-40ms on all three URLs, an alert is fired. We test three URLs to make sure the change is across the board and not specific to one URL.

Search

We run have a couple of alerts for our synthetic testing on search: Coming from Google (searching for Barack Obama) and hitting our Barack Obama page (desktop and emulated mobile) and search for Barack Obama on our search page (desktop and emulated mobile). https://grafana.wikimedia.org/d/IpmyFNzMz/search-alerts

Save Timings

// TODO

Known problems

There's a couple of problems we have seen so far.

Self healing alerts

We go back X days back (usually 7 days back). That means that after 7 days, the alert is self healing (we will then compare with the metric that set off the alert).

Known non working queries

We have had problems with nested queries that works in the beginning but then stopped working (using Graphite built in percentage queries). To avoid that we now do alert queries like this:

Create one query that goes back X days and make that hidden. Then make another query that divides with the first one and set the offset to -1. It looks like this:

Creating an alert

In 2021 we moved to use AlertManager in Grafana. To setup an alert for AlertManager you need to make sure you add AlertManager in the send to field in the alerts.

You also need to add tags to your alerts. To make sure the alerts reach the performance team you need to add two tags: team and severity. The team tag needs to have the value perf and the severity tag should have the value critical or warning (depending on the severity of the alert).

We also use tags to add extra info about the alerts. The following tags are used at the moment but feel free to add more:

metric - the value shows what metric that fired the alert. By tagging the metric, its easier to find other alerts that also fired at the same time
tool - what tool that fired the alert. Values can be rum (real user measurements), webpagetest, webpagereplay, sitespeed.io.
dashboard - the link to the current alert dashboard. At the moment Grafana/AlertManager aren't best friends so you need to give the link to the dashboard so you can find the actual dashboard directly from the alert. This is important, please add the link.
drilldown - link to other dashboard that hold more info about the metric that fired. For example a dashboard that shows all metric collected for a URL that failed or RUM metrics that give more insights to what could be wrong.

In the future we also want to include a link to the current runbook for each alert.

The tags will look something like this when you are done:

When you create your alert you need to make sure that it's evaluated often so that the AlertManager understands that it is the same alert that fires. At the moment we use evaluate every 3 minutes. If you don't evaluate often, alerts will fire and you will give multiple emails for the same alert.