Alerts in Grafana
We've been working on finding web performance regressions for a couple of years. We are slowly getting more confident in our metrics and find regressions easier. Before we found regression by looking at the graphs in Graphite/Grafana but now we use the built in alerts in Grafana.
When we started out we only used RUM to find regressions. Back then (and now) we use https://github.com/wikimedia/mediawiki-extensions-NavigationTiming to collect the data. We collect metrics from a small portion of the users and pass on the metrics to our servers that later ends up https://graphiteapp.org/. We collect Navigation Timing, a couple of User Timings and first paint for browsers that supports it.
The way we found regressions was to closely look at the graphs in Graphite/Grafana. Yep watching them real close. The best way for us is to compare current metrics with the metrics we had one week back in time. The traffic and usage pattern for Wikipedia is almost the same if we compare 7 days. Comparing 24 hours back in time can also work, depending on when you look (weekend traffic is different).
Did we find any regressions? Yes we did. This is what one looked like for us:
Looks good right, we could actually see that we have a regression on first paint. What is kind of cool is that the human eye is pretty good at spotting differences between two lines.
But we moved on to use alerts in Grafana to automate how we find them.
Alerts and history
We have set up alerts both for RUM and synthetic testing. I've spent a lot of time tuning and setting up web performance alerts and the best way so far has been to create one set of alert queries that compare the metric in percentage. Talking about a change in percentage is easier for people to understand than the raw change in numbers. And then we have one history graph to the right. It looks like this:
To the left we have changes in percentage. These are the numbers where we add alerts. In this case we first create a query and take the moving average seven days back (this is the number we will use and compare with) and then we take the moving average of the latest 24 hours. We have big span here of 24 hours, meaning we don't find regressions immediately but that helps us to have stable metrics.
To the right is the history graph. We have a graph to the right because it is nice to see the real metrics (not in percentage), it makes it easier to know if the regression is real or not. The history graph is pretty straight forward. You list the metrics you want and you configure how long back in time you want to graph them. We used to do 30 days (that is really good to see trends) but it was to long to see something when an actual regression was happening. Now we use 7 days.
We alert on our RUM metrics. We alert on first paint, TTFB and loadEventEnd. We set the alerts on p75 and p95 of the metrics we collect and alert on a 5-30% change depending on the metrics. Some metrics are really unstable and some are better. You can see our RUM alerts at https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts
At the moment we test three URLs on desktop in our synthetic testing. We also alert on three URLs for mobile. If a regression is larger than 10% on all three URLs, an alert is fired. We test three URLs to make sure the change is across the board and not specific to one URL. 10% is quite high but that gives us confidence that it is a real regression.
There's a couple of problems we have seen so far.
Self healing alerts
We go back X days back (usually 7 days back). That means that after 7 days, the alert is self healing (we will then compare with the metric that set off the alert).
Known non working queries
We have had problems with nested queries that works in the beginning but then stopped working (using Graphite built in percentage queries). To avoid that we now do alert queries like this:
Create one query that goes back X days and make that hidden. Then make another query that divides with the first one and set the offset to -1. It looks like this: