Performance/Guides/WebPageReplay alert

From Wikitech
Jump to navigation Jump to search

We use Grafana alerts for WebPageReplay tests to get notified of potential frontend performance regressions, based on the metrics we collect from the Browsertime and WebPageReplay infrastructure.

We run 5 runs per URL against WebPageReplay and record a video in 30 fps and take the median metrics. Since we run against a local proxy (WebPageReplay) we set the latency to 100ms to slowdown the rendering.

We have alerts for FirstVisualChange that will fire if there's a 20-40 ms regression all the three pages tested for that wiki. We run tests using Firefox and Chrome for desktop and emulated mobile Chrome for mobile.

At the moment we test three pages on each of these wikis and on their corresponding mobile wiki:

The main test repo hold the URLs for all pages tested, except the enwiki URLs.

Mobile test URLs and enwiki mobile URLs.

You find all the alerts under Performance Team Alerts in Grafana.


WebPageReplay alert fired

Our WebPageReplay tests measures the front end performance of Wikipedia (using a WebPageReplay proxy). If an alert fires it can be caused by:

  • A front end performance regression of Wikipedia
  • A regression in the browser that is used for the test
  • Instability on the server that runs the tests

Front end performance regression

  1. Go to the WebPageReplay alert Grafana dashboard to see/verify the alert.
  2. Go to the individual page dashboard and use the zoom in on the regression. Try to find the time of the regression (+- 2 hours or something like that). Check all tested URLs and see if they all have the regression.
  3. Verify the regression on Browsertime/ tests that runs direct against Wikipedia and check if you can see anything in the RUM data (that normally lags since we switch browser versions fast, and for users it takes time).
    1. If you can't find anything in the other tools, check if its a browser regression or a test server regression.
  4. Check Server Admin Log to see if there's been a change that correlate to the regression.

If you can verify that it is a regression, create a Phabricator task in and include everything you know. Please take screenshots of the dashboards and include links. If you could identify the code change that caused the change, please include the team/person in the issue.

Browser performance regression

  1. Go the the dashboard for WebPageReplay tests
  2. Make sure the domain, page and browser matches the alert that fired (=you are looking at the right data).
  3. Zoom in using the time dropdown, use the last 24 hours or two days, make sure the regression happened within that time window
  4. Click on Show each tests and wait a couple of seconds until you see the green vertical lines appearing on the graphs.
  5. Hover the mouse on the green lines before the regression and after the regression. Hovering will show a screenshot of the test and what versions of and browser that was used when the test was executed. It will look something like this: 20.3.0 - 95.0.4638.54 The first part is the version and the second part is the browser.
  6. Verify that it is the exact same browser version before the regression and after the regression
  7. If the browser version differ, verify the regression on all tested URLs and check if you can see the same thing on the tests running without WebPageReplay.

If we can see that the browser caught the regression we can rollback the version running WebPageReplay (look at the changelog to see what version that includes what browser version) to 100% verify the regression. If the regression is verified, you should create an upstream bug for the browser.

Test server performance regression

If the regression is on emulated mobile, make sure the dashboard type is emulatedMobile and Test type is webpagereplay in the dashboard. The default links are for desktop.

  1. Check the standard deviation of the CPU benchmark it should be something like 1 ms.
  2. Look at the min/median/max values of the CPU benchmark.
  3. If the standard variation is high contact the performance team that need to deploy the tests on a another server.