This is the runbook for WebPageReplay alerts.
- Issue tracker (Phabricator): WebPageReplay
- Documentation: WebPageReplay
WebPageReplay alert fired
Our WebPageReplay tests measures the front end performance of Wikipedia (using a WebPageReplay proxy). If an alert fires it can be caused by:
- A front end performance regression of Wikipedia
- A regression in the browser that is used for the test
- Instability on the server that runs the tests
Front end performance regression
- Go to the WebPageReplay alert Grafana dashboard to see/verify the alert.
- Go to the individual page dashboard and use the zoom in on the regression. Try to find the time of the regression (+- 2 hours or something like that). Check all tested URLs and see if they all have the regression.
- Verify the regression on Browsertime/sitespeed.io tests that runs direct against Wikipedia and check if you can see anything in the RUM data (that normally lags since we switch browser versions fast, and for users it takes time).
- If you can't find anything in the other tools, check if its a browser regression or a test server regression.
- Check Server Admin Log to see if there's been a change that correlate to the regression.
If you can verify that it is a regression, create a Phabricator task in and include everything you know. Please take screenshots of the dashboards and include links. If you could identify the code change that caused the change, please include the team/person in the issue.
Browser performance regression
- Go the the dashboard for WebPageReplay tests
- Make sure the domain, page and browser matches the alert that fired (=you are looking at the right data).
- Zoom in using the time dropdown, use the last 24 hours or two days, make sure the regression happened within that time window
- Click on Show each tests and wait a couple of seconds until you see the green vertical lines appearing on the graphs.
- Hover the mouse on the green lines before the regression and after the regression. Hovering will show a screenshot of the test and what versions of sitespeed.io and browser that was used when the test was executed. It will look something like this: 20.3.0 - 95.0.4638.54 The first part is the sitespeed.io version and the second part is the browser.
- Verify that it is the exact same browser version before the regression and after the regression
- If the browser version differ, verify the regression on all tested URLs and check if you can see the same thing on the tests running without WebPageReplay.
If we can see that the browser caught the regression we can rollback the version running WebPageReplay (look at the changelog to see what sitespeed.io version that includes what browser version) to 100% verify the regression. If the regression is verified, you should create an upstream bug for the browser.
Test server performance regression
If the regression is on emulated mobile, make sure the dashboard type is emulatedMobile and Test type is webpagereplay in the dashboard. The default links are for desktop.
- Check the standard deviation of the CPU benchmark it should be something like 1 ms.
- Look at the min/median/max values of the CPU benchmark.
- If the standard variation is high contact the performance team that need to deploy the tests on a another server.