We have different tools to find performance regressions and we have a automated alerts that will fire if they suspect there is a regression. Alerts are fired from https://alerts.wikimedia.org/ and the performance team is notified using email and IRC.
When an alert is fired, we need to find out the cause of the regression. There are two different types of performance alerts: synthetic testing and real user measurements. Best case scenario both types of tools fire and then
You got a performance alert, what's the next step?
You want to understand what's causing the regression: Is it a performance regression or is it something going in with the monitoring? Are multiple tools alerting? Follow the runbooks for the tool that fires the alert:
The first thing I do is to try and find out if the regression is across the board (for all URLs, all browsers, all synthetic tools, both synthetic and RUM metrics). If you know that, you are on the way to finding the root cause of the problem.
Synthetic testing alerts typically reference WebPageTest or WebPageReplay. For example:
Notification Type: PROBLEM Service: https://grafana.wikimedia.org/dashboard/db/webpagereplay-mobile-alerts grafana alert Host: einsteinium Address: 220.127.116.11 State: CRITICAL Date/Time: Tue Sept 11 22:14:46 UTC 2018 Notes URLs: Additional Info: CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagereplay-mobile-alerts is alerting: Rendering Mobile enwiki CPU alert.
Notification Type: PROBLEM Service: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts grafana alert Host: einsteinium Address: 18.104.22.168 State: CRITICAL Date/Time: Thu Sept 13 04:12:19 UTC 2018 Notes URLs: https://phabricator.wikimedia.org/T203485 Additional Info: CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts is alerting: Start Render Chrome Desktop [ALERT] alert.
We run two different synthetic testing tools to find regressions: WebPageTest includes network/server time, Browsertime/WebPageReplay focuses exclusively on front end performance. We run WebPageTest for English Wikipedia (desktop and mobile) and Browsertime/WebPageReplay for English, beta, group 0 and group 1 (desktop and mobile).
You can read more about the WebPageReplay alerts to get the understanding on what we test.
If the alert comes from WebPageTest, you can start by checking the generic WebPageTest dashboard: https://grafana.wikimedia.org/d/000000210/webpagetest and then go down and check the metrics for the individual URL: https://grafana.wikimedia.org/d/000000057/webpagetest-drilldown
If the alert is coming from WebPageReplay/Browsertime you should start with the drill down dashboard where you can see each URL drilldown dashboard
Where to start
A good starting point is to follow the runbacks for each alert. One of the key things is to find out at what point in time the regression was introduced. If you can find that, then you can compare screenshots and HAR files (that describes what and when the browser downloads assets) before and after the regression.
To find specific runs in WebPageTest, you need to use the search page. It will show a lot of runs so make sure you pick the right ones!
A couple of things to know: Make sure you choose Show tests from all users and Do not limit the number of results (warning, WILL be slow). That way you are sure you will see all the tests. Also change the View to include enough days to go back to when the regression happened.
You can also the fields or URLs containing and try to limit the result.
It's important that you get the run before and after the regression within the same search result, because you can use the small checkbox to the left of the results to pick runs. It's usually a lot of work to just find the right run so have patience. When you've picked to runs, then click the (small) Compare button.
When you click "compare", you will see a comparison of the waterfall chart (using the HAR) and screenshots and videos for the selected runs.
Some things to look for:
- Are there assets that are being downloaded after the regression, that were not being downloaded before it?
- Are there specific assets that are downloading slowly?
- Has anything visible changed on the page? (For example, we frequently have alerts fire when fundraising campaigns start, and we sometimes see alerts when an edit is made to a page that changes it significantly.)
To find specific runs, you need do to go to the drill down dashboard. Then choose which wiki and page and click "Show each tests" and wait.
In a couple seconds you will see vertical lines on the dashboard. Each vertical line represents a time when the test run. Hover over the line and you will see layer with a screenshot from when the page was tested.
Click on the result link and you will then get to the actual result for that run.
Tips and tricks
Check the screenshots. Look out for campaigns and try to correlate them to when they got activated.
Check if there has been any release for the tool (using WebPageReplay make sure you click Show synthetic environment changes and for WebPageTest Show WebPageTest changes. If the performance team updates the tool (new version of the tool, new version of the browser) there will be an annotation for that. It has happened that new browser versions have introduced a regression. WARNING: We still autoupdate WebPageTest, so it can happen that we miss an annotation for a browser upgrade or change in the tool.
Check if there is a release that correlates to the change by choosing Show sync-wikiversions and check the server admin log.
Do you see any changes in the Navigation Timing metrics? It's always good to try verify the change in both our ways of collecting metrics.
If the tests are run by Chrome we collect the internal trace log (both on WebPageTest and Browsertime/WebPageReplay) that you can use to dig deeper into what happens. For WebPageTest, you find the log (to download) using the Trace link. For Browsertime/WebPageReplay, the log for each run is in the result directory. Download the files, unpack them and drag and drop them into Developer Tools/Performance in Chrome.
Real user measurement
The real user measurements are metrics that we collect from real users, using browsers APIs. Historically these metrics have been more technical than those collected by synthetic testing, as we can't get visual measures from the user's browser.
Alerts that derive from Real User Measurement data will typically reference Navigation Timing in the alert. For example:
Notification Type: PROBLEM Service: https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts grafana alert Host: einsteinium Address: 22.214.171.124 State: CRITICAL Date/Time: Fri Aug 31 05:02:38 UTC 2018 Notes URLs: Additional Info: CRITICAL: https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts is alerting: Load event overall median.
The real user measurement metrics collect data from all browsers that support the Navigation Timing API . It also collects additional metrics like first paint (when something first is displayed on the screen), or the effective connection type, when the browser supports those additional APIs. We sample the data and use 1 out of 1000 requests by default. This can be overridden for specific geographies, pages, etc. where the sampling rate might be different.
The main Navigation Timing dashboards are a good way to start, with the alert dashboard and the generic one.
Where to start
Start by following the runbook for RUM alerts.
Tips and tricks
If you cannot find what caused the regression you can try the Navigation Timing by browser dashboard. Check the report rate, has it changed? It could be that we did a release and accidentally changed how we collect the metrics or a new browser version rolled out that effect the metrics. You can see how many metric we collect for specific browser versions.
Do you see any change in the synthetic metrics? Use both tools to try to nail down the regression. The other tools can easier show you what has changed (by checking HAR from before and after the change).
It's possible that further drilling down is required and you may need to slice the data by other features than platform, browser or geography. For this, it's best to use hive and query the raw RUM data recorded under the NavigationTiming Eventlogging schema. Remember to narrow down your hive queries to the timespan around the regression, as the NavigationTiming table is huge (we record around 14 records per second on average).