Performance/Guides/Regressions
We have different tools to find performance regressions and we have a automated alerts that will fire if they suspect there is a regression. Alerts are fired from https://alerts.wikimedia.org/ and the QTE team is notified using email for synthetic alerts. For the real user alerts there are some members from the old performance team that keeps an eye on them.
When an alert is fired, we need to find out the cause of the regression. There are two different types of performance alerts: synthetic testing and real user measurements.
You got a performance alert, what's the next step?
You want to understand what's causing the regression: Is it a performance regression or is it something going in with the monitoring? Are multiple tools alerting? Follow the runbooks for the tool that fires the alert:
The first thing I do is to try and find out if the regression is across the board (for all URLs, all browsers, all synthetic tools, both synthetic and RUM metrics). If you know that, you are on the way to finding the root cause of the problem.
Synthetic testing
Background
We run two different synthetic testing tools to find regressions: Direct tests that includes network/server time, WebPageReplay focuses exclusively on front end performance. We run direct tests for English Wikipedia (desktop and mobile) and WebPageReplay for English, group 0 and group 1 (desktop and mobile).
You can read more about how the WebPageReplay alerts works.
Where to start
If the alert comes from direct tests you can start by checking the direct tests alerts dashboard and then the generic page drilldown dashboard and make sure you choose firstView tests in the Test type dropdown. You can also search for individual tests at https://wikiperformance.wmcloud.org/search/.
For WebPageReplay alerts you have the English Wikipedia dashboard, the group 1 dashboard and the group 0 dashboard. You can also check individual URLs at the the generic page drilldown dashboard and make sure you choose webpagereplay tests in the Test type dropdown.
A good starting point is to check the graphs and try to understand when the regression started to happens. One of the key things is to find out at what point in time the regression was introduced. If you can find that, then you can compare screenshots and HAR files (that describes what and when the browser downloads assets) before and after the regression.
To find specific runs, you need do to go to the drill down dashboard. Then choose which wiki and page and click "Show each tests" and wait.
In a couple seconds you will see vertical lines on the dashboard. Each vertical line represents a time when the test run. Hover over the line and you will see layer with a screenshot from when the page was tested.
Click on the result link and you will then get to the actual result for that run.
Tips and tricks
Check the screenshots. Look out for campaigns and try to correlate them to when they got activated.
Check if there has been any release for the tool (using WebPageReplay make sure you click Synthetic setup. If QTE updates the tool (new version of the tool, new version of the browser) there will be an annotation for that. It has happened that new browser versions have introduced a regression.
Check if there is a release that correlates to the change by choosing Show sync-wikiversions and check the server admin log.
Do you see any changes in the Navigation Timing metrics? It's always good to try verify the change in both our ways of collecting metrics.
If the tests are run by Chrome we collect the internal trace log (both on direct and WebPageReplay) that you can use to dig deeper into what happens. For Browsertime/WebPageReplay, the log for each run is in the result directory. Download the files, unpack them and drag and drop them into Developer Tools/Performance in Chrome.
Real user measurement
The real user measurements are metrics that we collect from real users, using browsers APIs. Historically these metrics have been more technical than those collected by synthetic testing, as we can't get visual measures from the user's browser.
Alerts that derive from Real User Measurement data will typically reference Navigation Timing in the alert. For example:
Notification Type: PROBLEM Service: https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts grafana alert Host: einsteinium Address: 208.80.155.119 State: CRITICAL Date/Time: Fri Aug 31 05:02:38 UTC 2018 Notes URLs: Additional Info: CRITICAL: https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts is alerting: Load event overall median.
Background
The real user measurement metrics collect data from all browsers that support the Navigation Timing API . It also collects additional metrics like first paint (when something first is displayed on the screen), or the effective connection type, when the browser supports those additional APIs. We sample the data and use 1 out of 1000 requests by default. This can be overridden for specific geographies, pages, etc. where the sampling rate might be different.
Useful dashboards
We have a couple of dashboards that is useful for navigation timing data:
- Compare some metrics with 1-2 weeks ago
- Look at specific metrics
- Breakdown by continent
- Breakdown by country
The alerts come from the alert dashboard.
Where to start
Start by following the runbook for RUM alerts.
Tips and tricks
If you cannot find what caused the regression you can try the Navigation Timing by browser dashboard. Check the report rate, has it changed? It could be that we did a release and accidentally changed how we collect the metrics or a new browser version rolled out that effect the metrics. You can see how many metric we collect for specific browser versions.
Do you see any change in the synthetic metrics? Use both tools to try to nail down the regression. The other tools can easier show you what has changed (by checking HAR from before and after the change).
It's possible that further drilling down is required and you may need to slice the data by other features than platform, browser or geography. For this, it's best to use hive and query the raw RUM data recorded under the NavigationTiming Eventlogging schema. Remember to narrow down your hive queries to the timespan around the regression, as the NavigationTiming table is huge (we record around 14 records per second on average).