Performance/Synthetic testing

Synthetic testing (or lab testing) is the practice of continously monitoring frontend performance from a known and controlled environment.

This provides us with valuable assistance in the following ways:

Keeping a close eye on how well Wikipedia is performing by monitoring it for any decreases in performance, both on desktop computers and dedicated mobile phones. This allows us to proactively identify and address any issues before they impact Google Web Vitals and other crucial user experience metrics. By staying vigilant and proactive, we ensure that Wikipedia continues to provide a high-quality experience for users, even before any changes take effect on Google or for users.
Collects more in deep information than our real user monitoring so that developers easier can understand changes in Chrome Web Vitals and other performance metrics.
The synthetic tools are fine-tuned to mirror the experience of Wikipedia users who fall in the lower 75th or 95th percentile in terms of performance. This allows developers to readily simulate and understand the challenges faced by users with suboptimal performance, making it easier to identify and address any issues that may arise for these users.
Our tools enable teams to evaluate solutions prior to deployment in the production environment. This empowers them to test for potential performance impacts, providing valuable insights into the effects of proposed changes before they are released to our users. This proactive approach allows for effective decision-making and mitigates the risk of performance-related issues reaching our user base.
Our lab testing servers act as a "wayback machine" by capturing and storing copies of the HTML, CSS, and JavaScript that are served by the Wikipedia servers. This allows us to track changes over time and serves as a reference point for understanding regressions. It helps us answer questions such as "What was the exact HTML/JS/CSS served during the time when we encountered that particular problem?" This historical data is invaluable for troubleshooting and identifying the root causes of issues for precise issue resolution.
Our lab testing system actively monitors browsers for performance regressions on Wikipedia. As new versions of Chrome and Firefox are released monthly, our testing helps identify any potential degradation in browser performance when accessing Wikipedia. This allows us to promptly create upstream tasks for browser vendors, notifying them of the regression so that they can address and fix the issue. This collaborative approach ensures that Wikipedia users have a smooth browsing experience, and helps maintain a strong partnership with browser vendors for ongoing performance improvements.

Responsibilities

The Quality and testing engineering team (QTE) are responsible for the synthetic test infrastructure. That means the team installs, updates and monitors the servers that run the tests. The team is also responsible for the software that runs the tests and makes sure we run our tests on the latest software and latest versions of browsers.

The team is also responsible for the Git repository that holds the test configuration and some generic tests that monitors the performance of Wikipedia.

The team is not responsible for creating web performance tests for Wikipedia. The team enables other teams to create their web performance tests. If you need help with your tests or have questions, make sure to tag the Quality And Test Engineering Team in Phabricator or ask questions direct on the #talk-to-qte Slack channel.

Our Web Performance testing principles

We test in the open. We use Open Source tools, we share our configuration and results in the open. We work to make our metrics and graphs understandable also for people outside of the performance community.
We test on multiple browser. Our users uses many different browsers, we should make sure to test on at least two different browsers. That helps us to understand if a regression is related to Wikipedia, or a specific browser. If the regression happens in two browsers, we can be pretty sure the problem is on our side. If it happens in one browser, we need to check if it is on our side or an browser issue.
We use the same browser versions as our users! Browsers upgrades automatically and are released once a month, we need to make sure that we test on the same version as our users. Through the years we've seen multiple regressions that happens on specific versions.
We use dedicated machines/devices for performance testing. Running performance test on your own machine is hard, other things running on your machine will give you unstable metrics.
We run the tests as our users with the worst performance. We aim to test as the 75 and 95 percentiles of our users. That way we are sure that we find regressions that will affect those users. That's why we we run tests on slow devices and a slow internet connection when possible.
We do not use Lighthouse for performance testing. There's so much propaganda from the Chrome team so it's easy to think that its the number one performance tool. It's not. The main issues with it is that it's Chrome only, it uses an engine that guestimates the metrics on different conditions (it's not actually measuring) and it's uses the screenshots from the devtools trace to create video/video metrics (and that is not the same as what's actually painted on the screen).

Strategy

Synthetic performance testing works by operating a dedicated server or phone somewhere in the world, and controlling it to automatically load a web page, and collect performance metrics. Together with our real-user monitoring strategy (which collects statistics from real page views), this is how we monitor the web performance of Wikipedia as whole.

In order to ensure we can detect changes in performance in a timely fashion (to avoid the "slow-boiling frog" problem), it is important that we produce stable and repeatable measurements. This requires that our tests run in a quiet and consistent environment:

The browser is launched on a dedicated bare metal server or cloud servers. As of 2023, we run our synthetic tests at Hetzner. See also Performance/Synthetic testing/Bare metal. (T311980). We are also looking for Android phone hosting providers so we can run tests on real mobile phones (again).
The browser needs stable connectivity, with the same connectivity each time. We use the Linux tc utility for this.
The browser version needs to under our control. We keep track of browser releases, and regularly try our and plan upgrades.
The Wikipedia page under test needs to be fairly stable. Use of randomised A/B tests and banner campaigns interfere with change detection.

Current setup

We have the following setup:

4 bare metal servers located at Hetzner in Europe. These servers run a on a pinned CPU speed and run tests using WebPageReplay.
1 bare metal server located at Hetzner in Europe. It uses pinned CPU speed and run tests direct against Wikipedia.
1 bare metal server that hosts all the result files (videos/traces/screenhots)
1 bare metal server that runs direct tests against Wikipedia. Together with two VPS cloud servers it runs the on demand testing setup that is the way for developers to more easily run performance tests on dedicated hardware. This setup will also be used for Android performance testing.

The monitoring setup works like this. The tests and the configuration exists in git. The repository is cloned on the test servers and the tests runs on these servers. The outcome of the tests (video/screenshot/traces/HTML) is stored on the storage server, the metrics from the tests is sent to Graphite. Our Grafana instance gets the data from the Graphite instance and alerts for these metrics is setup in Grafana.

How it works

We have to types of synthetic tests to measure the frontend performance of Wikipedia:

For user journeys (visiting multiple pages, or performing actions like login and search), we use sitespeed.io's Browsertime to measure the full performance journey, from a web browser to the server and back.
For single pages, we use WebPageReplay together with Browsertime to focus solely on the front-end page load performance. We start from an empty browser cache each time. WebPageReplay is a proxy that records HTTP responses and replays them locally with fixed latency. This removes interference from external variables that would otherwise allow constant changes in network and server-side conditions influence the metrics.

If you want to run performance tests you can follow these instructions.

Results

You can analyze the results of all synthetic tests in the Grafana: Page drilldown dashboard.

You can choose different types of testing with the device dropdown:

desktop - tests where we test desktop Wikipedia
emulatedMobile - tests where we test mobile Wikipedia using a desktop browser, emulating a mobile phone
~~android - tests mobile Wikipedia using real mobile phones~~ We are looking for a new mobile phone provider.

The next dropdown Test type chooses what test to see. It can be first view tests (with a cold browser cache), warm cache view, webpagereplay tests or different user journeys.

We also alert on those metrics using WebPageReplay for desktop/emulated mobile and first view cold cache on desktop.

Browsertime and WebPageReplay

We've been using sitespeed.io's Browsertime and WebPageReplay since 2017. Browsertime is the engine of sitespeed.io that controls the browser and collects measurements. Browsertime is also in use at Mozilla to measure the performance of Firefox itself. WebPageReplay is known for being used at Google to monitor the performance of the Chrome browser.

You can read about our setup at Performance/WebPageReplay. We collect metrics and store them on a dedicated Graphite instance.

The journeys and pages that we currently test are configured in Git at https://gerrit.wikimedia.org/g/performance/synthetic-monitoring-tests. We also use Browsertime/WebPageReplay to collect metrics from an Android device. Those tests are configured in https://gerrit.wikimedia.org/g/performance/mobile-synthetic-monitoring-tests

CruX

Main article: Performance/Metrics#CruX

When to use what tool/test?

If you test the mobile version of Wikipedia, you should run tests on emulated mobile tests. At the moment we are missing real Android devices for doing the testing. What's good about running Android tests is that you know for sure the performance on that specific Android device, and we can say things like the first visual change of the Barack Obama page on English Wikipedia regressed by ten percent on a Moto G 5 phone.

If you want to find small frontend regression, testing with WebPageReplay should be your thing. However at the moment we only test one page (first view cold cache) tests with WebPageReplay.

If you want to test user journeys, test them direct against Wikipedia servers using Browsertime. If you are not sure what tests to use, please reach out to the QTE team and we will help you!