Performance/Synthetic testing

From Wikitech
Jump to navigation Jump to search

Synthetic testing (or lab testing) is the practice of continously monitoring frontend performance from a known and controlled environment.

This provides us with valuable assistance in the following ways:

  • Keeping a close eye on how well Wikipedia is performing by monitoring it for any decreases in performance, both on desktop computers and dedicated mobile phones. This allows us to proactively identify and address any issues before they impact Google Web Vitals and other crucial user experience metrics. By staying vigilant and proactive, we ensure that Wikipedia continues to provide a high-quality experience for users, even before any changes take effect on Google or for users.
  • Collects more in deep information than our real user monitoring so that developers easier can understand changes in Chrome Web Vitals and other performance metrics.
  • The synthetic tools are fine-tuned to mirror the experience of Wikipedia users who fall in the lower 75th or 95th percentile in terms of performance. This allows developers to readily simulate and understand the challenges faced by users with suboptimal performance, making it easier to identify and address any issues that may arise for these users.
  • Our tools enable teams to evaluate solutions prior to deployment in the production environment. This empowers them to test for potential performance impacts, providing valuable insights into the effects of proposed changes before they are released to our users. This proactive approach allows for effective decision-making and mitigates the risk of performance-related issues reaching our user base.
  • Our lab testing servers act as a "wayback machine" by capturing and storing copies of the HTML, CSS, and JavaScript that are served by the Wikipedia servers. This allows us to track changes over time and serves as a reference point for understanding regressions. It helps us answer questions such as "What was the exact HTML/JS/CSS served during the time when we encountered that particular problem?" This historical data is invaluable for troubleshooting and identifying the root causes of issues for precise issue resolution.
  • Our lab testing system actively monitors browsers for performance regressions on Wikipedia. As new versions of Chrome and Firefox are released monthly, our testing helps identify any potential degradation in browser performance when accessing Wikipedia. This allows us to promptly create upstream tasks for browser vendors, notifying them of the regression so that they can address and fix the issue. This collaborative approach ensures that Wikipedia users have a smooth browsing experience, and helps maintain a strong partnership with browser vendors for ongoing performance improvements.

Strategy

Synthetic performance testing works by operating a dedicated server or phone somewhere in the world, and controlling it to automatically load a web page, and collect performance metrics. Together with our real-user monitoring strategy (which collects statis on real pageviews), this is how we monitor the web performance of Wikipedia as whole.

In order to ensure we can detect changes in performance in a timely fashion (to avoid the "slow-boiling frog" problem), it is important that we produce stable and repeatable measurements. This requires that our tests run in a quiet and consistent environment:

  • The browser is launched on a dedicated bare metal server or mobile phone (not shared with other customers). As of 2022, we run our synthetic tests at Hetzner. See also Performance/Synthetic testing/Bare metal. (T311980)
  • The browser needs stable connectivity, with the same connectivity each time. We use the Linux tc utility for this.
  • The browser version needs to under our control. We keep track of browser releases, and regularly try our and plan upgrades.
  • The Wikipedia page under test needs to be fairly stable. Use of randomised A/B tests and banner campaigns interfere with change detection.

How it works

We have to types of synthetic tests to measure the frontend performance of Wikipedia:

  1. For user journeys (visiting multiple pages, or performing actions like login and search), we use sitespeed.io's Browsertime to measure the full performance journey, from a web browser to the server and back.
  2. For single pages, we use WebPageReplay together with Browsertime to focus solely on the front-end page load performance. We start from an empty browser cache each time. WebPageReplay is a proxy that records HTTP responses and replays them locally with fixed latency. This removes interference from external variables that would otherwise allow constant changes in network and server-side conditions influence the metrics.

Results

You can analyze the results of all synthetic tests in the Grafana: Page drilldown dashboard.

You can choose different types of testing with the device dropdown:

  • desktop - tests where we test desktop Wikipedia
  • emulatedMobile - tests where we test mobile Wikipedia using a desktop browser, emulating a mobile phone
  • android - tests mobile Wikipedia using real mobile phones

The next dropdown Test type chooses what test to see. It can be first view tests (with a cold browser cache), warm cache view, webpagereplay tests or different user journeys.

We also alert on those metrics using WebPageReplay for desktop/emulated mobile and Android and first view cold cache on desktop.

Browsertime and WebPageReplay

We've been using sitespeed.io's Browsertime and WebPageReplay since 2017. Browsertime is the engine of sitespeed.io that controls the browser and collects measrements. Browsertime is also in use at Mozilla to measure the performance of Firefox itself. WebPageReplay is known for being used at Google to monitor the performance of the Chrome browser.

You can read about our setup at Performance/WebPageReplay. We collect metrics and store them on a dedicated Graphite instance.

The journeys and pages that we currently test are configured in Git at https://gerrit.wikimedia.org/g/performance/synthetic-monitoring-tests. We also use Browsertime/WebPageReplay to collect metrics from an Android device. Those tests are configured in https://gerrit.wikimedia.org/g/performance/mobile-synthetic-monitoring-tests

CruX

Google uses perf metrics collected within the Chrome browser from websites people visit (from users who "opt-in" by syncing their Google account with Chrome). These are used by Google Search to know how a website performs in reality, and this data is available publicly as part of their Chrome User Experience Report. In order to keep track on how Wikipedia is doing from Google's point of view, we collect this data once a day from the Google API and store it in Graphite. You can explore these metrics on our Chrome User Experience dashboard in Grafana.

The daily crawl runs on the gpsi.webperf.eqiad1.wikimedia.cloud server, where run a couple of tests and collect if we are slow/moderate/fast. The data is collected using sitespeed.io CruX plugin.

When to use what tool/test?

If you test the mobile version of Wikipedia, you should run tests on Android and our emulated mobile tests. What's good about running Android tests is that you know for sure the performance on that specific Android device, and we can say things like the first visual change of the Barack Obama page on English Wikipedia regressed by ten percent on a Moto G 5 phone.

If you want to find small frontend regression, testing with WebPageReplay should be your thing. However at the moment we only test one page (first view cold cache) tests with WebPageReplay.

If you want to test user journeys, test them direct against Wikipedia servers using Browsertime. If you are not sure what tests to use, please reach out to the performance team and we will help you!

Further reading