WebPageReplay

Background

In the path to have more stable metrics in our synthetic testing we have been trying out mahimahi, mitmproxy and WebPageReplay to record and replay Wikipedia. For mahimahi we have used patched version fixed by Gilles over Benedikt Wolters HTTP2 version of https://github.com/worenga/mahimahi-h2o. With mitmproxy and WebPageReplay we use the default version. The work has been done in T176361.

We have put mahimahi on ice because it is too much of hack to get HTTP/2 to work at the moment and WebPageReplay works out of the box with HTTP/2. mitmproxy worked fine but offered no clear benefit over WebPageReplay.

Replaying vs non replaying

Let us compare what the metrics looks like comparing WebPageTest vs WebPageReplay (Chrome).

WebPageReplay setup

The current version run that collects the data for https://grafana.wikimedia.org/d/IvAfnmLMk/page-drilldow is a Docker container with this setup:

We run all tests on a couple of bare metal servers. We have tried running the same code on AWS, WMCS and Google Cloud and in all those cases the metrics stability over time was at least 2 to 4 times worse than AWS and over time the metrics are most stable on bare metal.

Servers

We run tests from four server at the moment:

hetzner1 - Run Chrome tests for desktop against enwiki.
hetzner2 - Run Chrome tests for desktop against group 0 and group 1.
hetzner3 - Run Firefox tests for desktop against group 0, group 1 and enwiki.
hetzner4 - Run emulated mobile tests for group 0, group 1 and enwiki.

Upgrade to a new version

Checkout WebPageReplay/Runbook#Update to new version.

Alerts

We also run alerts on the metrics we collect from WebPageReplay. Checkout Performance/Guides/WebPageReplay alert.

Maintenance

Add a new URL to test

All configuration files exists in our synthetic monitoring tests repo. Clone the repo and go into the tests folder:

git clone ssh://USERNAME@gerrit.wikimedia.org:29418/performance/synthetic-monitoring-tests.git
cd synthetic-monitoring-tests/tests

All test files lives in that directory. WebPageReplay tests exists in four directories:

desktopReplayChromeEnwiki - run Chrome tests for desktop for the English Wikipedia.
desktopReplayChromeGroups - run Chrome tests for desktop against group 0 and group 1. With these tests we aim to find regressions before they land on the English Wikipedia.
desktopReplayFirefox - run Firefox tests for desktop for group 0, group 1 and the English Wikipedia.
emulatedMobileReplay - the emulated mobile tests for group 0, group 1 and English Wikipedia. These emulated tests can go away when we run tests on real devices.

Debug missing metrics

If metrics stops arriving in Grafana the reason can be two different things: Either something is wrong with Graphite or something is broken on the WebPageReplay server.

Let us focus on the WebPageReplay server. Log into the server and check of if any tests is running. Do that by running docker ps

If everything is ok it should look something like:

CONTAINER ID        IMAGE                             COMMAND                  CREATED              STATUS              PORTS               NAMES
548ccf495779        sitespeedio/sitespeed.io:15.4.0   "/start.sh --graphit…"   About a minute ago   Up About a minute                       sitespeedio

We start (and stop) the container for every new test so a container should have been created for maximum a couple of minutes ago. If the created is a couple of hours (or days) ago, something is wrong. The container is stuck, probably something happened with the browser. You can fix the test by killing the container:

docker kill sitespeedio

The root cause (that the container got stuck) is still there but restarting the test usually works. After you killed the container, wait a minute and check again that a new container is running by using docker ps.

CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS              PORTS               NAMES
f31eadffbcf0        sitespeedio/sitespeed.io:15.4.0   "/start.sh --graphit…"   4 seconds ago       Up 3 seconds                            sitespeedio

Check the job the coming hour to make sure it doesn't get stuck again.