In the path to have more stable metrics in our synthetic testing we have been trying out mahimahi, mitmproxy and WebPageReplay to record and replay Wikipedia. For mahimahi we have used patched version fixed by Gilles over Benedikt Wolters HTTP2 version of https://github.com/worenga/mahimahi-h2o. With mitmproxy and WebPageReplay we use the default version. The work has been done in T176361.
We have put mahimahi on ice because it is too much of hack to get HTTP/2 to work at the moment and WebPageReplay works out of the box with HTTP/2. mitmproxy worked fine but offered no clear benefit over WebPageReplay.
Replaying vs non replaying
Let us compare what the metrics looks like comparing WebPageTest vs WebPageReplay (Chrome).
The current version run that collects the data for https://grafana.wikimedia.org/d/IvAfnmLMk/page-drilldow is a Docker container with this setup:
We run all tests on one bare metal server. We have tried running the same code on AWS, WMCS and Google Cloud and in all those cases the metrics stability over time was at least 2 to 4 times worse than AWS and over time the metrics are most stable on bare metal.
On desktop we can use 30 frames per second for the video and we get a metric stability span of 33 ms for first visual change. Which is 1 frame of accuracy, since at 30fps one frame represents 33.33ms. Speed Index's stability span is a little wider but still ok (less than 50 points but it depends on the content).
For emulated mobile, we can use 30 frames per second but we seen that it would also work with 60 fps but somewhere we will hit the limit of the browser and OS. We run the both desktop and mobile with 100ms simulated latency during the replays.
We run tests from one server at the moment:
- hetzner1 - Run all tests with WebPageReplay for desktop and emulated mobile.
Upgrade to a new version
We also run alerts on the metrics we collect from WebPageReplay. Checkout Performance/Guides/WebPageReplay alert.
Add a new URL to test
All configuration files exists in our synthetic monitoring tests repo. Clone the repo and go into the tests folder:
git clone ssh://USERNAME@gerrit.wikimedia.org:29418/performance/synthetic-monitoring-tests.git cd synthetic-monitoring-tests/tests
All test files lives in that directory. WebPageReplay tests exists in four directories:
- desktopReplay - the bulk of all the test URLs that gets tested for desktop. All these tests runs on one machine.
- desktopReplayInstant - the desktops tests for the English Wikipedia, we try to keep this list as short as possible and run them as often as possible to get fast feedback.
- emulatedMobileReplay - the bulk of all the test URLs that gets tested for emulated mobile. All these tests runs on one machine.
- emulatedMobileInstantReplay - the emulated mobile tests for the English Wikipedia, we try to keep this list as short as possible and run them as often as possible to get fast feedback.
Debug missing metrics
If metrics stops arriving in Grafana the reason can be two different things: Either something is wrong with Graphite or something is broken on the WebPageReplay server.
Let us focus on the WebPageReplay server. Log into the server and check of if any tests is running. Do that by running
If everything is ok it should look something like:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 548ccf495779 sitespeedio/sitespeed.io:15.4.0 "/start.sh --graphit…" About a minute ago Up About a minute sitespeedio
We start (and stop) the container for every new test so a container should have been created for maximum a couple of minutes ago. If the created is a couple of hours (or days) ago, something is wrong. The container is stuck, probably something happened with the browser. You can fix the test by killing the container:
docker kill sitespeedio
The root cause (that the container got stuck) is still there but restarting the test usually works. After you killed the container, wait a minute and check again that a new container is running by using
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f31eadffbcf0 sitespeedio/sitespeed.io:15.4.0 "/start.sh --graphit…" 4 seconds ago Up 3 seconds sitespeedio
Check the job the coming hour to make sure it doesn't get stuck again.