Performance/WebPageReplay

From Wikitech
Jump to navigation Jump to search

Background

In the path to have more stable metrics in our synthetic testing we have been trying out mahimahi, mitmproxy and WebPageReplay to record and replay Wikipedia. For mahimahi we have used patched version fixed by Gilles over Benedikt Wolters HTTP2 version of https://github.com/worenga/mahimahi-h2o. With mitmproxy and WebPageReplay we use the default version. The work has been done in T176361.

We have put mahimahi on ice because it is too much of hack to get HTTP/2 to work at the moment and WebPageReplay works out of the box with HTTP/2. mitmproxy worked fine but offered no clear benefit over WebPageReplay.

Replaying vs non replaying

Let us compare what the metrics looks like comparing WebPageTest vs WebPageReplay (Chrome).

Compare emulated mobile First Visual Change on Obama
Compare emulated mobile Speed Index on Obama
First Visual Change on Desktop using WPT vs WebPageReplay
Compare Speed Index on Desktop using WPT vs WebPageReplay

WebPageReplay setup

The current version run that collects the data for https://grafana.wikimedia.org/d/000000059/webpagereplay-drilldown is a Docker container with this setup:

WebPageReplay setup

Running on AWS (instance type c5.xlarge) we get stable metrics. We have tried running the same code on WMCS, bare metal and Google Cloud and in all those cases the metrics stability over time was at least 2 to 4 times worse than AWS. This difference remains unexplained and probably lies somewhere in AWS's secret sauce (custom hypervisor, custom kernel).

On desktop we can use 30 frames per second for the video and we get a metric stability span of 33 ms for first visual change. Which is 1 frame of accuracy, since at 30fps one frame represents 33.33ms. Speed Index's stability span is a little wider but still ok (less than 50 points but it depends on the content).

For emulated mobile, we can use 30 frames per second but we seen that it would also work with 60 fps but somewhere we will hit the limit of the browser and OS. We run the both desktop and mobile with 100ms simulated latency during the replays.

Servers

We run tests from three servers at the moment:

  • wpr-mobile.wmftest.org - Run tests on emulated mobile for WebPageReplay and user journeys
  • wpr-desktop.wmftest.org - Run tests on desktop with WebPageReplay for Chrome
  • wpr-enwiki.wmftest.org - Run tests on enwiki with WebPageReplay for Chrome and emulated mobile

Access

Access the servers with the pem file:

# Running emulated mobile tests
ssh -i "sitespeedio.pem" ubuntu@wpr-mobile.wmftest.org 
# Running desktop tests
ssh -i "sitespeedio.pem" ubuntu@wpr-desktop.wmftest.org
# Running enwiki tests
ssh -i "sitespeedio.pem" ubuntu@ec2-3-94-53-255.compute-1.amazonaws.com

Setup a new server

Here are the details of our current setup. We currently run desktop and emulated mobile Chrome tests on a C5.xlarge VM on AWS using Ubuntu 18.

Install

Install it manually:

  1. Install Docker and grant your user right privileges to start Docker.
  2. Create a config directory where we place the secrets (AWS keys etc): mkdir /config
  3. Clone the repo (in your home dir) with the tests:git clone https://github.com/wikimedia/performance-synthetic-monitoring-tests.git
  4. Take a copy of the /config/secret.json file that exists on of the current running servers and add it to /config/

Also make sure the script start on server restart. When you start the script you choose which tests to run, by pointing out one or multiple test directories. That means that starting the tests looks differently on different machines.

Run crontab -e

And add @reboot rm /home/ubuntu/performance-synthetic-monitoring-tests/sitespeed.run;/home/ubuntu/performance-synthetic-monitoring-tests/loop.sh THE_TEST_DIR

That will remove the run file and restart everything if the server reboots.

The last step is to create a welcome message to you when you login to the server. Run sudo nano /etc/profile.d/greeting.sh

echo "This server runs tests testing Desktop Wikipedia using WebPageReplay"
echo "Start: nohup /home/ubuntu/performance-synthetic-monitoring-tests/loop.sh TEST_DIR &"
echo "Stop: rm /home/ubuntu/performance-synthetic-monitoring-tests/sitespeed.run && tail -f /tmp/sitespeed.io.log"

Make sure to change the TEST_DIR and the message match what you run on your server.

Setup AWS monitoring

When you create a new instance, you also need to setup monitoring on that instance on AWS. Setup alarms for network outgoing traffic (NetworkOut) and set the alarm if it is <= 0 bytes for 3 out of 3 points for 1 hour. Assign the alert to the email group Performance-alerts.

Start and restart

You start by giving it the folder to test. If we test all desktop tests on the same machine we do that with:

Start the script: Β nohup /home/ubuntu/performance-synthetic-monitoring-tests/loop.sh desktopReplay &

Restart: First remove /home/ubuntu/performance-synthetic-monitoring-tests/sitespeed.run and then tail the log and wait for the script to exit. Then start as usual.

Log

You can find the log file at /tmp/sitespeed.io.log. There you can find all log entries from sitespeed.io.

Upgrade to a new version

Checkout the Performance/Runbook/WebPageReplay#Update to new version.

Alerts

We also run alerts on the metrics we collect from WebPageReplay. Checkout Performance/WebPageReplay/Alerts.

Maintenance

Add a new URL to test

All configuration files exists in our synthetic monitoring tests repo. Clone the repo and go into the tests folder:

git clone ssh://USERNAME@gerrit.wikimedia.org:29418/performance/synthetic-monitoring-tests.git
cd synthetic-monitoring-tests/tests

All test files lives in that directory. WebPageReplay tests exists in four directories:

  • desktopReplay - the bulk of all the test URLs that gets tested for desktop. All these tests runs on one machine.
  • desktopReplayInstant - the desktops tests for the English Wikipedia, we try to keep this list as short as possible and run them as often as possible to get fast feedback.
  • emulatedMobileReplay - the bulk of all the test URLs that gets tested for emulated mobile. All these tests runs on one machine.
  • emulatedMobileInstantReplay - the emulated mobile tests for the English Wikipedia, we try to keep this list as short as possible and run them as often as possible to get fast feedback.

The directory structure looks like this, where each wiki has their own file. Each file contains the URLs that is tested for that wiki.

.
β”œβ”€β”€ desktop
β”‚Β Β  β”œβ”€β”€ alexaTop10.txt
β”‚Β Β  β”œβ”€β”€ coronaVirusSecondView.js
β”‚Β Β  β”œβ”€β”€ desktop.txt
β”‚Β Β  β”œβ”€β”€ elizabethSecondView.js
β”‚Β Β  β”œβ”€β”€ facebookSecondView.js
β”‚Β Β  β”œβ”€β”€ latestTechBlogPost.js
β”‚Β Β  β”œβ”€β”€ latestTechBlogPostSecondView.js
β”‚Β Β  β”œβ”€β”€ loginDesktop.js
β”‚Β Β  β”œβ”€β”€ mainpageSecondView.js
β”‚Β Β  β”œβ”€β”€ searchGoogleObama.js
β”‚Β Β  β”œβ”€β”€ searchHeaderObama.js
β”‚Β Β  β”œβ”€β”€ searchLegacyObama.js
β”‚Β Β  β”œβ”€β”€ searchPageObama.js
β”‚Β Β  β”œβ”€β”€ searchPortalObama.js
β”‚Β Β  └── shared
β”‚Β Β      └── searchScriptFactory.js
β”œβ”€β”€ desktopReplay
β”‚Β Β  β”œβ”€β”€ arwiki.wpr
β”‚Β Β  β”œβ”€β”€ awiki.wpr
β”‚Β Β  β”œβ”€β”€ beta.wpr
β”‚Β Β  β”œβ”€β”€ dewiki.wpr
β”‚Β Β  β”œβ”€β”€ eswiki.wpr
β”‚Β Β  β”œβ”€β”€ frwiki.wpr
β”‚Β Β  β”œβ”€β”€ group0.wpr
β”‚Β Β  β”œβ”€β”€ group1.wpr
β”‚Β Β  β”œβ”€β”€ nlwiki.wpr
β”‚Β Β  β”œβ”€β”€ ruwiki.wpr
β”‚Β Β  β”œβ”€β”€ svwiki.wpr
β”‚Β Β  └── zhwiki.wpr
β”œβ”€β”€ desktopReplayInstant
β”‚Β Β  └── enwiki.wpr
β”œβ”€β”€ emulatedMobile
β”‚Β Β  β”œβ”€β”€ alexaMobileTop10.txt
β”‚Β Β  β”œβ”€β”€ elizabethSecondView.js
β”‚Β Β  β”œβ”€β”€ emulatedMobile.txt
β”‚Β Β  β”œβ”€β”€ facebookSecondView.js
β”‚Β Β  β”œβ”€β”€ latestTechBlogPost.js
β”‚Β Β  β”œβ”€β”€ latestTechBlogPostSecondView.js
β”‚Β Β  β”œβ”€β”€ loginEmulatedMobile.js
β”‚Β Β  β”œβ”€β”€ searchGoogleObama.js
β”‚Β Β  └── searchPageObama.js
β”œβ”€β”€ emulatedMobileInstantReplay
β”‚Β Β  └── enwiki.wpr
β”œβ”€β”€ emulatedMobileReplay
β”‚Β Β  β”œβ”€β”€ arwiki.wpr
β”‚Β Β  β”œβ”€β”€ beta.wpr
β”‚Β Β  β”œβ”€β”€ dewiki.wpr
β”‚Β Β  β”œβ”€β”€ eswiki.wpr
β”‚Β Β  β”œβ”€β”€ group0.wpr
β”‚Β Β  β”œβ”€β”€ group1.wpr
β”‚Β Β  β”œβ”€β”€ jawiki.wpr
β”‚Β Β  β”œβ”€β”€ nlwiki.wpr
β”‚Β Β  β”œβ”€β”€ ruwiki.wpr
β”‚Β Β  β”œβ”€β”€ rwiki.wpr
β”‚Β Β  β”œβ”€β”€ svwiki.wpr
β”‚Β Β  └── zhwiki.wpr
β”œβ”€β”€ webpagetestDesktop
β”‚Β Β  β”œβ”€β”€ webpagetest.beta.wpt
β”‚Β Β  β”œβ”€β”€ webpagetest.enwiki.wpt
β”‚Β Β  β”œβ”€β”€ webpagetest.ruwiki.wpt
β”‚Β Β  └── webpagetest.wikidata.wpt
└── webpagetestEmulatedMobile
    └── webpagetestEmulatedMobile.wpt

Let have a look at svwiki.wpr, it contains of three URLs:

https://sv.wikipedia.org/wiki/Facebook 
https://sv.wikipedia.org/wiki/Stockholm 
https://sv.wikipedia.org/wiki/Astrid_Lindgren

If you want to add a new URL to be tested on the Swedish Wikipedia, you open svwiki.wpr, add the URL on a new line and commit the result. When the commit passed through the Gerrit review, that URL will be picked up on the next iteration automatically by the test agent.

Debug missing metrics

If metrics stops arriving in Grafana the reason can be two different things: Either something is wrong with Graphite or something is broken on the WebPageReplay server.

Let us focus on the WebPageReplay server. Log into the server and check of if any tests is running. Do that by running docker ps

If everything is ok it should look something like:

CONTAINER ID        IMAGE                             COMMAND                  CREATED              STATUS              PORTS               NAMES
548ccf495779        sitespeedio/sitespeed.io:15.4.0   "/start.sh --graphit…"   About a minute ago   Up About a minute                       sitespeedio

We start (and stop) the container for every new test so a container should have been created for maximum a couple of minutes ago. If the created is a couple of hours (or days) ago, something is wrong. The container is stuck, probably something happened with the browser. You can fix the test by killing the container:

docker kill sitespeedio

The root cause (that the container got stuck) is still there but restarting the test usually works. After you killed the container, wait a minute and check again that a new container is running by using docker ps.

CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS              PORTS               NAMES
f31eadffbcf0        sitespeedio/sitespeed.io:15.4.0   "/start.sh --graphit…"   4 seconds ago       Up 3 seconds                            sitespeedio

Check the job the coming hour to make sure it doesn't get stuck again.

See also