Performance/Synthetic testing/Bare metal

From Wikitech

Synthetic tests on a physical server

We run our tests synthetic tests on a physical server ("bare metal") to get as stable metrics as possible. You can read about how we evaluated running on a physical server in T203060.

The physical server gives us the following advantage:

  1. We don't get the noisy neighbour effect, meaning we get stable metric over time
  2. We can adjust the CPU frequency on the machine to match the speed of our users.

At the moment we run our desktop tests with the CPU frequency of our 90 percentile of users in India. For emulated mobile we match the 90/95 percentile. By matching those users, we know that we can pickup regressions that will be visible for them.

Synthetic tests

We run tests using WebPageReplay on our physical server. That is replay proxy that tries to remove the noice of internet, We have one server running those tests.

Setup a bare metal server

This document explains how the bare metal server at Hetzner is setup. Through the Hetzner setup, you can choose base OS. We use Ubuntu 22.04 to be able to run Chrome tests directly without using Docker in the future.

Start by updating the machine:

sudo apt-get update 
sudo apt-get upgrade

Install Docker

Install Docker (we use Docker for the WebPageReplay tests):

sudo apt install apt-transport-https curl gnupg-agent ca-certificates software-properties-common -y
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
sudo apt install docker-ce docker-ce-cli containerd.io -y

Setup the user

Create a new user that will run the tests and make sure that user can use sudo (needed for using Linux traffic control (tc)

adduser sitespeedio 
usermod -aG sudo sitespeedio
su - sitespeedio
echo "sitespeedio ALL=(ALL:ALL) NOPASSWD:ALL" | sudo tee "/etc/sudoers.d/sitespeedio"

Make sure the new user can run Docker without using sudo:

sudo usermod -aG docker ${USER} 
su - ${USER}

Run test without Docker

If you want to run tests without using Docker you need to install the dependencies directly.

Install dependencies for sitespeed.io

Install dependencies to be able to run sitespeed.io (use NodeJS LTS version):

curl -sL https://deb.nodesource.com/setup_18.x -o nodesource_setup.sh
sudo bash nodesource_setup.sh
sudo apt install -y nodejs

To be able to record a video we need FFPMEG, a couple of Python libs and xvfb. Net-tools is needed to use Linux traffic control

sudo apt-get update -y && sudo apt-get install -y ffmpeg
python -m pip install pyssim OpenCV-Python Numpy
sudo apt-get install -y xvfb
sudo apt-get install -y net-tools

Install sitespeed.io

Make sure you can install using npm without using sudo. Checkout Sindre Sorhus guide and then install sitespeed.io

npm install sitespeed.io --location=global

Install browsers

If you want to run without Docker you need to install the browsers manually. Install Chrome:

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
sudo apt update
sudo apt install -y google-chrome-stable

To install Firefox without using Snap on Ubuntu 22.04 follow this guide.

Setup unattended-upgrades

At the moment we run the unattended upgrades make sure we keep the machine up to date.

sudo apt-get install unattended-upgrades -y
sudo dpkg-reconfigure -plow unattended-upgrades

Pin the CPU frequency and use performance governor

We want the machine to use the same CPU frequency all the time since that will make our metrics more stable. Make sure you are root and install cpufrequtils

sudo apt-get install cpufrequtils

Then you can check the current setup with cpufreq-info command. Then you which governor that is used for each CPU and the min and max limit. The box we use at the moment has 8 CPUs and minimum 800 MHZ and max 4 GHZ. We set it up to use the performance governor and run at 1 GHz. You need to configure that for each and every CPU.

cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 0
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 1
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 2
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 3
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 4
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 5
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 6
cpufreq-set -d 1.00Ghz -u 1.00Ghz -g performance -c 7

You can verify that it worked by running cpufreq-infoagain.

The CPU benchmark is the at 200 ms. That matches almost our 90 percentile of desktop users in India. For the emulated mobile tests we slow down the CPU some more using the Chrome built in CPU throttler so the the CPU benchmark is at 420 ms that matches somewhere like 90/95 percentile for mobile users in India.

For some of our test servers we run them at max speed, so instead of setting them to 1.00Ghz we use whatever max they have. The reason is that running the WebPageReplay proxy on the server seems to serve on running in a faster machine.

You should also make sure that the cpu script runs automatically if the machine needs to be rebooted (the script needs to run as the root user):

crontab -e
@reboot /root/cpu.sh

Set DNS

The machine comes setup with using a local DNS server. You can change that following these instructions and use 1.1.1.1 instead (that has given us more stable metrics).

Open Graphite firewall for the machine

You need to open the firewall on AWS for the new machine. You do that using the AWS GUI but first you need to know your public ip:

curl http://ipinfo.io/ip

Change hostname

Hetzner sets a hostname with the OS version. You should change that. Check the current hostname in the console: hostname

Then edit the hostname and hosts by changing the hostname to the new name. We have named the bare metal servers: hetzner-1, hetzner-2, hetzner-3.

Change your current hostname:

sudo nano /etc/hostname

sudo nano /etc/hosts

And then reboot the server: sudo reboot

Install collectd

Collectd collect server information and sends the metrics to our Graphite instance and you can see the metrics in the dashboard.

First install collectd: sudo apt-get install collectd collectd-utils

Then configure what data will be sent to Graphite. You do that by edit the configuration: nano /etc/collectd/collectd.conf

In the "Load plugin" section make sure to add the following code (and remove all other plugins):

LoadPlugin cpu
LoadPlugin cpufreq
LoadPlugin cpusleep
LoadPlugin disk
LoadPlugin memory
LoadPlugin processes
LoadPlugin swap
LoadPlugin write_graphite

And then in the section for plugin configuration, add the following (just make sure to change the graphite hostname to the QTE graphite instance):

<Plugin write_graphite>
        <Node "graphite">
                Host "GRAPHITE-HOST"
                Port "2003"
                Protocol "tcp"
                LogSendErrors true
                Prefix "collectd.baremetal."
                StoreRates true
                AlwaysAppendDS false
                EscapeCharacter "_"
        </Node>
</Plugin>

The restart collectd: sudo service collectd restart

Add Slack error reporter

There's a script that report all errors in the log to a Slack channel.

Create the script on the server (slack.sh) and make sure to change the start of the text to the type of testing you do on the server, in this example the test is hetzner-1-baremetal:

#!/bin/bash

tail -n0 -F "$1" | while read LINE; do
  (echo "$LINE" | grep -A 3 -e "$3") && curl -X POST --silent --data-urlencode \
    "payload={\"text\": \"hetzner-1-baremetal $(echo $LINE | sed "s/\"/'/g")\"}" "$2";
done

Make the bash script runnable: chmod +x slack.sh

Start the script and make sure to change the SECRET_TOKEN part to the token for the Slack channel (you can find the correct token on the other servers):

nohup nice ./tail-slack.sh "/tmp/sitespeed.io.log" "https://hooks.slack.com/services/SECRET_TOKEN" "ERROR:" > /tmp/s.out 2> /tmp/s.err < /dev/null &

Now you will get all error logs reported to the Slack channel.

Enable the firewall

Enable the firewall by accepting only incoming SSH traffic.

sudo ufw default allow outgoing
sudo ufw default deny incoming
sudo ufw allow ssh
sudo ufw enable

Add configuration and baseline directories

The server needs to have a configuration directory with a secret.json file that handle our secrets. Create the directory and make sure the sitespeed.io user owns the directory.

mkdir /config
chown sitespeedio:sitespeedio /config

Then copy the secret.json file from one of the other servers. If the server run baseline tests, it also needs to have the baseline directory (where baseline data is stored).

mkdir /baseline
chown sitespeedio:sitespeedio /baseline