Performance/Runbook/SyntheticToolAlert

From Wikitech

This is the runbook for Synthetic tool alerts that will fire when one of the tools are down or we are missing data from that tool.

Meta

WebPageReplay missing data

  1. Check if the WebPageReplay server is running by login to the machine.
  2. Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
  3. If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status docker ps. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the container docker kill <container name> then wait some time, check that a new container is up and running with docker ps and then verify that everything looks ok in the log,

WebPageReplay CPU benchmark alert

The CPU benchmark measures how stable the metrics/CPU is on the machine that runs the tests. If it is unstable for some time you need to deploy the tests on a new server.

WebPageReplay TTFB alert

The TTFB (time to first byte) should be really stable when you use WebPageReplay since the tests runs on the same server as the browser. If you have high variation in TTFB something is really wrong on the server and you should try to deploy the tests on a another server.

CRUX missing data

The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab.

  1. Login to the server that collect the data from the Chrome User Experience Report API: ssh gpsi.webperf.eqiad1.wikimedia.cloud
  2. Check the log located in /tmp/sitespeed.io.log to see if you see any errors
  3. If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data).

sitespeed.io missing data

  • Check if the sitespeed.io server is running by login to the machine.
  • Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
  • If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status docker ps. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the container docker kill <container name> then wait some time, check that a new container is up and running with docker ps and then verify that everything looks ok in the log,