Performance/Graphite (synthetic instance)
Graphite for synthetic testing
We have our own instance of Graphite running outside of our environment to make it easy to add as many metrics as needed. You can see that metrics/data in our Grafana instance under the namespace sitespeed_io.
The instance also has metrics from Pixel under the namespace pixel.
The instance is setup for keeping monitoring performance metrics for 60 days. That means we have a two months window to act on regressions.
Access
You need to have the pem file to be able to access the server:
ssh -i ~/.ssh/your_id root@performance-testing-graphite.wmftest.org
Then switch to the user that runs Graphite: sudo su - graphite
Start/stop
You use the docker compose file to start/stop Graphite. The compose location is /home/graphite/settings/docker-compose.yml
Start the instance:
docker-compose up
Stop the instance:
docker-compose down
Setup
The instance run on Hetzner. Everything run as the user graphite. The firewall (making sure only blessed servers can add metrics) is setup using /home/graphite/firewall.sh. If you need to add a new server to the setup, add then server IP to the list in the file, run the script clear-firewall.sh to clear everything and then firewall.sh.
We run the official dockerized version of Graphite using a docker-compose file. To setup Graphite the way we want it, we need to setup five volumes/mappings.
- whisper is where we store all the metrics
- graphite.db is the database where Graphites annotations is stored
- storage-schemas.conf configures how long time we want to store the metrics
- storage-aggregation.conf configures how we want to aggregate metrics
- carbon.conf is carbon/whisper setup, we have our own version because the default one has a very moderate number of new metrics created per minute.
Configurations
All configuration files lives in the server in /home/graphite/settings/.
Docker compose
Our docker compose file (docker-compose.yml) is simple. We point out which Graphite version, which ports to use, auto restart if something fails and map all the volumes we need.
version: "3"
services:
graphite:
image: graphiteapp/graphite-statsd:1.1.5-12
ports:
- "2003:2003"
- "8080:80"
restart: always
volumes:
- /data/whisper:/opt/graphite/storage/whisper
- /data/graphite.db:/opt/graphite/storage/graphite.db
- /home/ubuntu/graphite/storage-schemas.conf:/opt/graphite/conf/storage-schemas.conf
- /home/ubuntu/graphite/storage-aggregation.conf:/opt/graphite/conf/storage-aggregation.conf
- /home/ubuntu/graphite/carbon.conf:/opt/graphite/conf/carbon.conf
Storage aggregation
storage-aggregation.conf
# Aggregation methods for whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds
#
# [name]
# pattern = <regex>
# xFilesFactor = <float between 0 and 1>
# aggregationMethod = <average|sum|last|max|min>
#
# name: Arbitrary unique name for the rule
# pattern: Regex pattern to match against the metric name
# xFilesFactor: Ratio of valid data points required for aggregation to the next retention to occur
# aggregationMethod: function to apply to data points for aggregation
#
[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min
[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
[default_average]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average
Storage schemas
storage-schemas.conf
# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:1d
[collectd]
pattern = ^collectd.*
retentions = 10s:1h,1m:1d,10m:40d
[crux]
pattern = ^sitespeed_io\.crux\.
retentions = 1d:2y
[pixel]
pattern = ^pixel.*
retentions = 1h:60d
[alexa]
pattern = ^sitespeed_io\.desktop\.firstViewAlexa\.
retentions = 1h:30d
[sitespeed_run]
pattern = ^sitespeed_io\.(.*)\.(.*)\.run\.
retentions = 15s:8d
[sitespeed-firstview-desktop]
pattern = ^sitespeed_io\.desktop\.firstView\.
retentions = 1h:400d
[sitespeed-desktop-user-journey-login]
pattern = ^sitespeed_io\.desktop\.userJourneyLogin\.
retentions = 1h:400d
[sitespeed-android]
pattern = ^sitespeed_io\.android\.
retentions = 1h:400d
[webpagereplay-desktop]
pattern = ^sitespeed_io\.desktop\.webpagereplay\.
retentions = 1h:90d
[alexa-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.firstViewAlexa\.
retentions = 1h:30d
[webpagereplay-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.webpagereplay\.
retentions = 1h:90d
[sitespeed-firstview-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.firstView\.
retentions = 1h:400d
[sitespeed-emulated-mobile-user-journey]
pattern = ^sitespeed_io\.emulatedMobile\.userJourneyLogin\.
retentions = 1h:400d
[sitespeed]
pattern = ^sitespeed_io\.
retentions = 1h:33d
[cath_them_all]
pattern = .*
retentions = 1h:60d
Storing annotations
Annotations for the test is stored in sqlite3. If the sqllite3 database gets too large, adding a new entry takes time and can make adding annotations break. The annotations stores links to the actual test (so that you in Grafana can go to the test result) and links to screenshots and some meta data.
There is a script that is setup in the crontab (list the crontab by using crontab -l
) to remove old annotations. It looks like this:
0 0 * * 0 sqlite3 /data/graphite.db < /home/graphite/DeleteOldEvents.sql && sqlite3 /data/graphite.db 'VACUUM;'