Performance/Graphite (synthetic instance)

Graphite for synthetic testing

We have our own instance of Graphite running outside of our environment to make it easy to add as many metrics as needed. You can see that metrics/data in our Grafana instance under the namespace sitespeed_io.

The instance also has metrics from Pixel under the namespace pixel.

The instance is setup for keeping monitoring performance metrics for 60 days. That means we have a two months window to act on regressions.

Access

You need to have the pem file to be able to access the server:

ssh -i ~/.ssh/your_id root@performance-testing-graphite.wmftest.org

Then switch to the user that runs Graphite: sudo su - graphite

Start/stop

You use the docker compose file to start/stop Graphite. The compose location is /home/graphite/settings/docker-compose.yml

Start the instance:

docker-compose up

Stop the instance:

docker-compose down

Setup

The instance run on Hetzner. Everything run as the user graphite. The firewall (making sure only blessed servers can add metrics) is setup using /home/graphite/firewall.sh. If you need to add a new server to the setup, add then server IP to the list in the file, run the script clear-firewall.sh to clear everything and then firewall.sh.

We run the official dockerized version of Graphite using a docker-compose file. To setup Graphite the way we want it, we need to setup five volumes/mappings.

whisper is where we store all the metrics
graphite.db is the database where Graphites annotations is stored
storage-schemas.conf configures how long time we want to store the metrics
storage-aggregation.conf configures how we want to aggregate metrics
carbon.conf is carbon/whisper setup, we have our own version because the default one has a very moderate number of new metrics created per minute.

Configurations

All configuration files lives in the server in /home/graphite/settings/.

Docker compose

Our docker compose file (docker-compose.yml) is simple. We point out which Graphite version, which ports to use, auto restart if something fails and map all the volumes we need.

version: "3"
services:
    graphite:
        image: graphiteapp/graphite-statsd:1.1.5-12
        ports:
            - "2003:2003"
            - "8080:80"
        restart: always
        volumes:
            - /data/whisper:/opt/graphite/storage/whisper
            - /data/graphite.db:/opt/graphite/storage/graphite.db
            - /home/ubuntu/graphite/storage-schemas.conf:/opt/graphite/conf/storage-schemas.conf
            - /home/ubuntu/graphite/storage-aggregation.conf:/opt/graphite/conf/storage-aggregation.conf
            - /home/ubuntu/graphite/carbon.conf:/opt/graphite/conf/carbon.conf

Storage aggregation

storage-aggregation.conf

# Aggregation methods for whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds
#
#  [name]
#  pattern = <regex>
#  xFilesFactor = <float between 0 and 1>
#  aggregationMethod = <average|sum|last|max|min>
#
#  name: Arbitrary unique name for the rule
#  pattern: Regex pattern to match against the metric name
#  xFilesFactor: Ratio of valid data points required for aggregation to the next retention to occur
#  aggregationMethod: function to apply to data points for aggregation
#
[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

[default_average]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average

Storage schemas

storage-schemas.conf

# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:1d

[collectd]
pattern = ^collectd.*
retentions = 10s:1h,1m:1d,10m:40d

[crux]
pattern = ^sitespeed_io\.crux\.
retentions = 1d:2y

[pixel]
pattern = ^pixel.*
retentions = 1h:60d

[alexa]
pattern = ^sitespeed_io\.desktop\.firstViewAlexa\.
retentions = 1h:30d

[sitespeed_run]
pattern = ^sitespeed_io\.(.*)\.(.*)\.run\.
retentions = 15s:8d

[sitespeed-firstview-desktop]
pattern = ^sitespeed_io\.desktop\.firstView\.
retentions = 1h:400d

[sitespeed-desktop-user-journey-login]
pattern = ^sitespeed_io\.desktop\.userJourneyLogin\.
retentions = 1h:400d

[sitespeed-android]
pattern = ^sitespeed_io\.android\.
retentions = 1h:400d

[webpagereplay-desktop]
pattern = ^sitespeed_io\.desktop\.webpagereplay\.
retentions = 1h:90d

[alexa-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.firstViewAlexa\.
retentions = 1h:30d

[webpagereplay-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.webpagereplay\.
retentions = 1h:90d

[sitespeed-firstview-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.firstView\.
retentions = 1h:400d

[sitespeed-emulated-mobile-user-journey]
pattern = ^sitespeed_io\.emulatedMobile\.userJourneyLogin\.
retentions = 1h:400d

[sitespeed]
pattern = ^sitespeed_io\.
retentions = 1h:33d

[cath_them_all]
pattern = .*
retentions = 1h:60d

Storing annotations

Annotations for the test is stored in sqlite3. If the sqllite3 database gets too large, adding a new entry takes time and can make adding annotations break. The annotations stores links to the actual test (so that you in Grafana can go to the test result) and links to screenshots and some meta data.

There is a script that is setup in the crontab (list the crontab by using crontab -l) to remove old annotations. It looks like this:

0 0 * * 0 sqlite3 /data/graphite.db < /home/graphite/DeleteOldEvents.sql && sqlite3 /data/graphite.db 'VACUUM;'