Performance/Graphite (synthetic instance)

From Wikitech

Graphite for synthetic testing

We have our own instance of Graphite running outside of our environment to make it easy to add as many metrics as needed. You can see that metrics/data in our Grafana instance under the namespace sitespeed_io.

The instance is setup for keeping metrics for 60 days. That means we have a two months window to act on regressions and also means have room for adding a lot of more tests/metrics if we want.

Access

You need to have the pem file to be able to access the server:

ssh -i "graphite.pem" ubuntu@wpt-graphite.wmftest.org

Start/stop

You use the docker compose file to start/stop Graphite. The compose location is /home/ubuntu/graphite/docker-compose.yml

Start the instance:

docker-compose up

Stop the instance:

docker-compose down

Setup

The instance run on AWS on an m4.xlarge instance with an extra volume. We use AWS since our agents that collects the data uses AWS and then we can use security groups to make sure only our instances can post data to the instance.

The the size of the instance was chosen because a big company that also runs Graphite use the same setup for their synthetic testing. We can change that in the future.

When you setup a new instance, you need to make sure it stores the data on a disk that don't belong to that instance. We have 200 gb extra running on an instance, setup using the official AWS documentation. Mount it and make sure it is automatically mounted after a reboot. The extra disk lives under /data/.

We run the official dockerized version of Graphite using a docker-compose file. To setup Graphite the way we want it, we need to setup five volumes/mappings.

  • whisper is where we store all the metrics
  • graphite.db is the database where Graphites annotations is stored
  • storage-schemas.conf configures how long time we want to store the metrics
  • storage-aggregation.conf configures how we want to aggregate metrics
  • carbon.conf is carbon/whisper setup, we have our own version because the default one has a very moderate number of new metrics created per minute.

Configurations

All configuration files lives in the server in /home/ubuntu/graphite/.

Docker compose

Our docker compose file (docker-compose.yml) is simple. We point out which Graphite version, which ports to use, auto restart if something fails and map all the volumes we need.

version: "3"
services:
    graphite:
        image: graphiteapp/graphite-statsd:1.1.5-12
        ports:
            - "2003:2003"
            - "8080:80"
        restart: always
        volumes:
            - /data/whisper:/opt/graphite/storage/whisper
            - /data/graphite.db:/opt/graphite/storage/graphite.db
            - /home/ubuntu/graphite/storage-schemas.conf:/opt/graphite/conf/storage-schemas.conf
            - /home/ubuntu/graphite/storage-aggregation.conf:/opt/graphite/conf/storage-aggregation.conf
            - /home/ubuntu/graphite/carbon.conf:/opt/graphite/conf/carbon.conf
    memcached:
        image: memcached:1.5.16
        ports:
            - "11211:11211"

Storage aggregation

storage-aggregation.conf

# Aggregation methods for whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds
#
#  [name]
#  pattern = <regex>
#  xFilesFactor = <float between 0 and 1>
#  aggregationMethod = <average|sum|last|max|min>
#
#  name: Arbitrary unique name for the rule
#  pattern: Regex pattern to match against the metric name
#  xFilesFactor: Ratio of valid data points required for aggregation to the next retention to occur
#  aggregationMethod: function to apply to data points for aggregation
#
[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

[default_average]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average

Storage schemas

storage-schemas.conf

# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:1d

[crux]
pattern = ^sitespeed_io\.crux\.
retentions = 1d:2y

[alexa]
pattern = ^sitespeed_io\.desktop\.firstViewAlexa\.
retentions = 1h:30d

[sitespeed-firstview-desktop]
pattern = ^sitespeed_io\.desktop\.firstView\.
retentions = 1h:400d

[sitespeed-desktop-user-journey-login]
pattern = ^sitespeed_io\.desktop\.userJourneyLogin\.
retentions = 1h:400d

[webpagereplay-desktop]
pattern = ^sitespeed_io\.desktop\.webpagereplay\.
retentions = 1h:90d

[alexa-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.firstViewAlexa\.
retentions = 1h:30d

[webpagereplay-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.webpagereplay\.
retentions = 1h:90d

[sitespeed-firstview-emulated-mobile]
pattern = ^sitespeed_io\.emulatedMobile\.firstView\.
retentions = 1h:400d

[sitespeed-emulated-mobile-user-journey]
pattern = ^sitespeed_io\.emulatedMobile\.userJourneyLogin\.
retentions = 1h:400d

[sitespeed-wpt-desktop]
pattern = ^sitespeed_io\.webpagetest\.firstView\.pageSummary\.en_wikipedia_org\.
retentions = 1h:400d

[sitespeed-wpt-emulated-mobile]
pattern = ^sitespeed_io\.webpagetestEmulatedMobile\.firstView\.pageSummary\.en_m_wikipedia_org\.
retentions = 1h:40d

[sitespeed]
pattern = ^sitespeed_io\.
retentions = 1h:33d

[cath_them_all]
pattern = .*
retentions = 1h:60d

Security groups

The instance has it own security group that make sure we only get data from our blessed instances. The Inbound group looks like this:

Custom TCP 8080 - Access from the proxy (install1002) that is used by grafana.wikimedia.org

Custom TCP 8080 - Access from the agents security group that runs sitespeed.io/browsertime/webpagetest (for annotations)

Custom TCP 2003 - Access from the agents security group that runs sitespeed.io/browsertime/webpagetest (for metrics)

SSH TCP 22 - Access only from the .pem file.