Search/ElasticSearch/LoadTesting

Load testing of the elasticsearch clusters is done by recording real production GET requests during peak hours and playing it back against the hot-spare cluster at accelerated rates. The performance of elasticsearch depends heavily on how much of it's data is available in the kernel disk cache, and how much has to be pulled up from disk in real time. Having both the variety of query types and the variety of real terms provided by users ensures we are testing a realistic load.

Overview

goreplay is used to record production traffic on the currently live servers for at least one hour of peak usage. The recorded data files are copied from the live cluster to the spare cluster, and then replayed at normal and accelerated rates. The most recent load testing run was run with gor_0.16.1_x86 and the commands below are based on that version.

Recording traffic

On all servers currently running production traffic use the following command. This needs to run for at least 60 minutes to capture enough traffic that we can play it back at 2x speed and still have a sufficiently long test to have confidence in the results.

sudo ./goreplay --input-raw :9200 --http-allow-method GET --output-file requests.log --exit-after 60m

Transfering recorded files

Gor will create files requests_0.log, requests_1.log, etc. In previous runs these files were transferred between elasticsearch servers using netcat. Since then the firewalls have been tightened up and this is no longer an option. Someone in SRE with agent forwarding capabilities can likely use scp to accomplish the same task.

Replaying recorded files

When replaying recorded files there are two important parameters to tune: The number of concurrent requests, and the speed of replay. Without limiting the number of concurrent requests it is possible to bring down the entire cluster. CirrusSearch limits this using the PoolCounter, goreplay uses --output-http-workers to limit on a per-instance basis.

./gor --input-file ./requests.gor'|150%' --input-file-repeat 3 --output-http http://search.svc.codfw.wmnet:9200 --stats --output-http-stats --output-http-workers 30

Evaluating Performance

Both server load and latencies should be evaluated via the grafana dashboards.

TODO

Scripts

Some scripts were hacked together during last load test (task T221121). Those are available on GitHub.

Caveats

When we have done this before we had the same number of servers in both clusters, which meant we could do a 1-to-1 transfer of request logs between servers for replay. When the number of servers in the spare cluster is less than the number of servers in the cluster recorded from the playback needs to be sped up to account for the reduced number of servers. We make the assumption that LVS evenly distributes requests across the servers, so a simple ratio of server counts can be used to determine the speedup necessary.