Data Platform/Systems/AQS/Scaling/LoadTesting
Cassandra Info
See: https://etherpad.wikimedia.org/p/analytics-aqs-cassandra
TL:DR
With our new compression scheme help us to reduce greatly the number of SSTables read upon every request and that, together with SSDs, has a strong impact on the throughput that the cluster can sustain. We are able to operate at ~ 400 transactions per second (TPS) per machine. Regardless of whether we have 3/4 months of data or 10/12 our throughput per host doesn't change.
We did not investigate as to the bottleneck so it might be that our woes have moved to the http layer.
While tests are going on CPU doesn't go beyond 20%. The machine is not taking traffic so CPU usage is only due to compaction, cassandra running and load testing.
Tests setup
We will be using siege to do load testing from inside the firewall to restbase deployed in aqs100[4,5,6]
Siege doesn't seem to be able to load a file with more than 100.000 urls thus we load test with files of that size per every test.
- Source files to test with have about 11 million urls thus we split them in smaller files with split.
- cassandra is creating debug logs while these tests are going on so it is likely we can run with a more efficient configuration
Graphs
CPU load in aqs1004: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=bytes_out&c=Analytics+Query+Service+eqiad&h=aqs1004.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=
SSTables read per query, new cluster:
https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-cf-sstables-per-read?panelId=5&fullscreen&from=1471377813796&to=1471464213797&var-node=aqs1004-a&var-node=aqs1004-b&var-node=aqs1005-a&var-node=aqs1005-b&var-node=aqs1006-a&var-node=aqs1006-b&var-keyspace=local_group_default_T_pageviews_per_article_flat&var-columnfamily=data&var-quantiles=99percentile [Broken Link T211982]
Druid Info
We added Druid as a back-end for AQS, to serve metric data to Wikistats. We need to siege the new end-points to figure out how Druid's throughput differs from Cassandra's. Running first with the same configuration as the Cassandra load tests, and then re-evaluating. The URLs used here represent every query that Wikistats would issue to AQS if users loaded every single project's main dashboard page. That is 9 widgets for each of about 800 projects. The first run, using siege's normal regression mode, resulted in about 75 transactions per second:
milimetric@aqs1004:~$ siege -c 100 -r 100 -f siege-druid-from-aqs.txt ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege... <<snip a few timeout alerts>> done. Transactions: 9990 hits Availability: 99.90 % Elapsed time: 132.46 secs Data transferred: 13.68 MB Response time: 0.51 secs Transaction rate: 75.42 trans/sec Throughput: 0.10 MB/sec Concurrency: 38.81 Successful transactions: 9770 Failed transactions: 10 Longest transaction: 27.67 Shortest transaction: 0.00
And this is what the servers looked like during this test:
It seems like Druid is able to cache about half of the responses we need to serve all dashboards. When I re-run the same test within 30 minutes, but with -i, for random "internet" mode, it runs almost twice as fast and the cache miss rate in prometheus is about half what it was during the first run:
milimetric@aqs1004:~$ siege -i -c 100 -r 100 -f siege-druid-from-aqs.txt ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege.. done. Transactions: 10000 hits Availability: 100.00 % Elapsed time: 75.26 secs Data transferred: 13.42 MB Response time: 0.11 secs Transaction rate: 132.87 trans/sec Throughput: 0.18 MB/sec Concurrency: 14.96 Successful transactions: 9768 Failed transactions: 0 Longest transaction: 8.50 Shortest transaction: 0.00
For reference, here are the URLs used to track performance:
- prometheus: https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All. [Broken Link T211982]
- druid boxes: https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=druid1004&var-network=eth0&from=1513093865729&to=1513098022000&refresh=1m [Broken Link T211982]
The CPU only reaches about 25% during the peak of the siege, pretty good.
Re-running this test (the internet random one) the next day results in a similar figure, 139.20 transactions per second. The cached objects metric in prometheus shows that Druid never let go of the cache from yesterday, but added another about 10% cached objects to use for this latest test. Running the siege with normal (non-random) access to URLs results in 144 transactions per second, so now I'm curious what happens if I request a different time period. I changed all the time ranges to 2014-12 (instead of 2015-12) through 2017-12 and ran in internet mode, and this resulted in 102 transactions per second. That most likely means the two years that were cached were returned, and the rest was returned from storage. The cached objects/misses/hits for this last test seem to confirm, they all increased. Ok, so, then, for the last test, I wanted to also look at daily resolution, so I loaded these two URLs for all projects and ran the same kind of siege:
milimetric@aqs1004:~$ tail siege-druid-from-aqs.txt -n2 https://wikimedia.org/api/rest_v1/metrics/editors/aggregate/zu.wiktionary.org/all-editor-types/all-page-types/25..99-edits/daily/2017091309/2017121309 https://wikimedia.org/api/rest_v1/metrics/bytes-difference/net/aggregate/zu.wiktionary.org/all-editor-types/non-content/daily/2017091309/2017121300 milimetric@aqs1004:~$ siege -i -c 100 -r 100 -f siege-druid-from-aqs.txt ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege.. done. Transactions: 9998 hits Availability: 99.98 % Elapsed time: 78.81 secs Data transferred: 4.77 MB Response time: 0.15 secs Transaction rate: 126.86 trans/sec Throughput: 0.06 MB/sec Concurrency: 19.30 Successful transactions: 9998 Failed transactions: 2 Longest transaction: 18.41 Shortest transaction: 0.00
Druid should not have had any of these things in memory, but maybe it cached daily while serving monthly data. The only other test that I can think of running is a longer extended siege. But my conclusion so far is that Druid can serve at least 50 requests per second without trouble from a cold cache. And if it's pushed further than that, it doesn't work very hard, it just takes a little longer to answer.
Realization: Druid caches segments of time. So that's why warming up the caches with monthly queries allowed it to easily answer daily queries. In that case, one more test is needed to see if requesting the full history for the datasource will break our caches such that they can't answer the monthly queries. The results indicate that basically we have enough memory to cache the full history. The URLs queried were the same as in the first test (all the queries required to serve all the dashboards for all the projects), but I injected about 3% queries that request the full history, all throughout the file. Whether we run over these randomly or in order, Druid actually works faster. This is probably because the full history cache allows answering the other questions even faster:
Transactions: 10000 hits Availability: 100.00 % Elapsed time: 69.66 secs Data transferred: 13.34 MB Response time: 0.03 secs Transaction rate: 143.55 trans/sec Throughput: 0.19 MB/sec Concurrency: 4.78 Successful transactions: 9504 Failed transactions: 0 Longest transaction: 3.06 Shortest transaction: 0.00
For reference, this is an example hive query that generates the URLs used for siege above:
set hive.mapred.mode=nonstrict; select regexp_replace(template, ',,hostname,,', hostname) url from (select distinct hostname from wmf_raw.mediawiki_project_namespace_map ) projects cross join (select 'http://localhost:7232/analytics.wikimedia.org/v1/editors/aggregate/,,hostname,,/all-editor-types/all-page-types/all-activity-levels/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/edited-pages/aggregate/,,hostname,,/all-editor-types/all-page-types/all-activity-levels/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/edits/aggregate/,,hostname,,/all-editor-types/all-page-types/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/registered-users/new/,,hostname,,/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/pageviews/aggregate/,,hostname,,/all-access/user/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/unique-devices/,,hostname,,/all-sites/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/pageviews/top/,,hostname,,/all-access/2015/10/all-days' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/bytes-difference/absolute/aggregate/,,hostname,,/all-editor-types/all-page-types/monthly/2015121200/2017121200' template union all select 'http://localhost:7232/analytics.wikimedia.org/v1/bytes-difference/net/aggregate/,,hostname,,/all-editor-types/all-page-types/monthly/2015121200/2017121200' template ) templates ;
Load Tests
Data and tests are at: nuria@aqs1004:~/load-testing
TL;DR
- From what I can see compaction doesn't affect throughput and we are capped at about 450 transactions per second per host. This number doesn't change with the amount of data loaded thus far. That is, the throughput cap is the same with 4 and 7 months of data.
SSTables per read is about 3, in our old cluster this number is 15. This means that the compaction we are using is more efficient for our type of reads.
- CPU doesn't even get to 20% while testing at a throughput of > 300 TPS for minutes at a a time.
- We have debug logging on, if logging is in any way a bottleneck our throughput might be even better.
Tests 1. 2016-08-16. Three months loaded. Compaction has finished
Results
- We test once compaction has finished before starting a different load of data on aqs1004 using siege.
The maximum throughput 1 box is able to sustain is about 350 transactions per second (tps). Not much of a blip on CPU, it doesn't go above 20%.
Detail data
Tests 2016-08-16 file with 1e6 urls
"a hundred concurrent users doing 100 repetitions", should be 1e6 transactions -c100 -r100 ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege.. done. Transactions: 10000 hits Availability: 100.00 % Elapsed time: 62.82 secs Data transferred: 2.95 MB Response time: 0.01 secs Transaction rate: 159.18 trans/sec Throughput: 0.05 MB/sec Concurrency: 2.30 Successful transactions: 9135 Failed transactions: 0 Longest transaction: 0.09 Shortest transaction: 0.00 *******Longer test: -c100 -r1000 ** SIEGE 3.0.8 ** Preparing 100 concurrent users for battle. The server is now under siege.. done. Transactions: 100000 hits Availability: 100.00 % Elapsed time: 548.67 secs Data transferred: 28.91 MB Response time: 0.01 secs Transaction rate: 182.26 trans/sec Throughput: 0.05 MB/sec Concurrency: 2.55 Successful transactions: 92589 Failed transactions: 0 Longest transaction: 0.22 Shortest transaction: 0.00 ********* bumping up concurrency Still cpu not even 20% nuria@aqs1004:~/load-testing$ siege -c 200 -r 1000 --file=part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 551.68 secs Data transferred: 57.82 MB Response time: 0.01 secs Transaction rate: 362.53 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.04 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.22 Shortest transaction: 0.00 ******* siege -c 400 -r 1000 --file=part-00000_1_1e6.txt --log Not working. ******** siege -c 300 -r 1000 --file=part-00000_1_1e6.txt --log Not working either!!! Looks like 300 TPS is our threshold.
Tests 1. 2016-08-18. Four months loaded. Compaction is happening while testing
Results
Compaction doesn't seem to affect test results when it comes to throughput. Average response time is slower (is "response time" a percentile or an average?). Twice as slow actually.
Number of SSTtables read is still around 3 and CPU is low while tests are running (<20%) while throughput is about 300 TPS.
Details
200 concurrent users doing 1000 repetitions ** siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 199870 hits Availability: 99.94 % Elapsed time: 563.90 secs Data transferred: 57.81 MB Response time: 0.02 secs Transaction rate: 354.44 trans/sec Throughput: 0.10 MB/sec Concurrency: 7.01 Successful transactions: 185055 Failed transactions: 130 Longest transaction: 5.23 Shortest transaction: 0.00 It took about the same time to complete this test that when compaction was not happening, average response time is higher 0.2 versus 0.1 ** 2nd time same set of urls: nuria@aqs1004:~/load-testing/2016-08-18$ siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 559.88 secs Data transferred: 57.82 MB Response time: 0.02 secs Transaction rate: 357.22 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.84 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.33 Shortest transaction: 0.00 Average response time is same and so is TPS. ** 3rd time different set of urls results are about identical to 1st run, slower transaction seems slower but averages are about the same nuria@aqs1004:~/load-testing/2016-08-18$ siege -c 200 -r 1000 --file=../part-00000_2_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 199999 hits Availability: 100.00 % Elapsed time: 560.76 secs Data transferred: 58.84 MB Response time: 0.02 secs Transaction rate: 356.66 trans/sec Throughput: 0.10 MB/sec Concurrency: 7.32 Successful transactions: 186271 Failed transactions: 1 Longest transaction: 5.01 Shortest transaction: 0.00 Same values for response time and throughput
Tests 1. 2016-08-26. Seven months loaded. Compaction is happening while testing
Results
Compaction doesn't seem to affect throughput (per our first test on 2016-08-16) We can bump up TPS to 450 per sec
Using the same set of urls over does not allow us to increase our troughput but it just seems the slowest transaction is not as slow. This might be a red herring if those numbers are affected by outliers.
CPU usage doesn't get to 20% while tests are going on (keep in mind that compaction is happening). SSTables per read is at 3.
Details
siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 566.03 secs Data transferred: 57.82 MB Response time: 0.02 secs Transaction rate: 353.34 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.41 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.27 Shortest transaction: 0.00 ** 2nd time same set of urls: Results are about the same, less of a blip on CPU nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 558.60 secs Data transferred: 57.82 MB Response time: 0.01 secs nuria@aqs1004:~/load-testing/2016-08-24$ more load-testing.log 7 months loaded, data is being compacted CPU usage doesn't get to 20%, compaction doesn't seem to affect throughput (per our first test on 2016-08-16) We can bump up TPS to 450 per sec siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 566.03 secs Data transferred: 57.82 MB Response time: 0.02 secs Transaction rate: 353.34 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.41 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.27 Shortest transaction: 0.00 ** 2nd time same set of urls: Results are about the same, less of a blip on CPU nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 200 -r 1000 --file=../part-00000_1_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 558.60 secs Data transferred: 57.82 MB Response time: 0.01 secs Transaction rate: 358.04 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.09 Successful transactions: 185178 Failed transactions: 0 Longest transaction: 0.26 Shortest transaction: 0.00 ** 3rd run, different set of urls Pretty much exact results than round 1 siege -c 200 -r 1000 --file=../part-00000_2_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 561.96 secs Data transferred: 58.84 MB Response time: 0.01 secs Transaction rate: 355.90 trans/sec Throughput: 0.10 MB/sec Concurrency: 5.23 Successful transactions: 186272 Failed transactions: 0 Longest transaction: 0.21 Shortest transaction: 0.00 ** 4th run try to bunp up concurrency: siege -c 300 -r 1000 --file=../part-00000_2_1e6.txt --log Test breaks, transactions per sec are too much ** 5th run bumping up concurrency, but a bit less, we bump up transactions to 400 nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 250 -r 1000 --file=../part-00000_2_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 557.41 secs Data transferred: 73.59 MB Response time: 0.01 secs Transaction rate: 448.50 trans/sec Throughput: 0.13 MB/sec Concurrency: 5.98 Successful transactions: 232779 Failed transactions: 0 Longest transaction: 0.21 Shortest transaction: 0.00 ** 6th run using a different set of urls, fresh nuria@aqs1004:~/load-testing/2016-08-24$ siege -c 250 -r 1000 --file=../part-00000_3_1e6.txt --log ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 560.55 secs Data transferred: 73.35 MB Response time: 0.02 secs Transaction rate: 445.99 trans/sec Throughput: 0.13 MB/sec Concurrency: 7.04 Successful transactions: 233363 Failed transactions: 0 Longest transaction: 0.24 Shortest transaction: 0.00
Tests 2016-09-17. All data loaded, deployed fix for nulls to zero. No compaction happening
Results
Throughput limit is not changed regardless of data size. It is likely that we are now running into a limit with our http layer rather than storage layer.
CPU usage doesn't get to 20% while tests are going on. SSTables per read is at 3.
Details
Test 1, 1e6 distinct urls. 200 users, 100 repetitions, which means urls are hit more than once siege -c 200 -r 1000 --file=../part-00000 ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 556.29 secs Data transferred: 58.90 MB Response time: 0.02 secs Transaction rate: 359.52 trans/sec Throughput: 0.11 MB/sec Concurrency: 6.78 Successful transactions: 186310 Failed transactions: 0 Longest transaction: 0.50 Shortest transaction: 0.00 Test 2, same setup but with 200.000 distinct urls. Results are about identical nuria@aqs1004:~/load-testing/2016-09-17$ siege -c 200 -r 1000 --file=./urls_2e6.txt.2 ** SIEGE 3.0.8 ** Preparing 200 concurrent users for battle. The server is now under siege.. done. Transactions: 200000 hits Availability: 100.00 % Elapsed time: 561.82 secs Data transferred: 58.04 MB Response time: 0.02 secs Transaction rate: 355.99 trans/sec Throughput: 0.10 MB/sec Concurrency: 6.71 Successful transactions: 185881 Failed transactions: 0 Longest transaction: 0.26 Shortest transaction: 0.00 Tests 3. Bumping up transactions per second. siege -c 300 -r 1000 --file=./urls_2e6.txt.2 This results on transactions per second > 500 and things break Test 4. Reduced concurrent users to 250. Actually we get to > 400 throughput siege -c 250 -r 1000 --file=./urls_2e6.txt.2 ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 562.85 secs Data transferred: 72.71 MB Response time: 0.02 secs Transaction rate: 444.17 trans/sec Throughput: 0.13 MB/sec Concurrency: 7.84 Successful transactions: 232861 Failed transactions: 0 Longest transaction: 0.29 Shortest transaction: 0.00 Test 5. Same users and repetitions than Test 4 but with urls that expand to a larger time range. Results are similar to prior test. nuria@aqs1004:~/load-testing/2016-09-17$ siege -c 250 -r 1000 --file=./urls_2e6.txt.2.late_end_date ** SIEGE 3.0.8 ** Preparing 250 concurrent users for battle. The server is now under siege.. done. Transactions: 250000 hits Availability: 100.00 % Elapsed time: 556.65 secs Data transferred: 83.56 MB Response time: 0.02 secs Transaction rate: 449.12 trans/sec Throughput: 0.15 MB/sec Concurrency: 8.72 Successful transactions: 232609 Failed transactions: 0 Longest transaction: 0.29 Shortest transaction: 0.00
Calibration. Performance Tests on Old Cluster
We have used siege to run tests on the old cluster in order to get a baseline of siege results compare to results we see in the live system. We know from experience that our old cluster couldn't sustain more than 30 reqs per second due to hardware (lack of SSDs) but also to software (suboptimal cassandra compaction).
We tested the old cluster with data up to 2016-09-23 but w/o receiving production traffic so our setup is identical to the performance tests we run on the newer cluster
TL;DR
Latencies are way higher for much smaller concurrency levels. Response times are aprox 2 seconds for less than 60 transactions per second. This means that latencies (as measured by our tests) are 100 times higher on the old cluster with similar (and lower) concurrency levels.
Cluster cannot even get to 100 transactions per sec. In theory the load tests tell us it that it could support almost 60. CPU gets to 20%. In reality we know our threshold is around 30 so we can safely conclude that siege allows us to find the right order of magnitude, the actual threshold has to be found experimentally.
Details
nuria@aqs1001:~/load-testing$ siege -c 120 -r 1000 --file=urls_2e6.txt.2.late_end_date ** SIEGE 3.0.8 ** Preparing 120 concurrent users for battle. The server is now under siege.. done. Transactions: 119593 hits Availability: 99.66 % Elapsed time: 2278.04 secs Data transferred: 40.07 MB Response time: 1.59 secs Transaction rate: 52.50 trans/sec <- Throughput: 0.02 MB/sec Concurrency: 83.61 Successful transactions: 111310 Failed transactions: 407 Longest transaction: 5.48 Shortest transaction: 0.00 nuria@aqs1001:~/load-testing$ siege -c 150 -r 1000 --file=urls_2e6.txt.2.late_end_date ** SIEGE 3.0.8 ** Preparing 150 concurrent users for battle. The server is now under siege.. done. siege aborted due to excessive socket failure; you can change the failure threshold in $HOME/.siegerc Transactions: 42589 hits Availability: 97.63 % Elapsed time: 723.67 secs Data transferred: 14.13 MB Response time: 2.03 secs Transaction rate: 58.85 trans/sec <- Throughput: 0.02 MB/sec Concurrency: 119.26 Successful transactions: 39333 Failed transactions: 1035 Longest transaction: 5.89 Shortest transaction: 0.00
Trying to bump up concurrency beyond this level runs into errors.