Data Engineering/Systems/AQS/Scaling/2020/Cluster Expansion

Goals

Expand the current AQS Restbase cluster with more hosts to grow available storage space. Main reason for growth are first to make sure the currently served data can continue to be handled, and second to extend the service to serve historical pagecounts.

Current cluster

aqs100[456789].eqiad.wmnet
Two Cassandra instances per node.
Cassandra 2.2.6

https://phabricator.wikimedia.org/T193759

Storage considerations

Current State (2020-03-06)

The 6 current hosts handle about 19Tb of data representing bout 55% of the available space. Most (99.7%) of the storage is used by three tables (see this graph):

table	storage taken (Tb)	storage taken (%)	Description
`local_group_default_T_pageviews_per_article_flat.data`	11.89	60.4%	4 years and 8 month of pageviews per-article daily
`local_group_default_T_mediarequest_per_file.data`	6.72	34.1%	5 years and 2 month of mediarequest per-file daily
`local_group_default_T_top_pageviews.data`	0.103	5.2%	4 years and 8 month of pageviews top-articles daily and monthly

Based on those those number, an approximate growing rate for our current datasets is an additional 4Tb per year.

Adding pagecounts

We would like to add the historical per-article daily pagecounts dataset to the api (see related task above). This data being similar to the currently existing pageviews per-article daily, we use the later as a basis for the capacity planning of the former.

Pagecounts data is available from 2008 onward, and will stop at 2015-06 when pageviews starts. this represent 7 years and six month of data. In order to represent growth over time (the number of pageviews was lower in 2008), we have computed the number of rows (distinct wiki, page_title and day) to be loaded in cassandra for every Januray month, both for pagecounts (2008 to 2015) and pageviews (2016 to 2020). We have taken the average linear growth between years as a basis for monthly-rows to be loaded for every month of the given year: Rows-Jan-Y * 12 + (Rows Jan Y+1 - Rows-Jan-Y), and based of the storage taken for pageviews, we have computed expected storage taken by by pagecounts (we have used the same method for incomplete years by mulitplying by the number of month instead of 12, and didn't apply any variation to the 2020 year as it only represents 2 month).

With the method described, we end up with every row stored for pageview per-article daily weighting on average 96 bytes, leading to an additional storage of ~14.7Tb (123% of currently stored pageviews) to add the historical pagecounts per-article daily.

Without going into similar details, we can assume that the top-pagecounts for the same period should weight a similar ratio of top-pageviews, therefore approximately 130Gb.

Pageviews

date	months	rows	rows for the period	storage for the period (Tb)
2020-01	2	2472079076	4944158152	0.47
2019-01	12	2433335897	29238773943	2.81
2018-01	12	2115357738	25702271015	2.47
2017-01	12	2154755423	25817667391	2.48
2016-01	12	2214377022	26512902665	2.50
2015-06	6	1858931075	11509032397	1.10

Pagecounts

date	months	rows	rows dfor the period	storage for the period (Tb)
2015-01	6	2354563064	14127378384	1.36
2014-01	12	2285656148	27496780692	2.64
2013-01	12	2227853445	26792044043	2.57
2012-01	12	1980345650	24011655595	2.31
2011-01	12	1445012922	17875487792	1.72
2010-01	12	1475652637	17677191929	1.70
2009-01	12	1269606634	15441325611	1.48
2008-01	12	733014827	9332769731	0.90

Remarks

Note on sizing

The number presented above are approximations. It makes no sense to try to represent cassandra storage in bytes-per-row as there compaction and compression at stake. However given the similarity of the datasets we worked with the approach feels reasonably correct.

Ops thoughts

We probably want to take advantage of the expansion to move to cassandra 3.x
Moving the ~20Tb total of data needs to be carefully planned and thought of.