Data Engineering/Systems/AQS/Scaling/2020/Cluster Expansion

From Wikitech

Goals

Expand the current AQS Restbase cluster with more hosts to grow available storage space. Main reason for growth are first to make sure the currently served data can continue to be handled, and second to extend the service to serve historical pagecounts.

Current cluster

  • aqs100[456789].eqiad.wmnet
  • Two Cassandra instances per node.
  • Cassandra 2.2.6

Related tasks

https://phabricator.wikimedia.org/T173720

https://phabricator.wikimedia.org/T193759

Storage considerations

Current State (2020-03-06)

The 6 current hosts handle about 19Tb of data representing bout 55% of the available space. Most (99.7%) of the storage is used by three tables (see this graph):

table storage taken (Tb) storage taken (%) Description
local_group_default_T_pageviews_per_article_flat.data 11.89 60.4% 4 years and 8 month of pageviews per-article daily
local_group_default_T_mediarequest_per_file.data 6.72 34.1% 5 years and 2 month of mediarequest per-file daily
local_group_default_T_top_pageviews.data 0.103 5.2% 4 years and 8 month of pageviews top-articles daily and monthly

Based on those those number, an approximate growing rate for our current datasets is an additional 4Tb per year.

Adding pagecounts

We would like to add the historical per-article daily pagecounts dataset to the api (see related task above). This data being similar to the currently existing pageviews per-article daily, we use the later as a basis for the capacity planning of the former.

Pagecounts data is available from 2008 onward, and will stop at 2015-06 when pageviews starts. this represent 7 years and six month of data. In order to represent growth over time (the number of pageviews was lower in 2008), we have computed the number of rows (distinct wiki, page_title and day) to be loaded in cassandra for every Januray month, both for pagecounts (2008 to 2015) and pageviews (2016 to 2020). We have taken the average linear growth between years as a basis for monthly-rows to be loaded for every month of the given year: Rows-Jan-Y * 12 + (Rows Jan Y+1 - Rows-Jan-Y), and based of the storage taken for pageviews, we have computed expected storage taken by by pagecounts (we have used the same method for incomplete years by mulitplying by the number of month instead of 12, and didn't apply any variation to the 2020 year as it only represents 2 month).

With the method described, we end up with every row stored for pageview per-article daily weighting on average 96 bytes, leading to an additional storage of ~14.7Tb (123% of currently stored pageviews) to add the historical pagecounts per-article daily.

Without going into similar details, we can assume that the top-pagecounts for the same period should weight a similar ratio of top-pageviews, therefore approximately 130Gb.

Pageviews

date months rows rows for the period storage for the period (Tb)
2020-01 2 2472079076 4944158152 0.47
2019-01 12 2433335897 29238773943 2.81
2018-01 12 2115357738 25702271015 2.47
2017-01 12 2154755423 25817667391 2.48
2016-01 12 2214377022 26512902665 2.50
2015-06 6 1858931075 11509032397 1.10

Pagecounts

date months rows rows dfor the period storage for the period (Tb)
2015-01 6 2354563064 14127378384 1.36
2014-01 12 2285656148 27496780692 2.64
2013-01 12 2227853445 26792044043 2.57
2012-01 12 1980345650 24011655595 2.31
2011-01 12 1445012922 17875487792 1.72
2010-01 12 1475652637 17677191929 1.70
2009-01 12 1269606634 15441325611 1.48
2008-01 12 733014827 9332769731 0.90


Remarks

Note on sizing

The number presented above are approximations. It makes no sense to try to represent cassandra storage in bytes-per-row as there compaction and compression at stake. However given the similarity of the datasets we worked with the approach feels reasonably correct.

Ops thoughts

  • We probably want to take advantage of the expansion to move to cassandra 3.x
  • Moving the ~20Tb total of data needs to be carefully planned and thought of.