Analytics/Projects/Data Lake/SQL Engine on Cloud/Appendix

From Wikitech
Jump to navigation Jump to search

Limitations of other public access systems like Analytics API and Quarry

In the past three years an effort has been made to facilitate accessing  analytics data for Wikimedia Projects through web-APIs with the Analytics Query Service. For instance pageview statistics (including per-article daily pageviews) for every wiki project since July 2015 or edits statistics for every projects can be accessed using simple internet requests. The new system has also proven useful as it currently serve a few million calls per day (mostly on pageviews per article).

One of the main limitation of our current API system is that it only allow users to request the data in predefined ways. For example it doesn’t let you query edits data for specific patterns in comments, or for users changing groups. To get this data, you’d need to query the Mediawiki database replicas in the WMF-Cloud infrastructure, using Quarry for instance.  While this is already incredible to be able to request this data, querying MySQL also has limitations. The two main ones mentioned here are the difficulty to query multiple wiki-projects at once (every wiki has its own database, so you need one query per wiki), and most importantly the fact that a whole class of queries are too computationally intensive for MySQL to provide answers in a timely fashion (quarry queries are killed after half-an-hour of computation).

More details on distributed-computation state-of-the-art techniques

In the past fifteen years, a lot has happened in the world of computation at scale. What was previously available for scientists only via supercomputers is now commonly available to anyone with some computer skills through “the cloud”. And while the core of what is done has not changed (split the problem in order to distribute the computation across CPUs and be good at managing IOs and RAM), the available tools and standards helping to achieve performance have really changed the game of what is nowadays called “big data”.

For the specific field of analytical-queries (to differentiate from random-read queries and general-computation engines), two main aspects drive performance: the parallelization of  computation and the use of optimized files formats.

Parallel computation

Splitting data and parallelizing computation across CPU cores or cluster nodes is nothing new, it has been done since more than fifty years on supercomputers for very hard simulation problem (weather or stock-market forecasting to name only two). Nonetheless the box of tools facilitating querying high volumes of data has grown fast in the past fifteen years, and solve different problems:

  • General data processing (MapReduce, Spark)
  • Random read/write (Cassandra, HBase, Kudu)
  • SQL-Querying (ClickHouse, Drill, Hive, Impala, Presto, Spark-SQL)
  • OLAP Querying (Druid)

In the “Technological components” section we will concentrate on the SQL-Querying tools listed above.

Columnar file formats

By opposition to the classical row-oriented data storage, where data is stored row after row in files, a columnar file format is a change of layout, and data is stored by columns (or portions of columns) instead. For instance:

Columnar Storage.png

This shift in representation allows for various improvements when dealing with analytics-oriented tasks (mostly read, filter and count), noticeably

  • Don’t read unnecessary columns
  • Precompute columns statistics/index and prevent having to recompute them
  • Reduce storage by indexing rows and not repeat column values
  • Push down filtering-predicates from queries to reading stage, therefore read even less

This representation however comes with a cost at insertion/update, and is suboptimal when the work involves reading entire rows.

More details on analytics query engines and file formats

This section briefly describes state-of-the-art techniques in analytical-query domain and the various technological components available. A discussion is finally provided, explaining the choice we made.

Query engines

We separate them as being part of the hadoop-ecosystem or not, as the former means having to manage Hadoop in addition to the query engine.

  • Using hadoop-ecosystem (by order of creation)
    • Apache Hive - Originally built at Facebook then strongly pushed by Hortonworks, this is the first SQL-engine over hadoop. It is embedded with hadoop and originally used Hadoop-MapReduce as computation engine (it can now use more recent engines as Tez or Spark). It uses a modified version of SQL called HQL (Hive Query Language) which is very similar to SQL but still needs some learning. Also, by being older than the other systems, its approach to distributed querying is not naturally up-to-date, and it is under constant evolution to keep on changes. On the other hand, being older means it is also the most mature system of the list.
    • Apache Impala - Originally built by Cloudera, this system was one the “better SQL-Engines for hadoop” effort made around 2012. It has its own computation daemons (multiple workers, single state-store and catalog-service), and uses HDFS or HBase as storage layer. It also has its own SQL dialect, subset of SQL and HQL.
    • Presto - Originally built by Facebook, it is also part of the “better SQL-Engines for hadoop” effort made around 2012. Similarly to impala, it has its own computation daemons (coordinators and workers) and accept a lot of storage backends using connectors (the first one was for the Hive Metastore). It uses ANSI-Compatible SQL statements. It used by many big web-players (facebook, netflix, uber, airbnb, Amazon-Athena ...)
    • Apache Drill - Apache project from the start but supported by MapR, it is the last one of the better SQL-Engines for hadoop” effort made around 2012. Again, it has its own daemons (workers) and provides many datasource integrations. In comparison to Impala and presto, Drill put the accent on facilitating querying multi-source data and doesn’t almost need table metadata definition.
    • Apache Spark - SQL - This  system is an extension of the Spark general engine (new kid on the block of distributed computation engines) to working SQL queries. It can be fully embedded in Hadoop or have its own daemons, and uses an SQL dialect very similar to HQL. It can be integrated with many datasources but is originally built to read from HDFS.
  • Self-contained
    • ClickHouse - Built by Yandex. Written in C++ with low-level optimizations, tests we have done with this tool are impressive in term of performance. It however is difficult to install and has a lot more operational constraints than the others, since not integrated with hadoop (sharding of data needs to be done manually for instance). It uses its own SQL dialect.

File formats for hadoop

There are two open-source columnar file formats used in the hadoop ecosystem: Apache Parquet and Apache ORC. They are very similar in what they provide and how they function (they find their roots in the Google Dremel article), but still have some differences. The current standard in the WMF-Analytics team is to use Parquet, out of the fact that when we started using those file formats, parquet had really a lot more traction and features. It seems that now ORC is prefered when used in conjunction to Hive and Presto, while parquet is still the prefered choice for Spark. Reasons for which one is to be favoured to the other ar mainly related to reader-optimizations being implemented in the various engines.

Manual tests over pageview-data have shown a very small difference in data-compression (ORC folder is smaller than parquet one by 2% - same compression codec and same number of files). However there is a definite difference in query speed when using Hive or Spark: ORC is faster in Hive while parquet is faster with Spark (we use Hive version 1.1 and Spark version 2.3.1).