Analytics/Systems/Presto

From Wikitech
Jump to navigation Jump to search

Presto is a distributed SQL query engine for big data open sourced by Facebook (like Hive). We are currently (as of September 2018) on the way to deploy an experimental setup on 3 machines to provide labs users with querying capabilities over the mediawiki-history dataset.

Why Presto ? Value proposition

We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.

Presto has been choosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.

Reasons why we choose Presto:

  • It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
  • It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine[1]).
  • It reads from HDFS and other big-data storages, making it easy to load/reload/maintain datasets (by opposition to Clickhouse and Druid).
  • It takes advantage of hadoop-standard columnar data format (Parquet)
  • It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.

Usage

Analytics team plans to use Presto in the upcoming release of the Data Lake to the general public and users of cloud platform: https://wikitech.wikimedia.org/wiki/Analytics/Projects/Public_Data_Lake

  1. As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.