Data Platform/Systems/Presto

Presto is an SQL engine which you can use to query data in the Data Lake. We currently run Presto in read-only mode.

As of September 2024, we are running Presto 0.288.1, so the most accurate documentation is at prestodb.io/docs/0.288/.1.

Use

The Presto command-line interface is available on all the analytics clients. Here's how to use it:

$ kinit

$ presto --catalog analytics_hive

Presto can also be easily accessed through Python using Wmfdata:

#!/usr/bin/env python3
import wmfdata as wmf

# Returns a Pandas dataframe
wmf.presto.run("SHOW TABLES FROM event")

SQL quirks

Strings in double quotes (e.g. "string") are treated as identifiers (e.g. the name of a table or field). Strings in single quotes (e.g. 'string') are treated as string literals.
Unlike with Hive or Spark, we run presto in read-only mode so you can only run SELECT queries.

Presto on Superset

Presto is available in Superset via the SQL Lab panel:

go to https://superset.wikimedia.org/sqllab
select database presto_analytics_hive
select schema event for Eventlogging (just an example, or any available one)
select any table to get a preview of the content, attributes, etc..

Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.

Background

We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.

Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.

Reasons why we choose Presto:

It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine^[1]).
It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
It takes advantage of Parquet, the standard Hadoop columnar data format
It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.

Administration

Please check Data_Platform/Systems/Presto/Administration.

↑ As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.

[1] As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.

[1]