Data Platform/Systems/Presto
Presto is an SQL engine which you can use to query data in the Data Lake. We currently run Presto in read-only mode.
As of September 2024, we are running Presto 0.288.1, so the most accurate documentation is at prestodb.io/docs/0.288/.1.
Use
The Presto command-line interface is available on all the analytics clients. Here's how to use it:
$ kinit
$ presto --catalog analytics_hive
Presto can also be easily accessed through Python using Wmfdata:
#!/usr/bin/env python3
import wmfdata as wmf
# Returns a Pandas dataframe
wmf.presto.run("SHOW TABLES FROM event")
SQL quirks
- Strings in double quotes (e.g.
"string"
) are treated as identifiers (e.g. the name of a table or field). Strings in single quotes (e.g.'string'
) are treated as string literals. - Unlike with Hive or Spark, we run presto in read-only mode so you can only run
SELECT
queries.
Presto on Superset
Presto is available in Superset via the SQL Lab panel:
- go to https://superset.wikimedia.org/superset/sqllab
- select database
presto_analytics_hive
- select schema
event
for Eventlogging (just an example, or any available one) - select any table to get a preview of the content, attributes, etc..
Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.
Background
We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.
We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.
Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.
Reasons why we choose Presto:
- It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
- It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine[1]).
- It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
- It takes advantage of Parquet, the standard Hadoop columnar data format
- It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.
Administration
Please check Data_Platform/Systems/Presto/Administration.
- ↑ As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.