Analytics/Systems/Presto

From Wikitech
Jump to navigation Jump to search

Presto is an SQL engine which you can use to query data in the Data Lake. We currently run Presto in read-only mode.

As of October 2023, we are running Presto 0.283, so the most accurate documentation is at prestodb.io/docs/0.283/.

Use

The Presto command-line interface is available on all the analytics clients. Here's how to use it:

$ kinit

$ presto --catalog analytics_hive

Presto can also be easily accessed through Python using wmfdata-python

#!/usr/bin/env python3
import wmfdata as wmf

# Returns a Pandas dataframe
wmf.presto.run("SHOW TABLES FROM event")

SQL quirks

  • Strings in double quotes (e.g. "string") are treated as identifiers (e.g. the name of a table or field). Strings in single quotes (e.g. 'string') are treated as string literals.
  • Unlike with Hive or Spark, we run presto in read-only mode so you can only run SELECT queries.

Presto on Superset

Presto is available in Superset via the SQL Lab panel:

  • go to https://superset.wikimedia.org/superset/sqllab
  • select database presto_analytics_hive
  • select schema event for Eventlogging (just an example, or any available one)
  • select any table to get a preview of the content, attributes, etc..

Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.

Background

We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.

Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.

Reasons why we choose Presto:

  • It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
  • It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine[1]).
  • It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
  • It takes advantage of Parquet, the standard Hadoop columnar data format
  • It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.

Administration

Please check Analytics/Systems/Presto/Administration.

Notes

  1. As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.