Presto is an SQL engine which you can use to query data in the Data Lake.

Note that there are two competing versions of Presto; as of October 2020, we are running PrestoDB 0.226, so the most accurate documenation is at

Usage on analytics cluster

The Presto CLI is deployed on all the analytics clients. Here's how to use it:

$ kinit

$ presto --catalog analytics_hive

Presto can also be easily used from a Python environment on stat/notebook hosts. Here an example:

#!/usr/bin/env python3
# IMPORTANT - before first use, in notebook terminal:
# pip3 install presto-python-client['kerberos']
import prestodb
import os

server = {
    'host': 'an-coord1001.eqiad.wmnet',
    'port': 8281,
    'ca_bundle':  '/etc/ssl/certs/Puppet_Internal_CA.pem',

conn = prestodb.dbapi.connect(
cursor = conn.cursor()
cursor.execute('SHOW TABLES from event')
for row in cursor.fetchall():

Presto on Superset

Presto is available in Superset via the SQL Lab panel:

  • go to
  • select database presto_analytics_hive
  • select schema event for Eventlogging (just an example, or any available one)
  • select any table to get a preview of the content, attributes, etc..

Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.


We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.

Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.

Reasons why we choose Presto:

  • It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
  • It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine[1]).
  • It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
  • It takes advantage of Parquet, the standard Hadoop columnar data format
  • It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.


Please check Analytics/Systems/Presto/Administration.


  1. As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.