Analytics/Data Lake

From Wikitech
Jump to navigation Jump to search

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).

Data available

Currently, you need production data access to use some of this data. A lot of it is available publicly at
Traffic data
Webrequest, pageviews, and unique devices
Edits data
Historical data about revisions, pages, and users (e.g. MediaWiki History)
Content data
Wikitext (latest & historical) and wikidata-entities
Events data
EventLogging, EventBus and event streams data (raw, refined, sanitized)

Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are also available as data cubes (usually in more aggregated capacity).


The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.

You can access these engines through several different routes:

All three engines also have command-line programs which you can use on one of the analytics clients. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.

Differences between the SQL engines

For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.

Syntax differences

  • Spark and Hive use STRING as the keyword for string data, while Presto uses VARCHAR.
    • One consequence is a different method for transforming integer year/month/day fields to a date string.
    • Spark and Hive: CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) (casting to STRING is not actually required)
    • Presto: CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) (casting to VARCHAR is required)
  • In Spark and Hive, you use the SIZE function to get the length of an array, while in Presto you use CARDINALITY.
  • In Spark and Hive, double quoted text (like "foo") is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like 'foo') for strings, since all three engines interpret it the same way.
  • Spark and Hive have a CONCAT_WS ("concatenate with separator") function, but Presto does not.
  • Spark supports both FLOAT and REAL as keywords for the 32-bit floating-point number data type, while Presto supports only REAL.
  • Presto has no FIRST and LAST functions
  • If you need to use a keyword like DATE as a column name, you use backticks (`date`) in Spark and Hive, but double quotes ("date") in Presto.
  • To convert an ISO 8601 timestamp string (e.g. "2021-11-01T01:23:02Z") to an SQL timestamp:
    • Spark: TO_TIMESTAMP(dt)
    • Presto: FROM_ISO8601_TIMESTAMP(dt)
    • Hive: FROM_UNIXTIME(UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
  • Escaping special characters in string literals works differently in Spark and Presto. See this notebook for more details.
  • See also: Presto's guide to migrating from Hive
Integer division in Presto

If you divide integers, Hive and Spark will return a floating-point number if necessary (e.g. 1 / 3 returns 0.333333). However, Presto will return only an integer (e.g. 1 / 3 returns 0). Use CAST(x AS DOUBLE) to work around this. DOUBLE is a 64-bit floating point number, while REAL is a 32-bit floating point number.

There are some quirks to be aware of with this behavior:

  2/5 AS "none",
  CAST(2 AS DOUBLE)/5 AS "numerator",
  2/CAST(5 AS DOUBLE) AS "denominator",
  CAST(2/5 AS DOUBLE) AS "outer",
  2/5 * CAST(100 AS DOUBLE) AS "percentage (a)",
  CAST(2/5 AS DOUBLE) * 100 AS "percentage (b)",
  CAST(2 AS DOUBLE) / 5 * 100 AS "percentage (c)",
  1.0 * 2 / 5 AS "percentage (d)"

These produce:

  • none: 0 (because 2/5 is rounded towards 0 to keep the output data type integer, same as input)
  • numerator, denominator: 0.4
  • outer: 0 (because 2/5 is implicitly cast to integer BEFORE being explicitly cast as double)
  • percentage
    • (a): 0 (same as "none" – 2/5 is cast to int and rounded towards 0 before it reaches the double-typed 100)
    • (b): 0 (same as outer)
    • (c): 40
    • (d): 40

So let's say your query has SUM(IF(event.action = 'click', 1, 0)) / COUNT(1) to calculate clickthrough rate. It'll be 0 unless you:

  • explicitly cast either the denominator or the numerator to double, or
  • implicitly cast by multiplying by 1.0 (for example above it follows order of operations: 1.0 * 2 becomes 2.0 then that gets divided by 5)

Table and file formats

Data Lake tables can be created using either Hive format or Iceberg format. Iceberg is the successor to Hive, and highly recommended for new tables. As of Feb 2024, the existing tables in the wmf database are being slowly migrated to Iceberg (task T333013).

Both table formats can store data using a variety of underlying file formats; we normally use Parquet with both Hive and Iceberg.

Technical architecture

Data Lake datasets which are available in Hive are stored in the Hadoop Distributed File System (HDFS). The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.

Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.

The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.

All Subpages of Analytics/Data Lake