Data Platform/Data Lake
The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).
Data available
- Traffic data
- Webrequest, pageviews, and unique devices
- Edits data
- Historical data about revisions, pages, and users (e.g. MediaWiki History)
- Content data
- Wikitext (latest & historical) and wikidata-entities
- Events data
- EventLogging, EventBus and event streams data (raw, refined, sanitized)
- Commons Impact Metrics
- Contributions to Wikimedia Commons focused on GLAMs
Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are also available as data cubes (usually in more aggregated capacity).
Access
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.
You can access these engines through several different routes:
- Superset has a graphical SQL editor where you can run Presto queries
- Custom code on one of the analytics clients (the easiest way to do this is to use our Jupyter service)
- for Python, use the Wmfdata-Python package
- for R, use the wmfdata-r package
- All three engines also support ad-hoc querying via command-line programs which you can use on one of the analytics clients. Consult the engine's documentation page for more info.
Syntax differences between the SQL engines
For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax.
use case | Spark | Presto | Hive |
---|---|---|---|
keyword for the string data type | STRING
|
VARCHAR
|
STRING
|
string literal | 'foo' , "foo"
|
'foo'
|
'foo' , "foo"
|
keyword for 32-bit float data type | FLOAT , REAL
|
REAL
|
FLOAT
|
keyword for 64-bit float data type | DOUBLE
| ||
select a column named with a reserved word (e.g. DATE )
|
`date`
|
"date"
|
`date`
|
get the length of an array | SIZE(a)
|
CARDINALITY(a)
|
SIZE(a)
|
concatenate strings with a separator | CONCAT_WS
|
not available | CONCAT_WS
|
count rows which match a condition | COUNT_IF(x = y)
|
COUNT_IF(x = y)
|
SUM(CAST(x = y AS INT))
|
transform integer year /month /day fields to a date string
|
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
|
CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0'))
|
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
|
convert an ISO 8601 timestamp string (e.g. "2021-11-01T01:23:02Z" ) to an SQL timestamp
|
TO_TIMESTAMP(dt)
|
FROM_ISO8601_TIMESTAMP(dt)
|
FROM_UNIXTIME(UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
|
divide integers, returning a float if necessary | x / y
|
CAST(x AS DOUBLE) / y
|
x / y
|
select the first or last rows in a group | FIRST , LAST
|
not available | not available |
- It's useful to get in the habit of using singled quoted text (
'foo'
) for strings, since all three engines interpret it the same way. Double quoted text ("foo"
) is interpreted as a string in Spark and Hive, but as a column name in Presto. - Escaping special characters in string literals works differently in Spark and Presto. See this notebook for more details.
- See also: Presto's guide to migrating from Hive
Integer division in Presto
If you divide integers, Hive and Spark will return a floating-point number if necessary (e.g. 1 / 3
returns 0.333333
). However, Presto will return only an integer (e.g. 1 / 3
returns 0
). Use CAST(x AS DOUBLE)
to work around this. DOUBLE
is a 64-bit floating point number, while REAL
is a 32-bit floating point number.
There are some quirks to be aware of with this behavior:
SELECT
2/5 AS "none",
CAST(2 AS DOUBLE)/5 AS "numerator",
2/CAST(5 AS DOUBLE) AS "denominator",
CAST(2/5 AS DOUBLE) AS "outer",
2/5 * CAST(100 AS DOUBLE) AS "percentage (a)",
CAST(2/5 AS DOUBLE) * 100 AS "percentage (b)",
CAST(2 AS DOUBLE) / 5 * 100 AS "percentage (c)",
1.0 * 2 / 5 AS "percentage (d)"
These produce:
- none: 0 (because 2/5 is rounded towards 0 to keep the output data type integer, same as input)
- numerator, denominator: 0.4
- outer: 0 (because 2/5 is implicitly cast to integer BEFORE being explicitly cast as double)
- percentage
- (a): 0 (same as "none" – 2/5 is cast to int and rounded towards 0 before it reaches the double-typed 100)
- (b): 0 (same as outer)
- (c): 40
- (d): 40
So let's say your query has SUM(IF(event.action = 'click', 1, 0)) / COUNT(1)
to calculate clickthrough rate. It'll be 0 unless you:
- explicitly cast either the denominator or the numerator to double, or
- implicitly cast by multiplying by 1.0 (for example above it follows order of operations:
1.0 * 2
becomes2.0
then that gets divided by 5)
Table and file formats
Data Lake tables can be created using either Hive format or Iceberg format. Iceberg is the successor to Hive, and highly recommended for new tables. As of Feb 2024, the existing tables in the wmf
database are being slowly migrated to Iceberg (task T333013).
Both table formats can store data using a variety of underlying file formats; we normally use Parquet with both Hive and Iceberg.
Technical architecture
Data Lake datasets which are available in Hive are stored in the Hadoop Distributed File System (HDFS). The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.
The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.
All Subpages of Data Platform/Data Lake
- Content
- Content/Mediawiki content history v1
- Content/Mediawiki wikitext current
- Content/Mediawiki wikitext history
- Content/Wikidata entity
- Content/Wikidata item page link
- Data Issues
- Data Issues/2021-02-09 Unique Devices By Family Overcount
- Data Issues/2021-06-04 Traffic Data Loss
- Data Issues/2023-01-08 Webrequest Data Loss
- Data Issues/2023-11 eventgate-analytics-external Data Loss
- Data Issues/2024-10-10 Webrequest Data Loss - Clobbered Hadoop Temporary Dir
- Edits
- Edits/Edit hourly
- Edits/Geoeditors
- Edits/Geoeditors/Public
- Edits/MediaWiki history
- Edits/MediaWiki history/Revision identity reverts
- Edits/MediaWiki history dumps
- Edits/MediaWiki history dumps/FAQ
- Edits/MediaWiki history dumps/Python spark examples
- Edits/MediaWiki history dumps/Scala spark examples
- Edits/Mediawiki history dumps/Python Dask examples
- Edits/Mediawiki history dumps/Python Pandas examples
- Edits/Mediawiki history reduced
- Edits/Mediawiki page history
- Edits/Mediawiki project namespace map
- Edits/Mediawiki user history
- Edits/Metrics
- Edits/Public
- Edits/Structured data/Commons entity
- Events
- Project History
- Public Data Lake
- Traffic
- Traffic/Banner activity
- Traffic/BotDetection
- Traffic/Browser general
- Traffic/Caching
- Traffic/Interlanguage
- Traffic/Mediacounts
- Traffic/Pagecounts-ez
- Traffic/Pageview actor
- Traffic/Pageview hourly
- Traffic/Pageview hourly/Fingerprinting Over Time
- Traffic/Pageview hourly/Identity reconstruction analysis
- Traffic/Pageview hourly/K Anonymity Threshold Analysis
- Traffic/Pageview hourly/Sanitization
- Traffic/Pageview hourly/Sanitization algorithm proposal
- Traffic/Pageviews
- Traffic/Pageviews/Bots
- Traffic/Pageviews/Bots Research
- Traffic/Pageviews/Redirects
- Traffic/Projectview hourly
- Traffic/ReaderCounts
- Traffic/SessionLength
- Traffic/Unique Devices
- Traffic/Unique Devices/Automated traffic correction
- Traffic/Unique Devices/Last access solution
- Traffic/Unique Devices/Last access solution/Validation
- Traffic/UserRetention
- Traffic/Virtualpageview hourly
- Traffic/Webrequest
- Traffic/Webrequest/RawIPUsage
- Traffic/Webrequest/Tagging
- Traffic/mediawiki api request
- Traffic/mobile apps session metrics
- Traffic/mobile apps uniques
- Traffic/referrer daily
- Traffic/referrer daily/Dashboard