Analytics/Data Lake
The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).
Data available
![]() | Currently, you need production data access to use some of this data. A lot of it is available publicly at dumps.wikimedia.org. |
- Traffic data
- Webrequest, pageviews, and unique devices
- Edits data
- Historical data about revisions, pages, and users (e.g. MediaWiki History)
- Content data
- Wikitext (latest & historical) and wikidata-entities
- Events data
- EventLogging, EventBus and event streams data (raw, refined, sanitized)
Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are also available as data cubes (usually in more aggregated capacity).
Access
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.
You can access these engines through several different routes:
- Superset has a graphical SQL editor where you can run Presto queries
- Hue has a graphical SQL editor where you can run Hive queries
- Custom code on one of the analytics clients (the easiest way to do this is to use our Jupyter service)
- for Python, use the wmfdata-python package
- for R, use the wmfdata-r package
All three engines also have command-line programs which you can use on one of the analytics clients. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.
Differences between the SQL engines
For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.
Syntax differences
- Spark and Hive use
STRING
as the keyword for string data, while Presto usesVARCHAR
.- One consequence is a different method for transforming integer
year
/month
/day
fields to a date string. - Spark and Hive:
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))
(casting toSTRING
is not actually required) - Presto:
CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0'))
(casting toVARCHAR
is required)
- One consequence is a different method for transforming integer
- In Spark and Hive, you use the
SIZE
function to get the length of an array, while in Presto you useCARDINALITY
. - In Spark and Hive, double quoted text (like
"foo"
) is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like'foo'
) for strings, since all three engines interpret it the same way. - Spark and Hive have a
CONCAT_WS
("concatenate with separator") function, but Presto does not. - Spark supports both
FLOAT
andREAL
as keywords for the 32-bit floating-point number data type, while Presto supports onlyREAL
. - Presto has no FIRST and LAST functions
- If you need to use a keyword like
DATE
as a column name, you use backticks (`date`
) in Spark and Hive, but double quotes ("date"
) in Presto. - To convert an ISO 8601 timestamp string (e.g.
"2021-11-01T01:23:02Z"
) to an SQL timestamp:- Spark:
TO_TIMESTAMP(dt)
- Presto:
FROM_ISO8601_TIMESTAMP(dt)
- Hive:
FROM_UNIXTIME(UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
- Spark:
- If you divide integers, Hive and Spark will return a floating-point number if necessary (e.g.
1 / 3
returns0.333333
). However, Presto will return only an integer (e.g.1 / 3
returns0
). UseCAST(x AS REAL)
to work around this. - Escaping special characters in string literals works differently in Spark and Presto. See this notebook for more details.
- See also: Presto's guide to migrating from Hive
Technical architecture
Data Lake datasets which are available in Hive are stored in the Hadoop Distributed File System (HDFS), usually in the Parquet file format. The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it. Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.
The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.
All Subpages of Analytics/Data Lake
- Analytics/Data Lake/
- Content
- Content/Mediawiki wikitext current
- Content/Mediawiki wikitext history
- Content/Wikidata entity
- Content/Wikidata item page link
- Data Issues
- Data Issues/2021-02-09 Unique Devices By Family Overcount
- Data Issues/2021-06-04 Traffic Data Loss
- Data Issues/2023-01-08 Webrequest Data Loss
- Edits
- Edits/Edit hourly
- Edits/Geoeditors
- Edits/Geoeditors/Public
- Edits/MediaWiki history
- Edits/MediaWiki history/Revision identity reverts
- Edits/Mediawiki history dumps
- Edits/Mediawiki history dumps/FAQ
- Edits/Mediawiki history dumps/Python spark examples
- Edits/Mediawiki history dumps/Scala spark examples
- Edits/Mediawiki history reduced
- Edits/Mediawiki page history
- Edits/Mediawiki project namespace map
- Edits/Mediawiki user history
- Edits/Metrics
- Edits/Public
- Edits/Structured data/Commons entity
- Events
- Traffic
- Traffic/Banner activity
- Traffic/BotDetection
- Traffic/Browser general
- Traffic/Caching
- Traffic/Interlanguage
- Traffic/Mediacounts
- Traffic/Pagecounts-ez
- Traffic/Pageview actor
- Traffic/Pageview hourly
- Traffic/Pageview hourly/Fingerprinting Over Time
- Traffic/Pageview hourly/Identity reconstruction analysis
- Traffic/Pageview hourly/K Anonymity Threshold Analysis
- Traffic/Pageview hourly/Sanitization
- Traffic/Pageview hourly/Sanitization algorithm proposal
- Traffic/Pageviews
- Traffic/Pageviews/Bots
- Traffic/Pageviews/Bots Research
- Traffic/Pageviews/Redirects
- Traffic/Projectview hourly
- Traffic/ReaderCounts
- Traffic/SessionLength
- Traffic/Unique Devices
- Traffic/Unique Devices/Automated traffic correction
- Traffic/Unique Devices/Last access solution
- Traffic/Unique Devices/Last access solution/Validation
- Traffic/UserRetention
- Traffic/Virtualpageview hourly
- Traffic/Webrequest
- Traffic/Webrequest/RawIPUsage
- Traffic/Webrequest/Tagging
- Traffic/mediawiki api request
- Traffic/mobile apps session metrics
- Traffic/mobile apps uniques
- Traffic/referrer daily
- Traffic/referrer daily/Dashboard