The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).
Technically, data in the Data Lake is stored in HDFS (the Hadoop Distributed File System), usually in the Parquet file format. The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
Data in the Data Lake can be accessed directly through the
hdfs command line tool.
As of September 2020, you have a choice of three engines that can run SQL queries against the Data Lake: Presto, Hive, and Spark. If you're not sure which to choose, Hive is good to start with. All three engines can be used from the Analytics clients.
- Traffic data -- webrequest, pageviews, unique devices ...
- Edits data -- Historical data about revisions, pages, and users
- Content data -- Wikitext and wikidata-entities
- Events data -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
- ORES scores -- Machine learning predictions [available as events as of 2020-02-27]
The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.