Data Platform/Data Lake
The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).
Data available
- Traffic data
- Webrequest, pageviews, and unique devices
- Edits data
- Historical data about revisions, pages, and users (e.g. MediaWiki History)
- Content data
- Wikitext (latest & historical) and wikidata-entities
- Events data
- EventLogging, EventBus and event streams data (raw, refined, sanitized)
- Commons Impact Metrics
- Contributions to Wikimedia Commons focused on GLAMs
Some of these datasets (such as webrequest) are only available in the Data Lake, while others (such as pageviews) are also available in more aggregated form as Druid data cubes.
SQL
The simplest way to access data in the Data Lake is to run SQL queries using either Presto or Spark (Hive is also available, but inferior and deprecated).
You can access these engines through several different routes:
- Superset has a graphical SQL editor where you can run Presto queries
- Custom code on one of the stat hosts (the easiest way to do this is to use our Jupyter service)
- for Python, use the Wmfdata-Python package
- for R, use the wmfdata-r package
- Both also support ad-hoc querying via command-line programs which you can use on one of the stat hosts. Consult the engine's documentation page for more info.
ETL
A common Data Lake task is to process data from existing tables and then persist it into a new table, a pattern which is called "extract, transform, load". Here are some recommended ways to do it.
Manually using Spark
For one-off ETLs, you can use Spark to manually transform data and write the results back to the Data Lake.
If you're using Spark SQL, this would be CREATE TABLE ... USING ICEBERG AS SELECT ....[1] (Or INSERT INTO ... SELECT ... if the table already exists).
Manually using Pandas
If you find it much easier to implement your processing in Pandas or you already have Pandas code, you can pull data into Pandas using Wmfdata's Spark or Presto modules, process it as you like, and then write it back using Wmfdata's spark.upload_pandas.
However, this is inefficient compared to doing it entirely in Spark, because it requires downloading all the data to a stat host, loading it into memory, and then uploading it back to the Data Lake. In particular, if your source data is large (say, 10 GB or more), you begin to run the risk of exhausting the available memory on the stat host.
Scheduled using Dbt
If you want to run a periodic ETL job (for example, updating a derived table every day with new data), the easiest way is using Dbt. You simply write your query in SQL and define some metadata (like how often you want the job run) in YAML files.
Note that, as of April 2026, Dbt has just been added to the Data Platform, so you may encounter some rough edges. But you'll help smooth them out for everyone who comes after!
Scheduled using Airflow
Setting up a job in Airflow is significantly more complicated than in Dbt, but it also gives you much greater capability. For example, you are not limited to SQL; you can do your data processing using custom Python code (including using custom dependencies).
Table and file formats
Data Lake tables can be created using either Hive format or Iceberg format. Iceberg is the successor to Hive, and highly recommended for new tables. As of Feb 2024, the existing tables in the wmf database are being slowly migrated to Iceberg (task T333013).
Both table formats can store data using a variety of underlying file formats; we normally use Parquet with both Hive and Iceberg.
Specifying the table format
When writing tables with Spark, you specify the table format with the USING clause in a CREATE TABLE statement or using the format option in one of the APIs. The proper names of each format are hive and iceberg. Note that you can pass other values for the format (the data source, in Spark's terminology), but these are the two recommended ones.
Hive is the default table format. With either table format, Parquet is the default file format, with Snappy compression for Hive tables and Gzip compression for Iceberg tables.
Checking the table format
You can check the format of an existing table using DESCRIBE TABLE EXTENDED. The table format is shown in the Provider field.[2]
Production databases
As Data Platform/Systems/Iceberg#Changes to database names mentions, many Hive tables would be split across functionally distinct databases in migration to Iceberg. Other databases have also been created since the original proposal.
The location of the individual tables in each database may differ from the location of the database. For example: the location of wmf_traffic.session_length is /wmf/data/wmf_traffic/session_length in HDFS.
| Database | Functional use | Location | Sample of tables | Table formats |
|---|---|---|---|---|
| wmf_raw
(DataHub) |
Raw data (e.g. unrefined webrequests, sqooped MediaWiki tables) | wmf_raw.db | mediawiki_*, mediawiki_private_*, webrequest, cirrussearchrequestset, wikibase_*, wikilambda_* | Hive |
| wmf
(DataHub) |
Production datasets, including refined webrequests | wmf.db | webrequest, mediawiki_history, pageview_hourly, pageview_actor, virtualpageview_hourly, geoeditors_monthly | Hive |
| wmf_content
(DataHub) |
Data and metrics related to Content (e.g. wikitext) | wmf_content.db | mediawiki_content_history_v1 | Iceberg |
| wmf_contributors
(DataHub) |
Data and metrics related to Contributors (e.g. editors, moderators) | wmf_contributors.db | editor_month, commons_edits | Iceberg |
| wmf_data_ops | Meta data about other tables, such as data quality | wmf_data_ops.db | data_quality_metrics, data_quality_alerts | Iceberg |
| wmf_experiments | For experiments conducted with Experimentation Lab | wmf_experiments.db | experiment_results_v1, experiment_configs_v1 | Iceberg |
| wmf_product
(DataHub) |
Product health metrics | wmf_product.db | trust_safety_admin_action_monthly, automoderator_potential_vandalism_reverted, cx_suggestions_menu_interactions_daily | Hive & Iceberg |
| wmf_readership
(DataHub) |
Data and metrics related to Readership | wmf_readership.db | unique_devices_per_domain_monthly,
unique_devices_per_project_family_daily |
Iceberg |
| wmf_traffic
(DataHub) |
Metrics related to traffic to Wikimedia properties | wmf_traffic.db | referrer_daily, browser_general, session_length | Iceberg |
Unless stated otherwise, each database is located inside hdfs://analytics-hadoop/user/hive/warehouse/
Technical architecture
Data Lake datasets are stored in the Hadoop Distributed File System (HDFS). The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
Some Data Lake datasets are available in Druid, which is separate from from the Data Lake and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.
The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.
All Subpages of Data Platform/Data Lake
- Content
- Content/Mediawiki content current v1
- Content/Mediawiki content history v1
- Content/Wikidata entity
- Content/Wikidata item page link
- Data Issues
- Data Issues/2021-02-09 Unique Devices By Family Overcount
- Data Issues/2021-06-04 Traffic Data Loss
- Data Issues/2023-01-08 Webrequest Data Loss
- Data Issues/2023-11 eventgate-analytics-external Data Loss
- Data Issues/2024-09-20 Unique Devices by Family Inflated Due to Miscategorized Traffic
- Data Issues/2024-10-10 Webrequest Data Loss - Clobbered Hadoop Temporary Dir
- Data Issues/2025-04-11 Mediawiki History duplicate revisions and excess reverts
- Data Issues/2025-06-03 May 2025 spike in bot traffic
- Data Issues/2025-06-30 Haproxykafka silently stopped sending request data to Kafka
- Edits
- Edits/Edit hourly
- Edits/Geoeditors
- Edits/Geoeditors/Public
- Edits/MediaWiki history
- Edits/MediaWiki history/Revision identity reverts
- Edits/MediaWiki history dumps
- Edits/MediaWiki history dumps/FAQ
- Edits/MediaWiki history dumps/Python spark examples
- Edits/MediaWiki history dumps/Scala spark examples
- Edits/Mediawiki history dumps/Python Dask examples
- Edits/Mediawiki history dumps/Python Pandas examples
- Edits/Mediawiki history reduced
- Edits/Mediawiki page history
- Edits/Mediawiki project namespace map
- Edits/Mediawiki user history
- Edits/Metrics
- Edits/Public
- Edits/Structured data/Commons entity
- Events
- Project History
- Public Data Lake
- SQL
- Traffic
- Traffic/Banner activity
- Traffic/BotDetection
- Traffic/Browser general
- Traffic/Caching
- Traffic/Interlanguage
- Traffic/Mediacounts
- Traffic/Pagecounts-ez
- Traffic/Pageview actor
- Traffic/Pageview hourly
- Traffic/Pageview hourly/Fingerprinting Over Time
- Traffic/Pageview hourly/Identity reconstruction analysis
- Traffic/Pageview hourly/K Anonymity Threshold Analysis
- Traffic/Pageview hourly/Sanitization
- Traffic/Pageview hourly/Sanitization algorithm proposal
- Traffic/Pageviews
- Traffic/Pageviews/Bots
- Traffic/Pageviews/Bots Research
- Traffic/Pageviews/Redirects
- Traffic/Projectview hourly
- Traffic/ReaderCounts
- Traffic/SessionLength
- Traffic/Unique Devices
- Traffic/Unique Devices/Automated traffic correction
- Traffic/Unique Devices/Last access solution
- Traffic/Unique Devices/Last access solution/Validation
- Traffic/UserRetention
- Traffic/Virtualpageview hourly
- Traffic/Webrequest
- Traffic/Webrequest/April 2025 Varnish to HAProxy Migration
- Traffic/Webrequest/RawIPUsage
- Traffic/Webrequest/Tagging
- Traffic/mediawiki api request
- Traffic/mobile apps session metrics
- Traffic/mobile apps uniques
- Traffic/referrer daily
- Traffic/referrer daily/Dashboard
- ↑ The
USING ICEBERGclause creates the table using the new and recommended Iceberg table format. If you omit it, the table will use the default Hive format. - ↑ Weirdly, the
Providerfield is unreliable if you get it fromSHOW TABLE EXTENDED. In that case, Iceberg tables showhive.