Data Platform/Transform data
This page describes the process and internal tools for creating datasets and reports based on private/internal data sources. For info about publicly-accessible resources and data, see meta:Research:Data.
Before you start
This page assumes you have already identified datasets relevant for your analysis, and that you can access and query that data using internal analysis tools.
Before you create a new table or dataset, check the existing data sources and datasets in DataHub to see if the data you need is already there. If not, is there a similar table that could be updated to meet your needs?
To define a new instrument, generate new product metrics, or run experiments, use the Metrics Platform documentation. |
Plan data lifecyle
Get approval for new data collection
If you intend to collect a new type of data or design a new instrument for experimentation or product analysis, follow these data collection policies and procedures to submit and get approval for your data collection activity:
- Data Collection Guidelines (draft, currently internal only, succeeds Instrumentation DACI; will eventually be posted to Foundation wiki)
- WMF staff should use the Legal, Safety and Security Service Center (L3SC) to submit a request to have data collection plans reviewed and approved.
- (draft) guide on measurement plans and instrumentation specifications and Instrumentation process and spec template(Google sheet)
Model and document your data
Follow the process defined in the Data modeling guidelines to define your schema, connect with data stewards and technical stewards, and determine who will build the dataset.
If you're defining a new instrument to collect data, follow the Metrics Platform workflow guides.
Follow the documentation guidelines for the type of data you're producing or collecting:
- Instrument documentation
- Data catalog documentation guide
- TODO: more comprehensive dataset documentation guidelines and requirements phab:T349103
To find existing dataset documentation, see Discover data.
Build your table or dataset
Use Airflow to run jobs and schedule batch workflows that generate new Data Lake tables, metrics, or other transformations based on internal data sources.
- Developer guide: Create Airflow DAGS and queries
- Tutorial: Python job repository
- References:
- Spark
- Hive queries and troubleshooting(support for Hive querying is being phased out)
To produce and consume instrumentation data from WMF products, use the Metrics Platform. It provides standard product metrics schemas and client libraries for data collection using Event Platform.
If your data collection plans are approved, get started instrumenting your event data collection:
- See the Event instrumentation tutorial and the Metrics Platform workflow guides for how to write and test your instrumentation code locally.
Advanced topics for data engineers
|
---|
Data pipelines and stream processing
Event Platform schemas
Schemas define the structure of event data. They enable the Event Platform to validate data, and ensure that consumers can rely upon and integrate with it.
|
Table formats and storage
You can store data in private namespaces in Hive or Iceberg, but product data should be in Iceberg (for exceptions, contact the team).
Hive is a data storage framework that enables you to use SQL to work with various file formats stored in HDFS. The "Hive metastore" is a centralized repository for metadata about these data files stored in the Data Lake, and all three SQL query engines WMF uses (Presto, Spark SQL, and Hive) rely on it.
Advanced topics for data engineers
|
---|
Cassandra:
|
Share data and dashboards
Before you publish any data
- Data Publication guidelines
- How to use data publication guidelines to evaluate risk and make publication decisions: GDrive, YouTube
Policies:
- WMF Privacy Policy
- Country and Territory Protection List (accessible via:
canonical_data.countries
in the Data Lake)(source docs)
Procedures:
Share queries and visualizations
- Publishing Jupyter notebooks on GitHub or GitLab
- Example Quarto publication: https://kcvelaga.quarto.pub/cx-deletion-rate-variables-2024/
Turnilo is a web interface that provides self-service access to data stored in Druid. In Turnilo, users who don't have full access to WMF private data can explore aggregate metrics without writing queries. However, Turnilo has some technical limitations that make it less accurate and precise than Superset.
- To access Turnilo, you need a Developer account and
wmf
ornda
LDAP access. - Druid data tables in Superset/Turnilo
- Turnilo documentation
Go to Turnilo: turnilo.wikimedia.org
Superset is a web interface for data visualization and exploration. Like Turnilo, it provides access to Druid tables, but it also has access to data in Hive (and elsewhere) via Presto, and it offers more advanced slicing-and-dicing options.
Tools and platforms for publishing data externally
analytics.wikimedia.org is a static site that serves WMF analytics dashboards and data downloads.
- Site documentation
- Web publication: Process for publishing ad-hoc, low-risk datasets, notebooks, or other research products on the site
Dashiki is a dashboarding tool that lets users declare dashboards by using configuration pages on a wiki.
- Dashiki dashboard tutorial
- Example dashboards:
- Pageviews (public)
- Browser statistics (public)
Manage published data
Maintainance and monitoring
TODO: are there dashboards where people can check the status of canonical data pipeline generation runs on which their datasets depend?
Retention and deletion
- Data Retention Guidelines
- Event data retention: data retention practices for events, and privacy best practices for creating or modifying event schemas
- Event Sanitization: processes used with Event Platform data for retaining event data in Hive beyond the standard 90 day retention period.
- Dataset archiving and deletion