User:Triciaburmeister/Sandbox/Data platform/Publish data

This page is currently a draft.
More information and discussion about changes to this draft on the talk page.

FIXME: when publishing, rename this to "Transform and publish data"

This page describes internal tools for creating datasets and reports based on private/internal data sources. For info about publicly-accessible resources and data, see meta:Research:Data.

Before you start

This page assumes you have already identified datasets relevant for your analysis, and that you can access and query that data using internal analysis tools.

Before you create a new table or dataset, check the existing data sources and datasets in DataHub to see if the data you need is already there. If not, is there a similar table that could be updated to meet your needs?

Plan data lifecyle

FIXME: Update this section when data lifecycle documentation is more complete

WIP dataset creation process: Google doc
Data lifecycle management process

Get approval for new data collection

If you intend to collect a new type of data or design a new instrument for experimentation or product analysis, follow these data collection policies and procedures to submit and get approval for your data collection activity:

Data Collection Guidelines (draft, currently internal only, succeeds Instrumentation DACI; will eventually be posted to Foundation wiki)
WMF staff should use the Legal, Safety and Security Service Center (L3SC) to submit a request to have data collection plans reviewed and approved.
(draft) guide on measurement plans and instrumentation specifications and Instrumentation process and spec template(Google sheet)

Model and document your data

Data modeling

Follow the process defined in the Data modeling guidelines to define your schema, connect with data stewards and technical stewards, and determine who will build the dataset.

If you're defining a new instrument to collect data, follow the Metrics Platform "Create your first instrument" tutorial.

Data documentation

Follow the documentation guidelines for the type of data you're producing or collecting:

Instrument documentation
Data catalog documentation guide
TODO: more comprehensive dataset documentation guidelines and requirements phab:T349103

Build your table or dataset

Batch transforms

Use Airflow to run jobs and schedule batch workflows that generate new Data Lake tables, metrics, or other transformations based on internal data sources.

Developer guide: Create Airflow DAGS and queries
Tutorial: Python job repository
References:
- Spark
- Hive queries and troubleshooting(support for Hive querying is being phased out)

Event data, instrumentation and experiments

To produce and consume instrumentation data from WMF products, use the Metrics Platform. It provides standard product metrics schemas and client libraries for data collection using Event Platform.

If your data collection plans are approved, get started instrumenting your event data collection:

See the Event instrumentation tutorial and "Create First Metrics Platform Instrument" for how to write and test your instrumentation code locally.

Advanced topics for data engineers

Data pipelines and stream processing

Flink (stream processing)
MediaWiki Event Enrichment
Data Lake pipelines
TODO: what else should be linked here?

Event Platform schemas

Schemas define the structure of event data. They enable the Event Platform to validate data, and ensure that consumers can rely upon and integrate with it.

Stream configuration and deployment

Event Platform: Stream configuration guide
Metrics Platform: Creating a stream configuration
Stream deployment
Event utilities: code libraries interacting with stream config, schemas and producing events to Kafka

Table formats and storage

Iceberg

Iceberg is the successor to Hive. Both Hive and Iceberg table formats can store data using a variety of underlying file formats; WMF normally uses Parquet.

Hive

Hive is a data storage framework that enables you to use SQL to work with various file formats stored in HDFS. The "Hive metastore" is a centralized repository for metadata about these data files stored in the Data Lake, and all three SQL query engines WMF uses (Presto, Spark SQL, and Hive) rely on it.

Druid

Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.

Advanced topics for data engineers

Cassandra:

The AQS Cassandra cluster stores Analytics Query Service (AQS) datasets and generated datasets, along with Image suggestions data.
TODO: Other references?

Share data and dashboards

Before you publish any data

Learn how to apply the Data Publication guidelines

Data Publication guidelines
How to use data publication guidelines to evaluate risk and make publication decisions: GDrive, YouTube

Follow policies and procedures

Policies:

WMF Privacy Policy
Country and Territory Protection List (accessible via:

canonical_data.countries in the Data Lake)(source docs)

Procedures:

Share queries and visualizations

GitHub, GitLab, and Jupyter notebooks

Publishing Jupyter notebooks on GitHub or GitLab
Example Quarto publication: https://kcvelaga.quarto.pub/cx-deletion-rate-variables-2024/

Turnilo

Turnilo is a web interface that provides self-service access to data stored in Druid. In Turnilo, users who don't have full access to WMF private data can explore aggregate metrics without writing queries. However, Turnilo has some technical limitations that make it less accurate and precise than Superset.

To access Turnilo, you need a Developer account and wmf or nda LDAP access.
Druid data tables in Superset/Turnilo
Turnilo documentation

Go to Turnilo: turnilo.wikimedia.org

Superset

Superset is a web interface for data visualization and exploration. Like Turnilo, it provides access to Druid tables, but it also has access to data in Hive (and elsewhere) via Presto, and it offers more advanced slicing-and-dicing options.

Go to Superset

Tools and platforms for publishing data externally

analytics.wikimedia.org

analytics.wikimedia.org is a static site that serves WMF analytics dashboards and data downloads.

Site documentation
Web publication: Process for publishing ad-hoc, low-risk datasets, notebooks, or other research products on the site

Dashiki

Dashiki is a dashboarding tool that lets users declare dashboards by using configuration pages on a wiki.

Dashiki dashboard tutorial
Example dashboards:
- Pageviews (public)
- Browser statistics (public)

Manage published data

Maintainance and monitoring

TODO: are there dashboards where people can check the status of canonical data pipeline generation runs on which their datasets depend?

Retention and deletion

Data Retention Guidelines
Event data retention: data retention practices for events, and privacy best practices for creating or modifying event schemas
Event Sanitization: processes used with Event Platform data for retaining event data in Hive beyond the standard 90 day retention period.