Jump to content

Data Platform/Transform data

From Wikitech


This page describes the process and internal tools for creating datasets and reports based on private/internal data sources. For info about publicly-accessible resources and data, see meta:Research:Data.

Before you start

This page assumes you have already identified datasets relevant for your analysis, and that you can access and query that data using internal analysis tools.

Before you create a new table or dataset, check the existing data sources and datasets in DataHub to see if the data you need is already there. If not, is there a similar table that could be updated to meet your needs?

Plan data lifecyle

FIXME: Update this section when data lifecycle documentation is more complete

Get approval for new data collection

If you intend to collect a new type of data or design a new instrument for experimentation or product analysis, follow these data collection policies and procedures to submit and get approval for your data collection activity:

Model and document your data

Data modeling

Follow the process defined in the Data modeling guidelines to define your schema, connect with data stewards and technical stewards, and determine who will build the dataset.

If you're defining a new instrument to collect data, follow the Metrics Platform workflow guides.

Data documentation

Follow the documentation guidelines for the type of data you're producing or collecting:

To find existing dataset documentation, see Discover data.

Build your table or dataset

Batch transforms

Use Airflow to run jobs and schedule batch workflows that generate new Data Lake tables, metrics, or other transformations based on internal data sources.

Event data, instrumentation and experiments

To produce and consume instrumentation data from WMF products, use the Metrics Platform. It provides standard product metrics schemas and client libraries for data collection using Event Platform.

If your data collection plans are approved, get started instrumenting your event data collection:

Advanced topics for data engineers
Data pipelines and stream processing
Event Platform schemas

Schemas define the structure of event data. They enable the Event Platform to validate data, and ensure that consumers can rely upon and integrate with it.

Table formats and storage

You can store data in private namespaces in Hive or Iceberg, but product data should be in Iceberg (for exceptions, contact the team).

Iceberg is the successor to Hive. Both Hive and Iceberg table formats can store data using a variety of underlying file formats; WMF normally uses Parquet.

Hive is a data storage framework that enables you to use SQL to work with various file formats stored in HDFS. The "Hive metastore" is a centralized repository for metadata about these data files stored in the Data Lake, and all three SQL query engines WMF uses (Presto, Spark SQL, and Hive) rely on it.

Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.

Advanced topics for data engineers

Cassandra:

Share data and dashboards

Before you publish any data

Learn how to apply the Data Publication guidelines
Follow policies and procedures

Share queries and visualizations

Turnilo

Turnilo is a web interface that provides self-service access to data stored in Druid. In Turnilo, users who don't have full access to WMF private data can explore aggregate metrics without writing queries. However, Turnilo has some technical limitations that make it less accurate and precise than Superset.

Go to Turnilo: turnilo.wikimedia.org

Superset is a web interface for data visualization and exploration. Like Turnilo, it provides access to Druid tables, but it also has access to data in Hive (and elsewhere) via Presto, and it offers more advanced slicing-and-dicing options.

Go to Superset

Tools and platforms for publishing data externally

analytics.wikimedia.org

analytics.wikimedia.org is a static site that serves WMF analytics dashboards and data downloads.

Dashiki is a dashboarding tool that lets users declare dashboards by using configuration pages on a wiki.

Manage published data

Maintainance and monitoring

TODO: are there dashboards where people can check the status of canonical data pipeline generation runs on which their datasets depend?

Retention and deletion