User:Triciaburmeister/Sandbox/Data platform/Collect data

From Wikitech

This page describes systems for collecting and generating private Wikimedia data. For public data, see meta:Research:Data.

Policies and procedures

Data collection (work in progress)
Data retention and sanitization

Collect event data and run experiments

Metrics Platform is a suite of services, standard libraries, and APIs for producing and consuming instrumentation data of all kinds from Wikimedia Foundation products. It mainly consists of standard product metrics schemas and client library implementations. It is built on top of the Event Platform.

FIXME: The Metrics Platform documentation overlaps with the Event Platform documentation linked below. These should be integrated, so this page doesn't have to link to multiple pages with similar content for each of the major user tasks.

This page links to the core user docs for how to collect event data and code instruments; for full documentation, see the Event Platform main page or the Metrics Platform main page.

Create schemas

Schemas define the structure of event data. They enable the Event Platform to validate data, and ensure that consumers can rely upon and integrate with it.

Write instrumentation code

Get your data collection plans approved, then instrument your event data collection:

Once Legal has approved your planned data collection activity, and your schema and instrumentation code has been reviewed and merged, start producing and collecting events by configuring and deploying an event stream:

View and query events data

Events are ingested into event and event_sanitized databases in the Data Lake.

See the Instrumentation tutorial for how to consume events directly from Kafka or through the internal EventStreams instance.

Other sources of user interaction data

Matomo is a small-scale web analytics platform, mostly used for Wikimedia microsites (roughly 10,000 requests per day or less).

Go to Matomo

Search engine performance data for site owners. This data isn't included in the Data Lake and requires a separate access request.