User:Triciaburmeister/Sandbox/Data platform/Collect data

This page contains historical information. It may be outdated or unreliable.

This page describes systems for collecting and generating private Wikimedia data. For public data, see meta:Research:Data.

Policies and procedures

Data collection (work in progress)

Data Collection Guidelines (draft, currently internal only, succeeds Instrumentation DACI; will eventually be posted to Foundation wiki)
WMF staff should use the Legal, Safety and Security Service Center (L3SC) to submit a request to have data collection plans reviewed and approved.
(draft) guide on measurement plans and instrumentation specifications and Instrumentation process and spec template(Google sheet)

Data retention and sanitization

Data Retention Guidelines
Event data retention: data retention practices for events, and privacy best practices for creating or modifying event schemas
Event Sanitization: processes used with Event Platform data for retaining event data in Hive beyond the standard 90 day retention period.

Collect event data and run experiments

Metrics Platform is a suite of services, standard libraries, and APIs for producing and consuming instrumentation data of all kinds from Wikimedia Foundation products. It mainly consists of standard product metrics schemas and client library implementations. It is built on top of the Event Platform.

FIXME: The Metrics Platform documentation overlaps with the Event Platform documentation linked below. These should be integrated, so this page doesn't have to link to multiple pages with similar content for each of the major user tasks.

This page links to the core user docs for how to collect event data and code instruments; for full documentation, see the Event Platform main page or the Metrics Platform main page.

Create schemas

Schemas define the structure of event data. They enable the Event Platform to validate data, and ensure that consumers can rely upon and integrate with it.

Write instrumentation code

Get your data collection plans approved, then instrument your event data collection:

Follow Data collection policies and procedures to submit and get approval for your data collection activity.
See the Event instrumentation tutorial and "Create First Metrics Platform Instrument" for how to write and test your instrumentation code locally.

Configure stream and validate events

Once Legal has approved your planned data collection activity, and your schema and instrumentation code has been reviewed and merged, start producing and collecting events by configuring and deploying an event stream:

Event Platform: Stream configuration guide
Metrics Platform: Creating a stream configuration
Stream deployment
Validate events
Event utilities: code libraries interacting with stream config, schemas and producing events to Kafka

View and query events data

Events are ingested into event and event_sanitized databases in the Data Lake.

The Hive table name is a normalized version of the stream name.
After the data becomes available, you can access it with standard query tools and create dashboards based on the data.

See the Instrumentation tutorial for how to consume events directly from Kafka or through the internal EventStreams instance.

Other sources of user interaction data

Matomo

Matomo is a small-scale web analytics platform, mostly used for Wikimedia microsites (roughly 10,000 requests per day or less).

Matomo documentation

Go to Matomo

Google Search Console

Search engine performance data for site owners. This data isn't included in the Data Lake and requires a separate access request.