User:Triciaburmeister/Sandbox/Data platform/Collect data
This page's contents have been moved to the mainspace at Data_Platform. See project history in phab:T350911. |
This page describes systems for collecting and generating private Wikimedia data. For public data, see meta:Research:Data.
Policies and procedures
- Data Collection Guidelines (draft, currently internal only, succeeds Instrumentation DACI; will eventually be posted to Foundation wiki)
- WMF staff should use the Legal, Safety and Security Service Center (L3SC) to submit a request to have data collection plans reviewed and approved.
- (draft) guide on measurement plans and instrumentation specifications and Instrumentation process and spec template(Google sheet)
- Data Retention Guidelines
- Event data retention: data retention practices for events, and privacy best practices for creating or modifying event schemas
- Event Sanitization: processes used with Event Platform data for retaining event data in Hive beyond the standard 90 day retention period.
Collect event data and run experiments
Metrics Platform is a suite of services, standard libraries, and APIs for producing and consuming instrumentation data of all kinds from Wikimedia Foundation products. It mainly consists of standard product metrics schemas and client library implementations. It is built on top of the Event Platform.
This page links to the core user docs for how to collect event data and code instruments; for full documentation, see the Event Platform main page or the Metrics Platform main page.
Schemas define the structure of event data. They enable the Event Platform to validate data, and ensure that consumers can rely upon and integrate with it.
Get your data collection plans approved, then instrument your event data collection:
- Follow Data collection policies and procedures to submit and get approval for your data collection activity.
- See the Event instrumentation tutorial and "Create First Metrics Platform Instrument" for how to write and test your instrumentation code locally.
Once Legal has approved your planned data collection activity, and your schema and instrumentation code has been reviewed and merged, start producing and collecting events by configuring and deploying an event stream:
- Event Platform: Stream configuration guide
- Metrics Platform: Creating a stream configuration
- Stream deployment
- Validate events
- Event utilities: code libraries interacting with stream config, schemas and producing events to Kafka
Events are ingested into event
and event_sanitized
databases in the Data Lake.
- The Hive table name is a normalized version of the stream name.
- After the data becomes available, you can access it with standard query tools and create dashboards based on the data.
See the Instrumentation tutorial for how to consume events directly from Kafka or through the internal EventStreams instance.
Other sources of user interaction data
Matomo is a small-scale web analytics platform, mostly used for Wikimedia microsites (roughly 10,000 requests per day or less).
Search engine performance data for site owners. This data isn't included in the Data Lake and requires a separate access request.