Shared Data Platform

From Wikitech

Wikimedia Shared Data Platform for Knowledge as a Service

Our communities build knowledge. We have data about how they do that, and we build knowledge out of that. The Wikimedia Foundation has committed to, in less than 10 years, deliver that collective knowledge to the world, as a service. To that end, technology teams at WMF should collaboratively build a self service data platform. Product teams should manage their data as a product for other engineers on a shared data platform. If we align these efforts, we can succeed together.

Disclaimer: this document is not a project proposal

This document is meant as a discussion starter. It highlights recurring problems faced by WMF engineering teams, and suggests a path forward. It is primarily meant to build alignment around a future idealized data platform at WMF. We hope that it will help us shift our engineering focus from only building interface products to treating the data created by an interface as a product in its own right.

DOCUMENT STATUS: As of 2022-04, This document is still evolving as feedback is being collected. Feedback notes are in this document.

(original google doc draft)

Problem Statement

Ultimately, knowledge gets to the world via data that is stored on our hardware. We build interface products that make this data available to real people. Using data to build new products is difficult, unless the data is already in the shape that the product needs. Why is this so hard for us?

The authors believe it is because our data is not cared for as its own thing. Historically, what we have prioritized is user and application interface products. Real people use interfaces to create and consume data. This data is stored in a way that is only useful by the specific interface it originates in.

When an engineer or product team wants to build something new using existing data, there is no way for them to get that data in the format they need in a real time fashion.

Both production and analytics products need data in order to be useful. For some products, all that is needed is to access data owned by another product’s API. For many others, the required shape of data is highly specific. This has led to the creation of many custom and one-off data pipelines. The duplicated effort and inefficiencies involved in building one-off data pipelines mean slow project velocities, and more often simply discourage people from building new products because of the difficulty involved. (Examples of some of these projects are listed at the end of this document.)

This can be rephrased as three distinct problems:

  • The domain data a service creates is not designed for sharing, making it difficult to use for future use cases.
  • Domain data is not discoverable or sourceable. There is no standard way for engineers to automate discovery and loading of data they might need.
  • When engineers need data for their service, they have to re-solve all of the engineering problems associated with data transport, transformation, storage, and serving.

How do we solve these problems so that we can move faster when building new products for our users?

Overview of current (2022) WMF data infrastructure

MediaWiki data (in MariaDB)

Most Wikimedia domain data comes from MediaWiki. This data is stored primarily in MariaDB databases[1][2] and OpenStack Swift clusters[3]. Only MediaWiki has direct access to these datastores[4]. If you want to build a service separate from MediaWiki with fresh data, your only option is to get the data you need from the MediaWiki API.

MediaWiki data is exported monthly(ish) as public dumps and for ingestion into the Analytics Data Lake.

Event data (in Kafka)

MediaWiki and other services produce data (both application state and instrumentation data) as events via Event Platform. For the most part, this data does not represent complete domain data for a service. At best, the event stream data at WMF can be used for notification of changes within a domain[5]. Because the logic to produce these events resides outside of MediaWiki core's control, there are often mistakes[6][7] made in what and how the data is produced.

Other services and datastores

(MariaDB, PostGRES, Redis, Cassandra, ML feature stores, files, Swift, etc.)

Many other services exist using their own datastores. Some services do emit state change event data, but none comprehensively enough that their domain data is useful externally to those services.

Analytics Data Lake (Hadoop)

Our Hadoop based Analytics Data Lake ingests as much data as we can from all of the above primarily for analytics purposes. Production services are beginning to rely on the Data Lake as the transformation layer for the Generated Datasets project.

What’s missing?

  • There is no primary long term generic storage of domain data. Some services do save some of their state change event history in their local datastores (e.g. MediaWiki and the revision table), but they do not make this data externally sourceable by others.
  • There is no primary data transformation layer. Data can be transformed in the Analytics Hadoop cluster, but only at a batch frequency and with availability only suitable for analytics use cases.
  • There is no common ingestion framework. Even if developers could load and transform domain data of other services, they will need to ingest that data into their own datastores for use by their own services.
  • There is no data discovery or governance tooling. Even if developers had tools that supported all of the above, finding and managing data stored in the platform will be difficult, especially as the number of datasets increases over time.

A way forward: Data as a Product via a Shared Data Platform

To solve these problems, products should be designed with a data-first perspective, leveraging a data platform that allows engineers to build products that use any domain’s data.

Data as a Product

‘Data as a Product’ is an idea described in Data Mesh architectures. Quoting from the original Data Mesh article:

domain data teams must apply product thinking [...] to the datasets that they provide; considering their data assets as their products and the rest of the organization's data scientists, ML and data engineers as their customers.

New products should produce their domain’s data (even if the product itself doesn’t directly use its externally produced data) so that others can consume it.

Domain Data

Domain data is a logical grouping of data related to a product or area of business. The term ‘domain’ here is the same as the domain in Domain Driven Design. Domain data has an explicit owner: the product where the data originates.

At WMF, our domain boundaries might not currently be clear, but there is still data that needs to be shared between domains. To do this well, we should apply product thinking that assigns specific domain ownership to that data.

Common Data Infrastructure

To enable products to use each other’s domain data, that data needs to live somewhere it can be consumed. Instead of asking each product to implement this capability, we should work towards building common data infrastructure and tooling that supports it.

Quoting again from Data Mesh ideas:

[...] the only way that teams can autonomously own their data products is to have access to a high-level abstraction of infrastructure that removes complexity and friction of provisioning and managing the lifecycle of data products - Self-serve data infrastructure as a platform to enable domain autonomy.


Note: Data Meshes are complicated, and many of the ideas are more organizational than technical. This document is not advocating for WMF to entirely embrace Data Mesh concepts. However, some of the core Data Mesh ideas are pertinent, namely ‘data as a product’ and ‘common data infrastructure’. Implementing these concepts at WMF would help us in our efforts to make it easier for products to get the data they need.

What about analytics?

The only real difference between production and analytics datasets is the degree to which people care about them, or more formally, the SLOs that are defined for them. Domain data encompasses both important application state data, as well as instrumentation of that domain. If data is useful for anything outside of the originating domain, it should be shared. Analytics data is no different.

Seen in this light, analytics systems like Hadoop are just specialized systems tuned for exploration and gathering insights from data, rather than for serving products to end users.

A Shared Data Platform is a foundational infrastructure serving both analytics and production use cases. It is a production multi-datacenter data source and transformation layer. Analytics systems can consume and produce data to the Shared Data Platform just like any other service or product.

It is possible that a modern (non Hadoop based) infrastructure for analytics would look technically very similar to a Shared Data Platform. Building new analytics infrastructure first may be a path towards a production level Shared Data Platform.

Data as a Service

The proposed Shared Data Platform directly supports WMF’s Data as a Service OKR. The KRs for FY 2021-2022 are specifically about analytics data. However, as noted above, the difference between analytics and production data is slim. To truly support the goal of Data as a Service, a foundational data platform that allows ‘self service’ of all domain data is needed.

Existent projects that move us forward

(Existent as of 2022.)

  • Data Value Stream scopes out many components that are important in a Shared Data Platform. This work moves us in the right direction. Hopefully it can be used as a base for supporting the full externalized data lifecycle as described below.
  • The Wikidata Query Service Updater is an example of good event driven application architecture. It makes use of the systems available at the time it was built (Kafka and Analytics Data Lake (Hadoop)) to source data. Ideally, this WDQS Updater would source and publish its data to a multi-DC Shared Data Platform.
  • The MediaWiki Event Carried State Transfer project aims to make recent MediaWiki state externally sourceable as events in Kafka. Externalizing its full state and history requires something like a Shared Data Platform.

Shared Data Platform - Externalized Data Lifecycle

Engineers building user facing products ultimately need to serve data. Sometimes those products need data belonging to other domains, where it may not be suitably shaped or accessible by the new product that needs it. To get that data into the right shape and storage location for serving in the new product, it usually needs to go through several identifiable phases in a data lifecycle.

This lifecycle is about data that is shared externally to a product, in the Shared Data Platform. This document does not prescribe any particular architecture internal to a single service.

Produce

Products should produce their domain data to the Shared Data Platform. This will usually be done by producing state change and/or instrumentation events. This is possible today via the Event Platform.

However not all data will be produced as events, so we need to ensure that the Shared Data Platform can accept data in other ways too. E.g. aggregate summaries, state snapshots, data dumps, etc.

Discover

There will be a lot of data available in the Shared Data Platform. Engineers will need a way to find the data they are looking for. Additionally, that data will need to be governed by privacy and retention policies. A data catalog technology will index our data and provide APIs to allow engineers to find the locations of the data they need in the data platform. Data governance will allow us to maintain that data and understand and apply retention policies, etc. All domain data should be discoverable, and its lineage should be well tracked.

Source (verb)

All domain data should be readable. Services should be able to discover the storage locations of the data they need, and have standardized tooling for sourcing (AKA loading) that data.

This includes loading both historical data at rest and consuming live real time event data.

Note: We use ‘source’ as a verb in the same way that the concept Event Sourcing does, indicating that the source of another domain’s data as seen by the consuming domain is the Shared Data Platform.

Transform

Once sourcing of data is possible, it will need to be transformed (aggregation, sanitization, restructuring, reordering, joining, etc.) into the shape needed by a service. Ideally, the same or similar technology could be used to do transformations of both historical and real time data. Any domain data created by the new service can be produced back to the Shared Data platform so that other products can potentially use it too.

Some historical derived data will need to be occasionally or periodically recomputed to ensure consistency. This is known as a lambda architecture, and we should make sure our data platform has the ability to do this.

Ingest

Transformed data is then ingested into systems optimized to deliver query results within certain parameters. Users of these systems have expectations about the freshness, consistency, and reliability of the data that constrain the data pipeline leading up to this point. These expectations should be expressed as SLOs for the data.

We don’t expect any single storage and/or serving technology to be used for all serving use cases, but we may be able to standardize around a small menu of storage techs that can serve most of them. These should be offered ‘as a platform’ for use by service owners.

Serving is outside the scope of Shared Data Platform infrastructure. However, served data ultimately is sourced from the Shared Data platform, so we should think carefully about how WMF might provide storage for serving that integrates well with Shared Data Platform.

Conclusion

A real self service data infrastructure will only be possible if Wikimedia product and engineering teams align on the fact that data is a product and treat it as such. However, product teams can only afford to do this if the underlying data platform infrastructure allows them to do this in a self service way. We should aim to evolve our data infrastructure into a Shared Data Platform to make this possible.

Possible Shared Data Platform Diagram

Further reading

Examples of Wikimedia one-off data pipelines

Project Description
Image Recommendations Platform engineering needs to collect events about when users accept image recommendations  and correlate them with a list of possible recommendations to order to decide what recommendations are available to be shown to users. See also Lambda Architecture for Generated Datasets.
Structured Data Topics Developers need a way to trigger/run a topic algorithm based on page updates in order to generate relevant section topics for users based on page content changes.
Similar Users AHT in cooperation with Research wish to provide a feature for the CheckUsers community group to compare users to determine if they might be the same user to help with evaluating negative behaviours. See also Lambda Architecture for Generated Datasets.
Search index generation The Search team joins event data with MediaWiki data in Hadoop with Spark, uploads to Swift object storage, and consumes it to update ElasticSearch indexes.
Add a link The Link Recommendation Service recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.
MediaWiki History Incremental Updates The Data Engineering team bulk loads monthly snapshots of MediaWiki data from dedicated MySQL replicas transforms this data using Spark into a MediaWiki History, and stores it in Hive and Cassandra and serves it via AQS.  

Data Engineering would like to keep this dataset up to date within a few hours using MediaWiki state change events, as well as use events to compute history.

WDQS Streaming Updater The Search team consumes Wikidata change MediaWiki events with Flink and queries MediaWiki APIs, builds a stream of RDF diffs and updates their Blazegraph database for the Wikidata Query Service.
Knowledge Store PoV The Architecture Team’s Knowledge Store PoV consumes events, looks up content from the MediaWiki API, transforms it, and stores structured versions of that content in an object store and serves it via GraphQL.
MediaWiki REST API Historical Data Endpoint Platform Engineering wants to consume MediaWiki events to compute edit statistics that can be served from an API endpoint to build iOS features.  (See also this data platform design document.)
Fundraising Campaign real time analytics Fundraising Tech consumes EventLogging events transforms them into MySQL update statements and updates MySQL databases to build fundraising statistics dashboards.
Cloud Data Services The Cloud Services team consumes MediaWiki MySQL data and transforms it for tool maintainers, sanitizing it for public consumption. (Many tool maintainers have to implement OLAP-type use cases on data shapes that don’t support that.)
Wikimedia Enterprise Wikimedia Enterprise (Okapi) consumes events externally, looks up content from the MediaWiki API, transforms it and stores structured versions of that content in AWS, and serves APIs on top of that data there/
Wiki content dumps Wiki content snapshot xml dumps are generated monthly and made available for download.
Change Propagation / RESTBase The Platform Engineering team uses Change Propagation (a simple home grown stream processor) to consume MediaWiki change events and causes RESTBase to store re-rendered HTML in Cassandra and serve it.
Frontend cache purging SRE consumes MediaWiki resource-purge events and transforms them  into HTTP PURGE requests to clear frontend HTTP caches.
Image Competition Research wants to Export Commons images from HDFS to a public file server.
MW DerivedPageDataUpdater and friends A ‘collection’ of various derived data generators running in-process within MW or deferring to the job queue
Other MW jobs Some jobs are pure RPC calls, but many jobs basically fit this topic, driving derived data generation. Cirrussearch jobs for example.
ML Feature store Machine Learning models need features to be trained and served. These features are often derived from existing datasets, and may have different requirements for latency and throughput (training vs serving mostly).
WikiWho WikiWho is a service (API + Datasets) for mining changes and attribution information from wiki pages.
AI Models for Knowledge Integrity A new set of Machine Learning models that would help to improve knowledge integrity, fight disinformation and vandalism in Wikimedia projects
Many more…
  1. MySQL at Wikipedia
  2. MariaDB MediaWIki External Storage (page content)
  3. MediaWiki Media Storage
  4. This is only strictly true for production services, there are ways to get snapshot copies directly from the datastores.
  5. The MediaWiki Event Carried State Transfer project aims to improve this for MediaWiki data.
  6. https://phabricator.wikimedia.org/T349845
  7. https://phabricator.wikimedia.org/T350299