MediaWiki Externalized Data Problem
This page tries to rephrase some of what is stated in Shared Data Platform into a more focused problem and use cases, rather than the broad theory (Data as a Product) and platform (Common Data Infrastructure) language used there.
This is a work in progress problem statement.
If you are reading this, please feel free to make edits, or contact the Data Platform Engineering team.
Using MediaWiki data outside of MediaWiki
Summary
If we want to make Wikipedia a multi-generational project, novel products and features will need MediaWiki data outside of our MediaWiki website engine.
It is difficult to create new products using MediaWiki data in ways MediaWiki does not intend.
This problem has never been holistically considered or addressed together as an organization.
Until we do, we will either:
- Not build novel things using MediaWiki data
- Repeatedly expend resources building brittle data pipelines
Context
Ultimately, knowledge gets to the world via data that is stored on our hardware. We build interface products, primarily in MediaWiki, that make this data available to real people. MediaWiki is good at its primary tasks: Humans editing and reading an encyclopedia. It is not competent at other tasks, like querying a knowledge graph, tracking user contributions over time, generating recommendations, etc.
It is difficult for product teams to implement new products using MediaWiki data to do things MediaWiki was not intended to do. This constraint limits new product ideas because it’s not even feasible to test them without the data being available.
- MediaWiki’s data is not designed for reuse outside of MediaWiki.
- MediaWiki HTTP APIs are not sufficient for some products.
For many products, the required shape of data is highly specific. This leads to the creation of brittle and bespoke data pipelines. The duplicated effort and inefficiencies involved in building one-off data pipelines mean slow project velocities, and more often simply discourage people from building new products because of the difficulty involved.
This problem has never been holistically considered or addressed together as an organization. Until we do:
- New products that require externalized MediaWiki data may be abandoned because of the difficulty to build them.
- Engineers using MediaWiki data outside of MediaWiki will continue to spend resources solving these problems as best they can for their individual projects, incurring long term technical debt and wasting time and resources that could be better spent focused on building product features.
- Valuable knowledge will be locked inside MediaWiki and not accessible in a historical context.
- Important data is essentially lost as it is either not stored or stored in a manner that makes access prohibitively expensive.
Use cases requiring externalized MediaWiki data
Use cases listed here either:
- actively use externalized MediaWiki data
- could be improved if they used externalized MediaWiki data
- have not been built because they needed externalized MediaWiki data
Product Features
- Search indexes
- Wikidata Query Service
- Wikimedia Enterprise
- Machine Learning: LiftWing model serving, feature store, etc.
- Recommendation API
- Structured Data Across Wikimedia
- Structured Tasks: Image Suggestions, Link Recommendations / Add Link, CopyEdit
- Similar Users / SimilarEditors
- Page Content Service (this is being moved into MediaWiki?)
- WikiWho & WikiCredit
- User Account Reputation Score
- AQS
Analytics, Research, Machine Learning, etc.
- Dumps
- MediaWiki History
- User:Isaac_(WMF)/Content_tagging/Data_gaps#Major_data_gaps
- Machine Learning: model training, Hugging Face Wikimedia Datasets
- Article Embeddings
- Cloud VPS Wiki Replicas and other public Data Services
- Wikistats
- MW REST API Historical Data Endpoint Needs
- Special:ContentTranslationStats
- WMDE Analytics
- Automated analysis of experiments using MediaWiki data
See also
- Shared Data Platform talks about this problem more theoretically (Data as a Product, not just MediaWiki etc.).
- T291120 - MediaWiki Event Carried State Transfer - Problem Statement describes this problem in more detail, albeit with event driven architecture in mind as a solution.
- Dumps 2.0 will hopefully help for some use cases. However, it is designed as an independent pipeline (meant to replace the aging Dumps 1.0 system). It is not a holistic project that addresses this core problem.