Event Platform/Stream Processing/Use cases

From Wikitech
Jump to navigation Jump to search

Here we collect a list of projects that would benefit from a stream processing platform. This page categorizes these use cases with the following requirements.

Enrichment? - Is the pipeline a 1 to 1 event transformation? Many use cases are, and these all have very similar deployment requirements, making them good candidates for the platform library and tooling abstractions. If this is not true, then the pipeline likely requires many input streams and/or does some kind of aggregations on them.

Streaming State? - If implemented with stream processing, does the stream processor need to maintain state? If so, the complexity of implementing and operating the pipeline increases. We'd still like it to be easy to use Event Platform streams without a lot of boiler plate, but abstracting away the pipeline logic will be difficult. Developers will have to understand more about the streaming framework and how to operate it.

Needs Backfill? - Once deployed, does the output data need backfilled? E.g. perhaps your job wants recommendations available for all MW pages, not just the ones that have been recently edited? If so, this is true, then ideally, we can run the same pipeline logic on an input dataset in batch mode.

Ingestion to serving datastore? - Will the output of the pipeline need to be written to a storage or database for serving to real users? We'd like to make streaming ingestion automated, similar to how Data Engineering has done for the event Hive database. Ideally, all that a user would need to do is add some configuration to have their output stream automatically ingested into their datastore.

Use Cases

Some of these use cases already exist in some form, and others are nascent product ideas.

Original source of table[1]

Project Description Enrichment? Streaming State? Needs Backfill? Ingestion to serving datastore? Other Requirements
MediaWiki Event Carried State Transfer This is a 'supporting' use case that enabled many of the following use cases. This aims to 'comprehensively' model state changes in MediaWiki and emit them as events in streams.
Copy Edit When a page is created or edited the text of the page needs to be analyzed for any spelling or grammar issues so that a feature can serve this dataset to editors.
Image Recommendations Platform engineering needs to collect events about when users accept image recommendations  and correlate them with a list of possible recommendations to order to decide what recommendations are available to be shown to users. See also Lambda Architecture for Generated Datasets.
Structured Data Topics Developers need a way to trigger/run a topic algorithm based on page updates in order to generate relevant section topics for users based on page content changes. ?
Similar Users AHT in cooperation with Research wish to provide a feature for the CheckUsers community group to compare users to determine if they might be the same user to help with evaluating negative behaviours. See also Lambda Architecture for Generated Datasets. ? ? Does this need streaming? Only if comparing the users needs recent activity.
Search index generation The Search team joins event data with MediaWiki data in Hadoop with Spark, uploads to Swift object storage, and consumes it to update ElasticSearch indexes.
Add a link The Link Recommendation Service recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.
MediaWiki History Incremental Updates The Data Engineering team bulk loads monthly snapshots of MediaWiki data from dedicated MySQL replicas transforms this data using Spark into a MediaWiki History, and stores it in Hive and Cassandra and serves it via AQS.  

Data Engineering would like to keep this dataset up to date within a few hours using MediaWiki state change events, as well as use events to compute history.

This job itself probably does not need streaming, but it will need up to date inputs that are created via streaming apps (e.g. mw content in steams).
WDQS Streaming Updater The Search team consumes Wikidata change MediaWiki events with Flink and queries MediaWiki APIs, builds a stream of RDF diffs and updates their Blazegraph database for the Wikidata Query Service.
Knowledge Store PoV The Architecture Team’s Knowledge Store PoV consumes events, looks up content from the MediaWiki API, transforms it, and stores structured versions of that content in an object store and serves it via GraphQL. ?
Fundraising Campaign real time analytics Fundraising Tech consumes EventLogging events transforms them into MySQL update statements and updates MySQL databases to build fundraising statistics dashboards. ? ? ?
Cloud Data Services The Cloud Services team consumes MediaWiki MySQL data and transforms it for tool maintainers, sanitizing it for public consumption. (Many tool maintainers have to implement OLAP-type use cases on data shapes that don’t support that.) This would likely be CDC, filtering and transforming the events with stream processing before they are written back to the Cloud storage.
Wikimedia Enterprise Wikimedia Enterprise (Okapi) consumes events externally, looks up content from the MediaWiki API, transforms it and stores structured versions of that content in AWS, and serves APIs on top of that data there/
Wiki content dumps Wiki content snapshot xml dumps are generated monthly and made available for download. This job itself probably does not need streaming, but it will need up to date inputs that are created via streaming apps (e.g. mw content in steams).
Change Propagation / RESTBase The Platform Engineering team uses Change Propagation (a simple home grown stream processor) to consume MediaWiki change events and causes RESTBase to store re-rendered HTML in Cassandra and serve it.
Frontend cache purging SRE consumes MediaWiki resource-purge events and transforms them  into HTTP PURGE requests to clear frontend HTTP caches.
MW DerivedPageDataUpdater and friends A ‘collection’ of various derived data generators running in-process within MW or deferring to the job queue
Other MW JobQueue jobs Some jobs are pure RPC calls, but many jobs basically fit this topic, driving derived data generation. Cirrussearch jobs for example.
WikiWho WikiWho is a service (API + Datasets) for mining changes and attribution information from wiki pages.
AI Models for Knowledge Integrity A new set of Machine Learning models that would help to improve knowledge integrity, fight disinformation and vandalism in Wikimedia projects ?
Revision Scoring For wikis where these machine learning models are supported, edits and revisions are automatically scored using article content and metadata. The service currently makes API calls back to MediaWiki, leading to a fragile dependency cycle and high latency. ?
  1. https://wikitech.wikimedia.org/wiki/Shared_Data_Platform#Examples_of_Wikimedia_one-off_data_pipelines