MediaWiki Event Enrichment/SLO/Mediawiki Page Content Change Enrichment
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
A real time data processing application that consumes the
mediawiki.page_change.v1 topic, performs a lookup join (HTTP) with the Action API to retrieve raw page content, and produces an enriched event into the
The Event Platform value stream is responsible for this service.
Mediawiki Page Content Change Enrichment runs on k8s (wikikube) on codfw and eqiad data centers.
Mediawiki Page Content Change Enrichment requires the Flink Kubernetes Operator to be deployed on the host k8s cluster.
Mediawiki Page Content Change Enrichment application consumes and produces from Kafka main clusters and produces to Kafka jumbo-eqiad cluster. The application issues HTTP requests to the Mediawiki Action API.
The service clients are consumers of the
mediawiki.page_content_change.v1 stream. Clients will only interact with the service via that stream.
Service Level Indicators (SLIs)
- Enriched events percentage (availability): the percentage of
mediawiki.page_change.v1events consumed that resulted in an enriched event being produced into
mediawiki.page_content_change.v1.This is the amount of events that have been successfully enriched.
- Excessive topics lag: the percentage of time that kafka topic lags above a threshold (TBD). This SLI informs about eventual service latency, that would cause
page_content_changemessages to lag behind
Mediawiki Page Content Change Enrichment emits timeseries metrics (counter, gauges) for all SLIs. They are available in Grafana.
Mediawiki Page Content Change Enrichment depends on Kafka and the Action API. Operational errors are expected to be correlated to the performance of either system.
Mediawiki Page Content Change Enrichment emits errors (exceptions, invalid records, HTTP timeout after retries into a kafka error topic:
As of 2023-05, No support SLA is provided. File a Bug at https://phabricator.wikimedia.org/project/view/1474/ and the Event Platform team will follow up within 24 hours (on work days). In case of outage, deleting and re-applying the deployment is considered within SLO targets.
This may change once we 'release' the
mediawiki.page_content_change.v1 stream, hopefully in early FY 2023-2024.
The service is deployed with deployment-charts. See MediaWiki_Event_Enrichment#mw-page-content-change-enrich
Service Level Objectives
A realistic target for availability would be 80% of processed messages are enriched, with no particular upper bound of latency.
A realistic target for excessive kafka topic percentage would be 80% of the time, the max lag is within the desired threshold (TBD).
A realistic target for availability would be 99% of processed messages are enriched, with no particular upper bound of latency.
A realistic target for excessive kafka lag percentage would be 99% of the time, the max lag is within the desired threshold (TBD).
Erroneous responses from the Mediawiki Action API, or changes on databases that can't be captured by EventBus hooks, will impact the availability of enriched events. MediaWiki Event Enrichment#mediawiki.page content change semantics describes failures scenarios for the enrichment application. While we expect retry-on-error logic to address the majority of API related issues, some of them might require clients to reconcile the stream.
There are known, sporadic, cases when database mutations will not result an event published to Kafka (e.g maintenance SQL script that UPDATEs a database). The enrichment application will not be able to handle those cases.
Explorative analysis on (backfilled) data we collected so far (June 2023) suggests that significantly less than < 1% of events are impacted by failures that will require reconciliation.