MediaWiki Event Enrichment/SLO/Mediawiki Page Content Change Enrichment

From Wikitech

Service

A real time data processing application that consumes the mediawiki.page_change.v1 topic, performs a lookup join (HTTP) with the Action API to retrieve raw page content, and produces an enriched event into the mediawiki.page_content_change.v1 topic.

Teams

The Event Platform value stream is responsible for this service.

Architectural

Environmental dependencies

Mediawiki Page Content Change Enrichment runs on k8s (wikikube) on codfw and eqiad data centers.

Mediawiki Page Content Change Enrichment requires the Flink Kubernetes Operator to be deployed on the host k8s cluster.

Service dependencies

Mediawiki Page Content Change Enrichment application consumes and produces from Kafka main clusters and produces to Kafka jumbo-eqiad cluster. The application issues HTTP requests to the Mediawiki Action API.

Client-facing

Clients

The service clients are consumers of the mediawiki.page_content_change.v1 stream. Clients will only interact with the service via that stream.

Service Level Indicators (SLIs)

  • Enriched events percentage (availability): the percentage of mediawiki.page_change.v1 events consumed that resulted in an enriched event being produced into mediawiki.page_content_change.v1. This is the amount of events that have been successfully enriched.
  • Excessive topics lag: the percentage of time that kafka topic lags above a threshold (TBD). This SLI informs about eventual service latency, that would cause page_content_change messages to lag behind  page_change.

Operational

Monitoring

Mediawiki Page Content Change Enrichment  emits timeseries metrics (counter, gauges) for all SLIs. They are available in Grafana.

Troubleshooting

Mediawiki Page Content Change Enrichment depends on Kafka and the Action API. Operational errors are expected to be correlated to the performance of either system.

Mediawiki Page Content Change Enrichment emits errors (exceptions, invalid records, HTTP timeout after retries into a kafka error topic: <DC>.mediawki_page_content_change_enrichment_error.

As of 2023-05, No support SLA is provided. File a Bug at https://phabricator.wikimedia.org/project/view/1474/ and the Event Platform team will follow up within 24 hours (on work days). In case of outage, deleting and re-applying the deployment is considered within SLO targets.

This may change once we 'release' the mediawiki.page_content_change.v1 stream, hopefully in early FY 2023-2024.

Deployment

The service is deployed with deployment-charts. See MediaWiki_Event_Enrichment#mw-page-content-change-enrich

Service Level Objectives

Realistic targets

A realistic target for availability would be 80% of processed messages are enriched, with no particular upper bound of latency.

A realistic target for excessive kafka topic percentage would be 80% of the time, the max lag is within the desired threshold (TBD).

Ideal targets

A realistic target for availability would be 99% of processed messages are enriched, with no particular upper bound of latency.

A realistic target for excessive kafka lag percentage would be 99% of the time, the max lag is within the desired threshold (TBD).

Reconciliation

Erroneous responses from the Mediawiki Action API, or changes on databases that can't be captured by EventBus hooks, will impact the availability of enriched events. MediaWiki Event Enrichment#mediawiki.page content change semantics describes failures scenarios for the enrichment application. While we expect retry-on-error logic to address the majority of API related issues, some of them might require clients to reconcile the stream.

There are known, sporadic, cases when database mutations will not result an event published to Kafka (e.g maintenance SQL script that UPDATEs a database). The enrichment application will not be able to handle those cases.

Explorative analysis on (backfilled) data we collected so far (June 2023) suggests that significantly less than < 1% of events are impacted by failures that will require reconciliation.