Search/Update Pipeline

From Wikitech

Context

The Search Update Pipeline role is to transfer edits from Mediawiki to the relevant Elasticsearch indices, and enrich those edits with various peripheral sources.

The current Update Pipeline is being replaced with a new implementation based on Flink. This page describes the new pipeline.

Links

Decision records

List of major decisions around this project (might need to be moved to a sub page if the list grows too large).

Remove dependency on WDQS Streaming Updater (2023-10-16)

task T326409

We had a process dependency on the WDQS Updater: we wanted to use it as a test application to validate the changes to the deployment strategy with a known good application.

We have tested all the use cases that make sense, so this dependency isn't needed anymore. WDQS Updater will still need to be migrated, but this can be done at a later time.

Duplicate files

task T341227

Our Mediawiki instances can serve files and images from the local wiki, or from Commons. When the same file name exists both in the local wiki and on Commons, the local version takes precedence.

In the context of Search, we ideally need to implement this logic so that search results are coherent. This is non trivial and the current implementation already has problems. In cases where the duplicates are not correctly identified, clicking the result will go to the local file, but the text of the result might come from Commons, which can be confusing.

We got some data and realized that we get ~300 duplicates per day, vs ~30'000'000 full text searches per day (~0.001%).

Decision

We will not implement the deduplication of File results, in a very small number of cases (~0.001%) there will be duplicates in the search results.

Event transport

task T341625

Inputs

  • Data sizing and traffic (TBD: link to spreadsheet)
  • Multi data center resiliency
  • Resiliency to database replication lag

Decision

System diagram of the search update pipeline flink application in multiple data centres
System diagram of the search update pipeline flink application in multiple data centres

Teams involved in the decision: Search Platform, Data Platform SRE, Service Ops, Data Engineering

  • We take the lean event approach, reading bulk data is done from the indexer (at the end of the pipeline), increasing reads from Mediawiki API for each ES cluster
  • We use kafka-main to store intermediate events (lean events)
  • Each Flink aggregator consumes only DC-local events
  • Each Flink indexer (one instance: per DC-local ES cluster) consumes DC-local and replicated update events

Backing storage for Flink state

Flink has different kinds of state that need to persisted elsewhere so it can recover after a failover. There is state that comes from buffered messages/events (checkpoints) and there is state for keeping track of the progress of processing a stream of events (watermarks).

Checkpoints - S3

task T342620

Watermarks - Zookeeper

Flink is stateful. That state has to be stored, so it can be restored after recovery (failover, maintenance, …). Running on k8s there are two options supported off the shelf: config maps (CM) and zookeeper (ZK). CM have to be backed up and restored manually when a cluster is recreated. This would force Service Ops to take special steps specifically for our application, so we'll use Zookeeper instead.

https://docs.google.com/document/d/17tY05WoaT_BloTzaIncR939k3hvhcVQ-E-8DBjo284E/edit#heading=h.wzoas015w2af

task T331283

Handling of files across local wiki and commons

Mediawiki has a fallback mechanism where files can be hosted locally or on Commons. In the context of Search, this bring ambiguity as to where the files is stored, which description is shown and if it matches what the user will find when following a link on the Search Result Page. The implementation in the current Search Update Pipeline has a number of limitations which don't entirely resolve those ambiguities. We expect that only a low number of results present problematic behaviour if we disable this feature entirely.

task T341227

Decision

  • Measurement of the number of problematic cases
  • The feature will only be implemented on the new pipeline if that number is high (TBD: decide on a threshold)

Template

Based on decision record template

# [short title of solved problem and solution]

* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->

Technical Story: [description | ticket/issue URL] <!-- optional -->

## Context and Problem Statement

[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]

## Decision Drivers <!-- optional -->

* [driver 1, e.g., a force, facing concern, …]
* [driver 2, e.g., a force, facing concern, …]
* … <!-- numbers of drivers can vary -->

## Considered Options

* [option 1]
* [option 2]
* [option 3]
* … <!-- numbers of options can vary -->

## Decision Outcome

Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].

### Positive Consequences <!-- optional -->

* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
* ### Negative Consequences <!-- optional -->

* [e.g., compromising quality attribute, follow-up decisions required, …]
* ## Pros and Cons of the Options <!-- optional -->

### [option 1]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

### [option 2]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

### [option 3]

[example | description | pointer to more information | …] <!-- optional -->

* Good, because [argument a]
* Good, because [argument b]
* Bad, because [argument c]
* … <!-- numbers of pros and cons can vary -->

## Links <!-- optional -->

* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->