How to interact with On Wiki Search at a technical level

Overview

Interacting with Search (CirrusSearch and Elasticsearch) comes with a set of best practices and constraints. This document tries to be explicit about most of them, so that teams requiring new features can be better prepared and have a better understanding of the Search context. This document is intended for engineers working on projects that build on top of Search. It only covers the technical level, and does not document how to interact with the Search Platform team.

Pushing data to Elasticsearch

Introduction

Summary

2 different update mechanisms
A field should be updated by a single mechanism
Updates are expensive
Initial data ingestion can take weeks if not months

Details

Data is ingested into Elasticsearch via 2 different mechanisms: on page edits or via background batch processing. A specific field should be updated either via “on page edits” or via “background batch processing”, but not by both.

Elasticsearch always recreates a new document on updates; this is somewhat costly and we have mechanisms in place to reduce the number of updates. When updating a subset of fields, other fields are copied from the current document version.

While Elasticsearch is updated on edits, when adding a new field, this field needs to be populated for all existing data. This can take weeks if not months.

On page edits

Summary

Updates restricted to data available in page context
Computationally expensive updates are prohibited
Asynchronous process, updates usually visible in minutes but can be delayed by hours

Details

Post page edits, Elasticsearch indices are updated via JobQueues. Post page edits are restricted to data directly available in the page context. Performance is critical for post page edit updates, any update that is computationally expensive has to be managed by the background batch processing pipeline. At a minimum, this means that all the data used at this point needs to be available in MySQL after the LinksUpdate job. The definition of “computationally expensive” is somewhat vague; the Search Platform team is the final arbiter. Post page edits are asynchronous. In most cases changes are visible after a few minutes, but changes being delayed by a few hours is part of normal operations. Any higher-level feature expecting synchronous or quasi-synchronous updates is doomed to fail eventually.

Background batch processing

Summary

Used for either expensive updates or for data not available in the page context
Triggered hourly
Batch updates are delayed by at least a few hours
Consumes data in batch from Kafka topics

Details

Any update that either requires expensive computation or data not available in the page context is done via background batch processing. Currently batches are triggered hourly, but this could change in the future. Even with hourly updates, writes are delayed for a few hours due to the underlying infrastructure.

Updates are consumed from events in Kafka topics. It is the responsibility of the data producer (the team wanting this new data) to produce appropriate events, either reusing an existing event schema (preferred) or by creating a new schema in consultation with the Search Platform team.

[TBD] link to documentation on interface to publish new data to index.

Weighted tags

Summary

Minimal documentation: Wikimedia Search Platform/Decision Records/Recommendation Flags in Search

Details

Weighted tags are a standardized solution to support a class of use cases on top of Search. They allow adding an arbitrary set of weighted tags to articles and querying on them. It is easy to publish new tags and requires only minimal involvement from the Search Platform team. Queries can filter on multiple tags, but should be aware of the complexity of filtering too many tags at the same time.

See Search/WeightedTags for more in depth documentation.

New keywords

Summary

Query language can be extended via new keywords

Details

To allow querying a specific field, or any other enhancement of the query language, new keywords need to be implemented. Interfaces are published by the CirrusSearch code and can be implemented by any extension. Those interfaces are still being refactored. Coordination with the Search Platform team is needed when implementing a new keyword.

[TBD] link to documentation

Schema constraints

Summary

Fields should be reused when possible
Schema changes are long and painful

Details

Fields have a non-trivial cost, we try to create somewhat generic fields that can be reused instead of creating a new field for each feature. For example for the AddLink and Image Recommendation features (which both need to flag articles in need of either a link or an image), we prefer to have a “needs_recommendation” field that takes a bitset of possible recommendations (link, image) instead of having separate “needs_link” and “needs_image” fields.

Schema changes can require a full reindex and careful management of the transition path. This can take multiple weeks if not months.

Sizing / scaling

Summary

Commons and Wikidata are especially large indices

Details

By nature, Commons and Wikidata are very large indices, with very large numbers of documents. Small changes in what is indexed can have huge impacts across that number of documents.

Limitations

Summary

Search is all about heuristics
Elasticsearch is not entirely reliable
Relevance cannot be compared across queries

Details

Some limitations are non obvious to people used to interacting with relational databases. Full-text search is all about heuristics. A search result is never an absolutely correct result; it’s a best estimate of what probably makes sense. Search is a probabilistic system.

Elasticsearch is not entirely reliable. A small number of writes are lost. The system is architectured for eventual consistency, but will have incoherences along the way. Use cases built on top of it need to account for that limitation. In most cases, a single edit will not significantly change the ranking of an article, so this has very limited impact, and the changes will be picked up by the next update. We also have a mechanism in place that re-indexes all articles over time (currently it takes ~8 weeks). Editors sometimes expect their changes or new articles to be immediately visible in Search, this is not supported and is unlikely to be supported in the future.

Search results are typically ordered by relevance. Those relevance numbers are valid only in the context of a single query and cannot be used to compare results across multiple queries. Aggregating results from multiple queries cannot use relevance scores to order results in a meaningful way.