SLO/Search

From Wikitech
< SLO

Status: draft

Organizational

Service

The various search components allow full-text search of Wiki articles as well as article title autocompletion and MediaSearch

Teams

Search Platform, a member of the Data Platforms working group, owns these services. The operational side is specifically owned by Data Platforms SRE.

Architectural

Environmental dependencies

CirrusSearch is a MediaWiki extension that handles cross-communication between MediaWiki and Search Elasticsearch clusters.

Our search Elasticsearch clusters are ultimately responsible for serving the search queries forwarded by MediaWiki.

Service dependencies

MediaWiki depends on CirrusSearch being operational.

CirrusSearch depends on the underlying Elasticsearch clusters being healthy and responsive.

Client-facing

Clients

Users submit queries on a given wiki. That wiki's MediaWiki processes generate Elasticsearch queries and, having received the result, display them through the UI.

Request Classes

There are various query types: full text, autocomplete, more_like.

There are also mediasearch requests as well as search preview requests.

Service Level Indicators (SLIs)

All Search SLIs use a Latency SLI, defined as % of all requests that complete within X ms.

Full text p95 latency: <4000 ms

This represents the p95 latency of a "normal" search query made via Special:Search (here's enwiki for example)

Autocomplete p95 latency: <600 ms

This represents the p95 latency of an autocomplete query, also known as a CompSuggest query. This query type is used by Mediawiki and the Wikipedia UI to surface articles whose titles fuzzy match the text of the CompSuggest query.

Example of Wikipedia UI surfacing articles whose title prefixes fuzzy match the query text

Search preview p95 latency: <10000 ms

TODO

MediaSearch p95 latency: <10000 ms

This represents the p95 latency of MediaSearch queries.

Operational

Monitoring

We get various graphite-based metrics that are plumbed through mediawiki. These are the major metrics used as SLIs for this service.

Eventually we'll transition to prometheus-based metrics once mediawiki has been cut over from graphite to prometheus.

Troubleshooting

Some issues, like the backend Elasticsearch clusters being temporarily overloaded, are quite simple. Other issues can require more extensive troubleshooting.

Deployment

The CirrusSearch mediawiki plugin handles all the logic of communication between mediawiki and Elasticsearch.

Additionally, we have 2 "cirrus clusters", in eqiad and codfw respectively, each of which is composed of 3 different Elasticsearch clusters (the multiple ES clusters are for performance reasons; having all of our indices/shards in a single cluster results in poor performance due to Elasticsearch performance being inversely related to total number of shards (after some threshold has been passed).

Service Level Objectives

Realistic targets

Realistic (perhaps overly lenient; we can tighten the non-mediasearch SLOs in future when we have more data) targets:

Full text p95 latency: <4000 ms 95% of the time

Autocomplete p95 latency: <600 ms 95% of the time

Search preview p95 latency: <10000 ms 95% of the time

MediaSearch p95 latency: <10000 ms 95% of the time

Ideal targets

The ideal targets for the SLOs are as follows:

Full text p95 latency: <4000 ms 99.9% of the time

Autocomplete p95 latency: <600 ms 99.9% of the time

Search preview p95 latency: <10000 ms 99.9% of the time

MediaSearch p95 latency: <10000 ms 95% of the time

Reconciliation

Currently we're using the realistic targets because our metrics have only been added in the last several months; once we have more data we should circle back and look if we want to tighten our SLOs accordingly.