Jump to content

Search/CirrusSearch

From Wikitech

CirrusSearch

CirrusSearch is a MediaWiki extension that provides search support backed by Elasticsearch. If you want to extend the data that is available to CirrusSearch, have a look at Search/TechnicalInteractionsWithSearch.

Configuration

The canonical location of the configuration documentation is in the docs/settings.txt file in the extension source. It also contains the defaults, but the source of truth for defaults is extension.json. A pool counter configuration example lives in the README in the extension source.

WMF configuration overrides live in the ext-CirrusSearch.php and CirrusSearch-{common|production|labs}.php files in the mediawiki-config git repo.

Local Build

A dockerized Cirrus environment with cirrus, elasticsearch, and related bits can be sourced from our integration test runner.

Logging

Via Logstash

Logs from CirrusSearch can be found from the general Mediawiki logstash dashboard (requires NDA-level access). You can filter with channel:CirrusSearch AND "backend error". This isn't as specific as we'd like it to be, but it should be enough to help get you started.

Via Logging Hosts

The logs generated by cirrus are located on mwlog1001.eqiad.wmnet:/a/mw-log/:

  • CirrusSearch.log: the main log. Around 300-500 lines generated per second.
  • CirrusSearchRequests.log: contains all requests (queries and updates) sent by cirrus to elastic.Generates between 1500 and 3000+ lines per second.
  • CirrusSearchSlowRequests.log: contains all slow requests (the threshold is currently set to 10s but can be changed with $wgCirrusSearchSlowSearch). Few lines per day.
  • CirrusSearchChangeFailed.log: contains all failed updates. Few lines per day except in case of cluster outage.

Useful commands :

See all errors in realtime (useful when doing maintenance on the elastic cluster)

tail -f /a/mw-log/CirrusSearch.log | grep -v DEBUG

WARNING: you can rapidly get flooded if the pool counter is full.

Measure the throughput between cirrus and elastic (requests/sec) in realtime

tail -f /a/mw-log/CirrusSearchRequests.log | pv -l -i 5 > /dev/null

NOTE: this is an estimation because I'm not sure that all requests are logged here. For instance: I think that the requests sent to the frozen_index are not logged here. You can add something like 150 or 300 qps (guessed by counting the number of "Allowed write" in CirrusSearch.log)

Measure the number of prefix queries per second for enwiki in realtime

tail -f /a/mw-log/CirrusSearchRequests.log | grep enwiki_content | grep " prefix search " | pv -l -i 5 > /dev/null

CirrusSearch Indexing

Diagram?

CirrusSearch updates the elasticsearch index by building and upserting almost the entire document on every edit. The revision id of the edit is used as the elasticsearch version number to ensure out-of-order writes by the job queue have no effect on the index correctness. There are a few sources of writes to the production search clusters, although CirrusSearch is the majority of writes. Writes also come from:

  • Cirrus Streaming Updater, a flink application that is due to replace the writes performed by the job queue
  • mjolnir-bulk-daemon, run on search-loader instances, pushes updates generated by teams airflow instance into the search clusters. This is primarily the weighted_tags field.
  • logstash, run on apifeatureusage instances, writes to its own indices in the search clusters

You can run some scripts from mwmaint1002.eqiad.wmnet, but you need to use a deployment server for backfills.

Adding new wikis

All wikis have Cirrus enabled as the search engine. To add a new Cirrus wiki:

  1. Estimate the number of shards required (one, the default, is fine for new wikis).
  2. Create the search index
  3. Populate the search index

Create the index

mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki $wiki --cluster=all

That'll create the search index on all necessary clusters with all the proper configuration.

Populate the search index

mkdir -p ~/log
clusters='eqiad codfw cloudelastic'

for cluster in eqiad codfw cloudelastic; do
  mwscript extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki $wiki --cluster $cluster --skipLinks --indexOnSkip --queue | tee ~/log/$wiki.$cluster.parse.log
  mwscript extensions/CirrusSearch/maintenance/ForceSearchIndex.php --wiki $wiki --cluster $cluster --skipParse --queue | tee ~/log/$wiki.$cluster.links.log
done

If the last line of the output of the --skipLinks line doesn't end with "jobs left on the queue" wait a few minutes before launching the second job. The job queue doesn't update its counts quickly and the job might queue everything before the counts catch up and still see the queue as empty. If this wiki is a private wiki then cloudelastic should be removed from the set of clusters. No harm if it's included, but it will (should?) throw exceptions and complain.