Search/WeightedTags

From Wikitech

Definition

CirrusSearch provides a way to store structured data in the indices powering full-text search on the wikis. This feature is useful in the following circumstances:

  • Store and search for data that is not owned/controlled by Mediawiki but can be attached/attributed to a page
  • This data is too expensive to be computed synchronously during the MediaWiki update process
  • This data is structured, when searching the user or process knows exactly what to search for (codes, IDs, not natural language)
  • This data is relatively stable, a small portion of the wiki pages might be required to be reindexed hourly (please ask when in doubt)
  • This data can be lost, CirrusSearch is not a primary datastore and this data must be retrievable from somewhere else
  • Real-time is not a strong requirement, hourly with a two hour lag is the most frequent update rate available as of now

Adding new data

The CirrusSearch data-pipeline (Discovery/Analytics) running in the analytics cluster can be used to process and push some data to add to the search indices. The high level picture of the process is:

  • A process produces data to an EventPlatform stream
  • The CirrusSearch data-pipeline running in the analytics cluster will:
    • consume these streams hourly
    • join the updates from different streams related to the same document together
    • push this data back to the production elasticsearch indices serving search on the wikis

Producing the data using the Event Platform

CirrusSearch requires at least the following information to update the search index:

  • the wiki database name
  • the page id (and the revision id if possible)
  • the namespace of the page
  • the payload (the data to store)

The Event Platform provides all the necessary tools to design and produce such events.

Example 1: using MediaWiki and EventBus (recommendation create)

This data is added to allow to search pages for which a recommendation to make an edit via a structured task has been detected by an algorithm running offline. A new schema has been created to support the required data. This data is currently populated by mediawiki using EventBus from which events conforming to the mediawiki/revision/recommendation-create schema are created and produced. Creating an event is pretty straightforward and sending it is even simpler:

public function addLinkRecommendation( RevisionRecord $revision ) {
/** @var EventBusFactory $eventBusFactory */
	$eventBusFactory = MediaWikiServices::getInstance()->getService( 'EventBus.EventBusFactory' );
	$eventBus = $eventBusFactory->getInstanceForStream( 'mediawiki.revision-recommendation-create' );
	$eventFactory = $eventBus->getFactory();
	$event = $eventFactory->createRecommendationCreateEvent( 'mediawiki.revision-recommendation-create', 'link', $revision );
	$result = $eventBus->send( [ $event ] );
	if ( $result !== true ) {
		// error handling
	}
}

The example above shows how to add a link recommendation to the page behind $revision. Note that this schema and the \MediaWiki\Extension\EventBus\EventFactory::createRecommendationCreateEvent() method can be used for other kind of recommendations too (e.g. image), please be sure to coordinate with the Growth team engineers who created this schema.

Example 2: using Changeprop with ORES article/draft topic

This data is produced to the mediawiki.revision-score stream and conforms to the mediawiki/revision/score schema. It is using a custom Changeprop processor for shipping the data.

Example 3: feeding the elasticsearch index directly from MediaWiki (for testing)

For testing it might be handy to update the elasticsearch directly to avoid completely the data pipeline. CirrusSearch provides the CirrusSearch::updateWeightedTags() method to do so:

$engine = MediaWikiServices::getInstance()->getSearchEngineFactory()->create();
Assert::precondition( $engine instanceof CirrusSearch, "CirrusSearch must be the default search engine" );
/** @var CirrusSearch $engine */
$pageToUpdate = Title::newFromText( 'Target Page' )->toPageIdentity();
// Schedules an asynchronous update to the search index and will populate the weighted_tags field with
// "my-custom-tag-family/tag-value-1" and "my-custom-tag-family/tag-value-2" with respective term frequencies 2 and 30
// for the page "Taget Page"
$engine->updateWeightedTags( $pageToUpdate, 'my-custom-tag-family', [ 'tag-value-1', 'tag-value-2' ], [ 2, 30 ] );
This function must only be used for testing purposes.

CirrusSearch analytics pipeline

When a new stream is added the data-pipeline must be adapted. The Search Platform is expected to do the integration work but here is a quick overview of what needs to happen to have the data being injected to production elasticsearch indices.

In the Discovery/Analytics repository add a new airflow dag whose purpose is to populate a table fit for the process that transfers the data to elasticsearch. The dag will likely include:

  1. a sensor to wait for the data to be available in the event.stream_name hive table
  2. a process to transform the event data (and possibly select the latest event per page thanks to the revision id or the event timestamp)

For the recommendation create pipeline this is what the dag looks like: mediawiki_revision_recommendation_create.py

Adapting the transfer_to_es process is then necessary:

  1. Adapt the dag transfer_to_es.py to wait for the data populated in the previous step
  2. Adapt the spark command convert_to_esbulk.py and add a new Table entry with a MultiListField field in the CONFIG array.

Resetting the data from MediaWiki

In some scenario some tags might have to be deleted/reset after a user action is taken. For the recommendation use-case when a user refuses or make an edit after a recommendation is being presented to them the state of the tag for this page must be reset to avoid suggesting the same page again.

CirrusSearch provides a function that can be called from your process to do this:

$engine = MediaWikiServices::getInstance()->getSearchEngineFactory()->create();
Assert::precondition( $engine instanceof CirrusSearch, "CirrusSearch must be the default search engine" );
/** @var CirrusSearch $engine */
$pageToUpdate = Title::newFromText( 'Target Page' )->toPageIdentity();
// Schedules an asynchronous update to reset all tags under "my-custom-tag-family"
$engine->resetWeightedTags( $pageToUpdate, 'my-custom-tag-family' );

This will reset all tags under the my-custom-tag-family for the page Target Page by sending an asynchronous update request (near real time) to the search index.

Querying the data

Shape of the data in elasticsearch

The data lies within an index document as an array of strings where each entry represents a tag. In the elasticsearch source document the tag has the following shape tag_prefix/tag|score:

  • tag_prefix is the family or category of the tag
  • tag is the identifying value of the tag, beware that no text analysis is performed on this data and therefor will be case sensitive
  • score is an optional score as an integer (1 to 1000) that is encoded as the term frequency of the indexed token

Here is an exemple taken from the czech wikipedia:

{
  "weighted_tags": [
    "classification.ores.articletopic/STEM.Libraries & Information|699",
    "classification.ores.articletopic/STEM.STEM*|926",
    "classification.ores.articletopic/Culture.Media.Software|566",
    "recommendation.link/exists|1"
  ]
}

Which can be broken up as:

  • Family classification.ores.articletopic
    • tag STEM.Libraries & Information with a score of 699
    • tag STEM.STEM*, score 926
    • tag Culture.Media.Software, score 566
  • Family recommendation.link
    • tag exists, score of 1

Querying the tags

Tags must be searched with an elasticsearch match query on the weighted_tags fields using the full tag structure tag-family/tag-value minus the |score which is only read at index time:

{
  "match": {
    "weighted_tags": {
      "query": "recommendation.link/exists"
    }
  }
}

Will find all pages matching the tag. The score of the match query is equal to 0.0001 (the score given at index time is multiplied by 0.0001 to have a number between 0 and 1). But since the provided score is encoded as the term frequency the term_freq query can be used to perform interesting filtering:

{
    "term_freq": {
        "field": "weighted_tags",
        "term": "classification.ores.articletopic/STEM.STEM*",
        "gte": 900
    }
}

Will find pages for which the STEM.STEM* topic has a score greater than or equal to 900.

Within CirrusSearch a filtering keyword can be added to allow users/bots to filter pages whose match a particular tag, for instance see HasRecommendationFeature.php the code behind the hasrecommendation: search keyword. This is useful to combine filtering with other criterias indexed by CirrusSearch (i.e. categories, templates, text...).

If you own a custom fulltext query builder (e.g. MediaSearch, WikibaseCirrusSearch) the weighted_tags field can be used too.

Known tag families

family owner[1] known users[2] Event stream hive table usage in search
classification.ores.articletopic ML Growth mediawiki.revision.score N/A keyword articletopic:
classification.ores.drafttopic ML N/A mediawiki.revision.score N/A keyword drafttopic:
recommendation.link Growth Growth mediawiki.revision.recommendation-create N/A keyword hasrecommendation:
recommendation.image SDAW Growth N/A analytics_platform_eng.image_suggestions_search_index_delta keyword hasrecommendation:
image.linked.from.wikidata.p18 SDAW SDAW N/A analytics_platform_eng.image_suggestions_search_index_delta keyword custommatch:depicts_or_linked_from= and when searching the File namespace on any wiki with the WikibaseMediaInfo extension enabled (atm that's commons)
image.linked.from.wikidata.p373 SDAW SDAW N/A analytics_platform_eng.image_suggestions_search_index_delta
image.linked.from.wikipedia.lead_image SDAW SDAW N/A analytics_platform_eng.image_suggestions_search_index_delta
  1. Team responsible for producing the data
  2. Known teams relying on this data in the search index for their product