Search/WeightedTags
Definition
CirrusSearch provides a way to store structured data in the indices powering full-text search on the wikis. This feature is useful in the following circumstances:
- Store and search for data that is not owned/controlled by Mediawiki but can be attached/attributed to a page
- This data is too expensive to be computed synchronously during the MediaWiki update process
- This data is structured, when searching the user or process knows exactly what to search for (codes, IDs, not natural language)
- This data is relatively stable, a small portion of the wiki pages might be required to be reindexed hourly (please ask when in doubt)
- This data can be lost, CirrusSearch is not a primary datastore and this data must be retrievable from somewhere else
- Real-time is not a strong requirement, hourly with a two hour lag is the most frequent update rate available as of now
Adding new data
The CirrusSearch data-pipeline (Discovery/Analytics) running in the analytics cluster can be used to process and push some data to add to the search indices. The high level picture of the process is:
- A process produces data to an EventPlatform stream
- The CirrusSearch data-pipeline running in the analytics cluster will:
- consume these streams hourly
- join the updates from different streams related to the same document together
- push this data back to the production elasticsearch indices serving search on the wikis
Producing the data using the Event Platform
CirrusSearch requires at least the following information to update the search index:
- the wiki database name
- the page id (and the revision id if possible)
- the namespace of the page
- the payload (the data to store)
The Event Platform provides all the necessary tools to design and produce such events.
Example 1: using MediaWiki and EventBus (recommendation create)
This data is added to allow to search pages for which a recommendation to make an edit via a structured task has been detected by an algorithm running offline.
A new schema has been created to support the required data. This data is currently populated by mediawiki using EventBus from which events conforming to the mediawiki/revision/recommendation-create
schema are created and produced.
Creating an event is pretty straightforward and sending it is even simpler:
public function addLinkRecommendation( RevisionRecord $revision ) {
/** @var EventBusFactory $eventBusFactory */
$eventBusFactory = MediaWikiServices::getInstance()->getService( 'EventBus.EventBusFactory' );
$eventBus = $eventBusFactory->getInstanceForStream( 'mediawiki.revision-recommendation-create' );
$eventFactory = $eventBus->getFactory();
$event = $eventFactory->createRecommendationCreateEvent( 'mediawiki.revision-recommendation-create', 'link', $revision );
$result = $eventBus->send( [ $event ] );
if ( $result !== true ) {
// error handling
}
}
The example above shows how to add a link recommendation to the page behind $revision
. Note that this schema and the \MediaWiki\Extension\EventBus\EventFactory::createRecommendationCreateEvent()
method can be used for other kind of recommendations too (e.g. image), please be sure to coordinate with the Growth team engineers who created this schema.
Example 2: using Changeprop with ORES article/draft topic
This data is produced to the mediawiki.revision-score
stream and conforms to the mediawiki/revision/score schema. It is using a custom Changeprop processor for shipping the data.
Example 3: feeding the elasticsearch index directly from MediaWiki (for testing)
For testing it might be handy to update the elasticsearch directly to avoid completely the data pipeline. CirrusSearch provides the CirrusSearch::updateWeightedTags()
method to do so:
$engine = MediaWikiServices::getInstance()->getSearchEngineFactory()->create();
Assert::precondition( $engine instanceof CirrusSearch, "CirrusSearch must be the default search engine" );
/** @var CirrusSearch $engine */
$pageToUpdate = Title::newFromText( 'Target Page' )->toPageIdentity();
// Schedules an asynchronous update to the search index and will populate the weighted_tags field with
// "my-custom-tag-family/tag-value-1" and "my-custom-tag-family/tag-value-2" with respective term frequencies 2 and 30
// for the page "Taget Page"
$engine->updateWeightedTags( $pageToUpdate, 'my-custom-tag-family', [ 'tag-value-1', 'tag-value-2' ], [ 2, 30 ] );
CirrusSearch Update Pipeline
Since phab:T366253, there is new, consolidated stream for adding an removing weighted tags: mediawiki.cirrussearch.page_weighted_tags_change.rc0
.
A single event can set and/or clear weighted tags. Any tag listed under set
will be merged with the existing ones. Any prefix under clear
will clear all tags under that prefix.
Resetting the data from MediaWiki
In some scenario some tags might have to be deleted/reset after a user action is taken. For the recommendation use-case when a user refuses or make an edit after a recommendation is being presented to them the state of the tag for this page must be reset to avoid suggesting the same page again.
CirrusSearch provides a function that can be called from your process to do this:
$engine = MediaWikiServices::getInstance()->getSearchEngineFactory()->create();
Assert::precondition( $engine instanceof CirrusSearch, "CirrusSearch must be the default search engine" );
/** @var CirrusSearch $engine */
$pageToUpdate = Title::newFromText( 'Target Page' )->toPageIdentity();
// Schedules an asynchronous update to reset all tags under "my-custom-tag-family"
$engine->resetWeightedTags( $pageToUpdate, 'my-custom-tag-family' );
This will reset all tags under the my-custom-tag-family
for the page Target Page by sending an asynchronous update request (near real time) to the search index.
Querying the data
Shape of the data in elasticsearch
The data lies within an index document as an array of strings where each entry represents a tag. In the elasticsearch source document the tag has the following shape tag_prefix/tag|score
:
tag_prefix
is the family or category of the tagtag
is the identifying value of the tag, beware that no text analysis is performed on this data and therefor will be case sensitivescore
is an optional score as an integer (1 to 1000) that is encoded as the term frequency of the indexed token
Here is an exemple taken from the czech wikipedia:
{
"weighted_tags": [
"classification.ores.articletopic/STEM.Libraries & Information|699",
"classification.ores.articletopic/STEM.STEM*|926",
"classification.ores.articletopic/Culture.Media.Software|566",
"recommendation.link/exists|1"
]
}
Which can be broken up as:
- Family
classification.ores.articletopic
- tag
STEM.Libraries & Information
with a score of 699 - tag
STEM.STEM*
, score 926 - tag
Culture.Media.Software
, score 566
- tag
- Family
recommendation.link
- tag
exists
, score of 1
- tag
Querying the tags
Tags must be searched with an elasticsearch match query on the weighted_tags
fields using the full tag structure tag-family/tag-value
minus the |score
which is only read at index time:
{
"match": {
"weighted_tags": {
"query": "recommendation.link/exists"
}
}
}
Will find all pages matching the tag. The score of the match query is equal to 0.0001 (the score given at index time is multiplied by 0.0001 to have a number between 0 and 1). But since the provided score is encoded as the term frequency the term_freq query can be used to perform interesting filtering:
{
"term_freq": {
"field": "weighted_tags",
"term": "classification.ores.articletopic/STEM.STEM*",
"gte": 900
}
}
Will find pages for which the STEM.STEM*
topic has a score greater than or equal to 900.
Within CirrusSearch a filtering keyword can be added to allow users/bots to filter pages whose match a particular tag, for instance see HasRecommendationFeature.php the code behind the hasrecommendation:
search keyword. This is useful to combine filtering with other criterias indexed by CirrusSearch (i.e. categories, templates, text...).
If you own a custom fulltext query builder (e.g. MediaSearch, WikibaseCirrusSearch) the weighted_tags
field can be used too.
Known tag families
family | owner[1] | known users[2] | Event stream | hive table | usage in search |
---|---|---|---|---|---|
classification.ores.articletopic |
ML | Growth | mediawiki.revision.score |
N/A | keyword articletopic:
|
classification.ores.drafttopic |
ML | N/A | mediawiki.revision.score |
N/A | keyword drafttopic:
|
recommendation.link |
Growth | Growth | mediawiki.revision.recommendation-create |
N/A | keyword hasrecommendation:
|
recommendation.image |
SDAW | Growth | N/A | analytics_platform_eng.image_suggestions_search_index_delta | keyword hasrecommendation:
|
image.linked.from.wikidata.p18 |
SDAW | SDAW | N/A | analytics_platform_eng.image_suggestions_search_index_delta | keyword custommatch:depicts_or_linked_from= and when searching the File namespace on any wiki with the WikibaseMediaInfo extension enabled (atm that's commons)
|
image.linked.from.wikidata.p373 |
SDAW | SDAW | N/A | analytics_platform_eng.image_suggestions_search_index_delta | |
image.linked.from.wikipedia.lead_image |
SDAW | SDAW | N/A | analytics_platform_eng.image_suggestions_search_index_delta |