Search/articletopic
CirrusSearch supports searching Wikipedia main-namespace articles by article topic (provided by the language agnostic link-based article topic) using the articletopic:
keyword. This requires loading LiftWing predictions into Elasticsearch. The rough data flow for this is:
- wiki edit -> changeprop -> LiftWing -> EventBus -> Cirrus Streaming Updater -> Elasticsearch
This is near real-time, typically within 10 minutes of an edit. This usually won't matter much since most edits won't change topic predictions drastically, but new articles become searchable after 10 minutes.
articletopic:
uses its own set of keywords as the model provided topic labels are not search-expression-friendly; these can be found in ArticleTopicFeature.php.
Details
- Whenever a page is edited, the EventBus extension (code) creates a mediawiki.page-change event through EventBus.
- changeprop configuration listens for those events and invokes LiftWing.
- The LiftWing containers receive the requests from changeprop, run the prediction, and send an event to mediawiki.revision-score-articletopic.
- The Cirrus Streaming Updater subscribes to the event stream in kafka. It merges edit-related updates like the article topic predictions into the initial edit event and updates the appropriate search indices. Any newly provided topic prediction overwrites the previous prediction. This doesn't do anything in the way of custom processing, topic predictions are stored for the page specified in the event.
- In Elasticsearch the data is stored under the Search/WeightedTags field with the
classification.ores.articletopic
tag family as a pretend word vector (e.g. if an article has a prediction of Linguistics -> 0.98, Literature -> 0.63, Elasticsearch will represent that with a field which has the word "Linguistics" 980 times and the word "Literature" 630 times). Thearticletopic:
search keyword will do a text similarity search between this field and the topics provided. (field definition, search code)
Management
When the model changes significantly the data has to be bulk-uploaded into Elasticsearch (since normally it's only updated for articles which are changed). This has not been done since the migration from ORES to liftwing and the change to streaming updates. Some method of bulk-importing predictions from monthly snapshots will be required.
In MediaWiki site configuration, the $wgCirrusSearchWMFExtraFeatures['weighted_tags']['build']
flag enables building the Elasticsearch index for the field, and $wgCirrusSearchWMFExtraFeatures['weighted_tags']['use']
enables the articletopic
search keyword.
See also
- phab:T240517 which was the epic for setting this up
- mw:ORES/Articletopic for the topic taxonomy (this is used for all wikis) and
- mw:Help:CirrusSearch#Articletopic for user docs
- Search/WeightedTags
- https://tools.wmflabs.org/ores-support-checklist/ for which wikis have a native model