CirrusSearch supports searching Wikipedia main-namespace articles by article topic (provided the topic mapping functionality in ORES) using the
articletopic: keyword. This requires loading ORES predictions into Elasticsearch. The rough data flow for this is:
- wiki edit -> changeprop -> ORES -> EventBus -> HDFS -> Airflow / Spark (-> cross-wiki propagation) -> Elasticsearch
Most of this is real-time but the Airflow step is only run once a week, on Sunday. This usually won't matter much since most edits won't change topic predictions drastically, but new articles (or in the case of wikis which do not have their own ORES model, articles newly linked to Wikidata) only become searchable on the following week.
articletopic: uses its own set of keywords as the ORES labels are not search-expression-friendly; these can be found in ArticleTopicFeature.php.
- Whenever a page is edited, the EventBus extension (code) triggers changeprop which sends data about the new revision to the ORES precache API (code, configuration), which calculates predictions for a predefined set of models (configuration); for English and some other Wikipedias, this includes the
- changeprop then sends these predictions to the
mediawiki.revision-scoreEventGate stream, which stores them in the
event.mediawiki_revision_scoretable in Hadoop.
- The Airflow / Spark job management platform used for managing wiki search data has a job that scoops up this data, deduplicates by page ID, discards predictions which are below the thresholds provided by ORES (these change dynamically to keep the precision of the predictions at a constant level), propagates (via Wikidata interwiki links) English Wikipedia predictions to pages on wikis which do not have their own articletopic models (this is determined automatically via the ORES API), and loads it into the
discovery.ores_articletopicHadoop table. (code, table schema) Another Airflow job then transfers this data into Elasticsearch (code).
- In Elasticsearch the data is stored under the Search/WeightedTags field with the
classification.ores.articletopictag family as a pretend word vector (e.g. if an article has a prediction of Linguistics -> 0.98, Literature -> 0.63, Elasticsearch will represent that with a field which has the word "Linguistics" 980 times and the word "Literature" 630 times). The
articletopic:search keyword will do a text similarity search between this field and the topics provided. (field definition, search code)
When a native articletopic model is added to a new wiki, or the model changes significantly, ORES data has to be bulk-uploaded into Elasticsearch (since normally it's only updated for articles which are changed) with the ores_bulk_ingest.py Spark job.
In MediaWiki site configuration, the
$wgCirrusSearchWMFExtraFeatures['weighted_tags']['build'] flag enables building the Elasticsearch index for the field, and
$wgCirrusSearchWMFExtraFeatures['weighted_tags']['use'] enables the
articletopic search keyword.
- phab:T240517 which was the epic for setting this up
- mw:ORES/Articletopic for the topic taxonomy (this is used for all wikis) and
- mw:Help:CirrusSearch#Articletopic for user docs
- https://tools.wmflabs.org/ores-support-checklist/ for which wikis have a native model