Search/articletopic

From Wikitech

CirrusSearch supports searching Wikipedia main-namespace articles by article topic (provided the topic mapping functionality in ORES) using the articletopic: keyword. This requires loading ORES predictions into Elasticsearch. The rough data flow for this is:

wiki edit -> changeprop -> ORES -> EventBus -> HDFS -> Airflow / Spark (-> cross-wiki propagation) -> Elasticsearch

Most of this is real-time but the Airflow step is only run once a week, on Sunday. This usually won't matter much since most edits won't change topic predictions drastically, but new articles (or in the case of wikis which do not have their own ORES model, articles newly linked to Wikidata) only become searchable on the following week.

articletopic: uses its own set of keywords as the ORES labels are not search-expression-friendly; these can be found in ArticleTopicFeature.php.

Details

  • Whenever a page is edited, the EventBus extension (code) triggers changeprop which sends data about the new revision to the ORES precache API (code, configuration), which calculates predictions for a predefined set of models (configuration); for English and some other Wikipedias, this includes the articletopic model.
  • changeprop then sends these predictions to the mediawiki.revision-score EventGate stream, which stores them in the event.mediawiki_revision_score table in Hadoop.
  • The Airflow / Spark job management platform used for managing wiki search data has a job that scoops up this data, deduplicates by page ID, discards predictions which are below the thresholds provided by ORES (these change dynamically to keep the precision of the predictions at a constant level), propagates (via Wikidata interwiki links) English Wikipedia predictions to pages on wikis which do not have their own articletopic models (this is determined automatically via the ORES API), and loads it into the discovery.ores_articletopic Hadoop table. (code, table schema) Another Airflow job then transfers this data into Elasticsearch (code).
  • In Elasticsearch the data is stored under the Search/WeightedTags field with the classification.ores.articletopic tag family as a pretend word vector (e.g. if an article has a prediction of Linguistics -> 0.98, Literature -> 0.63, Elasticsearch will represent that with a field which has the word "Linguistics" 980 times and the word "Literature" 630 times). The articletopic: search keyword will do a text similarity search between this field and the topics provided. (field definition, search code)

Management

When a native articletopic model is added to a new wiki, or the model changes significantly, ORES data has to be bulk-uploaded into Elasticsearch (since normally it's only updated for articles which are changed) with the ores_bulk_ingest.py Spark job.

In MediaWiki site configuration, the $wgCirrusSearchWMFExtraFeatures['weighted_tags']['build'] flag enables building the Elasticsearch index for the field, and $wgCirrusSearchWMFExtraFeatures['weighted_tags']['use'] enables the articletopic search keyword.

See also