Jump to content

Search/MLR Pipeline

From Wikitech
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.

The MLR pipeline is a set of offline tools and processes used to train models for ranking search results on WMF wiki sites using CirrusSearch.

Overview

The high level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The ML algorithm will then produce a model realized as a json file that can be uploaded to the production elasticsearch cluster in a format understood by the LTR plugin. The diagram below is dated, oozie along with the indicated operator have since been replaced with an apache airflow instance that builds models weekly.

Data preparation

Generation of query clicks

This process joins the cirrus requests data with webrequests. This process is managed by airflow. The code is available in the wikimedia/discovery/analytics project, labeled query_clicks. The resulting data is available in two tables discovery.query_clicks_hourly and discovery.query_clicks_daily.

Training data

The training data is assembled by the mjolnir data pipeline and the cli script data_pipeline.py.

Grouping queries

In order to maximize the number of labels for a query we need to group similar queries together. The technique uses two passes:

  • group queries together using a lucene stemmer
  • collect the top 5 results from raw queries and apply a naive clustering algorithm to explode large groups where the stemmer was too aggressive

See the code for more details.

Sampling

The resulting data may be too large to be processed by the training pipeline so we need to sample the input data. Sampling is not trivial as it needs to take into account the popularity of each query to not bias the training data towards popular queries. The technique employed is to bucketize the queries per percentiles using spark approxQuantile. Each bucket can then be sampled.

See the code for more details.

Labels generation

Labels are generated thanks to clickmodels; the implementation used is the DbnModel described in the research paper "A Dynamic Bayesian Network Click Model for Web Search Ranking".

See the code for more details.

Feature vectors

The next step in preparing the training is to collect feature values for every pair of feature : hit for every query. There are several ways to collect feature vectors with mjolnir; one can use the logging endpoint of the LTR plugin or directly send individual queries to elasticsearch.

The most convenient way to train a new model with new features is to prepare a featureset with the LTR plugin. Mjolnir will then be able to collect feature vectors directly from the plugin.

Due to firewall constraints mjolnir running on the analytics cluster cannot directly access relforge100x machines where test indices are usually created. Kafka is used as the service to transfer queries from mjolnir to the relforge elasticsearch cluster. Results follow the same path in the other direction. This is managed by the mjolnir-msearch-daemon.

See the code for more details.

Notes on the kafka workflow

TODO: describe the daemon and the 3 topics used.

Feature Selection

Initially around 250 feature are collected for each labeled data point. This is trimmed to 50 features, for runtime performance reasons, using minimum redundancy maximum relevance (mRMR) feature selection via spark-infotheoretic-feature-selection.

Once this process is done the training data will be available in hdfs, containing queries, labels and feature vectors.

Training

The training process is based on xgboost and hyperopt. Models are trained with either 3 or 5 fold cross validation using 100-150 hyperopt rounds. Final model is trained on all data using best parameters from hyperparameter tuning.

Evaluate feature importance

TODO: notes on how to build feature importance with paws.

Deploy a model to production

Once a model has been trained mjolnir should have created a json file that can be uploaded directly to the production clusters. A simple mediawiki config patch can be deployed to switch production traffic to this new model by changing the wmgCirrusSearchMLRModel config in wmf-config/InitialiseSettings.php.