Search/2013

Overview of MediaWiki search

The apache project Lucene provides search capabilities in MediaWiki. The lucene daemon, which is a Java program, runs identically configured on a cluster of 25 machines at our data center in Ashburn, Virginia (a.k.a. eqiad) with a similar cluster running at the data center at Tampa, Florida (a.k.a. pmtpa) as a hot failover standby.
One machine on each cluster is dedicated to index generation and the rest to servicing search queries.
Each search server listens on (configurable) port 8123 for search queries. The indexer listens on port 8321 for a small set of commands it supports (such as snapshot generation, status queries, etc.)
Clustering uses LVS (Linux Virtual Server); further details about that tool at: http://www.linuxvirtualserver.org/whatis.html
Each server is configured automatically using Puppet; the Puppet code can be cloned from (replace xyz with your user name):

   ssh://xyz@gerrit.wikimedia.org:29418/operations/puppet.git

The config files are under templates/lucene; similarly, LVS clustering is configured via Puppet using files under templates/lvs.

The status of the various servers can be seen at: http://ganglia.wikimedia.org/latest/. From the Choose source dropdown, select Search eqiad. Click on the Physical View button at top right to see details like the amount of RAM, number of cores, etc.
The MediaWiki extension MWSearch (PHP code) receives search queries and routes them to a search server.
The file operations/mediawiki-config/wmf-config/lucene.php defines a number of globals to configure search; these include the port number, LVS cluster IP addresses, timeout, cache-expiry, etc. The main search file extensions/MWSearch/MWSearch.php is also require'd here.
The single server named searchidx1001 is the indexer: Indexes are generated there but it does not service search requests. It also periodically runs the IncrementalUpdater to retrieve page updates from the OAI Mediawiki extension every morning and puts those in a new index. However, these new indexes are distributed to the search servers only if a snapshot is created; this happens when the command curl http://localhost:8321/snapshot is run as a periodic cron job. This request is handled by a separate thread running HTTPIndexDaemon.
The searchers periodically query the indexer (via UpdateThread) for new snapshots of indexes they host and pull them over with rsync. This step generates a lot of net traffic and can cause RMI errors in the log files of some servers.
Indexes are sharded across the different servers using namespaces; in some cases, if a single namespace index is still very large, it is split further into parts which may reside on different machines.
The code supports many indexes some of which (titles for example), currently are not supported (presumably due to hardware resource constraints) and are disabled by mapping them in the config file to a non-existent host whose name is given by the Search.nullHost property (currently search1000x).

Troubleshooting

The log files on the indexer and searchers are an invaluable resource for diagnosing issues. The log level is currently set to INFO but can be lowered to TRACE by the ops folk to dump more detail for a day or two.
Often, for search issues, the first question to answer is: Which searcher hosts the index in question ? The global configuration file (currently lsearch-global-2.1.conf) defines the distribution of indexes, shards and parts over searchers but that file has an odd format that is difficult to decipher. This file is parsed by the singleton class GlobalConfiguration and the results are stored in the singleton object in a variety of data structures but especially in a large map of over 8000 IndexId objects -- one per index, shard or part. We now have debug code that dumps all these data structures to a file (currently /var/tmp/global.txt) in human readable form upon initialization. As of this writing, this debug code has been merged but not yet deployed. Once deployed, this file should be useful to determine quickly the internal properties of any index such as its location, the set of searchers that might search it, etc.
Another, somewhat fragile, step in initializing the singleton GlobalConfiguration object is an attempt to parse, using regular expressions, the InitialiseSettings.php file; it would be better to recast this step to generate an intermediate config file via a PHP script in some easily parsed format (JSON or YAML for example) and then read that data using a proper parser for that format in Java.
The mapping of unsupported indices to a non-existent server discussed above used to generate a large number of exceptions and associated stack traces in the log files; many of these messages have now been suppressed but some machines (notably search1015 and other hosts in pool 4) continue to see these errors because the last patch, though merged, has not yet been deployed. Once deployed, the log files on those machines should be a lot cleaner and make it easier to identify messages of consequence.
Every search server supports a /status query; the index server supports /getStatus. These queries currently report assorted errors that remain to be diagnosed (for example, the Japanese spelling indexes seem to be missing) and fixed.

Notes on Lucene

There is a great deal of documentation available on the net; specific notes that are helpful to understand our implementation appear below.

The current version of Lucene from Apache is 4.1.0 but we are still using 2.9.4; you can retrieve the version programmatically with:

     LucenePackage.get().getImplementationVersion();

The file formats seem to have their own version track; there is no API call to retrieve this version.

A Lucene index is not a single file but rather a collection of binary files in a directory.
A Lucene index is a collection of segments; segments start as separate files but can get merged in a process known as optimization.
There are 2 tuning parameters which can be specified in the global configuration file (see below):
- maxBufferedDocs: Specifies the max number of docs held in memory; when this limit is reached, the docs are written out to a segment file. Default is 10.
- mergeFactor: Specifies the base for logarithmic merging of segment files, e.g. when 10 segment files, each with 10 documents, have been written, they get merged into 1 file with 100 documents; this continues until 10 files, each with 100 documents, have been created at which point they get merged into 1 file with 1000 documents. Default is 10.
An index may or may not be optimized. Optimization can be expensive since it involves consolidating multiple segment files into one, removing deleted documents, etc. so it is typically only done in situations where the index is expected to be largely static.
In addition to a normal index, we create additional indexes:
- highlight -- Highlights search terms in snippet
- spell -- Used to detect spelling errors and show corrected form with Did you mean?.
- links -- URLs in page
- related -- ??

Socket parameters

We've periodically had issues with some socket timeout parameters. Timeouts are set in various places:

lsearchd shell script sets sun.rmi.transport.tcp.handshakeTimeout to 10_000 (ms) on the commandline.
Java classes CustomSocketFactory, Configuration, and GlobalConfiguration
The configuration files lsearch.conf and lsearch-global-2.1.conf.

CustomSocketFactory retrieves rmiReadTimeout from Configuration (with a default of 7200 s or 2 hrs) and uses that to set the soTimeout.

Some of these timeout parameters are documented here: http://docs.oracle.com/javase/6/docs/technotes/guides/rmi/sunrmiproperties.html Older version: http://docs.oracle.com/javase/1.4.2/docs/guide/rmi/sunrmiproperties.html):

Search details (PHP)

Some important classes and files defining them.
Class	File
SpecialSearch	core/includes/specials/SpecialSearch.php
SearchEngine	core/includes/Search.php
LuceneSearch	extensions/MWSearch/MWSearch_body.php
LuceneSearchResult	extensions/MWSearch/MWSearch_body.php
LuceneSearchSet	extensions/MWSearch/MWSearch_body.php
ApiQuerySearch	core/includes/api/ApiQuerySearch.php
ApiQueryGeneratorBase	core/includes/api/ApiQueryBase.php
ApiQueryBase	core/includes/api/ApiQueryBase.php
ApiBase	core/includes/api/ApiQueryBase.php
ContextSource	core/includes/context/ContextSource.php
IContextSource	core/includes/context/IContextSource.ph
Http	core/includes/HttpFunctions.php
MWHttpRequest	core/includes/HttpFunctions.php
CurlHttpRequest	core/includes/HttpFunctions.php

For normal search, SearchEngine (along with the derived class LuceneSearchEngine) is the top-level class dealing with search; the static function create() instantiates a new LuceneSearchEngine object and returns it. This is used in SpecialSearch:getSearchEngine()

For the web API, ApiQuerySearch seems to be the main class handling search requests. Its inheritence hiearchy looks like this: ApiQuerySearch → ApiQueryGeneratorBase → ApiQueryBase → ApiBase → ContextSource/IContextSource. ApiQuerySearch.run() starts query processing.

In both cases, LuceneSearch:searchText() is invoked which simply returns the result of invoking LuceneSearchSet::newFromQuery()
That routine does the following:
- creates the search URL like this: $searchUrl = "http://$host:$wgLucenePort/$method/$wgDBname/$enctext?" to which a few parameters are appended like namespaces, etc.
- Invokes Http.get() which invokes MWHttpRequest::factory() to get a new request object which is, probably, a CurlHttpRequest object and invokes execute() on it.
- That method uses the native PHP functions curl_init(), curl_setup(), curl_exec(), curl_close() to make the HTTP call to the Java engine; the results are saved in the request object.

Search details (Java)

Most of the code is in subdirectories of src/org/wikimedia/lsearch/. The main class dealing with search itself is search/SearchServer.java; classes interfacing with PHP are in frontend, those dealing with networking in interoperability and the main entry point is config/StartupManager.java.

Some important classes are described below.

StartupManager

Performs these steps:

Get local and global configurations and retrieve various parameters (language codes, localization data, etc.)
Invoke static methods createRegistry() and bindRMIObjects() in RMIServer (see below for more on this class).
If this is an indexer machine, start new HTTPIndexServer [default] or RPCIndexServer [? Is this ever used?].
If this is an search machine:
- Start new SearchServer.
- Create singleton SearcherCache.
- Start singleton threads UpdateThread and NetworkStatusThread.

HttpHandler: This is an abstract class (with processRequest() the only abstract method) that extends Thread; it is extended by HTTPIndexDaemon (handles index update requests) and SearchDaemon (handles search requests).

SearchDaemon

Extends HttpHandler; one of these is created for each incoming search request and run by the thread-pool in SearchServer (see below). Provides a definition of processRequest() which does the following:

If non-search request (e.g. /robots.txt, /stats, /status), return relevant data.
Otherwise:
- Create new SearchEngine (top-level search class) and invoke it to get search results.
- Return results in one of 3 formats: Standard, JSON, or OPENSEARCH.

HTTPIndexDaemon: Similar to SearchDaemon (above); extends HttpHandler; one of these is created for each incoming index request and run by the thread-pool in HTTPIndexServer (see below). Provides a definition of processRequest() which does the following:

SearchServer

Extends Thread. Though not defined as a singleton, appears to be so in practice. Started by StartupManager (see above). Does the following:

Create a Statistics and StatisticsThread objects to supply stats to Ganglia.
Create thread-pool of maxThreads [default: 80] threads.
Listen on ServerSocket [default port: 8123]; when a connection is made, create new SearchDaemon object and run it in the pool if pool is not full. If pool is full, log an error and simply close socket ! [?NOTE There may be an off-by-one error in the check to see if the pool is full.]

HTTPIndexServer

Similar to SearchServer above; extends Thread. Though not defined as a singleton, appears to be so in practice. Started by StartupManager. Does the following:

Create thread-pool of 25 (hardcoded) threads.
Listen on ServerSocket [default port: 8321]; when a connection is made, create new HTTPIndexDaemon object and run it in the pool if pool is not full. If pool is full, log an error and simply close socket ! [?NOTE There may be an off-by-one error in the check to see if the pool is full. There is also a potential issue if both servers are run in the same Java process since the count of open requests is a static member of the common base class, so it becomes a combined count of both search and index requests but the thread pools are separate]

IndexDaemon: Simple class that functions as interface adapter to present a much simpler interface to clients of the somewhat complex IndexThread class. Not clear why this is done via a concrete class rather than an interface implemented by IndexThread.

HttpMonitor: Coming soon

RPCIndexDaemon: No longer used.

RPCIndexServer: No longer used.

Installing MediaWiki and lucene-search-2 for debugging

These instructions are targeted at developers who want to setup an instance of MediaWiki and the Lucene based search functionality for testing and debugging; it is not the intent here to setup a production system.

Details on how to install MediaWiki are at: http://www.mediawiki.org/wiki/Installation A summary appears below along with some additional details.

Download the latest release from: http://www.mediawiki.org/wiki/Download and extract the archive; then rename the top-level directory to 'core' or something similar for ease of typing, e.g.

      cd ~/src
      tar xvf mediawiki-1.20.2.tar.gz
      mv mediawiki-1.20.2 core

Install prerequisites (if you prefer MySql to SQLite3 replace the sqlite packages below with corresponding MySql packages: mysql-server, php5-mysql):

    list="php5 php5-curl php5-sqlite sqlite3 apache2 git default-jdk ant debhelper javahelper"
    list="$list liblog4j1.2-java libcommons-logging-java libslf4j-java "
    sudo apt-get install $list

Checkout the MWSearch extension from the git repository, e.g.:

    cd ~/src
    mkdir extensions; cd extensions
    git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/MWSearch.git

Make sure the apache/php combo is working by creating a file named info.php at /var/www containing:

   <?php phpinfo(); ?>

Now point your browser (or use wget/curl to fetch the page) at http://localhost/info.php; you should see lots of tables with PHP configuration info (replace localhost with appropriate host name or IP if necessary).

Create a data directory somewhere and make it world read/write (this is to allow apache to create an sqlite DB file under it). Also make sure that all the directories from your home directory to the MediaWiki root are world readable and searchable (otherwise you'll get errors from Apache as it searches for .htaccess files), e.g.

    mkdir ~/data; chmod 777 ~/data
    chmod 755 ~ ~/src ~/src/core

Reconfigure apache by editing /etc/apache2/sites-available/default to remove unnecessary stuff and also set DocumentRoot to point to the freshly unpacked MediaWiki root above. Something close to this should work (replace xyz by a proper user name):

   <VirtualHost *:80>
       ServerAdmin webmaster@localhost
       DocumentRoot /home/xyz/src/core
       Alias /extensions /home/xyz/src/extensions
       ErrorLog ${APACHE_LOG_DIR}/error.log
       LogLevel warn
       CustomLog ${APACHE_LOG_DIR}/access.log combined
       php_admin_flag engine on
       <Directory /home/xyz/src/core/images>
           php_admin_flag engine off
       </Directory>
       <Directory /home/xyz>
           AllowOverride All
       </Directory>
   </VirtualHost>

Reload apache configuration with:

  sudo /etc/init.d/apache2 reload

You should now be able to point your browser at http://localhost/index.php where you should see a message saying:

   LocalSettings.php not found.
   Please set up the wiki first.

and follow on-screen instructions to configure MediaWiki; at the end you'll be prompted to download the generated LocalSettings.php file and place it in the MediaWiki root directory to complete the configuration step. This step can also be done from the commandline as described next.

The previous step can also be done from the commandline, e.g.

      php core/maintenance/install.php --help

Documentation on the various parameters is at: http://www.mediawiki.org/wiki/Manual:Config_script. A sample invocation might look like this for a MySql install (change dbxyz and DbXyzPass to a suitable user name and password; likewise, wiki_admin and WikiAdminPass to suitable values for the wiki administrator; also change RootPass to the password for the root user of your MySql installation) [?Whats the difference between dbuser and admin user ?]:

   #!/usr/bin/bash
   # install MediaWiki from commandline
   opt='--dbtype mysql '
   opt+='--dbuser dbxyz '
   opt+='--dbpass DbXyzPass '
   opt+='--installdbuser root '
   opt+='--installdbpass RootPass '
   opt+='--pass WikiAdminPass'
   php maintenance/install.php $opt my_wiki wiki_admin

This will generate a new LocalSettings.php file, create a new database named my_wiki and create a number of tables within it; the user table will have a row for dbxyz. You can edit the generated LocalSettings.php file manually to add additional configuration options as needed; for example, some of these may be useful (replace xyzhost with your hostname):

      require( "$IP/../extensions/MWSearch/MWSearch.php" );
      $wgLuceneHost = 'xyzhost';
      $wgLucenePort = 8123;
      $wgLuceneSearchVersion = '2.1';
      $wgLuceneUseRelated = true;
      $wgEnableLucenePrefixSearch = false;
      $wgSearchType = 'LuceneSearch';

Checkout lucene-search-2:

     cd ~/src
     git clone https://gerrit.wikimedia.org/r/operations/debs/lucene-search-2.git

There is a top-level README.txt file that describes how to build it; we summarize the steps below.

Run ant to build everything; the result should be a local file named LuceneSearch.jar:

     cd lucene-search-2; ant

The README.txt file mentions running the configure script but that script is missing in the git checkout. Create it to contain:

     #!/bin/bash
     dir=`cd $1; pwd`
     java -cp LuceneSearch.jar org.wikimedia.lsearch.util.Configure $dir

Now run it with the full path to the MediaWiki root directory as an argument, e.g.:

     bash configure ~/src/core

It will examine your MediaWiki configuration and generate these matching configuration files for search:

     lsearch.log4j  lsearch-global.conf  lsearch.conf  config.inc

The generated lsearch.log4j uses ScribeAppender which requires installation of additional packages (without them you'll get Java exceptions when you run the lsearchd daemon); one way to get around this is the remove those references and use a RollingFileAppender:

       log4j.rootLogger=INFO, R
       log4j.appender.R=org.apache.log4j.RollingFileAppender
       log4j.appender.R.File=logs/test.log
       log4j.appender.R.MaxFileSize=10MB
       log4j.appender.R.MaxBackupIndex=2
       log4j.appender.R.layout=org.apache.log4j.PatternLayout
       log4j.appender.R.layout.ConversionPattern=%d{ISO8601} %-5p %c %m%n
       log4j.logger.org.wikimedia.lsearch.interoperability=DEBUG

Now get an XML dump (replace /var/tmp with a different location if you prefer; the path to the dump file as well as the name of the file itself may change over time):

     pushd /var/tmp
     file='simplewiktionary-20130113-pages-meta-current.xml.bz2'
     wget http://dumps.wikimedia.org/simplewiktionary/20130113/$file
     popd

and build Lucene indexes from it (the last argument is the name of your wiki as defined in LocalSettings.php by the $wgDBname global variable):

     java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.BuildAll /var/tmp/$file my_wiki

This last command is equivalent to running the build script mentioned in README.txt; it creates a new directory named indexes and a number of directories and index files under it; for the dump file mentioned above, it should take around 5m to complete on a modern machine.

Finally, you can run the search daemon:

     ./lsearchd &

It listens for search queries on port 8123, so you can test it like this:

       wget http://localhost:8123/search/my_wiki/hello

Logs can be found under the logs directory.

References

These links have useful info about search: