Search

From Wikitech
Jump to: navigation, search

Looking for lsearchd? That's been moved

Contents

In case of emergency

Here is the (sorted) emergency contact list for Search issues on the WMF production cluster:

  • David Causse (dcausse)
  • Erik Bernhardson (ebernhardson)
  • Guillaume Lederrey (gehel)

You can get contact information from officewiki.

Overview

This system has three components: Elastica, CirrusSearch, and Elasticsearch.

Elastica

Elastica is a MediaWiki extension that provides the library to interface with Elasticsearch. It wraps the Elastica library. It has no configuration.

CirrusSearch

CirrusSearch is a MediaWiki extension that provides search support backed by Elasticsearch.

Turning on/off

  • $wmgUseCirrus will set the wiki to use CirrusSearch by default for all searches. Doing so would break searches and search updates. You probably don't want to do this.

Configuration

The canonical location of the configuration documentation is in the CirrusSearch.php file in the extension source. It also contains the defaults. A pool counter configuration example lives in the README in the extension source.

WMF runs the default configuration except for elasticsearch server locations but overrides would live in the CirrusSearch-{common|production|labs}.php files in the mediawiki-config files.

Elasticsearch

Elasticsearch is a multi-node Lucene implementation. No individual node is a single point of failure.

Install

To install Elasticsearch on a node, use role::elasticsearch::server in puppet. This will install the elasticsearch Debian package and setup the service and appropriate monitoring. It should automatically join the cluster.

Configuration

role::elasticsearch::server has no configuration.

In labs a config parameter is required. It is called elasticsearch::cluster_name and names the cluster that the machine should join.

role::elasticsearch::server use ::elasticsearch to install elasticsearch with the following configuration:

Environment Memory Cluster Name Multicast Group Servers
labs 2G configured 224.2.2.4
beta 4G beta-search 224.2.2.4 deployment-elastic0[5678]
eqiad 30G production-search-eqiad 224.2.2.5 elastic10([012][0-9]*or*3[01]) (eqiad)
codfw 30G production-search-codfw 224.2.2.7 elastic20(0[0-9|1[0-9]|2[0-4])

Note that the multicast group and cluster name are different in different production sites. This is because the network is flat from the perspective of Elasticsearch's multicast and it is way easier to reserve separate IPs than it is to keep up with multicast ttl.

Files
/etc/default/elasticsearch Simple settings that must be initialized either on the command line (memory) or really early (memlockall) in execution
/etc/elasticsearch/logging.yml Logging configuration
/etc/logrotate.d/elasticsearch Log rolling managed by logrotate
/etc/elasticsearch/elasticsearch.yml Everything not covered above, e.g. cluster name, multicast group
Puppet

The elasticsearch cluster is provisioned via the elasticsearch module of the operations/puppet repository. Configuration for specific machines is in the hieradata folder of the same operations/puppet repository. A few of the relevant files are linked here, but a grep for 'elasticsearch' in hieradata and poking around the elasticsearch puppet module are the best ways to understand the current configuration.

Provisioning

Configuration

  • Common configuration for all elasticsearch servers (this includes the separate elasticsearch cluster used with the ELK stack, not covered in this document).
  • Common configuration for all eqiad elasticsearch servers.
  • regex.yaml - Contains rack identifiers for each elasticsearch server. This is fed into elasticsearch.yml for rack aware shard distribution.
  • Master eligibility is applied to individual selected hosts.
  • A new elasticsearch server will serve queries as soon as it joins the cluster. It needs to be added to conftool-data to be added to LVS endpoint and receive direct traffic.

SSL certificates

  • Elasticsearch exposes HTTPS endpoints via nginx. Puppet CA is used for certificates. Certificates have to be regenerated after the first installation to ensure they contain the correct SAN entry. The procedure is documented on Puppet CA page.

WMF Production setup

Hardware

The WMF, as of September 2015, operates a single elasticsearch cluster which serves all search requests for WikiMedia sites. This is hosted on elastic{01..31}.eqiad.wmnet.

Up to date rack placement can be found in ops/puppet here

servers cores memory disk
elastic1017-1031 32 128G 2x300GB SSD RAID0 *
elastic1032-1052 32 128G 2x800GB SSD RAID1
elastic2001-2024 32 128G 2x800GB SSD RAID1
  • Software RAID by partition is raid1 for the OS but elastic data will not survive a disk loss
Data distribution

As far as elasticsearch is concerned all of these servers are exactly the same. The hardware within the machines is not taken into account by the shard allocation algorithms. This does lead to some issues where the older machines have a higher load than the newer machines, but fixing it is not yet supported by elasticsearch.

Our largest user of elasticsearch resources is, by far, queries to enwiki. The data for enwiki is split into 7 shards with 1 master plus 3 replica. This means that each individual enwiki query is answered by 7 machines and at any given moment 28 (7*4) of the 31 machines in the cluster are serving enwiki queries. Other popular wikis are included in the table below(wikis can be listed twice, there is a 'content' index and a separate 'everything' index for each wiki):

wiki primaries replicas total
itwiki 9 2 27
dewiki, enwiki 7 3 28
zhwiki, ukwiki, svwiki, nlwiki, frwiki, wikidatawiki, frwikisource, eswiki, enwikisource, ruwiki, itwiki, plwiki, jawiki, ptwiki 7 2 21
viwiki, jawiki, eswiki 6 2 18
arwiki, cawiki, commonswiki, enwiktionary, fawiki, ptwiki, ruwiki, zhwiki zhwikisource 5 2 15
Many many more wikis served by < 15 machines
Load balancing requests

MediaWiki talks to the elasticsearch cluster through LVS. Application servers talk to a single LVS ip address and this balances the requests out across the cluster. The read and write requests are distributed evenly across all available elasticsearch instances with no consideration for data locality.

Shard balance across the cluster

The production configuration sets index.routing.allocation.total_shards_per_node to 1 for all indexes. This means that each server will only contain a single shard for any given index. This combined with setting the number of shards and replicas to an appropriate number ensure that the indices with heavy query volume are spread out across the entire cluster.

The elasticsearch shard distribution algorithm is relatively naive with respect to our use case. It is optimized for having shards of approximately equal size and query volume in all indexes. The WMF use case is very different. As of September 2015 there are 6419 shards with approximately the following size breakdown:

size num shards percent
x > 50gb 33 0.5%
50gb > x > 10gb 56 0.8%
10gb > x > 1gb 1307 20.3%
1gb > x > 100mb 1114 17.3%
100mb > x 3909 60.8%

Due to this load across the cluster needs to be occasionally monitored and shards moved around. This is further discussed in #Trouble.

Indexing

CirrusSearch updates the elasticsearch index by building and upserting almost the entire document on every edit. The revision id of the edit is used as the elasticsearch version number to ensure out-of-order writes by the job queue have no effect on the index correctness. There is, as of September 2015, only one property that is not upserted as normal on every edit. The CirrusSearch\OtherIndex functionality which adds information to the commonswiki index about duplicate files only updates the commonswiki index when the duplicated file, on the other wiki, is indexed.

Realtime updates

The CirrusSearch extension updates the elasticsearch indexes for each and every mediawiki edit. The chain of events between a user clicking the 'save page' button and elasticsearch being updated is roughly as follows:

  • MW core approves of the edit and inserts the LinksUpdate object into DeferredUpdates
  • DeferredUpdates runs the LinksUpdate in the web request process, but after closing the connection to the user (so no extra delays).
  • When LinksUpdate completes it runs a LinksUpdateComplete hook which CirrusSearch listens for. In response to this hook CirrusSearch inserts CirrusSearch\Job\LinksUpdate for this page into the job queue (backed by redis in wmf prod).
  • The CirrusSearch\Job\LinksUpdate job runs CirrusSearch\Updater::updateFromTitle() to re-build the document that represents this page in elasticsearch. For each wikilink that was added or removed this inserts CirrusSearch\Job\IncomingLinkCount to the job queue.
  • The CirrusSearch\Job\IncomingLinkCount job runs CirrusSearch\Updater::updateLinkedArticles() for the title that was added or removed.

Other processes that write to elasticsearch (such as page deletion) are similar. All writes to elasticsearch are funneled through the CirrusSearch\Updater class, but this class does not directly perform writes to the elasticsearch database. This class performs all the necessary calculations and then creates the CirrusSearch\Job\ElasticaWrite job to actually make the request to elasticsearch. When the job is run it creates CirrusSearch\DataSender which transforms the job parameters into the full request and issues it. This is done so that any updates that fail (network errors, cluster maintenance, etc) can be re-inserted into the job queue and executed at a later time without having to re-do all the heavy calculations of what actually needs to change.

Batch updates from the database

CirrusSearch indices can also be populated from the database to bootstrap brand new clusters or to backfill existing indices for periods of time where updates, for whatever reason, were not written to the elasticsearch cluster. These updates are performed with the forceSearchIndex.php maintenance script, the usage of which is described in multiple parts of the #Adminstration section.

Batch updates use a custom job type, the CirrusSearch\Job\MassIndex job. The main script iterates the entire page table and inserts jobs in batches of 10 titles. The MassIndex job kicks off CirrusSearch\Updater::updateFromPages() to perform the actual updates. This is the same process as CirrusSearch\Updater::updateFromTitle, updateFromTitle simply does a couple extra checks around redirect handling that is unnecessary here.

Job queue

CirrusSearch uses the mediawiki job queue for all operations that write to the indices. The jobs can be roughly split into a few groups, as follows:

Primary update jobs

These are triggered by the actions of either users or adminstrators.

  • DeletePages - Removes titles from the search index when they have been deleted.
  • LinksUpdate - Updates a page after it has been edited.
  • MassIndex - Used by the forceSearchIndex.php maintenance script to distribute indexing load across the job runners.
Secondary update jobs

These are triggered by primary update jobs to update pages or indices other than the main document.

  • OtherIndex - Updates the commonswiki index with information about file uploads to all wikis to prevent showing the user duplicate uploads.
  • IncomingLinkCount - Triggers against the linked page when a link is added or removed from a page. Updates the list of links coming into a page from other pages on the same wiki. This information is used as part of the rescoring algorithms. While this always sends the full list to elasticsearch, our custom search-extra plugin for elasticsearch adds functionality that only performs the update if there is more than a 20% change in the content of the incoming links field. While only one field is being updated elasticsearch docs are immutable, it will internally delete the old document and re-index the new version. The 20% change optimization prevents the cluster from being overloaded with writes when a popular template edit triggers link changes in significant numbers of pages.
Backend write jobs

These are triggered by primary and secondary update jobs to represent an individual write request to the cluster. These are typically run in-process from another job and not enqueued. Queue'd jobs of this type mean that writes are not making it to the elasticsearch cluster.

  • ElasticaWrite

Logging

CirrusSearch logs a wide variety of data. A few logs in particular are of interest:

  • The kibana logging dashboard for CirrusSearch contains all of the low-volume logging.
  • CirrusSearchRequests - A textual log line per request from mediawiki to elasticsearch plus a json payload of information. Logged via udp2log to mwlog1001.eqiad.wmnet. This is generally 1500-3000 log lines per second. Can be turned off by setting $wgCirrusSearchLogElasticRequests = false.
  • CirrusSearchRequestSet - This is a replacement for CirrusSearchRequests and is batched together at the php execution level. This simplifies the job of user analysis as they can look at the sum of what we did for a users requests, rather than the individual pieces. This is logged from mediawiki to the mediawiki_CirrusSearchRequests kafka topic. Can be turned off by setting $wgCirrusSearchSampleRequestSetLog = 0.

Administration

All of our (CirrusSearch's) scripts have been designated to run on terbium.eqiad.wmnet.

Adding new wikis

All wikis now get Cirrus by default as primary. They have to opt out in InitializeSettings.php if they only want Cirrus as a secondary search option and want to use lsearchd instead. To add a new Cirrus wiki:

  1. Estimate the number of shards required (one, the default, is fine for new wikis).
  2. Create the search index
    1. addWiki.php should do this automatically for new wikis
    2. There is a chance that this will timeout and fail to create the indexes. The only fix for now is to custom hack a longer timeout into the Elastica library. This is tracked at T107348
  3. Populate the search index

Estimating the number of shards required

If the wiki is small then skip this step. The defaults are good for small wikis. If it isn't, use one of the three methods below to come up with numbers and add them to mediawiki-config's InitializeSettings.php file under wmgCirrusSearchShardCount.

See #Resharding if you are correcting a shard count error.

If it hasn't been indexed yet

You should compare its content and general page counts with a wiki with similarly sized pages. Mostly this means a wiki in a different language of the same type. Enwikisource and frwikisource are similar, for example. Wiktionaries pages are very small. As are wikidatawiki's. In any case pick a number proportional to the number for the other wiki. Don't worry too much about being wrong - resharding is as simple as changing the value in the config and performing an in place reindex. To count the pages I log in to tools-login.pmtpa.wmflabs then

sql $wiki
 SELECT COUNT(*) FROM page WHERE page_namespace = 0;
 SELECT COUNT(*) FROM page WHERE page_namespace != 0;
sql $comparableWiki
 SELECT COUNT(*) FROM page WHERE page_namespace = 0;
 SELECT COUNT(*) FROM page WHERE page_namespace != 0;

Wiki's who's indexes are split other ways then content/general should use the same methodology but be sure to do it once per index and use the right namespaces for counts.

If it has been indexed

You can get the actual size of the primary shards (in GB), divide by two, round the result to a whole number, and spit ball estimate for growth so you don't have to go do this again really soon. Normally I add one to the number if the primary shards already total to at least one GB and it isn't a wiktionary. Don't worry too much about being wrong because you can change this with an in place reindex. Anyway, to get the size log in to an elasticsearch machine and run this:

curl -s localhost:9200/_stats?pretty | grep 'general\|content\|"size"\|count' | less

Count and size are repeated twice. The first time is for the primary shards and the second includes all replicas. You can ignore the replica numbers.

We max out the number of shards at 5 for content indexes and 10 for non-content indexes.

Create the index

mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki $wiki

That'll create the search index with all the proper configuration. Addwiki should have done this automatically for you.

Populate the search index

mkdir -p ~/log
mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --skipLinks --indexOnSkip --queue | tee ~/log/$wiki.parse.log
mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --skipParse --queue | tee ~/log/$wiki.links.log

If the last line of the output of the --skipLinks line doesn't end with "jobs left on the queue" wait a few minutes before launching the second job. The job queue doesn't update its counts quickly and the job might queue everything before the counts catch up and still see the queue as empty.

Health/Activity Monitoring

We have nagios checks that monitor Elasticsearch's claimed health (green, yellow, red) and ganglia monitoring on a number of attributes exposed by Elasticsearch.

Eqiad

Codfw

Monitoring the job queue

The current state of the jobqueue is visible, but only for individual wikis. Enwiki is almost always the busiest index, so we can monitor the state with:

mwscript showJobs.php --wiki=enwiki --group | grep ^cirrus

Under normal operation this will output something like:

ebernhardson@terbium:~$ mwscript showJobs.php --wiki=enwiki --group | grep ^cirrus
cirrusSearchIncomingLinkCount: 19028 queued; 0 claimed (0 active, 0 abandoned); 234 delayed
cirrusSearchLinksUpdatePrioritized: 35 queued; 4 claimed (4 active, 0 abandoned); 0 delayed

This output is explained in the manual. Most CirrusSearch jobs are normal to see here, but there is one exception. The cirrusSearchElasticaWrite is typically created and run in-process from other jobs. The only time cirrusSearchElasticaWrite in inserted into the job queue is when the write operations cannot be performed to the cluster. This may be due to writes being frozen, it could be a network partition between the job runners and the elasticsearch cluster, or it could be that the index being written to is red.

Waiting for Elasticsearch to "go green"

Elasticsearch has a built in cluster health monitor. red means there are unassigned primary shards and is bad because some requests will fail. Yellow means there are unassigned replica shards but the masters are doing just fine. This is mostly fine because technically all requests can still be served but there is less redundancy then normal and there are less shards to handle queries. Performance may be effected in the yellow state but requests won't just fail. Anyway, this is how you wait for the cluster to "go green".

until curl -s localhost:9200/_cluster/health?pretty | tee /tmp/status | grep \"status\" | grep \"green\"; do
  cat /tmp/status
  echo
  sleep 1
done

So you aren't just watching number scroll by for fun I'll explain what they all mean:

{
  "cluster_name" : "labs-search-project",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 30,
  "active_shards" : 55,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 0
}
cluster_name
The name of the cluster.
status
The status we're waiting for.
timed_out
Has this requests for status timed out? I've never seen one time out.
number_of_nodes
How many nodes (data or otherwise) are in the cluster? We plan to have some master only nodes at some point so this should be three more than number_of_data_nodes
number_of_data_nodes
How many nodes that hold data are in the cluster?
active_primary_shards
How many of the primary shards are active? This should be indexes * shards. This number shouldn't change frequently because when the primary shards go off line one of the replica shards should take over immediately. It should be too fast to notice.
active_shards
How many shards are active? This should be indexes * shards * (1 + replicas). This will go down when a machine leaves the cluster. When possible those shards will be reassigned to other machines. This is possible so long as those machines don't already hold a copy of that replica.
relocating_shards
How many shards are currently relocating?
initializing_shards
How many shards are initializing? This mostly means they are being restored from a peer. This should max out at cluster.routing.allocation.node_concurrent_recoveries. We still use the default of 2.
unassigned_shards
How many shards have yet to be assigned? These are just waiting in line to become initializing_shards.

See the Trouble section for what can go wrong with this.

Pausing Indexing

Various forms of cluster maintenance, such as rolling restarts, take significantly less time when the content of the indices stays static and is not being constantly written to. To support this CirrusSearch includes the concept of 'frozen' indices. When an index is frozen no ElasticaWrite jobs are executed. When an ElasticaWrite job tries to run it sees the index it wants to write to is frozen and re-inserts itself to the job queue with an exponential backoff. Individual indic

To freeze all writes to the cluster. It does not matter which --wiki=XXX you use, this is a cluster wide setting stored in an elasticsearch index. The cluster to freeze is selected with --cluster=XXX:

mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster --wiki=enwiki --cluster=eqiad

Once run all writes will backup into the job queue. The current state of the jobqueue is visible, but only for individual wikis. Enwiki is almost always the busiest index, so we can monitor the state with:

mwscript showJobs.php --wiki=enwiki --group | grep ^cirrus

To thaw out writes to the cluster re-run the initial script with the --thaw option

mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster --wiki=enwiki --cluster=eqiad --thaw

After thawing the cluster you must monitor the job queue to ensure the number of cirrusSearchElasticaWrite jobs is decreasing as expected. These jobs include very little calculation and quickly drain from the queue.

Ensure you do not leave the cluster in the frozen state for too long. As of Sept 2015 too long is six hours. In WMF production all jobs need to fit within the redis cluster's available memory. If cirrus jobs are allowed to fill up all available redis memory there will be an outage and it will not be limited to search.

Restarting a node

Elasticsearch rebalances shards when machines join and leave the cluster. Because this can take some time we've stopped puppet from restarting Elasticsearch on config file changes. We expect to manually perform rolling restarts to pick up config changes or software updates (via apt-get), at least for the time being. There are two ways to do this: the fast way, and the safe way. At this point we prefer the fast way if the downtime is quick and the safe way if it isn't. The fast way in the Elasticsearch recommended way. The safe way keeps the cluster green the whole time but is really slow and can cause the cluster to get a tad unbalanced if it is running close to the edge on disk space.

The fast way:

es-tool restart-fast

This will instruct Elasticsearch not to allocate new replicas for the duration of the downtime. It should be faster then the simple way because unchanged indexes can be restored on the restarted machine quickly. It still takes some time.

The safe way starts with this to instruct Elasticsearch to move all shards off of this node as fast as it can without going under the appropriate number of replicas:

ip=$(facter ipaddress)
curl -XPUT localhost:9200/_cluster/settings?pretty -d "{
    \"transient\" : {
        \"cluster.routing.allocation.exclude._ip\" : \"$ip\"
    }
}"

Now you wait for all the shards to move off of this one:

host=$(facter hostname)
function moving() {
   curl -s localhost:9200/_cluster/state | jq -c '.nodes as $nodes |
       .routing_table.indices[].shards[][] |
       select(.relocating_node) | {index, shard, from: $nodes[.node].name, to: $nodes[.relocating_node].name}'
}
while moving | grep $host; do echo; sleep 1; done

Now you double check that they are all off. See the advice under stuck in yellow if they aren't. It is probably because Elasticsearch can't find another place for them to go:

curl -s localhost:9200/_cluster/state?pretty | awk '
    BEGIN {more=1}
    {if (/"nodes"/) nodes=1}
    {if (/"metadata"/) nodes=0}
    {if (nodes && !/"name"/) {node_name=$1; gsub(/[",]/, "", node_name)}}
    {if (nodes && /"name"/) {name=$3; gsub(/[",]/, "", name); node_names[node_name]=name}}
    {if (/"node"/) {node=$3; gsub(/[",]/, "", node)}}
    {if (/"shard"/) {shard=$3; gsub(/[",]/, "", shard)}}
    {if (more && /"index"/) {
        index_name=$3
        gsub(/[",]/, "", index_name)
        print "node="node_names[node]" shard="index_name":"shard
    }}
' | grep $host

Now you can restart this node and the cluster with stay green:

sudo /etc/init.d/elasticsearch restart
until curl -s localhost:9200/_cluster/health?pretty ; do
    sleep 1
done

Next you tell Elasticsearch that it can move shards back to this node:

curl -XPUT localhost:9200/_cluster/settings?pretty -d "{
    \"transient\" : {
        \"cluster.routing.allocation.exclude._ip\" : \"\"
    }
}"

Finally you can use the same code that you used to find stuff shards moving from this node to find shards moving to this node:

while moving | grep $host; do echo; sleep 1; done

Rolling restarts

Prior to starting a rolling restart you want to pause all write actions that are performed on the cluster and issue a synced flush. The write operations will be queued. Note that write operations will be dropped if writes are stopped for more than $wgCirrusSearchDropDelayedJobsAfter (at this point set to 3 hours).

First from a deployment machine (terbium, deployment-bastion) run the following script:

ES_SERVER=elastic1001.eqiad.wmnet

# Pause all writes from mediawiki, allowing them to queue up in the job queue. Doesn't
# matter which wiki this is run against, it is a cluster-wide setting.
echo Freezing all mediawiki writes to elasticsearch
mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster.php --wiki=enwiki

echo Sleeping 5 minutes to ensure mediawiki settles down
for i in {1..30}; do
  echo -n "."
  sleep 10;
done
echo Done waiting

# https://discuss.elastic.co/t/synced-flush-causes-node-to-re start/24220/14
echo Issuing a forced flush
FORCE_FLUSH=$(curl -XPOST "http://$ES_SERVER:9200/_flush?force=true&wait_if_ongoing=true" | jq ._shards.failed)
if [ x"$FORCE_FLUSH" != x"0" ]; then
  echo "Failed to force-flush $FORCE_FLUSH shards"
  exit 1
fi

echo Issuing a synced flush
SYNC_FLUSH=$(curl -XPOST "http://$ES_SERVER:9200/_flush?synced" | jq 'with_entries(select(.value.failed > 0))')
if [ x"$SYNC_FLUSH" != x"null" ]; then
  echo Failed to issue synced-flush: 
  echo $SYNC_FLUSH"
  exit 1
fi

# We also need to prevent apifeatureusage updates. Data for Special:ApiFeatureUsage
# is written to the elasticsearch cluster via logstash. It is not possible to pause
# these writes, so we just need to reject them at the elasticsearch level. This has to
# be done after the synced flush, that won't work with any read-only indexes.
# TODO: DATE+1 isn't made read-only here, rolling over to next day could break this.
DATE=$(date +"%Y.%m.%d")
API_READONLY=$(curl -XPUT http://$ES_SERVER:9200/apifeatureusage-$DATE/_settings -d '{"index": { "blocks": { "read_only": true }}}')
if [ x"$API_READONLY" != x'{"acknowledged":true}' ]; then
  echo "Failed to set apifeatureusage-$DATE to read only: $API_READONLY
  exit 1
fi

This script will perform a rolling restart across all nodes using the fast way mentioned above. It needs to be run from your laptop or other machine that can ssh directly into the elastic servers

# Build the servers file with servers to restart
export prefix=elastic10
export suffix=.eqiad.wmnet
rm -f servers
for i in $(seq -w 1 31); do
    echo $prefix$i$suffix >> servers
done

# Restart them
cat << __commands__ > /tmp/commands
# sudo apt-get update
# sudo apt-get install elasticsearch
# wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.0.deb
# sudo dpkg -i --force-confdef --force-confold elasticsearch-1.1.0.deb
sudo es-tool restart-fast
echo "Bouncing gmond to make sure the statistics are up to date..."
sudo /etc/init.d/ganglia-monitor restart
__commands__

for server in $(cat servers); do
    scp /tmp/commands $server:/tmp/commands
    ssh $server bash /tmp/commands
done

Finally, back on terbium.eqiad.wmnet or deployment-bastion.eqiad.wmflabs, this script needs to enable writes to begin going to elasticsearch again:

ES_SERVER=elastic1001.eqiad.wmnet

echo Thawing cluster-level block of writes by mediawiki to elasticsearch
mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster.php --wiki=enwiki --thaw

DATE=$(date +"%Y.%m.%d")
API_READONLY=$(curl -XPUT http://$ES_SERVER:9200/apifeatureusage-$DATE/_settings -d '{"index": { "blocks": { "read_only": false }}}')
if [ x"$API_READONLY" != x'{"acknowledged":true}' ]; then
  echo "Failed to set apifeatureusage-$DATE back to read/write mode: $API_READONLY
  exit 1
fi

Recovering from an Elasticsearch outage/interruption in updates

The same script that populates the search index can be run over a more limited list of pages:

mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --from <YYYY-mm-ddTHH:mm:ssZ> --to <YYYY-mm-ddTHH:mm:ssZ>

or:

mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --fromId <id> --toId <id>

So the simplest way to recover from an Elasticsearch outage should be to use --from and --to with the times of the outage. Don't be stingy with the dates - it is better to reindex too many pages than too few.

If there was an outage its probably good to also do:

mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --deletes --from <YYYY-mm-ddTHH:mm:ssZ> --to <YYYY-mm-ddTHH:mm:ssZ>

This will pick up deletes which need to be iterated separately.

This is the script I have in my notes for recovering from an outage:

function outage() {
   wiki=$1
   start=$2
   end=$3
   mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --from $start --to $end --deletes | tee -a ~/cirrus_log/$wiki.outage.log
   mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --from $start --to $end --queue | tee -a ~/cirrus_log/$wiki.outage.log
}
while read wiki ; do
   outage $wiki '2015-03-01T20:00:00Z' '2015-03-02T00:00:00Z'
done < /srv/mediawiki/dblists/all.dblist

In place reindex

Some releases require an in place reindex. This is due to analyzer changes. Sometime the analyzer will only change for wikis in particular languages so only those wikis will need an update. In any case, this is how you perform an in place reindex:

function reindex() {
    cluster=$1
    wiki=$2
    reindex_log="$HOME/cirrus_log/$wiki.$cluster.reindex.log"
    if [ -z "$cluster" -o -z "$wiki" ]; then
        echo "Usage: reindex [cluster] [wiki]"
        return 1
    fi
    TZ=UTC export REINDEX_START=$(date +%Y-%m-%dT%H:%m:%SZ)
    mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki $wiki --cluster $cluster --reindexAndRemoveOk --indexIdentifier now --reindexProcesses 10 2>&1 | tee $reindex_log && \
        mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --cluster $cluster --from $REINDEX_START --deletes | tee -a $reindex_log && \
        mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --cluster $cluster --from $REINDEX_START --queue | tee -a $reindex_log
}
while read wiki ; do
    reindex eqiad $wiki
    reindex codfw $wiki
done < ../reindex_me

Don't worry about incompatibility during the update - CirrusSearch is maintained so that queries and updates will always work against the currently deployed index as well as the new index configuration. Once the new index has finished building (the second command) it'll replace the old one automatically without any interruption of service. Some updates will have been lost during the reindex process. The third command will catch those update

The reindex_me file should be the wikis you want to reindex. For group 0 use:

/srv/mediawiki/dblists/group0.dblist

or

/usr/local/bin/expanddblist group0

For group 1 use:

rm -f allcirrus
while read wiki ; do
    echo 'global $wgCirrusSearchServers; if (isset($wgCirrusSearchServers)) {echo wfWikiId();}' |
        mwscript maintenance/eval.php --wiki $wiki
done < /usr/local/apache/common/all.dblist | sed '/^$/d' | tee allcirrus
mv allcirrus allcirrus.old
sort allcirrus.old | uniq > allcirrus
rm allcirrus.old
diff allcirrus /srv/mediawiki/dblists/group0.dblist | grep '<' | cut -d' ' -f 2 | diff - /usr/local/apache/common/wikipedia.dblist | grep '<' | cut -d' ' -f 2 > cirrus_group1
cp cirrus_group1 reindex_me

For group 2 use:

cat allcirrus /srv/mediawiki/dblists/wikipedia.dblist | sort | uniq -c | grep "  2" | cut -d' ' -f 8 > cirrus_wikipedia
cp cirrus_wikipedia reindex_me

Full reindex

Some releases will require a full, from scratch, reindex. This is due to changes in the way the Mediawiki sends data to Elasticsearch. These changes typically will have to be performed for all deployed wikis and are time consuming. First, do an in place reindex as above. Then use this to make scripts that run the full reindex:

 function make_index_commands() {
     wiki=$1
     mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki $wiki --queue --maxJobs 10000 --pauseForJobs 1000 \
         --forceUpdate --buildChunks 250000 |
         sed -e 's/^/mwscript extensions\/CirrusSearch\/maintenance\//' |
         sed -e 's/$/ | tee -a cirrus_log\/'$wiki'.parse.log/'
 }
 function make_many_reindex_commands() {
     wikis=$(pwd)/$1
     rm -rf cirrus_scripts
     mkdir cirrus_scripts
     pushd cirrus_scripts
     while read wiki ; do
         make_index_commands $wiki
     done < $wikis | split -n r/5
     for script in x*; do sort -R $script > $script.sh && rm $script; done
     popd
 }

Then run the scripts it makes in screen sessions.

Index Warmers

Some indices have warmers, these warmers are usually updated when an inplace reindex is done. In some cases you may want to update the warmers with a query generated from a recent cirrus version. You can regenerate them by issuing the following from terbium:

# Only one wiki
mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php  --wiki wikiid --justCacheWarmers
# All wikis
foreachwiki extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --justCacheWarmers

Some of the updates may fail with a timeout so it's important to verify that all the indices have been updated correctly by checking the queries with :

curl -s -XGET localhost:9200/_warmer/?pretty

These commands are very useful in the case you want to disable an ES feature (e.g. dynamic scripting) used by these queries, before issuing a node restart make sure that all the warmers are compatible with the new configuration and update them if needed.

Dumps

Dumps of all indices are created weekly by a cron as part of the snapshot puppet module. The dumps are available on dumps.wikimedia.org and can be reimported via the elasticsearch _bulk API.

Adding new nodes

Add the new nodes in puppet making sure to include all the current roles and account, sudo, and lvs settings.

Once the node is installed and puppet has run, you should be left with large RAID 0 ext4 /var/lib/elasticsearch mount point. Two things should be tweaked about this mount point:

service elasticsearch stop
umount /var/lib/elasticsearch

# Remove the default 5% blocks reserved for privileged processes.
tune2fs -m 0 /dev/md2

# add the noatime option to fstab.
sed -i 's@/var/lib/elasticsearch ext4    defaults@/var/lib/elasticsearch ext4    defaults,noatime@' /etc/fstab

mount /var/lib/elasticsearch
service elasticsearch start

Add the node to lvs.

You can watch he node suck up shards:

curl -s localhost:9200/_cluster/state?pretty | awk '
    BEGIN {more=1}
    {if (/"nodes"/) nodes=1}
    {if (/"metadata"/) nodes=0}
    {if (nodes && !/"name"/) {node_name=$1; gsub(/[",]/, "", node_name)}}
    {if (nodes && /"name"/) {name=$3; gsub(/[",]/, "", name); node_names[node_name]=name}}
    {if (/"RELOCATING"/) relocating=1}
    {if (/"routing_nodes"/) more=0}
    {if (/"node"/) {from_node=$3; gsub(/[",]/, "", from_node)}}
    {if (/"relocating_node"/) {to_node=$3; gsub(/[",]/, "", to_node)}}
    {if (/"shard"/) {shard=$3; gsub(/[",]/, "", shard)}}
    {if (more && relocating && /"index"/) {
        index_name=$3
        gsub(/[",]/, "", index_name)
        print "from="node_names[from_node]" to="node_names[to_node]" shard="index_name":"shard
        relocating=0
    }}
'

Removing nodes

Push all the shards off the nodes you want to remove like this:

curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "10.x.x.x,10.x.x.y,..."
    }
}'

Then wait for there to be no more relocating shards:

echo "Waiting until relocating shards is 0..."
until curl -s localhost:9200/_cluster/health?pretty | tee /tmp/status | grep '"relocating_shards" : 0,'; do
   cat /tmp/status | grep relocating_shards
   sleep 2
done

Use the cluster state API to make sure all the shards are off of the node:

curl localhost:9200/_cluster/state?pretty | less

Elasticsearch will leave the shard on a node even if it can't find another place for it to go. If you see that then you have probably removed too many nodes. Finally, you can decommission those nodes in puppet. The is no need to do it one at a time because those nodes aren't hosting any shards any more.

Standard decommissioning documentation applies.

Deploying plugins

Plugins are deployed via the repository at https://gerrit.wikimedia.org/r/operations/software/elasticsearch/plugins.

To use it, clone it, install git-fat (e.g. pip install --user git-fat), then run git fat init in the root of the repository. Add your jars and commit as normal but note that git-fat will add them as text files containing hashes.

To get the jars into place build them against http://archiva.wikimedia.org/ then upload them to the "Wikimedia Release Repository". You can also download them from central and verify their checksums, and then upload them to the release repository. If you need a dependency then make sure to verify its checksum and upload it to the "Wikimedia Mirrored Repository".

Once you've got the jars into Archiva wait a while for the git-fat archive to build. Then you can sync it to beta by going to the gerrit page for the change, copying the checkout link, then:

ssh deployment-tin.eqiad.wmflabs
cd /srv/deployment/elasticsearch/plugins/
git deploy start
<paste checkout link>
git deploy sync
<follow instructions>

Since the git deploy/git fat process can be a bit tempermental, verify that the plugins made it:

export prefix=deployment-elastic0
export suffix=.eqiad.wmflabs
rm -f servers
for i in {5..8}; do
    echo $prefix$i$suffix >> servers
done
cat << __commands__ > /tmp/commands
find /srv/deployment/elasticsearch/plugins/ -name "*.jar" -type f | xargs du -h | sort -k 2
__commands__
for server in $(cat servers); do
    scp /tmp/commands $server:/tmp/commands
    ssh $server bash /tmp/commands
done

If a file is 4.0k it is probably not a valid jar file.

Now once the files are synced in beta do a rolling restart in beta and they'll be loaded. Magic, eh?

To get the plugins to production repeat the above process but instead of deployment-bastion.eqiad.wmflabs use tin.eqiad.wmnet and instead of pasting the checkout links do a git pull. Use the following as the list of servers to check and for the rolling restart.

export prefix=elastic100
export suffix=.eqiad.wmnet
rm -f servers
for i in {1..9}; do
    echo $prefix$i$suffix >> servers
done
export prefix=elastic101
for i in {0..9}; do
    echo $prefix$i$suffix >> servers
done
export prefix=elastic102
for i in {0..9}; do
    echo $prefix$i$suffix >> servers
done
export prefix=elastic103
for i in {0..1}; do
    echo $prefix$i$suffix >> servers
done

Inspecting logs

There is two different types of log, we have elasticsearch logs on each node and cirrus logs.

Elasticsearch logs

The logs generated by elasticsearch are located in /var/logs/elasticsearch/ :

  • production-search-eqiad.log is the main log, it contains all errors (mostly failed queries due to syntax errors). It's generally a good idea to keep this log opened on the master and the node involved in administration tasks.
  • production-search-eqiad_index_search_slowlog.log contains the queries that are considered slow (thresholds are described in /etc/elasticsearch/elasticsearch.yml)
  • elasticsearch_hot_threads.log is a snapshot of java threads considered "hot" (generated by the python script elasticsearch_hot_threads_logger.py, it takes a snapshot from hot threads every 10 or 50 seconds depending on the load of the node)

NOTE: the log verbosity can be changed at runtime, see elastic search logging for more details.

Cirrus logs

The logs generated by cirrus are located on fluorine.eqiad.wmnet:/a/mw-log/ :

  • CirrusSearch.log: the main log. Around 300-500 lines generated per second.
  • CirrusSearchRequests.log: contains all requests (queries and updates) sent by cirrus to elastic.Generates between 1500 and 3000+ lines per second.
  • CirrusSearchSlowRequests.log: contains all slow requests (the threshold is currently set to 10s but can be changed with $wgCirrusSearchSlowSearch). Few lines per day.
  • CirrusSearchChangeFailed.log: contains all failed updates. Few lines per day except in case of cluster outage.

Useful commands :

See all errors in realtime (useful when doing maintenance on the elastic cluster)

tail -f /a/mw-log/CirrusSearch.log | grep -v DEBUG

WARNING: you can rapidly get flooded if the pool counter is full.

Measure the throughput between cirrus and elastic (requests/sec) in realtime

tail -f /a/mw-log/CirrusSearchRequests.log | pv -l -i 5 > /dev/null

NOTE: this is an estimation because I'm not sure that all requests are logged here. For instance: I think that the requests sent to the frozen_index are not logged here. You can add something like 150 or 300 qps (guessed by counting the number of "Allowed write" in CirrusSearch.log)

Measure the number of prefix queries per second for enwiki in realtime

tail -f /a/mw-log/CirrusSearchRequests.log | grep enwiki_content | grep " prefix search " | pv -l -i 5 > /dev/null

Multi-DC Operations

Status

We have deployed a 24 node elasticsearch cluster in codfw. This cluster has an LVS at search.svc.codfw.wmnet:9243. The logic for ongoing updates was implemented in https://gerrit.wikimedia.org/r/#/c/237264/. Switch of Elasticsearch to codfw is tested as part of the Datacenter Switch tests. Having Mediawiki run in eqiad and Elasticsearch in codfw is already tested.

Overview

There is no current plan to utilize a second job queue in codfw for the secondary elastic cluster. As update jobs are spawned in eqiad they will be spawned in duplicate with one for each cluster. This will keep both clusters in sync. Prior to that we need to do initial index population.

DC Switch

Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The default value is "local", which means that if mediawiki switches DC, everything should be automatic.

Having search traffic flow between 2 datacenters increases the privacy risks. HTTPS has been deployed (on port 9243) to mitigate this risk.

Removing Duplicate Indices

When index creation bails out, perhaps due to a thrown exception or some such, the cluster can be left with multiple indices of a particular type but only one active. The following will locate and clear out any duplicates:

curl -s elastic1020:9200/_cat/indices | awk '{print $3}' | sed 's/_[0-9]\+$//' | sort | uniq -c | grep -v 1 | awk '{print $2}' > indices_with_dupes
curl -s elastic1020:9200/_cat/aliases | grep -Ff indices_with_dupes | awk '{print $2}' > indices_with_dupes.active
curl -s elastic1020:9200/_cat/indices | grep -Ff indices_with_dupes | grep -vFf indices_with_dupes.active | awk '{print $3}' > indices_to_delete
 while read index; do
  echo $index
  curl -XDELETE elastic1020:9200/${index}
  echo
done < indices_to_delete

Rebalancing shards to even out disk space use

Viewing free disk space per node:

curl -s localhost:9200/_cat/nodes?h=d,h | sort -nr

To move shards away from the nodes with the least disk free, we can lower cluster.routing.allocation.disk.watermark.high settings temporarily. The high watermark is the limit at which Elasticsearch will start moving shards out of the node to free up space. High watermark can be modified by:

curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
        "cluster.routing.allocation.disk.watermark.high" : "70%"
    }
}'

Cleanup of old titlesuggest indices

When title suggest indices need to be updated, a new index is created, the title suggest alias is moved to point at the new index and the old index is removed. In some cases, part of this process fails, leaving 2 versions of a title suggest index. The one not in used can safely be deleted.

Example of duplicated zhwiki_titlesuggest:

 gehel@elastic2001:~$ curl -s localhost:9200/_cat/indices | grep zhwiki_titlesuggest
 green  open zhwiki_titlesuggest_1471950550                    2 5  1244577        0    1.6gb    279mb 
 yellow open zhwiki_titlesuggest_1472031575                    2 5  1244777        0  279.1mb  279.1mb 

Checking which index is in used:

 gehel@elastic2001:~$ curl -s localhost:9200/_cat/aliases | grep zhwiki_titlesuggest
 zhwiki_titlesuggest                    zhwiki_titlesuggest_1471950550                    - - -

Deleting the unused index:

 gehel@elastic2001:~$ curl -XDELETE localhost:9200/zhwiki_titlesuggest_1472031575
 {"acknowledged":true}

Keep an eye on the number of indices on a few nodes

When banning nodes to prepare for some maintenance operations, it is useful to see how many shards are left on those nodes:

 watch -d -n 10 \
 "curl -s localhost:9200/_cat/shards \
   | awk '{print \$8}' \
   | egrep 'elastic10(21|22|23|24)' \
   | sort \
   | uniq -c"

Elasticsearch Curator

elasticsearch-curator is a python tool which provides high level configuration for some maintenance operations. Its configuration is based on action files. Some standard actions are deployed on each elasticsearch node in /etc/curator. For example, you can disable shard routing with:

 sudo curator --config /etc/curator/config.yaml /etc/curator/disable-shard-allocation.yaml

Curator must be run as root to access its log file (/var/log/curator.log).

Explain unallocated shards

Elasticsearch 5.2 introduced a new API endpoint to explain why a shard is unassigned:

 curl -s localhost:9200/_cluster/allocation/explain?pretty

Other elasticsearch clients

Cirrus is not the only application using our "search" elasticsearch cluster. In particular:

  • API Feature requests
  • Translate Extension
  • Phabricator

Constraints on elasticsearch clients

Reading this page should already give you some idea of how our elasticsearch cluster is managed. A short checklist that can help you:

  • Writes need to be robust: Our elasticsearch cluster can go in read-only mode for various reasons (loss of a node, maintenance operations, ...). While this is relatively rare, it is part of normal operation. The client application is responsible to implement queuing / retry.
  • Writes need to be stoppable: During cluster restart, we need to stop all writes to ensure nodes can be restarted fast. A client application needs to provide a way to pause and restart writes. Ideally it should provide a way to flush pending writes as well (this can be used when writes are re-enabled to process all pending writes quickly, before the next read-only period).
  • Multi-DC aware: We have 2 independent clusters, writes have to be duplicated to both clusters, including index creation, the client application has to provide a mechanism to switch from one cluster to the other (in case of loss of datacenter, major maintenance on the cluster, ...)

Trouble

If Elasticsearch is in trouble, it can show itself in many ways. Searches could fail on the wiki, job queue could get backed up with updates, pool counter overflowing with unperformed searches, or just plain old high-load that won't go away. Luckily, Elasticsearch is very good at recovering from failure so most of the time these sorts of problems aren't life threatening. For some more specific problems and their mitigation techniques, see the /Trouble subpage.

Tasks Api

Sometimes pathological queries can get "stuck" burning cpu as they try and execute for many minutes. We can ask elasticsearch what tasks are currently running with the tasks api:

   curl https://search.svc.eqiad.wmnet:9243/_tasks?detailed > tasks

This will dump a multi-megabyte json file that indicates all tasks (such as shard queries, or the parent query that issued shard queries) and how long they have been running. The inclusion of the detailed flag means all shard queries will report the index they are running against, and the parent query will report the exact text of the json search query that cirrussearch provided. For an extended incident this can be useful to track down queries that have been running extended periods of time, but for brief blips it may not have much information.

For a quick look at tasks on a node, you can use something like:

   curl -s 'localhost:9200/_tasks/?detailed&nodes=elastic1022' | \
       jq '.nodes | to_entries | .[0].value.tasks | to_entries
           | map({
               action: .value.action,
               desc: .value.description[0:100], 
               ms:(.value.running_time_in_nanos / 1000000)
           }) | sort_by(.ms)' | \
       less

Other Resources

  • The elasticsearch cat APIs. This contains much of the information you want to know about the cluster. The allocation, shards, nodes, indices, health and recovery reports within are often useful for diagnosing information.
  • The elasticsearch cluster settings api. The contains other interesting information about the current configuration of the cluster. Temporary settings changes, such as changing logging levels, are applied here.