Jump to content

User:EBernhardson/search-rolling restarts

From Wikitech

Rolling restarts

Prior to starting a rolling restart you want to pause all write actions that are performed on the cluster and issue a synced flush. First from a deployment machine (terbium, deployment-bastion) run the following script:

ES_SERVER=elastic1001.eqiad.wmnet

# Pause all writes from mediawiki, allowing them to queue up in the job queue. Doesn't
# matter which wiki this is run against, it is a cluster-wide setting.
echo Freezing all mediawiki writes to elasticsearch
mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster.php --wiki=enwiki

echo Sleeping 5 minutes to ensure mediawiki settles down
for i in {1..30}; do
  echo -n "."
  sleep 10;
done
echo Done waiting

# https://discuss.elastic.co/t/synced-flush-causes-node-to-re start/24220/14
echo Issuing a forced flush
FORCE_FLUSH=$(curl -XPOST "http://$ES_SERVER:9200/_flush?force=true&wait_if_ongoing=true" | jq ._shards.failed)
if [ x"$FORCE_FLUSH" != x"0" ]; then
  echo "Failed to force-flush $FORCE_FLUSH shards"
  exit 1
fi

echo Issuing a synced flush
SYNC_FLUSH=$(curl -XPOST "http://$ES_SERVER:9200/_flush?synced" | jq 'with_entries(select(.value.failed > 0))')
if [ x"$SYNC_FLUSH" != x"null" ]; then
  echo Failed to issue synced-flush: 
  echo $SYNC_FLUSH"
  exit 1
fi

# We also need to prevent apifeatureusage updates. Data for Special:ApiFeatureUsage
# is written to the elasticsearch cluster via logstash. It is not possible to pause
# these writes, so we just need to reject them at the elasticsearch level. This has to
# be done after the synced flush, that won't work with any read-only indexes.
# TODO: DATE+1 isn't made read-only here, rolling over to next day could break this.
DATE=$(date +"%Y.%m.%d")
API_READONLY=$(curl -XPUT http://$ES_SERVER:9200/apifeatureusage-$DATE/_settings -d '{"index": { "blocks": { "read_only": true }}}')
if [ x"$API_READONLY" != x'{"acknowledged":true}' ]; then
  echo "Failed to set apifeatureusage-$DATE to read only: $API_READONLY
  exit 1
fi

This script will perform a rolling restart across all nodes using the fast way mentioned above. It needs to be run from your laptop or other machine that can ssh directly into the elastic servers

# Build the servers file with servers to restart
export prefix=elastic10
export suffix=.eqiad.wmnet
rm -f servers
for i in $(seq -w 1 31); do
    echo $prefix$i$suffix >> servers
done

# Restart them
cat << __commands__ > /tmp/commands
# sudo apt-get update
# sudo apt-get install elasticsearch
# wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.0.deb
# sudo dpkg -i --force-confdef --force-confold elasticsearch-1.1.0.deb
sudo es-tool restart-fast
echo "Bouncing gmond to make sure the statistics are up to date..."
sudo /etc/init.d/ganglia-monitor restart
__commands__

for server in $(cat servers); do
    scp /tmp/commands $server:/tmp/commands
    ssh $server bash /tmp/commands
done

Finally, back on terbium.eqiad.wmnet or deployment-bastion.eqiad.wmflabs, this script needs to enable writes to begin going to elasticsearch again:

ES_SERVER=elastic1001.eqiad.wmnet

echo Thawing cluster-level block of writes by mediawiki to elasticsearch
mwscript extensions/CirrusSearch/maintenance/freezeWritesToCluster.php --wiki=enwiki --thaw

DATE=$(date +"%Y.%m.%d")
API_READONLY=$(curl -XPUT http://$ES_SERVER:9200/apifeatureusage-$DATE/_settings -d '{"index": { "blocks": { "read_only": false }}}')
if [ x"$API_READONLY" != x'{"acknowledged":true}' ]; then
  echo "Failed to set apifeatureusage-$DATE to back to read/write mode: $API_READONLY
  exit 1
fi