Search/S3 Plugin Enable

Backup and Restore API Calls

Register S3 repository

The below API call registers a snapshot repository with the arbitrary name "elastic_snaps". The exact values of max_(snapshot|restore)_bytes_per_sec depend on circumstances. Snapshotting a 32 shard index needs a lower setting than a 1 shard index. Restoring a 32 shard index to a 35 node cluster needs different settings than restoring the same index to a 6 node cluster. The values below are conservative and hopefully work everywhere, but may take longer than strictly necessary to complete an operation.

More details at Elastic.co .

curl -H 'Content-type: Application/json' -XPUT  http://127.0.0.1:9200/_snapshot/elastic_snaps -d '
{
  "type": "s3",
  "settings": {
   "bucket": "elasticsearch-snapshot",
   "client": "default",
   "endpoint": "https://thanos-swift.discovery.wmnet",
   "path_style_access": "true",
   "chunk_size": "50mb",
   "buffer_size": "50mb",
   "max_snapshot_bytes_per_sec": "20mb",
   "max_restore_bytes_per_sec": "10mb"
  }
}'

Create snapshot (aka backup)

The following API call creates a snapshot, using the S3 repository registered in the above API call:

curl -X PUT "https://search.svc.codfw.wmnet:9243/_snapshot/elastic_snaps/snapshot_t309648?pretty" -H 'Content-Type: application/json' -d'
> {
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "bking",
>     "taken_because":  "T309648"
>   }
> }
> '

{
  "accepted" : true
}

A Note on Cross-cluster Restores

Because (as of this writing) all clusters write snapshots to the same endpoint and bucket, you can restore any snapshot to any cluster. Restoring to clusters with wildly different amounts of resources from the snapshot source (such as restoring a snapshot from a large production cluster to a small test cluster) requires changing index settings, see below.

Restore Snapshot

To restore the snapshot we should override a few index settings. The below settings may or may not be appropriate for your use case, please review as necessary. In the specific example below the number_of_replicas was set to 0 with the intent of expanding that post-restore using normal index recovery mechanisms. total_shards_per_node was set to a large value to allow a 32 shard index to be restored into a 6 node cluster.

curl -H 'Content-type: Application/json' -XPOST  \
    https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps/snapshot_t309648/_restore -d '{
        "indices": "commonswiki_file_1647921177",
        "include_global_state": false,
        "index_settings": {
            "index.number_of_replicas": 0,
            "index.auto_expand_replicas": null,
            "index.routing.allocation.total_shards_per_node": 8
        }
    }'

Check Status of Ongoing Restore

The restore operation uses the shard recovery process to restore an index’s primary shards from a snapshot. See the Elastic docs for more details. You can check status with the following command:

curl -H 'Content-type: Application/json' -XGET  \
    https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps/snapshot_t309648/_status?pretty

Post Restore

If the number_of_replicas was set to 0 during restore it's critical that we bring this back to the expected value post-restore. Additionally in this specific example the refresh_interval was increased to the default expected in that specific cluster.

curl -XPUT https://cloudelastic.wikimedia.org:9243/commonswiki_file/_settings -H 'Content-Type: application/json' -d '{
    "index.number_of_replicas": 1,
    "index.refresh_interval": "5m"
}'

Repository Errors

After deleting and creating a snapshot via the snapshot API, we've seen the following error:

Could not read repository data because the contents of the repository do not match its expected state. This is likely the result of either concurrently modifying the contents of the repository by a process other than this cluster or an issue with the repository's underlying storage. The repository has been disabled to prevent corrupting its contents. To re-enable it and continue using it please remove the repository from the cluster and add it again to make the cluster recover the known state of the repository from its physical contents.

This appears to have been a transient issue, but I'm documenting just in case it pops up again. Removing and re-registering the repository might fix such an error.

Why Enable the S3 Plugin?

Currently, there is no easy way to copy data from one cluster to another. The easiest way to do this is to use thanos-swift (an object storage service with an S3-compatible API) to move data around. Elasticsearch has better support for the S3 API as opposed to swift. This is (sadly) pretty common, as the Search platform team has seen with flink already.

Complicating factors

Getting the elastic keystore path right

By default, Elasticsearch requires the S3 client key and secret key (aka username and password) to be stored in its keystore, instead of in config files or its API (this is typical for values it considers sensitive).

We have an unorthodox Elastic environment: specifically, we run 2 or 3 Elasticsearch instances on a single host. As a result, using the elasticsearch-keystore command requires special care.

By default, elasticsearch-keystore invokes java with the wrong es.path.conf. We can override by setting ES_PATH_CONF when invoking elasticsearch-keystore:

export ES_PATH_CONF=/etc/elasticsearch/production-search-psi-codfw; /usr/share/elasticsearch/bin/elasticsearch-keystore add s3.client.default.access_key

Permissions are also very important! The keystore file must have permissions root:elasticsearch and mode 0640 . If the elasticsearch service fails to start after a keystore change, check the paths and permissions. A brand-new elasticsearch-keystore file in /etc/elasticsearch/ means the ES_PATH_CONF=environment variable was not respected. If theelasticsearch-keystore file is owned by root:root , the service will not start.

The keystore file has no validation

"...the keystore has no validation to block unsupported settings. Adding unsupported settings to the keystore will cause Elasticsearch to fail to start." More at Elastic's website

The keystore file must be identical across all cluster nodes

Since we don't use shared storage and the keystore file isn't a simple flat file, we do some interesting stuff with puppet to make this work.

Path-style and bucket-style access

We use the thanos-swift cluster as our object store, via its S3-compatible API . "Real" S3 supports bucket-based access, which relies on DNS records. We don't have this, so we must use path-style access. Unfortunately, Elastic added, removed, then re-added support for this feature. This was more of an issue when we were on Elasticsearch 6. We are currently on 7, which officially supports path-style access.

Overloading LVS when creating a snapshot

When creating a snapshot, particularly from the 30+ node production clusters, we can end up overloading the LVS instances that load balance requests going into thanos-swift. We've been notified by network ops previously when creating a 32 shard snapshot with the default max_snapshot_bytes_per_sec of 40mb. Reducing to 20mb allowed the snapshot to complete without setting off alerts. This may not be necessary when snapshotting from a small cluster or an index with only a couple shards.

Premature end of Content-Length delimited message body

When restoring a snapshot, we've seen failures where it appears that data being loaded from thanos-swift into elasticsearch doesn't make it all the way there. Elasticsearch will make multiple attempts to restore the shard and each time it will get a different amount of data that doesn't match the expected index size. We don't know exactly which values made it work, but a successful snapshot/restore of a 32 shard, 1+TB index was taken with aggressively setting these values quite low, to a chunk_size of 50mb and max_restore_bytes_per_sec pulled down to 10mb in a 6 node cluster. Note that when using chunk_size less than 100mb buffer_size has to be reduced to match. It is likely these values could be somewhere between these values and the defaults, but we haven't experimented yet.

index.routing.allocation.total_shards_per_node

When moving an index between clusters it will, by default, load into the new cluster with the total_shards_per_node value that was appropriate to the source cluster but may not be appropriate to the target cluster. If a restore is attempted to a smaller cluster than the source without changing this value the snapshot will eventually fail with an allocation_explanation, from /_cluster/allocation/explain, of cannot allocate because allocation is not permitted to any of the nodes. This value needs to be overridden in the index_settings section of the snapshot restore api call.