Search/Elasticsearch Administration
This page is only for Search Platform-owned Elasticsearch environments. For higher-level documentation and emergency contacts, see this google doc that is only visible to WMF staff.
Deployment
This documentation has been migrated to https://wikitech.wikimedia.org/wiki/Search/OpenSearch/Administration
Alerts/Dashboards
This documentation has been migrated to https://wikitech.wikimedia.org/wiki/Search/OpenSearch/Administration
Dashboards
This documentation has been migrated to https://wikitech.wikimedia.org/wiki/Search/OpenSearch/Administration
Operations
General Debugging Algorithm for SREs
Follow these steps in order to diagnose and resolve common cluster problems.
Important Notes:
- All commands should be run from a
cuminhost - Each CirrusSearch host runs port 9243 (main cluster) plus either port 9443 or 9643 (secondary clusters)
- Examples use
eqiad- replace withcodfwwhen troubleshooting the CODFW datacenter - Most examples show commands for the main cluster (9243) - run on secondary clusters (9443/9643) as needed
Common Issue 1: Cluster Health Issues (Red/Yellow Status)
Step 1: Initial Health Assessment
Start by checking cluster health across all clusters. This is your primary indicator of cluster status:
# Check cluster health (run from cumin host) curl -s https://search.svc.eqiad.wmnet:9243/_cluster/health | jq . curl -s https://search.svc.eqiad.wmnet:9443/_cluster/health | jq . curl -s https://search.svc.eqiad.wmnet:9643/_cluster/health | jq .
Status Interpretation:
- 🟢 Green: All shards allocated, cluster fully operational
- 🟡 Yellow: All primaries allocated, some replicas missing (data redundancy at risk)
- 🔴 Red: Primary shards unassigned (data loss risk, search/indexing impaired)
Actions by Status:
- Yellow: Check for unassigned replicas, verify node capacity
- Red: URGENT - Identify missing primaries, check for node failures
For more detailed information about health checks, see General Health Checks (and what they mean).
Step 2: Node Status Verification
Verify all expected nodes are operational:
# Check all nodes are responsive (run from cumin host)
for port in 9243 9443 9643; do
curl -s "https://search.svc.eqiad.wmnet:${port}/_cat/nodes?v"
done
What to look for:
- All expected nodes present and responsive
- No nodes in "unresponsive" or "disconnected" state
- Consistent node versions across cluster
Step 3: Identify Problematic Shards
Find shards that are not in the STARTED state:
# Find unassigned shards curl -s https://search.svc.eqiad.wmnet:9243/_cat/shards?v | grep -v STARTED # Check allocation issues curl -s https://search.svc.eqiad.wmnet:9243/_cat/allocation?v
Common issues to investigate:
- Unassigned shards (check
unassigned.reason) - Uneven shard distribution across nodes
- Nodes with high disk usage (>85% triggers allocation blocks)
Step 4: Analyze Allocation Decisions
For unassigned shards, understand why they can't be allocated:
# Explain specific shard allocation issues
curl -X GET "https://search.svc.eqiad.wmnet:9243/_cluster/allocation/explain" \
-H 'Content-Type: application/json' -d'{
"index": "problematic_index",
"shard": 0,
"primary": true
}'
Step 5: Monitor Recovery Status
Check for ongoing shard recoveries and identify stuck operations:
# Check active recoveries curl -s 'https://search.svc.eqiad.wmnet:9243/_cat/recovery?active_only&v'
Recovery issues to watch for:
- Stuck recoveries (same stage for extended time)
- Failed recoveries requiring manual intervention
Step 6: Review Cluster Settings
Verify cluster configuration isn't causing issues:
# Check allocation settings curl -s https://search.svc.eqiad.wmnet:9243/_cluster/settings | jq .persistent.cluster.routing
Important settings to verify:
- Allocation enabled/disabled
- Node exclusions
- Recovery throttling settings
- Rebalance settings
Step 7: Take Appropriate Action
Based on your findings, execute the appropriate recovery steps:
For Red status:
- Get any failed nodes back online
- Review allocation explain for specific shards
- Retry failed shard allocations:
curl -s -X POST https://search.svc.eqiad.wmnet:9243/_cluster/reroute?retry_failed=true
For Yellow status:
- Same as red, just less urgent
Step 8: Validation and Monitoring
After taking action, verify the cluster is recovering:
# Re-check cluster health curl -s https://search.svc.eqiad.wmnet:9243/_cluster/health # Monitor recovery progress curl -s 'https://search.svc.eqiad.wmnet:9243/_cat/recovery?active_only&v'
Common Issue 2: Performance Issues (Green Cluster)
If the cluster is green but experiencing performance issues (elevated response times, rejections), follow these steps:
Step 1: Monitor Response Times
Check Grafana dashboards for performance metrics:
- Response Time Percentiles: Elasticsearch Percentiles Dashboard
- Pool Counters: Elasticsearch Pool Counters Dashboard
- Cache Hit Rates: Cache Hit Rate Graph (generally only relevant after datacenter switchovers or similar changes)
Key metrics to investigate:
- P95/P99 response time trends
- Pool counter rejection rates (MediaWiki rate limiting)
- Query throughput patterns
- Resource utilization spikes
Step 2: Check Thread Pool Status
Monitor thread pool rejections and queue sizes:
# Check thread pool status (search is most important, but check others as needed) curl -s 'https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/search?h=node_name,node_id,n,l,type,completed,rejected&v'
What to look for:
- High rejection rates (should be close to zero)
- Large queue sizes
- Nodes with significantly higher completed/rejected counts
Common causes:
- Too many "hot" shards on a single node
- Insufficient thread pool capacity
- Resource constraints (CPU/memory)
Step 3: Investigate Pathological Queries
Note: This is more theoretical but might be worth checking for long-running queries.
Check for long-running queries that may be impacting performance:
# Check running tasks
curl -s 'https://search.svc.eqiad.wmnet:9243/_tasks?detailed' | jq '.nodes | to_entries | .[0].value.tasks | to_entries | map({
action: .value.action,
desc: .value.description[0:100],
ms: (.value.running_time_in_nanos / 1000000)
}) | sort_by(.ms)'
Step 4: Take Performance Actions
For elevated response times, check:
- Recent changes in query volume or composition
- Cache hit rates
- Resource utilization patterns
Additional Resources
For specific procedures related to issues found during debugging, see:
- Hardware failures
- Banning nodes from the cluster
- Recovering from an Elasticsearch outage
- Explain unallocated shards
- Thread pools
General Health Checks (and what they mean)
Elasticsearch has a built-in cluster health monitor. Green is the typical operating state, and indicates that Elasticsearch has quorum and all shards are replicated adequately (this is a user-controlled setting. Our production clusters are set for 1 primary and 2 replicas). red means there are unassigned primary shards and is bad because some requests will fail. Yellow means one or more shards is not replicated adequately. Performance may be affected in the yellow state but requests won't just fail. It’s normal for the cluster to occasionally dip into yellow due to transient errors, and Elastic can typically fix these automatically by replicating and/or moving shards.
Hardware failures
Elasticsearch is robust to losing nodes. If a hardware failure is detected, follow these steps:
- Create a ticket for DC Ops. Use the tag ‘ops-codfw’ or ‘ops-eqiad’ depending on the datacenter. (Example ticket). Subscribe to the ticket and make sure you’re in #wikimedia-dcops and #wikimedia-sre, as DC Ops engineers may reach out to you in these rooms.
- Depool the host from the load balancer.
- Ban the host from the Elastic cluster.
Rolling restarts
Lifecycle work (package updates, java security updates) require rolling restarts of the clusters. To that end, we have a cookbook (operational script) with the following options:
- Reboot: Reboot the whole server. Needed for some types of security updates.
- Restart: Only restarts the Elasticsearch service. Typically needed for Java security updates.
- Upgrade: Upgrade Elasticsearch.
For the larger clusters (codfw/eqiad), it’s acceptable to do 3 nodes at once. The smaller clusters (cloudelastic, relforge) are limited to 1 node at a time.
Example cookbook invocation:
sudo -i cookbook sre.elasticsearch.rolling-restart search_eqiad "restart for JVM upgrade " --start-datetime 2019-06-12T08:00:00 --nodes-per-run 3
where:
- search_eqiad is the cluster to be restarted
- --start-datetime 2019-06-12T08:00:00 is the time at which the operation is starting (which allows the cookbook to be restarted without restarting the already restarted servers).
- --nodes-per-run 3 is the maximum number of nodes to restart concurrently
During rolling restarts, it is a good idea to monitor a few elasticsearch specific things:
Things that can go wrong:
- some shards are not reallocated: Elasticsearch stops trying to recover shards after too many failures. To force reallocation, use the sre.elasticsearch.force-shard-allocation cookbook.
- master re-election takes too long: There is no way to preemptively force a master re-election. When the current master is restarted, an election will occur. This sometimes takes long enough that it has an impact. This might raise an alert and some traffic might be dropped. This recovers as soon as the new master is elected (1 or 2 minutes). We don't have a good way around this at the moment.
- cookbook are force killed or in error: The cookbook uses context manager for most operations that need to be undone (stop puppet, freeze writes, etc...). A force kill might not leave time to cleanup. Some operations are not rolled back in case of exception, like pool / depool, because an unknown exception might leave the server in an unknown state and do require manual checking.
- unexpected red status: Don't panic! This is usually due to a shard with no alias. You can usually wait for a node to reboot and the problem will fix itself.
Since "usually" doesn't mean "always", I'll give you some more context and what you can do to fix this. Each user-facing index is actually an alias to the most current version of the wiki's index. A shard that's in use will have at least one alias:
curl -s https://cloudelastic.wikimedia.org:9243/_alias?pretty
"zhwiktionary_content_1716769642" : {
"aliases" : {
"zhwiktionary" : { },
"zhwiktionary_content" : { }
}
A shard without an alias doesn't contain any data, and is safe to delete. For example:
curl -s https://cloudelastic.wikimedia.org:9243/_alias?pretty
"eswiki_content_1716685673" : {
"aliases" : { }
}
You can get a view of all these un-aliased shards via:
curl -s https://cloudelastic.wikimedia.org:9243/_cat/indices | awk '$6 == 0 { print $0 }'
Replacing master-eligibles
To do this safely, update the elastic config (found under unicast_hosts in the Puppet hieradata, here’s an example for CODFW) to add the new masters WITHOUT removing the old masters. Restart each node in the cluster, then update the config to remove the original masters. Restart the cluster again to activate the new masters.
You can also remove masters without restarting the cluster, see the Elastic docs. We haven't tested this one extensively, so don't try this in production.
Banning nodes from the cluster
Before network-related maintenances (such as switch reboots) or decommissioning, you should ban the node(s) from the cluster. This action removes all shards from the node, but it does NOT remove the node from the cluster state. The ban cookbook is the easiest way to ban the node(s). If you’re watching the cluster, you should see shards immediately start to move after your API call is accepted. You should also see used disk space decreasing on the banned node. For temporary bans (like for a switch reboot), don’t forget to unban after the maintenance!
Recovering from an Elasticsearch outage/interruption in updates
Primary shard lost, replica shard considered outdated
This can happen on our smaller clusters (cloudelastic or relforge) as they have little to no redundancy. You can tell Elastic to use a shard it considers outdated with this API call:
curl -XPOST -H 'Content-Type: application/json' https:// cloudelastic.wikimedia.org:9243/_cluster/reroute -d'
{
"commands ": [
{ "allocate_stale_primary ": {
"index ": "wikidatawiki_content_1692131793 ",
"shard ": 19,
"accept_data_loss ": true,
"node ": "cloudelastic1003-cloudelastic-chi-eqiad "
}}
]}'
After this, you need to run the backfill procedure for the time period the promoted replica missed.
Cluster Quorum Loss Recovery Procedure
This documentation has been migrated to https://wikitech.wikimedia.org/wiki/Search/OpenSearch/Administration#Cluster_Quorum_Loss_Recovery_Procedure
Avoiding Loss of Quorum by using the Voting Configuration Exclusions API
Placeholder, see this Elastic.co page for more details
Adding new masters/removing old masters
- Add new masters as "unicast_hosts " in our production puppet repo's cirrus.yaml file.
- After/merging/puppet-merging, run the rolling operation cookbook with the "restart" argument, this brings the new masters into the cluster.
Check the master status in the elastic API. For example:
curl -s -XGET https://search.svc.codfw.wmnet:9243/_cat/nodes
The current master has an asterisk in the second-to-last field. Example output:
10.192.16.229 61 98 10 4.62 5.14 5.07 dimr * elastic2093-production-search-codfw
If one of the old hosts is still master, force a failover. Do this carefully (one host at a time) until one of the new master-eligibles is the active master.
Remove old hosts from "unicast_hosts" in our production puppet repo's cirrus.yaml file, but DO NOT remove them from the cluster completely (yet).
After/merging/puppet-merging, run the rolling operation cookbook with the "restart " argument, again. The old master eligibles will lose their master-eligible status after a cluster restart, which is what we want.
Make/merge a new puppet patch completely removing the old master-eligibles from the cluster.
After this change is merged (and puppet-merged), we also need to run this script, which informs the other ES clusters of the new masters. The script requires manually creating an 'lst' file, here's an example on how to do that.
Adding new nodes
- (Step 1a) If they're cirrus (elasticsearch*) hosts, add the racking info for the new hosts in hieradata/regex.yaml (you can use Netbox to easily get the rack/row information, by searching like so). Otherwise for cloudelastic, add the entries in a yaml file per host in hieradata/hosts.
- (Step 1b) Also add the new nodes to the cluster_hosts list for each of the 3 elasticsearch instances in hieradata/role/$DATA_CENTER/elasticsearch/cirrus.yaml (example patch). Put every host in the main cluster, but divide the hosts between the two smaller clusters as evenly as possible (recall that each cirrus elastic* host runs two instances of Elasticsearch). Run puppet on all elastic* hosts in the same datacenter, this will open the firewall ports needed for the existing hosts to reach the new hosts.
- (Step 1d) Add new entries in conftool-data/node/$DATACENTER.yaml. These entries need to specify which of the two smaller clusters (psi or omega) a given host belongs to. (example patch)
- (Step 2) Flip the not-yet-in-service insetup hosts to role::elasticsearch::cirrus in manifests/site.pp (example patch). Run puppet on the new hosts twice (currently this needs to be done twice due to nonideal order of operations in our/o11y's puppet logic)
- (Step 3) Once the hosts have joined the clusters and are ready to receive requests from mediawiki, we'll need to pool them and set their weights. On the puppetmaster, configure the new hosts like so: sudo confctl select name=elastic206[1-9].* set/weight=10:pooled=yes. See Conftool for further documentation.
Decommissioning Nodes (Non-masters)
Use the ban cookbook to remove the node from the cluster. (See Banning Nodes from the Cluster for more details).
Elasticsearch will leave the shard on a node if it can't find another place for it to go. If you see that then you have probably removed too many nodes.
After you have confirmed that all shards have been moved off the node and the cluster state is yellow or green, Follow the WMF standarddecommissioning proces. When decommissioning nodes, you should ensure that elasticsearch is cleanly stopped before the node is cut from the network.
Deploying Debian Packages
When we upgrade our Elasticsearch or Debian distro versions, we have to redeploy packages, including:
- elasticsearch-madvise
- liblogstash-gelf-java
- wmf-elasticsearch-search-plugins
In the case where upstream has not changed significantly, you (an SRE or someone with SRE access) can simply copy the Debian package from the previous distro repository, as described on the reprepro page.
Deploying plugins
Plugins (for search) are deployed using the wmf-elasticsearch-search-plugins debian package. This package is built using the ops-es-plugins repository and deployed to our apt repo like any other debian packages we maintain.
The list of plugins we currently maintain is:
- extra
- extra-analysis
- highlighter
- s3-repository
They are currently released using the classic maven process (release:prepare|perform) and deployed to maven central. The repository name used for deploying to central is ossrh and thus the engineer performing the release must have its ~/.m2/settings.xml with the following credentials set:
settings;
servers
server
id ossrh /id
username username /username
password password /password
/server
/servers
/settings
To request write permission to this repository one must create a request like this one and obtain a +1 from a person already granted.
Please see discovery-parent-pom for generic advises on the build process of the java projects we maintain.
This section will change once we migrate to Opensearch. See this work-in-progress page for a preview.
Backup and Restore
Via Elastic S3 Snapshot Repository Plugin
See this page for more details.
Inspecting logs
This is specific to Elastic, but you may also need to check Cirrus logs.
Elasticsearch logs
The logs generated by elasticsearch are located in /var/logs/elasticsearch/ :
- production-search-eqiad.log is the main log, it contains all errors (mostly failed queries due to syntax errors). It's generally a good idea to keep this log opened on the master and the node involved in administration tasks.
- production-search-eqiad_index_search_slowlog.log contains the queries that are considered slow (thresholds are described in /etc/elasticsearch/elasticsearch.yml)
- elasticsearch_hot_threads.log is a snapshot of java threads considered "hot " (generated by the python script elasticsearch_hot_threads_logger.py, it takes a snapshot from hot threads every 10 or 50 seconds depending on the load of the node)
NOTE: the log verbosity can be changed at runtime, see elastic search logging for more details.
Using jstack or jmap or other similar tools to view logs
Our elasticsearch systemd unit sets PrivateTmp=true, which is overall a good thing. But this prevents jstack / jmap / etc. from connecting to the JVM as they expect a socket in the temp directory. The easy/safe workaround to viewing these logs is via nsenter (See T230774 or the following code snippet)
# bash function for Java debugging via jstack/JVM
systemctl-jstack() {
user_check=$(whoami)
if [ $user_check != 'root' ]
then
printf "%s\n" "Error, this script should be run as root "
exit 1
fi
MAINPID=$(systemctl show $1 |grep "^MainPID " | awk -F= '{print $2}')
SVC_USER=$(systemctl show $1 |grep "^User " | awk -F= '{print $2}')
nsenter -t $MAINPID -m sudo -u $SVC_USER jstack $MAINPID
}
Multi-DC / Multi-Cluster Operations
We have deployed full size elasticsearch clusters in eqiad and codfw. The cluster endpoints for readonly traffic are at search.discovery.wmnet and read/write traffic flows through search.svc.{eqiad,codfw}.wmnet. The three clusters are identified as search, search-psi, and search-omega.
Read Operations
By default, application servers are configured to query the elasticsearch cluster in their own datacenter via DNS/Discovery. Traffic can be shifted to a single datacenter for maintenance operations when necessary via the standard confctl commands.
Shift all traffic away from a datacenter
for cluster in search-omega search-psi search; do
sudo confctl --object-type discovery select "dnsdisc=${cluster},name=codfw" set/pooled=false
done
Bring traffic back
for cluster in search-omega search-psi search; do
sudo confctl --object-type discovery select "dnsdisc=${cluster},name=codfw" set/pooled=true
done
Write Operations
Elasticsearch does not have support for replicating search indices across clusters, instead we perform all writes to all clusters we want to store the data.
In our streaming updater pipeline, a producer application in each datacenter reads events from mediawiki in the same datacenter and generates a stream of pages that need to be updated. That stream of updates is mirrored to all datacenters by the kafka infrastructure. A consumer application running in the same datacenter as the elasticsearch cluster consumes the updates from all datacenters and applies them to the local cluster.
Removing Duplicate Indices
When index creation bails out, perhaps due to a thrown exception or some such, the cluster can be left with multiple indices of a particular type but only one active. CirrusSearch contains a script, scripts/check_indices.py, that will check the configuration of all wikis in the cluster and then compare the set of expected indices with those actually in the elasticsearch clusters. You can run this script from the mwmaint servers, (mwmaint servers are assigned to the role(mediawiki::maintenance) via site.pp in the Puppet repo)
/srv/mediawiki/php/extensions/CirrusSearch/scripts/check_indices.py | jq .
When everything is expected the output will be an empty json array ([]) . Note that it might take a minute or three before giving an answer.
When something is wrong, the output will contain an object for each cluster, and then the cluster will have a list of problems found:
[
{
"cluster_name ": "production-search-eqiad ",
"problems ": [
{
"reason ": "Looks like Cirrus, but did not expect to exist.Deleted wiki? ",
"index ": "ebernhardson_test_first "
}
],
"url ": "https://search.svc.eqiad.wmnet:9243 "
}
]
Indices still have to be manually deleted after reviewing the output above. As we gain confidence in the output through manual review, the tool can be updated with options to automatically apply deletes.
curl -XDELETE https://search.svc.eqiad.wmnet/ebernhardson_test_first
Rebalancing shards to even out disk space use
Why?
If one node has radically different disk space usage than the rest of the cluster, this probably means Elastic has decided some other nodes are not qualified to host shards. One possibility is lack of capacity in a particular rack, as our production cluster config has rack/row awareness.
Viewing free disk space per node:
curl -s localhost:9200/_cat/nodes?h=d,h | sort -nr
To move shards away from the nodes with the least disk free, we can lower cluster.routing.allocation.disk.watermark.high settings temporarily. The high watermark is the limit at which Elasticsearch will start moving shards out of the node to free up
space. High watermark can be modified by:
curl -H 'Content-Type: Application/json' -XPUT localhost:9200/_cluster/settings -d '{
"persistent" ": {
"cluster.routing.allocation.disk.watermark.high "h: "70% "
}
}'
Keep an eye on the number of indices on a few nodes
When banning nodes to prepare for some maintenance operations, it is useful to see how many shards are left on those nodes:
Elasticsearch Curator
elasticsearch-curator is a python tool which provides high level configuration for some maintenance operations. Its configuration is based on action files. Some standard actions are deployed on each elasticsearch node in /etc/curator. For example, you can disable shard routing with:
sudo curator --config /etc/curator/config.yaml /etc/curator/disable-shard-allocation.yaml
Curator must be run as root to access its log file (/var/log/curator.log).
Explain unallocated shards
Elasticsearch 5.2 introduced a new API endpoint to explain why a shard is unassigned:
curl -s localhost:9200/_cluster/allocation/explain?pretty
Anti-affinity (shards limit) prevents shards from assignment
Our Elastic hosts are spread across four rows in the datacenter (A-D). Within a single row, we allow no more than 2 replicas per shard. This can lead to unassigned replicas, especially during maintenance operations. Check for this with the following command:
curl -s -X GET "localhost:9200/_cluster/allocation/explain?pretty " -H 'Content-Type: application/json' -d' | jq '.node_allocation_decisions[] | select (.node_attributes.row == "C ")'
{
"index ": "enwiki_content_1658309446 ",
"shard ": 15,
"primary ": false
}
Shards stuck in recovery
We've seen cases where some shards are stuck in recovery and never complete the recovery process. Reducing temporarily the number of replicas and increasing it again once the cluster is green seems to fix the issue. So far, this has happened only during the 5.6.4 to 6.5.4 upgrade.
curl -k -X PUT "https://localhost:9243/$index/_settings " -H 'Content-Type: application/json' -d'
{
"index " : {
"auto_expand_replicas " : "0-2 ",
"number_of_replicas " : 2
}
}
'
No updates
If no updates are flowing, the usual culprit can be:
- Kafka queues are having issues (or eventgate, writing to kafka for mediawiki).
- These should be causing alerts in #wikimedia-operations
- Job runner are having issues
- Review JobQueue Job dashboard to look for backlogging
- Streaming Updater consumers or producers having issues
- Review Streaming Updater and Flink App grafana dashboards
- Check on pod lifetimes. If they are single digit minutes the problem could be there.
Stuck in old GC hell
When memory pressure is excessive the GC behaviour of an instance will switch from primarily invoking the young GC to primarily invoking the old GC. If the issue is limited to a single instance it's possible the instance has an unlucky set of shards that require more memory than the average instance. Resolving this situation involves:
- Ban the node from cluster routing
- Wait for all shards to drain
- Restart instance
- Unban the node from cluster routing
Draining all the shards away ensures the instance will get a new selection of shards when it is unbanned. Restarting the instance is not strictly required, but it's nice to start with a fresh JVM after the old one has been misbehaving.
CirrusSearch full_text eqiad 95th percentile
The full_text indices are a good measure of search performance. As such, we have alerts for these. Unfortunately, one or two overloaded nodes can cause this alert to trigger. The workaround is to ban the troublesome node. Then you can try restarting the Elastic service and/or rebooting the bad host.
Pool Counter rejections (search is currently too busy)
Sometimes users' searches may fail with the message: "An error has occurred while searching: Search is currently too busy. Please try again later. "
This is generally due to a spike in traffic which triggers MediaWiki's PoolCounter, which in the case of CirrusSearch will drop requests if the traffic spike is too large.
There is an alert mediawiki_cirrus_pool_counter_rejections_rate which should warn if a concerning number of CirrusSearch PoolCounter rejections are detected. The alert can be checked in icinga.
If the alert fires, take some time to consider if the current PoolCounter thresholds should be increased to increase the allowable queue size on MediaWiki's end. If the threshold cannot be increased, then all there is to do is verify that the traffic spike is organic as opposed to the result of some bug upstream.
Indexing hung and not making progress
This alert fires when an individual Elastic node's index size doesn't change for 30 minutes or more.
Possible cause: Nodes hung
Elastic nodes can get hung up indexing and stop doing work. We've seen this happen from a bug in our code, although theoretically it could be caused by other issues related to ingest (bad data, too much data, etc). This manifests as increased Search Thread Pool Rejections per Second. When this happens, you may have to temporarily shut off mjolnir services on search-loader hosts to alleviate backpressure.
Another possible cause: Node banned
This can also mean that the host has been banned (removed from the cluster, typically for a maintenance) and we forgot to unban it when finished. This can be fixed by running the ban cookbook again with the 'unban' action.
Deploying Elasticsearch config (.yml/jvm) changes
Elasticsearch needs to be restarted in a controlled manner. When puppet deploys configuration changes for elasticsearch the changes are applied on disk but the running service is not restarted. After puppet has succesfully run on all instances follow the #Rolling restarts runbook.
CirrusSearch titlesuggest index is too old
This alert is fired when titlesuggest indices, which power fuzzy title complete, are older than expected for at least one wiki. This is not an urgent problem, the old version of these indices covers most use cases. Over time the changes compound and become more noticable. The exact indices that are having problems can be checked with the following. This will give the 5 oldest titlesuggest indices for the given cluster.
curl -XPOST -H 'Content-Type: application/json' 'https://search.svc.eqiad.wmnet:9243/*_titlesuggest,omega:*_titlesuggest,psi:*_titlesuggest/_search' -d '{
"size": 0,
"aggs": {
"by_index": {
"terms": {
"field": "_index",
"order": {"max_batch_id": "asc"}
},
"aggs": {
"max_batch_id": {"max":{"field": "batch_id"}}
}
}
}
}
' | jq '.aggregations.by_index.buckets[].max_batch_id.value |= (. | strftime("%Y-%m-%d %H:%M UTC"))'
The titlesuggest indices are built by the mediawiki_job_cirrus_build_completion_indices_{eqiad,codfw} systemd timer typically run on mwmaint1002.eqiad.wmnet. The relevant logs, in mwmaint1002.eqiad.wmnet:/var/log/mediawiki/mediawiki_job_cirrus_build_completion_indices_{eqiad,codfw} should be checked to find out why the given titlesuggest indices are not being built.
Thread pools
Elasticsearch thread pools help manage memory consumption on individual nodes. When an individual node reaches its maximum queue size, it will reject further requests associated with that particular queue. Not surprisingly, the search thread pool is the most important one for our use case.
You can track these rejections via the Elasticsearch API:
curl 'localhost:9200/_cat/thread_pool/search?h=node_name,node_id,n,l,type,completed,rejected&v'
Example output:
node_name node_id n l type completed rejected
elastic1071-production-search-eqiad ocZjIxEhQb-TojSu1v6NYQ search 73 fixed_auto_queue_size 140760721 0
elastic1056-production-search-eqiad 8grTo-qVTe-a-QOptlXvAA search 61 fixed_auto_queue_size 79014359 0
elastic1066-production-search-eqiad eS9G_qj7RlC5dFdhH12MMQ search 61 fixed_auto_queue_size 142933677 66406
Too many hot shards can lead to overloaded queues/rejections
Under normal circumstances, the rejection rate should be close to zero. However, if enough shards from the busiest wikis end up on a single host, you will see rejections. In the above example, elastic1066 has too many hot shards, and cannot keep up. You can usually mitigate this issue by:
- banning the node
- waiting for shards to drain
- unbanning
For those with SRE access, the easiest way to do this is via this cookbook.
Identifying the active master and master-eligibles from the shell
The following one-liner will give you a list of master-eligibles:
for n in 9243 9443 9643; do curl -s "https://search.svc.${dc}.wmnet:${n}/_cat/nodes" | grep dimr; done
The "m" in "dimr" indicates a master-eligible host:
10.192.48.56 62 96 0 0.33 0.66 0.59 dimr - elastic2084-production-search-codfw
If you see an asterisk '*' in the 9th field, that means the host is the active master:
10.192.16.228 59 96 0 0.38 0.44 0.52 dimr * elastic2092-production-search-omega-codfw
Be sure to run this from outside the Cirrussearch hosts (such as cumin, mwmaint, or deploy servers), as each Cirrussearch server only hosts one of two secondary clusters, so you'll miss one of the clusters entirely.
Forcing the cluster to retry shard scheduling
Elasticsearch won't try to reschedule shards forever. Eventually it will stop trying to move shards in state INITIALIZING with reason ALLOCATION_FAILED and set their state to UNASSIGNED. If you've added more capacity and you'd like the cluster to try again, use this API call:
curl -s -XPOST "https://search.svc.${dc}.wmnet/_cluster/reroute?retry_failed=true
If your call works, you should immediately see shards go into MANUAL_ALLOCATION state in the /_cat/shards API. Note that this doesn't mean the shards will get definitely get scheduled, it just means Elastic will try again.
Warmup a cold cluster
If all read traffic was shifted away from a full cluster we have seen load spikes when bringing traffic back to the cluster, plausibly due to caches needing to be warmed up. A simple warmup procedure of running a simple query against all indices in the cluster has shown some improvement to this:
dc="eqiad"
for port in 9243 9443 9643; do
curl -o /dev/null https://search.svc.$dc.wmnet:$port/_search?q=a+and+the+to+on+for+be+of+in
done
CirrusSearchNodeIndexingNotIncreasing alerts
If the alert is for a single host, possible causes include:
- New host online
- Host was banned and needs to be unbanned
- Firewall rules
If the alert is for multiple hosts, it could be a sign that the Cirrus Streaming Updater is not running in that datacenter.
Other elasticsearch clients
Cirrus is not the only application using our ;search ; elasticsearch cluster. In particular:
- API Feature requests
- Translate Extension
- Phabricator
- Toolhub
Constraints on elasticsearch clients
Reading this page should already give you some idea of how our elasticsearch cluster is managed. A short checklist that can help you:
- Writes need to be robust: Our elasticsearch cluster can go in read-only mode for various reasons (loss of a node, maintenance operations, ...). While this is relatively rare, it is part of normal operation. The client application is responsible to implement queuing / retry.
- Multi-DC aware: We have 2 independent clusters, writes have to be duplicated to both clusters, including index creation, the client application has to provide a mechanism to switch from one cluster to the other (in case of loss of datacenter, major maintenance on the cluster, ...)
Pathological Queries and the Tasks API
Sometimes pathological queries can get ;stuck ; burning cpu as they try and execute for many minutes. We can ask elasticsearch what tasks are currently running with the tasks api:
curl https://search.svc.eqiad.wmnet:9243/_tasks?detailed > tasks
This will dump a multi-megabyte json file that indicates all tasks (such as shard queries, or the parent query that issued shard queries) and how long they have been running. The inclusion of the detailed flag means all shard queries will report the index they are running against, and the parent query will report the exact text of the json search query that cirrussearch provided. For an extended incident this can be useful to track down queries that have been running extended periods of time, but for brief blips it may not have much information.
For a quick look at tasks on a node, you can use something like:
curl -s 'localhost:9200/_tasks/?detailed&nodes=_local' | \
jq '.nodes | to_entries | .[0].value.tasks | to_entries
| map({
action: .value.action,
desc: .value.description[0:100],
ms:(.value.running_time_in_nanos / 1000000)
}) | sort_by(.ms)' | \
Less
Performance, Stability, and Fault Tolerance at the Elastic layer
Tuning shard counts
Optimal shard count requires making a tradeoff between competing factors. Each Elasticsearch shard is actually a Lucene index which requires some amount of file descriptors/disk usage, compute, and RAM. So, a higher shard count causes more overhead, due to resource contention as well as ;fixed costs ;.
Since Elasticsearch is designed to be resilient to instance failures, if a node drops out of the cluster, shards must rebalance across the remaining nodes (likewise for changes in instance count, etc). Shard rebalancing is rate-limited by network throughput, and thus excessively large shards can cause the cluster to be stuck ;recovering ; (rebalancing) for an unacceptable amount of time.
Thus, the optimal shard size is a balancing act between overhead (which is optimized via having larger shards), and rebalancing time (which is optimized via smaller, more numerous shards). Due to the problem of fragmentation, we also don't want a given shard to be too large a % of the available disk capacity.
In most cases we don't want shards to exceed 50GB, and ideally they wouldn't be smaller than 10GB (of course, for small indices this is unavoidable). Although we are currently topping out at 50GB, our recent upgrade to 10Gbps networking across all production clusters could enable us to use even larger shards than 50GB.
The final consideration is that different indices receive different levels of query volume. enwiki and dewiki have the highest volume. That means that nodes hosting enwiki_content shard will receive disproportionate load relative to nodes that lack those shards. As such, we want to set our total shard count in relation to the number of servers we have available.
For example, a cluster with 36 servers would ideally have a total shard count that is close to a multiple of 36. We generally like to go slightly ;under ; to be able to weather losing nodes. In short, we want to maintain the invariant that most servers have the same number of shards for a given heavy index, with a few servers having 1 less shard so that we have headroom to lose nodes.
In conclusion: we want shards close to 50GB, but we also need our shard count set to avoid having any particularly hot servers, while making sure the shard count isn't completely even so that we can still afford to lose a few nodes. We absolutely want to avoid having a ridiculous number of extremely tiny shards, since the thrash/resource contention would grind the cluster to a halt.