Search/OpenSearch/Administration
This page is only for Search Platform-owned OpenSearch environments. For higher-level documentation and emergency contacts, see this google doc that is only visible to WMF staff.
Primary Audiences
- The Data Platform SRE team, as they own/maintain the infrastructure.
- The Search Platform Team, as they own/maintain the code that runs on this infrastructure (the lines between DPE SRE and Search Platform are a little blurry sometimes, and that's OK).
- Any other technical Wikimedian that might need to help out in an emergency.
Deployment
The Wikimedia Foundation's 2 primary datacenters in Virginia (eqiad) and Dallas (codfw) host the OpenSearch services that power Wikipedia's search feature.
Each environment hosts multiple clusters. Let's start with the environments:
Deployment Environments (codfw, eqiad, cloudelastic, relforge)
| Server Names | Purpose | Hardware spec (relative to other envs) | Approximate number of nodes | Cluster-specific config (memory, ports, master-eligibles) |
|---|---|---|---|---|
| cirrussearch1xxx.eqiad.wmnet
cirrusearch2xxx.codfw.wmnet |
Production: when you do a search on a wikipedia site, this is what answers back. | Highest | 50 per datacenter | CODFW |
| cloudelastic1xxx.eqiad.wmnet | Hosts sanitized search indices to be used by Toolforge projects. Not considered production. | Same as production, but with bigger disks, fewer nodes, and each node running all 3 clusters. | Total of 6, only deployed in a single datacenter (EQIAD). | Cloudelastic |
| relforge1xxx.eqiad.wmnet | Testing. Secondary clusters are configured, but don't have any data. It's OK to break this environment. | smaller | 3 | relforge |
Next, let's take a closer look at the clusters that exist in each environment.
Clusters (chi, psi, omega)
Each production server hosts the main cluster (chi), plus one of the secondary clusters. The secondary clusters host the search data for smaller wikis.
Why Multiple Clusters?
We’ve split off 2 secondary clusters (psi and omega), which enables us to reduce the cross-node chatter without reducing the amount of indices hosted. Each production cluster node hosts the main cluster (chi), plus one of the secondary clusters.
Cluster Endpoints
| Cluster | cloudelastic (8xxx ports are read-only) | codfw | eqiad | relforge |
|---|---|---|---|---|
| chi (main) | https://cloudelastic.wikimedia.org:8243 | https://search.svc.codfw.wmnet:9243 | https://search.svc.eqiad.wmnet:9243 | N/A, you must access the cluster locally from one of its nodes. For example, bking@relforge1010:~$ curl -s http://localhost:9200/_cat/nodes
|
| omega | https://cloudelastic.wikimedia.org:8443 | https://search.svc.codfw.wmnet:9443 | https://search.svc.eqiad.wmnet:9443 | |
| psi | https://cloudelastic.wikimedia.org:8643 | https://search.svc.codfw.wmnet:9643 | https://search.svc.eqiad.wmnet:9643 |
Config management
The WMF uses Puppet as its configuration management and state management solution. The profile::opensearch::cirrus::server class is used to configure our production cirrussearch nodes.
OpenSearch version support
As of this writing, we use OpenSearch 1.3 in production. We mirror the upstream OpenSearch Debian repositories, and we use pinning to help ensure the correct version of OpenSearch is installed on our hosts.
Operations
Hardware failures
OpenSearch is robust to losing nodes. If a hardware failure is detected, follow these steps:
- Create a ticket for DC Ops. Use the tag ‘ops-codfw’ or ‘ops-eqiad’ depending on the datacenter. (Example ticket). Subscribe to the ticket and make sure you’re in #wikimedia-dcops and #wikimedia-sre, as DC Ops engineers may reach out to you in these rooms.
- Depool the host from the load balancer.
- Ban the host from the OpenSearch cluster.
Rolling restarts
Lifecycle work (package updates, java security updates) require rolling restarts of the clusters. To that end, we have a cookbook (operational script) with the following options:
- Reboot: Reboot the whole server. Needed for some types of security updates.
- Restart: Only restarts the OpenSearch service. Typically needed for Java security updates.
- Upgrade: Upgrade OpenSearch .
For the larger clusters (codfw/eqiad), it’s acceptable to do 3 nodes at once. The smaller clusters (cloudelastic, relforge) are limited to 1 node at a time.
Example cookbook invocation:
sudo -i cookbook sre.elasticsearch.rolling-restart search_eqiad "restart for JVM upgrade " --start-datetime 2099-06-12T08:00:00 --nodes-per-run 3
where:
- search_eqiad is the cluster to be restarted
- --start-datetime 2099-06-12T08:00:00 is the time at which the operation is starting (which allows the cookbook to be restarted without restarting the already restarted servers).
- --nodes-per-run 3 is the maximum number of nodes to restart concurrently
During rolling restarts, it is a good idea to monitor a few elasticsearch specific things:
Things that can go wrong:
- some shards are not reallocated: OpenSearch stops trying to recover shards after too many failures. To force reallocation, use the sre.elasticsearch.force-shard-allocation cookbook.
- master re-election takes too long: There is no way to preemptively force a master re-election. When the current master is restarted, an election will occur. This sometimes takes long enough that it has an impact. This might raise an alert and some traffic might be dropped. This recovers as soon as the new master is elected (1 or 2 minutes). We don't have a good way around this at the moment.
- cookbook are force killed or in error: The cookbook uses context manager for most operations that need to be undone (stop puppet, freeze writes, etc...). A force kill might not leave time to cleanup. Some operations are not rolled back in case of exception, like pool / depool, because an unknown exception might leave the server in an unknown state and do require manual checking.
- unexpected red status: Don't panic! This is usually due to a shard with no alias. You can usually wait for a node to reboot and the problem will fix itself.
Since "usually" doesn't mean "always", I'll give you some more context and what you can do to fix this. Each user-facing index is actually an alias to the most current version of the wiki's index. A shard that's in use will have at least one alias:
curl -s https://cloudelastic.wikimedia.org:9243/_alias?pretty
"zhwiktionary_content_1716769642" : {
"aliases" : {
"zhwiktionary" : { },
"zhwiktionary_content" : { }
}
A shard without an alias doesn't contain any data, and is safe to delete. For example:
curl -s https://cloudelastic.wikimedia.org:9243/_alias?pretty
"eswiki_content_1716685673" : {
"aliases" : { }
}
You can get a view of all these un-aliased shards via:
curl -s https://cloudelastic.wikimedia.org:9243/_cat/indices | awk '$6 == 0 { print $0 }'
Replacing master-eligibles
To do this safely, update the OpenSearch config (found under unicast_hosts in the Puppet hieradata, here’s an example for CODFW) to add the new masters WITHOUT removing the old masters. Restart each node in the cluster, then update the config to remove the original masters. Restart the cluster again to activate the new masters.
You can also remove masters without restarting the cluster, see the OpenSearch docs. We haven't tested this one extensively, so don't try this in production.
Banning nodes from the cluster
Some operations (decommissioning, for example) are a bit more stable if you first ban the node(s) from the cluster. This action removes all shards from the node, but it does NOT remove the node from the cluster state. The ban cookbook is the easiest way to ban the node(s). If you’re watching the cluster, you should see shards immediately start to move after your API call is accepted. You should also see used disk space decreasing on the banned node. Note that banning is a 'best-effort' action, and OpenSearch will not violate anti-affinity rules to fulfill a ban request.
Bans should not be permanent
When you're finished with your decom or other maintenance operation, run the cookbook with the 'unban' option. We left a bunch of banned hosts in our cluster state after the Elastic->OpenSearch migration and it led to some cluster quorum failures (see this Phab ticket for more details).
The bans themselves aren't really the issue, but the banned hosts are a part of the cluster state. We should try to keep the cluster state as small as possible, for stability's sake.
Deploying our custom OpenSearch plugins Debian package
Our Cirrussearch deploy includes several custom plugins that improve user search experience:
- extra
- extra-analysis
- highlighter
- s3-repository
These plugins are bundled into a Debian package, which is hosted in the WMF gitlab instance . CI will build the package, but it's the responsibility of the SREs to deploy.
To do this:
- Copy the latest Deb package to the apt repo server. Run this ansible playbook, it will pull the latest plugins package from the Gitlab repo, copy it to the server which hosts our Debian repos (apt1002.wikimedia.org as of this writing), and render a bash script that you'll run in the next step. See the README for more details on how to use the playbook.
- Publish the new package. From the server which hosts the Debian repos, run the script rendered by Ansible. Example command:
bash -x publish-wmf-opensearch-search-plugins_1.3.20+12-bullseye.sh. FIXME: Add a non-default option in the Ansible playbook to run the script instead of forcing the user to invoke manually. - Log the command in the SAL. I like to put the exact command rendered by the script, plus the relevant task ID. Example:
!log bking@apt1002 sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia /home/bking/wmf-opensearch-search-plugins-1.3.20+12-bullseye/wmf-opensearch-search-plugins_1.3.20+12_amd64.changes T407520
- Confirm the new package is available on a single host (any relforge/cloudelastic/cirrusssearch will do).
bking@relforge1008:~$ apt policy wmf-opensearch-search-plugins # we're looking for version 1.3.20+12-bullseye, is it available yet?
wmf-opensearch-search-plugins:
Installed: 1.3.20+9~bullseye
Candidate: 1.3.20+9~bullseye
bking@relforge1008:~$ sudo apt-get update # looks like no, let's try an apt update
bking@relforge1008:~$ apt policy wmf-opensearch-search-plugins
wmf-opensearch-search-plugins:
Installed: 1.3.20+9~bullseye
Candidate: 1.3.20+12~bullseye # after `apt-get update`, apt is aware of the change and we can move on to the next step.
- Install the new package on all hosts
sudo cumin 'A:cloudelastic or A:relforge' 'apt-get update && apt-get install --only-upgrade wmf-opensearch-search-plugins'
sudo cumin 'A:cirrussearch' 'apt-get update && apt-get install --only-upgrade wmf-opensearch-search-plugins'
- Roll-restart the clusters via cookbook
Example commands:
sudo cookbook sre.elasticsearch.rolling-operation relforge --restart --without-lvs --nodes-per-run 1 --start-datetime "2099-11-13T16:56:50" --task-id T407520 "apply wmf-opensearch-search-plugins update"
sudo cookbook sre.elasticsearch.rolling-operation cloudelastic --nodes-per-run 1 --start-datetime "2099-11-13T16:56:50" --task-id T407520 "apply wmf-opensearch-search-plugins update"
sudo cookbook sre.elasticsearch.rolling-operation --nodes-per-run 3 --start-datetime "2099-11-13T16:56:50" --task-id T407520 "apply wmf-opensearch-search-plugins update"
- Verify that all hosts have the new version of the plugin
Check the version number of the plugins that were updated by surfing to the package repo changelog .
Every changelog entry contains one or more plugin updates. For example, here's a single changelog entry with multiple plugin updates:
wmf-opensearch-search-plugins (1.3.20+9) bullseye; urgency=medium
* Bump opensearch-extra to 1.3.20-wmf6
* Bump cirrus-highlighter to 1.3.20-wmf2
To check if the plugins are active, we'll call the _cat/plugins API route. Let's grep for one of the updated plugins, cirrus-highlighter, and filter out any responses that have the expected version of the plugin, 1.3.20-wmf2.
curl -s https://search.svc.eqiad.wmnet:9243/_cat/plugins | grep cirrus-highlighter | grep -v 1.3.20-wmf2
If the response is empty, that means the plugins were updated everywhere. If the response contains hostnames, such as cirrussearch1119-production-search-eqiad cirrus-highlighter 1.3.20-wmf1 , it means you missed one or more hosts. You may need to update the Debian package and/or restart OpenSearch services on that host.
You'll need to verify this for all environments and clusters as noted above (and yes, we should really create automation to do this for us).
Cluster Quorum Loss Recovery Procedure
If the cluster loses quorum, OpenSearch will shut down external access.
Initial troubleshooting
First, check the logs on the master-eligible nodes (the Puppet repo lists master-eligibles) . Verify that the openseearch service is started on all master-eligibles. If it is, stop the service on the master-eligible that started its service most recently. In other words, compare the output of
systemctl status opensearch_1@${SERVICE}.service | grep Active
and stop the one with the time closest to the present.
If that doesn't work (rebuilding the cluster)
/usr/share/opensearch/bin/opensearch-node can recreate a cluster with its last-known state, but it carries the risk of data loss. Only use if there are no other options.
First, stop the opensearch service everywhere, such as:
systemctl stop opensearch_1@production-search-omega-codfw.service
Run opensearch-node on all master-eligibles, taking care to target the correct OpenSearch cluster (to do this, set the OS_PATH_CONF env var). DO NOT CONFIRM THE COMMAND YET! Example command:
export OS_PATH_CONF=/etc/opensearch/production-search-omega-codfw; opensearch-node unsafe-bootstrap
Check the output of opensearch-node; it contains a value that represents the cluster state. Example output:
Current node cluster state (term, version) pair is (4, 12).
Find the host with the highest number, that represents the newest cluster state and that will be your initial master on the new cluster you are forming.
Run
export OS_PATH_CONF=/etc/opensearch/production-search-omega-codfw; opensearch-node unsafe-bootstrap
on the initial master and confirm. Next, start the opensearch service:
systemctl start opensearch_1@production-search-omega-codfw.service
You now have a new cluster! Now we need to get the other nodes to forget about their old cluster.
On the remaining master-eligibles, run
export OS_PATH_CONF=/etc/opensearch/production-search-omega-codfw; opensearch-node detach-cluster
and confirm. This wipes the old cluster state clean. Next, start the OpenSearch service on all remaining master-eligibles:
systemctl start opensearch_1@production-search-omega-codfw.service
Confirm that the other master-eligibles have joined the cluster. Example command:
curl http://0:9400/_cat/nodes
- Repeat step 5 on all data nodes.
Confirm that the cluster is healthy again:
curl http://0:9400/_cat/health?v=true
Note that the data recovery will take some time; you might still get a 503 for several minutes after bringing all nodes into the cluster.
Alerts/Dashboards
For the most up-to-date alerts and dashboard information, see the team-search-platform and team-data-platform directories in the alerts repo. Each file contains alerts, and each alert links its associated dashboard. We are working on adding runbook links as well.
FIXME: Add links to alerts that live in the Puppet repo, such as:
- Expected eligible masters check and alert
- Unassigned Shards Alerts
- CirrusSearch full_text eqiad 95th percentile