Incident documentation/2021-11-10 cirrussearch commonsfile outage

From Wikitech
Jump to navigation Jump to search

document status: draft

Summary

In order to test a bug, queries were being run against the active production cirrus cluster (eqiad cirrussearch) via a tunnel from mw-vagrant. vagrant provision was (probably) later run without the tunnel being properly closed, resulting in (for reasons not fully understood) the index `commonswiki_file_1623767607` being deleted and recreated by the script.

As a result, any search queries for commonswiki files directly failed. Furthermore, any "cross-wiki" searches[1] that searched Commons, such as the sidebar of many wikis (notably, not English b/c the English Wikipedia community disables the commons integration), failed as well.

For context, when using the Wikipedia search function Special:Search, most wikipedias queries their sister wikis along with commons. So any wiki who included Commons in their "sidebar" (right side of page) would have had the query fail.

Note that with respect to Wikipedia search, the "Go box" in the top-right corner (how most users search for articles) was not impacted. It was only the full search page Special:Search that failed on any Wikis that had Commons as one of the possible sister search results in the right sidebar.

Impact: Users were impacted between 14:00-16:32 (about 2.5 hours). All commons file searches failed, as well as Special:Search for many wikis (but notably not English wikipedia)

Timeline

15:21 First ticket filed by impacted user https://phabricator.wikimedia.org/T295478

15:28 Additional, largely duplicate ticket filed by user https://phabricator.wikimedia.org/T295480

15:32 <Dylsss> Searching for files on Commons is currently impossible, I believe this is quite critical given the whole point of Commons is being a file repository

15:52 Initial attempt to shift cirrussearch traffic to codfw (did not work due to missing a required line in patch) (https://sal.toolforge.org/log/05mNCn0B1jz_IcWuO9iw)

16:32 Search team operator successfully moves all cirrussearch traffic to codfw, resolving user impact (https://sal.toolforge.org/log/8p2xCn0Ba_6PSCT9sorW)

??? (In future) Index successfully restored, and traffic is returned to eqiad

References:

  1. Log events of all affected requests (note: requires Logstash access)

Actionables

  • Future one-off debugging of the sort that triggered this incident, when it requires production data, should be done on cloudelastic, which is an up-to-date read-only Elasticsearch cluster. If production data is needed but <= 1 week stale data is acceptable, relforge should be used instead.