At around 19:27 UTC on May 24th, Wikimedia's S8 database was overwhelmed and became unavailable and in turn, took down several services which depend on that database cluster:
There were two unfortunate coincidences in timing which contributed to significant confusion early in the response to the outage. The timing of the outage coincided with a MediaWiki train deployment, and although nothing had been deployed, that wasn't immediately obvious to SRE, leading to the logical assumption that the outage was related to the train deployment and a call for rollbacks. The general confusion was further exacerbated by an ongoing
NickServ outage on Freenode which caused people to be kicked from private channels and various people were stripped of their nicks, renamed instead to random
Once it was determined that the train deployment was not the culprit, attention turned to database changes.
There has been discrepancies between indexes in the code base of
wb_terms table and real indexes in Wikidata. That was brought to light while examining the table in order to improve the structure specially since this table used be the backend for search in Wikidata but that has been changed to use ElasticSearch instead. One index named 'tmp1' was found there that hasn't been mentioned in the codebase. So it got deleted. But in four places
wb_terms was actually being used for search:
- PropertySuggester extension
- Properties referenced by name from Lua or parser functions in client
- Special:ItemDisambiguation in Wikibase repository
- ArticlePlaceholder integration with Special:Search
tmp1 index was deleted after checking it wasn't being used (by
sys.schema_unused_indexes). This method isn't 100% reliable though (and that is a known issue). There are many things that can make an index to show up there. There was some more on-going work on the
wb_terms table, like emptying a column and deleting other indexes at the same time, which could have contributed to the optimizer getting a different plan for ongoing queries and getting the index
tmp1 marked as not used.
Surprisingly, Special:ItemDisambiguation has been hit around 30K times an hour in the past month (see phab:P7152).
One part that can explain timeline of events better, is replacing the
term_search_key field with empty string that has been running for a twenty days now (see phab:T189779). This field is not being read, because Wikidata uses elastic as the search backend, but still changing the cardinality of the field past a certain tipping point could have caused a sudden change in the query execution plan to use an index that's several orders of magnitude slower. On the other hand,
term_search_key is not used in any query on Wikidata, by virtue if
$wgWBRepoSettings['useTermsTableSearchFields'] being set to false. It's however still possible that the change in
term_search_key's cardinality lead the query planner to change from using
term_text. for queries selecting based on
12:44 - jcrespo indicates there are important ongoing issues (worst query of all wikimedia databases) with `TermSqlIndex::getMatchingTerms` and sets that task to high (not unbreak now because there is not any ongoing outage at that moment) task T194273#4228564
19:22 - Open connections to S8 database cluster started climbing
19:27 - First icinga alert - restbase endpoints health on restbase1010 is CRITICAL
19:29 - twentyafterfour@tin:/srv/mediawiki-staging$ scap sync-file php-1.32.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.MobileArticleTarget.js 'sync https://gerrit.wikimedia.org/r/c/434992 refs T195191 T191051'
19:30 - !log twentyafterfour@tin: scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details)
- 19:32 Boshomi reports phab:T195520
19:36 - marostegui - _joe_: looks related to an index we dropped because wikidata team thought it wasn't in use, and looks like it is
- adding an index there will take more than 10h
19:44 reedy@tin: Synchronized wmf-config/Wikibase.php: Disable PropSuggester (duration: 01m 21s)
20:10 < marostegui> ok, I am now killing those queries only
20:11 < _joe_> marostegui: that should mostly fix things
20:11 < bblack> lots of red things are going green!
20:12 < addshore> seeing a mass drop off in fatal monitor
20:12 < volans> db1109 is clean
20:12 < _joe_> if not, please do tell us
20:12 < addshore> wikidata loads for me again now
20:13 < Amir1111WTF> DanielK_WMDE_: there are so many ways to fix that
20:13 < Amir1111WTF> I should write that down
20:13 < _joe_> tendril seems to agree things are now better
20:11 legoktm@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Disable TermSqlIndex::getMatchingTerms (duration: 01m 20s)
20:14 legoktm@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Add debug logging (duration: 01m 19s)
20:17 legoktm@tin: Synchronized php-1.32.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: wmf.5 this time (duration: 01m 19s)
- Build an analysis tool to discover more discrepancies between database schema in the codebase and "hot fixes" applied on the production database.
- Force using invisible indexes when dropping them and wait for a period of time before the actual drop.
- When adding or modifying indexes on the live system, name them properly, document the rationale for their existence, and file a ticket for them to be reconciled with the code base.
- Done re-add indexes
- Done patch out TermSqlIndex::getMatchingTerms
- Done stop replacing term_search_key field with empty string (gerrit:435113)
- Done Make PropertySuggester use EntitySearchHelper (gerrit:435080)
- Done Make Special:ItemDisambiguation use elastic for backend (gerrit:435079)
- Done Make list=wbsearch in API use elastic for backend (gerrit:435079)
- Done Disable Article Placeholder integration with search
- ???? Disable lookup for property names in clients
- Done Re-enable TermSqlIndex::getMatchingTerms and monitor
- Bump phab:T177453, Look into doing it
- Done Make article placeholder and client lookups use elastic
- Not done Get terms of properties to a dedicated tables and properly index them - Refining the retrieval instead
- Not done Restart replacing term_search_key with empty string. - Killing the table instead
- Not done Drop 'tmp1' index - Killing the table instead
- Add per query timeouts to mediawiki DB abstraction phab:T195792
- In progress replace wb_terms table with something better