Incidents/20180524-wikidata

From Wikitech
Jump to navigation Jump to search

WIP

Summary

At around 19:27 UTC on May 24th, Wikimedia's S8 database was overwhelmed and became unavailable and in turn, took down several services which depend on that database cluster:

  • MediaWiki
  • Cxserver
  • Restbase

There were two unfortunate coincidences in timing which contributed to significant confusion early in the response to the outage. The timing of the outage coincided with a MediaWiki train deployment, and although nothing had been deployed, that wasn't immediately obvious to SRE, leading to the logical assumption that the outage was related to the train deployment and a call for rollbacks. The general confusion was further exacerbated by an ongoing NickServ outage on Freenode which caused people to be kicked from private channels and various people were stripped of their nicks, renamed instead to random GuestNNN nicks.

Once it was determined that the train deployment was not the culprit, attention turned to database changes.

Database issue

There has been discrepancies between indexes in the code base of wb_terms table and real indexes in Wikidata. That was brought to light while examining the table in order to improve the structure specially since this table used be the backend for search in Wikidata but that has been changed to use ElasticSearch instead. One index named 'tmp1' was found there that hasn't been mentioned in the codebase. So it got deleted. But in four places wb_terms was actually being used for search:

  • PropertySuggester extension
  • Properties referenced by name from Lua or parser functions in client
  • Special:ItemDisambiguation in Wikibase repository
  • ArticlePlaceholder integration with Special:Search

The tmp1 index was deleted after checking it wasn't being used (by sys.schema_unused_indexes). This method isn't 100% reliable though (and that is a known issue). There are many things that can make an index to show up there. There was some more on-going work on the wb_terms table, like emptying a column and deleting other indexes at the same time, which could have contributed to the optimizer getting a different plan for ongoing queries and getting the index tmp1 marked as not used.

Surprisingly, Special:ItemDisambiguation has been hit around 30K times an hour in the past month (see phab:P7152).

One part that can explain timeline of events better, is replacing the term_search_key field with empty string that has been running for a twenty days now (see phab:T189779). This field is not being read, because Wikidata uses elastic as the search backend, but still changing the cardinality of the field past a certain tipping point could have caused a sudden change in the query execution plan to use an index that's several orders of magnitude slower. On the other hand, term_search_key is not used in any query on Wikidata, by virtue if $wgWBRepoSettings['useTermsTableSearchFields'] being set to false. It's however still possible that the change in term_search_key's cardinality lead the query planner to change from using term_text_full to term_text. for queries selecting based on term_text.

Timeline

S8 Database metrics from around the time of the incident
  • 12:44 - jcrespo indicates there are important ongoing issues (worst query of all wikimedia databases) with `TermSqlIndex::getMatchingTerms` and sets that task to high (not unbreak now because there is not any ongoing outage at that moment) task T194273#4228564
  • 19:22 - Open connections to S8 database cluster started climbing
  • 19:27 - First icinga alert - restbase endpoints health on restbase1010 is CRITICAL
  • 19:29 - twentyafterfour@tin:/srv/mediawiki-staging$ scap sync-file php-1.32.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.MobileArticleTarget.js 'sync https://gerrit.wikimedia.org/r/c/434992 refs T195191 T191051'
  • 19:30 - !log twentyafterfour@tin: scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details)
  • 19:32 Boshomi reports phab:T195520
  • 19:36 - marostegui - _joe_: looks related to an index we dropped because wikidata team thought it wasn't in use, and looks like it is
  • - adding an index there will take more than 10h
  • 19:44 reedy@tin: Synchronized wmf-config/Wikibase.php: Disable PropSuggester (duration: 01m 21s)
  • 20:10 < marostegui> ok, I am now killing those queries only
  • 20:11 < _joe_> marostegui: that should mostly fix things
  • 20:11 < bblack> lots of red things are going green!
  • 20:12 < addshore> seeing a mass drop off in fatal monitor
  • 20:12 < volans> db1109 is clean
  • 20:12 < _joe_> if not, please do tell us
  • 20:12 < addshore> wikidata loads for me again now
  • 20:13 < Amir1111WTF> DanielK_WMDE_: there are so many ways to fix that
  • 20:13 < Amir1111WTF> I should write that down
  • 20:13 < _joe_> tendril seems to agree things are now better
  • 20:11 legoktm@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Disable TermSqlIndex::getMatchingTerms (duration: 01m 20s)
  • 20:14 legoktm@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Add debug logging (duration: 01m 19s)
  • 20:17 legoktm@tin: Synchronized php-1.32.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: wmf.5 this time (duration: 01m 19s)

Conclusions

wip

Actionables

wip

Meta

  • Build an analysis tool to discover more discrepancies between database schema in the codebase and "hot fixes" applied on the production database.
  • Force using invisible indexes when dropping them and wait for a period of time before the actual drop.
  • When adding or modifying indexes on the live system, name them properly, document the rationale for their existence, and file a ticket for them to be reconciled with the code base.

wb_terms table

short term

  • Yes Done re-add indexes
  • Yes Done patch out TermSqlIndex::getMatchingTerms
  • Yes Done stop replacing term_search_key field with empty string (gerrit:435113)

mid-term

  • Yes Done Make PropertySuggester use EntitySearchHelper (gerrit:435080)
  • Yes Done Make Special:ItemDisambiguation use elastic for backend (gerrit:435079)
  • Yes Done Make list=wbsearch in API use elastic for backend (gerrit:435079)
  • Yes Done Disable Article Placeholder integration with search
  • ???? Disable lookup for property names in clients
  • Yes Done Re-enable TermSqlIndex::getMatchingTerms and monitor
  • Bump phab:T177453, Look into doing it
  • Yes Done Make article placeholder and client lookups use elastic
  • ☒N Not done Get terms of properties to a dedicated tables and properly index them - Refining the retrieval instead
  • ☒N Not done Restart replacing term_search_key with empty string. - Killing the table instead
  • ☒N Not done Drop 'tmp1' index - Killing the table instead
  • Add per query timeouts to mediawiki DB abstraction phab:T195792

long-term

  • Underway In progress replace wb_terms table with something better