Incident documentation/20170503-missing index
On 3th May, after the datacenter switch back to eqiad, between 14:26 and 15:12 UTC, revision deletions failed on the final step (showing older logs and adding a comment/more options), showing the following error instead:
A database query error has occurred. This may indicate a bug in the software. [WQnrpgpAADsAAj14qxoAAABF] 2017-05-03 14:39:35: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"
(From 26 March - 23 April: New schema changes are tested live on eqiad on one slave of each shard) (Weeks of 23 and 30 April: several schema changes are applied on eqiad only: https://phabricator.wikimedia.org/T164185 ) (On 2017-05-03: read only disabled after switchover) [14:20:28] <logmsgbot> !log END (PASS) - Set MediaWiki in read-write mode in eqiad (db-eqiad config already merged and git pulled) - t08_stop_mediawiki_readonly (switchdc/oblivian@neodymium) (incident start here) [14:45:38] <paravoid> Trying to revdel content gives me: A database query error has occurred. This may indicate a bug in the software. [WQnrpgpAADsAAj14qxoAAABF] 2017-05-03 14:39:35: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" --NeilN talk to me 14:42, 3 May 2017 (UTC) [14:46:56] <RoanKattouw> Error: 1176 Key 'ls_field_val' doesn't exist in table 'log_search' (10.64.16.77) [14:47:45] <RoanKattouw> INNER JOIN `log_search` FORCE INDEX (ls_field_val) ON ((ls_log_id=log_id)) [conversation happens about the best/fastest way to deal with the error- either perform a schema change or deploy code] [15:06:10] <RoanKattouw> I'm going to apply https://gerrit.wikimedia.org/r/#/c/351653/ on tin and sync it to mwdebug1002 [15:09:52] <RoanKattouw> !log Live-hacked (cherry-picked) https://gerrit.wikimedia.org/r/#/c/351653/ onto naos and synced to mwdebug1002 for testing (Incident stops here) [15:18:41] <RoanKattouw> jynus: Patch deployed at 15:13 (see logmsgbot above), I'm tailing the logs and not seeing any new errors since. We were previously seeing a couple errors per minute but with stops and starts
Despite the alter tables being tested extensively on production on all shards for a month (See https://phabricator.wikimedia.org/T17441#3133031 and following comments), which prevented worse issue on pagelinks and templatelinks, the testing did not registered any errors before the full deployment, probably because the action is not so frequent, plus it depended on master deployment. FORCE INDEX and similar hints are problematic because while it is not equivalent to having different data on different servers, it forces the existence of an index (which cannot be deployed simultaneously to all servers - so it is not backwards compatible), while not having immediate noticeable results (unlike data loss). Index hints should be documented thoroughly to prevent further issues.
Having said that, it is difficult to do more thorough testing before full deployment- it only caused 60 errors -probably some of those retries from the same person- on an hour window between detection and fix released (affecting only admin users when doing a specialized action). So it caused a degradation of service, not a full outage, and no data loss.
It was detected quickly, and fixed as quickly thanks to the help of Roan. This was a calculated risk (following a decision taken by Jaime Crespo, current Senior DBA) and being more careful than this would make schema changes even more burdensome. This is a one-time incident and it or similar ones will most likely never happen again due to the special conditions needed of index hints + alter table. Better documentation (which should be a goal to pursue anyway) will minimize, but not eliminate completely the risk.