Incident documentation/20130628-Site

From Wikitech
Jump to: navigation, search

Summary

At ~04:30 UTC on 2013-06-28 the site went down during a deploy started by Roan K. Caused by an improperly deployed extension (WikibaseDataModel). This extension caused mergeMessageFileList.php to abort (due to a limitation in mergeMessageFileList.php).

What Happened

04:20: Tim goes to lunch
04:22: Roan updated VisualEditor and started running scap. Exceptions immediately began to be logged, of the form
2013-06-28 04:22:47 srv270 enwiki: [a57c6f8a] /wiki/Main_Page
Exception from line 315 of /usr/local/apache/common-local/php-1.22wmf8/includes/MagicWord.php:
Error: invalid magic word 'coordinates'
The exception rate ramps up as the change is deployed to more servers.
04:30: scap completes
04:31: Roan discovers that the site is broken as a result of the scap
04:39: Roan determines that the problem is due to WikibaseDataModel being listed in extension-list but not checked out
04:42: Roan creates the WikibaseDataModel directory and runs mw-update-l10n
04:45: Roan syncs ExtensionMessages-1.22wmf8.php
04:46: Roan runs scap
04:49: Tim returns from lunch
04:50: scap completes
04:51: Roan runs mw-update-l10n again, notices an error about WikiDataDataBase, and reports that ExtensionMessages-1.22wmf8.php is empty (zero bytes)
04:51: Tim attempts to copy in ExtensionMessages from wmf9 and sync that out, but runs the wrong sync command
04:52: Roan removes WikibaseDataModel from extension-list, successfully rebuilds ExtensionMessages-1.22wmf8.php, and runs the correct sync command
04:52: Roan starts running mw-update-l10n again, which runs until 04:57
04:54: Roan does a "sync-dir php-1.22wmf8/cache/l10n", forgetting that mw-update-l10n is still in progress. It runs until Roan kills it at 05:01.
04:57: Tim disables the GeoData extension in CommonSettings.php
04:57: Roan reports that a different extension is now failing instead
04:59: Tim switches all 1.22wmf8 wikis to 1.22wmf7
05:01: The site comes back up

Lessons Learned

  • extension-list was modified in git, but the change was not properly deployed.
  • WikibaseDataModel returns null from the file level of its extension setup file, causing mergeMessageFileList.php to abort even when the file was present.
    • Issue is that this code was moved out of the Wikibase "extension" (git repo) into a stand-alone git repo and extension. In the version of Wikibase in wmf8, it was loading the code from the Wikibase git repo and setting a define WIKIBASE_DATAMODEL_VERSION (or something). Then when it tries to load WikibaseDataModel, the constant is already set and code/classes already loaded, so the code returns instead of loading stuff again. Having it loaded twice could be a problem, as it could load the *wrong* version.
    • The intention is for WikibaseDataModel to be for wmf9 only. We could set configuration in CommonSettings and InitialiseSettings to help control things but it's error prone.
    • I think best is for the mergeMessageFileList.php to skip any missing extension files. (e.g. allow the extension to be present in wmf9 branch and not in wmf8) aude (talk) 19:11, 28 June 2013 (UTC)
  • A bug in mw-update-l10n causes errors from mergeMessageFileList.php to be ignored. A zero-length file is created by mktemp and copied into the production, then the l10n cache is rebuilt with no extensions defined. bugzilla:50347

Action Items

  • mergeMessageFileList.php should support comments in its input file. Comments should be added to extension-list in git detailing deployment procedure and potential pitfalls.
  • bugzilla:50347 (mergeMessageFileList.php bug above) should be fixed. Yes check.svg Done
  • Periodic backups of the /a/common directory including unversioned l10n cache files should be made, so that the site can rapidly be restored to a known-good state in the event of cache corruption.
  • Sanity checks should be done by scap before the files are pushed out, in addition to sanity checks done by MagicWord::get() at runtime.
  • Communication! Code pushes should be done during hours when ops are very likely to be awake, unless it is in emergency
  • Emergency procedures - we need to make emergency procedures such as "rollback first, fix later"
  • Update documentation at Heterogeneous_deployment#Add_new_extensions_to_extension-list (specifically, if an extension is added and needed in a new branch, it also needs to be added to older branches even if not used or enabled there?) or can the mergeMessageFileList.php be changed so that it ignores missing extensions, maybe with a warning? (see patch 71056, suggestions? any reason not to do this?)