At ~04:30 UTC on 2013-06-28 the site went down during a deploy started by Roan K. Caused by an improperly deployed extension (WikibaseDataModel). This extension caused mergeMessageFileList.php to abort (due to a limitation in mergeMessageFileList.php).
- 04:20: Tim goes to lunch
- 04:22: Roan updated VisualEditor and started running scap. Exceptions immediately began to be logged, of the form
- 2013-06-28 04:22:47 srv270 enwiki: [a57c6f8a] /wiki/Main_Page
- Exception from line 315 of /usr/local/apache/common-local/php-1.22wmf8/includes/MagicWord.php:
- Error: invalid magic word 'coordinates'
- The exception rate ramps up as the change is deployed to more servers.
- 04:30: scap completes
- 04:31: Roan discovers that the site is broken as a result of the scap
- 04:39: Roan determines that the problem is due to WikibaseDataModel being listed in extension-list but not checked out
- 04:42: Roan creates the WikibaseDataModel directory and runs mw-update-l10n
- 04:45: Roan syncs ExtensionMessages-1.22wmf8.php
- 04:46: Roan runs scap
- 04:49: Tim returns from lunch
- 04:50: scap completes
- 04:51: Roan runs mw-update-l10n again, notices an error about WikiDataDataBase, and reports that ExtensionMessages-1.22wmf8.php is empty (zero bytes)
- 04:51: Tim attempts to copy in ExtensionMessages from wmf9 and sync that out, but runs the wrong sync command
- 04:52: Roan removes WikibaseDataModel from extension-list, successfully rebuilds ExtensionMessages-1.22wmf8.php, and runs the correct sync command
- 04:52: Roan starts running mw-update-l10n again, which runs until 04:57
- 04:54: Roan does a "sync-dir php-1.22wmf8/cache/l10n", forgetting that mw-update-l10n is still in progress. It runs until Roan kills it at 05:01.
- 04:57: Tim disables the GeoData extension in CommonSettings.php
- 04:57: Roan reports that a different extension is now failing instead
- 04:59: Tim switches all 1.22wmf8 wikis to 1.22wmf7
- 05:01: The site comes back up
- extension-list was modified in git, but the change was not properly deployed.
- WikibaseDataModel returns null from the file level of its extension setup file, causing mergeMessageFileList.php to abort even when the file was present.
- Issue is that this code was moved out of the Wikibase "extension" (git repo) into a stand-alone git repo and extension. In the version of Wikibase in wmf8, it was loading the code from the Wikibase git repo and setting a define WIKIBASE_DATAMODEL_VERSION (or something). Then when it tries to load WikibaseDataModel, the constant is already set and code/classes already loaded, so the code returns instead of loading stuff again. Having it loaded twice could be a problem, as it could load the *wrong* version.
- The intention is for WikibaseDataModel to be for wmf9 only. We could set configuration in CommonSettings and InitialiseSettings to help control things but it's error prone.
- I think best is for the mergeMessageFileList.php to skip any missing extension files. (e.g. allow the extension to be present in wmf9 branch and not in wmf8) aude (talk) 19:11, 28 June 2013 (UTC)
- A bug in mw-update-l10n causes errors from mergeMessageFileList.php to be ignored. A zero-length file is created by mktemp and copied into the production, then the l10n cache is rebuilt with no extensions defined. bugzilla:50347
- mergeMessageFileList.php should support comments in its input file. Comments should be added to extension-list in git detailing deployment procedure and potential pitfalls.
- bugzilla:50347 (mergeMessageFileList.php bug above) should be fixed. Done
- Periodic backups of the /a/common directory including unversioned l10n cache files should be made, so that the site can rapidly be restored to a known-good state in the event of cache corruption.
- Sanity checks should be done by scap before the files are pushed out, in addition to sanity checks done by MagicWord::get() at runtime.
- Communication! Code pushes should be done during hours when ops are very likely to be awake, unless it is in emergency
- Emergency procedures - we need to make emergency procedures such as "rollback first, fix later"
- Update documentation at Heterogeneous_deployment#Add_new_extensions_to_extension-list (specifically, if an extension is added and needed in a new branch, it also needs to be added to older branches even if not used or enabled there?) or can the mergeMessageFileList.php be changed so that it ignores missing extensions, maybe with a warning? (see patch 71056, suggestions? any reason not to do this?)