Incidents/2017-01-11 multiversion
Appearance
(Redirected from Incident documentation/20170111-multiversion)
Summary
The multiversion code is poorly understood by many deployers. The code is complex and the entry points are a mess. An ongoing effort has been underway to address this. On Jan 11th, a fairly involved refactor landed and caused a brief outage, despite testing in beta, on mwdebug*, and the canary checks.
Timeline
- 18:28: Gerrit #331552 was merged
- tested on beta, mwdebug, etc
- 18:56 demon@tin: Synchronized multiversion/MWMultiVersion.php: Attempt #2 for Multiversion cleanup (duration: 00m 41s)
- 19:27 demon@tin: Synchronized php-1.29.0-wmf.7/extensions/FlaggedRevs: Stupid errors (duration: 00m 46s)
- Not technically related, but weird autoloader bugs became more apparent (seen also in TMH) in testing this, so we backported a fix here
- 19:34 demon@tin: Synchronized multiversion: MWVersion fallbacks & such (duration: 00m 56s)
- outage immediately reported, began rollback
- PHP fatal error: Call to undefined method stdClass::get()
- 19:36 demon@tin: Synchronized multiversion: rollback (duration: 00m 56s)
Conclusions
The canary checks for MediaWiki remain insufficient to catch production errors prior to code rolling out live. mwdebug* is nice for testing specific config changes, but does not get "real" traffic so it's hard to test things extensively. The multiversion code is incredibly fragile--but we knew this. This refactor is complicated, should be broken down even further (than it already is)...small changes are best with this endeavor.
Actionables
- Status: Done Include fatal log rate check in scap canary test - task T154646
- Status: Done All entry points (including cli) should be subject to canary checks - task T121597
- Status: Done T152005 did not cause/exacerbate the outage, but was noticed at the time, priority raised