Incidents/2017-01-11 multiversion

From Wikitech

Summary

The multiversion code is poorly understood by many deployers. The code is complex and the entry points are a mess. An ongoing effort has been underway to address this. On Jan 11th, a fairly involved refactor landed and caused a brief outage, despite testing in beta, on mwdebug*, and the canary checks.

Timeline

  • 18:28: Gerrit #331552 was merged
  • tested on beta, mwdebug, etc
  • 18:56 demon@tin: Synchronized multiversion/MWMultiVersion.php: Attempt #2 for Multiversion cleanup (duration: 00m 41s)
  • 19:27 demon@tin: Synchronized php-1.29.0-wmf.7/extensions/FlaggedRevs: Stupid errors (duration: 00m 46s)
    • Not technically related, but weird autoloader bugs became more apparent (seen also in TMH) in testing this, so we backported a fix here
  • 19:34 demon@tin: Synchronized multiversion: MWVersion fallbacks & such (duration: 00m 56s)
  • outage immediately reported, began rollback
    • PHP fatal error: Call to undefined method stdClass::get()
  • 19:36 demon@tin: Synchronized multiversion: rollback (duration: 00m 56s)

Conclusions

The canary checks for MediaWiki remain insufficient to catch production errors prior to code rolling out live. mwdebug* is nice for testing specific config changes, but does not get "real" traffic so it's hard to test things extensively. The multiversion code is incredibly fragile--but we knew this. This refactor is complicated, should be broken down even further (than it already is)...small changes are best with this endeavor.

Actionables

  • Status:    Done Include fatal log rate check in scap canary test - task T154646
  • Status:    Done All entry points (including cli) should be subject to canary checks - task T121597
  • Status:    Done T152005 did not cause/exacerbate the outage, but was noticed at the time, priority raised