Incident documentation/20140612-Math

From Wikitech
Jump to: navigation, search

Summary

A new backwards incompatible feature was added to the Math extension (dependence on a new db table) and deployed without the rights steps being taken to ensure backwards compatibility and/or production changes in place.


Timeline

On Wednesday June 11th Kunal noticed a change to the Math extension mentioning adding a new table. He commented on the relevant bug informing the developer that db changes need to be done before the change is merged and linked to the Schema Changes page. Kunal asked if the current change, if merged without the new table, would cause errors on production and Physikerwelt replied "Only useres that want to test the MathML rendering will get a database error. That might cause many bug reports". Brad replied with even more warning and pointing to the Schema changes documentation/guidelines.

On Thursday June 12th we had a spike of 500's.

Ori detected it: (timestamps Pacific timezone)

17:04 <+icinga-wm> PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
17:05 < ori> i certainly hope that this is wrong: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)

the reason was the Math extension:

17:09 < ori> OK, the math extension is still broken
17:10 < ori> i'm going to revert it
17:11 < ori> i'm also going to run scap in the background because i suspect things are in an inconsistent state because of the issue you flagged above mutante
17:20 < ori> 3027 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf8/includes/db/Database.php
17:20 < ori> 1132 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf7/includes/db/Database.php
17:20 < ori> 394 Exception from line 50 of /usr/local/apache/common-local/php-1.24wmf8/extensions/Math/MathSource.php
17:23 < ori> exceptions have subsided too
17:24 < mutante> ori: yes, the graph looks normal again
17:24 < ori> thank you. i still suspect the database errors have to do with the math extension, and i worry that they'll spike again

Sean guessed right, it was expecting a new database table without going through the proper way of requesting it:

17:25 < springle> https://bugzilla.wikimedia.org/show_bug.cgi?id=66492#c9
17:25 < springle> (wild guess)
17:26 < ori> yeah. https://gerrit.wikimedia.org/r/#/c/139068/ didn't help, evidently
17:28 < ori> https://bugzilla.wikimedia.org/show_bug.cgi?id=65793 wtf.

We considered several options to revert:

17:28 < mutante> should we restore https://gerrit.wikimedia.org/r/#/c/138993/  ?
17:28 < mutante> and do the temp. disable thing?
17:31 < ori> well, $wgMathValidModes isn't set to MW_MATH_MATHML presently
17:31 < ori> or doesn't include it, rather
17:31 < ori> so i don't see how that patch would have an effect, but i could use a second pair of eyes
17:34 < ori> change I75f24cb762609d6728247e3758fcc18f2ebfc6e6
17:34 < ori> "Invalid settings for math rendering mode will default to MathMathML."

legoktm pointed out we should also change the default case and revert the tests:

17:35 < legoktm> wait https://gerrit.wikimedia.org/r/#/c/138572/16/MathRenderer.php,cm
17:35 < legoktm> change the default: case
17:35 < legoktm> default should be PNG
17:36 < ori> i'd prefer to revert, but i'm having a hard time identifying a safe point in the past
17:36 < ori> so your suggestion may be the best one, legoktm
17:38 < legoktm> ori: [05:36:46 PM] <grrrit-wm>I (PS1) Legoktm: Set default fallback rendering option to MW_MATH_PNG [extensions/Math] - https://gerrit.wikimedia.org/r/139301
17:38 < legoktm> for some reason MathTexvc is marked as deprecated

maybe a bonus issue, prod talking to labs ?

17:37 < legoktm> also I hope that code wasn't contact a wmflabs domain from prod
17:39 < legoktm> I'm going to have to revert the tests too
17:39 < ori> and yes it does
17:39 < ori> connect to labs i mean

finally this revert fixed it:

17:39 < ori> i think it might be best to revert to 1bb3bfa3b5656af5ee57784578996e9513600a4d
17:40 < legoktm> uh, https://gerrit.wikimedia.org/r/#/c/137549/2 how will that help?

and Ori synced, which stopped the exceptions and it recovered

17:45 <+logmsgbot> !log ori Synchronized php-1.24wmf9/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 06s)
17:47 < ori> exceptions subsiding
17:48 < legoktm> I'll start looking into a proper revert
17:48 < ori> thanks
17:48 < ori> yeah, no more exceptions.

Conclusions

Greg had a good conversation with Physikerwelt on the morning of the 13th. Physikerwelt was unaware of many aspects of the deployment cycle/process at WMF and together they came to a new mutual understanding of how to make changes.

Actionables

  • Status:    on going - Greg be more diligent about actively reverting non-backwards compatible changes before they cause problems.
  • Status:    Done - Update our extension development documentation as per Physikerwelt's (good) suggestion.
  • Status:    in-progress - Get more WMF reviewers for the Math extension work (not only will it be reviewed more quickly, but we'll have more institutional knowledge for when things break, as software tends to do)