Incidents/20140612-Math
Summary
A new backwards incompatible feature was added to the Math extension (dependence on a new db table) and deployed without the rights steps being taken to ensure backwards compatibility and/or production changes in place.
Timeline
On Wednesday June 11th Kunal noticed a change to the Math extension mentioning adding a new table. He commented on the relevant bug informing the developer that db changes need to be done before the change is merged and linked to the Schema Changes page. Kunal asked if the current change, if merged without the new table, would cause errors on production and Physikerwelt replied "Only useres that want to test the MathML rendering will get a database error. That might cause many bug reports". Brad replied with even more warning and pointing to the Schema changes documentation/guidelines.
On Thursday June 12th we had a spike of 500's.
Ori detected it: (timestamps Pacific timezone)
- 17:04 <+icinga-wm> PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds
- 17:05 < ori> i certainly hope that this is wrong: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)
the reason was the Math extension:
- 17:09 < ori> OK, the math extension is still broken
- 17:10 < ori> i'm going to revert it
- 17:11 < ori> i'm also going to run scap in the background because i suspect things are in an inconsistent state because of the issue you flagged above mutante
- 17:20 < ori> 3027 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf8/includes/db/Database.php
- 17:20 < ori> 1132 Exception from line 1168 of /usr/local/apache/common-local/php-1.24wmf7/includes/db/Database.php
- 17:20 < ori> 394 Exception from line 50 of /usr/local/apache/common-local/php-1.24wmf8/extensions/Math/MathSource.php
- 17:23 < ori> exceptions have subsided too
- 17:24 < mutante> ori: yes, the graph looks normal again
- 17:24 < ori> thank you. i still suspect the database errors have to do with the math extension, and i worry that they'll spike again
Sean guessed right, it was expecting a new database table without going through the proper way of requesting it:
- 17:25 < springle> https://bugzilla.wikimedia.org/show_bug.cgi?id=66492#c9
- 17:25 < springle> (wild guess)
- 17:26 < ori> yeah. https://gerrit.wikimedia.org/r/#/c/139068/ didn't help, evidently
- 17:28 < ori> https://bugzilla.wikimedia.org/show_bug.cgi?id=65793 wtf.
We considered several options to revert:
- 17:28 < mutante> should we restore https://gerrit.wikimedia.org/r/#/c/138993/ ?
- 17:28 < mutante> and do the temp. disable thing?
- 17:31 < ori> well, $wgMathValidModes isn't set to MW_MATH_MATHML presently
- 17:31 < ori> or doesn't include it, rather
- 17:31 < ori> so i don't see how that patch would have an effect, but i could use a second pair of eyes
- 17:34 < ori> change I75f24cb762609d6728247e3758fcc18f2ebfc6e6
- 17:34 < ori> "Invalid settings for math rendering mode will default to MathMathML."
legoktm pointed out we should also change the default case and revert the tests:
- 17:35 < legoktm> wait https://gerrit.wikimedia.org/r/#/c/138572/16/MathRenderer.php,cm
- 17:35 < legoktm> change the default: case
- 17:35 < legoktm> default should be PNG
- 17:36 < ori> i'd prefer to revert, but i'm having a hard time identifying a safe point in the past
- 17:36 < ori> so your suggestion may be the best one, legoktm
- 17:38 < legoktm> ori: [05:36:46 PM] <grrrit-wm>I (PS1) Legoktm: Set default fallback rendering option to MW_MATH_PNG [extensions/Math] - https://gerrit.wikimedia.org/r/139301
- 17:38 < legoktm> for some reason MathTexvc is marked as deprecated
maybe a bonus issue, prod talking to labs ?
- 17:37 < legoktm> also I hope that code wasn't contact a wmflabs domain from prod
- 17:39 < legoktm> I'm going to have to revert the tests too
- 17:39 < ori> and yes it does
- 17:39 < ori> connect to labs i mean
finally this revert fixed it:
- 17:39 < ori> i think it might be best to revert to 1bb3bfa3b5656af5ee57784578996e9513600a4d
- 17:40 < legoktm> uh, https://gerrit.wikimedia.org/r/#/c/137549/2 how will that help?
and Ori synced, which stopped the exceptions and it recovered
- 17:45 <+logmsgbot> !log ori Synchronized php-1.24wmf9/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 06s)
- 17:47 < ori> exceptions subsiding
- 17:48 < legoktm> I'll start looking into a proper revert
- 17:48 < ori> thanks
- 17:48 < ori> yeah, no more exceptions.
Conclusions
Greg had a good conversation with Physikerwelt on the morning of the 13th. Physikerwelt was unaware of many aspects of the deployment cycle/process at WMF and together they came to a new mutual understanding of how to make changes.
Actionables
- Status: on going - Greg be more diligent about actively reverting non-backwards compatible changes before they cause problems.
- Status: Done - Update our extension development documentation as per Physikerwelt's (good) suggestion.
- bugzilla:66603 -- done by Physikerwelt
- Status: in-progress - Get more WMF reviewers for the Math extension work (not only will it be reviewed more quickly, but we'll have more institutional knowledge for when things break, as software tends to do)
- See also: RT 6077
- See also: wikitech-l thread