Incident documentation/20151214-Math-and-Mathoid

From Wikitech
Jump to: navigation, search

Summary

A bug in MathJax, the rendering engine used by Mathoid, coupled with a bug introduced in the Math extension when switching to RESTBase, caused certain Wiki pages with mathematical formulae not being able to render. The user-perceived impact was low because only pages containing nested TeX constructs were failing to load, and only for logged-in users that selected MathML as the preferred rendering mode.

Description

On the Mathoid side, MathJax v2.5 contained a bug which would cause the process to enter an endless rendering loop when nested TeX constructs, such as \limits_{\binom{\| x\| =1}{Cx=0}}, were used. This caused Mathoid to start refusing connections during shorter periods of time:

[2015-12-14T11:03:44Z] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.32.153:10042/: Timeout on connection while downloading http://10.64.32.153:10042/
[2015-12-14T11:05:33Z] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy
[2015-12-14T11:23:43Z] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[2015-12-14T12:25:33Z] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy

In spite of the endless-loop bug, Mathoid was functional because the master service process monitors its workers and kills them after a time-out if they are unresponsive, which limited the impact of the overall rendering.

The user-visible impact was also contained because of the recent switch in the Math extension to using RESTBase as a storage proxy, since many of the renders were already present in its storage. Unfortunately, the switch itself introduced a bug which caused the Math extension to ask for a render even for invalid formulae. This caused the Math extension to start throwing exceptions, effectively blocking MediaWiki from displaying Wiki pages containing problematic formulae.

Timeline

  • 2015-12-14 11:03: First reports from Icinga about Mathoid issues
  • 2015-12-14 xx:xx: mobrovac restarts Mathoid multiple times over the day
  • 2015-12-14 20:00: An abnormal amount of Math ext exceptions noticed in Logstash, mobrovac starts investigating and files bug T121445
  • 2015-12-14 22:40: PS 259164 fixing the Math extension is merged, landed in 1.27wmf9
  • 2015-12-17 14:00: User reports (bug T121762) of rendering crashing still occurring, and giving a concrete TeX formula exposing the MathJax bug
  • 2015-12-17 16:30: PS 259164 back-ported to 1.27wmf8, Wiki pages are now being displayed, bug with an error in lieu of the formula
  • 2015-12-17 18:10: Number of Mathoid workers temporarily increased to 50 to reduce production impact (PS 259765)
  • 2015-12-17 19:40: mobrovac deploys a temporary hotfix for Mathoid (PS 259780)
  • 2015-12-17 23:42: physikerwelt fixes the original MathJax bug, and a new version of Mathoid including it is deployed (PS 259894)

Conclusions

The bug crashing the Math extension could have been caught by tests exercising rendering of invalid input. However, currently RESTBase's math checking endpoint is closed for external IPs since it's a POST endpoint, thus making it impossible to test in our CI infrastructure.

As far as Mathoid is concerned, the main problem in this incidence is that the formula exposing the bug is an edge case. Furthermore, as the timeline shows, problems appeared 3 days before users reported it, making it quite hard to find the root cause of the problem. Not having logging enabled in production for the Math extension didn't help either.

Actionables

  • Status:    In progress Consider opening up the math check POST RESTBase endpoint - bug T116147
  • Status:    In progress Add more edge-case formulae to Mathoid's test suite
  • Yes check.svg Done Enable logging for the Math extension in production - PS 259168