Incidents/2019-01-17 RESTBase-Parsoid

From Wikitech

Summary

On 2019-01-17, Mobile-Content-Service has been deployed to bump a content version to 1.3.10. Due to a bug in RESTBase, it caused all requests for summaries for page previews that fall out of Varnish to get re-rendered in an incorrect attempt to update the version. Since the requests were coming from the public internet, they have included Accept-Language header for languages that support variants (sr, zh etc), which was forwarded to MCS and back to RESTBase for a proper variant of HTML. RESTBase doesn't store variants of HTML, so it relies on Parsoid for transformations, thus all the requests started hitting Parsoid to transform the variants, which in turn over-loaded Parsoid and made it page. The overload of Parsoid was probably due to a bug in Parsoid regarding work assignments. Undeploying MCS quickly mitigated the issue since no attempts to incorrectly upgrade the content version were made anymore by RESTBase.

Timeline

All times UTC.

  • 21:10: bsitzmann deploys new mobileapps https://gerrit.wikimedia.org/r/#/c/mediawiki/services/mobileapps/deploy/+/484766/
  • 21:10: restbase begins prodigously logging errors and warnings, which increase over time https://logstash.wikimedia.org/goto/a34dd8ace7c46e09426ed41e18b8b596
  • 03:16: first problem report from Icinga: <+icinga-wm> PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received
  • 03:39: first page from Icinga: <+icinga-wm> PROBLEM - LVS HTTP IPv4 on parsoid.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
  • 04:02: cdanis restarts parsoids in codfw via scap
  • 04:06: scap restarts complete, parsoid still returning errors
  • 04:10: Pchelolo reverts mobileapps deployment
  • 04:12: Services begin to come back up
  • 04:18: All services back to nominal.

Conclusions

  • Monitor log files after mobileapps deploys. Updated mobileapps deployment procedure page.
  • RESTBase has a bug in ensure_content_type.js version checking that caused RESTBase to request all /page/summary responses to be re-rendered. (/page/summary requests Parsoid content.)
  • During the outage, Parsoid has seen steady increase in language variant conversion requests that more than doubled the rate (< 20 reqs/s to >70 reqs/s) and at some point the rate spiked (>140 reqs/s) which increased Parsoid request latencies for these requests. The reason for the request spike is not yet understood, but thecurrent theory is that some big template on a wiki with language variants was edited that caused a massive amount of pages to drop out of the cache. Yet to be understood.
  • There's no graphs showing rates/latencies of language variant transforms
  • When upgrading the content, RESTBase must not forward Accept-Language - this would cause it to store content in non-primary variant.

Actionables

  • RESTBase should not pass through Accept-Language on content upgrades task T214094
  • Stress test Parsoid HTTP API task T214099
  • Instrument Parsoid language variant conversion task T214103