Incident documentation/20170623-ORES

From Wikitech
Jump to: navigation, search

ORES scoring service was intermittently unavailable for ~7 hours on 2017-06-23.

Summary

An edit in Spanish wikipedia contained a text that caused a regular expression to go into effectively a forever loop causing the ORES workers to lockup and deny scoring any more requests

Timeline

  • 11:51 Amir1 restarts celery and uwsgi on all eqiad scb nodes
  • 12:21 Ema calles alex on the phone
  • 12:26 uwsgi-ores app gets restarted on scb eqiad cluster
  • 12:26 celery-ores-worker app gets restarted on scb eqiad cluster
  • 12:38 akosiaris disables changeprop so that it doesn't impact ores
  • 12:47 celery-ores-worker app gets restarted on scb eqiad cluster
  • 13:13 celery-ores-worker app gets restarted on scb eqiad cluster
  • 13:21 akosiaris issues flushdb on eqiad ores queue cache after monitoring it and deciding with Amir1 that's it's huge and its ok
  • 13:51 akosiaris issues flushdb on eqiad ores queue cache (again)
  • 13:57 celery-ores-worker app gets restarted on scb eqiad cluster
  • XX:XX A flushdb is erroneously issued on eqiad ores score cache.
  • 14:08 celery-ores-worker app gets restarted on scb eqiad cluster
  • 14:34 uwsgi-ores app gets restarted on scb eqiad cluster
  • 15:51 eqiad directed traffic is shifted to codfw. shortly after codfw starts exhibiting symptoms as well
  • 16:06 uwsgi-ores app gets restarted on scb eqiad cluster
  • 16:19 akosiaris is stracing a celery-ores-worker, finds out it is stuck waiting to read from a pipe. Redis connection is ruled out, it's a pipe, not a socket. That pipe seems to be a celery IPC mechanism.
  • 16:36 celery worker processes are in an endless loop in userspace, akosiaris starts gdb and attaches to one
  • 16:46 processes are discovered to be stuck in userspaces handling unicode input. A utf8 bug with REs theory emerges. akosiaris starts reading the python implementation of REs in sre_lib.h. Uploads backtrace paste at https://p.defau.lt/?Uqrabdz8bSQGl1EOdQ9i6g (later moved in https://phabricator.wikimedia.org/P5633)
  • 17:01 mobrovac disables ORES in changeprop config and re-enables changeprop
  • 17:32 it becomes clear that only some celery workers exhibit the issue. It's definite it's related to user input.
  • 17:32 subbu formulates the hypothesis that the issue maybe to bad regexps that lead to phatological backtracing
  • 17:39 akosiaris wastes time trying to figure out how to get the re pattern from within gdb. gdb's python integration is confusing him (he did not know that back then)
  • 18:16 godog manages to get the user input string from gdb and for that the culprit edit in https://es.wikipedia.org/w/index.php?title=Relieve_terrestre&type=revision&diff=100032810&oldid=100032572 is discovered.
  • 18:22 after some discussions Halfak suggestes inserting in the ores scores cache a hand crafted result for the above change
  • 18:47 crafted data is inserted into redis. team starts counting time, waiting for the issue to reappear in codfw (but it does not). akosiaris continues gdbing in eqiad
  • 19:29 akosiaris restarts processes in 3/4 boxes in eqiad. 45 minutes without an issue.
  • 19:58 amir1 pastes some REs used in mediawiki. subbu gets interested pretty soon finds out an issue with one of the REs.
  • 20:39 subbu reproduces the issue in nodejs. akosiaris can't reproduce in python3
  • 21:02 subbu reproduces in python3 as well. Turns out the entire compound regexp is needed, not just the problematic part to make CPU usage explode exponentially
  • 19:38 pchelko reenables ORES rules for changeprop after akosiaris's request
  • 23:32 akosiaris finally manages to get the re pattern from gdb after realizing python gdb integration messes with gdb output, meaning explicit type casts are needed. That verifies independelty subbu's claim+reproduction (the exact same re is discovered).
  • 23:42 akosiaris bounces finally celery-ores-worker on scb1004
  • 23:45 traffic for eqiad gets routed back to eqiad from codfw by akosiaris

This is a step by step outline of what happened to cause the incident and how it was remedied.

Conclusions

  1. We have some problematic patterns in our regular expressions that will cause problems in rare scenarios
  2. Long running regular expression matching doesn't timeout as expected

Actionables