Incidents/20170623-ORES
Appearance
ORES scoring service was intermittently unavailable for ~7 hours on 2017-06-23.
Summary
An edit in Spanish wikipedia contained a text that caused a regular expression to go into effectively a forever loop causing the ORES workers to lockup and deny scoring any more requests
Timeline
- 11:51 Amir1 restarts celery and uwsgi on all eqiad scb nodes
- 12:21 Ema calles alex on the phone
- 12:26 uwsgi-ores app gets restarted on scb eqiad cluster
- 12:26 celery-ores-worker app gets restarted on scb eqiad cluster
- 12:38 akosiaris disables changeprop so that it doesn't impact ores
- 12:47 celery-ores-worker app gets restarted on scb eqiad cluster
- 13:13 celery-ores-worker app gets restarted on scb eqiad cluster
- 13:21 akosiaris issues flushdb on eqiad ores queue cache after monitoring it and deciding with Amir1 that's it's huge and its ok
- 13:51 akosiaris issues flushdb on eqiad ores queue cache (again)
- 13:57 celery-ores-worker app gets restarted on scb eqiad cluster
- XX:XX A flushdb is erroneously issued on eqiad ores score cache.
- 14:08 celery-ores-worker app gets restarted on scb eqiad cluster
- 14:34 uwsgi-ores app gets restarted on scb eqiad cluster
- 15:51 eqiad directed traffic is shifted to codfw. shortly after codfw starts exhibiting symptoms as well
- 16:06 uwsgi-ores app gets restarted on scb eqiad cluster
- 16:19 akosiaris is stracing a celery-ores-worker, finds out it is stuck waiting to read from a pipe. Redis connection is ruled out, it's a pipe, not a socket. That pipe seems to be a celery IPC mechanism.
- 16:36 celery worker processes are in an endless loop in userspace, akosiaris starts gdb and attaches to one
- 16:46 processes are discovered to be stuck in userspaces handling unicode input. A utf8 bug with REs theory emerges. akosiaris starts reading the python implementation of REs in sre_lib.h. Uploads backtrace paste at https://p.defau.lt/?Uqrabdz8bSQGl1EOdQ9i6g (later moved in https://phabricator.wikimedia.org/P5633)
- 17:01 mobrovac disables ORES in changeprop config and re-enables changeprop
- 17:32 it becomes clear that only some celery workers exhibit the issue. It's definite it's related to user input.
- 17:32 subbu formulates the hypothesis that the issue maybe to bad regexps that lead to phatological backtracing
- 17:39 akosiaris wastes time trying to figure out how to get the re pattern from within gdb. gdb's python integration is confusing him (he did not know that back then)
- 18:16 godog manages to get the user input string from gdb and for that the culprit edit in https://es.wikipedia.org/w/index.php?title=Relieve_terrestre&type=revision&diff=100032810&oldid=100032572 is discovered.
- 18:22 after some discussions Halfak suggestes inserting in the ores scores cache a hand crafted result for the above change
- 18:47 crafted data is inserted into redis. team starts counting time, waiting for the issue to reappear in codfw (but it does not). akosiaris continues gdbing in eqiad
- 19:29 akosiaris restarts processes in 3/4 boxes in eqiad. 45 minutes without an issue.
- 19:58 amir1 pastes some REs used in mediawiki. subbu gets interested pretty soon finds out an issue with one of the REs.
- 20:39 subbu reproduces the issue in nodejs. akosiaris can't reproduce in python3
- 21:02 subbu reproduces in python3 as well. Turns out the entire compound regexp is needed, not just the problematic part to make CPU usage explode exponentially
- 19:38 pchelko reenables ORES rules for changeprop after akosiaris's request
- 23:32 akosiaris finally manages to get the re pattern from gdb after realizing python gdb integration messes with gdb output, meaning explicit type casts are needed. That verifies independelty subbu's claim+reproduction (the exact same re is discovered).
- 23:42 akosiaris bounces finally celery-ores-worker on scb1004
- 23:45 traffic for eqiad gets routed back to eqiad from codfw by akosiaris
This is a step by step outline of what happened to cause the incident and how it was remedied.
Conclusions
- We have some problematic patterns in our regular expressions that will cause problems in rare scenarios
- Long running regular expression matching doesn't timeout as expected
Actionables
- ORES logging into logstash is filled with INFO level regardless of severity (task T168921)
- Audit REs in ORES task T168888 and task T168889
- Investigate why the problematic edit didn't trigger a timeout task T168965