Incidents/20171120-Ext:ORES/minutes

From Wikitech

Post Mortem Meeting for train blocker - 20171120-Ext:ORES

Meeting Date:

December 7, 2017

Attendees:

JR, Adam W, Aaron H, Chad H, Greg G, Victoria C

For reference:

Main Task

  • https://phabricator.wikimedia.org/T181006

Related Tasks

  • https://phabricator.wikimedia.org/T181010

Resulting Tasks

  • https://phabricator.wikimedia.org/T181191
  • https://phabricator.wikimedia.org/T181183
  • https://phabricator.wikimedia.org/T181071
  • https://phabricator.wikimedia.org/T181067
  • https://phabricator.wikimedia.org/T181187

Topics:

Brief Summary of Problem(s)

  • Code change that made the "threshold API" (used by RC, WL, Contribs dynamically)
  • Fixed a crash and made it return "null" (in a good way)
    • Didn't analyze what would happen when the ORES Ext would receive those values
  • Extension starts recieving "null" and RC/WL/Contribs crashed for users

How problem(s) was/were discovered?

  • RU/FR wikis threw an exception on RC/WL/Contribs
  • awight heard about it on IRC (~45 minutes after deployment)
    • note that deployments to the service can affect the extension
    • In this case, the extension was not being monitored for a service deployment
  • awight wasn't sure what the fix was (talked in #wikimedia-ai)
  • rollback started 4 minutes later, but wasn't done correctly right away (related to deployments to a new experimental cluster)
    • 2 rollbacks failed before success (which started 1 hour later)
      • Note that once the first failed rollback happen, the revid issue was apparent
    • There were lots of upset people ("pressure") in #wikimedia-operations
    • Before the rollback complete, we disabled ORES on FR/RU wiki
  • Took 1-2 days to re-enable ORES (post rollback) and deploy a fix to ORES Ext.

How problem(s) were introduced?

  • Why didn't this problem manifest in Beta?
    • Large number of wikis that could have been affected.  Frwiki isn't even on beta cluster.  Still ruwiki was, so we could have reviewed all pages on the available beta wikis
    • There was some configuration in Beta that prevented ORES from being enabled in a way in which it would crash there (would have only seen a working page)
      • TODO: (Adam) add documentation to ORES/Deployment explaining that the beta config must be carefully checked, and left empty once deployment to production is complete. (done)
    • Beta is not a perfect mirror of production
  • Collab and Scoring Platform share ownership of the ORES extension
    • Configuration belongs to two teams.  Tried to split the code, but that was determined to be too expensive.
    • Part of our deployment pattern involved work on the ORES Ext., later the Collab does more work on the ORES Ext. to make scores visible to users. 

What was learned?

How could problem(s) could have been avoided

  • Graceful degradation ({{done}})
    • And for the future -- a complete refactor of the extention by a more experience version of an engineer. 
  • Monitoring extension graphs during deployments (and include certain common types of failures)
  • Beta config
  • Beta quality-assurance -- too difficult.  30 pages is too many.  

How could we detect this kind of problem before train/deployment

  • low-traffic staging

What changes have been put in place or are planned to avoid similar scenario in future deployments.

  • Covered in incident report and resulting tasks above.

Action Items

  • (task T182731)greg/JR: reviews of extensions
    • greg: send a note to Daniel Kinsler about extension review
  • (task T182733)chad: proximity of service deployments to train deployments is a problem -- do something about that