Incident documentation/20160801-ORES

From Wikitech
Jump to: navigation, search

Summary

Today, ORES went down for a one hour period during a deployment. This was due to some issues with how the redis password was read from the configuration and some regressions introduced in the V1 response format.

Timeline

Server admin log entries during this time

Short version

  • 20:33 -- New code is deployed.
  • 21:20 -- A hotfix is deployed (see ores:e8d2475)
  • 21:37 -- Old code is reverted back to

Longer with context

  • 20:33 < Amir1> !log deploying 624d777 to ores
  • ...
  • 20:38 < Amir1> and we have problem with redis
  • 20:38 < Amir1> the puppet change is not there
  • ...
  • 21:08 < Amir1> https://phabricator.wikimedia.org/diffusion/1912/browse/master/
  • 21:09 < Amir1> okay, something urgent. Can you update this manually
  • 21:09 < Amir1> it's a mirror
  • 21:09 < Amir1> and it's not updated
  • ...
  • 21:11 < greg-g> Amir1: who's "you"?
  • 21:11 < Amir1> the person who can update that
  • 21:12 (in another channel) < greg-g> we need a root in -operations to help amir1
  • 21:12 (in another channel) < greg-g> now
  • 21:13 (in another channel) < greg-g> mutante: robh ^
  • 21:14 < chasemp> I'm confused on what needs to be changed Amir1
  • 21:14 < chasemp> are you saying you pull from diffusion and it's lagging behind has caused an issue?
  • 21:14 < Amir1> yeah
  • 21:14 < Amir1> chasemp: yup
  • 21:14 < chasemp> diffusion is on a staggered update schedule based on change rate in a repo iirc
  • 21:15 < chasemp> let me see here if I can force it
  • 21:17 < chasemp> !log iridium sudo -u phd /srv/phab/phabricator/bin/repository update 1912
  • 21:20 < Amir1> !log deploying e8d2475 to scb nodes
  • 21:23 < halfak> We're up!
  • 21:23 < Amir1> it's up
  • 21:23 < Amir1> yeah
  • ...
  • 21:27 < greg-g> still seeing the "no model available for {blah}" fatal
  • 21:28 < Amir1> where?
  • 21:28 < MaxSem> in exception log
  • ... unclear what is broken/if it is broken (halfak and amir discussing this)
  • 21:33 < Amir1> let's rollback and check later
  • 21:33 < Amir1> mutante: hey, can you revert that?
  • 21:33 < greg-g> halfak: the ORES extension is fataling in prod, it needs to be reverted now
  • 21:34 < grrrit-wm> (PS2) Dzahn: Revert "ores: changes for configs for the refactor" [puppet] - https://gerrit.wikimedia.org/r/302352
  • 21:34 < mutante> Amir1: ^ that?
  • 21:34 < Amir1> yup
  • 21:37 < Amir1> !log deploying 6790ccb
  • 22:41 < MaxSem> huh, ores is still broken?
  • 22:41 < halfak> Oh... It shouldn't be.
  • 22:41 < halfak> MaxSem, looks up to me
  • 22:41 < MaxSem> RuntimeException from line 136 of /srv/mediawiki/php-1.28.0-wmf.12/extensions/ORES/includes/Cache.php: No model available for [361156268]
  • ...
  • 22:43 < MaxSem> not one error, flood of them
  • 22:44 < MaxSem> resumed at 22:23 UTC
  • 22:55 < halfak> MaxSem, it looks to me like error you cited stop by 22:41
  • 22:55 < halfak> The errors that occur after that are expected.

Conclusions

The deployment to beta did not show issues for two reasons.

  1. the redis server used in beta requires no password
  2. the ORES Extension on Beta uses the prod installation of the ORES service

Actionables