Incident documentation/20150908-ores

From Wikitech
Jump to: navigation, search

Between 1350 UTC and 1550 UTC, ORES was down during an upgrade.

Summary

During an upgrade, version issues cause the ORES service to go down for approximately two hours. The problem was the result of the wrong version of ORES being pushed to the production servers. This was due to an outdated 'master' branch being pushed to the 'deploy' branch before the 'fab deploy_web' and 'fab deploy_celery' scripts were called. This resulted in the service being unable to even start.

Another issue worth note is the extreme amount of time spent waiting for pip to update packages (that did not need updating in the first place). While revscoring and ores specify a range of versions requirements that was satisfied by packages that were already installed, pip insisted on installing new versions repeatedly.

Timeline

  • 2015-09-08 @ 13:50 UTC -- 'fab deploy_worker' command is run
    • ... moments later ores.wmflabs.org stopped responding
  • 2015-09-08 @ 15:50 UTC -- ORES is able to serve requests again

Conclusions

  • pip is great for development, but it's a problem in production. We should never be surprised with a 5 minute compile time when we're deploying code
  • It's critical to make sure that the right version is on the 'deploy' branch before starting the deployment.

Actionables

  • phab:T111806 -- ORES load balancer doesn't rebalance on 500 reponses
  • phab:T111826 -- [Spike] Under what circumstances does pip want to re-compile scipy/numpy?
  • phab:T111828 -- ORES/Wikilabels deploy checklist