Talk:Incidents/20160801-ORES
Appearance
Rendered with Parsoid
Latest comment: 8 years ago by Ladsgroup in topic Possible improvements to deploy process?
Possible improvements to deploy process?
Hey, after reading the report I am wondering if the impact of the outage could have been reduced with a deploy process that only upgraded one production node as a canary. Without depooling, this would still have caused errors for a portion of requests, but would have avoided a complete outage. With depooling, user impact could have been avoided altogether.
Canary deploys require that the two ORES instances can indeed be upgraded independently. I guess that this is true, as otherwise we would have a single point of failure, but don't know enough about ORES internals to be sure about this. -- gwicke (talk) 00:32, 2 August 2016 (UTC)
- +1 to that. Additionally, doing a deploy in the inactive DC (currently codfw) and doing tests there before proceeding would make the process bullet- and foolproof :) Mobrovac (talk) 11:16, 2 August 2016 (UTC)
- FYI We do it already and it's documented. I also mentioned that below too. Ladsgroup (talk) 17:10, 2 August 2016 (UTC)
- Thank you for the explanation, @Ladsgroup:! The balancing issue is pretty similar to not depooling a node. Issues should still show up in error metrics, at least if there is sufficient traffic. If generating synthetic traffic in codfw is an issue, you could perhaps consider upgrading all of codfw as a canary test. Any request in codfw should then surface errors. Additionally, you could add a canary node in eqiad to test with actual traffic. This is what we generally do with all prod services, precisely because it is the ultimate test with everything (traffic mix, config etc) identical to the remaining eqiad prod nodes. -- gwicke (talk) 22:32, 3 August 2016 (UTC)
- Thanks. Using scb1002 as a canary node doesn't sound too bad. Let me talk to Aaron. Ladsgroup (talk) 10:47, 6 August 2016 (UTC)
- Thank you for the explanation, @Ladsgroup:! The balancing issue is pretty similar to not depooling a node. Issues should still show up in error metrics, at least if there is sufficient traffic. If generating synthetic traffic in codfw is an issue, you could perhaps consider upgrading all of codfw as a canary test. Any request in codfw should then surface errors. Additionally, you could add a canary node in eqiad to test with actual traffic. This is what we generally do with all prod services, precisely because it is the ultimate test with everything (traffic mix, config etc) identical to the remaining eqiad prod nodes. -- gwicke (talk) 22:32, 3 August 2016 (UTC)
- FYI We do it already and it's documented. I also mentioned that below too. Ladsgroup (talk) 17:10, 2 August 2016 (UTC)
Some notes
It was a complicated outage. I want to elaborate more here.
- We have six different ORES setups including a canary node. It's extremely difficult to maintain these instances but due to research-based nature of ORES, it is necessary. Every patch (except urgent fixes) has to pass these five instances before being deployed to prod. I've setup a page explaining these setups and their configs. Please check it and give us feedback.
- We are deploying a tremendous refactor. It'll reduce the memory pressure to 60-70% (depends on config and setup) and makes requesting for different models in one request several times faster. I anticipated it'll introduces new bugs and we found several one of them in the testing period (which took more than a week).
- This deployment required puppet changes. I cherry-picked that puppet change and deployed in beta to be sure if it works the way I wanted and it worked.
- There were two reasons that we perused the "hot fix" solution instead of reverting and fixing: 1- It couldn't be reproduced anywhere except the prod and in that time we didn't know it was a bug in reading redis password. 2- That change required puppet changes so the rollback required the revert of that puppet change and get that merged and running the puppet agent in the scb nodes and that would usually take more time than the fix itself. The biggest lesson that I learned was to write puppet config changes backward compatible, I thought of it but it was overly complicated so I didn't do it but from now on. I will write these changes backward compatible to make the revert as fast as possible no matter how complicated.
- The canary node: We have a canary node. It's scb2001.codfw.wmnet (Since it's the back up data center, we were told by Ops it's good to use that as a canary node. Once the codfw starts to get traffic, we should make another node.) But the biggest issue around that canary node is that it's just a web node, the worker role doesn't work there properly. Let me explain how ores works in a nutshell. When requests for scores comes to ores, it hits the uwsgi workers, uwsgi checks redis to see if it's cached and if it's not, it'll send the request to the celery workers using redis as a broker (i.e. it's a load balancer for the scoring part). When there are several worker (scoring part) and web nodes registered to a redis server, you can't say where the scoring of a revision will be handled. So since the canary node is registered for the production redis server, I can test uwsgi in it (as I did in this as well) but for scoring a revision, it just sends the request to the redis and there is no way to know if it's handled by the canary node or other nodes. The real canary node here is the beta cluster since it has its own dedicated redis instance. (It's a complicated system, please ask if anything is not clear enough.
- One of the reasons that made the outage longer it was that the hot fix we made couldn't be grabbed in to config repo because the lag between github and diffusion. The commit was pushed to the github repo in 11:01 UTC (based on github IRC bot) and 17 minutes later and a manual push by Chase, the mirror got updated. This needs to be fixed. We should think about if we are allowed to make patches in the production directly in these cases or not (like what we do with security fixes).