Incidents/2017-04-26 ORES

From Wikitech


Summary

Today, an ORES deployment resulted in a big pile of timeout errors in CODFW, but not in EQIAD. It looks like uwsgi was running old code while celery was running new code. No service restarts rectified the situation. So we switched traffic to point to EQIAD instead. The problem was then resolved.

Timeline

Errors skyrocket between 20:30 and 21:45
2034 UTC
ORES deployment is completed
2041 UTC
Icinga warns of an outage
2048 UTC
Halfak confirms that a bunch of requests are timing out CODFW but EQIAD is doing OK.
2102 UTC
Phab:T163944 is created to track the issue.
2114 UTC
A service restart is issued via scap and directly (by mutante) via systemctl "100.0% (6/6) success ratio (>= 100.0% threshold) for command: 'systemctl restart uwsgi-ores"
2135 UTC
mutante posts a patchset to re-route traffic to eqiad. (https://gerrit.wikimedia.org/r/#/c/350487/)
2143 UTC
the patch is merged and puppet is run
2145 UTC
everything is OK again

Conclusions

No idea what could have caused this. Filed a task to investigate. T163950 -- Investigate failed deploy to CODFW

Update @ 2017-05-02
The problem was caused by scb2005 and scb2006. We had no idea they existed. They weren't in our scap config so new code wasn't getting deployed to them. The problem was solved once we added them to the scap config and deployed to them.

Actionables