Incidents/20160620-ores

Summary

ores.wikimedia.org was down today for about twenty minutes because of deploying a commit that changed reading config directory without proper order.

Timeline

SAL log

10:58 Amir1: deploying bdc1e2b in ores nodes
11:04 deployment finished and ores went down
- puppet agent ran and services got restarted (uwsgi-ores, celery-ores-worker). Didn't solve the problem
- Checking logs showed the problem persists due to bad config reading
11:27 Amir1: rollbacking ae71d842dfc0958e06922062dd09d49243332a6a
- ORES went live again
12:13 Amir1: deploying bdc1e2bd only to ores on scb2001 (codfw)
- Did not work as expected. (No down time because it only affected that node in codfw).
13:04 Amir1: deploying 8e65182 to scb2001
We fixed it in 295214
- Worked perfectly fine
13:06 Amir1: deploying 8e65182 to all ores nodes

Conclusions

A very shallow reasoning would be the issue of reading config directories which got changed a lot and now it's in a rather stable situation but that's dangerous. What we really need is a safe method to deploy ores which we did the second time today. The only thing is documenting them

Actionables

Status: Unresolved Document safe steps to deploy ores in prod (bug T138234)