Incidents/20160620-ores
Appearance
(Redirected from Incident documentation/20160620-ores)
Summary
ores.wikimedia.org was down today for about twenty minutes because of deploying a commit that changed reading config directory without proper order.
Timeline
- 10:58 Amir1: deploying bdc1e2b in ores nodes
- 11:04 deployment finished and ores went down
- puppet agent ran and services got restarted (uwsgi-ores, celery-ores-worker). Didn't solve the problem
- Checking logs showed the problem persists due to bad config reading
- 11:27 Amir1: rollbacking ae71d842dfc0958e06922062dd09d49243332a6a
- ORES went live again
- 12:13 Amir1: deploying bdc1e2bd only to ores on scb2001 (codfw)
- Did not work as expected. (No down time because it only affected that node in codfw).
- 13:04 Amir1: deploying 8e65182 to scb2001
- We fixed it in 295214
- Worked perfectly fine
- 13:06 Amir1: deploying 8e65182 to all ores nodes
Conclusions
A very shallow reasoning would be the issue of reading config directories which got changed a lot and now it's in a rather stable situation but that's dangerous. What we really need is a safe method to deploy ores which we did the second time today. The only thing is documenting them
Actionables
- Status: Unresolved Document safe steps to deploy ores in prod (bug T138234)