Incident documentation/20160620-ores

From Wikitech
Jump to: navigation, search

Summary

ores.wikimedia.org was down today for about twenty minutes because of deploying a commit that changed reading config directory without proper order.

Timeline

SAL log

  • 10:58 Amir1: deploying bdc1e2b in ores nodes
  • 11:04 deployment finished and ores went down
    • puppet agent ran and services got restarted (uwsgi-ores, celery-ores-worker). Didn't solve the problem
    • Checking logs showed the problem persists due to bad config reading
  • 11:27 Amir1: rollbacking ae71d842dfc0958e06922062dd09d49243332a6a
    • ORES went live again
  • 12:13 Amir1: deploying bdc1e2bd only to ores on scb2001 (codfw)
    • Did not work as expected. (No down time because it only affected that node in codfw).
  • 13:04 Amir1: deploying 8e65182 to scb2001
  • We fixed it in 295214
    • Worked perfectly fine
  • 13:06 Amir1: deploying 8e65182 to all ores nodes

Conclusions

A very shallow reasoning would be the issue of reading config directories which got changed a lot and now it's in a rather stable situation but that's dangerous. What we really need is a safe method to deploy ores which we did the second time today. The only thing is documenting them

Actionables

  • Status:    In progress Document safe steps to deploy ores in prod (bug T138234)