Incident documentation/20160319-Ores

From Wikitech
Jump to: navigation, search

Summary

ORES went down and responded slowly for ~2 hours today.

Timeline

  • 1930 UTC: New deployment begins
  • 2005 UTC: ORES begins to be overloaded
  • 2025 UTC: A problem with old Jessie installs is discovered Phab:T130463 -- it turns out that it was really a pip issue with versioning https://github.com/pypa/pip/issues/214
  • 2130 UTC: A new cluster is built and requests are being served at the rate that they come in
  • 2300 UTC: A new cluster configuration is complete.

Conclusions

  1. Pip does not remove old versions when installing new wheels. This will need to be done manually
  2. Our precaching utility will back-up during a short outage and unleash a load of requests on the service when it comes back online

Actionables