Incident documentation/20151123-LabsOutage

From Wikitech
Jump to: navigation, search

Summary

As part of switching primary Labs DNS service from holmium to labservices1001, Andrew caused an outage of labs-recursor0 and labs-recursor1, thus breaking resolution of internal Labs IPs from about 05:22 to about 05:31.

Timeline

  • [previously] Andrew is moving Labs designate and DNS services from holmium to labservices1001, in pursuit of the ultimate renaming of holmium.
  • [05:07] Andrew merge a patch which is meant, via hiera, to exchanges the assignments of labs-recursor0 and labs-recursor1. This is consistent with the 'primary service on labservices1001' goal, but ignores the fact that each labs-recursor0's IP is routed to the rack containing holmium and /not/ routed to the rack containing labservices1001, and likewise for the IP for labs-recursor1. So, this patch should have broken DNS immediately -- it did not due to a mistake in the patch which accidentally assigned the IP for labs-recursor0 is assigned to both holmium and labservices1001. Consequently labs-recursor0 still works and labs-recursor1 does not.
  • [05:17] Andrew merges a second patch patch which corrects the typo. At this point labs-recursor0 is assigned to labservices1001 and labs-recursor1 to holmium, both unroutable.
  • [05:22] The above patch is applied, and the first diamond alerts start showing up about DNS resolution failure.
  • [05:30] Andrew realizes the source of the problem, submits a patch returning the IPs to their original homes.
  • [05:32] The above patch is merged, and normal service is restored.