Incidents/2018-02-14 labvirt1008-failure

From Wikitech

Summary

Labvirt1008 seems to have overheated and gone down. This effects tenants as well as virtual VPS infrastructure

Timeline

Conclusions

We know that our instance storage is local and ephemeral. We should ensure that is documented for tenants in easy to find places, and re-ensure that our mechanism that keep critical redundant components spread across labvirts are working. In our world though a single hypervisor is a special snowflake and I believe we should have been paged on this outage, but seem not to have been. It was my understanding that a full instance storage partition should have paged if nothing else, and in this case the failure of that check.

Actionables

  • Coordinate with DC OPS to deal with overheating phab:T187292
  • Look at moving tenant instances to another labvirt (we should have a standing spare)
  • Investigate what should have paged and why it did not (and fix it)