Incidents/20170629-ganeti

From Wikitech

Various services were unavailable for ~40 mins

Summary

A human mistake in the eqiad ganeti cluster caused widespread issues in various services. Services impacted:

* poolcounter
* OTRS
* LDAP for labs/authentication for various tools (mostly performance as the failover to the other LDAP happens)
* Grafana
* mx1001 (minor as incoming emails are bound to be routed to mx2001)
* etherpad (minor)
* bromine (various static microsites)
* kubernetes (unimportant, no services there yet)

The mistake uncovered issues in our remote access infrastructure.

Timeline

  • 15:20 akosiaris in the final cleaning up steps of a migration to multi-row ganeti in eqiad manages to remove networking for 3 ganeti hardware nodes. This removes networking from all VMs on those hosts. Realizes this immediately and informs others. Tries to login to the iDRACs of above hosts in order to fix the issue manually, but all 3 host's iDRAC does not respond.
  • 15:21 Puppet alerts start pouring in as one of the services on the disconnected VMs is puppetDB.
  • 15:25 Realization that racktables is down, LDAP auth is down.
  • 15:30 poolcounter1001 gets depooled from mediawiki config as it is own of the VMs.
  • 15:33 more hosts iDRAC's are found out to not be replying to requests. It becomes clear problem is widespread, but no root cause is known.
  • 15:39 ganeti1002 has been forcefully rebooted (flea power drain). iDRAC is still not responding, but host is responsive again
  • 15:40 akosiaris tries to disable puppet agents in eqiad, esams using cumin to avoid some icinga spam but fails as puppetdb is unresponsive. While some alternatives are proposed, the task is not deemed worthy enough to justify more investigation/work at that point in time.
  • 15:46 It becomes clear we have 50x problems with upload. It's thumbor using only one poolcounter server and that is poolcounter1001.
  • 15:48 akosiaris tries to failover to the last remaining ganeti node the VMs aluminium.wikimedia.org bromine.eqiad.wmnet etherpad1001.eqiad.wmnet mx1001.wikimedia.org seaborgium.wikimedia.org. Apart from seaborgium, none managed to as they have degraded disks on said host. This has happened because ganeti1001 was rebooted a few mins before the human error and was still syncing the disks. LDAP is back up however
  • 15:56 Everything is back, recoveries start pouring in
  • 19:06 poolcounter1001 gets repooled

Conclusions

  • Humans are the weakest link
  • Our remote access infrastructure has been having serious problems, to the point we could not use it when most needed and we did not know
  • Thumbor only uses 1 poolcounter server

Actionables

  • Thumbor should use >1 poolcounter (task T169312)
  • Thumbor shouldn't fail when poolcounter fails (task T169313)
  • Thumbor should alert/page when thumbs aren't rendered (task T169316)
  • Cumin masters: simplify usage in case of emergency task T169304
  • A single PuppetDB per-datacenter going down shouldn't impact puppet runs (task T169318)
  • iDRACs unresponsive (task T169321)
  • OTRS in a HA/Failover Setup (task T169322)
  • Rack documentation app (racktables) in a HA/Failover Setup task TX