Incidents/2019-04-15 maps

From Wikitech

Summary

Tilerator and Kartotherian failed to restart during planned restart for OpenSSL upgrade. Issue was tracked down to config being not readable, permissions were manually reset and service recovered.

Impact

  • maps.wikimedia.org was partially unavailable for about 10 minutes. See graph.

Detection

  • problem was detected by multiple Icinga checks.

Timeline

  • 13:56Z: rolling restart of kartotherian + tilerator started
  • 14:04Z: first icinga alert about kartotherian and tilerator being down on maps1001
  • ~14:16Z: permission reset for /srv/deployment/kartotherian on maps100[23]
  • 14:17Z: recovery of kartotherian on maps100[23]
  • 14:18Z: permission reset on /srv/deployment/kartotherian and /srv/deployment/tilerator for all maps servers
  • 14:20Z: last direct recovery message from Icinga

Conclusions

  • Restarting a service should be a trivial operation, but was not, and was not tested properly on a single host before restarting the whole clusters.
  • The wrong permissions on config files is not yet explained.

What went well?

  • good collaboration of multiple people to resolve the issue

What went poorly?

  • a trivial operation went wrong
  • lack of focus by the operator during the restart
  • restart was not tested

Where did we get lucky?

  • N/A

Links to relevant documentation

It isn't clear yet what went wrong in this case and what should be documented. Maps runbook is available, but does not contain anything that would have helped in this case.

Actionables