Portal:Cloud VPS/Admin/Runbooks/SystemdUnitDownForLong

The procedures in this runbook require admin permissions to complete.

Error / Incident

A systemd unit has been down on a host for a long time.


This is quite a generic alert, so there's many options, but some things to try:

  • Try to ssh to the host and check the status of the unit:
systemctl status <unit_name>
  • Check the logs for that unit:
journalctl -u <unit_name> -n 1000

From there you might have to go to the specific service details on how to debug them.

  • Ask on irc if someone is taking any action and forgot about it
  • Check SAL (or sal.toolforge.org) for host being rebooted/taken down
  • Check Netbox to make sure the host is not being decommed or network changes were not applied
  • Check puppet for recent changes that might have affected the node

Common issues

Add notes here when you find this issue.

