Jump to content

Portal:Cloud VPS/Admin/Runbooks/MainProxyDown

From Wikitech

The MainProxyDown alert fires when the monitoring system cannot connect to the primary web proxy service address is unreachable. This is a highly user-facing outage and likely consitutes an incident.

The procedures in this runbook require admin permissions to complete.

Debugging

In general, there are two components that could fail:

  • The proxy stack itself (i.e. nginx + Redis). This should be detected by the separate per-instance MainProxyInstanceDown (runbook) alert: that alert firing generally means you can focus on this and ignore keepalived-related issues.
    • If one of the nginx has failed, and the other is alive, you might be able to save some time by stopping the keepalived service (and disabling Puppet) on the broken one to fail over traffic to the working one.
  • keepalived, which is responsible for ensuring that exactly one of the instances has the service IPs allocated.
    • If neither, or both of the instances, have the service IPs assigned to the interface, try kicking the service.
    • If the IPs are correctly allocated on exactly one instance, check that the Neutron ports are configured correctly.

Common issues

Support contacts

Old incidents