Portal:Cloud VPS/Admin/Runbooks/RabbitmqNetworkPartition

From Wikitech
The procedures in this runbook require admin permissions to complete.

Error / Incident

This alert fires when there's no longer consensus between the rabbitmq servers. This seems to happen now and then, for unexplained reasons: the servers can talk to their clients but not to each other. When this happens we get a lot of RPC and other messaging timeouts in OpenStack services.

A state of -1 means that the metric is not being collected. There may or may not be an actual network partition.

Debugging

This alert is based on the output of rabbitmqctl cluster_status. Here's what it looks like when everything is healthy:

andrew@cloudrabbit1003:~$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudrabbit1003 ...
Basics

Cluster name: rabbit@cloudrabbit1003.wikimedia.org

Disk Nodes

rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003

Running Nodes

rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003

Versions

rabbit@cloudrabbit1001: RabbitMQ 3.9.13 on Erlang 24.2.1
rabbit@cloudrabbit1002: RabbitMQ 3.9.13 on Erlang 24.2.1
rabbit@cloudrabbit1003: RabbitMQ 3.9.13 on Erlang 24.2.1

Maintenance status

Node: rabbit@cloudrabbit1001, status: not under maintenance
Node: rabbit@cloudrabbit1002, status: not under maintenance
Node: rabbit@cloudrabbit1003, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@cloudrabbit1001, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1001, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1001, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1001, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1001, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1002, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1002, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1002, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1002, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1002, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1003, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1003, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1003, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1003, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1003, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS

Feature flags

Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled

Note that 'Network Partitions' shows as '(none)'. In case of a partition, that section will list the partitioned servers. By running cluster_health on all three nodes it should obvious which node has fallen out of consensus.

Most often this can be resolved on the failing host by restarting rabbit:

andrew@cloudcontrol2001-dev:~$ sudo rabbitmqctl stop_app
Stopping rabbit application on node rabbit@cloudcontrol2001-dev ...
andrew@cloudcontrol2001-dev:~$ sudo rabbitmqctl start_app
Starting node rabbit@cloudcontrol2001-dev ...

If that does not resolve the issue, it might be necessary to reset the failing node, or to reset the entire cluster.