Portal:Cloud VPS/Admin/RabbitMQ

Many OpenStack services communicate with one another via rabbitmq. For example:

Nova services relay messages to nova-conductor via rabbitmq, and nova-conductor marshals database reads and writes
When an instance is created via nova-api, nova-api passes a rabbitmq message to nova-scheduler, who then schedules a VM (again via a rabbit message) on a nova-compute node
nova-scheduler assesses capacity of compute nodes via rabbitmq messages
designate-sink subscribes to rabbitmq notifications in order to detect and respond to VM creation/deletion
etc.

When VM creation is failing, very often the issue is with rabbitmq. Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect. Restart each rabbit service one at a time and wait several minutes for it to stabilize before restarting the next.

root@cloudrabbit1001:~# systemctl restart rabbitmq-server
root@cloudrabbit1002:~# systemctl restart rabbitmq-server
root@cloudrabbit1003:~# systemctl restart rabbitmq-server

Operations

Documentation about some common operations.

Operation: depool a rabbitmq node

For the most part, single rabbitmq nodes don't need to be depooled if undergoing brief unavailability periods. Clients will just connect to the next rabbitmq server (hopefully).

Anyway, a hard depool can be required because maintenance or other operation in which is undesirable for a rabbitmq server to receive client traffic.

Go to the operations/dns.git repository and change the CNAMES of the server you want to depool (Example gerrit patch against cloudcontrol nodes).

Example:

user@laptop:~/git/wmf/operations/dns:~$ git grep rabbitmq
templates/wikimediacloud.org:rabbitmq01 5M  IN CNAME  cloudcontrol2001-dev.wikimedia.org.
templates/wikimediacloud.org:rabbitmq02 5M  IN CNAME  cloudcontrol2004-dev.wikimedia.org.
templates/wikimediacloud.org:rabbitmq03 5M  IN CNAME  cloudcontrol2005-dev.wikimedia.org.
templates/wikimediacloud.org:rabbitmq01 5M  IN CNAME  cloudrabbit1001.wikimedia.org.
templates/wikimediacloud.org:rabbitmq02 5M  IN CNAME  cloudrabbit1002.wikimedia.org.
templates/wikimediacloud.org:rabbitmq03 5M  IN CNAME  cloudrabbit1003.wikimedia.org.

Operation: new rabbitmq user

Create rabbitmq user (for example, for a new openstack service), and granting access privileges.

root@cloudrabbitXXXX:~# rabbitmqctl add_user username password
root@cloudrabbitXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"

It should be enough to create the user in one of the rabbit nodes, it will be replicated to the others.

HA setup

For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship. Some documentation about how this is set up can be found at openstack. Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.

On the secondary host (where the primary host is cloudrabbit1001):

 root@cloudrabbit1002:~# rabbitmqctl stop_app
 root@cloudrabbit1002:~# rabbitmqctl join_cluster rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud
 root@cloudrabbit1002:~# rabbitmqctl start_app
 root@cloudrabbit1002:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

Resetting the HA setup

Several times, we have run into issues with the HA setup (e.g. T320232), that we could only fix by resetting the cluster completely, with the following procedure:

You'll want a shell on all three rabbit nodes, for starters: cloudrabbit100[123].eqiad.wmnet

On all nodes, stop puppet from messing with us by running sudo disable-puppet

sudo rabbitmqctl cluster_status on cloudrabbit1001 should claim that all three nodes are up (the 'Running Nodes' section is the interesting bit)

First on 1003, then on 1002, run sudo rabbitmqctl stop_app and then sudo rabbitmqctl reset, then confirm that 1001 agrees that the node is no longer part of the cluster by running sudo rabbitmqctl cluster_status on 1001.

Resetting the last node (1001) sometimes just works and sometimes is weird... it might say something like "I can't reset when nothing is running" or it might work (not clear why)

On 1001, you can now run sudo rabbitmqctl start_app, then run enable-puppet, and run-puppet-agent. You should see a bunch of puppet output about creating Rabbit users

On 1002, then on 1003, run rabbitmqctl join_cluster rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud, then rabbitmqctl start_app, then rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

On 1002, then on 1003, run enable-puppet and run-puppet-agent

Check all 3 nodes are part of the cluster by running sudo rabbitmqctl cluster_status on 1001

Troubleshooting

Lists

 $ sudo rabbitmqctl list_exchanges
 $ sudo rabbitmqctl list_channels
 $ sudo rabbitmqctl list_connections
 $ sudo rabbitmqctl list_consumers
 $ sudo rabbitmqctl list_queues

Logs

 /var/log/rabbitmq/rabbit@<hostname>.log

Check local server health

 $ sudo rabbitmqctl status

Checking cluster health

If cluster_status hangs check for stuck processes.

 $ sudo rabbitmqctl cluster_status
 Cluster status of node rabbit@cloudcontrol1003 ...
 [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
  {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]},
  {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
  {partitions,[]},
  {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]

Viewing stuck/suspicious processes

Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.

 $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
 2019-07-23 21:34:54 There are 2247 processes.
 2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go.
 ...
 2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go.
 2019-07-23 21:34:59 Found 0 suspicious processes.

Viewing unacknowledged messages

 $ sudo rabbitmqctl list_channels connection messages_unacknowledged

Recovering from split-brain partitioned nodes

When cluster members lose connectivity with each other they can become partitioned (split-brain). You can check for partitioned hosts with the following command:

$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudcontrol1003 ...
[{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
 {running_nodes,[rabbit@cloudcontrol1003]},                          # this line should have both hosts listed
 {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
 {partitions,[{rabbit@cloudcontrol1003,[rabbit@cloudcontrol1004]}]}, # this line should NOT have any hosts listed
 {alarms,[{rabbit@cloudcontrol1003,[]}]}]

When this happens you will typically see log messages like: File: /var/log/rabbitmq/rabbit@<hostname>.log

=ERROR REPORT==== 2-Dec-2019::00:08:08 ===
Channel error on connection <0.27123.2916> (208.80.154.23:52790 -> 208.80.154.23:5672, vhost: '/', user: 'nova'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_f48d171794a340' in vhost '/'"

To recover the cluster you will need to restart rabbitmq

sudo systemctl restart rabbitmq-server

Failed to consume message - access to vhost refused

When this error happens on any of the openstack components:

 Failed to consume message from queue: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down: amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down

You can try restarting the app, taking the node out and in the cluster and restart and the vhost:

 sudo rabbitmqctl stop_app
 sudo rabbitmqctl reset
 sudo rabbitmqctl start_app
 sudo rabbitmqctl restart_vhost