Portal:Cloud VPS/Admin/RabbitMQ
Many OpenStack services communicate with one another via rabbitmq. For example:
- Nova services relay messages to nova-conductor via rabbitmq, and nova-conductor marshals database reads and writes
- When an instance is created via nova-api, nova-api passes a rabbitmq message to nova-scheduler, who then schedules a VM (again via a rabbit message) on a nova-compute node
- nova-scheduler assesses capacity of compute nodes via rabbitmq messages
- designate-sink subscribes to rabbitmq notifications in order to detect and respond to VM creation/deletion
- etc.
When VM creation is failing, very often the issue is with rabbitmq. Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect. Restart each rabbit service one at a time and wait several minutes for it to stabilize before restarting the next.
root@cloudrabbit1001:~# systemctl restart rabbitmq-server root@cloudrabbit1002:~# systemctl restart rabbitmq-server root@cloudrabbit1003:~# systemctl restart rabbitmq-server
Operations
Documentation about some common operations.
Operation: depool a rabbitmq node
For the most part, single rabbitmq nodes don't need to be depooled if undergoing brief unavailability periods. Clients will just connect to the next rabbitmq server (hopefully).
Anyway, a hard depool can be required because maintenance or other operation in which is undesirable for a rabbitmq server to receive client traffic.
Go to the operations/dns.git repository and change the CNAMES of the server you want to depool (Example gerrit patch against cloudcontrol nodes).
Example:
user@laptop:~/git/wmf/operations/dns:~$ git grep rabbitmq
templates/wikimediacloud.org:rabbitmq01 5M IN CNAME cloudcontrol2001-dev.wikimedia.org.
templates/wikimediacloud.org:rabbitmq02 5M IN CNAME cloudcontrol2004-dev.wikimedia.org.
templates/wikimediacloud.org:rabbitmq03 5M IN CNAME cloudcontrol2005-dev.wikimedia.org.
templates/wikimediacloud.org:rabbitmq01 5M IN CNAME cloudrabbit1001.wikimedia.org.
templates/wikimediacloud.org:rabbitmq02 5M IN CNAME cloudrabbit1002.wikimedia.org.
templates/wikimediacloud.org:rabbitmq03 5M IN CNAME cloudrabbit1003.wikimedia.org.
Operation: new rabbitmq user
- Create rabbitmq user (for example, for a new openstack service), and granting access privileges.
root@cloudrabbitXXXX:~# rabbitmqctl add_user username password
root@cloudrabbitXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"
It should be enough to create the user in one of the rabbit nodes, it will be replicated to the others.
HA setup
For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship. Some documentation about how this is set up can be found at openstack. Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.
On the secondary host (where the primary host is cloudrabbit1001):
root@cloudrabbit1002:~# rabbitmqctl stop_app root@cloudrabbit1002:~# rabbitmqctl join_cluster rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud root@cloudrabbit1002:~# rabbitmqctl start_app root@cloudrabbit1002:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
Resetting the HA setup
Several times, we have run into issues with the HA setup (e.g. T320232), that we could only fix by resetting the cluster completely, with the following procedure:
- You'll want a shell on all three rabbit nodes, for starters:
cloudrabbit100[123].eqiad.wmnet
- On all nodes, stop puppet from messing with us by running
sudo disable-puppet
sudo rabbitmqctl cluster_status
oncloudrabbit1001
should claim that all three nodes are up (the 'Running Nodes' section is the interesting bit)
- First on 1003, then on 1002, run
sudo rabbitmqctl stop_app
and thensudo rabbitmqctl reset
, then confirm that 1001 agrees that the node is no longer part of the cluster by runningsudo rabbitmqctl cluster_status
on 1001.
- Resetting the last node (1001) sometimes just works and sometimes is weird... it might say something like "I can't reset when nothing is running" or it might work (not clear why)
- On 1001, you can now run
sudo rabbitmqctl start_app
, then runenable-puppet
, andrun-puppet-agent
. You should see a bunch of puppet output about creating Rabbit users
- On 1002, then on 1003, run
rabbitmqctl join_cluster rabbit@cloudrabbit1001.private.eqiad.wikimedia.cloud
, thenrabbitmqctl start_app
, thenrabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
- On 1002, then on 1003, run
enable-puppet
andrun-puppet-agent
- Check all 3 nodes are part of the cluster by running
sudo rabbitmqctl cluster_status
on 1001
Troubleshooting
Lists
$ sudo rabbitmqctl list_exchanges $ sudo rabbitmqctl list_channels $ sudo rabbitmqctl list_connections $ sudo rabbitmqctl list_consumers $ sudo rabbitmqctl list_queues
Logs
/var/log/rabbitmq/rabbit@<hostname>.log
Check local server health
$ sudo rabbitmqctl status
Checking cluster health
If cluster_status hangs check for stuck processes.
$ sudo rabbitmqctl cluster_status Cluster status of node rabbit@cloudcontrol1003 ... [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]}, {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]}, {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>}, {partitions,[]}, {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]
Viewing stuck/suspicious processes
Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.
$ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' 2019-07-23 21:34:54 There are 2247 processes. 2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go. ... 2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go. 2019-07-23 21:34:59 Found 0 suspicious processes.
Viewing unacknowledged messages
$ sudo rabbitmqctl list_channels connection messages_unacknowledged
Recovering from split-brain partitioned nodes
When cluster members lose connectivity with each other they can become partitioned (split-brain). You can check for partitioned hosts with the following command:
$ sudo rabbitmqctl cluster_status Cluster status of node rabbit@cloudcontrol1003 ... [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]}, {running_nodes,[rabbit@cloudcontrol1003]}, # this line should have both hosts listed {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>}, {partitions,[{rabbit@cloudcontrol1003,[rabbit@cloudcontrol1004]}]}, # this line should NOT have any hosts listed {alarms,[{rabbit@cloudcontrol1003,[]}]}]
When this happens you will typically see log messages like: File: /var/log/rabbitmq/rabbit@<hostname>.log
=ERROR REPORT==== 2-Dec-2019::00:08:08 === Channel error on connection <0.27123.2916> (208.80.154.23:52790 -> 208.80.154.23:5672, vhost: '/', user: 'nova'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_f48d171794a340' in vhost '/'"
To recover the cluster you will need to restart rabbitmq
sudo systemctl restart rabbitmq-server
Failed to consume message - access to vhost refused
When this error happens on any of the openstack components:
Failed to consume message from queue: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down: amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down
You can try restarting the app, taking the node out and in the cluster and restart and the vhost:
sudo rabbitmqctl stop_app sudo rabbitmqctl reset sudo rabbitmqctl start_app sudo rabbitmqctl restart_vhost