-
Notifications
You must be signed in to change notification settings - Fork 11
Issues adding nodes during ASG rolling updates #8
Comments
Thank you for your time. Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:
We get at least a dozen of questions through various venues every single day, often quite light on details. Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're
Feel free to edit out hostnames and other potentially sensitive information. When/if we have enough details and evidence we'd be happy to file a new issue. Thank you. |
The lines that explain what's going on:
Your nodes cannot join a cluster, possibly due to network communication restrictions via security groups or the Erlang distribution port not being open or similar. This plugin cannot and should not do anything about it. It assumes nodes can communicate with EC2 API endpoints that describe instances/ASGs and each other. |
I understand, but in the second log block...
These instances all have identical security groups, are in the same VPC, and attached to the same ASG. The only difference is the LC is based off a different AMI (because I was changing some settings in the base image and then rebuilding). I'm able to manually talk to all EPMD ports between instances as well. |
i have the same issue @chuckyz , but the owner of this repo just closed my case without giving it enough time !!! |
This is not a support forum. This plugin does not change how clusters are formed or operate; all it does is discover peers to contact on boot unless the node is already a member of a cluster. There is nothing special about how nodes join each other when this plugin is used and it logs the operations it performs and their outcome extensively at debug level. If nodes cannot contact or join their peers, this is not this plugin's fault and should be resolved separately. All edge cases with this plugin (most of them have to do initial cluster formation when N nodes boot in parallel and there is no "existing set of nodes" to join) should be discussed on the mailing list. |
Cluster formation documentation provides an entire section dedicated to the primary edge case with cluster formation: initial formation with N nodes joining simultaneously and thus creating a natural race condition. |
Hello,
I do a rolling update strategy to my ASG nodes when a new LC is created. This consists of increasing/reducing the desired count until all old LC's are rotated out for instances running a new LC, using the ec2_asg Ansible module.
When this happens, I expect all newly created nodes to have no issues joining the existing nodes, as they have the same security groups, and are fully open to communicate with each other.
However, what I end up seeing is this...
This is with the following versions of things:
RPMs:
RabbitMQ:
The config file is as follows:
The newly created instances do not have such a problem after the first is created, but upon changing an LC they are definitely dropping everything.
Ideally this would have no problems so I could do rolling updates (like for Meltdown) or deal with instance outages by using mirrored queues.
In the above log, 172.31.1.17 was the first instance with a new LC, then a node 172.31.1.74 is created, which has the following in the logs...
The text was updated successfully, but these errors were encountered: