Skip to content
This repository was archived by the owner on Nov 16, 2020. It is now read-only.

Issues adding nodes during ASG rolling updates #8

Closed
chuckyz opened this issue Jan 5, 2018 · 6 comments
Closed

Issues adding nodes during ASG rolling updates #8

chuckyz opened this issue Jan 5, 2018 · 6 comments

Comments

@chuckyz
Copy link

chuckyz commented Jan 5, 2018

Hello,

I do a rolling update strategy to my ASG nodes when a new LC is created. This consists of increasing/reducing the desired count until all old LC's are rotated out for instances running a new LC, using the ec2_asg Ansible module.

When this happens, I expect all newly created nodes to have no issues joining the existing nodes, as they have the same security groups, and are fully open to communicate with each other.

However, what I end up seeing is this...

2018-01-05 17:50:29.679 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/[email protected] is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-01-05 17:50:29.679 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_aws
2018-01-05 17:50:29.679 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_aws
2018-01-05 17:50:29.679 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_aws does not support registration, skipping randomized startup delay.
2018-01-05 17:50:29.874 [info] <0.193.0> All discovered existing cluster peers: [email protected], [email protected], [email protected]
2018-01-05 17:50:29.875 [info] <0.193.0> Peer nodes we can cluster with: [email protected], [email protected]
2018-01-05 17:50:29.879 [warning] <0.193.0> Could not auto-cluster with node [email protected]: {badrpc,nodedown}
2018-01-05 17:50:29.883 [warning] <0.193.0> Could not auto-cluster with node [email protected]: {badrpc,nodedown}
2018-01-05 17:50:29.883 [warning] <0.193.0> Could not successfully contact any node of: [email protected],[email protected] (as in Erlang distribution). Starting as a blank standalone node...

This is with the following versions of things:

RPMs:

"https://dl.bintray.com/rabbitmq/all/rabbitmq-server/3.7.0/rabbitmq-server-3.7.0-1.el7.noarch.rpm"
"https://dl.bintray.com/rabbitmq/rpm/erlang/20/el/7/x86_64/erlang-20.1.7-1.el7.centos.x86_64.rpm"

RabbitMQ:

{running_applications,
     [{rabbitmq_peer_discovery_aws,
          "AWS-based RabbitMQ peer discovery backend","3.7.0"},
      {rabbitmq_peer_discovery_common,
          "Modules shared by various peer discovery backends","3.7.0"},
      {rabbitmq_management,"RabbitMQ Management Console","3.7.0"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.0"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.0"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.0"},
      {rabbit,"RabbitMQ","3.7.0"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.0"},
      {recon,"Diagnostic tools for production use","2.3.2"},
      {ranch_proxy_protocol,"Ranch Proxy Protocol Transport","1.4.2"},
      {cowboy,"Small, fast, modern HTTP server.","2.0.0"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.4.0"},
      {rabbitmq_aws,
          "A minimalistic AWS API interface used by rabbitmq-autocluster (3.6.x) and other RabbitMQ plugins",
          "3.7.0"},
      {ssl,"Erlang/OTP SSL application","8.2.2"},
      {public_key,"Public key infrastructure","1.5.1"},
      {asn1,"The Erlang ASN1 compiler version 5.0.3","5.0.3"},
      {cowlib,"Support library for manipulating Web protocols.","2.0.0"},
      {crypto,"CRYPTO","4.1"},
      {xmerl,"XML parser","1.3.15"},
      {mnesia,"MNESIA  CXC 138 12","4.15.1"},
      {inets,"INETS  CXC 138 49","6.4.4"},
      {jsx,"a streaming, evented json parsing toolkit","2.8.2"},
      {os_mon,"CPO  CXC 138 46","2.4.3"},
      {lager,"Erlang logging framework","3.5.1"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.1.3"},
      {syntax_tools,"Syntax tools","2.1.3"},
      {sasl,"SASL  CXC 138 11","3.1"},
      {stdlib,"ERTS  CXC 138 10","3.4.2"},
      {kernel,"ERTS  CXC 138 10","5.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64] [hipe] [kernel-poll:true]\n"},

The config file is as follows:

cluster_formation.peer_discovery_backend    = rabbit_peer_discovery_aws
cluster_formation.aws.region                = us-west-2
cluster_formation.aws.use_autoscaling_group = true
cluster_formation.aws.use_private_ip        = true

tcp_listen_options.backlog        = 4096
tcp_listen_options.nodelay        = true
tcp_listen_options.linger.on      = true
tcp_listen_options.linger.timeout = 0
tcp_listen_options.sndbuf         = 196608
tcp_listen_options.recbuf         = 196608

management.load_definitions = /etc/rabbitmq/definitions.json

The newly created instances do not have such a problem after the first is created, but upon changing an LC they are definitely dropping everything.

Ideally this would have no problems so I could do rolling updates (like for Meltdown) or deal with instance outages by using mirrored queues.

In the above log, 172.31.1.17 was the first instance with a new LC, then a node 172.31.1.74 is created, which has the following in the logs...

2018-01-05 17:52:57.121 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/[email protected] is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-01-05 17:52:57.121 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_aws
2018-01-05 17:52:57.122 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_aws
2018-01-05 17:52:57.122 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_aws does not support registration, skipping randomized startup delay.
2018-01-05 17:52:57.533 [info] <0.193.0> All discovered existing cluster peers: [email protected], [email protected], [email protected]
2018-01-05 17:52:57.533 [info] <0.193.0> Peer nodes we can cluster with: [email protected], [email protected]
2018-01-05 17:52:57.538 [warning] <0.193.0> Could not auto-cluster with node [email protected]: {badrpc,nodedown}
2018-01-05 17:52:57.543 [info] <0.193.0> Node '[email protected]' selected for auto-clustering
2018-01-05 17:52:57.556 [info] <0.193.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2018-01-05 17:52:57.765 [info] <0.193.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2018-01-05 17:52:57.789 [info] <0.193.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2018-01-05 17:52:57.795 [info] <0.193.0> Setting up a table for connection tracking on this node: '[email protected]'
2018-01-05 17:52:57.800 [info] <0.193.0> Setting up a table for per-vhost connection counting on this node: '[email protected]'
2018-01-05 17:52:57.800 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_aws does not support registration, skipping registration.
@michaelklishin
Copy link
Contributor

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

  1. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
  2. We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details.
At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're
observing, or at least sharing as much relevant information as possible on the list:

  • Server, client library and plugin (if applicable) versions used
  • Server logs
  • A code example or terminal transcript that can be used to reproduce
  • Full exception stack traces (not a single line message)
  • rabbitmqctl status (and, if possible, rabbitmqctl environment output)
  • Other relevant things about the environment and workload, e.g. a traffic capture

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

@michaelklishin
Copy link
Contributor

The lines that explain what's going on:

2018-01-05 17:50:29.879 [warning] <0.193.0> Could not auto-cluster with node [email protected]: {badrpc,nodedown}
2018-01-05 17:50:29.883 [warning] <0.193.0> Could not auto-cluster with node [email protected]: {badrpc,nodedown}
2018-01-05 17:50:29.883 [warning] <0.193.0> Could not successfully contact any node of: [email protected],[email protected] (as in Erlang distribution)

Your nodes cannot join a cluster, possibly due to network communication restrictions via security groups or the Erlang distribution port not being open or similar. This plugin cannot and should not do anything about it. It assumes nodes can communicate with EC2 API endpoints that describe instances/ASGs and each other.

@chuckyz
Copy link
Author

chuckyz commented Jan 6, 2018

I understand, but in the second log block...

2018-01-05 17:52:57.533 [info] <0.193.0> Peer nodes we can cluster with: [email protected], [email protected]
2018-01-05 17:52:57.538 [warning] <0.193.0> Could not auto-cluster with node [email protected]: {badrpc,nodedown}
2018-01-05 17:52:57.543 [info] <0.193.0> Node '[email protected]' selected for auto-clustering

These instances all have identical security groups, are in the same VPC, and attached to the same ASG. The only difference is the LC is based off a different AMI (because I was changing some settings in the base image and then rebuilding).

I'm able to manually talk to all EPMD ports between instances as well.

@youssefNM
Copy link

i have the same issue @chuckyz , but the owner of this repo just closed my case without giving it enough time !!!
my nodes are able to communicate to each other, they are using the same erlang cookie, and there is no issue with instances security groups, all the rabbitmq required ports are allowed and tested, but somehow nodes discovery using this plugin fails.
i was able to make it work using a script that does the discovery of nodes explicitly, by listing the instances in the same ASG, and then executing rabbitmqctl join_cluster to add nodes to the cluster.

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 7, 2018

This is not a support forum.

This plugin does not change how clusters are formed or operate; all it does is discover peers to contact on boot unless the node is already a member of a cluster. There is nothing special about how nodes join each other when this plugin is used and it logs the operations it performs and their outcome extensively at debug level. If nodes cannot contact or join their peers, this is not this plugin's fault and should be resolved separately.

All edge cases with this plugin (most of them have to do initial cluster formation when N nodes boot in parallel and there is no "existing set of nodes" to join) should be discussed on the mailing list.

@rabbitmq rabbitmq locked as off-topic and limited conversation to collaborators Jan 7, 2018
@michaelklishin
Copy link
Contributor

Cluster formation documentation provides an entire section dedicated to the primary edge case with cluster formation: initial formation with N nodes joining simultaneously and thus creating a natural race condition.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants