-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Perform partition checks from OCF HA script #939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform partition checks from OCF HA script #939
Conversation
Partitioned nodes are ordered to restart by master. It may sound like `autoheal`, but the problem is that OCF script and `autoheal` are not compatible because concepts of master in pacemaker and winner in autoheal are completely unrelated.
@dmitrymex @bogdando waiting for a go-ahead from you :) |
node_name=$(rabbit_node_name $1) | ||
|
||
local seen_as_running | ||
seen_as_running=$(erl_eval "lists:member('%s', rabbit_mnesia:cluster_nodes(running))." "$node_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we as well use this approach in the check_need_join_to()? Your change seems like the get_running_nodes() is not good :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or may be we need update get_nodes__base()'s eval "mnesia:system_info(${infotype})." as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By approach do you mean using lists:member/2
or checking both running and partitioned states?
I've added lists:member/2
because I needed to add another check and did it just for unification with new piece of code. It's slightly cleaner IMO, but it's not worth changing it everywhere. And for partition check it's just way more harder to implement it using grep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for cheek_need_join_to()
it got me thinking. Is the check for running
state is a correct check here? Or should we check list of known nodes instead?
I'm not sure which problem you solve by replacing |
ocf_log err "${LH} Failed to check whether '$node_name' is considered running by us" | ||
# XXX Or should we give remote node benefit of a doubt? | ||
return 1 | ||
elif [ "$seen_as_running" != true ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it "${seen_as_running}" != "true"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. I'm checking here that given node is missing from the list of running nodes.
This are actually equivalent. Functions return the same values and can be used interchangeably, and wrapping result in |
Is this ready to be merged? |
LGTM and the standard check I do with jepsen (https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant) works for me |
Partitioned nodes are ordered to restart by master. It may sound like `autoheal`, but the problem is that OCF script and `autoheal` are not compatible because concepts of master in pacemaker and winner in autoheal are completely unrelated. Upsream change: rabbitmq/rabbitmq-server#939 Change-Id: I79bc2054a0ea0f04917130779e3380777960b1d6 Closes-Bug: 1616581
Partitioned nodes are ordered to restart by master. It may sound like
autoheal
, but the problem is that OCF script andautoheal
are notcompatible because concepts of master in pacemaker and winner in
autoheal are completely unrelated.