-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Remove dependency on cluster state API in SpecificMasterNodesIT
#127213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dependency on cluster state API in SpecificMasterNodesIT
#127213
Conversation
These tests depended on the cluster state API to wait for the master node. This behavior is being removed, so we switch to alternative approaches of waiting for the master node. Relates elastic#127212
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
* @param viaNode the node to check the cluster state one | ||
* @param masterNodeName the master node name that we wait for | ||
*/ | ||
public void awaitAndAssertMasterNode(String viaNode, String masterNodeName) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these methods in ESIntegTestCase
because I see other use cases for these methods (which I'll get to in follow-up PRs). I'm a little bit on the fence about which class they should live in, though. I would actually like to avoid adding them in this class, as this class is too big already. Maybe InternalTestCluster
? Or an entirely new class?
test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java
Outdated
Show resolved
Hide resolved
ensureOpen(); | ||
if (node == null) { | ||
synchronized (this) { | ||
return getOrBuildRandomNode().getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is super-trappy - getMasterName()
definitely doesn't look like the sort of thing that might start a node, especially since there's no way to configure this magic new node. Tests should know whether they have a node running or not when calling this, and should start a node themselves if they want one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point... I reverted Start new node if none found b9093e2 and added 9aee869
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh hang on wtf calling client()
(and therefore getMasterName()
) in an empty cluster creates a new default node already today?! Can we make it not do that first? Unsure how many tests rely on that behaviour right now but I hope it's not many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry yeah I'm trying to juggle too many things at the same time (and failing to do so...). That's indeed why I added the behavior initially, to match what getMasterName()
currently does. I can address that in a separate PR first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #127318 to address this.
test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java
Outdated
Show resolved
Hide resolved
This reverts commit b9093e2.
@DaveCTurner this is ready for review again. Could you have a look when you some free time? |
* @param masterNodeName the master node name that we wait for | ||
*/ | ||
public void awaitAndAssertMasterNode(String viaNode, String masterNodeName) throws Exception { | ||
awaitClusterState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to use addTemporaryStateListener
here too, because (a) it is more harmonious with awaitAndAssertMasterNotFound
and (b) it avoids the throws Exception
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about that too. I'm fine with using addTemporaryStateListener
here too.
Is there a reason we would intentionally want the behavior of awaitClusterState
? I've been tempted to make awaitClusterState
just use addTemporaryClusterState
(or even getting rid of awaitClusterState
, but that's a larger change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we would intentionally want the behavior of
awaitClusterState
?
IMO not really no. ClusterStateObserver
is really old and suspiciously complicated given what it's actually doing. I'd be happy to migrate things away from it.
try { | ||
Client client = viaNode != null ? client(viaNode) : client(); | ||
return client.admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).get().getState().nodes().getMasterNode().getName(); | ||
ClusterServiceUtils.awaitClusterState(logger, state -> state.nodes().getMasterNode() != null, clusterService(viaNode)); | ||
final ClusterState state = client(viaNode).admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).setLocal(true).get().getState(); | ||
return state.nodes().getMasterNode().getName(); | ||
} catch (Exception e) { | ||
logger.warn("Can't fetch cluster state", e); | ||
throw new RuntimeException("Can't get master node " + e.getMessage(), e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still worry this reads the state twice and might see no master the second time due to election jitter. We could cheat and remember the master node seen in the state on which we're awaiting (also saves the exception handling):
final var masterNameListener = new SubscribableListener<String>();
return safeAwait(ClusterServiceUtils.addTemporaryStateListener(clusterService(viaNode), cs -> {
Optional.ofNullable(cs.nodes().getMasterNode()).ifPresent(masterNode -> masterNameListener.onResponse(masterNode.getName()));
return masterNameListener.isDone();
}).andThen(masterNameListener::addListener));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that too, but I figured "election jitter" would likely disrupt the test anyway. Tests should be designed to be deterministic/consistent, but there's only so much outside influence tests can account for (e.g. election jitter or CI blips).
I'm a little hesitant to "cheat" here. I seem to recall someone saying
Having to obtain the cluster state in another way is a feature, not a bug :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh yeah I thought I'd said that at some point :) On reflection I think you're right, tests should not be calling this if the master isn't stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
These tests depended on the cluster state API to wait for the master node. This behavior is being removed, so we switch to alternative approaches of waiting for the master node.
Relates #127212