Skip to content

Remove dependency on cluster state API in SpecificMasterNodesIT #127213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nielsbauman
Copy link
Contributor

These tests depended on the cluster state API to wait for the master node. This behavior is being removed, so we switch to alternative approaches of waiting for the master node.

Relates #127212

These tests depended on the cluster state API to wait for the master
node. This behavior is being removed, so we switch to alternative
approaches of waiting for the master node.

Relates elastic#127212
@nielsbauman nielsbauman added >test Issues or PRs that are addressing/adding tests Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0 :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Apr 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

* @param viaNode the node to check the cluster state one
* @param masterNodeName the master node name that we wait for
*/
public void awaitAndAssertMasterNode(String viaNode, String masterNodeName) throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these methods in ESIntegTestCase because I see other use cases for these methods (which I'll get to in follow-up PRs). I'm a little bit on the fence about which class they should live in, though. I would actually like to avoid adding them in this class, as this class is too big already. Maybe InternalTestCluster? Or an entirely new class?

@nielsbauman nielsbauman requested a review from DaveCTurner April 23, 2025 08:03
ensureOpen();
if (node == null) {
synchronized (this) {
return getOrBuildRandomNode().getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is super-trappy - getMasterName() definitely doesn't look like the sort of thing that might start a node, especially since there's no way to configure this magic new node. Tests should know whether they have a node running or not when calling this, and should start a node themselves if they want one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point... I reverted Start new node if none found b9093e2 and added 9aee869

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hang on wtf calling client() (and therefore getMasterName()) in an empty cluster creates a new default node already today?! Can we make it not do that first? Unsure how many tests rely on that behaviour right now but I hope it's not many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry yeah I'm trying to juggle too many things at the same time (and failing to do so...). That's indeed why I added the behavior initially, to match what getMasterName() currently does. I can address that in a separate PR first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #127318 to address this.

@nielsbauman nielsbauman requested a review from DaveCTurner April 23, 2025 09:21
@DaveCTurner DaveCTurner dismissed their stale review April 23, 2025 09:34

Addressed

@nielsbauman
Copy link
Contributor Author

@DaveCTurner this is ready for review again. Could you have a look when you some free time?

* @param masterNodeName the master node name that we wait for
*/
public void awaitAndAssertMasterNode(String viaNode, String masterNodeName) throws Exception {
awaitClusterState(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to use addTemporaryStateListener here too, because (a) it is more harmonious with awaitAndAssertMasterNotFound and (b) it avoids the throws Exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that too. I'm fine with using addTemporaryStateListener here too.

Is there a reason we would intentionally want the behavior of awaitClusterState? I've been tempted to make awaitClusterState just use addTemporaryClusterState (or even getting rid of awaitClusterState, but that's a larger change).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we would intentionally want the behavior of awaitClusterState?

IMO not really no. ClusterStateObserver is really old and suspiciously complicated given what it's actually doing. I'd be happy to migrate things away from it.

Comment on lines 2040 to 2047
try {
Client client = viaNode != null ? client(viaNode) : client();
return client.admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).get().getState().nodes().getMasterNode().getName();
ClusterServiceUtils.awaitClusterState(logger, state -> state.nodes().getMasterNode() != null, clusterService(viaNode));
final ClusterState state = client(viaNode).admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).setLocal(true).get().getState();
return state.nodes().getMasterNode().getName();
} catch (Exception e) {
logger.warn("Can't fetch cluster state", e);
throw new RuntimeException("Can't get master node " + e.getMessage(), e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still worry this reads the state twice and might see no master the second time due to election jitter. We could cheat and remember the master node seen in the state on which we're awaiting (also saves the exception handling):

        final var masterNameListener = new SubscribableListener<String>();
        return safeAwait(ClusterServiceUtils.addTemporaryStateListener(clusterService(viaNode), cs -> {
            Optional.ofNullable(cs.nodes().getMasterNode()).ifPresent(masterNode -> masterNameListener.onResponse(masterNode.getName()));
            return masterNameListener.isDone();
        }).andThen(masterNameListener::addListener));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that too, but I figured "election jitter" would likely disrupt the test anyway. Tests should be designed to be deterministic/consistent, but there's only so much outside influence tests can account for (e.g. election jitter or CI blips).

I'm a little hesitant to "cheat" here. I seem to recall someone saying

Having to obtain the cluster state in another way is a feature, not a bug :)

#125195 (comment) 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh yeah I thought I'd said that at some point :) On reflection I think you're right, tests should not be calling this if the master isn't stable.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nielsbauman nielsbauman enabled auto-merge (squash) April 25, 2025 12:03
@nielsbauman nielsbauman merged commit 7bd2b80 into elastic:main Apr 25, 2025
16 checks passed
@nielsbauman nielsbauman deleted the refactor-specific-master-nodes-it branch April 25, 2025 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants