Remove dependency on cluster state API in `SpecificMasterNodesIT` #127213

nielsbauman · 2025-04-23T07:59:28Z

These tests depended on the cluster state API to wait for the master node. This behavior is being removed, so we switch to alternative approaches of waiting for the master node.

Relates #127212

These tests depended on the cluster state API to wait for the master node. This behavior is being removed, so we switch to alternative approaches of waiting for the master node. Relates elastic#127212

elasticsearchmachine · 2025-04-23T07:59:57Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nielsbauman · 2025-04-23T08:01:19Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+     * @param viaNode the node to check the cluster state one
+     * @param masterNodeName the master node name that we wait for
+     */
+    public void awaitAndAssertMasterNode(String viaNode, String masterNodeName) throws Exception {


I added these methods in ESIntegTestCase because I see other use cases for these methods (which I'll get to in follow-up PRs). I'm a little bit on the fence about which class they should live in, though. I would actually like to avoid adding them in this class, as this class is too big already. Maybe InternalTestCluster? Or an entirely new class?

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

DaveCTurner · 2025-04-23T08:58:51Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

+        ensureOpen();
+        if (node == null) {
+            synchronized (this) {
+                return getOrBuildRandomNode().getName();


I think this is super-trappy - getMasterName() definitely doesn't look like the sort of thing that might start a node, especially since there's no way to configure this magic new node. Tests should know whether they have a node running or not when calling this, and should start a node themselves if they want one.

Good point... I reverted Start new node if none found b9093e2 and added 9aee869

Oh hang on wtf calling client() (and therefore getMasterName()) in an empty cluster creates a new default node already today?! Can we make it not do that first? Unsure how many tests rely on that behaviour right now but I hope it's not many.

Ah sorry yeah I'm trying to juggle too many things at the same time (and failing to do so...). That's indeed why I added the behavior initially, to match what getMasterName() currently does. I can address that in a separate PR first.

I opened #127318 to address this.

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

This reverts commit b9093e2.

Addressed

nielsbauman · 2025-04-25T09:38:28Z

@DaveCTurner this is ready for review again. Could you have a look when you some free time?

DaveCTurner · 2025-04-25T09:38:44Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+     * @param masterNodeName the master node name that we wait for
+     */
+    public void awaitAndAssertMasterNode(String viaNode, String masterNodeName) throws Exception {
+        awaitClusterState(


I'd be tempted to use addTemporaryStateListener here too, because (a) it is more harmonious with awaitAndAssertMasterNotFound and (b) it avoids the throws Exception.

Yeah I thought about that too. I'm fine with using addTemporaryStateListener here too.

Is there a reason we would intentionally want the behavior of awaitClusterState? I've been tempted to make awaitClusterState just use addTemporaryClusterState (or even getting rid of awaitClusterState, but that's a larger change).

Is there a reason we would intentionally want the behavior of awaitClusterState?

IMO not really no. ClusterStateObserver is really old and suspiciously complicated given what it's actually doing. I'd be happy to migrate things away from it.

DaveCTurner · 2025-04-25T09:49:53Z

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

        try {
-            Client client = viaNode != null ? client(viaNode) : client();
-            return client.admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).get().getState().nodes().getMasterNode().getName();
+            ClusterServiceUtils.awaitClusterState(logger, state -> state.nodes().getMasterNode() != null, clusterService(viaNode));
+            final ClusterState state = client(viaNode).admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).setLocal(true).get().getState();
+            return state.nodes().getMasterNode().getName();
        } catch (Exception e) {
            logger.warn("Can't fetch cluster state", e);
            throw new RuntimeException("Can't get master node " + e.getMessage(), e);
        }


I still worry this reads the state twice and might see no master the second time due to election jitter. We could cheat and remember the master node seen in the state on which we're awaiting (also saves the exception handling):

final var masterNameListener = new SubscribableListener<String>(); return safeAwait(ClusterServiceUtils.addTemporaryStateListener(clusterService(viaNode), cs -> { Optional.ofNullable(cs.nodes().getMasterNode()).ifPresent(masterNode -> masterNameListener.onResponse(masterNode.getName())); return masterNameListener.isDone(); }).andThen(masterNameListener::addListener));

I thought about that too, but I figured "election jitter" would likely disrupt the test anyway. Tests should be designed to be deterministic/consistent, but there's only so much outside influence tests can account for (e.g. election jitter or CI blips).

I'm a little hesitant to "cheat" here. I seem to recall someone saying

Having to obtain the cluster state in another way is a feature, not a bug :)

#125195 (comment) 😉

Heh yeah I thought I'd said that at some point :) On reflection I think you're right, tests should not be calling this if the master isn't stable.

DaveCTurner

LGTM

Instead of NPE, this PR throws an AssertionError with more explicit error message. Relates: elastic#127213

Instead of NPE, this PR throws an AssertionError with more explicit error message. Relates: #127213

Instead of NPE, this PR throws an AssertionError with more explicit error message. Relates: elastic#127213

Remove dependency on cluster state API in SpecificMasterNodesIT

186faee

These tests depended on the cluster state API to wait for the master node. This behavior is being removed, so we switch to alternative approaches of waiting for the master node. Relates elastic#127212

nielsbauman added 2 commits April 23, 2025 10:01

Fix assertion message

fea0260

Run API on local node

d890be8

nielsbauman commented Apr 23, 2025

View reviewed changes

nielsbauman requested a review from DaveCTurner April 23, 2025 08:03

Start new node if none found

b9093e2

DaveCTurner previously requested changes Apr 23, 2025

View reviewed changes

nielsbauman added 2 commits April 23, 2025 11:05

Revert "Start new node if none found"

cdee810

This reverts commit b9093e2.

Properly fix test

9aee869

nielsbauman requested a review from DaveCTurner April 23, 2025 09:21

nielsbauman added 5 commits April 25, 2025 10:11

Merge branch 'main' into refactor-specific-master-nodes-it

bf8302e

Revert changes in AbstractLicensesIntegrationTestCase

a51bda0

Throw AssertionError when no node found in getMasterName

4b1d6f5

Remove redundant cluster state API calls

8c891ce

Remove redundant line

5c5d813

DaveCTurner reviewed Apr 25, 2025

View reviewed changes

nielsbauman added 2 commits April 25, 2025 12:13

Add temporary state listener

f47c766

Rename methods

d52575f

DaveCTurner approved these changes Apr 25, 2025

View reviewed changes

nielsbauman enabled auto-merge (squash) April 25, 2025 12:03

nielsbauman merged commit 7bd2b80 into elastic:main Apr 25, 2025
16 checks passed

nielsbauman deleted the refactor-specific-master-nodes-it branch April 25, 2025 12:12

JeremyDahlgren mentioned this pull request Apr 29, 2025

[CI] TransportClusterStateActionDisruptionIT class failing #127443

Closed

JeremyDahlgren mentioned this pull request Apr 29, 2025

Fix NullPointerExceptions failing TransportClusterStateActionDisruptionIT #127523

Merged

ywangd added a commit to ywangd/elasticsearch that referenced this pull request May 8, 2025

[Test] Make unable master node error more explicit

30711ae

Instead of NPE, this PR throws an AssertionError with more explicit error message. Relates: elastic#127213

ywangd mentioned this pull request May 8, 2025

[Test] Make unable master node error more explicit #127891

Merged

elasticsearchmachine pushed a commit that referenced this pull request May 9, 2025

[Test] Make unable master node error more explicit (#127891)

ca18a86

Instead of NPE, this PR throws an AssertionError with more explicit error message. Relates: #127213

JeremyDahlgren mentioned this pull request May 9, 2025

Fix race condition in InternalTestCluster.getMasterName() #127985

Closed

jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025

[Test] Make unable master node error more explicit (elastic#127891)

d7b5859

Instead of NPE, this PR throws an AssertionError with more explicit error message. Relates: elastic#127213

Remove dependency on cluster state API in SpecificMasterNodesIT #127213

Remove dependency on cluster state API in SpecificMasterNodesIT #127213

Uh oh!

Conversation

nielsbauman commented Apr 23, 2025

Uh oh!

elasticsearchmachine commented Apr 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nielsbauman commented Apr 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Remove dependency on cluster state API in `SpecificMasterNodesIT` #127213

Remove dependency on cluster state API in `SpecificMasterNodesIT` #127213