Fix `NullPointerException`s failing `TransportClusterStateActionDisruptionIT` #127523

JeremyDahlgren · 2025-04-29T17:22:39Z

Replaces the use of InternalTestCluster.getMasterName() with a ClusterServiceUtils.addTemporaryStateListener() call that waits for a new master node other than the previous master node. InternalTestCluster.getMasterName() is not safe to use in unstable clusters, per PR #127213.

To reproduce the NullPointerException in getMasterName() seen in the test failures I added a sleep in between the two cluster state accesses:

            ClusterServiceUtils.awaitClusterState(logger, state -> state.nodes().getMasterNode() != null, clusterService(viaNode));
            final ClusterState state = client(viaNode).admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).setLocal(true).get().getState();

Closes #127466
Closes #127443
Closes #127424
Closes #127423
Closes #127422

Replaces the use of InternalTestCluster.getMasterName() with a ClusterServiceUtils.addTemporaryStateListener() call that waits for a new master node other than the previous master node. InternalTestCluster.getMasterName() is not safe to use in unstable clusters, per PR 127213. Closes: elastic#127466 elastic#127443 elastic#127424 elastic#127423 elastic#127422

elasticsearchmachine · 2025-04-29T17:23:03Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nielsbauman

I left one small suggestion, other than that LGTM! Great job tracking down the PR that caused this failure.

N.B. GitHub requires you to write out the closes #... for every issue, so you'll have to write

closes #...
closes #...

if you want GitHub to automatically close all of them.

nielsbauman · 2025-04-29T18:06:28Z

...va/org/elasticsearch/action/admin/cluster/state/TransportClusterStateActionDisruptionIT.java

+        var newMasterNodeListener = ClusterServiceUtils.addTemporaryStateListener(
+            internalCluster().clusterService(nonMasterNode),
+            state -> Optional.ofNullable(state.nodes().getMasterNode()).map(m -> m.getName().equals(masterName) == false).orElse(false),
+            TEST_REQUEST_TIMEOUT
+        );
+        safeAwait(newMasterNodeListener, TEST_REQUEST_TIMEOUT);


I think we can do this:

Suggested change

var newMasterNodeListener = ClusterServiceUtils.addTemporaryStateListener(

internalCluster().clusterService(nonMasterNode),

state -> Optional.ofNullable(state.nodes().getMasterNode()).map(m -> m.getName().equals(masterName) == false).orElse(false),

TEST_REQUEST_TIMEOUT

);

safeAwait(newMasterNodeListener, TEST_REQUEST_TIMEOUT);

awaitClusterState(

logger,

nonMasterNode,

state -> Optional.ofNullable(state.nodes().getMasterNode()).map(m -> m.getName().equals(masterName) == false).orElse(false)

);

That looks a tiny bit cleaner to me and matches better what we're actually trying to achieve (which is waiting for the cluster state). The only downside is that this changes the timeout as ClusterServiceUtils#awaitClusterState uses a timeout of 30s whereas asserBusy (the old code) uses a default timeout of 10s. I think we'll want to try to reduce that 30s timeout at some point anyway, so I don't see that as a reason not to use that method here.

Also, the logger parameter will likely be removed in the future as well, making this even cleaner (but still only a little bit more).

elasticsearchmachine added the v9.1.0 label Apr 29, 2025

JeremyDahlgren requested a review from nielsbauman April 29, 2025 17:23

nielsbauman approved these changes Apr 29, 2025

View reviewed changes

JeremyDahlgren added 6 commits April 29, 2025 14:40

Refactor per code review suggestion

2be01d6

Merge branch 'main' into fix/127466

ee2f458

Merge branch 'main' into fix/127466

55641fb

Updated muted-tests.yml

37ba3d6

Merge branch 'main' into fix/127466

2e1f047

Merge branch 'main' into fix/127466

bd3cd62

JeremyDahlgren merged commit de68cb0 into elastic:main Apr 30, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `NullPointerException`s failing `TransportClusterStateActionDisruptionIT` #127523

Fix `NullPointerException`s failing `TransportClusterStateActionDisruptionIT` #127523

JeremyDahlgren commented Apr 29, 2025 •

edited

Loading

elasticsearchmachine commented Apr 29, 2025

nielsbauman left a comment

nielsbauman Apr 29, 2025

Fix NullPointerExceptions failing TransportClusterStateActionDisruptionIT #127523

Fix NullPointerExceptions failing TransportClusterStateActionDisruptionIT #127523

Conversation

JeremyDahlgren commented Apr 29, 2025 • edited Loading

elasticsearchmachine commented Apr 29, 2025

nielsbauman left a comment

Choose a reason for hiding this comment

nielsbauman Apr 29, 2025

Choose a reason for hiding this comment

Fix `NullPointerException`s failing `TransportClusterStateActionDisruptionIT` #127523

Fix `NullPointerException`s failing `TransportClusterStateActionDisruptionIT` #127523

JeremyDahlgren commented Apr 29, 2025 •

edited

Loading