Skip to content

Fix NullPointerExceptions failing TransportClusterStateActionDisruptionIT #127523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 30, 2025

Conversation

JeremyDahlgren
Copy link
Contributor

@JeremyDahlgren JeremyDahlgren commented Apr 29, 2025

Replaces the use of InternalTestCluster.getMasterName() with a ClusterServiceUtils.addTemporaryStateListener() call that waits for a new master node other than the previous master node. InternalTestCluster.getMasterName() is not safe to use in unstable clusters, per PR #127213.

To reproduce the NullPointerException in getMasterName() seen in the test failures I added a sleep in between the two cluster state accesses:

            ClusterServiceUtils.awaitClusterState(logger, state -> state.nodes().getMasterNode() != null, clusterService(viaNode));
            final ClusterState state = client(viaNode).admin().cluster().prepareState(TEST_REQUEST_TIMEOUT).setLocal(true).get().getState();

Closes #127466
Closes #127443
Closes #127424
Closes #127423
Closes #127422

Replaces the use of InternalTestCluster.getMasterName() with a
ClusterServiceUtils.addTemporaryStateListener() call that waits
for a new master node other than the previous master node.
InternalTestCluster.getMasterName() is not safe to use in
unstable clusters, per PR 127213.

Closes:
elastic#127466
elastic#127443
elastic#127424
elastic#127423
elastic#127422
@JeremyDahlgren JeremyDahlgren added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team labels Apr 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one small suggestion, other than that LGTM! Great job tracking down the PR that caused this failure.

N.B. GitHub requires you to write out the closes #... for every issue, so you'll have to write

closes #...
closes #...

if you want GitHub to automatically close all of them.

Comment on lines 216 to 221
var newMasterNodeListener = ClusterServiceUtils.addTemporaryStateListener(
internalCluster().clusterService(nonMasterNode),
state -> Optional.ofNullable(state.nodes().getMasterNode()).map(m -> m.getName().equals(masterName) == false).orElse(false),
TEST_REQUEST_TIMEOUT
);
safeAwait(newMasterNodeListener, TEST_REQUEST_TIMEOUT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this:

Suggested change
var newMasterNodeListener = ClusterServiceUtils.addTemporaryStateListener(
internalCluster().clusterService(nonMasterNode),
state -> Optional.ofNullable(state.nodes().getMasterNode()).map(m -> m.getName().equals(masterName) == false).orElse(false),
TEST_REQUEST_TIMEOUT
);
safeAwait(newMasterNodeListener, TEST_REQUEST_TIMEOUT);
awaitClusterState(
logger,
nonMasterNode,
state -> Optional.ofNullable(state.nodes().getMasterNode()).map(m -> m.getName().equals(masterName) == false).orElse(false)
);

That looks a tiny bit cleaner to me and matches better what we're actually trying to achieve (which is waiting for the cluster state). The only downside is that this changes the timeout as ClusterServiceUtils#awaitClusterState uses a timeout of 30s whereas asserBusy (the old code) uses a default timeout of 10s. I think we'll want to try to reduce that 30s timeout at some point anyway, so I don't see that as a reason not to use that method here.

Also, the logger parameter will likely be removed in the future as well, making this even cleaner (but still only a little bit more).

@JeremyDahlgren JeremyDahlgren merged commit de68cb0 into elastic:main Apr 30, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0
Projects
None yet
3 participants