Skip to content

[CI] TransportClusterStateActionDisruptionIT class failing #127443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 27, 2025 · 3 comments · Fixed by #127523
Closed

[CI] TransportClusterStateActionDisruptionIT class failing #127443

elasticsearchmachine opened this issue Apr 27, 2025 · 3 comments · Fixed by #127523
Assignees
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

undefined

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

undefined

Issue Reasons:

  • [main] 10 failures in class org.elasticsearch.action.admin.cluster.state.TransportClusterStateActionDisruptionIT (2.4% fail rate in 414 executions)
  • [main] 5 failures in pipeline elasticsearch-periodic-platform-support (45.5% fail rate in 11 executions)
  • [main] 2 failures in pipeline elasticsearch-periodic (20.0% fail rate in 10 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >test-failure Triaged test failures from CI needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Indexing Meta label for Distributed Indexing team labels Apr 27, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@kingherc kingherc added Team:Distributed Coordination Meta label for Distributed Coordination team :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. Team:Distributed Indexing Meta label for Distributed Indexing team labels Apr 29, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@JeremyDahlgren JeremyDahlgren self-assigned this Apr 29, 2025
@JeremyDahlgren
Copy link
Contributor

JeremyDahlgren commented Apr 29, 2025

I was able to reproduce the null pointer exception locally if in InternalTestCluster.getMasterName() I added a sleep in between the awaitClusterState() and the client call to get the cluster state before getting the master node name.
The getMasterName() method was changed recently in #127213, where David mentions the danger of waiting for the state and then having a second call to get the state. Niels and David agreed that a test shouldn't use getMasterName() if the test expects cluster instability. The fix I'm testing ensures the master node is not null in the assertBusy() in TransportClusterStateActionDisruptionIT.runRepeatedlyWhileChangingMaster().

@JeremyDahlgren JeremyDahlgren added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 29, 2025
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this issue Apr 29, 2025
Replaces the use of InternalTestCluster.getMasterName() with a
ClusterServiceUtils.addTemporaryStateListener() call that waits
for a new master node other than the previous master node.
InternalTestCluster.getMasterName() is not safe to use in
unstable clusters, per PR 127213.

Closes:
elastic#127466
elastic#127443
elastic#127424
elastic#127423
elastic#127422
@JeremyDahlgren JeremyDahlgren added medium-risk An open issue or test failure that is a medium risk to future releases and removed low-risk An open issue or test failure that is a low risk to future releases labels Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
3 participants