Skip to content

[CI] ManyShardsIT testCancelUnnecessaryRequests failing #125947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Mar 31, 2025 · 5 comments · Fixed by #126653
Closed

[CI] ManyShardsIT testCancelUnnecessaryRequests failing #125947

elasticsearchmachine opened this issue Mar 31, 2025 · 5 comments · Fixed by #126653
Assignees
Labels
:Analytics/ES|QL AKA ESQL needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:esql:internalClusterTest" --tests "org.elasticsearch.xpack.esql.action.ManyShardsIT.testCancelUnnecessaryRequests" -Dtests.seed=2B36B5E6A61236E2 -Dtests.jvm.argline="-Des.entitlements.enabled=false" -Dtests.locale=tt-RU -Dtests.timezone=America/Belize -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: safeGet: listener was completed exceptionally

Issue Reasons:

  • [main] 3 failures in test testCancelUnnecessaryRequests (0.3% fail rate in 871 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI labels Mar 31, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 3 failures in test testCancelUnnecessaryRequests (0.3% fail rate in 871 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 31, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-analytical-engine (Team:Analytics)

@alex-spies
Copy link
Contributor

I can see

 Caused by: java.util.concurrent.ExecutionException: org.elasticsearch.action.NoShardAvailableActionException: no such shard
 at org.elasticsearch.action.support.PlainActionFuture$Sync.getValue(PlainActionFuture.java:279)
 at org.elasticsearch.action.support.PlainActionFuture$Sync.get(PlainActionFuture.java:253)
 at org.elasticsearch.action.support.PlainActionFuture.get(PlainActionFuture.java:74)
 at org.elasticsearch.test.ESTestCase.safeGet(ESTestCase.java:2533)
 ... 44 more

I don't see any WARN logs for this test that would immediately explain what happened.

@dnhatn , git blame has your name (and only your name) plastered over this test suite :D So I hope it's okay to throw this your way - do you have an idea of this failure's risk level?

@idegtiarenko
Copy link
Contributor

I looked on this test and noticed that we execute the query immediately after the new data node has started. Please note, we do not wait for all shard migrations to complete.

It would really help if we have stack trace from NoShardAvailableActionException: no such shard, but I suspect we do not handle well the case when shard moves to another node between can match and sending data request call.

If such cases are expected to be retried on the client, I could add a wait for no shard movements.
Otherwise, it is a bit trickier case to fix.

@idegtiarenko idegtiarenko self-assigned this Apr 3, 2025
@idegtiarenko
Copy link
Contributor

I can also add logging to confirm this theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants