Skip to content

[CI] EsqlSpecIT test {rrf.SimpleRrf ASYNC} failing #127063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 18, 2025 · 6 comments
Closed

[CI] EsqlSpecIT test {rrf.SimpleRrf ASYNC} failing #127063

elasticsearchmachine opened this issue Apr 18, 2025 · 6 comments
Assignees
Labels
:Analytics/ES|QL AKA ESQL low-risk An open issue or test failure that is a low risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:esql:qa:server:single-node:javaRestTest" --tests "org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT" -Dtests.method="test {rrf.SimpleRrf ASYNC}" -Dtests.seed=92E84B4D784905FD -Dtests.locale=nmg-CM -Dtests.timezone=America/Louisville -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: Expected a map containing
   lt_1d: <0>
  lt_10s: <0>
   gt_1d: <0>
   lt_1h: <0>
lt_100ms: expected <54> but was <55>
 lt_10ms: <10>
   lt_1m: <0>
  lt_10h: <0>
  lt_10m: <0>
   lt_1s: expected <7> but was <6>

Issue Reasons:

  • [main] 2 failures in test test {rrf.SimpleRrf ASYNC} (0.6% fail rate in 309 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test test {rrf.SimpleRrf ASYNC} (0.6% fail rate in 309 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 18, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-analytical-engine (Team:Analytics)

@fang-xing-esql
Copy link
Member

Could you take a look at this rrf test @ioanatia ? Thank you!

@ioanatia
Copy link
Contributor

Looking at the build that's linked in the issue description, there are actually 4k failures due to java.net.ConnectException: Connection refused. The RRF test failures have the same failures. I think the failures in this build are not related.

Then looking at the other linked built, the failure message has to do with an assertion introduced in #126547 :

java.lang.AssertionError: Expected a map containing
   lt_1d: <0>
  lt_10s: <0>
   gt_1d: <0>
   lt_1h: <0>
lt_100ms: expected <54> but was <55>
 lt_10ms: <10>
   lt_1m: <0>
  lt_10h: <0>
  lt_10m: <0>
   lt_1s: expected <7> but was <6>

I don't see how this is any way related to RRF, but I will take a look.

@ioanatia ioanatia added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 24, 2025
@ioanatia
Copy link
Contributor

ioanatia commented Apr 24, 2025

Not related to RRF, but to FORK 😄

The error can happen when the reported took time from the response is different than the one we register as part of TookMetrics and it's reported by telemetry.
If the took values fall into different buckets as the one from the description (100ms vs 1s), the test fails.

The root cause is a race condition between the time we call markEndQuery on the main coordinator plan and the sub plans.

We call it once for the main coordinator plan here:

finalListener.map(profiles -> {
execInfo.markEndQuery();
return new Result(finalMainPlan.output(), collectedPages, profiles, execInfo);
})

And we call it when each sub plan has ended:

listener.map(completionInfo -> {
execInfo.markEndQuery(); // TODO: revisit this time recording model as part of INLINESTATS improvements
return new Result(outputAttributes, collectedPages, completionInfo, execInfo);
})

I tested this by adding a Thread.sleep and checking that we move on to reporting the took metrics before the sub plan listeners are finished.

It would be very simple to just not call markEndQuery for the sub plans, but that's not the only change we want.
We want to make sure that the all the final listeners for the sub plans are called before we consider the whole plan executed and move on to returning the results to the user.

@ioanatia
Copy link
Contributor

ioanatia commented Apr 28, 2025

fixed with #127328
unmuted with #127455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL low-risk An open issue or test failure that is a low risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants