Skip to content

[CI] SearchWithRandomDisconnectsIT testSearchWithRandomDisconnects failing #122707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
elasticsearchmachine opened this issue Feb 16, 2025 · 12 comments
Assignees
Labels
needs:risk Requires assignment of a risk label (low, medium, blocker) :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Feb 16, 2025

Build Scans:

Reproduction Line:

./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT.testSearchWithRandomDisconnects" -Dtests.seed=84ACF56E827FA582 -Dtests.locale=mzn-IR -Dtests.timezone=Atlantic/Bermuda -Druntime.java=24

Applicable branches:
9.0

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.Exception: Test abandoned because suite timeout was reached.

Issue Reasons:

  • [9.0] 7 failures in test testSearchWithRandomDisconnects (3.3% fail rate in 215 executions)
  • [9.0] 2 failures in step openjdk21_checkpart1_java-matrix (22.2% fail rate in 9 executions)
  • [9.0] 5 failures in pipeline elasticsearch-periodic (55.6% fail rate in 9 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Search Foundations/Search Catch all for Search Foundations >test-failure Triaged test failures from CI labels Feb 16, 2025
elasticsearchmachine added a commit that referenced this issue Feb 16, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testSearchWithRandomDisconnects (1.1% fail rate in 180 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels Feb 16, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@drempapis drempapis self-assigned this Feb 21, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 12 failures in test testSearchWithRandomDisconnects (25.0% fail rate in 48 executions)
  • [main] 2 failures in step part1 (25.0% fail rate in 8 executions)
  • [main] 9 failures in step part-1 (26.5% fail rate in 34 executions)
  • [main] 2 failures in pipeline elasticsearch-intake (25.0% fail rate in 8 executions)
  • [main] 10 failures in pipeline elasticsearch-pull-request (29.4% fail rate in 34 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Apr 2, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 5 failures in test testSearchWithRandomDisconnects (2.9% fail rate in 171 executions)
  • [8.x] 2 failures in step part1 (13.3% fail rate in 15 executions)
  • [8.x] 2 failures in step part-1 (13.3% fail rate in 15 executions)
  • [8.x] 2 failures in pipeline elasticsearch-intake (13.3% fail rate in 15 executions)
  • [8.x] 2 failures in pipeline elasticsearch-pull-request (12.5% fail rate in 16 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Apr 4, 2025
andreidan pushed a commit to andreidan/elasticsearch that referenced this issue Apr 9, 2025
@drempapis
Copy link
Contributor

I collected the logs from an execution in both the main branch and the 8.x branch, where I see the following stackOverflowError.

 <testcase name="classMethod" classname="org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT" time="0.004">
    <failure message="com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3956, name=elasticsearch[node_s3][search][T#4], state=RUNNABLE, group=TGRP-SearchWithRandomDisconnectsIT]" type="com.carrotsearch.randomizedtesting.UncaughtExceptionError">com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3956, name=elasticsearch[node_s3][search][T#4], state=RUNNABLE, group=TGRP-SearchWithRandomDisconnectsIT]
Caused by: java.lang.StackOverflowError
	at __randomizedtesting.SeedInfo.seed([514BE65E1A73EE72]:0)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:569)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:560)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:153)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:176)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:265)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:636)
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.resolveAll(IndexNameExpressionResolver.java:1569)
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.resolveExpressionsToResources(IndexNameExpressionResolver.java:318)
	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.resolveSearchRouting(IndexNameExpressionResolver.java:1075)
	at org.elasticsearch.action.search.TransportSearchAction.getLocalShardsIterator(TransportSearchAction.java:1859)
	at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:1258)
	at org.elasticsearch.action.search.TransportSearchAction.executeLocalSearch(TransportSearchAction.java:1050)
	at org.elasticsearch.action.search.TransportSearchAction.lambda$executeRequest$4(TransportSearchAction.java:373)
	at org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:247)
	at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:109)
	at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:77)
	at org.elasticsearch.action.search.TransportSearchAction.executeRequest(TransportSearchAction.java:538)
	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:324)
	at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:123)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:135)
	at org.elasticsearch.action.support.TransportAction.handleExecution(TransportAction.java:96)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:59)
	at org.elasticsearch.tasks.TaskManager.registerAndExecute(TaskManager.java:197)
	at org.elasticsearch.client.internal.node.NodeClient.executeLocally(NodeClient.java:106)
	at org.elasticsearch.client.internal.node.NodeClient.doExecute(NodeClient.java:84)
	at org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:140)
	at org.elasticsearch.client.internal.FilterClient.doExecute(FilterClient.java:56)
	at org.elasticsearch.client.internal.support.AbstractClient.execute(AbstractClient.java:140)
	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:55)
	at org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT$1.runMoreSearches(SearchWithRandomDisconnectsIT.java:73)
	at org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT$1.onFailure(SearchWithRandomDisconnectsIT.java:68)

This is the code under inspection

     while (done.get() == false) {
            final ListenableFuture<SearchResponse> f = new ListenableFuture<>();
            prepareRandomSearch().execute(f);
            if (f.isDone() == false) {
                f.addListener(this);
                return;
            }
      }

I understand that the issue lies within the while loop. If prepareRandomSearch().execute(f) completes immediately and f.isDone() returns true, the loop continues and calls execute(f) again. As a result, runMoreSearches() repeatedly invokes itself on the same thread, leading to recursive looping.

@original-brownbear
Copy link
Member

A damn I thought by making the stacks a little less deep I had fixed the above but not good enough it turns out :(
Either we improve the looping here some more to not recurse when immediately done or we simply fork more. Better looping would be nice since it makes the test reproduce the issue it was created to reproduce (searches coinciding with network disconnects) more likely but forking is an ok choice too if that's hard. Let me know if you want to look into this together, happy to help ;)

@drempapis
Copy link
Contributor

Thank you, @original-brownbear, for the feedback. I was experimenting with this modification to schedule a call with an executor instead of looping. Please tell me what you think.

private void runMoreSearches() {
                    if (done.get()) {
                        finishFuture.onResponse(null);
                        return;
                    }

                    final ListenableFuture<SearchResponse> f = new ListenableFuture<>();
                    prepareRandomSearch().execute(f);
                    if (f.isDone() == false) {
                        f.addListener(this);
                    } else {
                        executor.execute(this::runMoreSearches);
                    }
                }
            });

@drempapis
Copy link
Contributor

Running repeatedly via the command line after applying the update, we get the exception, which is also reported in the linked executions, e.g., https://gradle-enterprise.elastic.co/s/z7whiq4yjvjyi

Expected: an empty collection
         but: <[LEAK: resource was not cleaned up before it was garbage-collected.
    Recent access records: 
    #1:
        in [elasticsearch[node_s1][search][T#2]][testSearchWithRandomDisconnects {seed=[9C4F49EC3D6BE222:4BFA3E4DF9657D16]}]
        org.elasticsearch.action.ActionListener.respondAndRelease(ActionListener.java:388)
        org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:79)
        org.elasticsearch.action.ActionRunnable$3.accept(ActionRunnable.java:76)
        org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:101)
        org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)
        org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044)
        org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1095)
        java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:619)
        java.base/java.lang.Thread.run(Thread.java:1447)
    #2:
        in [elasticsearch[node_s1][search][T#2]][testSearchWithRandomDisconnects {seed=[9C4F49EC3D6BE222:4BFA3E4DF9657D16]}]
        org.elasticsearch.action.search.ArraySearchPhaseResults.consumeResult(ArraySearchPhaseResults.java:47)
        org.elasticsearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:159)
        org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:503)
        org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:180)
        org.elasticsearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:277)

....

@original-brownbear
Copy link
Member

Interesting @drempapis could it be that the executor we're using here rejects or throws some other exception maybe? Something is preventing the search from cleanly completing it seems. But it almost looks like this could be a bug too where we add to the QueryPhaseResultConsumer after the search already failed? That's where I'd start my investigation I think, looking at the full leak trace it seems quite possible that that's the case here :sigh :)

@drempapis
Copy link
Contributor

It’s not related to the executor; the same error occurs in main, and I was able to reproduce it locally. I still need to do some additional debugging, and I'll start with the approach you suggested.

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.18

Mute Reasons:

  • [8.18] 8 failures in test testSearchWithRandomDisconnects (1.3% fail rate in 626 executions)
  • [8.18] 2 failures in step openjdk21_checkpart1_java-matrix (12.5% fail rate in 16 executions)
  • [8.18] 2 failures in step rocky-9_platform-support-unix (11.8% fail rate in 17 executions)
  • [8.18] 3 failures in pipeline elasticsearch-periodic-platform-support (16.7% fail rate in 18 executions)
  • [8.18] 3 failures in pipeline elasticsearch-periodic (16.7% fail rate in 18 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Apr 18, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 9.0

Mute Reasons:

  • [9.0] 7 failures in test testSearchWithRandomDisconnects (3.3% fail rate in 215 executions)
  • [9.0] 2 failures in step openjdk21_checkpart1_java-matrix (22.2% fail rate in 9 executions)
  • [9.0] 5 failures in pipeline elasticsearch-periodic (55.6% fail rate in 9 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:risk Requires assignment of a risk label (low, medium, blocker) :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants