Skip to content

[CI] IndicesSegmentsRestCancellationIT testCatSegmentsRestCancellation failing #88201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fcofdez opened this issue Jun 30, 2022 · 19 comments · Fixed by #90188 or #126686
Closed

[CI] IndicesSegmentsRestCancellationIT testCatSegmentsRestCancellation failing #88201

fcofdez opened this issue Jun 30, 2022 · 19 comments · Fixed by #90188 or #126686
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. low-risk An open issue or test failure that is a low risk to future releases stalled Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@fcofdez
Copy link
Contributor

fcofdez commented Jun 30, 2022

Build scan:
https://gradle-enterprise.elastic.co/s/2o4ugc7mhz6si/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation

Reproduction line:
./gradlew ':qa:smoke-test-http:javaRestTest' --tests "org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation" -Dtests.seed=8D552D969E2896B -Dtests.locale=ar-IQ -Dtests.timezone=MIT -Druntime.java=8

Applicable branches:
7.17

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.http.IndicesSegmentsRestCancellationIT&tests.test=testCatSegmentsRestCancellation

90-day delivery stats dashboard

Failure excerpt:

junit.framework.AssertionFailedError: task 83/indices:monitor/segments[n] not cancelled

  at __randomizedtesting.SeedInfo.seed([8D552D969E2896B:5FA268A1A1677E70]:0)
  at junit.framework.Assert.fail(Assert.java:57)
  at junit.framework.Assert.assertTrue(Assert.java:22)
  at junit.framework.TestCase.assertTrue(TestCase.java:192)
  at org.elasticsearch.test.TaskAssertions.lambda$assertAllCancellableTasksAreCancelled$2(TaskAssertions.java:55)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1141)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1114)
  at org.elasticsearch.test.TaskAssertions.assertAllCancellableTasksAreCancelled(TaskAssertions.java:47)
  at org.elasticsearch.http.BlockedSearcherRestCancellationTestCase.runTest(BlockedSearcherRestCancellationTestCase.java:111)
  at org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation(IndicesSegmentsRestCancellationIT.java:21)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:750)

@fcofdez fcofdez added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >test-failure Triaged test failures from CI labels Jun 30, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@pxsalehi pxsalehi self-assigned this Jul 12, 2022
@pxsalehi
Copy link
Member

I've looked into this. Unfortunately no clue why it fails. That being said, this is nothing specific to 7.17. The test has failed only 3 times since 1 Jan (~2800 runs!), and that's across all branches. I haven't been able to reproduce it.

It seems both tests (indices segment and cat segment) in that suite will result in the same request type (IndicesSegmentsRequest), but only one of those tests has failed!

Since both tests are basically doing the same thing, my best guess is either there is some interference between the two (we could avoid sharing the cluster and use a ESIntegTestCase.Scope.TEST, or we simplify the assertions that the test makes. By the latter I mean, e.g. here if we have already asserted that we have received the cancellation exception, do we still need to reach into the internals of the node and verify the task status? (CC: @DaveCTurner and @fcofdez)

@DaveCTurner
Copy link
Contributor

FWIW I ran tens of thousands of iterations of the whole IndicesSegmentsRestCancellationIT suite on a CPU-stressed machine (usually good for triggering rare concurrency issues) without a single problem. I wonder if we're receiving a segments request from an external influence (e.g. another test suite running concurrently).

@pxsalehi
Copy link
Member

Yes. I also tried 5k-6k runs of the suite on two branches, with no failure.

elasticsearchmachine pushed a commit that referenced this issue Aug 31, 2022
The following two failures happen rarely, but both fail in the same
`assertBusy` block. I don't have a clue why, and couldn't reproduce
them. Considering the amount of checks in that block, maybe a larger
timeout is more suitable. (Also it seems from the test history, it is
not uncommon for those tests to take 2-3s, so every few thousand runs
hitting the 10s timeout seems likely, IMO!)  Relates
#88884,
#88201
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this issue Aug 31, 2022
The following two failures happen rarely, but both fail in the same
`assertBusy` block. I don't have a clue why, and couldn't reproduce
them. Considering the amount of checks in that block, maybe a larger
timeout is more suitable. (Also it seems from the test history, it is
not uncommon for those tests to take 2-3s, so every few thousand runs
hitting the 10s timeout seems likely, IMO!)  Relates
elastic#88884,
elastic#88201
elasticsearchmachine pushed a commit that referenced this issue Aug 31, 2022
The following two failures happen rarely, but both fail in the same
`assertBusy` block. I don't have a clue why, and couldn't reproduce
them. Considering the amount of checks in that block, maybe a larger
timeout is more suitable. (Also it seems from the test history, it is
not uncommon for those tests to take 2-3s, so every few thousand runs
hitting the 10s timeout seems likely, IMO!)  Relates
#88884,
#88201
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this issue Aug 31, 2022
The following two failures happen rarely, but both fail in the same
`assertBusy` block. I don't have a clue why, and couldn't reproduce
them. Considering the amount of checks in that block, maybe a larger
timeout is more suitable. (Also it seems from the test history, it is
not uncommon for those tests to take 2-3s, so every few thousand runs
hitting the 10s timeout seems likely, IMO!)  Relates
elastic#88884,
elastic#88201
@pxsalehi
Copy link
Member

I double checked this with @original-brownbear and he agrees that there is not much to do for this failure other than increasing the timeout, which was done a few weeks ago. We can just leave this open until next week and then close it if there are no new failures.

pxsalehi added a commit that referenced this issue Sep 22, 2022
The check used to entirely skip parent lookup relies on
ConcurrentHashMap#isEmpty() which could return inconsistent results, and
potentially skip the cancellation of a task with a banned parent upon
registration, and it doesn't seem to have a benefit considering the hash
code computation.

Closes #88201
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this issue Sep 22, 2022
The check used to entirely skip parent lookup relies on
ConcurrentHashMap#isEmpty() which could return inconsistent results, and
potentially skip the cancellation of a task with a banned parent upon
registration, and it doesn't seem to have a benefit considering the hash
code computation.

Closes elastic#88201
pxsalehi added a commit to pxsalehi/elasticsearch that referenced this issue Sep 22, 2022
The check used to entirely skip parent lookup relies on
ConcurrentHashMap#isEmpty() which could return inconsistent results, and
potentially skip the cancellation of a task with a banned parent upon
registration, and it doesn't seem to have a benefit considering the hash
code computation.

Closes elastic#88201
elasticsearchmachine pushed a commit that referenced this issue Sep 22, 2022
The check used to entirely skip parent lookup relies on
ConcurrentHashMap#isEmpty() which could return inconsistent results, and
potentially skip the cancellation of a task with a banned parent upon
registration, and it doesn't seem to have a benefit considering the hash
code computation.

Closes #88201
elasticsearchmachine pushed a commit that referenced this issue Sep 22, 2022
The check used to entirely skip parent lookup relies on
ConcurrentHashMap#isEmpty() which could return inconsistent results, and
potentially skip the cancellation of a task with a banned parent upon
registration, and it doesn't seem to have a benefit considering the hash
code computation.

Closes #88201

Co-authored-by: Elastic Machine <[email protected]>
@kingherc
Copy link
Contributor

kingherc commented Jul 3, 2023

Re-appeared in 8.9 at https://gradle-enterprise.elastic.co/s/szrl2yl34rtz2

@kingherc kingherc reopened this Jul 3, 2023
@pxsalehi
Copy link
Member

pxsalehi commented Jul 4, 2023

This issue is from one year ago! I think a new issue would be more fitting. The code state one year apart might be very different, and re-opening this instead of a new one might just be misleading when investigating the issue.

@pxsalehi pxsalehi removed their assignment Jul 4, 2023
@kingherc
Copy link
Contributor

kingherc commented Jul 4, 2023

I believe the error is so similar, that the history might be relevant. I see commits referenced, that might have been close to the root cause, but not exactly. So, I'd leave this to the judgement of whoever handles this ticket. They can close this and open a new one.

@albertzaharovits
Copy link
Contributor

New failure today on main: https://gradle-enterprise.elastic.co/s/zlqhwimhv7cju/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation?top-execution=1

junit.framework.AssertionFailedError: task 56/indices:monitor/segments[n] not cancelled	
at __randomizedtesting.SeedInfo.seed([52231FF4B54560F8:554258C7DC097E3]:0)	
at junit.framework.Assert.fail(Assert.java:57)	
at junit.framework.Assert.assertTrue(Assert.java:22)	
at junit.framework.TestCase.assertTrue(TestCase.java:192)	
at org.elasticsearch.test.TaskAssertions.lambda$assertAllCancellableTasksAreCancelled$2(TaskAssertions.java:74)	
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1173)	
at org.elasticsearch.test.TaskAssertions.assertAllCancellableTasksAreCancelled(TaskAssertions.java:65)	
at org.elasticsearch.http.BlockedSearcherRestCancellationTestCase.runTest(BlockedSearcherRestCancellationTestCase.java:111)	
at org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation(IndicesSegmentsRestCancellationIT.java:21)	

And 4 more in the past 7 days https://gradle-enterprise.elastic.co/scans/tests?search.startTimeMax=1689086885271&search.startTimeMin=1688418000000&search.timeZoneId=Europe/Bucharest&tests.container=org.elasticsearch.http.IndicesSegmentsRestCancellationIT&tests.test=testCatSegmentsRestCancellation on various branches.

@volodk85 volodk85 added the low-risk An open issue or test failure that is a low risk to future releases label Oct 23, 2023
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 3, 2024
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this issue Oct 4, 2024
@repantis repantis added Team:Distributed Coordination Meta label for Distributed Coordination team and removed Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member

ywangd commented Jan 13, 2025

There was a new failure on Dec 19. Unfortunately it is on 7.17 where #113844 is not available. I raised a backport PR for it #120025

ywangd pushed a commit to ywangd/elasticsearch that referenced this issue Jan 13, 2025
elasticsearchmachine pushed a commit that referenced this issue Jan 14, 2025
…113844) (#120025)

* More verbose logging in `IndicesSegmentsRestCancellationIT` (#113844)

Relates #88201

(cherry picked from commit cd42719)

* Fix logger reference

* Use String instead of Strings

---------

Co-authored-by: David Turner <[email protected]>
tvernum pushed a commit that referenced this issue Feb 25, 2025
Add use of Opaque ID HTTP header in task cancellation assertion. In some
tests, like this #88201 `testCatSegmentsRestCancellation`, we assert
that all tasks related to specific HTTP request are cancelled. But we do
blanket approach in assertion block catching all tasks by action name. I
think narrowing down assertion to specific http request in this case
would be more accurate.

It is still not clear why test mentioned above failing, but after hours
of investigation and injecting random delays, I'm inclining more to
@DaveCTurner's comment about interference from other tests or cluster
activity. I added additional log that will report when we spot task with
different opaque id.
tvernum pushed a commit that referenced this issue Feb 25, 2025
Add use of Opaque ID HTTP header in task cancellation assertion. In some
tests, like this #88201 `testCatSegmentsRestCancellation`, we assert
that all tasks related to specific HTTP request are cancelled. But we do
blanket approach in assertion block catching all tasks by action name. I
think narrowing down assertion to specific http request in this case
would be more accurate.

It is still not clear why test mentioned above failing, but after hours
of investigation and injecting random delays, I'm inclining more to
@DaveCTurner's comment about interference from other tests or cluster
activity. I added additional log that will report when we spot task with
different opaque id.
@DaveCTurner
Copy link
Contributor

This reproduces (in 7.17 and main) thusly:

diff --git a/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java b/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java
index 33b3ef35671e..19a120e9b72f 100644
--- a/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java
+++ b/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java
@@ -100,6 +100,13 @@ public class RestCancellableNodeClient extends FilterClient {
         });
         assert task instanceof CancellableTask : action.name() + " is not cancellable";
         final TaskId taskId = new TaskId(client.getLocalNodeId(), task.getId());
+        if (action.name().equals("indices:monitor/segments")) {
+            try {
+                Thread.sleep(500);
+            } catch (InterruptedException e) {
+                throw new AssertionError(e);
+            }
+        }
         closeListener.registerTask(taskHolder, taskId);
         closeListener.maybeRegisterChannel(httpChannel);
     }

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201

Backport of elastic#126686 to `7.17`
elasticsearchmachine pushed a commit that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this issue Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201

Backport of #126686 to `7.17`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. low-risk An open issue or test failure that is a low risk to future releases stalled Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet