-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] IndicesSegmentsRestCancellationIT testCatSegmentsRestCancellation failing #88201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
I've looked into this. Unfortunately no clue why it fails. That being said, this is nothing specific to 7.17. The test has failed only 3 times since 1 Jan (~2800 runs!), and that's across all branches. I haven't been able to reproduce it. It seems both tests (indices segment and cat segment) in that suite will result in the same request type ( Since both tests are basically doing the same thing, my best guess is either there is some interference between the two (we could avoid sharing the cluster and use a |
FWIW I ran tens of thousands of iterations of the whole |
Yes. I also tried 5k-6k runs of the suite on two branches, with no failure. |
The following two failures happen rarely, but both fail in the same `assertBusy` block. I don't have a clue why, and couldn't reproduce them. Considering the amount of checks in that block, maybe a larger timeout is more suitable. (Also it seems from the test history, it is not uncommon for those tests to take 2-3s, so every few thousand runs hitting the 10s timeout seems likely, IMO!) Relates #88884, #88201
The following two failures happen rarely, but both fail in the same `assertBusy` block. I don't have a clue why, and couldn't reproduce them. Considering the amount of checks in that block, maybe a larger timeout is more suitable. (Also it seems from the test history, it is not uncommon for those tests to take 2-3s, so every few thousand runs hitting the 10s timeout seems likely, IMO!) Relates elastic#88884, elastic#88201
The following two failures happen rarely, but both fail in the same `assertBusy` block. I don't have a clue why, and couldn't reproduce them. Considering the amount of checks in that block, maybe a larger timeout is more suitable. (Also it seems from the test history, it is not uncommon for those tests to take 2-3s, so every few thousand runs hitting the 10s timeout seems likely, IMO!) Relates #88884, #88201
The following two failures happen rarely, but both fail in the same `assertBusy` block. I don't have a clue why, and couldn't reproduce them. Considering the amount of checks in that block, maybe a larger timeout is more suitable. (Also it seems from the test history, it is not uncommon for those tests to take 2-3s, so every few thousand runs hitting the 10s timeout seems likely, IMO!) Relates elastic#88884, elastic#88201
I double checked this with @original-brownbear and he agrees that there is not much to do for this failure other than increasing the timeout, which was done a few weeks ago. We can just leave this open until next week and then close it if there are no new failures. |
The check used to entirely skip parent lookup relies on ConcurrentHashMap#isEmpty() which could return inconsistent results, and potentially skip the cancellation of a task with a banned parent upon registration, and it doesn't seem to have a benefit considering the hash code computation. Closes #88201
The check used to entirely skip parent lookup relies on ConcurrentHashMap#isEmpty() which could return inconsistent results, and potentially skip the cancellation of a task with a banned parent upon registration, and it doesn't seem to have a benefit considering the hash code computation. Closes elastic#88201
The check used to entirely skip parent lookup relies on ConcurrentHashMap#isEmpty() which could return inconsistent results, and potentially skip the cancellation of a task with a banned parent upon registration, and it doesn't seem to have a benefit considering the hash code computation. Closes elastic#88201
The check used to entirely skip parent lookup relies on ConcurrentHashMap#isEmpty() which could return inconsistent results, and potentially skip the cancellation of a task with a banned parent upon registration, and it doesn't seem to have a benefit considering the hash code computation. Closes #88201
The check used to entirely skip parent lookup relies on ConcurrentHashMap#isEmpty() which could return inconsistent results, and potentially skip the cancellation of a task with a banned parent upon registration, and it doesn't seem to have a benefit considering the hash code computation. Closes #88201 Co-authored-by: Elastic Machine <[email protected]>
Re-appeared in 8.9 at https://gradle-enterprise.elastic.co/s/szrl2yl34rtz2 |
This issue is from one year ago! I think a new issue would be more fitting. The code state one year apart might be very different, and re-opening this instead of a new one might just be misleading when investigating the issue. |
I believe the error is so similar, that the history might be relevant. I see commits referenced, that might have been close to the root cause, but not exactly. So, I'd leave this to the judgement of whoever handles this ticket. They can close this and open a new one. |
New failure today on
And 4 more in the past 7 days https://gradle-enterprise.elastic.co/scans/tests?search.startTimeMax=1689086885271&search.startTimeMin=1688418000000&search.timeZoneId=Europe/Bucharest&tests.container=org.elasticsearch.http.IndicesSegmentsRestCancellationIT&tests.test=testCatSegmentsRestCancellation on various branches. |
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
…113844) Relates elastic#88201 (cherry picked from commit cd42719)
Add use of Opaque ID HTTP header in task cancellation assertion. In some tests, like this #88201 `testCatSegmentsRestCancellation`, we assert that all tasks related to specific HTTP request are cancelled. But we do blanket approach in assertion block catching all tasks by action name. I think narrowing down assertion to specific http request in this case would be more accurate. It is still not clear why test mentioned above failing, but after hours of investigation and injecting random delays, I'm inclining more to @DaveCTurner's comment about interference from other tests or cluster activity. I added additional log that will report when we spot task with different opaque id.
Add use of Opaque ID HTTP header in task cancellation assertion. In some tests, like this #88201 `testCatSegmentsRestCancellation`, we assert that all tasks related to specific HTTP request are cancelled. But we do blanket approach in assertion block catching all tasks by action name. I think narrowing down assertion to specific http request in this case would be more accurate. It is still not clear why test mentioned above failing, but after hours of investigation and injecting random delays, I'm inclining more to @DaveCTurner's comment about interference from other tests or cluster activity. I added additional log that will report when we spot task with different opaque id.
This reproduces (in 7.17 and diff --git a/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java b/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java
index 33b3ef35671e..19a120e9b72f 100644
--- a/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java
+++ b/server/src/main/java/org/elasticsearch/rest/action/RestCancellableNodeClient.java
@@ -100,6 +100,13 @@ public class RestCancellableNodeClient extends FilterClient {
});
assert task instanceof CancellableTask : action.name() + " is not cancellable";
final TaskId taskId = new TaskId(client.getLocalNodeId(), task.getId());
+ if (action.name().equals("indices:monitor/segments")) {
+ try {
+ Thread.sleep(500);
+ } catch (InterruptedException e) {
+ throw new AssertionError(e);
+ }
+ }
closeListener.registerTask(taskHolder, taskId);
closeListener.maybeRegisterChannel(httpChannel);
} |
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes elastic#88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes elastic#88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes elastic#88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes elastic#88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes elastic#88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes elastic#88201 Backport of elastic#126686 to `7.17`
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes #88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes #88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes #88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes #88201
Today we rely on registering the channel after registering the task to be cancelled to ensure that the task is cancelled even if the channel is closed concurrently. However the client may already have processed a cancellable request on the channel and therefore this mechanism doesn't work. With this change we make sure not to register another task after draining the registrations in order to cancel them. Closes #88201 Backport of #126686 to `7.17`
Build scan:
https://gradle-enterprise.elastic.co/s/2o4ugc7mhz6si/tests/:qa:smoke-test-http:javaRestTest/org.elasticsearch.http.IndicesSegmentsRestCancellationIT/testCatSegmentsRestCancellation
Reproduction line:
./gradlew ':qa:smoke-test-http:javaRestTest' --tests "org.elasticsearch.http.IndicesSegmentsRestCancellationIT.testCatSegmentsRestCancellation" -Dtests.seed=8D552D969E2896B -Dtests.locale=ar-IQ -Dtests.timezone=MIT -Druntime.java=8
Applicable branches:
7.17
Reproduces locally?:
Didn't try
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.http.IndicesSegmentsRestCancellationIT&tests.test=testCatSegmentsRestCancellation
90-day delivery stats dashboard
Failure excerpt:
The text was updated successfully, but these errors were encountered: