Skip to content

Fix race condition in RestCancellableNodeClient #126686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DaveCTurner
Copy link
Contributor

Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201

Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. auto-backport Automatically create backport pull requests when merged v8.18.1 v8.19.0 v9.0.1 v9.1.0 v7.17.29 v8.17.6 labels Apr 11, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Apr 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -113,6 +113,7 @@ private void cancelTask(TaskId taskId) {
private class CloseListener implements ActionListener<Void> {
private final AtomicReference<HttpChannel> channel = new AtomicReference<>();
private final Set<TaskId> tasks = new HashSet<>();
private boolean tasksDrained = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: whatever you prefer really but to me it seems that in these spots it's mostly easier to just make tasks non-final and null ist out to signal tasksDrained, removing the need for the copy and somewhat hardening the design against adding a task after tasks have been drained?
It's also one less field which is always nice :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair enough, see 20f492f. Required adding more null checks than I initially expected...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right the stats :) thanks!

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 11, 2025
@elasticsearchmachine elasticsearchmachine merged commit 1461820 into elastic:main Apr 11, 2025
17 checks passed
@DaveCTurner DaveCTurner deleted the 2025/04/11/RestCancellableNodeClient-reuse-bug branch April 11, 2025 15:00
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.18
8.x
9.0
7.17 Commit could not be cherrypicked due to conflicts
8.17

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126686

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes elastic#88201

Backport of elastic#126686 to `7.17`
@DaveCTurner
Copy link
Contributor Author

Backport to 7.17 is #126703

elasticsearchmachine pushed a commit that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201
elasticsearchmachine pushed a commit that referenced this pull request Apr 11, 2025
Today we rely on registering the channel after registering the task to
be cancelled to ensure that the task is cancelled even if the channel is
closed concurrently. However the client may already have processed a
cancellable request on the channel and therefore this mechanism doesn't
work. With this change we make sure not to register another task after
draining the registrations in order to cancel them.

Closes #88201

Backport of #126686 to `7.17`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed Coordination Meta label for Distributed Coordination team v7.17.29 v8.17.6 v8.18.1 v8.19.0 v9.0.1 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndicesSegmentsRestCancellationIT testCatSegmentsRestCancellation failing
3 participants