Skip to content

Unlink delegate in ReleasableBytesReference once ref count reaches 0 #127058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

original-brownbear
Copy link
Member

We have some spots where we retain a reference to the ReleasableBytesReference instance well beyond its ref-count reaching 0. If it itself references Netty buffers or BigArrays that are not pooled (mostly as a result of overflowing the pooled number of bytes for large messages or under heavy load) then those bytes are not GC-able unless we unlink them here.

We have some spots where we retain a reference to the `ReleasableBytesReference` instance
well beyond its ref-count reaching `0`. If it itself references Netty buffers or `BigArrays`
that are not pooled (mostly as a result of overflowing the pooled number of bytes for large
messages or under heavy load) then those bytes are not GC-able unless we unlink them here.
@original-brownbear original-brownbear requested a review from a team as a code owner April 18, 2025 13:41
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Apr 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserves a few more assertions or at least null checks e.g. in isFragment(), ramBytesUsed(), length(), retainedSlice(); also retain() should probably mustIncRef now.

@original-brownbear
Copy link
Member Author

Makes sense, added an assertion on the delegate into hasReferences that we assert on everywhere anyway :)

@@ -282,7 +281,7 @@ public void testIndicesPrivilegesAreEnforcedForCcrRestoreSessionActions() throws
GetCcrRestoreFileChunkAction.REMOTE_TYPE,
new GetCcrRestoreFileChunkRequest(response2.getNode(), sessionUUID2, leaderIndex2FileName, 1, shardId2)
);
assertThat(getChunkResponse.getChunk().length(), equalTo(1));
assertFalse(getChunkResponse.getChunk().hasReferences());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh this is pretty much exactly what this is about :P this assertion shouldn't have worked but worked as long as we didn't check for having a reference in the length() call. I see no value in the original version and there's a tiny bit of value in checking that we actually released the reference here but I guess we could just as well drop any assertion on the response.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

boolean hasRef = refCounted.hasReferences();
// delegate is nulled out when the ref-count reaches zero but only via a plain store, and also we could be racing with a concurrent
// decRef so need to check #refCounted again in case we run into a non-null delegate but saw a reference before
assert delegate != null || hasRef == false || refCounted.hasReferences() == false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just check refCounted.hasReferences() again, no need to check hasRef right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this idea into my head that we might be introducing far more happens-before into everything than necessary via assertions and that as a result reproducing some prod bugs isn't happening. But on reflection, it doesn't make too much sense to do so here, a false hasRef read twice should actually never introduce a happens before, I'll clean it up before merging :)

@original-brownbear original-brownbear added the auto-backport Automatically create backport pull requests when merged label May 6, 2025
@original-brownbear
Copy link
Member Author

Thanks David!

@original-brownbear original-brownbear merged commit f69fba6 into elastic:main May 6, 2025
17 checks passed
@original-brownbear original-brownbear deleted the break-gc-chain-delegate branch May 6, 2025 11:44
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 6, 2025
…lastic#127058)

We have some spots where we retain a reference to the `ReleasableBytesReference` instance
well beyond its ref-count reaching `0`. If it itself references Netty buffers or `BigArrays`
that are not pooled (mostly as a result of overflowing the pooled number of bytes for large
messages or under heavy load) then those bytes are not GC-able unless we unlink them here.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19

elasticsearchmachine pushed a commit that referenced this pull request May 6, 2025
…127058) (#127749)

We have some spots where we retain a reference to the `ReleasableBytesReference` instance
well beyond its ref-count reaching `0`. If it itself references Netty buffers or `BigArrays`
that are not pooled (mostly as a result of overflowing the pooled number of bytes for large
messages or under heavy load) then those bytes are not GC-able unless we unlink them here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants