Skip to content

Evict from the shared blob cache asynchronously #126581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Apr 10, 2025

  • I added a ThrottledTaskRunner to optionally execute the shared blob cache evictions asynchronously
    • SharedSnapshotIndexEventListener and SharedSnapshotIndexFoldersDeletionListener now call the asynchronous method
    • We limit to 5 concurrent deletion threads by default. I think there’s limited gains to be had by increasing concurrency much more than this because the removals happen inside a mutex anyway (the scan for what to remove can be done concurrently). Open to reducing it to less than 5 if we think that’s appropriate.
  • Whenever we are clearing the cache, we only clear the shared cache for partially mounted indices. It looked like only partial mounted indices use the shared cache. I imagine if recommendations are followed and people use dedicated frozen nodes there will be limited impact from that change, because the cache will either be empty or always scanned. Open to removing those changes for simplicity’s sake.

I don’t know if there’s anything to be gained by moving evictions from the CacheService off the applier thread any more than they already are. I have concerns about making that more asynchronous than it is because in CacheService there is a method called waitForCacheFilesEvictionIfNeeded which takes the shardsEvictionMutex and blocks until any pending shard evictions for the specified shard are completed. It uses the pendingShardsEvictions map to know if there are any pending evictions. If we add another layer of asynchrony, we will potentially be adding a “shadow” queue of evictions that this method doesn’t know about. I wonder if that might break things.

If there are performance issue with CacheService evictions, I think we’d be better off optimising the enqueueing and processing of evictions in that, some ideas for that include

Reduce the amount of lock contention. There is potentially a lot of contention for the shardsEvictionsMutex between the evicting threads on generic and the calls to markShardAsEvictedInCache. I believe there are ways to reduce that.

Reduce the amount of concurrent evictions. Currently there is no limit other than the size of the generic pool. We could add a ThrottledTaskRunner and it might reduce the contention enough to make markShardsAsEvictedInCache faster.

Relates: ES-10744

@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Apr 10, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Indexing Meta label for Distributed Indexing team and removed needs:triage Requires assignment of a team area label labels Apr 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will leave actual approval to Tanguy.

I think we could maybe do the spawn further out to avoid too many tasks - but it may not really be helpful.

I guess this does not address contention on the CacheService evictions - but we can see if we need that to address that too.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comment about the (lack of) reasons to evict cache entries for partially mounted shards. I do like the new forceEvictAsync method, it might be useful in other places too.

final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
assert sharedBlobCacheService != null : "frozen cache service not initialized";
sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't remember all the decisions around evicting cache entries for partially mounted shards 😞

I suspect that we made it this way to allow cache regions to be reused sooner without waiting for them to decay. It was also useful at the beginning when some corrupted data got written in the cache for some reason, as the forced eviction would clear the mess for us.

But besides this, for partially mounted shards I don't see much reason to force the eviction of cache entries vs. just let them expire in cache. And if the shard is quickly relocated them reassigned to the same node, I think there is a risk that the async force eviction now runs concurrently with a shard recovery?

So maybe we could only force-evict asynchronously when the shard is deleted or failed, and let cache entries in cache if it's no longer assigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That sounds reasonable. It's easy to implement in SearchableSnapshotIndexEventListener#beforeIndexRemoved because we have the IndexRemovalReason. It's a bit trickier in the SearchableSnapshotIndexFoldersDeletionListener#before(Index|Shard)FoldersDeleted because we lack that context, I'll trace back to where those deletions originate to see if there's an obvious way to propagate that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.

  • I used the IndexRemovalReason to communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.
  • There are some places that now take a reasonText and an IndexRemovalReason we could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when the IndexService executes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node.
  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nick.

  • If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

That makes sense.

  • There are some places that now take a reasonText and an IndexRemovalReason

Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.

  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those?

Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).

It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.

…ache/shared/SharedBlobCacheService.java

Co-authored-by: Tanguy Leroux <[email protected]>
final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
assert sharedBlobCacheService != null : "frozen cache service not initialized";
sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a go at propagating the reason for the deletion to the listeners. This allows the listener to trigger the asynchronous eviction when we know the shards/indices aren't coming back (i.e. only on DELETE). It meant changes in a few places.

  • I used the IndexRemovalReason to communicate the reason for deletion. I don't like borrowing that from an unrelated interface but we did already have it in scope in some of these places. If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.
  • There are some places that now take a reasonText and an IndexRemovalReason we could get rid of the reason text if we don't feel it's adding anything, but it would mean some log messages would change. It sometimes seems to offer more context, for example the text is different when the IndexService executes a pending delete vs when it succeeds on the first attempt, also delete unassigned index specifies that the index is being deleted despite it not being assigned to the local node.
  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those? I don't have the context to make this call I don't think.

this.indexSettings,
shardPaths,
IndexRemovalReason.FAILURE
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a mis-categorisation as FAILURE. The javadoc seems to suggest it's deleting remnants of a different shard rather than the shard being created, due to a name collision. So we're deleting not because the shard failed to start, but to clear old state from a shard that used to have the same name as the one being started.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to use FAILURE, but maybe worth a comment?

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I left a comment about evictions in case of shard failures, otherwise looks good.

this.indexSettings,
shardPaths,
IndexRemovalReason.FAILURE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to use FAILURE, but maybe worth a comment?

final SharedBlobCacheService<CacheKey> sharedBlobCacheService =
SearchableSnapshotIndexFoldersDeletionListener.this.frozenCacheServiceSupplier.get();
assert sharedBlobCacheService != null : "frozen cache service not initialized";
sharedBlobCacheService.forceEvictAsync(SearchableSnapshots.forceEvictPredicate(shardId, indexSettings.getSettings()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nick.

  • If we think it's right to use it I could break it out to be a top-level enum rather than being under IndicesClusterStateService.AllocatedIndices.IndexRemovalReason.

That makes sense.

  • There are some places that now take a reasonText and an IndexRemovalReason

Thanks for having kept the reason as text. It's provides a bit more context and people are also used to search them in logs.

  • I think the only time it's safe to schedule an asynchronous delete is on an IndexRemovalReason.DELETE. I don't think FAILURE is appropriate, because I assume we could retry after one of those?

Yes,it is possible that the failed shard got reassigned on the same node after it failed. But in that case, we don't really know the cause of the failure and it would be preferable to synchronously evict the cache I think. It makes sure that cached data are cleaned up so that retries will fetch them again from the source of truth (in the case the cached data are the cause of the failure if we were not evicting them then the shard would have no chance to recover ever).

It goes against the purpose of this PR but shard failures should be the exception so I think keeping the synchronous eviction is OK for failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Searchable Snapshots Searchable snapshots / frozen indices. >enhancement Team:Distributed Indexing Meta label for Distributed Indexing team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants