Skip to content

POC: Add support for per project repo client #126584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ywangd
Copy link
Member

@ywangd ywangd commented Apr 10, 2025

The intention of this draft PR is to get agreement on the approach for managing per-project repository clients. The current changes are for s3 only. But similar change should be possible for Azure and GCS as well. I do plan to have PoC for at least one of them as well. Raising the initial PR for early feedback.

A few things worth note for the per-project clients are:

  • Creation and deletion/rebuilding by listening to project secrets changes in cluster state
  • Separate from and does not work with the reloadable plugin framework which is still used in single-default-project mode and Serverless MP cluster level client.
  • Each project is managed separately so that update to one does not impact another. Within the same project, updating/closing one client leads to rebuild of clients of the project. This is similar to how we rebuild all clients with the update from the reloadable plugin interface.
  • Do not support overriding cilent settings with repo settings which is still supported in the single-default-project mode.

Relates: ES-11383

Comment on lines 170 to 188
/**
* Delegates to {@link #client(RepositoryMetadata)} when
* 1. per-project client is disabled
* 2. or when the blobstore is cluster level (projectId = null)
* Otherwise, attempts to retrieve a per-project client by the project-id and repository metadata from the
* per-project client manager. Throws if project-id or the client does not exist. The client maybe initialized lazily.
*/
public AmazonS3Reference client(@Nullable ProjectId projectId, RepositoryMetadata repositoryMetadata) {
if (perProjectClientManager == null) {
// Multi-Project is disabled and we have a single default project
assert ProjectId.DEFAULT.equals(projectId) : projectId;
return client(repositoryMetadata);
} else if (projectId == null) {
// Multi-Project is enabled and we are retrieving a client for the cluster level blobstore
return client(repositoryMetadata);
} else {
return perProjectClientManager.client(projectId, repositoryMetadata);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main interface to access repo client for different scopes. Note the usage of null projectId for cluster level blobstore similar to how cluster level persistent task is handled.

@ywangd ywangd added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 10, 2025
@ywangd
Copy link
Member Author

ywangd commented Apr 11, 2025

@DaveCTurner @henningandersen I updated the PR with changes to GCS repository clients as well. Also tightened up clientReference management for S3.

Overall the S3 changes have more complexity due to refcounting for the clients. I ended up using synchronized blocks similar to the existing code. The sychronization is per-project and needed only when there are project creation/deletion and clients setting update so that the contention should be rather low.

GCS repo does not allow overriding client settings with repo settings. So it might be feasible to merge the per-project client manager with the existing clientsCache. But for now, I am having them separate since it feels easier to reason with the changes. It might also make sense to keep them separate long term because the clients are refreshed with different mechanisms, reloadable plugin vs cluster state listener.

Since Azure does not have a clients cache, I think the changes should be relatively straightforward. It also does not allow overriding client settings with repo settings.

Given this the draft status of the PR, I am mostly seeking for consensus of the overall approach, e.g. separate per-project and existing client management, how the per-project manager is hooked and how project-id is passed around and used. Thanks!

@DaveCTurner
Copy link
Contributor

I ended up using synchronized blocks similar to the existing code.

I'm not a huge fan of creating/closing the clients while holding a mutex, I think we might need something more refined here, especially since with this change we now might block the cluster applier thread on this work. The CHM-based approach in GCS is better, but still blocking. Creating a client should be cheap enough that if we cannot find a cached client then we can optimistically create a fresh one before trying to replace it in the map, accepting that if multiple threads do this then we'd want to discard the clients created by the threads that lost the race.

Note that with SDKv2 we will own the lifecycle of the HTTP client independently of the S3 SDK client, so could potentially share the HTTP client across multiple projects.

For my education, are we going to be accepting credentials via file-based settings and putting them in the cluster state, or do those still use the keystore?

@ywangd
Copy link
Member Author

ywangd commented Apr 11, 2025

Thanks for the feedback, David.

For CHM usages, there are two levels. The first one is Map<ProjectId, ClientsHolder> which tracks the live projects based on cluster state updates. I think this usage is unavoidable? But it should have no contention since all updates happen in the applier thread. The second level CHM usage is inside GCS's ClientsHolder, i.e. clientCache. This one is nevered called on the applier thread. So should have no issue.

I agree that the synchronized usage is not desirable. I tried to avoid it initially but it proved to be difficult while maintaining the current semantics. It might be possible if we are ok with creating the same client more than once as you suggested. But I wonder whether you think it would be an acceptable solution to simply fork client closing to a different thread? Closing the clients is the only use case in the applier thread where it runs into synchronization. Since there is no requirement to close the clients synchronously, I feel this could be a simple solution while maintain all existing semantics. In fact, I think we do want to move closing the clients to a different thread regardless of the synchronization since it might be itself expensive (by closing the underlying s3 sdk client). What do you think?

are we going to be accepting credentials via file-based settings and putting them in the cluster state, or do those still use the keystore?

Per-project credentials will be via per-project file-based secrets (see ReservedProjectSecretsAction in the serverless repo). Cluster level credentials will still be provisioned as is, i.e. via the reloadable plugin mechanism that originated from the keystore time.

@DaveCTurner
Copy link
Contributor

I think we do want to move closing the clients to a different thread regardless of the synchronization since it might be itself expensive (by closing the underlying s3 sdk client). What do you think?

Yes I agree. We should probably block the doClose() of the S3Service until all the clients have finished closing as well. That's probably enough to keep the blocking away from the cluster applier indeed.

@ywangd
Copy link
Member Author

ywangd commented Apr 29, 2025

I assume we are OK with the overall approach and I will go ahead to raise the formal PR for s3 first. Thanks for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants