Upgrade `repository-s3` to AWS SDK v2 #126843

DaveCTurner · 2025-04-15T12:23:45Z

Closes #120993

Closes elastic#120993

elasticsearchmachine · 2025-04-15T12:24:09Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-04-15T12:24:09Z

Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled >breaking and release highlight, you need to update the changelog YAML to fill out the extended information sections.

mhl-b

I went through half of PR. Left some comments. Will do second half tomorrow.

mhl-b · 2025-04-15T18:13:29Z

...c/internalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java

+        // S3 SDK stops retrying after TOKEN_BUCKET_SIZE/DEFAULT_EXCEPTION_TOKEN_COST == 500/5 == 100 failures in quick succession
+        // see software.amazon.awssdk.retries.DefaultRetryStrategy.Legacy.TOKEN_BUCKET_SIZE
+        // see software.amazon.awssdk.retries.DefaultRetryStrategy.Legacy.DEFAULT_EXCEPTION_TOKEN_COST
+        // TODO NOMERGE: do we need to use (100 - DEFAULT_EXCEPTION_TOKEN_COST) to avoid running out of tokens?
+        private final Semaphore failurePermits = new Semaphore(100);


Why server should be aware of client side setting? If it is protection of running into client side circuit-breaker we can disable CB in tests. Cant imagine why would we need it here.

These tests are deliberately checking the SDK retry behaviour - see S3ErroneousHttpHandler. So yeah we could disable the CB, but that would then be testing a different configuration from what users will use in production.

mhl-b · 2025-04-15T18:18:08Z

...qa/insecure-credentials/src/test/java/org/elasticsearch/repositories/s3/AmazonS3Wrapper.java


 @SuppressForbidden(reason = "implements AWS api that uses java.io.File!")
-public class AmazonS3Wrapper extends AbstractAmazonS3 {
+public class AmazonS3Wrapper implements S3Client {


Wrapping whole S3Client is largely redundant. We can wrap only subset of methods we use and expose simpler API.

Perhaps we need to wrap because we use the AmazonS3ClientBuilder and all the logic in S3Service for construction.

Nit: I'd call this thing DelegatingS3Client though to make it clear that it just delegates.

Yeah the name isn't great (and indeed I also wonder if we could do this more simply) but I'm going to leave it alone for this PR.

mhl-b · 2025-04-15T18:42:53Z

...ava/org/elasticsearch/repositories/blobstore/testkit/analyze/S3RepositoryAnalysisRestIT.java

 import static org.hamcrest.Matchers.blankOrNullString;
 import static org.hamcrest.Matchers.not;

 public class S3RepositoryAnalysisRestIT extends AbstractRepositoryAnalysisRestTestCase {

    static final boolean USE_FIXTURE = Boolean.parseBoolean(System.getProperty("tests.use.fixture", "true"));

-    public static final S3HttpFixture s3Fixture = new S3HttpFixture(USE_FIXTURE);
+    private static final Supplier<String> regionSupplier = new DynamicRegionSupplier();


Whats the reason for dynamic-ness? I read comment inside DynamicRegionSupplier about static context, but at the same time I see ESTestCase.randomIdentifier() is used in other tests in statics.

Can you point me to such a place? If you just say private static final String REGION = ESTestCase.randomIdentifier(); then the test won't even start:

No context information for thread: Thread[id=29, name=SUITE-S3RepositoryAnalysisRestIT-seed#[6FFD4A25FEF46CB7], state=RUNNABLE, group=TGRP-S3RepositoryAnalysisRestIT]. Is this thread running under a class com.carrotsearch.randomizedtesting.RandomizedRunner runner context? Add @RunWith(class com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. Make sure your code accesses random contexts within @BeforeClass and @AfterClass boundary (for example, static test class initializers are not permitted to access random contexts). java.lang.IllegalStateException: No context information for thread: Thread[id=29, name=SUITE-S3RepositoryAnalysisRestIT-seed#[6FFD4A25FEF46CB7], state=RUNNABLE, group=TGRP-S3RepositoryAnalysisRestIT]. Is this thread running under a class com.carrotsearch.randomizedtesting.RandomizedRunner runner context? Add @RunWith(class com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. Make sure your code accesses random contexts within @BeforeClass and @AfterClass boundary (for example, static test class initializers are not permitted to access random contexts). at com.carrotsearch.randomizedtesting.RandomizedContext.context(RandomizedContext.java:249) at com.carrotsearch.randomizedtesting.RandomizedContext.current(RandomizedContext.java:134) at com.carrotsearch.randomizedtesting.RandomizedTest.getContext(RandomizedTest.java:72) at com.carrotsearch.randomizedtesting.RandomizedTest.getRandom(RandomizedTest.java:92) at com.carrotsearch.randomizedtesting.RandomizedTest.randomAsciiLettersOfLengthBetween(RandomizedTest.java:588) at com.carrotsearch.randomizedtesting.RandomizedTest.randomAsciiOfLengthBetween(RandomizedTest.java:573) at org.elasticsearch.test.ESTestCase.randomAlphaOfLengthBetween(ESTestCase.java:1223) at org.elasticsearch.test.ESTestCase.randomIdentifier(ESTestCase.java:1279) at org.elasticsearch.repositories.blobstore.testkit.analyze.S3RepositoryAnalysisRestIT.<clinit>(S3RepositoryAnalysisRestIT.java:31) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:543) at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:631)

mhl-b · 2025-04-15T18:57:55Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/HttpScheme.java

+
+package org.elasticsearch.repositories.s3;
+
+public enum HttpScheme {


Nit. It also exists in org.elasticsearch.discovery.ec2. Can be specified in org.elasticsearch.http

Yeah, didn't really want this in :server and there's nowhere else that makes sense either. NBD I think.

mhl-b · 2025-04-15T20:19:50Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3RepositoryPlugin.java

+    private static Region getDefaultRegion() {
+        return AccessController.doPrivileged((PrivilegedAction<Region>) () -> {
+            try {
+                return DefaultAwsRegionProviderChain.builder().build().getRegion();
+            } catch (Exception e) {
+                logger.debug("failed to obtain region from default provider chain", e);
+                return null;
+            }
+        });


I'm not following why createComponents overrides region resolution into a default one, when S3Service uses explicit -> guesser -> fallback chain. I think a single AwsProfileRegionProvider which is a chain of
explicit -> default(from profile) -> guesser -> fallback can serve all needs for prod and test. Not even sure that fallback should exists, kind of dangerous default.

Not even sure that fallback should exists, kind of dangerous default.

I agree, but simply refusing to proceed without a region would be a massive breaking change for our users since SDKv1 was lenient in this area. I hope to get to that point in v10 but we will have to live with this lenience for now.

I think you're right, we could compute the region by constructing our own AwsRegionProviderChain and then calling getRegion() on it. I'm not sure I see the advantage vs just computing it directly as we have here but maybe I'm missing something?

nicktindall

I got through 43/61 files. I'm down to the tests now. I'll submit this feedback and continue to look at the tests tomorrow

nicktindall · 2025-04-16T02:10:03Z

...-credentials/src/test/java/org/elasticsearch/repositories/s3/RepositoryCredentialsTests.java

-        assertThat(client, instanceOf(ProxyS3RepositoryPlugin.ClientAndCredentials.class));
+        try (var clientReference = repository.createBlobStore().clientReference()) {
+            final S3Client client = clientReference.client();
+            assertThat(client, instanceOf(ProxyS3RepositoryPlugin.ClientAndCredentials.class));


Nit: could use asInstanceOf to reduce verbosity (cast and assert in one)? also repeated below a few times

nicktindall · 2025-04-16T02:22:01Z

...qa/insecure-credentials/src/test/java/org/elasticsearch/repositories/s3/AmazonS3Wrapper.java


 @SuppressForbidden(reason = "implements AWS api that uses java.io.File!")
-public class AmazonS3Wrapper extends AbstractAmazonS3 {
+public class AmazonS3Wrapper implements S3Client {


Perhaps we need to wrap because we use the AmazonS3ClientBuilder and all the logic in S3Service for construction.

Nit: I'd call this thing DelegatingS3Client though to make it clear that it just delegates.

...c/internalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java

nicktindall · 2025-04-16T06:50:48Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobContainer.java

+                .maxUploads(maxUploads)
+                .overrideConfiguration(
+                    b -> b.putRawQueryParameter(S3BlobStore.CUSTOM_QUERY_PARAMETER_PURPOSE, OperationPurpose.SNAPSHOT_DATA.getKey())
+                )


We sometimes do configureRequestForMetrics and sometimes just add the above parameter. What's the reason for the two different approaches?

I suspect it's because this cleanup is a background process and it messed up the metrics collection tests. Added TODOs to come back to this.

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

nicktindall · 2025-04-16T07:32:32Z

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java

            s3RepositoriesMetrics.common().operationCounter().incrementBy(1, attributes);
            operations.increment();
-            if (numberOfAwsErrors == requestCount) {
+            if (overallSuccess == false) {
                s3RepositoriesMetrics.common().unsuccessfulOperationCounter().incrementBy(1, attributes);
            }

            s3RepositoriesMetrics.common().requestCounter().incrementBy(requestCount, attributes);


we use requestCount here but responseCount above, is that intentional?

I think this reflects the existing behaviour: requests tracks the number of HTTP requests which got a response (since we assume those without a response didn't make it to S3 and therefore aren't billable) whereas s3RepositoriesMetrics.common().requestCounter() tracks the number of HTTP requests emitted regardless of whether they got a response. The naming in this area is not good.

The operation counter should be the count of billable things that happened I think?

i.e. operations is the logical blob store operations and requests includes things like retries due to throttles or errors that we execute along the way. I think it's right to not count requests with no response (if they are requests for which we received no response), but I think the request counts for the metrics and stats endpoints should be the same? In practice we're probably talking about an exceedingly small number, but my understanding was that the metrics and the in-memory counters were just two ways of exposing the same counter? (with the caveat that the in-memory ones get reset when the node restarts of course)

The one relevant to billing is requests, whereas the metrics are for our own internal usage, so I'm not sure they should be exactly aligned. In any case, they're already not aligned in main today and I'd rather leave that as-is in this PR, potentially to be revisited in future.

(I agree, the difference should be very small in practice; maybe we should have metrics for both requests and responses to validate that)

The value returned in the metering endpoint is operations (see here)

We had to clarify after we did Azure because Azure was reporting "operations" and AWS was reporting "requests". The former being the logical/billing operations the latter being the constituent requests (including retries etc.). The distinction becomes even more tricky in GCP where e.g. a resumable upload is billed as a single operation regardless of how many chunks you upload.

We have just gone for "best effort" at operations being billable/logical operations and requests being a raw count of HTTP requests. As that seemed consistent with the original AWS impl. And the rules for how requests are billed are way too complex to reliably count "billable" operations.

I chased up with the perf and billing people and the billing people were interested in "operations" and the perf people were interested in "requests" so the metering endpoint (/_nodes/{nodeId}/_repositories_metering) returns "operations" and the stats endpoint (/_internal/blob_store/stats, serverless only) returns "requests". (see ES-9767)

modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3ClientSettings.java

...les/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3RetryingInputStream.java

elasticsearchmachine · 2025-04-24T11:22:18Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126843

Closes elastic#120993 Backport of elastic#126843 to `8.x`

Closes #120993 Backport of #126843 to `8.x`

Catching `Exception` instead of `SdkException` in `copyBlob` and `executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due to the injected exceptions getting wrapped in `IOExceptions` that prevented them from being caught and handled in `BlobAnalyzeAction`. Repeat of elastic#126731, regressed due to elastic#126843 Closes elastic#127399

Catching `Exception` instead of `SdkException` in `copyBlob` and `executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due to the injected exceptions getting wrapped in `IOExceptions` that prevented them from being caught and handled in `BlobAnalyzeAction`. Repeat of #126731, regressed due to #126843 Closes #127399

These changes are to bring the docs into alignment with the changes made as part of the S3 upgrade elastic/elasticsearch#126843

The `s3.client.CLIENT_NAME.protocol` setting became unused in elastic#126843 as it is inapplicable in the v2 SDK. However, the v2 SDK requires the `s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a scheme, so in elastic#127489 we prepend a `https://` to the endpoint if needed. This commit generalizes this slightly so that we prepend `http://` if the endpoint has no scheme and the `.protocol` setting is set to `http`.

The `s3.client.CLIENT_NAME.protocol` setting became unused in #126843 as it is inapplicable in the v2 SDK. However, the v2 SDK requires the `s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a scheme, so in #127489 we prepend a `https://` to the endpoint if needed. This commit generalizes this slightly so that we prepend `http://` if the endpoint has no scheme and the `.protocol` setting is set to `http`.

The `s3.client.CLIENT_NAME.protocol` setting became unused in elastic#126843 as it is inapplicable in the v2 SDK. However, the v2 SDK requires the `s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a scheme, so in elastic#127489 we prepend a `https://` to the endpoint if needed. This commit generalizes this slightly so that we prepend `http://` if the endpoint has no scheme and the `.protocol` setting is set to `http`. Backport of elastic#127744 to `8.19`

* Reinstate use of S3 `protocol` client setting The `s3.client.CLIENT_NAME.protocol` setting became unused in #126843 as it is inapplicable in the v2 SDK. However, the v2 SDK requires the `s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a scheme, so in #127489 we prepend a `https://` to the endpoint if needed. This commit generalizes this slightly so that we prepend `http://` if the endpoint has no scheme and the `.protocol` setting is set to `http`. Backport of #127744 to `8.19` * Fix up warning assertions

The `s3.client.CLIENT_NAME.protocol` setting became unused in elastic#126843 as it is inapplicable in the v2 SDK. However, the v2 SDK requires the `s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a scheme, so in elastic#127489 we prepend a `https://` to the endpoint if needed. This commit generalizes this slightly so that we prepend `http://` if the endpoint has no scheme and the `.protocol` setting is set to `http`.

These changes are to bring the docs into alignment with the changes made as part of the S3 upgrade elastic/elasticsearch#126843 --------- Co-authored-by: David Turner <[email protected]> Co-authored-by: Edu González de la Herrán <[email protected]>

Upgrade repository-s3 to AWS SDK v2

c265fb9

Closes elastic#120993

DaveCTurner added >breaking release highlight :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >upgrade v8.19.0 v9.1.0 labels Apr 15, 2025

DaveCTurner requested review from DiannaHohensee and mhl-b April 15, 2025 12:23

DaveCTurner requested a review from a team as a code owner April 15, 2025 12:23

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Apr 15, 2025

DaveCTurner added 3 commits April 15, 2025 13:24

Update docs/changelog/126843.yaml

d9f7439

Replace changelog

897557f

Remove spurious changelog

252005f

DaveCTurner requested a review from nicktindall April 15, 2025 13:29

mhl-b reviewed Apr 15, 2025

View reviewed changes

nicktindall reviewed Apr 16, 2025

View reviewed changes

DaveCTurner added 9 commits April 17, 2025 09:57

Merge branch 'main' into 2025/04/15/repository-s3-sdk-v2

71d2578

Don't run out of permits

d39a4bb

asInstanceOf

420f0a8

randomPurpose

a8ad71c

TODOs about multipart cleanups

c30e7f3

Comment on CoreMetric.API_CALL_SUCCESSFUL loop

bce8992

Fix loadDeprecatedCredentials comment

787d8cc

NOMERGE

74d7fe6

NOMERGE

33db762

DaveCTurner requested review from nicktindall and mhl-b April 17, 2025 09:50

DaveCTurner added 2 commits April 24, 2025 09:22

Enable cross-region access sometimes

7c3f82e

Dedup

1371d2a

elasticsearchmachine merged commit b028c0a into elastic:main Apr 24, 2025
17 checks passed

DaveCTurner deleted the 2025/04/15/repository-s3-sdk-v2 branch April 24, 2025 11:21

elasticsearchmachine added the backport pending label Apr 24, 2025

This was referenced Apr 24, 2025

[CI] S3BlobStoreRepositoryTests testRequestStats failing #88841

Closed

[CI] S3BlobStoreRepositoryTests testMetrics failing #101608

Closed

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 24, 2025

Upgrade repository-s3 to AWS SDK v2

0b30c22

Closes elastic#120993 Backport of elastic#126843 to `8.x`

DaveCTurner mentioned this pull request Apr 24, 2025

Upgrade repository-s3 to AWS SDK v2 #127331

Merged

elasticsearchmachine pushed a commit that referenced this pull request Apr 24, 2025

Upgrade repository-s3 to AWS SDK v2 (#127331)

2e3df0c

Closes #120993 Backport of #126843 to `8.x`

DaveCTurner mentioned this pull request Apr 25, 2025

S3BlobContainer: Revert broadened exception handler again #127405

Merged

ywangd mentioned this pull request Apr 29, 2025

Default S3 endpoint scheme to HTTPS when not specified #127489

Merged

nicktindall added a commit to elastic/docs-content that referenced this pull request May 5, 2025

Update s3-repository docs after upgrade

057eb4e

These changes are to bring the docs into alignment with the changes made as part of the S3 upgrade elastic/elasticsearch#126843

nicktindall mentioned this pull request May 5, 2025

Update s3-repository docs after upgrade elastic/docs-content#1356

Merged

DaveCTurner mentioned this pull request May 6, 2025

Reinstate use of S3 protocol client setting #127744

Merged

DaveCTurner mentioned this pull request May 7, 2025

Reinstate use of S3 protocol client setting #127809

Merged

DaveCTurner mentioned this pull request May 13, 2025

Ensure S3Service is STARTED when creating client #128026

Merged


		package org.elasticsearch.repositories.s3;

		public enum HttpScheme {

Upgrade repository-s3 to AWS SDK v2 #126843

Upgrade repository-s3 to AWS SDK v2 #126843

Uh oh!

Conversation

DaveCTurner commented Apr 15, 2025

Uh oh!

elasticsearchmachine commented Apr 15, 2025

Uh oh!

elasticsearchmachine commented Apr 15, 2025

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 24, 2025

💔 Backport failed

Uh oh!

Uh oh!

Upgrade `repository-s3` to AWS SDK v2 #126843

Upgrade `repository-s3` to AWS SDK v2 #126843

mhl-b Apr 15, 2025 •

edited

Loading

mhl-b Apr 15, 2025 •

edited

Loading

nicktindall Apr 22, 2025 •

edited

Loading