-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Upgrade repository-s3
to AWS SDK v2
#126843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade repository-s3
to AWS SDK v2
#126843
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through half of PR. Left some comments. Will do second half tomorrow.
// S3 SDK stops retrying after TOKEN_BUCKET_SIZE/DEFAULT_EXCEPTION_TOKEN_COST == 500/5 == 100 failures in quick succession | ||
// see software.amazon.awssdk.retries.DefaultRetryStrategy.Legacy.TOKEN_BUCKET_SIZE | ||
// see software.amazon.awssdk.retries.DefaultRetryStrategy.Legacy.DEFAULT_EXCEPTION_TOKEN_COST | ||
// TODO NOMERGE: do we need to use (100 - DEFAULT_EXCEPTION_TOKEN_COST) to avoid running out of tokens? | ||
private final Semaphore failurePermits = new Semaphore(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why server should be aware of client side setting? If it is protection of running into client side circuit-breaker we can disable CB in tests. Cant imagine why would we need it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are deliberately checking the SDK retry behaviour - see S3ErroneousHttpHandler
. So yeah we could disable the CB, but that would then be testing a different configuration from what users will use in production.
|
||
@SuppressForbidden(reason = "implements AWS api that uses java.io.File!") | ||
public class AmazonS3Wrapper extends AbstractAmazonS3 { | ||
public class AmazonS3Wrapper implements S3Client { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping whole S3Client is largely redundant. We can wrap only subset of methods we use and expose simpler API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need to wrap because we use the AmazonS3ClientBuilder
and all the logic in S3Service
for construction.
Nit: I'd call this thing DelegatingS3Client
though to make it clear that it just delegates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the name isn't great (and indeed I also wonder if we could do this more simply) but I'm going to leave it alone for this PR.
import static org.hamcrest.Matchers.blankOrNullString; | ||
import static org.hamcrest.Matchers.not; | ||
|
||
public class S3RepositoryAnalysisRestIT extends AbstractRepositoryAnalysisRestTestCase { | ||
|
||
static final boolean USE_FIXTURE = Boolean.parseBoolean(System.getProperty("tests.use.fixture", "true")); | ||
|
||
public static final S3HttpFixture s3Fixture = new S3HttpFixture(USE_FIXTURE); | ||
private static final Supplier<String> regionSupplier = new DynamicRegionSupplier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the reason for dynamic-ness? I read comment inside DynamicRegionSupplier
about static context, but at the same time I see ESTestCase.randomIdentifier()
is used in other tests in statics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to such a place? If you just say private static final String REGION = ESTestCase.randomIdentifier();
then the test won't even start:
No context information for thread: Thread[id=29, name=SUITE-S3RepositoryAnalysisRestIT-seed#[6FFD4A25FEF46CB7], state=RUNNABLE, group=TGRP-S3RepositoryAnalysisRestIT]. Is this thread running under a class com.carrotsearch.randomizedtesting.RandomizedRunner runner context? Add @RunWith(class com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. Make sure your code accesses random contexts within @BeforeClass and @AfterClass boundary (for example, static test class initializers are not permitted to access random contexts).
java.lang.IllegalStateException: No context information for thread: Thread[id=29, name=SUITE-S3RepositoryAnalysisRestIT-seed#[6FFD4A25FEF46CB7], state=RUNNABLE, group=TGRP-S3RepositoryAnalysisRestIT]. Is this thread running under a class com.carrotsearch.randomizedtesting.RandomizedRunner runner context? Add @RunWith(class com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. Make sure your code accesses random contexts within @BeforeClass and @AfterClass boundary (for example, static test class initializers are not permitted to access random contexts).
at com.carrotsearch.randomizedtesting.RandomizedContext.context(RandomizedContext.java:249)
at com.carrotsearch.randomizedtesting.RandomizedContext.current(RandomizedContext.java:134)
at com.carrotsearch.randomizedtesting.RandomizedTest.getContext(RandomizedTest.java:72)
at com.carrotsearch.randomizedtesting.RandomizedTest.getRandom(RandomizedTest.java:92)
at com.carrotsearch.randomizedtesting.RandomizedTest.randomAsciiLettersOfLengthBetween(RandomizedTest.java:588)
at com.carrotsearch.randomizedtesting.RandomizedTest.randomAsciiOfLengthBetween(RandomizedTest.java:573)
at org.elasticsearch.test.ESTestCase.randomAlphaOfLengthBetween(ESTestCase.java:1223)
at org.elasticsearch.test.ESTestCase.randomIdentifier(ESTestCase.java:1279)
at org.elasticsearch.repositories.blobstore.testkit.analyze.S3RepositoryAnalysisRestIT.<clinit>(S3RepositoryAnalysisRestIT.java:31)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:543)
at com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:631)
|
||
package org.elasticsearch.repositories.s3; | ||
|
||
public enum HttpScheme { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. It also exists in org.elasticsearch.discovery.ec2
. Can be specified in org.elasticsearch.http
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, didn't really want this in :server
and there's nowhere else that makes sense either. NBD I think.
private static Region getDefaultRegion() { | ||
return AccessController.doPrivileged((PrivilegedAction<Region>) () -> { | ||
try { | ||
return DefaultAwsRegionProviderChain.builder().build().getRegion(); | ||
} catch (Exception e) { | ||
logger.debug("failed to obtain region from default provider chain", e); | ||
return null; | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following why createComponents
overrides region resolution into a default one, when S3Service uses explicit -> guesser -> fallback
chain. I think a single AwsProfileRegionProvider
which is a chain of
explicit -> default(from profile) -> guesser -> fallback
can serve all needs for prod and test. Not even sure that fallback should exists, kind of dangerous default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not even sure that fallback should exists, kind of dangerous default.
I agree, but simply refusing to proceed without a region would be a massive breaking change for our users since SDKv1 was lenient in this area. I hope to get to that point in v10 but we will have to live with this lenience for now.
I think you're right, we could compute the region by constructing our own AwsRegionProviderChain
and then calling getRegion()
on it. I'm not sure I see the advantage vs just computing it directly as we have here but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got through 43/61 files. I'm down to the tests now. I'll submit this feedback and continue to look at the tests tomorrow
assertThat(client, instanceOf(ProxyS3RepositoryPlugin.ClientAndCredentials.class)); | ||
try (var clientReference = repository.createBlobStore().clientReference()) { | ||
final S3Client client = clientReference.client(); | ||
assertThat(client, instanceOf(ProxyS3RepositoryPlugin.ClientAndCredentials.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could use asInstanceOf
to reduce verbosity (cast and assert in one)? also repeated below a few times
|
||
@SuppressForbidden(reason = "implements AWS api that uses java.io.File!") | ||
public class AmazonS3Wrapper extends AbstractAmazonS3 { | ||
public class AmazonS3Wrapper implements S3Client { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need to wrap because we use the AmazonS3ClientBuilder
and all the logic in S3Service
for construction.
Nit: I'd call this thing DelegatingS3Client
though to make it clear that it just delegates.
...c/internalClusterTest/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java
Show resolved
Hide resolved
.maxUploads(maxUploads) | ||
.overrideConfiguration( | ||
b -> b.putRawQueryParameter(S3BlobStore.CUSTOM_QUERY_PARAMETER_PURPOSE, OperationPurpose.SNAPSHOT_DATA.getKey()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We sometimes do configureRequestForMetrics
and sometimes just add the above parameter. What's the reason for the two different approaches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect it's because this cleanup is a background process and it messed up the metrics collection tests. Added TODOs to come back to this.
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java
Show resolved
Hide resolved
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobStore.java
Show resolved
Hide resolved
s3RepositoriesMetrics.common().operationCounter().incrementBy(1, attributes); | ||
operations.increment(); | ||
if (numberOfAwsErrors == requestCount) { | ||
if (overallSuccess == false) { | ||
s3RepositoriesMetrics.common().unsuccessfulOperationCounter().incrementBy(1, attributes); | ||
} | ||
|
||
s3RepositoriesMetrics.common().requestCounter().incrementBy(requestCount, attributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use requestCount
here but responseCount
above, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this reflects the existing behaviour: requests
tracks the number of HTTP requests which got a response (since we assume those without a response didn't make it to S3 and therefore aren't billable) whereas s3RepositoriesMetrics.common().requestCounter()
tracks the number of HTTP requests emitted regardless of whether they got a response. The naming in this area is not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation counter should be the count of billable things that happened I think?
i.e. operations is the logical blob store operations
and requests
includes things like retries due to throttles or errors that we execute along the way. I think it's right to not count requests with no response (if they are requests for which we received no response), but I think the request counts for the metrics and stats endpoints should be the same? In practice we're probably talking about an exceedingly small number, but my understanding was that the metrics and the in-memory counters were just two ways of exposing the same counter? (with the caveat that the in-memory ones get reset when the node restarts of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one relevant to billing is requests
, whereas the metrics are for our own internal usage, so I'm not sure they should be exactly aligned. In any case, they're already not aligned in main
today and I'd rather leave that as-is in this PR, potentially to be revisited in future.
(I agree, the difference should be very small in practice; maybe we should have metrics for both requests and responses to validate that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value returned in the metering endpoint is operations
(see here)
We had to clarify after we did Azure because Azure was reporting "operations" and AWS was reporting "requests". The former being the logical/billing operations the latter being the constituent requests (including retries etc.). The distinction becomes even more tricky in GCP where e.g. a resumable upload is billed as a single operation regardless of how many chunks you upload.
We have just gone for "best effort" at operations being billable/logical operations and requests being a raw count of HTTP requests. As that seemed consistent with the original AWS impl. And the rules for how requests are billed are way too complex to reliably count "billable" operations.
I chased up with the perf and billing people and the billing people were interested in "operations" and the perf people were interested in "requests" so the metering endpoint (/_nodes/{nodeId}/_repositories_metering
) returns "operations" and the stats endpoint (/_internal/blob_store/stats
, serverless only) returns "requests". (see ES-9767)
modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3ClientSettings.java
Show resolved
Hide resolved
...les/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3RetryingInputStream.java
Show resolved
Hide resolved
💔 Backport failed
You can use sqren/backport to manually backport by running |
Closes elastic#120993 Backport of elastic#126843 to `8.x`
Catching `Exception` instead of `SdkException` in `copyBlob` and `executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due to the injected exceptions getting wrapped in `IOExceptions` that prevented them from being caught and handled in `BlobAnalyzeAction`. Repeat of elastic#126731, regressed due to elastic#126843 Closes elastic#127399
Catching `Exception` instead of `SdkException` in `copyBlob` and `executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due to the injected exceptions getting wrapped in `IOExceptions` that prevented them from being caught and handled in `BlobAnalyzeAction`. Repeat of elastic#126731, regressed due to elastic#126843 Closes elastic#127399
Catching `Exception` instead of `SdkException` in `copyBlob` and `executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due to the injected exceptions getting wrapped in `IOExceptions` that prevented them from being caught and handled in `BlobAnalyzeAction`. Repeat of #126731, regressed due to #126843 Closes #127399
Catching `Exception` instead of `SdkException` in `copyBlob` and `executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due to the injected exceptions getting wrapped in `IOExceptions` that prevented them from being caught and handled in `BlobAnalyzeAction`. Repeat of #126731, regressed due to #126843 Closes #127399
These changes are to bring the docs into alignment with the changes made as part of the S3 upgrade elastic/elasticsearch#126843
The `s3.client.CLIENT_NAME.protocol` setting became unused in elastic#126843 as it is inapplicable in the v2 SDK. However, the v2 SDK requires the `s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a scheme, so in elastic#127489 we prepend a `https://` to the endpoint if needed. This commit generalizes this slightly so that we prepend `http://` if the endpoint has no scheme and the `.protocol` setting is set to `http`.
Closes #120993