Skip to content

[CI] S3RepositoryAnalysisRestIT testRepositoryAnalysis failing #126576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 9, 2025 · 8 comments · Fixed by #126593 or #126731
Closed

[CI] S3RepositoryAnalysisRestIT testRepositoryAnalysis failing #126576

elasticsearchmachine opened this issue Apr 9, 2025 · 8 comments · Fixed by #126593 or #126731
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases rca:random-controlled test failed due to randomization, and is reproducible given the seed Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 9, 2025

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:snapshot-repo-test-kit:qa:s3:javaRestTest" --tests "org.elasticsearch.repositories.blobstore.testkit.analyze.S3RepositoryAnalysisRestIT.testRepositoryAnalysis" -Dtests.seed=B71B1505C30AE167 -Dtests.locale=vi-VN -Dtests.timezone=US/Michigan -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

org.elasticsearch.client.ResponseException: method [POST], host [http://[::1]:34247], URI [/_snapshot/repository/_analyze?blob_count=10&seed=-2481787089920117582&max_blob_size=10mb&timeout=120s&concurrency=4], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[{"type":"uncategorized_execution_exception","reason":"Failed execution"}],"type":"repository_verification_exception","reason":"[repository] Elasticsearch observed the storage system underneath this repository behaved incorrectly which indicates it is not suitable for use with Elasticsearch snapshots. Typically this happens when using storage other than AWS S3 which incorrectly claims to be S3-compatible. If so, please report this incompatibility to your storage supplier. Do not report Elasticsearch issues involving storage systems which claim to be S3-compatible unless you can demonstrate that the same issue exists when using a genuine AWS S3 repository. See [https://www.elastic.co/docs/api/doc/elasticsearch
[truncated]

Issue Reasons:

  • [main] 15 failures in test testRepositoryAnalysis (3.0% fail rate in 505 executions)
  • [main] 11 failures in step part-2 (4.6% fail rate in 237 executions)
  • [main] 11 failures in pipeline elasticsearch-pull-request (4.7% fail rate in 236 executions)
  • [main] 2 failures in pipeline elasticsearch-periodic-platform-support (22.2% fail rate in 9 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Apr 9, 2025
elasticsearchmachine added a commit that referenced this issue Apr 9, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 consecutive failures in step part-2
  • [main] 2 consecutive failures in pipeline elasticsearch-pull-request
  • [main] 9 failures in test testRepositoryAnalysis (2.3% fail rate in 389 executions)
  • [main] 7 failures in step part-2 (4.1% fail rate in 172 executions)
  • [main] 7 failures in pipeline elasticsearch-pull-request (4.0% fail rate in 175 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Apr 9, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@bcully bcully self-assigned this Apr 9, 2025
@bcully
Copy link
Contributor

bcully commented Apr 9, 2025

This is bound to be #125737 - I'll look at it tomorrow, I'm rolling onto unplanned anyway by happy chance.

@ywangd ywangd added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 10, 2025
@ywangd
Copy link
Member

ywangd commented Apr 10, 2025

Thanks for the pointer. IIUC, the copyBlob call like this one needs to fork to a different thread similar to how early read forks.

In the current form, it fails with both the fixture where the S3HttpHandler is single threaded and the real s3 service because the source blob has not finished writing (and abortWrite=false).

It's likely low-risk since I don't think we have this usage pattern in places other than repo analysis. But I am leaving it as medium-risk just in case.

@DaveCTurner DaveCTurner added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Apr 10, 2025
@DaveCTurner
Copy link
Contributor

I think it's a test bug tbh, we can make S3HttpHandler multi-threaded to avoid the deadlock. I like that we wait for the CopyObject to complete (i.e. fail) before sending the last bytes of the upload.

@DaveCTurner
Copy link
Contributor

... tho as I mentioned here I'm also ok with dropping this extra checking for now.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 10, 2025
- Translate a 404 during a multipart copy into a `FileNotFoundException`

- Use multiple threads in `S3HttpHandler` to avoid
  `CopyObject`/`PutObject` deadlock

Closes elastic#126576
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 10, 2025
- Translate a 404 during a multipart copy into a `FileNotFoundException`

- Use multiple threads in `S3HttpHandler` to avoid
  `CopyObject`/`PutObject` deadlock

Closes elastic#126576
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 10, 2025
- Translate a 404 during a multipart copy into a `FileNotFoundException`

- Use multiple threads in `S3HttpHandler` to avoid
  `CopyObject`/`PutObject` deadlock

Closes elastic#126576
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 10, 2025
- Translate a 404 during a multipart copy into a `FileNotFoundException`

- Use multiple threads in `S3HttpHandler` to avoid
  `CopyObject`/`PutObject` deadlock

Closes elastic#126576
bcully pushed a commit to bcully/elasticsearch that referenced this issue Apr 10, 2025
- Translate a 404 during a multipart copy into a `FileNotFoundException`

- Use multiple threads in `S3HttpHandler` to avoid `CopyObject`/`PutObject` deadlock

Closes elastic#126576

(cherry picked from commit b10b35f)
elasticsearchmachine added a commit that referenced this issue Apr 11, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 15 failures in test testRepositoryAnalysis (3.0% fail rate in 505 executions)
  • [main] 11 failures in step part-2 (4.6% fail rate in 237 executions)
  • [main] 11 failures in pipeline elasticsearch-pull-request (4.7% fail rate in 236 executions)
  • [main] 2 failures in pipeline elasticsearch-periodic-platform-support (22.2% fail rate in 9 executions)

Build Scans:

@bcully
Copy link
Contributor

bcully commented Apr 11, 2025

It looks like 4584b3d causes the BlobWriteAbortedException that is thrown when reading the blob when abortWrite is true to get wrapped as an IOException, which causes it not to get caught here.

bcully added a commit that referenced this issue Apr 14, 2025
Catching Exception instead of AmazonClientException in copyBlob and
executeMultipart led to failures in S3RepositoryAnalysisRestIT due to
the injected exceptions getting wrapped in IOExceptions that prevented
them from being caught and handled in BlobAnalyzeAction.

Closes #126576
@bcully bcully added the rca:random-controlled test failed due to randomization, and is reproducible given the seed label Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs low-risk An open issue or test failure that is a low risk to future releases rca:random-controlled test failed due to randomization, and is reproducible given the seed Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
4 participants