Skip to content

[CI] RepositoryAnalysisFailureIT testFailsOnWriteException failing #126747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Apr 12, 2025 · 2 comments · Fixed by #126750
Closed

[CI] RepositoryAnalysisFailureIT testFailsOnWriteException failing #126747

elasticsearchmachine opened this issue Apr 12, 2025 · 2 comments · Fixed by #126750
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs needs:risk Requires assignment of a risk label (low, medium, blocker) rca:random-controlled test failed due to randomization, and is reproducible given the seed Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 12, 2025

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:snapshot-repo-test-kit:internalClusterTest" --tests "org.elasticsearch.repositories.blobstore.testkit.analyze.RepositoryAnalysisFailureIT.testFailsOnWriteException" -Dtests.seed=5E27405E3C54F01 -Dtests.locale=es-GQ -Dtests.timezone=America/Chicago -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: safeGet: listener was not completed within the timeout

Issue Reasons:

  • [main] 4 failures in test testFailsOnWriteException (0.5% fail rate in 795 executions)
  • [main] 2 failures in step part-2 (0.8% fail rate in 249 executions)
  • [main] 2 failures in pipeline elasticsearch-pull-request (0.8% fail rate in 248 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Apr 12, 2025
elasticsearchmachine added a commit that referenced this issue Apr 12, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 3 failures in test testFailsOnWriteException (0.4% fail rate in 682 executions)
  • [main] 2 failures in step part-2 (0.8% fail rate in 243 executions)
  • [main] 2 failures in pipeline elasticsearch-pull-request (0.8% fail rate in 242 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Distributed Coordination Meta label for Distributed Coordination team labels Apr 12, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

bcully added a commit to bcully/elasticsearch that referenced this issue Apr 13, 2025
With the addition of copy coverage in the repository analyzer,
blob count is no longer 1:1 with blob analyzer request count: requests
that create a copy count as two blobs. This can cause
testFailsOnWriteException to sometimes fail, because this test randomly
injects a failure somewhere between the first and blobCounth request,
which may never happen if enough of the requests create copies.

This simple fix is to inject the failure within blobCount/2 requests,
which we will see even if every request generates a copy. An alternative
could be to add a knob to the request to disallow copies and use that
during this test.

Closes elastic#126747
bcully added a commit that referenced this issue Apr 14, 2025
With the addition of copy coverage in the repository analyzer,
blob count is no longer 1:1 with blob analyzer request count: requests
that create a copy count as two blobs. This can cause
testFailsOnWriteException to sometimes fail, because this test randomly
injects a failure somewhere between the first and blobCounth request,
which may never happen if enough of the requests create copies.

This simple fix is to inject the failure within blobCount/2 requests,
which we will see even if every request generates a copy. An alternative
could be to add a knob to the request to disallow copies and use that
during this test.

Closes #126747
@bcully bcully added the rca:random-controlled test failed due to randomization, and is reproducible given the seed label Apr 14, 2025
bcully added a commit to bcully/elasticsearch that referenced this issue Apr 17, 2025
The fix to elastic#126747 was only for one test. This applies
that change to all the tests in this suite that need it.
bcully added a commit to bcully/elasticsearch that referenced this issue Apr 17, 2025
Fixes elastic#127029

The fix to elastic#126747 was only for one test. This applies
that change to all the tests in this suite that need it.
bcully added a commit that referenced this issue Apr 18, 2025
The fix to #126747 was only for one test. This applies
that change to all the tests in this suite that need it.

Fixes #127029
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs needs:risk Requires assignment of a risk label (low, medium, blocker) rca:random-controlled test failed due to randomization, and is reproducible given the seed Team:Distributed Coordination Meta label for Distributed Coordination team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants