Skip to content

[CI] SearchableSnapshotActionIT testResumingSearchableSnapshotFromPartialToFull failing #125789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Mar 27, 2025 · 3 comments · Fixed by #126605
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Mar 27, 2025

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:ilm:qa:multi-node:javaRestTest" --tests "org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testResumingSearchableSnapshotFromPartialToFull" -Dtests.seed=127DEF3256F9FBDC -Dtests.locale=fy -Dtests.timezone=Europe/Vienna -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: 
Expected: is <false>
     but: was <true>

Issue Reasons:

  • [main] 3 failures in test testResumingSearchableSnapshotFromPartialToFull (0.3% fail rate in 862 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 27, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@nielsbauman
Copy link
Contributor

Looking at the logs, ILM is stuck in wait-for-index-color which is similar to #125683 (and other test failures we've seen in this test class).

@nielsbauman nielsbauman self-assigned this Mar 27, 2025
@nielsbauman nielsbauman added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 27, 2025
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Mar 27, 2025
These tests are very prone to the "quiet test cluster syndrome" and
sometimes need a little push by means of some artificial cluster state
updates. To hopefully be done with these test failures once and for all
(famous last words), I've updated all tests in this class to trigger
cluster state updates where necessary.

This is basically a follow-up of elastic#108162.

Fixes elastic#125683
Fixes elastic#125789
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 3 failures in test testResumingSearchableSnapshotFromPartialToFull (0.3% fail rate in 862 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Mar 31, 2025
afoucret pushed a commit to afoucret/elasticsearch that referenced this issue Apr 1, 2025
nielsbauman added a commit that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step was still running/enqueued while the update came in. That on its
own isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
nielsbauman added a commit that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step was still running/enqueued while the update came in. That on its
own isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step was still running/enqueued while the update came in. That on its
own isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step is still running/enqueued when the update comes in. That on its own
isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
nielsbauman added a commit that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
Fixes elastic#126354
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
Fixes elastic#126354

(cherry picked from commit 3231eb2)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354

(cherry picked from commit 3231eb2)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354

(cherry picked from commit 3231eb2)

# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
2 participants