Skip to content

[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #125683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Mar 26, 2025 · 14 comments · Fixed by #126605
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:ilm:qa:multi-node:javaRestTest" --tests "org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testSearchableSnapshotsInHotPhasePinnedToHotNodes" -Dtests.seed=23428F9DB66FBCCC -Dtests.locale=nl-CW -Dtests.timezone=America/Indianapolis -Druntime.java=24

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: 
Expected: is <{"phase":"hot","action":"complete","name":"complete"}>
     but: was <{"phase":"null","action":"null","name":"null"}>

Issue Reasons:

  • [main] 4 failures in test testSearchableSnapshotsInHotPhasePinnedToHotNodes (2.0% fail rate in 204 executions)
  • [main] 2 failures in pipeline elasticsearch-periodic-platform-support (50.0% fail rate in 4 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI labels Mar 26, 2025
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 4 failures in test testSearchableSnapshotsInHotPhasePinnedToHotNodes (2.0% fail rate in 204 executions)
  • [main] 2 failures in pipeline elasticsearch-periodic-platform-support (50.0% fail rate in 4 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Mar 26, 2025
…estSearchableSnapshotsInHotPhasePinnedToHotNodes #125683
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 26, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@nielsbauman nielsbauman self-assigned this Mar 26, 2025
@nielsbauman
Copy link
Contributor

nielsbauman commented Mar 26, 2025

My first guess would be that this is related to #122885. I'll have a better look later.

EDIT: after further investigation, it looks like it's unrelated to that PR after all - see my comment below.

@nielsbauman nielsbauman added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Mar 26, 2025
@nielsbauman
Copy link
Contributor

nielsbauman commented Mar 27, 2025

I've been looking at this for some time, and I think I have some idea of what happened, but I'm not 100% sure. I mainly looked at the only build failure that renders for me (Gradle Enterprise is having some issues ATM), which is the PR build linked in the OP. I'm posting my intermediate findings here for now:

The test failed because ILM truly did get stuck - i.e. it wasn't an API that was returning outdated/wrong information. Right until the end of the test, we see that the first backing index is in the wait-for-index-color-step as part of the searchable_snapshot action whereas the test expects ILM to just complete the action without issues. After mounting the searchable snapshot, ILM is waiting for the restored- index to be GREEN.

Interestingly, though, the index did turn green way before the end of the test (note the timestamp differences):

[2025-03-26T11:00:29,732][INFO ][o.e.c.r.a.AllocationService] [javaRestTest-3] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[restored-.ds-logs-aolgqmqtwy-2025.03.26-000001][0]]])." previous.health="YELLOW" reason="shards started [[restored-.ds-logs-aolgqmqtwy-2025.03.26-000001][0]]"
...
[2025-03-26T11:01:02,273][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-aolgqmqtwy-2025.03.26-000001] retrieved current step key: {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}

Right before the restored index turned GREEN, we run the wait-for-index-color step for the first time:

[2025-03-26T11:00:29,730][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-aolgqmqtwy-2025.03.26-000001] maybe running step (WaitForIndexColorStep) after state change: {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}
[2025-03-26T11:00:29,730][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-aolgqmqtwy-2025.03.26-000001] running policy with current-step [{"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}]

Looking at where that running policy with current-step ... line is logged in the code (here), I see we enqueue an ExecuteStepsUpdateTask. What's confusing me is that I would have expected to see a waiting for cluster state step condition log from this line, but we don't - not until this step is re-executed at the end of the test when we start cleaning up and thus triggering cluster state updates again. In fact, I don't see any log from this task. Looking at ExecuteStepsUpdateTask#doExecute, the only code path I see that would result in the task silently completing (as in, without any logs), is when currentStep.equals(registeredCurrentStep) is false here. I see one retrieved current step key log all on its own

[2025-03-26T11:00:29,765][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-aolgqmqtwy-2025.03.26-000001] retrieved current step key: {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}

and I think that's the one from ExecuteStepsUpdateTask#doExecute here. I'm not very familiar with the Step object or the PolicyStepsRegistry, so I'm not sure how the condition in that if-statement failed.

While this task is queued/doing God knows what, the index is in IndexLifecycleRunner#busyIndices, causing ILM to (silently!) skip processing this task for incoming cluster state updates here. In this case, one of those skipped cluster state updates is the cluster state update that sets the index health to GREEN. Since tests have very little activity going on, there are no subsequent cluster state updates to trigger a retry on the step. That last part is something we can fix by artificially introducing cluster state changes in the test - as we do in various other tests in this test suite - and I think that's what we'll end up having to do here.

@nielsbauman
Copy link
Contributor

nielsbauman commented Mar 27, 2025

#125789 also seems to be caused by ILM being stuck due to an absence of cluster state updates. I also see we have a dedicated method in this class for triggering cluster state updates:

/**
* Cause a bit of cluster activity using an empty reroute call in case the `wait-for-index-colour` ILM step missed the
* notification that partial-index is now GREEN.
*/
private void triggerStateChange() throws IOException {
Request rerouteRequest = new Request("POST", "/_cluster/reroute");
client().performRequest(rerouteRequest);
}

That gives me some more confidence that both these tests are indeed caused by a quiet cluster.

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Mar 27, 2025
These tests are very prone to the "quiet test cluster syndrome" and
sometimes need a little push by means of some artificial cluster state
updates. To hopefully be done with these test failures once and for all
(famous last words), I've updated all tests in this class to trigger
cluster state updates where necessary.

This is basically a follow-up of elastic#108162.

Fixes elastic#125683
Fixes elastic#125789
@gmarouli
Copy link
Contributor

Looking at where that running policy with current-step ... line is logged in the code (here), I see we enqueue an ExecuteStepsUpdateTask. What's confusing me is that I would have expected to see a waiting for cluster state step condition log from this line, but we don't - not until this step is re-executed at the end of the test when we start cleaning up and thus triggering cluster state updates again.

Is it possible that this is only executed when there is a cluster change? This could answer why we do not see any logging from it until the end of the test. Right?

@nielsbauman
Copy link
Contributor

@gmarouli we only log running policy with current-step ... after a cluster state change - it's logged inside runPolicyAfterStateChange. So we know for sure this was triggered after a cluster state change. Or are you referring to the MasterService only executing tasks in a queue after a cluster state change? Because I don't think that's true - the master service should just be processing queues regardless of cluster state updates.

@gmarouli
Copy link
Contributor

Or are you referring to the MasterService only executing tasks in a queue after a cluster state change? Because I don't think that's true - the master service should just be processing queues regardless of cluster state updates.

I was referring to this indeed.

@gmarouli
Copy link
Contributor

gmarouli commented Mar 28, 2025

I think I found the issue in the equality, it checks the indexNameSupplier which is a BiFunction which can be problematic:

@Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        if (obj.getClass() != getClass()) {
            return false;
        }
        WaitForIndexColorStep other = (WaitForIndexColorStep) obj;
        return super.equals(obj)
            && Objects.equals(this.color, other.color)
            && Objects.equals(this.indexNameSupplier, other.indexNameSupplier);
    }

@nielsbauman
Copy link
Contributor

Yeah I noticed that too and it seemed a little off to me as well, but I disregarded it. Looking at it again, we construct the WaitForIndexColorStep here:

WaitForIndexColorStep waitForGreenIndexHealthStep = new WaitForIndexColorStep(
waitForGreenRestoredIndexKey,
copyMetadataKey,
ClusterHealthStatus.GREEN,
getRestoredIndexPrefix(waitForGreenRestoredIndexKey)
);

At first glance, that seems fine because we always construct the step in the same way. But, indeed, because we use a BiFunction, that will only be equal if it's the exact same instance (because the lambda doesn't implement an equals method). So, somehow, the step got recreated, causing the equality to be false. That's what you're thinking too, right?

@gmarouli
Copy link
Contributor

gmarouli commented Mar 28, 2025

that will only be equal if it's the exact same instance (because the lambda doesn't implement an equals method)

Exactly, if you look into the retrieval of a cached step in the PolicyStepsRegistry there is a comment:

n.b. we're using instance equality here for the IndexMetadata rather than object equality because it's fast,
this means that we're erring on the side of cache misses (if the IndexMetadata changed in any way, it'll be
a new instance, so we'll miss-and-repopulate the cache for the index in question)

This means that this cache might have a miss when retrieving the step more "often" than we expected and with the bug in the equality it might skip some runs. Considering that the index name is not kept in the state of the class, I think we need to remove the BiFunction from the equality as a first step in fixing this test failure. What do you think?

Triggering a cluster update might hide issues like this, so it might be interesting to test this first by fixing the equality, what do you think?

@nielsbauman
Copy link
Contributor

Triggering a cluster update might hide issues like this, so it might be interesting to test this first by fixing the equality, what do you think?

Tbh, I don't think that will make much of a difference. For instance, even if we fix the equality, the ExecuteStepsUpdateTask might still be in the queue arbitrarily long, allowing for an arbitrary number of cluster state updates to be skipped by ILM for this index. Skipping one or a few cluster state updates isn't the problem; we'll proceed without issues when the next cluster state comes in. The issue is that that next cluster state update could come in arbitrarily far in the future - and that's true for both test and production clusters, although it's extremely less likely on production clusters to cause any significant impact

omricohenn pushed a commit to omricohenn/elasticsearch that referenced this issue Mar 28, 2025
@gmarouli
Copy link
Contributor

The issue is that that next cluster state update could come in arbitrarily far in the future

That is true, but this how ILM currently works, so skipping a cluster update is a bug and in a test it has a much bigger impact than in production because the clusters are "quiet". This is why in a test we can see that impact better, if we trigger a cluster state update the test can work even with this bug. In this test failure the root cause seems to be this equality check, right?

@nielsbauman
Copy link
Contributor

In this test failure the root cause seems to be this equality check, right?

What impact do you expect to see if we fix this equality check? My expectation is that, even if we fix it, the task might still run while a new cluster state comes in. Even if it doesn't exit silently but executes properly, it'll still see the old cluster state and just consider the condition of the index color to still be false.

nielsbauman added a commit that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step was still running/enqueued while the update came in. That on its
own isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
nielsbauman added a commit that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step was still running/enqueued while the update came in. That on its
own isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step was still running/enqueued while the update came in. That on its
own isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 1, 2025
ILM sometimes skips a policy/index for a cluster state update if the
step is still running/enqueued when the update comes in. That on its own
isn't a problem, but in very quiet clusters, this would mean that
it could take arbitrarily long for the policy step to be run -
i.e. when the next cluster state comes in. We saw this happening in
a few tests, but it could potentially happen in production too.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
nielsbauman added a commit that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
Fixes elastic#126354
nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes elastic#125683
Fixes elastic#125789
Fixes elastic#125867
Fixes elastic#125911
Fixes elastic#126053
Fixes elastic#126354

(cherry picked from commit 3231eb2)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354

(cherry picked from commit 3231eb2)

# Conflicts:
#	muted-tests.yml
elasticsearchmachine pushed a commit that referenced this issue Apr 10, 2025
The `indexNameSupplier` was included in the equality and is of type
`BiFunction`, which doesn't implement a proper `equals` method by
default - and thus neither do the lambdas. This meant that two instances
of this step would only be considered equal if they were the same
instance. By excluding `indexNameSupplier` from the `equals` method, we
ensure the method works as intended and is able to properly tell the
equality between two instances.

As a side effect, we expect/hope this change will fix a number of tests
that were failing because `WaitForIndexColorStep` missed the last
cluster state update in the test, causing ILM to get stuck and the test
to time out.

Fixes #125683
Fixes #125789
Fixes #125867
Fixes #125911
Fixes #126053
Fixes #126354

(cherry picked from commit 3231eb2)

# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
3 participants