-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #125683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This has been muted on branch main Mute Reasons:
Build Scans: |
…estSearchableSnapshotsInHotPhasePinnedToHotNodes #125683
Pinging @elastic/es-data-management (Team:Data Management) |
EDIT: after further investigation, it looks like it's unrelated to that PR after all - see my comment below. |
I've been looking at this for some time, and I think I have some idea of what happened, but I'm not 100% sure. I mainly looked at the only build failure that renders for me (Gradle Enterprise is having some issues ATM), which is the PR build linked in the OP. I'm posting my intermediate findings here for now: The test failed because ILM truly did get stuck - i.e. it wasn't an API that was returning outdated/wrong information. Right until the end of the test, we see that the first backing index is in the Interestingly, though, the index did turn green way before the end of the test (note the timestamp differences):
Right before the restored index turned
Looking at where that
and I think that's the one from While this task is queued/doing God knows what, the index is in |
#125789 also seems to be caused by ILM being stuck due to an absence of cluster state updates. I also see we have a dedicated method in this class for triggering cluster state updates: Lines 1100 to 1107 in 0c50403
That gives me some more confidence that both these tests are indeed caused by a quiet cluster. |
These tests are very prone to the "quiet test cluster syndrome" and sometimes need a little push by means of some artificial cluster state updates. To hopefully be done with these test failures once and for all (famous last words), I've updated all tests in this class to trigger cluster state updates where necessary. This is basically a follow-up of elastic#108162. Fixes elastic#125683 Fixes elastic#125789
Is it possible that this is only executed when there is a cluster change? This could answer why we do not see any logging from it until the end of the test. Right? |
@gmarouli we only log |
I was referring to this indeed. |
I think I found the issue in the equality, it checks the
|
Yeah I noticed that too and it seemed a little off to me as well, but I disregarded it. Looking at it again, we construct the Lines 359 to 364 in 6be6ebe
At first glance, that seems fine because we always construct the step in the same way. But, indeed, because we use a BiFunction , that will only be equal if it's the exact same instance (because the lambda doesn't implement an equals method). So, somehow, the step got recreated, causing the equality to be false. That's what you're thinking too, right?
|
Exactly, if you look into the retrieval of a cached step in the
This means that this cache might have a Triggering a cluster update might hide issues like this, so it might be interesting to test this first by fixing the equality, what do you think? |
Tbh, I don't think that will make much of a difference. For instance, even if we fix the equality, the |
…estSearchableSnapshotsInHotPhasePinnedToHotNodes elastic#125683
That is true, but this how ILM currently works, so skipping a cluster update is a bug and in a test it has a much bigger impact than in production because the clusters are "quiet". This is why in a test we can see that impact better, if we trigger a cluster state update the test can work even with this bug. In this test failure the root cause seems to be this equality check, right? |
What impact do you expect to see if we fix this equality check? My expectation is that, even if we fix it, the task might still run while a new cluster state comes in. Even if it doesn't exit silently but executes properly, it'll still see the old cluster state and just consider the condition of the index color to still be false. |
ILM sometimes skips a policy/index for a cluster state update if the step was still running/enqueued while the update came in. That on its own isn't a problem, but in very quiet clusters, this would mean that it could take arbitrarily long for the policy step to be run - i.e. when the next cluster state comes in. We saw this happening in a few tests, but it could potentially happen in production too. Fixes #125683 Fixes #125789 Fixes #125867 Fixes #125911
ILM sometimes skips a policy/index for a cluster state update if the step was still running/enqueued while the update came in. That on its own isn't a problem, but in very quiet clusters, this would mean that it could take arbitrarily long for the policy step to be run - i.e. when the next cluster state comes in. We saw this happening in a few tests, but it could potentially happen in production too. Fixes #125683 Fixes #125789 Fixes #125867 Fixes #125911
ILM sometimes skips a policy/index for a cluster state update if the step was still running/enqueued while the update came in. That on its own isn't a problem, but in very quiet clusters, this would mean that it could take arbitrarily long for the policy step to be run - i.e. when the next cluster state comes in. We saw this happening in a few tests, but it could potentially happen in production too. Fixes elastic#125683 Fixes elastic#125789 Fixes elastic#125867 Fixes elastic#125911 Fixes elastic#126053
ILM sometimes skips a policy/index for a cluster state update if the step is still running/enqueued when the update comes in. That on its own isn't a problem, but in very quiet clusters, this would mean that it could take arbitrarily long for the policy step to be run - i.e. when the next cluster state comes in. We saw this happening in a few tests, but it could potentially happen in production too. Fixes elastic#125683 Fixes elastic#125789 Fixes elastic#125867 Fixes elastic#125911 Fixes elastic#126053
The `indexNameSupplier` was included in the equality and is of type `BiFunction`, which doesn't implement a proper `equals` method by default - and thus neither do the lambdas. This meant that two instances of this step would only be considered equal if they were the same instance. By excluding `indexNameSupplier` from the `equals` method, we ensure the method works as intended and is able to properly tell the equality between two instances. As a side effect, we expect/hope this change will fix a number of tests that were failing because `WaitForIndexColorStep` missed the last cluster state update in the test, causing ILM to get stuck and the test to time out. Fixes #125683 Fixes #125789 Fixes #125867 Fixes #125911 Fixes #126053 Fixes #126354
The `indexNameSupplier` was included in the equality and is of type `BiFunction`, which doesn't implement a proper `equals` method by default - and thus neither do the lambdas. This meant that two instances of this step would only be considered equal if they were the same instance. By excluding `indexNameSupplier` from the `equals` method, we ensure the method works as intended and is able to properly tell the equality between two instances. As a side effect, we expect/hope this change will fix a number of tests that were failing because `WaitForIndexColorStep` missed the last cluster state update in the test, causing ILM to get stuck and the test to time out. Fixes elastic#125683 Fixes elastic#125789 Fixes elastic#125867 Fixes elastic#125911 Fixes elastic#126053 Fixes elastic#126354
The `indexNameSupplier` was included in the equality and is of type `BiFunction`, which doesn't implement a proper `equals` method by default - and thus neither do the lambdas. This meant that two instances of this step would only be considered equal if they were the same instance. By excluding `indexNameSupplier` from the `equals` method, we ensure the method works as intended and is able to properly tell the equality between two instances. As a side effect, we expect/hope this change will fix a number of tests that were failing because `WaitForIndexColorStep` missed the last cluster state update in the test, causing ILM to get stuck and the test to time out. Fixes elastic#125683 Fixes elastic#125789 Fixes elastic#125867 Fixes elastic#125911 Fixes elastic#126053 Fixes elastic#126354 (cherry picked from commit 3231eb2) # Conflicts: # muted-tests.yml
The `indexNameSupplier` was included in the equality and is of type `BiFunction`, which doesn't implement a proper `equals` method by default - and thus neither do the lambdas. This meant that two instances of this step would only be considered equal if they were the same instance. By excluding `indexNameSupplier` from the `equals` method, we ensure the method works as intended and is able to properly tell the equality between two instances. As a side effect, we expect/hope this change will fix a number of tests that were failing because `WaitForIndexColorStep` missed the last cluster state update in the test, causing ILM to get stuck and the test to time out. Fixes #125683 Fixes #125789 Fixes #125867 Fixes #125911 Fixes #126053 Fixes #126354 (cherry picked from commit 3231eb2) # Conflicts: # muted-tests.yml
The `indexNameSupplier` was included in the equality and is of type `BiFunction`, which doesn't implement a proper `equals` method by default - and thus neither do the lambdas. This meant that two instances of this step would only be considered equal if they were the same instance. By excluding `indexNameSupplier` from the `equals` method, we ensure the method works as intended and is able to properly tell the equality between two instances. As a side effect, we expect/hope this change will fix a number of tests that were failing because `WaitForIndexColorStep` missed the last cluster state update in the test, causing ILM to get stuck and the test to time out. Fixes #125683 Fixes #125789 Fixes #125867 Fixes #125911 Fixes #126053 Fixes #126354 (cherry picked from commit 3231eb2) # Conflicts: # muted-tests.yml # x-pack/plugin/ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java
Build Scans:
Reproduction Line:
Applicable branches:
main
Reproduces locally?:
N/A
Failure History:
See dashboard
Failure Message:
Issue Reasons:
Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.
The text was updated successfully, but these errors were encountered: