-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
This has been muted on branch main Mute Reasons:
Build Scans: |
…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped #118406
…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped elastic#118406
This is a duplicate of #94918. I'm keeping this one open to avoid fighting with the test bot. |
The first build linked in the OP doesn't have any artifacts (i.e. cluster logs) to download - which I'll reach out to Delivery for. The second linked build is unrelated as the failures are caused by PR changes. Looking at recent build failures on our very own Serverless project: https://es-delivery-stats.elastic.dev/app/r/s/YOHTU, I was able to find a build that does have artifacts included, allowing us to have a look at the cluster logs: https://gradle-enterprise.elastic.co/s/6uowhhwr452v2. I haven't actually looked at the logs yet, I'm just posting what I found for now. |
After having taken a quick look at the logs, I see repeated logs of
If I run this test locally, it looks like the restored index is supposed to jump to
and skip the updated hot phase. I'll have another look some other time into why ILM jumped back to the hot phase. |
After staring at the cluster logs some more, I was able to figure out what went wrong here. It's caused by a very specific order of (concurrent, surprise surprise) events, so I'll do my best to try to explain what happened exactly. First, what does the test expect? We start off with the following ILM policy: {
"policy": {
"phases": {
"hot": {
"min_age": "0s",
"actions": {
"rollover": {...},
"searchable_snapshot": {...}
}
},
"warm": {
"min_age": "30d",
"actions": {
"set_priority": {"priority": 999}
}
}
}
}
} We expect ILM to move the index to the {
"policy": {
"phases": {
"hot": {
"min_age": "0s",
"actions": {
"set_priority": {"priority": 10}
}
},
"warm": {
"min_age": "0s",
"actions": {
"shrink": {...},
"force_merge": {...}
}
},
"cold": {
"min_age": "0s",
"actions": {
"searchable_snapshot": {...}
}
},
}
}
} Looking at the cluster logs from the build scan I linked in #118406 (comment), we see the following log:
This is what we want - we want ILM to skip the updated However, then we see the following log:
While we do expect the index to move to the Looking at the code, here's the order of events:
I changed the label to |
I looked some more into this, and my conclusion about what happens here changed a little bit. The fact that we move to a step that does not exist (anymore) is not necessarily problematic. When that happens, we simply execute the last step again (i.e. the Here is the order of events as I currently understand them:
In short, the issue is that we clear |
Great job on this analysis @nielsbauman, super helpful. You are right that adding locking in this should not be done lightly. Before we start exploring different approaches, I wanted to challenge the order of these operations and see if a change there could help. If I understand you analysis:
What I am thinking is to move clearing What do you think? |
I think that is an excellent solution! Thanks a lot for the great idea :) I'll run a test to ensure no tests depend on the current behavior (just to be sure) and then I'll open a PR. |
The following order of events was possible: - An ILM policy update cleared `cachedSteps` - ILM retrieves the step definition for an index, this populates `cachedSteps` with the outdated policy - The updated policy is put in `lifecyclePolicyMap` Any subsequent cache retrievals will see the old step definition. By clearing `cachedSteps` _after_ we update `lifecyclePolicyMap`, we ensure eventual consistency between the policy and the cache. Fixes elastic#118406
The following order of events was possible: - An ILM policy update cleared `cachedSteps` - ILM retrieves the step definition for an index, this populates `cachedSteps` with the outdated policy - The updated policy is put in `lifecyclePolicyMap` Any subsequent cache retrievals will see the old step definition. By clearing `cachedSteps` _after_ we update `lifecyclePolicyMap`, we ensure eventual consistency between the policy and the cache. Fixes #118406 (cherry picked from commit 5383f0f) # Conflicts: # muted-tests.yml
Build Scans:
Reproduction Line:
Applicable branches:
main
Reproduces locally?:
N/A
Failure History:
See dashboard
Failure Message:
Issue Reasons:
Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.
The text was updated successfully, but these errors were encountered: