Skip to content

[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elasticsearchmachine opened this issue Dec 10, 2024 · 9 comments · Fixed by #126840
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management medium-risk An open issue or test failure that is a medium risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Dec 10, 2024

Build Scans:

Reproduction Line:

gradlew ":x-pack:plugin:ilm:qa:multi-node:javaRestTest" --tests "org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped" -Dtests.seed=3D05FB81E22D2F83 -Dtests.locale=jgo-CM -Dtests.timezone=Australia/Tasmania -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: 
Expected: is "cold"
     but: was "hot"

Issue Reasons:

  • [main] 2 failures in test testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped (0.2% fail rate in 925 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI labels Dec 10, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 10, 2024
@dakrone dakrone added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 17, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped (0.2% fail rate in 925 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Jan 14, 2025
…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped #118406
martijnvg pushed a commit to martijnvg/elasticsearch that referenced this issue Jan 14, 2025
…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped elastic#118406
@nielsbauman
Copy link
Contributor

This is a duplicate of #94918. I'm keeping this one open to avoid fighting with the test bot.

@nielsbauman
Copy link
Contributor

The first build linked in the OP doesn't have any artifacts (i.e. cluster logs) to download - which I'll reach out to Delivery for. The second linked build is unrelated as the failures are caused by PR changes.

Looking at recent build failures on our very own Serverless project: https://es-delivery-stats.elastic.dev/app/r/s/YOHTU, I was able to find a build that does have artifacts included, allowing us to have a look at the cluster logs: https://gradle-enterprise.elastic.co/s/6uowhhwr452v2. I haven't actually looked at the logs yet, I'm just posting what I found for now.

@nielsbauman
Copy link
Contributor

After having taken a quick look at the logs, I see repeated logs of

[2025-03-19T10:28:13,136][WARN ][o.e.x.i.MoveToNextStepUpdateTask] [javaRestTest-2] policy [policy-WvYtC] for index [[restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001/1knqUX9uRSmCboOSoImnNg]] failed trying to move from step [{"phase":"hot","action":"complete","name":"complete"}] to step [{"phase":"warm","action":"set_priority","name":"set_priority"}]. java.lang.IllegalArgumentException: step [{"phase":"warm","action":"set_priority","name":"set_priority"}] for index [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] with policy [policy-WvYtC] does not exist
	at [email protected]/org.elasticsearch.xpack.ilm.IndexLifecycleTransition.validateTransition(IndexLifecycleTransition.java:99)
	at [email protected]/org.elasticsearch.xpack.ilm.IndexLifecycleTransition.moveClusterStateToStep(IndexLifecycleTransition.java:127)
	at [email protected]/org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask.doExecute(MoveToNextStepUpdateTask.java:60)

If I run this test locally, it looks like the restored index is supposed to jump to

[2025-03-27T20:52:31,481][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [restored-.ds-logs-ldmsikfyfm-2025.03.27-000001] moving to step [policy-mgoJF] {"phase":"hot","action":"complete","name":"complete"} -> {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}

and skip the updated hot phase.

I'll have another look some other time into why ILM jumped back to the hot phase.

@nielsbauman nielsbauman added medium-risk An open issue or test failure that is a medium risk to future releases and removed low-risk An open issue or test failure that is a low risk to future releases labels Mar 28, 2025
@nielsbauman nielsbauman assigned nielsbauman and unassigned samxbr Mar 28, 2025
@nielsbauman
Copy link
Contributor

After staring at the cluster logs some more, I was able to figure out what went wrong here. It's caused by a very specific order of (concurrent, surprise surprise) events, so I'll do my best to try to explain what happened exactly.

First, what does the test expect? We start off with the following ILM policy:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0s",
		"actions": {
		  "rollover": {...},
		  "searchable_snapshot": {...}
		}
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "set_priority": {"priority": 999} 
        }
      }
    }
  }
}

We expect ILM to move the index to the complete action of the hot phase, where it will be waiting for the warm phase for 30 days - i.e. not within the timeframe of the test. Then, we change the ILM policy to the following:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0s",
        "actions": {
          "set_priority": {"priority": 10} 
        }
      },
      "warm": {
        "min_age": "0s",
		"actions": {
		  "shrink": {...},
		  "force_merge": {...}
		}
      },
      "cold": {
        "min_age": "0s",
		"actions": {
		  "searchable_snapshot": {...}
		}
      },
    }
  }
}

Looking at the cluster logs from the build scan I linked in #118406 (comment), we see the following log:

[2025-03-19T10:28:13,119][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] updated policy [policy-WvYtC] has different phase step keys and will NOT refresh phase definition as it differs too greatly. old ...

This is what we want - we want ILM to skip the updated hot phase and just proceed to the warm phase which now has min_age=0s (instead of the 30d from before). Then we see some logs that ILM is updating the policy in some places (i.e. the cache and policy registry) without abnormalities.

However, then we see the following log:

[2025-03-19T10:28:13,133][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] moving to step [policy-WvYtC] {"phase":"hot","action":"complete","name":"complete"} -> {"phase":"warm","action":"set_priority","name":"set_priority"}

While we do expect the index to move to the warm phase, ILM moved it to the set_priority action. With the policy update, we moved the set_priority action from the warm phase to the hot phase. That's why ILM is logging the error I posted in #118406 (comment) about the unknown step. Instead, we expect ILM to move the index to the unfollow action of the warm phase - which I confirmed in the logs of a successful local run of this test.

Looking at the code, here's the order of events:

  1. We fetch the current step key for the index here:
    currentStep = getCurrentStep(stepRegistry, policy, indexMetadata, lifecycleState);

    which corresponds with the following logs:
    [2025-03-19T10:28:13,132][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] retrieved current step key: {"phase":"hot","action":"complete","name":"complete"}
    [2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] parsed steps for policy [policy-WvYtC] in phase [hot], definition: [{"policy":"policy-WvYtC","phase_definition":{"min_age":"0ms","actions":{"rollover":{"max_docs":1},"searchable_snapshot":{"snapshot_repository":"cTUTRWgWBVkC","force_merge_index":true}}},"version":1,"modified_date_in_millis":1742380088938}], steps: [[{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"} => {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"} => {"phase":"hot","action":"unfollow","name":"pause-follower-index"}, {"phase":"hot","action":"unfollow","name":"pause-follower-index"} => {"phase":"hot","action":"unfollow","name":"close-follower-index"}, {"phase":"hot","action":"unfollow","name":"close-follower-index"} => {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"} => {"phase":"hot","action":"unfollow","name":"open-follower-index"}, {"phase":"hot","action":"unfollow","name":"open-follower-index"} => {"phase":"hot","action":"unfollow","name":"wait-for-index-color"}, {"phase":"hot","action":"unfollow","name":"wait-for-index-color"} => {"phase":"hot","action":"rollover","name":"check-rollover-ready"}, {"phase":"hot","action":"rollover","name":"check-rollover-ready"} => {"phase":"hot","action":"rollover","name":"attempt-rollover"}, {"phase":"hot","action":"rollover","name":"attempt-rollover"} => {"phase":"hot","action":"rollover","name":"wait-for-active-shards"}, {"phase":"hot","action":"rollover","name":"wait-for-active-shards"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"rollover","name":"set-indexing-complete"}, {"phase":"hot","action":"rollover","name":"set-indexing-complete"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"} => null, {"phase":"hot","action":"searchable_snapshot","name":"check-not-write-index"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"} => {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"forcemerge"} => {"phase":"hot","action":"searchable_snapshot","name":"segment-count"}, {"phase":"hot","action":"searchable_snapshot","name":"segment-count"} => {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"} => {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-data-tier"} => {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"} => null, {"phase":"hot","action":"searchable_snapshot","name":"replace-datastream-backing-index"} => {"phase":"hot","action":"searchable_snapshot","name":"delete"}, {"phase":"hot","action":"searchable_snapshot","name":"delete"} => null, {"phase":"hot","action":"searchable_snapshot","name":"swap-aliases"} => null, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"set_priority","name":"set_priority"}]]
    
    Note that the second log line proves that it fetches the old policy
  2. The policy is updated in the policy cache:
    [2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] updating cached steps for [policy-WvYtC] policy, new steps: [{"phase":"new","action":"init","name":"init"}, {"phase":"new","action":"complete","name":"complete"}, {"phase":"hot","action":"set_priority","name":"set_priority"}, {"phase":"hot","action":"complete","name":"complete"}, {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"migrate"}, {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"check-not-write-index"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}, {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"aliases"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"}, {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"forcemerge","name":"branch-forcemerge-check-prerequisites"}, {"phase":"warm","action":"forcemerge","name":"check-not-write-index"}, {"phase":"warm","action":"forcemerge","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"forcemerge","name":"readonly"}, {"phase":"warm","action":"forcemerge","name":"forcemerge"}, {"phase":"warm","action":"forcemerge","name":"segment-count"}, {"phase":"warm","action":"complete","name":"complete"}, {"phase":"cold","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"cold","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"cold","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"cold","action":"unfollow","name":"pause-follower-index"}, {"phase":"cold","action":"unfollow","name":"close-follower-index"}, {"phase":"cold","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"cold","action":"unfollow","name":"open-follower-index"}, {"phase":"cold","action":"unfollow","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"cold","action":"searchable_snapshot","name":"check-not-write-index"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"cold","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"forcemerge"}, {"phase":"cold","action":"searchable_snapshot","name":"segment-count"}, {"phase":"cold","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"cold","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-data-tier"}, {"phase":"cold","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"cold","action":"searchable_snapshot","name":"replace-datastream-backing-index"}, {"phase":"cold","action":"searchable_snapshot","name":"delete"}, {"phase":"cold","action":"searchable_snapshot","name":"swap-aliases"}, {"phase":"cold","action":"complete","name":"complete"}]
    
  3. We check the min_age of the warm phase here:
    if (isReadyToTransitionToThisPhase(policy, indexMetadata, currentStep.getNextStepKey().phase())) {

    Note: this fetches the latest policy from the cache, which has now been updated
  4. Since the min_age of the new warm phase is used (which is 0s), we proceed to the next step here:
    moveToStep(indexMetadata.getIndex(), policy, currentStep.getKey(), currentStep.getNextStepKey());

    However, this uses the currentStep.getNextStepKey(), which was obtained from before the policy update and thus points to the next step in the old warm phase (i.e. set_priority)

I changed the label to medium-risk because this is an actual bug and could cause issues in production clusters. I think I have some ideas for fixes, but I'll take some time to come up with the best one and possibly first discuss with the team before we proceed with a fix.

@nielsbauman
Copy link
Contributor

I looked some more into this, and my conclusion about what happens here changed a little bit. The fact that we move to a step that does not exist (anymore) is not necessarily problematic. When that happens, we simply execute the last step again (i.e. the complete step) and "recompute" (spoiler alert: not really, see below) the next step and try to move the index to that step again. However, the actual bug here lies a little deeper.

Here is the order of events as I currently understand them:

  1. We start the lifecycle policy update process:

    [2025-03-19T10:28:13,118][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-2] updating index lifecycle policy [policy-WvYtC]

  2. As part of the policy update process, we clear the PolicyStepsRegistry.cachedSteps:

  3. Then we start running the periodic step for the index in question:

    [2025-03-19T10:28:13,132][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] retrieved current step key: {"phase":"hot","action":"complete","name":"complete"}
    [2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] parsed steps for policy [policy-WvYtC] in phase [hot], definition: [{"policy":"policy-WvYtC","phase_definition":{"min_age":"0ms","actions":{"rollover":{"max_docs":1},"searchable_snapshot":{"snapshot_repository":"cTUTRWgWBVkC","force_merge_index":true}}},"version":1,"modified_date_in_millis":1742380088938}], steps: [[{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"} => {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"} => {"phase":"hot","action":"unfollow","name":"pause-follower-index"}, {"phase":"hot","action":"unfollow","name":"pause-follower-index"} => {"phase":"hot","action":"unfollow","name":"close-follower-index"}, {"phase":"hot","action":"unfollow","name":"close-follower-index"} => {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"} => {"phase":"hot","action":"unfollow","name":"open-follower-index"}, {"phase":"hot","action":"unfollow","name":"open-follower-index"} => {"phase":"hot","action":"unfollow","name":"wait-for-index-color"}, {"phase":"hot","action":"unfollow","name":"wait-for-index-color"} => {"phase":"hot","action":"rollover","name":"check-rollover-ready"}, {"phase":"hot","action":"rollover","name":"check-rollover-ready"} => {"phase":"hot","action":"rollover","name":"attempt-rollover"}, {"phase":"hot","action":"rollover","name":"attempt-rollover"} => {"phase":"hot","action":"rollover","name":"wait-for-active-shards"}, {"phase":"hot","action":"rollover","name":"wait-for-active-shards"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"rollover","name":"set-indexing-complete"}, {"phase":"hot","action":"rollover","name":"set-indexing-complete"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"} => null, {"phase":"hot","action":"searchable_snapshot","name":"check-not-write-index"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"} => {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"forcemerge"} => {"phase":"hot","action":"searchable_snapshot","name":"segment-count"}, {"phase":"hot","action":"searchable_snapshot","name":"segment-count"} => {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"} => {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-data-tier"} => {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"} => null, {"phase":"hot","action":"searchable_snapshot","name":"replace-datastream-backing-index"} => {"phase":"hot","action":"searchable_snapshot","name":"delete"}, {"phase":"hot","action":"searchable_snapshot","name":"delete"} => null, {"phase":"hot","action":"searchable_snapshot","name":"swap-aliases"} => null, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"set_priority","name":"set_priority"}]]
    

    Since we cleared cachedSteps in step 2, we have to parse the steps again (see the second log above).

    phaseSteps = parseStepsFromPhase(policyName, phase, phaseJson);

    We do that based on lifecyclePolicyMap (which at this point still contains the old policy) here
    LifecyclePolicyMetadata policyMetadata = lifecyclePolicyMap.get(policy);

  4. After parsing the steps, we put the parsed steps back into cachedSteps

    cachedSteps.put(indexMetadata.getIndex(), Tuple.tuple(indexMetadata, s));

    This means the old definition of the step key {"phase":"hot","action":"complete","name":"complete"} is stored in cachedSteps, which still points to {"phase":"warm","action":"set_priority","name":"set_priority"} as the next step.

  5. Then, we put the updated policy in lifecyclePolicyMap as part of the policy update process (i.e. a continuation of step 1)

    lifecyclePolicyMap.put(policyMetadata.getName(), policyMetadata);

    [2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] updating cached steps for [policy-WvYtC] policy, new steps: [{"phase":"new","action":"init","name":"init"}, {"phase":"new","action":"complete","name":"complete"}, {"phase":"hot","action":"set_priority","name":"set_priority"}, {"phase":"hot","action":"complete","name":"complete"}, {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"migrate"}, {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"check-not-write-index"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}, {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"aliases"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"}, {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"forcemerge","name":"branch-forcemerge-check-prerequisites"}, {"phase":"warm","action":"forcemerge","name":"check-not-write-index"}, {"phase":"warm","action":"forcemerge","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"forcemerge","name":"readonly"}, {"phase":"warm","action":"forcemerge","name":"forcemerge"}, {"phase":"warm","action":"forcemerge","name":"segment-count"}, {"phase":"warm","action":"complete","name":"complete"}, {"phase":"cold","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"cold","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"cold","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"cold","action":"unfollow","name":"pause-follower-index"}, {"phase":"cold","action":"unfollow","name":"close-follower-index"}, {"phase":"cold","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"cold","action":"unfollow","name":"open-follower-index"}, {"phase":"cold","action":"unfollow","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"cold","action":"searchable_snapshot","name":"check-not-write-index"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"cold","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"forcemerge"}, {"phase":"cold","action":"searchable_snapshot","name":"segment-count"}, {"phase":"cold","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"cold","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-data-tier"}, {"phase":"cold","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"cold","action":"searchable_snapshot","name":"replace-datastream-backing-index"}, {"phase":"cold","action":"searchable_snapshot","name":"delete"}, {"phase":"cold","action":"searchable_snapshot","name":"swap-aliases"}, {"phase":"cold","action":"complete","name":"complete"}]
    
  6. We then retrieve the min_age of the next phase using the lifecyclePolicyMap (i.e. the new policy) here:

    if (isReadyToTransitionToThisPhase(policy, indexMetadata, currentStep.getNextStepKey().phase())) {

  7. From this point on (i.e. on every retry), since the old step key is in cachedSteps, we will always try to, incorrectly, move to the set_priority step of the warm phase (which doesn't exist anymore).

In short, the issue is that we clear cachedSteps but don't atomically update lifecyclePolicyMap as well. We should be able to solve this by synchronizing/locking both maps to make sure we're updating them atomically. However, that is again more of a "patch" to fix a symptom and begs the question of whether we should consider a different approach.

@gmarouli
Copy link
Contributor

Great job on this analysis @nielsbauman, super helpful.

You are right that adding locking in this should not be done lightly. Before we start exploring different approaches, I wanted to challenge the order of these operations and see if a change there could help.

If I understand you analysis:

  • the cachedSteps is an index->step (thread safe) cache that gets populated by retrieving the steps from lifecyclePolicyMap
  • when we update the lifecyclePolicyMap we also clear the cache so it does not contain outdated data.
  • but we clear the cache before we update the lifecyclePolicyMap which means as you pointed out, that it's possible that the cache will get repopulated with old data and then they will stay there until a different "invalidation" comes along.

What I am thinking is to move clearing cachedSteps to the end after updating the lifecyclePolicyMap. This means that there will be a moment where cachedSteps will contain an outdated entry but soon enough after the update of the lifecyclePolicyMap it will get invalidated, so it will be eventually consistent.

What do you think?

@nielsbauman
Copy link
Contributor

I think that is an excellent solution! Thanks a lot for the great idea :)

I'll run a test to ensure no tests depend on the current behavior (just to be sure) and then I'll open a PR.

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue Apr 15, 2025
The following order of events was possible:
- An ILM policy update cleared `cachedSteps`
- ILM retrieves the step definition for an index, this populates `cachedSteps` with the outdated policy
- The updated policy is put in `lifecyclePolicyMap`

Any subsequent cache retrievals will see the old step definition.

By clearing `cachedSteps` _after_ we update `lifecyclePolicyMap`, we
ensure eventual consistency between the policy and the cache.

Fixes elastic#118406
elasticsearchmachine pushed a commit that referenced this issue Apr 17, 2025
The following order of events was possible:
- An ILM policy update cleared `cachedSteps`
- ILM retrieves the step definition for an index, this populates `cachedSteps` with the outdated policy
- The updated policy is put in `lifecyclePolicyMap`

Any subsequent cache retrievals will see the old step definition.

By clearing `cachedSteps` _after_ we update `lifecyclePolicyMap`, we
ensure eventual consistency between the policy and the cache.

Fixes #118406

(cherry picked from commit 5383f0f)

# Conflicts:
#	muted-tests.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management medium-risk An open issue or test failure that is a medium risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants