[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406

elasticsearchmachine · 2024-12-10T23:14:43Z

Build Scans:

Reproduction Line:

gradlew ":x-pack:plugin:ilm:qa:multi-node:javaRestTest" --tests "org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped" -Dtests.seed=3D05FB81E22D2F83 -Dtests.locale=jgo-CM -Dtests.timezone=Australia/Tasmania -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.AssertionError: 
Expected: is "cold"
     but: was "hot"

Issue Reasons:

[main] 2 failures in test testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped (0.2% fail rate in 925 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-12-10T23:15:06Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-01-14T05:29:17Z

This has been muted on branch main

Mute Reasons:

[main] 2 failures in test testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped (0.2% fail rate in 925 executions)

Build Scans:

…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped #118406

…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped elastic#118406

nielsbauman · 2025-03-27T20:29:00Z

This is a duplicate of #94918. I'm keeping this one open to avoid fighting with the test bot.

nielsbauman · 2025-03-27T20:37:22Z

The first build linked in the OP doesn't have any artifacts (i.e. cluster logs) to download - which I'll reach out to Delivery for. The second linked build is unrelated as the failures are caused by PR changes.

Looking at recent build failures on our very own Serverless project: https://es-delivery-stats.elastic.dev/app/r/s/YOHTU, I was able to find a build that does have artifacts included, allowing us to have a look at the cluster logs: https://gradle-enterprise.elastic.co/s/6uowhhwr452v2. I haven't actually looked at the logs yet, I'm just posting what I found for now.

nielsbauman · 2025-03-27T22:05:43Z

After having taken a quick look at the logs, I see repeated logs of

[2025-03-19T10:28:13,136][WARN ][o.e.x.i.MoveToNextStepUpdateTask] [javaRestTest-2] policy [policy-WvYtC] for index [[restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001/1knqUX9uRSmCboOSoImnNg]] failed trying to move from step [{"phase":"hot","action":"complete","name":"complete"}] to step [{"phase":"warm","action":"set_priority","name":"set_priority"}]. java.lang.IllegalArgumentException: step [{"phase":"warm","action":"set_priority","name":"set_priority"}] for index [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] with policy [policy-WvYtC] does not exist
	at [email protected]/org.elasticsearch.xpack.ilm.IndexLifecycleTransition.validateTransition(IndexLifecycleTransition.java:99)
	at [email protected]/org.elasticsearch.xpack.ilm.IndexLifecycleTransition.moveClusterStateToStep(IndexLifecycleTransition.java:127)
	at [email protected]/org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask.doExecute(MoveToNextStepUpdateTask.java:60)

If I run this test locally, it looks like the restored index is supposed to jump to

[2025-03-27T20:52:31,481][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [restored-.ds-logs-ldmsikfyfm-2025.03.27-000001] moving to step [policy-mgoJF] {"phase":"hot","action":"complete","name":"complete"} -> {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}

and skip the updated hot phase.

I'll have another look some other time into why ILM jumped back to the hot phase.

nielsbauman · 2025-03-28T16:55:26Z

After staring at the cluster logs some more, I was able to figure out what went wrong here. It's caused by a very specific order of (concurrent, surprise surprise) events, so I'll do my best to try to explain what happened exactly.

First, what does the test expect? We start off with the following ILM policy:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0s",
		"actions": {
		  "rollover": {...},
		  "searchable_snapshot": {...}
		}
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "set_priority": {"priority": 999} 
        }
      }
    }
  }
}

We expect ILM to move the index to the complete action of the hot phase, where it will be waiting for the warm phase for 30 days - i.e. not within the timeframe of the test. Then, we change the ILM policy to the following:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0s",
        "actions": {
          "set_priority": {"priority": 10} 
        }
      },
      "warm": {
        "min_age": "0s",
		"actions": {
		  "shrink": {...},
		  "force_merge": {...}
		}
      },
      "cold": {
        "min_age": "0s",
		"actions": {
		  "searchable_snapshot": {...}
		}
      },
    }
  }
}

Looking at the cluster logs from the build scan I linked in #118406 (comment), we see the following log:

[2025-03-19T10:28:13,119][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] updated policy [policy-WvYtC] has different phase step keys and will NOT refresh phase definition as it differs too greatly. old ...

This is what we want - we want ILM to skip the updated hot phase and just proceed to the warm phase which now has min_age=0s (instead of the 30d from before). Then we see some logs that ILM is updating the policy in some places (i.e. the cache and policy registry) without abnormalities.

However, then we see the following log:

[2025-03-19T10:28:13,133][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] moving to step [policy-WvYtC] {"phase":"hot","action":"complete","name":"complete"} -> {"phase":"warm","action":"set_priority","name":"set_priority"}

While we do expect the index to move to the warm phase, ILM moved it to the set_priority action. With the policy update, we moved the set_priority action from the warm phase to the hot phase. That's why ILM is logging the error I posted in #118406 (comment) about the unknown step. Instead, we expect ILM to move the index to the unfollow action of the warm phase - which I confirmed in the logs of a successful local run of this test.

Looking at the code, here's the order of events:

We fetch the current step key for the index here:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

Line 182 in 67798dd

    
           currentStep = getCurrentStep(stepRegistry, policy, indexMetadata, lifecycleState);

which corresponds with the following logs:

[2025-03-19T10:28:13,132][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] retrieved current step key: {"phase":"hot","action":"complete","name":"complete"}
[2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] parsed steps for policy [policy-WvYtC] in phase [hot], definition: [{"policy":"policy-WvYtC","phase_definition":{"min_age":"0ms","actions":{"rollover":{"max_docs":1},"searchable_snapshot":{"snapshot_repository":"cTUTRWgWBVkC","force_merge_index":true}}},"version":1,"modified_date_in_millis":1742380088938}], steps: [[{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"} => {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"} => {"phase":"hot","action":"unfollow","name":"pause-follower-index"}, {"phase":"hot","action":"unfollow","name":"pause-follower-index"} => {"phase":"hot","action":"unfollow","name":"close-follower-index"}, {"phase":"hot","action":"unfollow","name":"close-follower-index"} => {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"} => {"phase":"hot","action":"unfollow","name":"open-follower-index"}, {"phase":"hot","action":"unfollow","name":"open-follower-index"} => {"phase":"hot","action":"unfollow","name":"wait-for-index-color"}, {"phase":"hot","action":"unfollow","name":"wait-for-index-color"} => {"phase":"hot","action":"rollover","name":"check-rollover-ready"}, {"phase":"hot","action":"rollover","name":"check-rollover-ready"} => {"phase":"hot","action":"rollover","name":"attempt-rollover"}, {"phase":"hot","action":"rollover","name":"attempt-rollover"} => {"phase":"hot","action":"rollover","name":"wait-for-active-shards"}, {"phase":"hot","action":"rollover","name":"wait-for-active-shards"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"rollover","name":"set-indexing-complete"}, {"phase":"hot","action":"rollover","name":"set-indexing-complete"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"} => null, {"phase":"hot","action":"searchable_snapshot","name":"check-not-write-index"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"} => {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"forcemerge"} => {"phase":"hot","action":"searchable_snapshot","name":"segment-count"}, {"phase":"hot","action":"searchable_snapshot","name":"segment-count"} => {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"} => {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-data-tier"} => {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"} => null, {"phase":"hot","action":"searchable_snapshot","name":"replace-datastream-backing-index"} => {"phase":"hot","action":"searchable_snapshot","name":"delete"}, {"phase":"hot","action":"searchable_snapshot","name":"delete"} => null, {"phase":"hot","action":"searchable_snapshot","name":"swap-aliases"} => null, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"set_priority","name":"set_priority"}]]

Note that the second log line proves that it fetches the old policy

The policy is updated in the policy cache:

[2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] updating cached steps for [policy-WvYtC] policy, new steps: [{"phase":"new","action":"init","name":"init"}, {"phase":"new","action":"complete","name":"complete"}, {"phase":"hot","action":"set_priority","name":"set_priority"}, {"phase":"hot","action":"complete","name":"complete"}, {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"migrate"}, {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"check-not-write-index"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}, {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"aliases"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"}, {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"forcemerge","name":"branch-forcemerge-check-prerequisites"}, {"phase":"warm","action":"forcemerge","name":"check-not-write-index"}, {"phase":"warm","action":"forcemerge","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"forcemerge","name":"readonly"}, {"phase":"warm","action":"forcemerge","name":"forcemerge"}, {"phase":"warm","action":"forcemerge","name":"segment-count"}, {"phase":"warm","action":"complete","name":"complete"}, {"phase":"cold","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"cold","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"cold","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"cold","action":"unfollow","name":"pause-follower-index"}, {"phase":"cold","action":"unfollow","name":"close-follower-index"}, {"phase":"cold","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"cold","action":"unfollow","name":"open-follower-index"}, {"phase":"cold","action":"unfollow","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"cold","action":"searchable_snapshot","name":"check-not-write-index"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"cold","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"forcemerge"}, {"phase":"cold","action":"searchable_snapshot","name":"segment-count"}, {"phase":"cold","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"cold","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-data-tier"}, {"phase":"cold","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"cold","action":"searchable_snapshot","name":"replace-datastream-backing-index"}, {"phase":"cold","action":"searchable_snapshot","name":"delete"}, {"phase":"cold","action":"searchable_snapshot","name":"swap-aliases"}, {"phase":"cold","action":"complete","name":"complete"}]

We check the min_age of the warm phase here:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

Line 230 in 67798dd

if (isReadyToTransitionToThisPhase(policy, indexMetadata, currentStep.getNextStepKey().phase())) {

Note: this fetches the latest policy from the cache, which has now been updated
Since the min_age of the new warm phase is used (which is 0s), we proceed to the next step here:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

Line 231 in 67798dd

moveToStep(indexMetadata.getIndex(), policy, currentStep.getKey(), currentStep.getNextStepKey());

However, this uses the currentStep.getNextStepKey(), which was obtained from before the policy update and thus points to the next step in the old warm phase (i.e. set_priority)

I changed the label to medium-risk because this is an actual bug and could cause issues in production clusters. I think I have some ideas for fixes, but I'll take some time to come up with the best one and possibly first discuss with the team before we proceed with a fix.

nielsbauman · 2025-04-03T10:29:51Z

I looked some more into this, and my conclusion about what happens here changed a little bit. The fact that we move to a step that does not exist (anymore) is not necessarily problematic. When that happens, we simply execute the last step again (i.e. the complete step) and "recompute" (spoiler alert: not really, see below) the next step and try to move the index to that step again. However, the actual bug here lies a little deeper.

Here is the order of events as I currently understand them:

We start the lifecycle policy update process:

[2025-03-19T10:28:13,118][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-2] updating index lifecycle policy [policy-WvYtC]
As part of the policy update process, we clear the PolicyStepsRegistry.cachedSteps:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/PolicyStepsRegistry.java

Line 109 in 9b353f6

cachedSteps.clear();

Then we start running the periodic step for the index in question:

[2025-03-19T10:28:13,132][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-2] [restored-.ds-logs-bgqwsrqkxy-2025.03.19-000001] retrieved current step key: {"phase":"hot","action":"complete","name":"complete"}
[2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] parsed steps for policy [policy-WvYtC] in phase [hot], definition: [{"policy":"policy-WvYtC","phase_definition":{"min_age":"0ms","actions":{"rollover":{"max_docs":1},"searchable_snapshot":{"snapshot_repository":"cTUTRWgWBVkC","force_merge_index":true}}},"version":1,"modified_date_in_millis":1742380088938}], steps: [[{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"} => null, {"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"} => {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"} => {"phase":"hot","action":"unfollow","name":"pause-follower-index"}, {"phase":"hot","action":"unfollow","name":"pause-follower-index"} => {"phase":"hot","action":"unfollow","name":"close-follower-index"}, {"phase":"hot","action":"unfollow","name":"close-follower-index"} => {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"hot","action":"unfollow","name":"unfollow-follower-index"} => {"phase":"hot","action":"unfollow","name":"open-follower-index"}, {"phase":"hot","action":"unfollow","name":"open-follower-index"} => {"phase":"hot","action":"unfollow","name":"wait-for-index-color"}, {"phase":"hot","action":"unfollow","name":"wait-for-index-color"} => {"phase":"hot","action":"rollover","name":"check-rollover-ready"}, {"phase":"hot","action":"rollover","name":"check-rollover-ready"} => {"phase":"hot","action":"rollover","name":"attempt-rollover"}, {"phase":"hot","action":"rollover","name":"attempt-rollover"} => {"phase":"hot","action":"rollover","name":"wait-for-active-shards"}, {"phase":"hot","action":"rollover","name":"wait-for-active-shards"} => {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}, {"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"} => {"phase":"hot","action":"rollover","name":"set-indexing-complete"}, {"phase":"hot","action":"rollover","name":"set-indexing-complete"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-prerequisites"} => null, {"phase":"hot","action":"searchable_snapshot","name":"check-not-write-index"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-shard-history-leases"} => {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"hot","action":"searchable_snapshot","name":"check-ts-end-time-passed"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-check-existing-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"forcemerge"} => {"phase":"hot","action":"searchable_snapshot","name":"segment-count"}, {"phase":"hot","action":"searchable_snapshot","name":"segment-count"} => {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"hot","action":"searchable_snapshot","name":"generate-snapshot-name"} => {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"cleanup-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"create-snapshot"} => null, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-data-tier"} => {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"hot","action":"searchable_snapshot","name":"mount-snapshot"} => {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"hot","action":"searchable_snapshot","name":"wait-for-index-color"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-execution-state"} => {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"hot","action":"searchable_snapshot","name":"copy-settings"} => {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"hot","action":"searchable_snapshot","name":"branch-on-datastream-check"} => null, {"phase":"hot","action":"searchable_snapshot","name":"replace-datastream-backing-index"} => {"phase":"hot","action":"searchable_snapshot","name":"delete"}, {"phase":"hot","action":"searchable_snapshot","name":"delete"} => null, {"phase":"hot","action":"searchable_snapshot","name":"swap-aliases"} => null, {"phase":"hot","action":"complete","name":"complete"} => {"phase":"warm","action":"set_priority","name":"set_priority"}]]

Since we cleared cachedSteps in step 2, we have to parse the steps again (see the second log above).

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/PolicyStepsRegistry.java

Line 389 in 9b353f6

phaseSteps = parseStepsFromPhase(policyName, phase, phaseJson);

We do that based on lifecyclePolicyMap (which at this point still contains the old policy) here

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/PolicyStepsRegistry.java

Line 295 in 9b353f6

LifecyclePolicyMetadata policyMetadata = lifecyclePolicyMap.get(policy);

After parsing the steps, we put the parsed steps back into cachedSteps

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/PolicyStepsRegistry.java

Line 411 in 9b353f6

cachedSteps.put(indexMetadata.getIndex(), Tuple.tuple(indexMetadata, s));

This means the old definition of the step key {"phase":"hot","action":"complete","name":"complete"} is stored in cachedSteps, which still points to {"phase":"warm","action":"set_priority","name":"set_priority"} as the next step.

Then, we put the updated policy in lifecyclePolicyMap as part of the policy update process (i.e. a continuation of step 1)

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/PolicyStepsRegistry.java

Line 148 in 9b353f6

lifecyclePolicyMap.put(policyMetadata.getName(), policyMetadata);

[2025-03-19T10:28:13,133][TRACE][o.e.x.i.PolicyStepsRegistry] [javaRestTest-2] updating cached steps for [policy-WvYtC] policy, new steps: [{"phase":"new","action":"init","name":"init"}, {"phase":"new","action":"complete","name":"complete"}, {"phase":"hot","action":"set_priority","name":"set_priority"}, {"phase":"hot","action":"complete","name":"complete"}, {"phase":"warm","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"warm","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"warm","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"warm","action":"unfollow","name":"pause-follower-index"}, {"phase":"warm","action":"unfollow","name":"close-follower-index"}, {"phase":"warm","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"warm","action":"unfollow","name":"open-follower-index"}, {"phase":"warm","action":"unfollow","name":"wait-for-index-color"}, {"phase":"warm","action":"migrate","name":"branch-check-skip-action"}, {"phase":"warm","action":"migrate","name":"migrate"}, {"phase":"warm","action":"migrate","name":"check-migration"}, {"phase":"warm","action":"shrink","name":"branch-check-prerequisites"}, {"phase":"warm","action":"shrink","name":"check-not-write-index"}, {"phase":"warm","action":"shrink","name":"wait-for-shard-history-leases"}, {"phase":"warm","action":"shrink","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"shrink","name":"readonly"}, {"phase":"warm","action":"shrink","name":"check-target-shards-count"}, {"phase":"warm","action":"shrink","name":"cleanup-shrink-index"}, {"phase":"warm","action":"shrink","name":"generate-index-name"}, {"phase":"warm","action":"shrink","name":"set-single-node-allocation"}, {"phase":"warm","action":"shrink","name":"check-shrink-allocation"}, {"phase":"warm","action":"shrink","name":"shrink"}, {"phase":"warm","action":"shrink","name":"shrunk-shards-allocated"}, {"phase":"warm","action":"shrink","name":"copy-execution-state"}, {"phase":"warm","action":"shrink","name":"branch-on-datastream-check"}, {"phase":"warm","action":"shrink","name":"aliases"}, {"phase":"warm","action":"shrink","name":"is-shrunken-index"}, {"phase":"warm","action":"shrink","name":"replace-datastream-backing-index"}, {"phase":"warm","action":"shrink","name":"delete"}, {"phase":"warm","action":"forcemerge","name":"branch-forcemerge-check-prerequisites"}, {"phase":"warm","action":"forcemerge","name":"check-not-write-index"}, {"phase":"warm","action":"forcemerge","name":"check-ts-end-time-passed"}, {"phase":"warm","action":"forcemerge","name":"readonly"}, {"phase":"warm","action":"forcemerge","name":"forcemerge"}, {"phase":"warm","action":"forcemerge","name":"segment-count"}, {"phase":"warm","action":"complete","name":"complete"}, {"phase":"cold","action":"unfollow","name":"branch-check-unfollow-prerequisites"}, {"phase":"cold","action":"unfollow","name":"wait-for-indexing-complete"}, {"phase":"cold","action":"unfollow","name":"wait-for-follow-shard-tasks"}, {"phase":"cold","action":"unfollow","name":"pause-follower-index"}, {"phase":"cold","action":"unfollow","name":"close-follower-index"}, {"phase":"cold","action":"unfollow","name":"unfollow-follower-index"}, {"phase":"cold","action":"unfollow","name":"open-follower-index"}, {"phase":"cold","action":"unfollow","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-prerequisites"}, {"phase":"cold","action":"searchable_snapshot","name":"check-not-write-index"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-shard-history-leases"}, {"phase":"cold","action":"searchable_snapshot","name":"check-ts-end-time-passed"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-check-existing-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"forcemerge"}, {"phase":"cold","action":"searchable_snapshot","name":"segment-count"}, {"phase":"cold","action":"searchable_snapshot","name":"generate-snapshot-name"}, {"phase":"cold","action":"searchable_snapshot","name":"cleanup-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"create-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-data-tier"}, {"phase":"cold","action":"searchable_snapshot","name":"mount-snapshot"}, {"phase":"cold","action":"searchable_snapshot","name":"wait-for-index-color"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-execution-state"}, {"phase":"cold","action":"searchable_snapshot","name":"copy-settings"}, {"phase":"cold","action":"searchable_snapshot","name":"branch-on-datastream-check"}, {"phase":"cold","action":"searchable_snapshot","name":"replace-datastream-backing-index"}, {"phase":"cold","action":"searchable_snapshot","name":"delete"}, {"phase":"cold","action":"searchable_snapshot","name":"swap-aliases"}, {"phase":"cold","action":"complete","name":"complete"}]

We then retrieve the min_age of the next phase using the lifecyclePolicyMap (i.e. the new policy) here:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleRunner.java

Line 230 in 9b353f6

if (isReadyToTransitionToThisPhase(policy, indexMetadata, currentStep.getNextStepKey().phase())) {
From this point on (i.e. on every retry), since the old step key is in cachedSteps, we will always try to, incorrectly, move to the set_priority step of the warm phase (which doesn't exist anymore).

In short, the issue is that we clear cachedSteps but don't atomically update lifecyclePolicyMap as well. We should be able to solve this by synchronizing/locking both maps to make sure we're updating them atomically. However, that is again more of a "patch" to fix a symptom and begs the question of whether we should consider a different approach.

gmarouli · 2025-04-14T07:50:28Z

Great job on this analysis @nielsbauman, super helpful.

You are right that adding locking in this should not be done lightly. Before we start exploring different approaches, I wanted to challenge the order of these operations and see if a change there could help.

If I understand you analysis:

the cachedSteps is an index->step (thread safe) cache that gets populated by retrieving the steps from lifecyclePolicyMap
when we update the lifecyclePolicyMap we also clear the cache so it does not contain outdated data.
but we clear the cache before we update the lifecyclePolicyMap which means as you pointed out, that it's possible that the cache will get repopulated with old data and then they will stay there until a different "invalidation" comes along.

What I am thinking is to move clearing cachedSteps to the end after updating the lifecyclePolicyMap. This means that there will be a moment where cachedSteps will contain an outdated entry but soon enough after the update of the lifecyclePolicyMap it will get invalidated, so it will be eventually consistent.

What do you think?

nielsbauman · 2025-04-15T09:19:01Z

I think that is an excellent solution! Thanks a lot for the great idea :)

I'll run a test to ensure no tests depend on the current behavior (just to be sure) and then I'll open a PR.

The following order of events was possible: - An ILM policy update cleared `cachedSteps` - ILM retrieves the step definition for an index, this populates `cachedSteps` with the outdated policy - The updated policy is put in `lifecyclePolicyMap` Any subsequent cache retrievals will see the old step definition. By clearing `cachedSteps` _after_ we update `lifecyclePolicyMap`, we ensure eventual consistency between the policy and the cache. Fixes elastic#118406

The following order of events was possible: - An ILM policy update cleared `cachedSteps` - ILM retrieves the step definition for an index, this populates `cachedSteps` with the outdated policy - The updated policy is put in `lifecyclePolicyMap` Any subsequent cache retrievals will see the old step definition. By clearing `cachedSteps` _after_ we update `lifecyclePolicyMap`, we ensure eventual consistency between the policy and the cache. Fixes #118406 (cherry picked from commit 5383f0f) # Conflicts: # muted-tests.yml

elasticsearchmachine added :Data Management/ILM+SLM Index and Snapshot lifecycle management >test-failure Triaged test failures from CI labels Dec 10, 2024

elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 10, 2024

mattc58 assigned samxbr Dec 12, 2024

dakrone added low-risk An open issue or test failure that is a low risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 17, 2024

elasticsearchmachine added a commit that referenced this issue Jan 14, 2025

Mute org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT t…

476797e

…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped #118406

martijnvg pushed a commit to martijnvg/elasticsearch that referenced this issue Jan 14, 2025

Mute org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT t…

fb1b693

…estUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped elastic#118406

nielsbauman added medium-risk An open issue or test failure that is a medium risk to future releases and removed low-risk An open issue or test failure that is a low risk to future releases labels Mar 28, 2025

nielsbauman assigned nielsbauman and unassigned samxbr Mar 28, 2025

nielsbauman mentioned this issue Apr 15, 2025

Fix PolicyStepsRegistry cache concurrency issue #126840

Merged

nielsbauman closed this as completed in #126840 Apr 16, 2025

nielsbauman closed this as completed in 5383f0f Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406

[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406

elasticsearchmachine commented Dec 10, 2024 •

edited

Loading

elasticsearchmachine commented Dec 10, 2024

elasticsearchmachine commented Jan 14, 2025

nielsbauman commented Mar 27, 2025

nielsbauman commented Mar 27, 2025

nielsbauman commented Mar 27, 2025

nielsbauman commented Mar 28, 2025

nielsbauman commented Apr 3, 2025

gmarouli commented Apr 14, 2025

nielsbauman commented Apr 15, 2025

[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406

[CI] SearchableSnapshotActionIT testUpdatePolicyToAddPhasesYieldsInvalidActionsToBeSkipped failing #118406

Comments

elasticsearchmachine commented Dec 10, 2024 • edited Loading

elasticsearchmachine commented Dec 10, 2024

elasticsearchmachine commented Jan 14, 2025

nielsbauman commented Mar 27, 2025

nielsbauman commented Mar 27, 2025

nielsbauman commented Mar 27, 2025

nielsbauman commented Mar 28, 2025

nielsbauman commented Apr 3, 2025

gmarouli commented Apr 14, 2025

nielsbauman commented Apr 15, 2025

elasticsearchmachine commented Dec 10, 2024 •

edited

Loading