Batch ILM policy cluster state updates [#122917] #126529

lukewhiting · 2025-04-09T13:20:55Z

Switches ILM PUT lifecycle action to use a batched task queue for executing cluster state updates.

This isn't an optimal solution as it could use less heap space by not creating a new cluster state for each change but it should greatly speed up execution time when performing many ILM updates such as in tests without being significantly worse than current implementation on heap usage.

Future Enhancements:
Refactor the task to not use the ClusterStateAckListener but instead use a cluster state builder as it's input / output of the execute function. This would allow us to combine multiple updates while only using a single builder object.

This would need a fair bit of refactoring, both upstream and downstream of this class (For instance a new equivalent of SimpleBatchExecutor that works on builders instead of cluster states.

Fixes #122917

…e batched

Copilot

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java:354

The 'logger' reference in the IlmLifecycleExecutor inner class is not clearly declared within its scope; please ensure it is properly defined or referenced from the outer class to avoid compilation issues.

logger.trace("Executed lifecycle policy update:\n{}", task.request.getPolicy().toString());

lukewhiting · 2025-04-09T13:22:08Z

...plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java

            "put-lifecycle-" + request.getPolicy().getName(),
-            new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client)
+            new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client),
+            null


I think an infinite timeout is the right call here but would be good to get options

The current behavior seems to be using request.masterNodeTimeout(). We submit the unbatched task using the timeout from the task here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

Line 567 in 80deeb8

createTaskQueue("unbatched", updateTask.priority(), unbatchedExecutor).submitTask(source, updateTask, updateTask.timeout());

and the UpdateLifecyclePolicyTask initializes the timeout here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/AckedClusterStateUpdateTask.java

Line 31 in 9a28516

this(Priority.NORMAL, request.masterNodeTimeout(), request.ackTimeout(), listener);

Right yes that makes sense as as it's triggered from an upstream request we should mirror that. Have switched the code to use the timeout on the task object.

lukewhiting · 2025-04-09T13:23:14Z

server/src/main/java/org/elasticsearch/cluster/SimpleBatchedExecutor.java

-                taskContext.success(() -> taskSucceeded(task, taskResult));
+                Runnable successFunction = () -> taskSucceeded(task, taskResult);
+                if (task instanceof ClusterStateAckListener ackListenerTask) {
+                    taskContext.success(successFunction, ackListenerTask);


This is to work around the assertion at

elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

Line 859 in 80deeb8

// [HISTORICAL NOTE] In the past, tasks executed by the master service would automatically be notified of acks if they implemented

I didn't check the rest of your PR properly yet, so maybe this is premature, but we also have SimpleBatchedAckListenerTaskExecutor. Wouldn't that work instead - avoiding these changes?

Didn't spot that class but yes that works nicely :-) Have refactored it to use the new class and the ILM tests seem to be happy about it.

Apologies for not mentioning that before when I suggested using SimpleBatchedExecutor...

elasticsearchmachine · 2025-04-09T13:27:02Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-04-09T13:27:02Z

Hi @lukewhiting, I've created a changelog YAML for you.

elasticsearchmachine · 2025-04-09T13:28:10Z

Hi @lukewhiting, I've updated the changelog YAML for you.

…22917-batch-ilm-updates

nielsbauman · 2025-04-09T14:02:11Z

...plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java

+            try {
+                return Tuple.tuple(task.execute(clusterState), task);
+            } catch (Exception e) {
+                throw new ElasticsearchException("failed to execute task", e);
+            }


Is there a reason you added the try-catch block here? The SimpleBatchedAckListenerTaskExecutor already catches all exceptions and calls onFailure here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/SimpleBatchedAckListenerTaskExecutor.java

Lines 67 to 69 in a59c182

} catch (Exception e) {

taskContext.onFailure(e);

}

It looks to me like that's what we want, but maybe I'm missing something.

No just my IDE being fussy and not correctly adding the throws block to the method it generated for the interface -.- Have updated it now.

nielsbauman · 2025-04-09T14:07:56Z

...plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java

+    private static class IlmLifecycleExecutor extends SimpleBatchedAckListenerTaskExecutor<UpdateLifecyclePolicyTask> {
+
+        @Override
+        public Tuple<ClusterState, ClusterStateAckListener> executeTask(UpdateLifecyclePolicyTask task, ClusterState clusterState)
+            throws Exception {
+            return Tuple.tuple(task.execute(clusterState), task);
+        }
+
+    }


Since this class is so small now, I think we could also make it a lambda in the constructor, but if you prefer this version, I'm also ok with that - no strong opinion here.

SimpleBatchedAckListenerTaskExecutor is an abstract rather than an @FunctionalInterface so I don't think we can go as a simple as a Lambda here. We could use an inline anonymous class here but my preference is always named classes for stuff like this as it improved readability but could be persuaded otherwise for something this small...

Ah yeah you're right. No then I definitely prefer this named class as well 👍

nielsbauman · 2025-04-09T14:11:46Z

When this change goes in, I'm going to close #111431, #111632, and #111662. Those failures seemed to be caused by slow cluster startup and this PR should reduce that. I'll close them manually as I don't want them to be linked in the changelog - and this PR is only indirectly addressing them at best.

nielsbauman

LGTM, thanks for working on this, Luke! An easy but impactful win :) 🚀

elasticsearchmachine · 2025-04-09T16:10:55Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126529

lukewhiting · 2025-04-10T09:08:23Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

* Use a task queue to ensure ILM policy change cluster state updates are batched * Update docs/changelog/126529.yaml * Update docs/changelog/126529.yaml * Switch to using SimpleBatchedAckListenerTaskExecutor * Get timeout from request * Ditch the try-catch (cherry picked from commit 7d7fa76) # Conflicts: # x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java

Use a task queue to ensure ILM policy change cluster state updates ar…

f3ef13c

…e batched

lukewhiting added the >enhancement label Apr 9, 2025

lukewhiting requested review from nielsbauman and Copilot April 9, 2025 13:20

elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Apr 9, 2025

Copilot AI reviewed Apr 9, 2025

View reviewed changes

lukewhiting commented Apr 9, 2025

View reviewed changes

lukewhiting added :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.19.0 v9.0.1 labels Apr 9, 2025

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 9, 2025

Update docs/changelog/126529.yaml

2c7a82d

lukewhiting added needs:triage Requires assignment of a team area label auto-backport Automatically create backport pull requests when merged and removed Team:Data Management Meta label for data/management team labels Apr 9, 2025

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 9, 2025

Update docs/changelog/126529.yaml

c4f8717

lukewhiting added 3 commits April 9, 2025 14:43

Switch to using SimpleBatchedAckListenerTaskExecutor

616504b

Merge remote-tracking branch 'origin/122917-batch-ilm-updates' into 1…

ac7f08d

…22917-batch-ilm-updates

Get timeout from request

74dfc86

nielsbauman reviewed Apr 9, 2025

View reviewed changes

Ditch the try-catch

8c379a3

nielsbauman reviewed Apr 9, 2025

View reviewed changes

nielsbauman approved these changes Apr 9, 2025

View reviewed changes

Merge branch 'main' into 122917-batch-ilm-updates

4a93322

lukewhiting merged commit 7d7fa76 into elastic:main Apr 9, 2025
17 checks passed

elasticsearchmachine added the backport pending label Apr 9, 2025

lukewhiting removed the v9.0.1 label Apr 10, 2025

lukewhiting mentioned this pull request Apr 10, 2025

[8.x] Batch ILM policy cluster state updates [#122917] (#126529) #126587

Merged

lukewhiting deleted the 122917-batch-ilm-updates branch April 10, 2025 10:16

nielsbauman mentioned this pull request Apr 10, 2025

[CI] CustomLoggingConfigIT testSuccessfulStartupWithCustomConfig failing #111431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch ILM policy cluster state updates [#122917] #126529

Batch ILM policy cluster state updates [#122917] #126529

lukewhiting commented Apr 9, 2025 •

edited

Loading

Copilot AI left a comment

lukewhiting Apr 9, 2025

nielsbauman Apr 9, 2025 •

edited

Loading

lukewhiting Apr 9, 2025

lukewhiting Apr 9, 2025

nielsbauman Apr 9, 2025

lukewhiting Apr 9, 2025

nielsbauman Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

nielsbauman Apr 9, 2025

lukewhiting Apr 9, 2025

nielsbauman Apr 9, 2025

lukewhiting Apr 9, 2025 •

edited

Loading

nielsbauman Apr 9, 2025

nielsbauman commented Apr 9, 2025

nielsbauman left a comment

elasticsearchmachine commented Apr 9, 2025 •

edited by lukewhiting

Loading

lukewhiting commented Apr 10, 2025

Batch ILM policy cluster state updates [#122917] #126529

Batch ILM policy cluster state updates [#122917] #126529

Conversation

lukewhiting commented Apr 9, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nielsbauman Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

elasticsearchmachine commented Apr 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukewhiting Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nielsbauman commented Apr 9, 2025

nielsbauman left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 9, 2025 • edited by lukewhiting Loading

💔 Backport failed

lukewhiting commented Apr 10, 2025

💚 All backports created successfully

Questions ?

lukewhiting commented Apr 9, 2025 •

edited

Loading

nielsbauman Apr 9, 2025 •

edited

Loading

lukewhiting Apr 9, 2025 •

edited

Loading

elasticsearchmachine commented Apr 9, 2025 •

edited by lukewhiting

Loading