Skip to content

Batch ILM policy cluster state updates [#122917] #126529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 9, 2025

Conversation

lukewhiting
Copy link
Contributor

@lukewhiting lukewhiting commented Apr 9, 2025

Switches ILM PUT lifecycle action to use a batched task queue for executing cluster state updates.

This isn't an optimal solution as it could use less heap space by not creating a new cluster state for each change but it should greatly speed up execution time when performing many ILM updates such as in tests without being significantly worse than current implementation on heap usage.

Future Enhancements:
Refactor the task to not use the ClusterStateAckListener but instead use a cluster state builder as it's input / output of the execute function. This would allow us to combine multiple updates while only using a single builder object.

This would need a fair bit of refactoring, both upstream and downstream of this class (For instance a new equivalent of SimpleBatchExecutor that works on builders instead of cluster states.

Fixes #122917

@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Apr 9, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java:354

  • The 'logger' reference in the IlmLifecycleExecutor inner class is not clearly declared within its scope; please ensure it is properly defined or referenced from the outer class to avoid compilation issues.
logger.trace("Executed lifecycle policy update:\n{}", task.request.getPolicy().toString());

"put-lifecycle-" + request.getPolicy().getName(),
new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client)
new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client),
null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an infinite timeout is the right call here but would be good to get options

Copy link
Contributor

@nielsbauman nielsbauman Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior seems to be using request.masterNodeTimeout(). We submit the unbatched task using the timeout from the task here:

createTaskQueue("unbatched", updateTask.priority(), unbatchedExecutor).submitTask(source, updateTask, updateTask.timeout());

and the UpdateLifecyclePolicyTask initializes the timeout here:
this(Priority.NORMAL, request.masterNodeTimeout(), request.ackTimeout(), listener);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yes that makes sense as as it's triggered from an upstream request we should mirror that. Have switched the code to use the timeout on the task object.

taskContext.success(() -> taskSucceeded(task, taskResult));
Runnable successFunction = () -> taskSucceeded(task, taskResult);
if (task instanceof ClusterStateAckListener ackListenerTask) {
taskContext.success(successFunction, ackListenerTask);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to work around the assertion at

// [HISTORICAL NOTE] In the past, tasks executed by the master service would automatically be notified of acks if they implemented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the rest of your PR properly yet, so maybe this is premature, but we also have SimpleBatchedAckListenerTaskExecutor. Wouldn't that work instead - avoiding these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't spot that class but yes that works nicely :-) Have refactored it to use the new class and the ILM tests seem to be happy about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not mentioning that before when I suggested using SimpleBatchedExecutor...

@lukewhiting lukewhiting added :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.19.0 v9.0.1 labels Apr 9, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've created a changelog YAML for you.

@lukewhiting lukewhiting added needs:triage Requires assignment of a team area label auto-backport Automatically create backport pull requests when merged and removed Team:Data Management Meta label for data/management team labels Apr 9, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've updated the changelog YAML for you.

Comment on lines 352 to 356
try {
return Tuple.tuple(task.execute(clusterState), task);
} catch (Exception e) {
throw new ElasticsearchException("failed to execute task", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you added the try-catch block here? The SimpleBatchedAckListenerTaskExecutor already catches all exceptions and calls onFailure here:

} catch (Exception e) {
taskContext.onFailure(e);
}

It looks to me like that's what we want, but maybe I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No just my IDE being fussy and not correctly adding the throws block to the method it generated for the interface -.- Have updated it now.

Comment on lines +352 to +360
private static class IlmLifecycleExecutor extends SimpleBatchedAckListenerTaskExecutor<UpdateLifecyclePolicyTask> {

@Override
public Tuple<ClusterState, ClusterStateAckListener> executeTask(UpdateLifecyclePolicyTask task, ClusterState clusterState)
throws Exception {
return Tuple.tuple(task.execute(clusterState), task);
}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class is so small now, I think we could also make it a lambda in the constructor, but if you prefer this version, I'm also ok with that - no strong opinion here.

Copy link
Contributor Author

@lukewhiting lukewhiting Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleBatchedAckListenerTaskExecutor is an abstract rather than an @FunctionalInterface so I don't think we can go as a simple as a Lambda here. We could use an inline anonymous class here but my preference is always named classes for stuff like this as it improved readability but could be persuaded otherwise for something this small...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah you're right. No then I definitely prefer this named class as well 👍

@nielsbauman
Copy link
Contributor

When this change goes in, I'm going to close #111431, #111632, and #111662. Those failures seemed to be caused by slow cluster startup and this PR should reduce that. I'll close them manually as I don't want them to be linked in the changelog - and this PR is only indirectly addressing them at best.

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this, Luke! An easy but impactful win :) 🚀

@lukewhiting lukewhiting merged commit 7d7fa76 into elastic:main Apr 9, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Apr 9, 2025

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126529

@lukewhiting
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

elasticsearchmachine pushed a commit that referenced this pull request Apr 10, 2025
* Use a task queue to ensure ILM policy change cluster state updates are batched

* Update docs/changelog/126529.yaml

* Update docs/changelog/126529.yaml

* Switch to using SimpleBatchedAckListenerTaskExecutor

* Get timeout from request

* Ditch the try-catch

(cherry picked from commit 7d7fa76)

# Conflicts:
#	x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java
@lukewhiting lukewhiting deleted the 122917-batch-ilm-updates branch April 10, 2025 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending :Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch ILM policy cluster state updates
3 participants