-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Batch ILM policy cluster state updates [#122917] #126529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch ILM policy cluster state updates [#122917] #126529
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java:354
- The 'logger' reference in the IlmLifecycleExecutor inner class is not clearly declared within its scope; please ensure it is properly defined or referenced from the outer class to avoid compilation issues.
logger.trace("Executed lifecycle policy update:\n{}", task.request.getPolicy().toString());
"put-lifecycle-" + request.getPolicy().getName(), | ||
new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client) | ||
new UpdateLifecyclePolicyTask(projectMetadata.id(), request, listener, licenseState, filteredHeaders, xContentRegistry, client), | ||
null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an infinite timeout is the right call here but would be good to get options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior seems to be using request.masterNodeTimeout()
. We submit the unbatched task using the timeout from the task here:
elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
Line 567 in 80deeb8
createTaskQueue("unbatched", updateTask.priority(), unbatchedExecutor).submitTask(source, updateTask, updateTask.timeout()); |
and the
UpdateLifecyclePolicyTask
initializes the timeout here:elasticsearch/server/src/main/java/org/elasticsearch/cluster/AckedClusterStateUpdateTask.java
Line 31 in 9a28516
this(Priority.NORMAL, request.masterNodeTimeout(), request.ackTimeout(), listener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right yes that makes sense as as it's triggered from an upstream request we should mirror that. Have switched the code to use the timeout on the task object.
taskContext.success(() -> taskSucceeded(task, taskResult)); | ||
Runnable successFunction = () -> taskSucceeded(task, taskResult); | ||
if (task instanceof ClusterStateAckListener ackListenerTask) { | ||
taskContext.success(successFunction, ackListenerTask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to work around the assertion at
elasticsearch/server/src/main/java/org/elasticsearch/cluster/service/MasterService.java
Line 859 in 80deeb8
// [HISTORICAL NOTE] In the past, tasks executed by the master service would automatically be notified of acks if they implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the rest of your PR properly yet, so maybe this is premature, but we also have SimpleBatchedAckListenerTaskExecutor
. Wouldn't that work instead - avoiding these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't spot that class but yes that works nicely :-) Have refactored it to use the new class and the ILM tests seem to be happy about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for not mentioning that before when I suggested using SimpleBatchedExecutor
...
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @lukewhiting, I've created a changelog YAML for you. |
Hi @lukewhiting, I've updated the changelog YAML for you. |
try { | ||
return Tuple.tuple(task.execute(clusterState), task); | ||
} catch (Exception e) { | ||
throw new ElasticsearchException("failed to execute task", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you added the try-catch block here? The SimpleBatchedAckListenerTaskExecutor
already catches all exceptions and calls onFailure
here:
Lines 67 to 69 in a59c182
} catch (Exception e) { | |
taskContext.onFailure(e); | |
} |
It looks to me like that's what we want, but maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No just my IDE being fussy and not correctly adding the throws
block to the method it generated for the interface -.- Have updated it now.
private static class IlmLifecycleExecutor extends SimpleBatchedAckListenerTaskExecutor<UpdateLifecyclePolicyTask> { | ||
|
||
@Override | ||
public Tuple<ClusterState, ClusterStateAckListener> executeTask(UpdateLifecyclePolicyTask task, ClusterState clusterState) | ||
throws Exception { | ||
return Tuple.tuple(task.execute(clusterState), task); | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this class is so small now, I think we could also make it a lambda in the constructor, but if you prefer this version, I'm also ok with that - no strong opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleBatchedAckListenerTaskExecutor
is an abstract
rather than an @FunctionalInterface
so I don't think we can go as a simple as a Lambda here. We could use an inline anonymous class here but my preference is always named classes for stuff like this as it improved readability but could be persuaded otherwise for something this small...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah you're right. No then I definitely prefer this named class as well 👍
When this change goes in, I'm going to close #111431, #111632, and #111662. Those failures seemed to be caused by slow cluster startup and this PR should reduce that. I'll close them manually as I don't want them to be linked in the changelog - and this PR is only indirectly addressing them at best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this, Luke! An easy but impactful win :) 🚀
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
* Use a task queue to ensure ILM policy change cluster state updates are batched * Update docs/changelog/126529.yaml * Update docs/changelog/126529.yaml * Switch to using SimpleBatchedAckListenerTaskExecutor * Get timeout from request * Ditch the try-catch (cherry picked from commit 7d7fa76) # Conflicts: # x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/action/TransportPutLifecycleAction.java
Switches ILM PUT lifecycle action to use a batched task queue for executing cluster state updates.
This isn't an optimal solution as it could use less heap space by not creating a new cluster state for each change but it should greatly speed up execution time when performing many ILM updates such as in tests without being significantly worse than current implementation on heap usage.
Future Enhancements:
Refactor the task to not use the
ClusterStateAckListener
but instead use a cluster state builder as it's input / output of the execute function. This would allow us to combine multiple updates while only using a single builder object.This would need a fair bit of refactoring, both upstream and downstream of this class (For instance a new equivalent of
SimpleBatchExecutor
that works on builders instead of cluster states.Fixes #122917