Skip to content

Add node write load to the ClusterInfo #130411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

DiannaHohensee
Copy link
Contributor

Sets up ClusterInfoService to collect node write load and
pass it into ClusterInfo. The node write load stats are not
yet supplied, they'll be zero/empty in the ClusterInfo for
now.

Relates ES-11990

Sets up ClusterInfoService to collect node write load and
pass it into ClusterInfo. The node write load stats are not
yet supplied, they'll be zero/empty in the ClusterInfo for
now.

Relates ES-11990
@DiannaHohensee DiannaHohensee self-assigned this Jul 1, 2025
@DiannaHohensee DiannaHohensee requested a review from a team as a code owner July 1, 2025 21:05
@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team labels Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)


@Override
protected Collection<Class<? extends Plugin>> getPlugins() {
return pluginList(InternalSettingsPlugin.class, BogusEstimatedHeapUsagePlugin.class);
return pluginList(InternalSettingsPlugin.class, BogusEstimatedHeapUsagePlugin.class, BogusWriteLoadCollectorPlugin.class);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought we'd be collecting the node write load stats in the stateless code. Now it's looking like we may not have a dependency on stateless code to collect the stats, but it's not yet decided. As I've already got this implemented, I'd like to go ahead with it as is, if folks agree. It will be a testing change later, if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be more than a testing change if we don't rely on stateless for collecting stats? We don't need to use the service provider pattern in that case, i.e. no need for the pluginsService.loadSingletonServiceProvider in the NodeServiceProvider class. That said, I am ok with us going with this for now. We can remove it later if it turns out to be unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The team discussion seems to decide that we will either extending NodeStats API or adding a new transport action for collection purpose. So maybe we can remove it in this PR already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd have to change the NodeServiceProvider line, too 👍

My inclination is to ship this as is, without revisiting the testing, so I don't block myself or others picking up the next pieces of project work. And I'm also not 100% confident proposals will survive contact with code, haha.

// NOTE: We don't serialize estimatedHeapUsages at this stage, to avoid
// committing to API payloads until the feature is settled
// NOTE: We don't serialize estimatedHeapUsages/nodeWriteLoads at this stage, to avoid
// committing to API payloads until the features are settled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what the toXContentChunked is used for, but figured I should follow suit.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some comments

String nodeId,
int totalWriteThreadPoolThreads,
int percentWriteThreadPoolUtilization,
long maxTaskTimeInWriteQueueMillis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should indicate that these are averages (e.g. averageWriteThreadPoolUtilization and averageWriteThreadPoolQueueLatency)

I used the term "queue latency" to describe the time spent waiting in the queue, I'm not married to that term specifically but I think we should try and be consistent (happy to agree on something else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the terminology doesn't match. Done 👍

public record NodeWriteLoad(
String nodeId,
int totalWriteThreadPoolThreads,
int percentWriteThreadPoolUtilization,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I have a preference for using a float/double here, i.e. a ratio rather than a percentage. I'm not sure why. But for example org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor#pollUtilization returns a double (arguably it should be a float)

It seems like they're more useful for arithmetic, and percentages are more useful for rendering. Just an opinion though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the data structure and potential expansion for more thread pools.

Comment on lines 27 to 30
public record NodeWriteLoad(
String nodeId,
int totalWriteThreadPoolThreads,
int percentWriteThreadPoolUtilization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure allows reporting for only a single thread pool. Autoscaling uses stats for all 3 write related thread pools, write, system_write and system_critical_write. Starting out this work with just one thread pool is probably good enough. But I wonder whether the data structure could leave space for expansion? Not sure if this has already been discussed and agreed upon elsewhere. Please let me know if that is the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides write thread pools, there is possibilty that we might extend this to search thread pools as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autoscaling uses stats for all 3 write related thread pools, write, system_write and system_critical_write.

I'm not familiar with system_write and system_critical_write. Looks like system_write might be for system indices?

But I wonder whether the data structure could leave space for expansion? Not sure if this has already been discussed and agreed upon elsewhere. Please let me know if that is the case.

Nope, it has not been discussed.

Besides write thread pools, there is possibilty that we might extend this to search thread pools as well?

Yes, I expect we would eventually add search. I think the future changes for that would be fairly straightforward: extending tracking to the search thread pools, sending that information along from the data node to master as well, modifying data structures like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't worthwhile to implement search support now, since it would be significant overhead to think that far ahead, and we might end up changing this a few times before we're satisfied with everything for write. Using a new TransportVersion to set up search in future seems like it'd be okay?

Regarding system_write and system_critical_write thread pools.. I currently have no idea how we'd leverage that extra data to make allocation decisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant by "leave space for expansion" is not adding either search or other thread pool stats right now. But making the data structure here to be something like

record NodeLoad (String nodeId, Map<String, LoadStats> threadsLoadStats) {}

record LoadStats(
    int totalWriteThreadPoolThreads,
    int percentWriteThreadPoolUtilization,
    long maxTaskTimeInWriteQueueMillis
)

where the Map<String, LoadStats> threadsLoadStats is keyed by thread pool name, e.g. write. In this PR, we still just take care of the write thread pool. But the map structure is more generic and organized and makes it easier to add any future thread pools that we might be interested without transport version changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it just seems like a lot more than we need right now. Adding search (not just here) will require a TransportVersion and significant changes. I don't see a clear use case for the other thread pools.

I don't have much preference for the structure myself. I'll go ahead with your code suggestion 👍

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the data structure. The naming is making my head spin a little, so I'll probably think on that a little more later. The node is sending thread pool stats, which reflect load, which are execution metrics..... Open to further naming suggestions 😌

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the iterations. My naming suggestion is optional. Feel free to take or discard. Regardless what names you decide to go with, I'd appreciate if you could make them consistent across all relevant places, e.g.:

  • The Collector class's name and it's variable name in NodeServiceProvider and InternalClusterInfoService
  • The field in ClusterInfo, ClusterInfoSimulator and InternalClusterInfoService

* @param threadPoolUsageStatsMap A map of thread pool name ({@link org.elasticsearch.threadpool.ThreadPool.Names}) to the thread pool's
* usage stats ({@link ThreadPoolUsageStats}).
*/
public record NodeExecutionLoad(String nodeId, Map<String, ThreadPoolUsageStats> threadPoolUsageStatsMap) implements Writeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use the phrase ThreadPoolUsageStats? I think we can also simplify the name of the map field.

Suggested change
public record NodeExecutionLoad(String nodeId, Map<String, ThreadPoolUsageStats> threadPoolUsageStatsMap) implements Writeable {
public record NodeThreadPoolUsageStats(String nodeId, Map<String, ThreadPoolUsageStats> threads) implements Writeable {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants