-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Add node write load to the ClusterInfo #130411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add node write load to the ClusterInfo #130411
Conversation
Sets up ClusterInfoService to collect node write load and pass it into ClusterInfo. The node write load stats are not yet supplied, they'll be zero/empty in the ClusterInfo for now. Relates ES-11990
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
||
@Override | ||
protected Collection<Class<? extends Plugin>> getPlugins() { | ||
return pluginList(InternalSettingsPlugin.class, BogusEstimatedHeapUsagePlugin.class); | ||
return pluginList(InternalSettingsPlugin.class, BogusEstimatedHeapUsagePlugin.class, BogusWriteLoadCollectorPlugin.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought we'd be collecting the node write load stats in the stateless code. Now it's looking like we may not have a dependency on stateless code to collect the stats, but it's not yet decided. As I've already got this implemented, I'd like to go ahead with it as is, if folks agree. It will be a testing change later, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be more than a testing change if we don't rely on stateless for collecting stats? We don't need to use the service provider pattern in that case, i.e. no need for the pluginsService.loadSingletonServiceProvider in the NodeServiceProvider
class. That said, I am ok with us going with this for now. We can remove it later if it turns out to be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The team discussion seems to decide that we will either extending NodeStats API or adding a new transport action for collection purpose. So maybe we can remove it in this PR already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd have to change the NodeServiceProvider line, too 👍
My inclination is to ship this as is, without revisiting the testing, so I don't block myself or others picking up the next pieces of project work. And I'm also not 100% confident proposals will survive contact with code, haha.
// NOTE: We don't serialize estimatedHeapUsages at this stage, to avoid | ||
// committing to API payloads until the feature is settled | ||
// NOTE: We don't serialize estimatedHeapUsages/nodeWriteLoads at this stage, to avoid | ||
// committing to API payloads until the features are settled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what the toXContentChunked is used for, but figured I should follow suit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some comments
String nodeId, | ||
int totalWriteThreadPoolThreads, | ||
int percentWriteThreadPoolUtilization, | ||
long maxTaskTimeInWriteQueueMillis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should indicate that these are averages (e.g. averageWriteThreadPoolUtilization
and averageWriteThreadPoolQueueLatency
)
I used the term "queue latency" to describe the time spent waiting in the queue, I'm not married to that term specifically but I think we should try and be consistent (happy to agree on something else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the terminology doesn't match. Done 👍
public record NodeWriteLoad( | ||
String nodeId, | ||
int totalWriteThreadPoolThreads, | ||
int percentWriteThreadPoolUtilization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I have a preference for using a float/double here, i.e. a ratio rather than a percentage. I'm not sure why. But for example org.elasticsearch.common.util.concurrent.TaskExecutionTimeTrackingEsThreadPoolExecutor#pollUtilization
returns a double (arguably it should be a float)
It seems like they're more useful for arithmetic, and percentages are more useful for rendering. Just an opinion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about the data structure and potential expansion for more thread pools.
public record NodeWriteLoad( | ||
String nodeId, | ||
int totalWriteThreadPoolThreads, | ||
int percentWriteThreadPoolUtilization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This structure allows reporting for only a single thread pool. Autoscaling uses stats for all 3 write related thread pools, write
, system_write
and system_critical_write
. Starting out this work with just one thread pool is probably good enough. But I wonder whether the data structure could leave space for expansion? Not sure if this has already been discussed and agreed upon elsewhere. Please let me know if that is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides write thread pools, there is possibilty that we might extend this to search thread pools as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autoscaling uses stats for all 3 write related thread pools, write, system_write and system_critical_write.
I'm not familiar with system_write and system_critical_write. Looks like system_write might be for system indices?
But I wonder whether the data structure could leave space for expansion? Not sure if this has already been discussed and agreed upon elsewhere. Please let me know if that is the case.
Nope, it has not been discussed.
Besides write thread pools, there is possibilty that we might extend this to search thread pools as well?
Yes, I expect we would eventually add search. I think the future changes for that would be fairly straightforward: extending tracking to the search thread pools, sending that information along from the data node to master as well, modifying data structures like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it isn't worthwhile to implement search support now, since it would be significant overhead to think that far ahead, and we might end up changing this a few times before we're satisfied with everything for write. Using a new TransportVersion
to set up search in future seems like it'd be okay?
Regarding system_write
and system_critical_write
thread pools.. I currently have no idea how we'd leverage that extra data to make allocation decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant by "leave space for expansion" is not adding either search or other thread pool stats right now. But making the data structure here to be something like
record NodeLoad (String nodeId, Map<String, LoadStats> threadsLoadStats) {}
record LoadStats(
int totalWriteThreadPoolThreads,
int percentWriteThreadPoolUtilization,
long maxTaskTimeInWriteQueueMillis
)
where the Map<String, LoadStats> threadsLoadStats
is keyed by thread pool name, e.g. write
. In this PR, we still just take care of the write
thread pool. But the map structure is more generic and organized and makes it easier to add any future thread pools that we might be interested without transport version changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it just seems like a lot more than we need right now. Adding search (not just here) will require a TransportVersion and significant changes. I don't see a clear use case for the other thread pools.
I don't have much preference for the structure myself. I'll go ahead with your code suggestion 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the data structure. The naming is making my head spin a little, so I'll probably think on that a little more later. The node is sending thread pool stats, which reflect load, which are execution metrics..... Open to further naming suggestions 😌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the iterations. My naming suggestion is optional. Feel free to take or discard. Regardless what names you decide to go with, I'd appreciate if you could make them consistent across all relevant places, e.g.:
- The Collector class's name and it's variable name in
NodeServiceProvider
andInternalClusterInfoService
- The field in
ClusterInfo
,ClusterInfoSimulator
andInternalClusterInfoService
* @param threadPoolUsageStatsMap A map of thread pool name ({@link org.elasticsearch.threadpool.ThreadPool.Names}) to the thread pool's | ||
* usage stats ({@link ThreadPoolUsageStats}). | ||
*/ | ||
public record NodeExecutionLoad(String nodeId, Map<String, ThreadPoolUsageStats> threadPoolUsageStatsMap) implements Writeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use the phrase ThreadPoolUsageStats
? I think we can also simplify the name of the map field.
public record NodeExecutionLoad(String nodeId, Map<String, ThreadPoolUsageStats> threadPoolUsageStatsMap) implements Writeable { | |
public record NodeThreadPoolUsageStats(String nodeId, Map<String, ThreadPoolUsageStats> threads) implements Writeable { |
Sets up ClusterInfoService to collect node write load and
pass it into ClusterInfo. The node write load stats are not
yet supplied, they'll be zero/empty in the ClusterInfo for
now.
Relates ES-11990