-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add thread pool utilisation metric #120363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add thread pool utilisation metric #120363
Conversation
…ool_utilisation_metric
Hi @nicktindall, I've created a changelog YAML for you. |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
…tion_metric' into ES-10530_add_thread_pool_utilisation_metric
@henningandersen pointed out the sampling only occurs once every |
Does this refer to the APM sampling interval? In that case, it is configured as 60s in the |
Yes it does, so even worse! it's on my list to put together a proposal for an approach that would introduce a finer-grained averaging mechanism. It'd mean adding some state and/or threads, similar to what |
*/ | ||
public double getUtilisation() { | ||
return (double) getActiveCount() / getMaximumPoolSize(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worthwhile having this, or just omitting this metric for non-timed pools altogether. Could be misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd agree with not reporting anything in this case. If something doesn't have a useful meaning or purpose, then it's just noise: it'll confuse people later who assume it must mean something since it's reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd prefer to omit it too, too different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 addressed in 54db077
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the logic (not the testing), and left a few superficial notes.
I like the idea of reporting how much capacity of a thread pool has been used over time. It would be a sort of sanity check on other metrics we report that can only be so granular over time.
Do we want to add this information in addition to what's currently being discussed in the thread pool utilization design document? This data may not be the information we wanted to report, but it seems like it would provide nice signal along with other metrics?
final long currentPollTimeNanos = System.nanoTime(); | ||
final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime; | ||
final long timeSinceLastPoll = currentPollTimeNanos - lastPollTime; | ||
final long maxExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final long maxExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize(); | |
final long maxSupportedExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this one just to maximumExecutionTimeSinceLastPollNanos
. I think that's clearer than what was there, and correlates with totalExecutionTimeSinceLastPollNanos
. The names are already quite long and I don't think "supported" adds much?
public double getUtilisation() { | ||
final long currentTotalExecutionTimeNanos = totalExecutionTime.sum(); | ||
final long currentPollTimeNanos = System.nanoTime(); | ||
final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime; | |
final long totalExecutionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c248419
@@ -89,6 +91,21 @@ public int getCurrentQueueSize() { | |||
return getQueue().size(); | |||
} | |||
|
|||
/** | |||
* This returns the percentage of the maximum time spent since the last poll executing tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This returns the percentage of the maximum time spent since the last poll executing tasks | |
* Returns the percentage of thread time that was actually used, of the available maximum thread time supported, since the last poll of | |
* this method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a note that this presumes CPUs are always available to run all the threads? Or is that not a fudge factor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it returns the fraction, not the percentage, i.e., a number [0;1]
, not [0;100]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, addressed in b9b8f89. I think it's fair to ignore scheduler contention, worker threads would always be getting scheduled to some extent by the OS. The more that happens the slower a CPU bound task would run, but it doesn't change the fact the worker thread is "active".
Unless I've misunderstood the feedback @DiannaHohensee
.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
Show resolved
Hide resolved
*/ | ||
public double getUtilisation() { | ||
return (double) getActiveCount() / getMaximumPoolSize(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd agree with not reporting anything in this case. If something doesn't have a useful meaning or purpose, then it's just noise: it'll confuse people later who assume it must mean something since it's reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left a few comments.
* This returns the percentage of the maximum time spent since the last poll executing tasks | ||
*/ | ||
@Override | ||
public double getUtilisation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this pollUtilization()
to signal more that it is not a read-only method (and start a US/UK/AU spelling fight 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all instances of "utilisation" in b9b8f89
*/ | ||
public double getUtilisation() { | ||
return (double) getActiveCount() / getMaximumPoolSize(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd prefer to omit it too, too different.
@@ -89,6 +91,21 @@ public int getCurrentQueueSize() { | |||
return getQueue().size(); | |||
} | |||
|
|||
/** | |||
* This returns the percentage of the maximum time spent since the last poll executing tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it returns the fraction, not the percentage, i.e., a number [0;1]
, not [0;100]
?
"percentage of maximum threads active for " + name, | ||
"percent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a [0;1]
fraction, not a percentage? I prefer that too.
…ool_utilisation_metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments, mainly in the unit test.
server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java
Outdated
Show resolved
Hide resolved
Future<?> future = threadPool.executor(threadPoolName).submit(() -> { | ||
long innerStartTimeNanos = System.nanoTime(); | ||
safeSleep(100); | ||
safeAwait(barrier); | ||
minimumDurationNanos.set(System.nanoTime() - innerStartTimeNanos); | ||
}); | ||
safeAwait(barrier); | ||
safeGet(future); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test sometimes fails on my machine:
Expected: (a value greater than <0.06801617003477693> and a value less than <0.07091782686496141>)
but: a value greater than <0.06801617003477693> <0.0> was less than <0.06801617003477693>
It looks like there is a race here, where the Future
returns but TaskExecutionTimeTrackingEsThreadPoolExecutor.afterExecute()
that increments the totalExecutionTime
hasn't finished yet. Since we already have the assert
above on threadPool.executor(threadPoolName)
one option would be to save the executor reference and wait for getTotalTaskExecutionTime()
to be greater than zero. I tried this out and the test would pass consistently in a loop after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! fixed in 287786f
assertLatestLongValueMatches( | ||
meterRegistry, | ||
ThreadPool.THREAD_POOL_METRIC_NAME_ACTIVE, | ||
InstrumentType.LONG_GAUGE, | ||
threadPoolName, | ||
equalTo(0L) | ||
); | ||
assertLatestLongValueMatches( | ||
meterRegistry, | ||
ThreadPool.THREAD_POOL_METRIC_NAME_CURRENT, | ||
InstrumentType.LONG_GAUGE, | ||
threadPoolName, | ||
equalTo(0L) | ||
); | ||
assertLatestLongValueMatches( | ||
meterRegistry, | ||
ThreadPool.THREAD_POOL_METRIC_NAME_COMPLETED, | ||
InstrumentType.LONG_ASYNC_COUNTER, | ||
threadPoolName, | ||
equalTo(0L) | ||
); | ||
assertLatestLongValueMatches( | ||
meterRegistry, | ||
ThreadPool.THREAD_POOL_METRIC_NAME_LARGEST, | ||
InstrumentType.LONG_GAUGE, | ||
threadPoolName, | ||
equalTo(0L) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion to shorten up the code some would be to either group together the common parameters like:
for (final var metricName : List.of(
ThreadPool.THREAD_POOL_METRIC_NAME_ACTIVE,
ThreadPool.THREAD_POOL_METRIC_NAME_CURRENT,
ThreadPool.THREAD_POOL_METRIC_NAME_LARGEST
)) {
assertLatestLongValueMatches(meterRegistry, metricName, InstrumentType.LONG_GAUGE, threadPoolName, equalTo(0L));
}
This could be done for each of the three sections of assertions in this test. Or maybe create a local lambda to hide the repeated meterRegistry
and threadPoolName
and only take the parameters that differ in each assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to reduce this repetition a bit here... d6d82f2
I've had to do similar before (see AzureBlobStoreRepositoryMetricsTest
). I wonder if the RecordingMeterRegistry
could provide more utilities in a generic way.
// Let all threads complete | ||
safeAwait(barrier); | ||
futures.forEach(ESTestCase::safeGet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test would fail on my machine sometimes after this point. Looks like a similar race condition as in the other test? If I kept the reference to the executor and waited on getOngoingTasks()
to be empty the test would pass consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! good pick up, addressed in 436cc0b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There are existing metrics for the active number of threads, but it seems tricky to go from that to a "utilisation" number because all the pools have different sizes.
This PR adds
es.thread_pool.{name}.threads.utilization.current
which will be published by allTaskExecutionTimeTrackingEsThreadPoolExecutor
thread pools (whereEsExecutors.TaskTrackingConfig#trackExecutionTime
is true).The metric is a double gauge indicating what fraction (in
[0.0, 1.0]
) of the maximum possible execution time was utilised over the polling interval.It's calculated as
actualTaskExecutionTime / maximumTaskExecutionTime
, so effectively a "mean" value. The metric interval is 60s so brief spikes won't be apparent in the measure, but the initial goal is to use it to detect hot-spotting so the 60s average will probably suffice.Relates ES-10530