Add thread pool utilisation metric #120363

nicktindall · 2025-01-17T11:19:12Z

There are existing metrics for the active number of threads, but it seems tricky to go from that to a "utilisation" number because all the pools have different sizes.

This PR adds es.thread_pool.{name}.threads.utilization.current which will be published by all TaskExecutionTimeTrackingEsThreadPoolExecutor thread pools (where EsExecutors.TaskTrackingConfig#trackExecutionTime is true).

The metric is a double gauge indicating what fraction (in [0.0, 1.0]) of the maximum possible execution time was utilised over the polling interval.

It's calculated as actualTaskExecutionTime / maximumTaskExecutionTime, so effectively a "mean" value. The metric interval is 60s so brief spikes won't be apparent in the measure, but the initial goal is to use it to detect hot-spotting so the 60s average will probably suffice.

Relates ES-10530

…ool_utilisation_metric

elasticsearchmachine · 2025-01-17T22:29:20Z

Hi @nicktindall, I've created a changelog YAML for you.

elasticsearchmachine · 2025-01-17T22:29:20Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2025-01-17T22:29:21Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

…tion_metric' into ES-10530_add_thread_pool_utilisation_metric

nicktindall · 2025-01-24T05:17:41Z

@henningandersen pointed out the sampling only occurs once every ~~30s~~ 60s so this approach will likely be too coarse for our needs. Investigating low-cost alternatives for maintaining a more representative number.

ywangd · 2025-01-28T03:45:19Z

sampling only occurs once every 30s

Does this refer to the APM sampling interval? In that case, it is configured as 60s in the serverless-default-settings.yml file.

nicktindall · 2025-01-28T07:25:39Z

Does this refer to the APM sampling interval? In that case, it is configured as 60s in the serverless-default-settings.yml file.

Yes it does, so even worse! it's on my list to put together a proposal for an approach that would introduce a finer-grained averaging mechanism. It'd mean adding some state and/or threads, similar to what IngestLoadProbe does. I have some ideas.

…ool_utilisation_metric

nicktindall · 2025-02-04T06:17:06Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+     */
+    public double getUtilisation() {
+        return (double) getActiveCount() / getMaximumPoolSize();
+    }


Not sure if it's worthwhile having this, or just omitting this metric for non-timed pools altogether. Could be misleading.

I think I'd agree with not reporting anything in this case. If something doesn't have a useful meaning or purpose, then it's just noise: it'll confuse people later who assume it must mean something since it's reported.

Yeah, I'd prefer to omit it too, too different.

👍 addressed in 54db077

DiannaHohensee

I took a look at the logic (not the testing), and left a few superficial notes.

I like the idea of reporting how much capacity of a thread pool has been used over time. It would be a sort of sanity check on other metrics we report that can only be so granular over time.

Do we want to add this information in addition to what's currently being discussed in the thread pool utilization design document? This data may not be the information we wanted to report, but it seems like it would provide nice signal along with other metrics?

DiannaHohensee · 2025-02-04T15:46:23Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+        final long currentPollTimeNanos = System.nanoTime();
+        final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime;
+        final long timeSinceLastPoll = currentPollTimeNanos - lastPollTime;
+        final long maxExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize();


Suggested change

final long maxExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize();

final long maxSupportedExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize();

I changed this one just to maximumExecutionTimeSinceLastPollNanos. I think that's clearer than what was there, and correlates with totalExecutionTimeSinceLastPollNanos. The names are already quite long and I don't think "supported" adds much?

DiannaHohensee · 2025-02-04T15:47:04Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+    public double getUtilisation() {
+        final long currentTotalExecutionTimeNanos = totalExecutionTime.sum();
+        final long currentPollTimeNanos = System.nanoTime();
+        final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime;


Suggested change

final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime;

final long totalExecutionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime;

Addressed in c248419

DiannaHohensee · 2025-02-04T15:53:22Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

@@ -89,6 +91,21 @@ public int getCurrentQueueSize() {
        return getQueue().size();
    }

+    /**
+     * This returns the percentage of the maximum time spent since the last poll executing tasks


Suggested change

* This returns the percentage of the maximum time spent since the last poll executing tasks

* Returns the percentage of thread time that was actually used, of the available maximum thread time supported, since the last poll of

* this method.

Should we add a note that this presumes CPUs are always available to run all the threads? Or is that not a fudge factor?

I think it returns the fraction, not the percentage, i.e., a number [0;1], not [0;100]?

Thanks, addressed in b9b8f89. I think it's fair to ignore scheduler contention, worker threads would always be getting scheduled to some extent by the OS. The more that happens the slower a CPU bound task would run, but it doesn't change the fact the worker thread is "active".

Unless I've misunderstood the feedback @DiannaHohensee

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

DiannaHohensee · 2025-02-04T16:01:33Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+     */
+    public double getUtilisation() {
+        return (double) getActiveCount() / getMaximumPoolSize();
+    }


I think I'd agree with not reporting anything in this case. If something doesn't have a useful meaning or purpose, then it's just noise: it'll confuse people later who assume it must mean something since it's reported.

henningandersen

Looks good, left a few comments.

henningandersen · 2025-02-28T12:19:34Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

+     * This returns the percentage of the maximum time spent since the last poll executing tasks
+     */
+    @Override
+    public double getUtilisation() {


Can we call this pollUtilization() to signal more that it is not a read-only method (and start a US/UK/AU spelling fight 🙂)

Addressed all instances of "utilisation" in b9b8f89

henningandersen · 2025-02-28T12:20:50Z

server/src/main/java/org/elasticsearch/common/util/concurrent/EsThreadPoolExecutor.java

+     */
+    public double getUtilisation() {
+        return (double) getActiveCount() / getMaximumPoolSize();
+    }


Yeah, I'd prefer to omit it too, too different.

henningandersen · 2025-02-28T12:23:34Z

.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java

@@ -89,6 +91,21 @@ public int getCurrentQueueSize() {
        return getQueue().size();
    }

+    /**
+     * This returns the percentage of the maximum time spent since the last poll executing tasks


I think it returns the fraction, not the percentage, i.e., a number [0;1], not [0;100]?

henningandersen · 2025-02-28T12:24:00Z

server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

+                        "percentage of maximum threads active for " + name,
+                        "percent",


I think it is a [0;1] fraction, not a percentage? I prefer that too.

…ool_utilisation_metric

JeremyDahlgren

Just a few comments, mainly in the unit test.

server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

JeremyDahlgren · 2025-04-15T18:50:54Z

server/src/test/java/org/elasticsearch/threadpool/ThreadPoolTests.java

+            Future<?> future = threadPool.executor(threadPoolName).submit(() -> {
+                long innerStartTimeNanos = System.nanoTime();
+                safeSleep(100);
+                safeAwait(barrier);
+                minimumDurationNanos.set(System.nanoTime() - innerStartTimeNanos);
+            });
+            safeAwait(barrier);
+            safeGet(future);


This test sometimes fails on my machine:

Expected: (a value greater than <0.06801617003477693> and a value less than <0.07091782686496141>) but: a value greater than <0.06801617003477693> <0.0> was less than <0.06801617003477693>

It looks like there is a race here, where the Future returns but TaskExecutionTimeTrackingEsThreadPoolExecutor.afterExecute() that increments the totalExecutionTime hasn't finished yet. Since we already have the assert above on threadPool.executor(threadPoolName) one option would be to save the executor reference and wait for getTotalTaskExecutionTime() to be greater than zero. I tried this out and the test would pass consistently in a loop after.

Nice! fixed in 287786f

JeremyDahlgren · 2025-04-15T19:31:12Z

server/src/test/java/org/elasticsearch/threadpool/ThreadPoolTests.java

+            assertLatestLongValueMatches(
+                meterRegistry,
+                ThreadPool.THREAD_POOL_METRIC_NAME_ACTIVE,
+                InstrumentType.LONG_GAUGE,
+                threadPoolName,
+                equalTo(0L)
+            );
+            assertLatestLongValueMatches(
+                meterRegistry,
+                ThreadPool.THREAD_POOL_METRIC_NAME_CURRENT,
+                InstrumentType.LONG_GAUGE,
+                threadPoolName,
+                equalTo(0L)
+            );
+            assertLatestLongValueMatches(
+                meterRegistry,
+                ThreadPool.THREAD_POOL_METRIC_NAME_COMPLETED,
+                InstrumentType.LONG_ASYNC_COUNTER,
+                threadPoolName,
+                equalTo(0L)
+            );
+            assertLatestLongValueMatches(
+                meterRegistry,
+                ThreadPool.THREAD_POOL_METRIC_NAME_LARGEST,
+                InstrumentType.LONG_GAUGE,
+                threadPoolName,
+                equalTo(0L)
+            );


Just a suggestion to shorten up the code some would be to either group together the common parameters like:

for (final var metricName : List.of( ThreadPool.THREAD_POOL_METRIC_NAME_ACTIVE, ThreadPool.THREAD_POOL_METRIC_NAME_CURRENT, ThreadPool.THREAD_POOL_METRIC_NAME_LARGEST )) { assertLatestLongValueMatches(meterRegistry, metricName, InstrumentType.LONG_GAUGE, threadPoolName, equalTo(0L)); }

This could be done for each of the three sections of assertions in this test. Or maybe create a local lambda to hide the repeated meterRegistry and threadPoolName and only take the parameters that differ in each assertion?

I tried to reduce this repetition a bit here... d6d82f2

I've had to do similar before (see AzureBlobStoreRepositoryMetricsTest). I wonder if the RecordingMeterRegistry could provide more utilities in a generic way.

JeremyDahlgren · 2025-04-15T19:32:44Z

server/src/test/java/org/elasticsearch/threadpool/ThreadPoolTests.java

+            // Let all threads complete
+            safeAwait(barrier);
+            futures.forEach(ESTestCase::safeGet);


The test would fail on my machine sometimes after this point. Looks like a similar race condition as in the other test? If I kept the reference to the executor and waited on getOngoingTasks() to be empty the test would pass consistently.

Thanks! good pick up, addressed in 436cc0b

Co-authored-by: Jeremy Dahlgren <[email protected]>

…ool_utilisation_metric

JeremyDahlgren

LGTM

Add thread pool utilisation metric

e6ea943

elasticsearchmachine added the v9.0.0 label Jan 17, 2025

Merge remote-tracking branch 'origin/main' into ES-10530_add_thread_p…

9cc7af1

…ool_utilisation_metric

nicktindall marked this pull request as ready for review January 17, 2025 22:25

nicktindall requested a review from a team as a code owner January 17, 2025 22:25

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 17, 2025

nicktindall requested review from DiannaHohensee and henningandersen January 17, 2025 22:26

nicktindall added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Core/Infra/Metrics Metrics and metering infrastructure labels Jan 17, 2025

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed Coordination Meta label for Distributed Coordination team labels Jan 17, 2025

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jan 17, 2025

Update docs/changelog/120363.yaml

4342f95

nicktindall added 3 commits January 20, 2025 10:06

Merge branch 'main' into ES-10530_add_thread_pool_utilisation_metric

b00ee42

Merge remote-tracking branch 'origin/ES-10530_add_thread_pool_utilisa…

0cfd844

…tion_metric' into ES-10530_add_thread_pool_utilisation_metric

Fix area

9dc9e18

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

nicktindall added 3 commits February 4, 2025 16:49

Implement mean utilisation for task execution time tracking thread pools

080bc98

Merge remote-tracking branch 'origin/main' into ES-10530_add_thread_p…

bfdf9df

…ool_utilisation_metric

Make lastXXX fields volatile

aa43210

nicktindall commented Feb 4, 2025

View reviewed changes

nicktindall added 2 commits February 4, 2025 17:20

Fix typo

d3ccf2f

finals

8ffcffa

DiannaHohensee reviewed Feb 4, 2025

View reviewed changes

henningandersen reviewed Feb 28, 2025

View reviewed changes

nicktindall added 7 commits March 3, 2025 09:15

Only publish utilisation for time-tracking executors

54db077

Merge remote-tracking branch 'origin/main' into ES-10530_add_thread_p…

755b567

…ool_utilisation_metric

Fix tests

b2f4ef5

Utilisation -> Utilization, make poll clearer it has side-effects

b9b8f89

Break up steps in pollUtilization

25e6f5a

Improve variable naming

c248419

Merge branch 'main' into ES-10530_add_thread_pool_utilisation_metric

ee90df7

nicktindall requested a review from DiannaHohensee March 28, 2025 01:36

Merge branch 'main' into ES-10530_add_thread_pool_utilisation_metric

d2a4de4

JeremyDahlgren reviewed Apr 15, 2025

View reviewed changes

nicktindall and others added 7 commits April 16, 2025 10:36

Update server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

c32e8e5

Co-authored-by: Jeremy Dahlgren <[email protected]>

Fix race condition in ThreadPoolTests#testDetailedUtilizationMetric

287786f

Reduce verbosity in metric assertions

d6d82f2

Reduce verbosity in metric assertions

a5c4073

Fix other race condition

436cc0b

Merge remote-tracking branch 'origin/main' into ES-10530_add_thread_p…

71b7fb7

…ool_utilisation_metric

Merge remote-tracking branch 'origin/main' into ES-10530_add_thread_p…

dc9fa3f

…ool_utilisation_metric

JeremyDahlgren approved these changes Apr 16, 2025

View reviewed changes

nicktindall merged commit 270ca0a into elastic:main Apr 17, 2025
17 checks passed

nicktindall deleted the ES-10530_add_thread_pool_utilisation_metric branch April 17, 2025 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add thread pool utilisation metric #120363

Add thread pool utilisation metric #120363

nicktindall commented Jan 17, 2025 •

edited

Loading

elasticsearchmachine commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

nicktindall commented Jan 24, 2025 •

edited

Loading

ywangd commented Jan 28, 2025

nicktindall commented Jan 28, 2025

nicktindall Feb 4, 2025

DiannaHohensee Feb 4, 2025

henningandersen Feb 28, 2025

nicktindall Mar 2, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading

DiannaHohensee Feb 4, 2025

nicktindall Mar 2, 2025

DiannaHohensee Feb 4, 2025

nicktindall Mar 2, 2025 •

edited

Loading

DiannaHohensee Feb 4, 2025

DiannaHohensee Feb 4, 2025

henningandersen Feb 28, 2025

nicktindall Mar 2, 2025 •

edited

Loading

DiannaHohensee Feb 4, 2025

henningandersen left a comment

henningandersen Feb 28, 2025

nicktindall Mar 2, 2025 •

edited

Loading

henningandersen Feb 28, 2025

henningandersen Feb 28, 2025

henningandersen Feb 28, 2025

JeremyDahlgren left a comment

JeremyDahlgren Apr 15, 2025

nicktindall Apr 16, 2025

JeremyDahlgren Apr 15, 2025

nicktindall Apr 16, 2025

JeremyDahlgren Apr 15, 2025

nicktindall Apr 16, 2025

JeremyDahlgren left a comment

	final long maxExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize();
	final long maxSupportedExecutionTimeSinceLastPollNanos = timeSinceLastPoll * getMaximumPoolSize();

	final long executionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime;
	final long totalExecutionTimeSinceLastPollNanos = currentTotalExecutionTimeNanos - lastTotalExecutionTime;

	* This returns the percentage of the maximum time spent since the last poll executing tasks
	* Returns the percentage of thread time that was actually used, of the available maximum thread time supported, since the last poll of
	* this method.

		"percentage of maximum threads active for " + name,
		"percent",

Add thread pool utilisation metric #120363

Add thread pool utilisation metric #120363

Conversation

nicktindall commented Jan 17, 2025 • edited Loading

elasticsearchmachine commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

elasticsearchmachine commented Jan 17, 2025

nicktindall commented Jan 24, 2025 • edited Loading

ywangd commented Jan 28, 2025

nicktindall commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Mar 2, 2025 • edited Loading

Choose a reason for hiding this comment

DiannaHohensee left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Mar 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Mar 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Mar 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeremyDahlgren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeremyDahlgren left a comment

Choose a reason for hiding this comment

nicktindall commented Jan 17, 2025 •

edited

Loading

nicktindall commented Jan 24, 2025 •

edited

Loading

nicktindall Mar 2, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading

nicktindall Mar 2, 2025 •

edited

Loading

nicktindall Mar 2, 2025 •

edited

Loading

nicktindall Mar 2, 2025 •

edited

Loading