Skip to content

Publish queue latency metrics from tracked thread pools #120488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Jan 21, 2025

We only publish queue latency for thread pools for which EsExecutors.TaskTrackingConfig#trackExecutionTime is true.

We maintain an in-memory ExponentialBucketHistogram for each monitored thread-poll to track the queue latencies for the tasks executed. Each time we poll for metrics we publish a hard-coded set of percentiles (I put 50th and 90th to begin with) as gauge values. This makes the querying possible with ES/QL and will allow ordering/filtering on those values.

After we've published the values we clear the histogram to start collecting observations for the next interval.

Closes: ES-10531

handlingTimeTracker.clear();
return metricValues;
}
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if rather than publishing a time-series-per-percentile (using percentile attribute) we should publish a metric-per-percentile.
The metric makes no sense if you don't filter by a percentile label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to plot different percentiles on the same graph with labels (and group by) compared to two different time series?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that makes a difference, but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just having a look at Kibana just now, it would be much easier to plot as a single metric grouped-by the percentiles. As separate metrics we'd need to add them as distinct time-series.

private final Function<Runnable, WrappedRunnable> runnableWrapper;
private final ExponentiallyWeightedMovingAverage executionEWMA;
private final LongAdder totalExecutionTime = new LongAdder();
private final boolean trackOngoingTasks;
// The set of currently running tasks and the timestamp of when they started execution in the Executor.
private final Map<Runnable, Long> ongoingTasks = new ConcurrentHashMap<>();
private final HandlingTimeTracker handlingTimeTracker = new HandlingTimeTracker();
Copy link
Contributor Author

@nicktindall nicktindall Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using a HandlingTimeTracker we can publish specific percentile streams. This is as opposed to using an APM histogram metric, which comes with a lot of limitations (doesn't work in ESQL, can't sort by, can't filter on) and is costly to aggregate. The histogram is cleared each time we publish the percentiles, so the percentiles are for samples received since the last publish.

If we agree with this approach, I think it might be worth moving the HandlingTimeTracker somewhere common and giving it a more generic name (e.g. ExponentialBucketHistogram).

HandlingTimeTracker is the simplest-possible solution to publishing percentiles. We could get more accurate metrics over a window that was decoupled from the metric polling interval if we used something like exponential histograms or decaying histograms though they would likely incur a larger synchronisation overhead.

}
assert false : "We shouldn't ever get here";
return Long.MAX_VALUE;
}
Copy link
Contributor Author

@nicktindall nicktindall Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to implement "exclusive" percentile because it was easier with the way the counts are stored in HandlingTimeTracker (the arrival of a value increments the counts of the first bucket with lower bound <= the value). If we wanted to get fancy we could look at interpolation, but I don't think it's necessary.

public String toCompositeString() {
return nodeName == null ? threadPoolName : nodeName + "/" + threadPoolName;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so we could separate the node name and thread pool name for the purpose of generating valid metric names.

i.e. es.thread_pool.instance-000003/write.threads.queue.latency.histogram is invalid, we want es.thread_pool.write.threads.queue.latency.histogram instead, so better to pass in something structured and unambiguous than making assumptions about the format of the string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether an alternative could be add a new method each to ExecutorHolder and TaskExecutionTimeTrackingEsThreadPoolExecutor, something like:

class ExecutorHolder {
    void registerMetrics(MeterRegistry meterRegistry) {
        if (executor instanceof TaskExecutionTimeTrackingEsThreadPoolExecutor te) {
            te.registerMetrics(info.name, meterRegistry)
        }
    }
}

class TaskExecutionTimeTrackingEsThreadPoolExecutor {
    void registerMetrics(String name, MeterRegistry meterRegistry) {
        meterRegistry.registerLongsGauge(...);
    }
}

and call ExecutorHolder#registerMetrics after building it in ThreadPool. I think it avoids the need of this class and associated cascading changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you're on to something there. I will review that tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've tidied that up now, thanks for the suggestion @ywangd

@nicktindall nicktindall marked this pull request as ready for review March 26, 2025 00:29
@nicktindall nicktindall requested a review from a team as a code owner March 26, 2025 00:29
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 26, 2025
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, did not do a thorough review though.

+ ", failedOrRejected: "
+ failedOrRejected;
if (taskQueueLatency != -1) {
queueWaitTimes.record(TimeValue.timeValueNanos(taskQueueLatency).secondsFrac());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should remove the allocation here and just calculate the seconds frac ourselves? I see that record will also allocate though - but I'd like to avoid adding unnecessary overhead here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e7f5bb6

+ ", failedOrRejected: "
+ failedOrRejected;
if (taskQueueLatency != -1) {
queueLatencyHistogram.addObservation(TimeUnit.NANOSECONDS.toMillis(taskQueueLatency));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a little late? Seems we could do this in beforeExecute too. May not matter too much?

I also wonder about a queue build up and how we can get the latency metric to rise quicker. But we can use this for now - only we have to be aware of the time it is measured. We could see a build up of latency - but queuing may be going down rapidly when we observe the build-up of latency, i.e., the problem was prior to the latency metric rising heavily.

Probably not too bad for our immediate purpose, but could be problematic for relying on this for allocation balancing. Might also be fine - not that delayed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had done it here just to be able to use the timestamp we already record for startTime, but we usually take one anyway in beforeExecute, so perhaps we can use that.

I also wonder about a queue build up and how we can get the latency metric to rise quicker. But we can use this for now - only we have to be aware of the time it is measured. We could see a build up of latency - but queuing may be going down rapidly when we observe the build-up of latency, i.e., the problem was prior to the latency metric rising heavily.

Yes it's a good point. I think this will be less of a problem the larger a thread-pool is, because for a larger thread pool there should be some tasks moving through fairly regularly giving us a read of the current latency, even if many slots are bogged down with large tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with the metric interval at 60s I think there will be a degree of lateness regardless of how we track this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to record the latency in beforeExecute in 2b294bd. It has the minor downside that we take an additional System.nanoTime() timestamp. I considered using the timestamp we already take for the ongoing task tracking, but was hesitant to do that because nowhere else in TimedRunnable do we expose the timestamps themselves, only the differences between them. If we were to change the source of those timestamps in TimedRunnable (e.g. to used cached or something else) it might be missed and start producing spurious deltas.

@@ -27,12 +35,16 @@
*/
public final class TaskExecutionTimeTrackingEsThreadPoolExecutor extends EsThreadPoolExecutor {

private static final int[] LATENCY_PERCENTILES_TO_REPORT = { 50, 90 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add 99 too. To some extent, the max-value in the interval seems most important.

I wonder if we also want an actual avg, but we can possibly add that.

At one point I was considering having the current value too. That would be easy to get, just peek into the queue for the front element and diff now-creation. But I think the percentiles or at least the 99th percentile also adds value so we can start there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added p99 in d6e44ed

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment.

for (int i = 0; i < BUCKET_COUNT; i++) {
percentileIndex -= snapshot[i];
if (percentileIndex <= 0) {
return getBucketUpperBounds()[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means we'll get 2x steps in the values reported. I.e., the 90th percentile will be 2048 or 4096 but never anything in between?

I guess this speaks to reporting just the current value instead as a starting point?

Copy link
Contributor Author

@nicktindall nicktindall Apr 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a downside to the exponential bucket histogram. We could perhaps use a different power, or linear buckets?

I thought the value in the histogram over the "current" value is that , because we only get a measure every 60s, the "current" value might not tell the whole story (e.g. with current or p99 we might see a momentary spike in latency e.g. due to GC or something but with ~p90 we give that a chance to be diluted by an otherwise healthy rate of queue processing.

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to do the split with the renaming/refactoring so it is clearer what the new changes are in this PR?

Comment on lines 73 to 84
() -> {
List<LongWithAttributes> metricValues = Arrays.stream(LATENCY_PERCENTILES_TO_REPORT)
.mapToObj(
percentile -> new LongWithAttributes(
queueLatencyHistogram.getPercentile(percentile / 100f),
Map.of("percentile", String.valueOf(percentile))
)
)
.toList();
queueLatencyHistogram.clear();
return metricValues;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth taking the histogram snapshot once, and then reusing it with an additional getPercentile(percentile, histogram) method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, because if we ask for a series of percentiles while the thread pool is still running we'll potentially get inconsistent measurements, but I don't think it's such an issue that we should add more infrastructure to address it. It should be quick and we only take < 5 percentiles so I think we'd be talking about a difference of very few samples between the measurements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth a comment note that the reporting can miss some entries but that we don't think it's a concern?

I agree that it's good enough as is -- let's get it running --, but worth keeping track that we could be missing something here in case it becomes something we want to improve later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this one now, we take a sample and calculate all percentages from it (see 5c6a4a7)

Copy link
Member

@pxsalehi pxsalehi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all make sense, but I think separating the refactorings from the new metric would be easier to review.

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all make sense, but I think separating the refactorings from the new metric would be easier to review.

++ splitting into a patch with the HandlingTimeTracker/Tests -> Exponential*/Tests naming refactor to reduce noise.

We only publish queue latency for thread pools for which EsExecutors.TaskTrackingConfig#trackExecutionTime is true.

This looks like it’ll turn on queue latency tracking for the search thread pools as well as WRITE. Just checking that we want that? I suppose it would be trickier to target just WRITE.


I reviewed the code thoroughly and I think it all functionally makes sense.

I have not yet investigated how the metrics are used, to verify the format we're supplying is satisfactory: do you have any pointers/explanation on that? Otherwise next time (EOD for me) I'll take a look at the history here and the original design document for some leads.

@@ -27,12 +35,16 @@
*/
public final class TaskExecutionTimeTrackingEsThreadPoolExecutor extends EsThreadPoolExecutor {

private static final int[] LATENCY_PERCENTILES_TO_REPORT = { 50, 90, 99 };
public static final String THREAD_POOL_METRIC_NAME_QUEUE_TIME = ".threads.queue.latency.histogram";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String THREAD_POOL_METRIC_NAME_QUEUE_TIME = ".threads.queue.latency.histogram";
public static final String THREAD_POOL_METRIC_NAME_QUEUE_TIME = ".queue.latency.histogram";

I think the full name below would look like "es.thread_pool.write.threads.queue.latency.histogram"?
"threads" seems extraneous, but maybe I'm missing context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fixed in 6ebc331

}
}

public void addObservation(long observedValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void addObservation(long observedValue) {
/**
* A negative value will be treated as zero.
*/
public void addObservation(long observedValue) {

Just an edge case I wondered about because theoretically timedRunnable.getQueueTimeNanos() could return -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually shouldn't, notice that we call TimedRunnable#beforeExecute() before we call TimedRunnable#getQueueTimeNanos(), the former sets beforeExecuteTime which should ensure getQueueTimeNanos() produces a valid value. I've added an assertion to make that clearer in 876ae78.

The javadoc at the top of ExponentialBucketHistogram indicates that the first bucket includes negative infinity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, timedRunnable.getQueueTimeNanos() can only return -1 if something goes wrong. It shouldn't happen, but it's possible to return -1 if someone used the TimedRunnable wrong (code added in future, not the current calling code), so I got to wondering what happened here.

I recently ran into a bug in production that would be caught by a pre-existing assert, but production doesn't run with asserts 🙃 Pros and cons of erroring vs continuing to run 🙂

nicktindall and others added 7 commits May 1, 2025 15:17
…ialBucketHistogram.java

Co-authored-by: Dianna Hohensee <[email protected]>
…ool_queue_latency_metric

# Conflicts:
#	server/src/main/java/org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
#	server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java
…ialBucketHistogram.java

Co-authored-by: Dianna Hohensee <[email protected]>
@nicktindall
Copy link
Contributor Author

Thanks for the reviews @pxsalehi @DiannaHohensee @JeremyDahlgren. I addressed your comments in this PR but will now split the histogram changes out to make things clearer. I'll request a re-review when that's done.

@nicktindall
Copy link
Contributor Author

This looks like it’ll turn on queue latency tracking for the search thread pools as well as WRITE. Just checking that we want that? I suppose it would be trickier to target just WRITE.

It'll actually turn on tracking for quite a few thread pools:

  • WRITE
  • SEARCH
  • SEARCH_COORDINATION
  • AUTO_COMPLETE
  • SYSTEM_WRITE
  • SYSTEM_CRITICAL_WRITE

We could add a flag to add the tracking in a more targeted way, but I did implement this with minimal overhead in mind, so I don't think it's too much of a concern. Whether it's worth implementing a specific flag for it I'm not sure. It might be interesting in the beginning at least to see how these metrics behave with different workloads.

I have not yet investigated how the metrics are used, to verify the format we're supplying is satisfactory: do you have any pointers/explanation on that? Otherwise next time (EOD for me) I'll take a look at the history here and the original design document for some leads.

The idea was to use utilisation to detect hot-spotting, and queueing as an indicator of the degree-of-hotspotting. I mocked up what I think the query might look like to find clusters that are hot-spotting and order them by how hot-spotted they are here. We really only need a number that can be used to rank how hot-spotted a cluster is. Queue delay seems like a good one. Which percentile we end up using will depend on how much queueing we see in these scenarios. That's why I'm publishing a few to begin with.

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a flag to add the tracking in a more targeted way, but I did implement this with minimal overhead in mind, so I don't think it's too much of a concern. Whether it's worth implementing a specific flag for it I'm not sure. It might be interesting in the beginning at least to see how these metrics behave with different workloads.

Agreed, seeing it across workloads/threadpools does sound informative. I'm good with the current approach 👍

The idea was to use utilisation to detect hot-spotting, and queueing as an indicator of the degree-of-hotspotting. I mocked up what I think the query might look like to find clusters that are hot-spotting and order them by how hot-spotted they are here. We really only need a number that can be used to rank how hot-spotted a cluster is. Queue delay seems like a good one. Which percentile we end up using will depend on how much queueing we see in these scenarios. That's why I'm publishing a few to begin with.

Thanks for the refresher, helpful.

Okay, I'm happy with this -- nice work! Needs a little cleanup after #127597 goes in, and some open threads addressed, but then it'll be good to go 👍

}
}

public void addObservation(long observedValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, timedRunnable.getQueueTimeNanos() can only return -1 if something goes wrong. It shouldn't happen, but it's possible to return -1 if someone used the TimedRunnable wrong (code added in future, not the current calling code), so I got to wondering what happened here.

I recently ran into a bug in production that would be caught by a pre-existing assert, but production doesn't run with asserts 🙃 Pros and cons of erroring vs continuing to run 🙂

Comment on lines 73 to 84
() -> {
List<LongWithAttributes> metricValues = Arrays.stream(LATENCY_PERCENTILES_TO_REPORT)
.mapToObj(
percentile -> new LongWithAttributes(
queueLatencyHistogram.getPercentile(percentile / 100f),
Map.of("percentile", String.valueOf(percentile))
)
)
.toList();
queueLatencyHistogram.clear();
return metricValues;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth a comment note that the reporting can miss some entries but that we don't think it's a concern?

I agree that it's good enough as is -- let's get it running --, but worth keeping track that we could be missing something here in case it becomes something we want to improve later.

# Conflicts:
#	server/src/main/java/org/elasticsearch/common/metrics/ExponentialBucketHistogram.java
#	server/src/main/java/org/elasticsearch/common/network/HandlingTimeTracker.java
#	server/src/main/java/org/elasticsearch/http/AbstractHttpServerTransport.java
#	server/src/main/java/org/elasticsearch/http/HttpRouteStatsTracker.java
#	server/src/main/java/org/elasticsearch/transport/InboundHandler.java
#	server/src/main/java/org/elasticsearch/transport/OutboundHandler.java
#	server/src/main/java/org/elasticsearch/transport/TcpTransport.java
#	server/src/test/java/org/elasticsearch/common/metrics/ExponentialBucketHistogramTests.java
#	server/src/test/java/org/elasticsearch/common/network/HandlingTimeTrackerTests.java
Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the split in #127597

@nicktindall nicktindall merged commit 62b2e86 into elastic:main May 5, 2025
17 checks passed
@nicktindall nicktindall deleted the ES-10531_add_thread_pool_queue_latency_metric branch May 5, 2025 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Metrics Metrics and metering infrastructure :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants