-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Publish queue latency metrics from tracked thread pools #120488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish queue latency metrics from tracked thread pools #120488
Conversation
Closes: ES-10531
…ool_queue_latency_metric
…ool_queue_latency_metric
handlingTimeTracker.clear(); | ||
return metricValues; | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if rather than publishing a time-series-per-percentile (using percentile
attribute) we should publish a metric-per-percentile.
The metric makes no sense if you don't filter by a percentile label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to plot different percentiles on the same graph with labels (and group by) compared to two different time series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that makes a difference, but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just having a look at Kibana just now, it would be much easier to plot as a single metric grouped-by the percentiles. As separate metrics we'd need to add them as distinct time-series.
private final Function<Runnable, WrappedRunnable> runnableWrapper; | ||
private final ExponentiallyWeightedMovingAverage executionEWMA; | ||
private final LongAdder totalExecutionTime = new LongAdder(); | ||
private final boolean trackOngoingTasks; | ||
// The set of currently running tasks and the timestamp of when they started execution in the Executor. | ||
private final Map<Runnable, Long> ongoingTasks = new ConcurrentHashMap<>(); | ||
private final HandlingTimeTracker handlingTimeTracker = new HandlingTimeTracker(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using a HandlingTimeTracker
we can publish specific percentile streams. This is as opposed to using an APM histogram metric, which comes with a lot of limitations (doesn't work in ESQL, can't sort by, can't filter on) and is costly to aggregate. The histogram is cleared each time we publish the percentiles, so the percentiles are for samples received since the last publish.
If we agree with this approach, I think it might be worth moving the HandlingTimeTracker
somewhere common and giving it a more generic name (e.g. ExponentialBucketHistogram
).
HandlingTimeTracker
is the simplest-possible solution to publishing percentiles. We could get more accurate metrics over a window that was decoupled from the metric polling interval if we used something like exponential histograms or decaying histograms though they would likely incur a larger synchronisation overhead.
} | ||
assert false : "We shouldn't ever get here"; | ||
return Long.MAX_VALUE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to implement "exclusive" percentile because it was easier with the way the counts are stored in HandlingTimeTracker
(the arrival of a value increments the counts of the first bucket with lower bound <= the value). If we wanted to get fancy we could look at interpolation, but I don't think it's necessary.
public String toCompositeString() { | ||
return nodeName == null ? threadPoolName : nodeName + "/" + threadPoolName; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this so we could separate the node name and thread pool name for the purpose of generating valid metric names.
i.e. es.thread_pool.instance-000003/write.threads.queue.latency.histogram
is invalid, we want es.thread_pool.write.threads.queue.latency.histogram
instead, so better to pass in something structured and unambiguous than making assumptions about the format of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether an alternative could be add a new method each to ExecutorHolder
and TaskExecutionTimeTrackingEsThreadPoolExecutor
, something like:
class ExecutorHolder {
void registerMetrics(MeterRegistry meterRegistry) {
if (executor instanceof TaskExecutionTimeTrackingEsThreadPoolExecutor te) {
te.registerMetrics(info.name, meterRegistry)
}
}
}
class TaskExecutionTimeTrackingEsThreadPoolExecutor {
void registerMetrics(String name, MeterRegistry meterRegistry) {
meterRegistry.registerLongsGauge(...);
}
}
and call ExecutorHolder#registerMetrics
after building it in ThreadPool
. I think it avoids the need of this class and associated cascading changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think you're on to something there. I will review that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've tidied that up now, thanks for the suggestion @ywangd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, did not do a thorough review though.
+ ", failedOrRejected: " | ||
+ failedOrRejected; | ||
if (taskQueueLatency != -1) { | ||
queueWaitTimes.record(TimeValue.timeValueNanos(taskQueueLatency).secondsFrac()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should remove the allocation here and just calculate the seconds frac ourselves? I see that record
will also allocate though - but I'd like to avoid adding unnecessary overhead here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in e7f5bb6
+ ", failedOrRejected: " | ||
+ failedOrRejected; | ||
if (taskQueueLatency != -1) { | ||
queueLatencyHistogram.addObservation(TimeUnit.NANOSECONDS.toMillis(taskQueueLatency)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a little late? Seems we could do this in beforeExecute
too. May not matter too much?
I also wonder about a queue build up and how we can get the latency metric to rise quicker. But we can use this for now - only we have to be aware of the time it is measured. We could see a build up of latency - but queuing may be going down rapidly when we observe the build-up of latency, i.e., the problem was prior to the latency metric rising heavily.
Probably not too bad for our immediate purpose, but could be problematic for relying on this for allocation balancing. Might also be fine - not that delayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I had done it here just to be able to use the timestamp we already record for startTime
, but we usually take one anyway in beforeExecute
, so perhaps we can use that.
I also wonder about a queue build up and how we can get the latency metric to rise quicker. But we can use this for now - only we have to be aware of the time it is measured. We could see a build up of latency - but queuing may be going down rapidly when we observe the build-up of latency, i.e., the problem was prior to the latency metric rising heavily.
Yes it's a good point. I think this will be less of a problem the larger a thread-pool is, because for a larger thread pool there should be some tasks moving through fairly regularly giving us a read of the current latency, even if many slots are bogged down with large tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, with the metric interval at 60s I think there will be a degree of lateness regardless of how we track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to record the latency in beforeExecute
in 2b294bd. It has the minor downside that we take an additional System.nanoTime()
timestamp. I considered using the timestamp we already take for the ongoing task tracking, but was hesitant to do that because nowhere else in TimedRunnable
do we expose the timestamps themselves, only the differences between them. If we were to change the source of those timestamps in TimedRunnable
(e.g. to used cached or something else) it might be missed and start producing spurious deltas.
@@ -27,12 +35,16 @@ | |||
*/ | |||
public final class TaskExecutionTimeTrackingEsThreadPoolExecutor extends EsThreadPoolExecutor { | |||
|
|||
private static final int[] LATENCY_PERCENTILES_TO_REPORT = { 50, 90 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add 99 too. To some extent, the max-value in the interval seems most important.
I wonder if we also want an actual avg, but we can possibly add that.
At one point I was considering having the current value too. That would be easy to get, just peek into the queue for the front element and diff now-creation. But I think the percentiles or at least the 99th percentile also adds value so we can start there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added p99 in d6e44ed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment.
for (int i = 0; i < BUCKET_COUNT; i++) { | ||
percentileIndex -= snapshot[i]; | ||
if (percentileIndex <= 0) { | ||
return getBucketUpperBounds()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means we'll get 2x steps in the values reported. I.e., the 90th percentile will be 2048 or 4096 but never anything in between?
I guess this speaks to reporting just the current value instead as a starting point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is a downside to the exponential bucket histogram. We could perhaps use a different power, or linear buckets?
I thought the value in the histogram over the "current" value is that , because we only get a measure every 60s, the "current" value might not tell the whole story (e.g. with current or p99 we might see a momentary spike in latency e.g. due to GC or something but with ~p90 we give that a chance to be diluted by an otherwise healthy rate of queue processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to do the split with the renaming/refactoring so it is clearer what the new changes are in this PR?
() -> { | ||
List<LongWithAttributes> metricValues = Arrays.stream(LATENCY_PERCENTILES_TO_REPORT) | ||
.mapToObj( | ||
percentile -> new LongWithAttributes( | ||
queueLatencyHistogram.getPercentile(percentile / 100f), | ||
Map.of("percentile", String.valueOf(percentile)) | ||
) | ||
) | ||
.toList(); | ||
queueLatencyHistogram.clear(); | ||
return metricValues; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth taking the histogram snapshot once, and then reusing it with an additional getPercentile(percentile, histogram)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, because if we ask for a series of percentiles while the thread pool is still running we'll potentially get inconsistent measurements, but I don't think it's such an issue that we should add more infrastructure to address it. It should be quick and we only take < 5 percentiles so I think we'd be talking about a difference of very few samples between the measurements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth a comment note that the reporting can miss some entries but that we don't think it's a concern?
I agree that it's good enough as is -- let's get it running --, but worth keeping track that we could be missing something here in case it becomes something we want to improve later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this one now, we take a sample and calculate all percentages from it (see 5c6a4a7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all make sense, but I think separating the refactorings from the new metric would be easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all make sense, but I think separating the refactorings from the new metric would be easier to review.
++ splitting into a patch with the HandlingTimeTracker/Tests -> Exponential*/Tests naming refactor to reduce noise.
We only publish queue latency for thread pools for which EsExecutors.TaskTrackingConfig#trackExecutionTime is true.
This looks like it’ll turn on queue latency tracking for the search thread pools as well as WRITE. Just checking that we want that? I suppose it would be trickier to target just WRITE.
I reviewed the code thoroughly and I think it all functionally makes sense.
I have not yet investigated how the metrics are used, to verify the format we're supplying is satisfactory: do you have any pointers/explanation on that? Otherwise next time (EOD for me) I'll take a look at the history here and the original design document for some leads.
server/src/main/java/org/elasticsearch/common/metrics/ExponentialBucketHistogram.java
Outdated
Show resolved
Hide resolved
@@ -27,12 +35,16 @@ | |||
*/ | |||
public final class TaskExecutionTimeTrackingEsThreadPoolExecutor extends EsThreadPoolExecutor { | |||
|
|||
private static final int[] LATENCY_PERCENTILES_TO_REPORT = { 50, 90, 99 }; | |||
public static final String THREAD_POOL_METRIC_NAME_QUEUE_TIME = ".threads.queue.latency.histogram"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final String THREAD_POOL_METRIC_NAME_QUEUE_TIME = ".threads.queue.latency.histogram"; | |
public static final String THREAD_POOL_METRIC_NAME_QUEUE_TIME = ".queue.latency.histogram"; |
I think the full name below would look like "es.thread_pool.write.threads.queue.latency.histogram"?
"threads" seems extraneous, but maybe I'm missing context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, fixed in 6ebc331
.../org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/InboundHandler.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/metrics/ExponentialBucketHistogram.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
public void addObservation(long observedValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void addObservation(long observedValue) { | |
/** | |
* A negative value will be treated as zero. | |
*/ | |
public void addObservation(long observedValue) { |
Just an edge case I wondered about because theoretically timedRunnable.getQueueTimeNanos()
could return -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually shouldn't, notice that we call TimedRunnable#beforeExecute()
before we call TimedRunnable#getQueueTimeNanos()
, the former sets beforeExecuteTime
which should ensure getQueueTimeNanos()
produces a valid value. I've added an assertion to make that clearer in 876ae78.
The javadoc at the top of ExponentialBucketHistogram
indicates that the first bucket includes negative infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, timedRunnable.getQueueTimeNanos()
can only return -1 if something goes wrong. It shouldn't happen, but it's possible to return -1 if someone used the TimedRunnable wrong (code added in future, not the current calling code), so I got to wondering what happened here.
I recently ran into a bug in production that would be caught by a pre-existing assert, but production doesn't run with asserts 🙃 Pros and cons of erroring vs continuing to run 🙂
server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/http/AbstractHttpServerTransport.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/http/HttpRouteStatsTracker.java
Outdated
Show resolved
Hide resolved
...elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutorTests.java
Outdated
Show resolved
Hide resolved
…ialBucketHistogram.java Co-authored-by: Dianna Hohensee <[email protected]>
…ool_queue_latency_metric # Conflicts: # server/src/main/java/org/elasticsearch/common/util/concurrent/TaskExecutionTimeTrackingEsThreadPoolExecutor.java # server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java
…ialBucketHistogram.java Co-authored-by: Dianna Hohensee <[email protected]>
Thanks for the reviews @pxsalehi @DiannaHohensee @JeremyDahlgren. I addressed your comments in this PR but will now split the histogram changes out to make things clearer. I'll request a re-review when that's done. |
It'll actually turn on tracking for quite a few thread pools:
We could add a flag to add the tracking in a more targeted way, but I did implement this with minimal overhead in mind, so I don't think it's too much of a concern. Whether it's worth implementing a specific flag for it I'm not sure. It might be interesting in the beginning at least to see how these metrics behave with different workloads.
The idea was to use utilisation to detect hot-spotting, and queueing as an indicator of the degree-of-hotspotting. I mocked up what I think the query might look like to find clusters that are hot-spotting and order them by how hot-spotted they are here. We really only need a number that can be used to rank how hot-spotted a cluster is. Queue delay seems like a good one. Which percentile we end up using will depend on how much queueing we see in these scenarios. That's why I'm publishing a few to begin with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a flag to add the tracking in a more targeted way, but I did implement this with minimal overhead in mind, so I don't think it's too much of a concern. Whether it's worth implementing a specific flag for it I'm not sure. It might be interesting in the beginning at least to see how these metrics behave with different workloads.
Agreed, seeing it across workloads/threadpools does sound informative. I'm good with the current approach 👍
The idea was to use utilisation to detect hot-spotting, and queueing as an indicator of the degree-of-hotspotting. I mocked up what I think the query might look like to find clusters that are hot-spotting and order them by how hot-spotted they are here. We really only need a number that can be used to rank how hot-spotted a cluster is. Queue delay seems like a good one. Which percentile we end up using will depend on how much queueing we see in these scenarios. That's why I'm publishing a few to begin with.
Thanks for the refresher, helpful.
Okay, I'm happy with this -- nice work! Needs a little cleanup after #127597 goes in, and some open threads addressed, but then it'll be good to go 👍
} | ||
} | ||
|
||
public void addObservation(long observedValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, timedRunnable.getQueueTimeNanos()
can only return -1 if something goes wrong. It shouldn't happen, but it's possible to return -1 if someone used the TimedRunnable wrong (code added in future, not the current calling code), so I got to wondering what happened here.
I recently ran into a bug in production that would be caught by a pre-existing assert, but production doesn't run with asserts 🙃 Pros and cons of erroring vs continuing to run 🙂
server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java
Outdated
Show resolved
Hide resolved
() -> { | ||
List<LongWithAttributes> metricValues = Arrays.stream(LATENCY_PERCENTILES_TO_REPORT) | ||
.mapToObj( | ||
percentile -> new LongWithAttributes( | ||
queueLatencyHistogram.getPercentile(percentile / 100f), | ||
Map.of("percentile", String.valueOf(percentile)) | ||
) | ||
) | ||
.toList(); | ||
queueLatencyHistogram.clear(); | ||
return metricValues; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth a comment note that the reporting can miss some entries but that we don't think it's a concern?
I agree that it's good enough as is -- let's get it running --, but worth keeping track that we could be missing something here in case it becomes something we want to improve later.
# Conflicts: # server/src/main/java/org/elasticsearch/common/metrics/ExponentialBucketHistogram.java # server/src/main/java/org/elasticsearch/common/network/HandlingTimeTracker.java # server/src/main/java/org/elasticsearch/http/AbstractHttpServerTransport.java # server/src/main/java/org/elasticsearch/http/HttpRouteStatsTracker.java # server/src/main/java/org/elasticsearch/transport/InboundHandler.java # server/src/main/java/org/elasticsearch/transport/OutboundHandler.java # server/src/main/java/org/elasticsearch/transport/TcpTransport.java # server/src/test/java/org/elasticsearch/common/metrics/ExponentialBucketHistogramTests.java # server/src/test/java/org/elasticsearch/common/network/HandlingTimeTrackerTests.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the split in #127597
We only publish queue latency for thread pools for which
EsExecutors.TaskTrackingConfig#trackExecutionTime
is true.We maintain an in-memory
ExponentialBucketHistogram
for each monitored thread-poll to track the queue latencies for the tasks executed. Each time we poll for metrics we publish a hard-coded set of percentiles (I put 50th and 90th to begin with) as gauge values. This makes the querying possible with ES/QL and will allow ordering/filtering on those values.After we've published the values we clear the histogram to start collecting observations for the next interval.
Closes: ES-10531