-
Notifications
You must be signed in to change notification settings - Fork 14.5k
KAFKA-19493: Incorrect rate metric with larger window size #20152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
@@ -1407,7 +1408,8 @@ public GroupCoordinatorMetrics(Metrics metrics, String metricGrpPrefix) { | |||
this.metricGrpName, | |||
"The number of successful rebalance events per hour, each event is composed of " + | |||
"several failed re-trials until it succeeded"), | |||
new Rate(TimeUnit.HOURS, new WindowedCount()) | |||
new Rate(TimeUnit.HOURS, new WindowedCount()), | |||
new MetricConfig().timeWindow(1, TimeUnit.HOURS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The '1' acts as a magic number; a relevant configuration item may need to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but the spec says "events per hour".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. The code looks sensible to me, but I would like to see at least one test that tests the behaviour of an hour-long window by advancing the time appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Some comments.
@@ -1407,7 +1408,8 @@ public GroupCoordinatorMetrics(Metrics metrics, String metricGrpPrefix) { | |||
this.metricGrpName, | |||
"The number of successful rebalance events per hour, each event is composed of " + | |||
"several failed re-trials until it succeeded"), | |||
new Rate(TimeUnit.HOURS, new WindowedCount()) | |||
new Rate(TimeUnit.HOURS, new WindowedCount()), | |||
new MetricConfig().timeWindow(1, TimeUnit.HOURS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any down side of losing the original attributes set in metrics config like recordLevel
, number of samples
and tags
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than fixing by providing another config object, will it not be appropriate to fix the windowSize
method in Rate by also considering the timeUnit
, rather only looking at config.timeWindowMs()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second this. We should fix it in Rate
instead of providing a new MetricConfig. It might raise more problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if I might be overlooking something. Perhaps merely adjusting the calculation method in windowSize
won't resolve the issue. The root cause lies in the time window being too short to retain data for the past hour, while we need to know exactly how many rebalance have occurred within the last hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point, but an alternative approach might be to add a new constructor in MetricConfig
instead. This is because the Rate class itself isn't aware of MetricConfig
or timeWindowMs
—it only receives this value as a parameter when the method is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, looks like you replied just after I deleted my earlier comment. I’ve rethought the approach.
The current patch feels more like a workaround.
Instead, we should probably create a variable:
timeWindowMs = Math.max(unit.toMillis(1), config.timeWindowMs())
in Rate
to determine window size, and then pass this timeWindowMs
to SampledStat rather than using config.timeWindowMs()
directly. Additionally, I think other MeasurableStat
that use config.timeWindowMs()
besides Rate
might have the same "larger window size" issue.
Let’s see if we have agreement here or any other ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. It would be great to add some UT to RateTest.java.
@@ -1407,7 +1408,8 @@ public GroupCoordinatorMetrics(Metrics metrics, String metricGrpPrefix) { | |||
this.metricGrpName, | |||
"The number of successful rebalance events per hour, each event is composed of " + | |||
"several failed re-trials until it succeeded"), | |||
new Rate(TimeUnit.HOURS, new WindowedCount()) | |||
new Rate(TimeUnit.HOURS, new WindowedCount()), | |||
new MetricConfig().timeWindow(1, TimeUnit.HOURS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second this. We should fix it in Rate
instead of providing a new MetricConfig. It might raise more problems.
Currently, it appears that the small sampling window size is causing distortion in metrics with larger time windows.
The default configuration samples two windows, each 30 seconds long. If rebalances occur twice—once at the 1st minute and again at the 59th minute—the expected value for rebalance-rate-per-hour should be 2.
However, when calculating at the 59th minute:
Thus, rebalance-rate-per-hour incorrectly reports 60.
This patch manually specifies the timeWindow as 1 hour when registering hourly monitoring metrics.