Skip to content

KAFKA-19493: Incorrect rate metric with larger window size #20152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

DL1231
Copy link
Contributor

@DL1231 DL1231 commented Jul 11, 2025

Currently, it appears that the small sampling window size is causing distortion in metrics with larger time windows.
The default configuration samples two windows, each 30 seconds long. If rebalances occur twice—once at the 1st minute and again at the 59th minute—the expected value for rebalance-rate-per-hour should be 2.
However, when calculating at the 59th minute:

  • The window length is 2 × 30s = 60 seconds
  • The system only detects 1 rebalance (the latest event within the 60s window)
  • The calculation becomes: 1 / 60s = 60 rebalances per hour
    Thus, rebalance-rate-per-hour incorrectly reports 60.

This patch manually specifies the timeWindow as 1 hour when registering hourly monitoring metrics.

@github-actions github-actions bot added triage PRs from the community consumer clients small Small PRs labels Jul 11, 2025
@@ -1407,7 +1408,8 @@ public GroupCoordinatorMetrics(Metrics metrics, String metricGrpPrefix) {
this.metricGrpName,
"The number of successful rebalance events per hour, each event is composed of " +
"several failed re-trials until it succeeded"),
new Rate(TimeUnit.HOURS, new WindowedCount())
new Rate(TimeUnit.HOURS, new WindowedCount()),
new MetricConfig().timeWindow(1, TimeUnit.HOURS)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The '1' acts as a magic number; a relevant configuration item may need to be added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but the spec says "events per hour".

Copy link
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The code looks sensible to me, but I would like to see at least one test that tests the behaviour of an hour-long window by advancing the time appropriately.

Copy link
Contributor

@apoorvmittal10 apoorvmittal10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Some comments.

@@ -1407,7 +1408,8 @@ public GroupCoordinatorMetrics(Metrics metrics, String metricGrpPrefix) {
this.metricGrpName,
"The number of successful rebalance events per hour, each event is composed of " +
"several failed re-trials until it succeeded"),
new Rate(TimeUnit.HOURS, new WindowedCount())
new Rate(TimeUnit.HOURS, new WindowedCount()),
new MetricConfig().timeWindow(1, TimeUnit.HOURS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any down side of losing the original attributes set in metrics config like recordLevel, number of samples and tags?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than fixing by providing another config object, will it not be appropriate to fix the windowSize method in Rate by also considering the timeUnit, rather only looking at config.timeWindowMs()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second this. We should fix it in Rate instead of providing a new MetricConfig. It might raise more problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if I might be overlooking something. Perhaps merely adjusting the calculation method in windowSize won't resolve the issue. The root cause lies in the time window being too short to retain data for the past hour, while we need to know exactly how many rebalance have occurred within the last hour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point, but an alternative approach might be to add a new constructor in MetricConfig instead. This is because the Rate class itself isn't aware of MetricConfig or timeWindowMs—it only receives this value as a parameter when the method is called.

Copy link
Collaborator

@Yunyung Yunyung Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, looks like you replied just after I deleted my earlier comment. I’ve rethought the approach.

The current patch feels more like a workaround.
Instead, we should probably create a variable:
timeWindowMs = Math.max(unit.toMillis(1), config.timeWindowMs()) in Rate to determine window size, and then pass this timeWindowMs to SampledStat rather than using config.timeWindowMs() directly. Additionally, I think other MeasurableStat that use config.timeWindowMs() besides Rate might have the same "larger window size" issue.

Let’s see if we have agreement here or any other ideas.

Copy link
Collaborator

@Yunyung Yunyung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. It would be great to add some UT to RateTest.java.

@@ -1407,7 +1408,8 @@ public GroupCoordinatorMetrics(Metrics metrics, String metricGrpPrefix) {
this.metricGrpName,
"The number of successful rebalance events per hour, each event is composed of " +
"several failed re-trials until it succeeded"),
new Rate(TimeUnit.HOURS, new WindowedCount())
new Rate(TimeUnit.HOURS, new WindowedCount()),
new MetricConfig().timeWindow(1, TimeUnit.HOURS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second this. We should fix it in Rate instead of providing a new MetricConfig. It might raise more problems.

@github-actions github-actions bot removed the triage PRs from the community label Jul 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants