KAFKA-19457: Make share group init retry interval configurable. #20104

smjn · 2025-07-04T09:23:09Z

While creating share group init requests in
GroupMetadataManager.shareGroupHeartbeat, we check for topics in
initializing state and if they are a certain amount of time old, we
issue retry requests for the same.
The interval for considering initializing topics as old was based of
offsetsCommitTimeoutMs and was not configurable.
In this PR, we remedy the situation by introducing a new config to
supply the value. The default is 30_000 which is a
heuristic based on the fact that the share coordinator persister
retries request with exponential backoff, with upper cap of 30_000
seconds.
Tests have been updated wherever applicable.

Reviewers: Apoorv Mittal [email protected], Lan Ding
[email protected], TaiJuWu [email protected], Andrew Schofield
[email protected]

apoorvmittal10

Thanks for the PR, LGTM!

AndrewJSchofield

Thanks for the PR. Just one comment to address.

AndrewJSchofield · 2025-07-04T11:45:02Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorConfig.java

@@ -484,6 +493,9 @@ public GroupCoordinatorConfig(AbstractConfig config) {
        require(shareGroupSessionTimeoutMs <= shareGroupMaxSessionTimeoutMs,
            String.format("%s must be less than or equal to %s",
                SHARE_GROUP_SESSION_TIMEOUT_MS_CONFIG, SHARE_GROUP_MAX_SESSION_TIMEOUT_MS_CONFIG));
+        require(shareGroupInitializeRetryIntervalMs >= offsetCommitTimeoutMs,


I think this needs to be changed a little. The problem is that you've added a new internal (and thus undocumented) configuration to allow configuration of something which should in principle not need tweaking. That's fine. However, the validation of the configs can now break if someone changes a documented configuration, and they will not know about the internal configuration. It would be better to ensure in the code that the shareGroupInitializeRetryIntervalMs is not smaller than the offsetCommitTimeoutMs just using assignment to an appropriate value.

smjn · 2025-07-05T08:46:26Z

@AndrewJSchofield Thanks for the comments, incorporated.

DL1231

Thanks for the patch, LGTM!

TaiJuWu

Thanks for this patch, leave a comment.

TaiJuWu · 2025-07-06T04:48:11Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupCoordinatorConfig.java

@@ -437,6 +445,8 @@ public GroupCoordinatorConfig(AbstractConfig config) {
        this.shareGroupMaxHeartbeatIntervalMs = config.getInt(GroupCoordinatorConfig.SHARE_GROUP_MAX_HEARTBEAT_INTERVAL_MS_CONFIG);
        this.shareGroupMaxSize = config.getInt(GroupCoordinatorConfig.SHARE_GROUP_MAX_SIZE_CONFIG);
        this.shareGroupAssignors = shareGroupAssignors(config);
+        int initializeRetryMs = config.getInt(GroupCoordinatorConfig.SHARE_GROUP_INITIALIZE_RETRY_INTERVAL_MS_CONFIG);
+        this.shareGroupInitializeRetryIntervalMs = Math.max(initializeRetryMs, this.offsetCommitTimeoutMs);


Should this comparison add to SHARE_GROUP_INITIALIZE_RETRY_INTERVAL_MS_DOC for clarification ?

AndrewJSchofield · 2025-07-08T13:35:01Z

@smjn The code change looks good to me. Please could you triage the failed tests and make sure they are appropriately tracked. Thanks.

smjn · 2025-07-08T17:56:31Z

Unrelated Flaky/failed tests:

FAILED ❌ LogRecoveryTest > testHWCheckpointWithFailuresMultipleLogSegments() - already tracked in https://issues.apache.org/jira/browse/KAFKA-19452

FLAKY ⚠️ AuthorizerIntegrationTest > testConsumerGroupHeartbeatWithRegex() - https://issues.apache.org/jira/browse/KAFKA-19481
FLAKY ⚠️ KafkaStreamsTelemetryIntegrationTest > "shouldPassMetrics(String, boolean, String).topologyType=complex, stateUpdaterEnabled=true, groupProtocol=streams" -
https://issues.apache.org/jira/browse/KAFKA-19482
FLAKY ⚠️ RemoteIndexCacheTest > testConcurrentCacheDeletedFileExists() - https://issues.apache.org/jira/browse/KAFKA-19483

cc: @AndrewJSchofield

TaiJuWu

LGTM, thanks for this patch.

KAFKA-19457: Make share group init retry interval configurable.

37db9d3

smjn requested a review from AndrewJSchofield July 4, 2025 09:23

github-actions bot added triage PRs from the community group-coordinator small Small PRs labels Jul 4, 2025

smjn added KIP-932 Queues for Kafka ci-approved and removed triage PRs from the community labels Jul 4, 2025

hardcoded default

1b3d057

apoorvmittal10 approved these changes Jul 4, 2025

View reviewed changes

AndrewJSchofield requested changes Jul 4, 2025

View reviewed changes

inc comments

4a0b2fd

smjn requested a review from AndrewJSchofield July 5, 2025 08:47

DL1231 approved these changes Jul 6, 2025

View reviewed changes

TaiJuWu reviewed Jul 6, 2025

View reviewed changes

smjn added 2 commits July 8, 2025 23:27

inc comments

cfc6639

Merge remote-tracking branch 'apache-kafka/trunk' into KAFKA-19457

7147a5b

smjn requested a review from TaiJuWu July 8, 2025 18:37

TaiJuWu approved these changes Jul 9, 2025

View reviewed changes

AndrewJSchofield approved these changes Jul 9, 2025

View reviewed changes

AndrewJSchofield merged commit 8aa5eae into apache:trunk Jul 9, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19457: Make share group init retry interval configurable. #20104

KAFKA-19457: Make share group init retry interval configurable. #20104

Uh oh!

smjn commented Jul 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

apoorvmittal10 left a comment

Uh oh!

AndrewJSchofield left a comment

Uh oh!

AndrewJSchofield Jul 4, 2025

Uh oh!

smjn commented Jul 5, 2025

Uh oh!

DL1231 left a comment

Uh oh!

TaiJuWu left a comment

Uh oh!

TaiJuWu Jul 6, 2025 •

edited

Loading

Uh oh!

AndrewJSchofield commented Jul 8, 2025

Uh oh!

smjn commented Jul 8, 2025 •

edited

Loading

Uh oh!

TaiJuWu left a comment

Uh oh!

Uh oh!

Uh oh!

KAFKA-19457: Make share group init retry interval configurable. #20104

KAFKA-19457: Make share group init retry interval configurable. #20104

Uh oh!

Conversation

smjn commented Jul 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

smjn commented Jul 5, 2025

Uh oh!

DL1231 left a comment

Choose a reason for hiding this comment

Uh oh!

TaiJuWu left a comment

Choose a reason for hiding this comment

Uh oh!

TaiJuWu Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield commented Jul 8, 2025

Uh oh!

smjn commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TaiJuWu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

smjn commented Jul 4, 2025 •

edited by github-actions bot

Loading

TaiJuWu Jul 6, 2025 •

edited

Loading

smjn commented Jul 8, 2025 •

edited

Loading