KAFKA-19227: Piggybacked share fetch acknowledgements performance issue #19612

apoorvmittal10 · 2025-05-01T17:19:10Z

The PR fixes the issue when ShareAcknowledgements are piggybacked on
ShareFetch. The current default configuration in clients sets batch size and max fetch records as per the max.poll.records config,
default 500. Which means all records in a single poll will be fetched
and acknowledged. Also the default configuration for inflight records in
a partition is 200. Which means prior fetch records has to be
acknowledged prior fetching another batch from share partition.

The piggybacked share fetch-acknowledgement calls from KafkaApis are
async and later the response is combined. If respective share fetch
starts waiting in purgatory because all inflight records are currently
full, hence when startOffset is moved as part of acknowledgement, then a
trigger should happen which should try completing any pending share
fetch requests in purgatory. Else the share fetch requests wait in
purgatory for timeout though records are available, which dips the share
fetch performance.

The regular fetch has a single criteria to land requests in purgatory,
which is min bytes criteria, hence any produce in respective topic
partition triggers to check any pending fetch requests. But share fetch
can wait in purgatory because of multiple reasons: 1) Min bytes 2)
Inflight records exhaustion 3) Share partition fetch lock competition.
The trigger already happens for 1 and current PR fixes 2. We will
investigate further if there should be any handling required for 3.

Reviewers: Abhinav Dixit [email protected], Andrew Schofield
[email protected]

adixitconfluent · 2025-05-02T09:15:38Z

core/src/main/java/kafka/server/share/SharePartition.java

                });
            }
        } finally {
            lock.writeLock().unlock();


should we have a maybeCompleteDelayedShareFetchRequest in this finally block as well?

There exists code at line 2439 in this method itself for trigger. So for release, first cache is updated and persister call is just to record correct delivery count. So notification to purgatory is not tied with persister.

adixitconfluent · 2025-05-02T09:16:42Z

@apoorvmittal10 Thanks for making the change. The change makes sense to me. Can you post the metrics evidence that this is the part of the code causing the problem?

apoorvmittal10 · 2025-05-02T09:34:28Z

@apoorvmittal10 Thanks for making the change. The change makes sense to me. Can you post the metrics evidence that this is the part of the code causing the problem?

You should look for ExpiresPerSec - DelayedShareFetch metric with and without produce.

adixitconfluent

LGTM

AndrewJSchofield

Thanks for the PR. We should increase the locks per share-partition to improve sharing with multiple consumers too.

apoorvmittal10 · 2025-05-06T08:55:40Z

Thanks for the PR. We should increase the locks per share-partition to improve sharing with multiple consumers too.

Sure, I ll update it as part of https://issues.apache.org/jira/browse/KAFKA-19245

…ue (apache#19612) The PR fixes the issue when ShareAcknowledgements are piggybacked on ShareFetch. The current default configuration in clients sets `batch size` and `max fetch records` as per the `max.poll.records` config, default 500. Which means all records in a single poll will be fetched and acknowledged. Also the default configuration for inflight records in a partition is 200. Which means prior fetch records has to be acknowledged prior fetching another batch from share partition. The piggybacked share fetch-acknowledgement calls from KafkaApis are async and later the response is combined. If respective share fetch starts waiting in purgatory because all inflight records are currently full, hence when startOffset is moved as part of acknowledgement, then a trigger should happen which should try completing any pending share fetch requests in purgatory. Else the share fetch requests wait in purgatory for timeout though records are available, which dips the share fetch performance. The regular fetch has a single criteria to land requests in purgatory, which is min bytes criteria, hence any produce in respective topic partition triggers to check any pending fetch requests. But share fetch can wait in purgatory because of multiple reasons: 1) Min bytes 2) Inflight records exhaustion 3) Share partition fetch lock competition. The trigger already happens for 1 and current PR fixes 2. We will investigate further if there should be any handling required for 3. Reviewers: Abhinav Dixit <[email protected]>, Andrew Schofield <[email protected]>

KAFKA-19227: Piggybacked share fetch acknowledgements performance issue

50fe44a

apoorvmittal10 requested a review from AndrewJSchofield May 1, 2025 17:19

github-actions bot added triage PRs from the community core Kafka Broker KIP-932 Queues for Kafka small Small PRs labels May 1, 2025

apoorvmittal10 added ci-approved and removed triage PRs from the community labels May 1, 2025

adixitconfluent reviewed May 2, 2025

View reviewed changes

apoorvmittal10 requested a review from adixitconfluent May 2, 2025 09:34

adixitconfluent approved these changes May 2, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19227

4e4e8ae

AndrewJSchofield approved these changes May 6, 2025

View reviewed changes

apoorvmittal10 merged commit ac9520b into apache:trunk May 6, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19227: Piggybacked share fetch acknowledgements performance issue #19612

KAFKA-19227: Piggybacked share fetch acknowledgements performance issue #19612

Uh oh!

apoorvmittal10 commented May 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

adixitconfluent May 2, 2025

Uh oh!

apoorvmittal10 May 2, 2025

Uh oh!

adixitconfluent commented May 2, 2025 •

edited

Loading

Uh oh!

apoorvmittal10 commented May 2, 2025 •

edited

Loading

Uh oh!

adixitconfluent left a comment

Uh oh!

AndrewJSchofield left a comment

Uh oh!

apoorvmittal10 commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

KAFKA-19227: Piggybacked share fetch acknowledgements performance issue #19612

KAFKA-19227: Piggybacked share fetch acknowledgements performance issue #19612

Uh oh!

Conversation

apoorvmittal10 commented May 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adixitconfluent May 2, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

adixitconfluent commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adixitconfluent left a comment

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

apoorvmittal10 commented May 1, 2025 •

edited by github-actions bot

Loading

adixitconfluent commented May 2, 2025 •

edited

Loading

apoorvmittal10 commented May 2, 2025 •

edited

Loading