KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets #19497

squah-confluent · 2025-04-16T18:48:15Z

When fetching stable offsets in the group coordinator, we iterate over
all requested partitions. For each partition, we iterate over the
group's ongoing transactions to check if there is a pending
transactional offset commit for that partition.

This can get slow when there are a large number of partitions and a
large number of pending transactions. Instead, maintain a list of
pending transactions per partition to speed up lookups.

When fetching stable offsets in the group coordinator, we iterate over all requested partitions. For each partition, we iterate over the group's ongoing transactions to check if there is a pending transactional offset commit for that partition. This can get slow when there are a large number of partitions and a large number of pending transactions. Instead, maintain a list of pending transactions per partition to speed up lookups.

squah-confluent · 2025-04-16T18:49:39Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

@@ -194,9 +194,16 @@ public OffsetMetadataManager build() {

    /**
     * The open transactions (producer ids) keyed by group.
+     * Tracks whether groups have any open transactions.
     */
    private final TimelineHashMap<String, TimelineHashSet<Long>> openTransactionsByGroup;


I wanted to replace this map entirely, but we use it in cleanupExpiredOffsets to check whether a group has any pending transactions.

Can we use openTransactionsByGroupTopicAndPartition for this too? If the group is present in openTransactionsByGroupTopicAndPartition, it means that it has an open transaction.

We clear entries from that map when offsets are deleted non-transactionally. If all offsets in a transaction are deleted, then there can be an open transaction for a group without a record of it in openTransactionsByGroupTopicAndPartition.

I've removed the 2nd map of group ids -> producer ids. Now we only track (group id, topic, partition) -> producer ids. As a side effect, it fixes KAFKA-19164.

shaan150

Left a few comments, it's mostly due to where the file was when picked up. That said, changes in areas like this are a good opportunity to improve structure as well. I get that it might add a bit of time now, but it'll pay off long-term by reducing complexity and easing future changes

shaan150 · 2025-04-16T20:51:41Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

-                }
+        TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>> openTransactionsByTopic =
+            openTransactionsByGroupTopicAndPartition.get(groupId);
+        if (openTransactionsByTopic != null) {


There are three nested for loops, this can be avoided.

Something like the following might help (was done for quickness)

var openTransactionsByTopic = openTransactionsByGroupTopicAndPartition.get(groupId); if (openTransactionsByTopic == null) return; for (var topicEntry : openTransactionsByTopic.entrySet()) { String topic = topicEntry.getKey(); var openTransactionsByPartition = topicEntry.getValue(); for (var partitionEntry : openTransactionsByPartition.entrySet()) { int partition = partitionEntry.getKey(); // Single check per partition if (!hasCommittedOffset(groupId, topic, partition)) { records.add(GroupCoordinatorRecordHelpers.newOffsetCommitTombstoneRecord(groupId, topic, partition)); numDeletedOffsets.getAndIncrement(); } } }

I missed this when refactoring the code, thank you. The previous code also has this bug, except it was not as obvious. I've fixed it now.

shaan150 · 2025-04-16T21:05:37Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

                    }
-                });
+                }
            }


This whole area is pretty badly structured, and maintainability will prove difficult, potentially fragile.

I would split this out into methods, and tidy some of it up. This is an example on how i'd put it, this code isn't tested baring mind and is more of a boiler plate example:

public void clearOpenTransactions(final String groupId, final String topic, inal int partition) { final TimelineHashMap<Integer, TimelineHashSet<Long>> partitionMap = getPartitionMap(groupId, topic); if (partitionMap == null) return; final TimelineHashSet<Long> openProducerIds = partitionMap.get(partition); if (openProducerIds == null) return; removePendingOffsets(openProducerIds, groupId, topic, partition); partitionMap.remove(partition); cleanupIfEmpty(partitionMap, getTopicMap(groupId), topic); cleanupIfEmpty(getTopicMap(groupId), openTransactionsByGroupTopicAndPartition, groupId); } private TimelineHashMap<Integer, TimelineHashSet<Long>> getPartitionMap(final String groupId, final String topic) { final TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>> topicMap = openTransactionsByGroupTopicAndPartition.get(groupId); if (topicMap == null) return null; return topicMap.get(topic); } private TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>> getTopicMap(final String groupId) { return openTransactionsByGroupTopicAndPartition.get(groupId); } private void removePendingOffsets( final Set<Long> producerIds, final String groupId, final String topic, final int partition) { for (final Long producerId : producerIds) { final Offsets offsets = pendingTransactionalOffsets.get(producerId); if (offsets != null) { offsets.remove(groupId, topic, partition); } } } private <K, V extends Map<?, ?>> void cleanupIfEmpty(final V innerMap, final Map<K, V> outerMap, final K key) { if (innerMap != null && innerMap.isEmpty()) { outerMap.remove(key); } }

shaan150 · 2025-04-16T21:10:11Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

+                        }
+                    }
+                });
+            });
        });


Previous point stands, how i'd do it would be something like the below attempt, the cleanupIfEmpty method from earlier here could be used if implemented

public void clearOpenTransactionsForProducer(final long producerId, final PendingOffsets pendingOffsets) { for (final Map.Entry<String, Map<String, Map<Integer, OffsetAndMetadata>>> groupEntry : pendingOffsets.offsetsByGroup.entrySet()) { final String groupId = groupEntry.getKey(); final Map<String, Map<Integer, OffsetAndMetadata>> topicOffsets = groupEntry.getValue(); removeProducerFromGroupTransactions(groupId, producerId); final TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>> topicMap = openTransactionsByGroupTopicAndPartition.get(groupId); if (topicMap == null) continue; processTopicOffsets(producerId, groupId, topicOffsets, topicMap); cleanupIfEmpty(topicMap, openTransactionsByGroupTopicAndPartition, groupId); } } private void removeProducerFromGroupTransactions(final String groupId, final long producerId) { final TimelineHashSet<Long> groupTransactions = openTransactionsByGroup.get(groupId); if (groupTransactions == null) return; groupTransactions.remove(producerId); if (groupTransactions.isEmpty()) { openTransactionsByGroup.remove(groupId); } } private void processTopicOffsets( final long producerId, final String groupId, final Map<String, Map<Integer, OffsetAndMetadata>> topicOffsets, final TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>> topicMap) { for (final Map.Entry<String, Map<Integer, OffsetAndMetadata>> topicEntry : topicOffsets.entrySet()) { final String topic = topicEntry.getKey(); final Map<Integer, OffsetAndMetadata> partitionOffsets = topicEntry.getValue(); final TimelineHashMap<Integer, TimelineHashSet<Long>> partitionMap = topicMap.get(topic); if (partitionMap == null) continue; for (final Integer partitionId : partitionOffsets.keySet()) { removeProducerFromPartitionMap(producerId, partitionId, partitionMap); } cleanupIfEmpty(partitionMap, topicMap, topic); } } private void removeProducerFromPartitionMap( final long producerId, final int partitionId, final TimelineHashMap<Integer, TimelineHashSet<Long>> partitionMap) { final TimelineHashSet<Long> partitionTransactions = partitionMap.get(partitionId); if (partitionTransactions == null) return; partitionTransactions.remove(producerId); if (partitionTransactions.isEmpty()) { partitionMap.remove(partitionId); } }

shaan150 · 2025-04-16T21:16:38Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

+     * The open transactions (producer ids) keyed by group id, topic name and partition id.
+     * Tracks whether partitions have any pending transactional offsets.
+     */
+    private final TimelineHashMap<String, TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>>> openTransactionsByGroupTopicAndPartition;


I understand your reasoning behind this addition, but it does introduce significant complexity and carries a high risk of data duplication if not carefully managed. While I appreciate this may serve as a stop-gap for now, I do think it’s important that this is revisited going forward. A more maintainable long-term solution, perhaps wrapping this logic in a dedicated structure or abstraction, would help reduce coupling and make it easier to evolve the logic cleanly.

dajac · 2025-04-17T10:02:29Z

@squah-confluent Thanks for the patch. Could we write a micro benchmark to demonstrate the gain?

chia7712 · 2025-04-21T13:40:19Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

+        if (openTransactionsByTopic != null) {
+            openTransactionsByTopic.forEach((topic, openTransactionsByPartition) -> {
+                openTransactionsByPartition.forEach((partition, producerIds) -> {
+                    producerIds.forEach(producerId -> {


Excuse me, why to iterate all producerIds if we don't use it actually in creating tombstone?

I missed this when refactoring the code, thank you. The previous code also has this bug, except it was not as obvious. I've fixed it now.

It was kept around in the initial PR to avoid deletion of groups with open transactions, but it's okay to delete groups with open transactions without any offsets, since committing the transaction is a no-op. If the client tries to add more transactional offsets to a deleted group, we may either recreate the group or return an error depending on the generation in the request.

squah-confluent · 2025-04-29T01:38:55Z

@squah-confluent Thanks for the patch. Could we write a micro benchmark to demonstrate the gain?

Added a benchmark.
Before:

Benchmark                              (partitionCount)  (transactionCount)  Mode  Cnt    Score    Error  Units
TransactionalOffsetFetchBenchmark.run              4000                4000  avgt    5  452.957 ± 7.883  ms/op

After:

Benchmark                              (partitionCount)  (transactionCount)  Mode  Cnt  Score   Error  Units
TransactionalOffsetFetchBenchmark.run              4000                4000  avgt    5  0.196 ± 0.005  ms/op

squah-confluent · 2025-04-29T02:07:29Z

@shaan150 I decided to do something a little different and factored out the code into Map interface-like operations.

dongnuo123 · 2025-04-29T20:47:54Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

+        private boolean contains(String groupId, String topic, int partition) {
+            TimelineHashSet<Long> openTransactions = get(groupId, topic, partition);
+            return openTransactions != null;
+        }


Could we do something like

TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>> topicMap = openTransactionsByGroup.get(groupId); if (topicMap == null) return false; TimelineHashMap<Integer, TimelineHashSet<Long>> partitionMap = topicMap.get(topic); return partitionMap != null && partitionMap.containsKey(partition);

to avoid extra lookup?

I checked and containsKey does basically the same operation as a get.

kafka/server-common/src/main/java/org/apache/kafka/timeline/TimelineHashMap.java

Lines 120 to 126 in 08f6042

public boolean containsKey(Object key) {

return containsKey(key, SnapshottableHashTable.LATEST_EPOCH);

}

public boolean containsKey(Object key, long epoch) {

return snapshottableGet(new TimelineHashMapEntry<>(key, null), epoch) != null;

}

kafka/server-common/src/main/java/org/apache/kafka/timeline/TimelineHashMap.java

Lines 139 to 150 in 08f6042

public V get(Object key) {

return get(key, SnapshottableHashTable.LATEST_EPOCH);

}

public V get(Object key, long epoch) {

Entry<K, V> entry =

snapshottableGet(new TimelineHashMapEntry<>(key, null), epoch);

if (entry == null) {

return null;

}

return entry.getValue();

}

dongnuo123 · 2025-04-29T20:56:33Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/OffsetMetadataManager.java

+         *
+         * Values in each level of the map will never be empty collections.
+         */
+        private final TimelineHashMap<String, TimelineHashMap<String, TimelineHashMap<Integer, TimelineHashSet<Long>>>> openTransactionsByGroup;


The current 3-layer map has already improved the performance significantly. If we are still not happy with the perf maybe we can consider flattening the map to (group id, topic, partition) -- producer id set

I tried a couple of approaches and settled on inlining hasPendingTransactionalOffsets into the offset fetch path. This way we don't do string comparisons of the group id and topic name for every partition and also avoid allocations from a compound key.

Benchmark (partitionCount) (transactionCount) Mode Cnt Score Error Units TransactionalOffsetFetchBenchmark.run 4000 4000 avgt 5 0.129 ± 0.002 ms/op

… paths

github-actions bot added the triage PRs from the community label Apr 16, 2025

squah-confluent commented Apr 16, 2025

View reviewed changes

shaan150 reviewed Apr 16, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label Apr 17, 2025

FrankYang0529 added the ci-approved label Apr 17, 2025

dajac self-requested a review April 17, 2025 10:02

dajac added the KIP-848 The Next Generation of the Consumer Rebalance Protocol label Apr 17, 2025

chia7712 reviewed Apr 21, 2025

View reviewed changes

squah-confluent added 4 commits April 28, 2025 23:32

KAFKA-19160: Add benchmark for fetching stable offsets

7fcfc45

fixup: don't generate extra tombstones when deleting offsets

23844a0

fixup: refactor

e80d94c

github-actions bot added the performance label Apr 29, 2025

squah-confluent changed the title ~~KAFKA-19160: Improve performance of fetching stable offsets~~ KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets Apr 29, 2025

fixup: clean up benchmark code

c74f3d4

squah-confluent requested review from chia7712 and shaan150 April 29, 2025 02:07

fixup: clean up benchmark code

c1ad70d

dongnuo123 reviewed Apr 29, 2025

View reviewed changes

squah-confluent added 2 commits May 1, 2025 15:35

fixup: inline hasPendingTransactionalOffset lookups into offset fetch…

de89fb7

… paths

fixup: add missing javadoc for openTransactions field

fe2e74b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets #19497

KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets #19497

squah-confluent commented Apr 16, 2025 •

edited by github-actions bot

Loading

squah-confluent Apr 16, 2025

dajac Apr 17, 2025

squah-confluent Apr 18, 2025

squah-confluent Apr 29, 2025

shaan150 left a comment

shaan150 Apr 16, 2025

squah-confluent Apr 29, 2025

shaan150 Apr 16, 2025

shaan150 Apr 16, 2025

shaan150 Apr 16, 2025

dajac commented Apr 17, 2025

chia7712 Apr 21, 2025

squah-confluent Apr 29, 2025

squah-confluent commented Apr 29, 2025 •

edited

Loading

squah-confluent commented Apr 29, 2025

dongnuo123 Apr 29, 2025

squah-confluent Apr 30, 2025

dongnuo123 Apr 29, 2025

squah-confluent May 1, 2025

	public boolean containsKey(Object key) {
	return containsKey(key, SnapshottableHashTable.LATEST_EPOCH);
	}

	public boolean containsKey(Object key, long epoch) {
	return snapshottableGet(new TimelineHashMapEntry<>(key, null), epoch) != null;
	}

	public V get(Object key) {
	return get(key, SnapshottableHashTable.LATEST_EPOCH);
	}

	public V get(Object key, long epoch) {
	Entry<K, V> entry =
	snapshottableGet(new TimelineHashMapEntry<>(key, null), epoch);
	if (entry == null) {
	return null;
	}
	return entry.getValue();
	}

KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets #19497

Are you sure you want to change the base?

KAFKA-19160;KAFKA-19164; Improve performance of fetching stable offsets #19497

Conversation

squah-confluent commented Apr 16, 2025 • edited by github-actions bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaan150 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dajac commented Apr 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squah-confluent commented Apr 29, 2025 • edited Loading

squah-confluent commented Apr 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squah-confluent commented Apr 16, 2025 •

edited by github-actions bot

Loading

squah-confluent commented Apr 29, 2025 •

edited

Loading