KAFKA-19478 [2/N]: Remove task pairs #20127

lucasbru · 2025-07-08T11:32:56Z

Task pairs is an optimization that is enabled in the current sticky task
assignor.

The basic idea is that every time we add a task A to a client that has
tasks B, C, we add pairs (A, B) and (A, C) to a global collection of
pairs. When adding a standby task, we then prioritize creating standby
tasks that create new task pairs. If this does not work, we fall back to
the original behavior.

The complexity of this optimization is fairly significant, and its
usefulness is questionable, the HighAvailabilityAssignor does not seem
to have such an optimization, and the absence of this optimization does
not seem to have caused any problems that I know of. I could not find
any what this optimization is actually trying to achieve.

A side effect of it is that we will sometimes avoid “small loops”, such
as

    Node A: ActiveTask1, StandbyTask2 Node B: ActiveTask2,

StandbyTask1 Node C: ActiveTask3,
StandbyTask4 Node D: ActiveTask4,
StandbyTask3

So a small loop like this, worst case losing two nodes will cause 2
tasks to go down, so the assignor is preferring

    Node A: ActiveTask1, StandbyTask4 Node B: ActiveTask2,

StandbyTask1 Node C: ActiveTask3,
StandbyTask2 Node D: ActiveTask4,
StandbyTask3

Which is a “big loop” assignment, where worst-case losing two nodes will
cause at most 1 task to be unavailable. However, this optimization seems
fairly niche, and also the current implementation does not seem to
implement it in a direct form, but a more relaxed constraint which
usually, does not always avoid small loops. So it remains unclear
whether this is really the intention behind the optimization. The
current unit tests of the StickyTaskAssignor pass even after removing
the optimization.

The pairs optimization has a worst-case quadratic space and time
complexity in the number of tasks, and make a lot of other optimizations
impossible, so I’d suggest we remove it. I don’t think, in its current
form, it is suitable to be implemented in a broker-side assignor. Note,
however, if we identify a useful effect of the code in the future, we
can work on finding an efficient algorithm that can bring the
optimization to our broker-side assignor.

This reduces the runtime of our worst case benchmark by 10x.

Copilot

Pull Request Overview

This PR removes the TaskPairs-based optimization from StickyTaskAssignor to simplify assignment logic and improve performance by eliminating the quadratic data structure.

Dropped TaskPairs initialization, filtering, and inner classes
Updated updateHelpers signature to no longer require a TaskId
Simplified findMemberWithLeastLoad and standby assignment by removing pair checks

Comments suppressed due to low confidence (1)

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java:218

The parameter taskId is no longer used in this method after removing TaskPairs logic; consider removing it from the signature to avoid confusion and dead code.

        if (members == null || members.isEmpty()) {

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java

aliehsaeedii · 2025-07-09T11:01:29Z

Thanks, @lucasbru, for the optimization.

The current unit tests of the StickyTasksAssignor pass even after removing the optimization.

It’s likely these tests pass because such a closed-loop scenario is rare. However, if that scenario does occur, the tests may fail, leading to flakiness. To avoid introducing flaky tests, it might be best to remove those that specifically check the impact of pairs.

Additionally, it would be useful to understand why omitting pairs causes certain tests to fail for LegacyStickyTaskAssignor. My assumption is that it’s related to the task ID assignment order, which remains consistent in the legacy implementation but can change in the current approach. While this insight may not directly support the optimization, it is important for verifying the current implementation’s correctness.

lucasbru · 2025-07-10T13:29:09Z

It’s likely these tests pass because such a closed-loop scenario is rare. However, if that scenario does occur, the tests may fail, leading to flakiness. To avoid introducing flaky tests, it might be best to remove those that specifically check the impact of pairs.

I think everything is deterministic in the unit tests. Where would the randomness come from? I ran the test 30k times to confirm, it never fails.

Additionally, it would be useful to understand why omitting pairs causes certain tests to fail for LegacyStickyTaskAssignor. My assumption is that it’s related to the task ID assignment order, which remains consistent in the legacy implementation but can change in the current approach. While this insight may not directly support the optimization, it is important for verifying the current implementation’s correctness.

I checked, and I think the reason they pass is only partially due to task ID order.

First, the unit tests in the new assignor will always pass, because it does order comparison instead of unordered comparison of assignments in the tests. So the unit tests related to "Not the same assignment" broke while being ported.

Second, the old assignor does something weird: it assigns active tasks in sorted order, but standby tasks in reverse order. So in the common case, where we have "number of tasks = number of processes", this yields exactly the configuration we try to avoid:

{00000000-0000-0000-0000-000000000001=[activeTasks: ([0_0]) standbyTasks: ([0_3]) 
 00000000-0000-0000-0000-000000000002=[activeTasks: ([0_1]) standbyTasks: ([0_2]) 
 00000000-0000-0000-0000-000000000003=[activeTasks: ([0_2]) standbyTasks: ([0_1]) 
 00000000-0000-0000-0000-000000000004=[activeTasks: ([0_3]) standbyTasks: ([0_0])

In the new algorithm (before optimizations), we use the order defined by HashMaps, and actually a somewhat complicated order based on the Member object which is inserted into a HashSet. That gives a deterministic, but somewhat random looking order, which will not create "pairs" in the test.

bbejeck

Thanks for the PR @lucasbru - LGTM

bbejeck · 2025-07-11T13:57:15Z

@lucasbru failure is relevant - imports format

lucasbru requested review from bbejeck and Copilot July 8, 2025 11:32

github-actions bot added the group-coordinator label Jul 8, 2025

Copilot AI reviewed Jul 8, 2025

View reviewed changes

...or/src/main/java/org/apache/kafka/coordinator/group/streams/assignor/StickyTaskAssignor.java Show resolved Hide resolved

disable task pairs

a529fd0

lucasbru force-pushed the jmh_benchmarks_second_opt branch from 94915fc to a529fd0 Compare July 9, 2025 10:16

remove pairs unit tests

9b1011a

lucasbru requested a review from mjsax July 11, 2025 11:57

bbejeck approved these changes Jul 11, 2025

View reviewed changes

checkstyle

84e8202

lucasbru merged commit 29cf97b into apache:trunk Jul 14, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19478 [2/N]: Remove task pairs #20127

KAFKA-19478 [2/N]: Remove task pairs #20127

Uh oh!

lucasbru commented Jul 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

aliehsaeedii commented Jul 9, 2025

Uh oh!

lucasbru commented Jul 10, 2025

Uh oh!

bbejeck left a comment

Uh oh!

bbejeck commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

KAFKA-19478 [2/N]: Remove task pairs #20127

KAFKA-19478 [2/N]: Remove task pairs #20127

Uh oh!

Conversation

lucasbru commented Jul 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

aliehsaeedii commented Jul 9, 2025

Uh oh!

lucasbru commented Jul 10, 2025

Uh oh!

bbejeck left a comment

Choose a reason for hiding this comment

Uh oh!

bbejeck commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

lucasbru commented Jul 8, 2025 •

edited by github-actions bot

Loading