KAFKA-19400: Update AddRaftVoterRequest RPC to version 1 #19982

kevin-wu24 · 2025-06-17T16:08:11Z

Add the ackWhenCommitted boolean field to the AddRaftVoterRequest RPC, and bump the RPC's version to 1.

The default value of ackWhenCommitted is true, and in this case the leader will return a response after committing the VotersRecord generated by the RPC.
If ackWhenCommitted == false, the leader will return a response after writing the VotersRecord to its own local log.
Add unit tests in KafkaRaftClientReconfigTest

ahuang98 · 2025-06-17T23:43:11Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientReconfigTest.java

+
+    // This method sets up the context so a test can send an AddVoter request after
+    // exiting this method
+    private void prepareToSendAddVoter(


this name is a bit deceiving, this is all prep we do on the local (leader) node in order for it to be able to respond favorably to an addvoter request, could we perhaps rename to "prepareLeaderToReceiveAddVoter"?

I'm also not sure how I feel about having a helper method for this, I do see how this is quite a bit of code duplication but I wonder if it might be more clear to have this written out explicitly in the tests (you probably don't need to keep all the same assertions for all the tests)

could we perhaps rename to "prepareLeaderToReceiveAddVoter"

Yeah, this is more accurate.

I wonder if it might be more clear to have this written out explicitly in the tests

I would prefer to have this code in a helper method, which applies to the other comment too. In my opinion, it's clearer to me when all of these tests do have the exact same assertions so long as they apply, because it means our new AckWhenCommitted field (or any other feature) is not unintentionally changing behavior elsewhere. For example, the metric values check prevented me from writing an incorrect implementation!

Another way to think about it is that in these AddVoter unit tests, we're testing the same state of KRaft essentially (i.e. how the local leader handles AddVoter RPC), but each test is changing one thing, which is the ackWhenCommitted value (true, false, or NOT_SUPPORTED). Because we want coverage, I think duplication is a natural side-effect.

Maybe I can just document the helpers/rename them as specific to these AddVoter tests?

ahuang98 · 2025-06-17T23:58:15Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientReconfigTest.java

+        ReplicaKey follower,
+        ReplicaKey newVoter,
+        int epoch
+    ) throws Exception {
        // The new voter is now a voter after writing the VotersRecord to the log
        assertTrue(context.client.quorum().isVoter(newVoter));
        checkLeaderMetricValues(3, 0, 1, context);


this doesn't seem very extensible - maybe the above checks belong outside of this method

thoughts here?

My rationale of breaking this into a shared helper is similar to here: https://github.com/apache/kafka/pull/19982/files/c46387c64df052d0794932d91c8839bcd2e9d4a5#r2153376114.

specifically the checkLeaderMetricValues assuming we have a 3 voter quorum

so I'm wondering if just the two assertions/checks don't belong in the helper

so I'm wondering if just the two assertions/checks don't belong in the helper

We do want the metric check. Specifically the uncommitted voter change metric should go from 1 to 0. Otherwise there is a bug.

specifically the checkLeaderMetricValues assuming we have a 3 voter quorum

Apologies, I did not clarify why I felt making this assumption was okay. My opinion is that the number of nodes is arbitrary in the context of the AddVoter unit tests. These should only check the state of KRaft as the voter attempts to go from X to X + 1.

ahuang98 · 2025-06-17T23:59:45Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientReconfigTest.java

+        "KIP_853_PROTOCOL",
+        "KIP_996_PROTOCOL"
+    })
+    void testAddVoterCompatibility(RaftProtocol protocol) throws Exception {


testAddVoterAckWhenCommittedUnsupported?

I also need to add KIP_1166 here too.

ahuang98 · 2025-07-10T17:34:20Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientReconfigTest.java

+        // Attempt to add new voter to the quorum
+        context.deliverRequest(context.addVoterRequest(Integer.MAX_VALUE, newVoter, newListeners));
+
+        completeApiVersionsForAddVoter(context, newVoter, newAddress);


this is only called in two places, I wonder if we should just remove the helper so folks can see very clearly what is happening in these tests

Are you okay with leaving this in? The completeApiVersionsForAddVoter method is a more general method than commitNewVoterSetForAddVoter, since it does not have metrics checks on the number of voters/observers. This means future tests that want to test AddRaftVoter RPC can call completeApiVersionsForAddVoter as is.

In general, my motivation for grouping code in this way into helpers was to make the test read like the high level description of how AddRaftVoter RPC works:

prepareLeaderToReceiveAddVoter gets the local replica to a state where the context can send the AddRaftVoter RPC and have it trigger an API versions request

The test context delivers an AddRaftVoter RPC

completeApiVersionsForAddVoter completes the API versions RPC contained as part of the AddRaftVoter RPC loop

commitNewVoterSetForAddVoter completes the fetch RPC that is needed to complete AddRaftVoter RPC

Let me know if that makes sense to you.

ahuang98

Thanks Kevin, two comments left about the helpers, but approving the changes in either case :)

jsancio

Thanks for the changes @kevin-wu24 .

jsancio · 2025-07-11T15:28:00Z

raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java

+            currentTimeMs,
+            data.ackWhenCommitted()


Let's swap the order of currentTimeMs and ackWhenCommitted. "Ack when committed" is more important and raft tends to pass the current time as the last parameter.

jsancio · 2025-07-11T15:29:46Z

raft/src/main/java/org/apache/kafka/raft/internals/AddVoterHandler.java

+            time.timer(timeout.getAsLong()),
+            ackWhenCommitted


Similar comment here. Let's swap the order of the timer and the "ack when committed" boolean.

jsancio · 2025-07-11T15:32:04Z

raft/src/main/java/org/apache/kafka/raft/internals/AddVoterHandlerState.java

    private final Timer timeout;
+    private final boolean ackWhenCommitted;


Minor but let's swap this order here and in the constructor.

jsancio · 2025-07-11T16:18:11Z

raft/src/main/java/org/apache/kafka/raft/internals/AddVoterHandler.java

+            current.future().complete(
+                RaftUtil.addVoterResponse(
+                    Errors.NONE,
+                    null
+                )
+            );


Feel free to join some of these lines if they are less than about 100 characters. E.g.:

current.future().complete( RaftUtil.addVoterResponse(Errors.NONE, null) );

jsancio · 2025-07-11T16:27:38Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientReconfigTest.java

+        // Attempt to add new voter to the quorum
+        assertThrows(
+            UnsupportedVersionException.class,
+            () -> context.deliverRequest(
+                context.addVoterRequest(
+                    Integer.MAX_VALUE,
+                    newVoter,
+                    newListeners
+                ).setAckWhenCommitted(false)
+            )
+        );


This is not testing KafkaRaftClient. This is testing the serialization code. It is testing that serializing a false "ack when committed" throws an exception when the target version is 0.

If you want to test this case for KafkaRaftClient, you need to make the local replica a non-voter that can auto-join and show that AddVoterRequest always sets "ack when committed" to false.

This is not testing KafkaRaftClient. This is testing the serialization code. It is testing that serializing a false "ack when committed" throws an exception when the target version is 0.

We want to test the serialization code behaves as expected right? How else would we know that we are throwing an exception that prevents us from sending AddVoter v1 to a node that only supports v0 because ackWhenCommitted has ignorable == false?

If you want to test this case for KafkaRaftClient, you need to make the local replica a non-voter that can auto-join and show that AddVoterRequest always sets "ack when committed" to false.

I was planning to add this as a check in all the KafkaRaftClientAutoJoinTest tests in the other PR after this PR is merged, since all of those tests have the local replica as a non-voter. Specifically, we add another check in assertSentAddVoterRequest that AckWhenCommitted == false for any auto-joining replica.

jsancio · 2025-07-11T16:33:19Z

raft/src/test/java/org/apache/kafka/raft/KafkaRaftClientReconfigTest.java

+        RaftClientTestContext context,
+        int epoch,
+        ReplicaKey leader,
+        ReplicaKey follower,


Interesting in all of the reconfig tests we always go from 2 voters to 3 voters? Maybe we can improve this in a future PR and make this parametrized.

Maybe we can improve this in a future PR and make this parametrized.

Yeah possibly. I think just parameterizing might not be enough, depending on what you want to parametrize over. This is because the number of fetches needed to commit the new voters record increases as the number of voters increases.

add AddRaftVoterRequest version 1

37504a8

github-actions bot added triage PRs from the community kraft clients labels Jun 17, 2025

kevin-wu24 mentioned this pull request Jun 17, 2025

KAFKA-19078: Implement automatic controller addition to cluster metadata partition #19589

Open

Merge branch 'trunk' into KAFKA-19400

c46387c

ahuang98 reviewed Jun 17, 2025

View reviewed changes

code review

85efe33

github-actions bot removed the triage PRs from the community label Jun 18, 2025

jsancio added the ci-approved label Jul 10, 2025

ahuang98 reviewed Jul 10, 2025

View reviewed changes

ahuang98 approved these changes Jul 10, 2025

View reviewed changes

jsancio reviewed Jul 11, 2025

View reviewed changes

code review

84b34a1

		private final Timer timeout;
		private final boolean ackWhenCommitted;

KAFKA-19400: Update AddRaftVoterRequest RPC to version 1 #19982

Are you sure you want to change the base?

KAFKA-19400: Update AddRaftVoterRequest RPC to version 1 #19982

Uh oh!

Conversation

kevin-wu24 commented Jun 17, 2025

Uh oh!

ahuang98 Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahuang98 left a comment

Choose a reason for hiding this comment

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahuang98 Jun 17, 2025 •

edited

Loading

kevin-wu24 Jun 18, 2025 •

edited

Loading

kevin-wu24 Jul 10, 2025 •

edited

Loading

kevin-wu24 Jul 10, 2025 •

edited

Loading

kevin-wu24 Jul 11, 2025 •

edited

Loading