Skip to content

Conversation

AgraVator
Copy link
Contributor

@AgraVator AgraVator commented Oct 1, 2025

increases threshold from 5 to 10% which will help to reduce noise from b/448552373

@AgraVator AgraVator requested a review from a team as a code owner October 1, 2025 08:17
@eshitachandwani
Copy link
Member

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

@pawbhard
Copy link
Contributor

pawbhard commented Oct 1, 2025

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

+1 use b/

@arjan-bal
Copy link
Contributor

Please wait before submitting this. I'm not 100% sure this is the right fix. I'm going the test and circuit breaking docs to see if the test needs be changed instead.

@arjan-bal
Copy link
Contributor

arjan-bal commented Oct 7, 2025

From my understanding of the test, the client is configured to send 100 QPS to the servers.

default_test_server, rpc="UnaryCall,EmptyCall", qps=_QPS

The server is configured to block on receiving the calls and a deadline of 20 seconds is set on the RPCs. This means that the server should block for 20 seconds on an RPC before failing it with a DEADLINE_EXCEEDED status.

with self.subTest("11_configure_client_with_keep_open"):
test_client.update_config.configure(
rpc_types=grpc_testing.RPC_TYPES_BOTH_CALLS,
metadata={
(
grpc_testing.RPC_TYPE_UNARY_CALL,
"rpc-behavior",
"keep-open",
),
(
grpc_testing.RPC_TYPE_EMPTY_CALL,
"rpc-behavior",
"keep-open",
),
},
timeout_sec=20,
)

Initially, when the circuit breaking config is not received by the client, it makes a large number of concurrent requests, example from the test logs:

I1006 09:10:54.669732 125114108895232 xds_k8s_testcase.py:890] [psm-grpc-client-64545cd99d-thnn8] << Received LoadBalancerAccumulatedStatsResponse:
- method: UNARY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4
- method: EMPTY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4

I1006 09:10:54.669906 125114108895232 xds_k8s_testcase.py:899] [psm-grpc-client-64545cd99d-thnn8] << UNARY_CALL RPCs in flight: 1992, expected 500 ±5%

After the client receives the circuit breaking config, it will ensure there are at most 500 UnaryCall and 1000 EmptyCall requests in-flight. In the logs, we can see the number of in-flight calls reducing with time.

I0929 20:58:33.484530 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 1630, expected 1000 ±5%
I0929 20:58:43.495065 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:43.537494 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 10630
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2194
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11098
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 500
    (14, UNAVAILABLE): 10083

I0929 20:58:43.537702 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 968, expected 1000 ±5%
I0929 20:58:43.537857 128673922162688 xds_k8s_testcase.py:868] Will check again in 5 seconds to verify that RPC count is steady
I0929 20:58:48.543253 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:48.584075 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 11039
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2603
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11454
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 530
    (14, UNAVAILABLE): 10409

The test then checks once that the in-flight requests are within 500±5% and 1000±5% respectively.

The problem

At t=20 seconds, the 100 RPCs started at t=0 will fail as their deadlines will expire. At the same time, the client will start 100 more RPCs that may succeed. If the test driver were to query the in-flight RPCs at this time, it will see the RPC count b/w 900-1000 EmptyCall for and 400-500 for UnaryCall. Assuming the rate of RPC starting is equal to rate of RPCs timing out, we'll see 950 EmptyCalls and 450 UnaryCalls on average.

The same situation will happen at t=21, 22...30 and t=20+30, 21+30...30+30.

TL;DR the steady state in the test is cyclic.

@arjan-bal
Copy link
Contributor

arjan-bal commented Oct 7, 2025

One way to fix the test would be to change the two assertions as follows:

  1. In the first check, verify RPCs are within [threshold - 5%, threshold], even 1% may work. Notice that there is no +5% because circuit breaking must not allow RPCs more than the threshold.
  2. In the second check, verify RPCs are within [threshold - QPS, threshold].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants