increase error threshold of diff in circuit_breaking_test #207

AgraVator · 2025-10-01T08:17:18Z

increases threshold from 5 to 10% which will help to reduce noise from b/448552373

eshitachandwani · 2025-10-01T14:21:42Z

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

pawbhard · 2025-10-01T20:47:51Z

Nit: Since this repo is open source, we shouldn’t include internal bug links. Could you replace it with just the correct bug number? I think this one might be a duplicate bug rather than the intended metabug.

+1 use b/

arjan-bal · 2025-10-07T07:01:52Z

Please wait before submitting this. I'm not 100% sure this is the right fix. I'm going the test and circuit breaking docs to see if the test needs be changed instead.

arjan-bal · 2025-10-07T08:25:41Z

From my understanding of the test, the client is configured to send 100 QPS to the servers.

psm-interop/tests/circuit_breaking_test.py

Line 167 in c3d3bf5

default_test_server, rpc="UnaryCall,EmptyCall", qps=_QPS

The server is configured to block on receiving the calls and a deadline of 20 seconds is set on the RPCs. This means that the server should block for 20 seconds on an RPC before failing it with a DEADLINE_EXCEEDED status.

psm-interop/tests/circuit_breaking_test.py

Lines 178 to 194 in c3d3bf5

    
           with self.subTest("11_configure_client_with_keep_open"): 
        
               test_client.update_config.configure( 
        
                   rpc_types=grpc_testing.RPC_TYPES_BOTH_CALLS, 
        
                   metadata={ 
        
                       ( 
        
                           grpc_testing.RPC_TYPE_UNARY_CALL, 
        
                           "rpc-behavior", 
        
                           "keep-open", 
        
                       ), 
        
                       ( 
        
                           grpc_testing.RPC_TYPE_EMPTY_CALL, 
        
                           "rpc-behavior", 
        
                           "keep-open", 
        
                       ), 
        
                   }, 
        
                   timeout_sec=20, 
        
               )

Initially, when the circuit breaking config is not received by the client, it makes a large number of concurrent requests, example from the test logs:

I1006 09:10:54.669732 125114108895232 xds_k8s_testcase.py:890] [psm-grpc-client-64545cd99d-thnn8] << Received LoadBalancerAccumulatedStatsResponse:
- method: UNARY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4
- method: EMPTY_CALL
  rpcs_started: 2464
  result:
    (0, OK): 468
    (4, DEADLINE_EXCEEDED): 4

I1006 09:10:54.669906 125114108895232 xds_k8s_testcase.py:899] [psm-grpc-client-64545cd99d-thnn8] << UNARY_CALL RPCs in flight: 1992, expected 500 ±5%

After the client receives the circuit breaking config, it will ensure there are at most 500 UnaryCall and 1000 EmptyCall requests in-flight. In the logs, we can see the number of in-flight calls reducing with time.

I0929 20:58:33.484530 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 1630, expected 1000 ±5%
I0929 20:58:43.495065 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:43.537494 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 10630
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2194
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11098
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 500
    (14, UNAVAILABLE): 10083

I0929 20:58:43.537702 128673922162688 xds_k8s_testcase.py:899] [psm-grpc-client-7746d59849-tm852] << EMPTY_CALL RPCs in flight: 968, expected 1000 ±5%
I0929 20:58:43.537857 128673922162688 xds_k8s_testcase.py:868] Will check again in 5 seconds to verify that RPC count is steady
I0929 20:58:48.543253 128673922162688 grpc.py:79] [psm-grpc-client-7746d59849-tm852:8079] >> RPC LoadBalancerStatsService.GetClientAccumulatedStats(request=LoadBalancerAccumulatedStatsRequest({}), timeout=600, wait_for_ready=True)
I0929 20:58:48.584075 128673922162688 xds_k8s_testcase.py:890] [psm-grpc-client-7746d59849-tm852] << Received LoadBalancerAccumulatedStatsResponse:
- method: EMPTY_CALL
  rpcs_started: 11039
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 2603
    (14, UNAVAILABLE): 7921
- method: UNARY_CALL
  rpcs_started: 11454
  result:
    (0, OK): 515
    (4, DEADLINE_EXCEEDED): 530
    (14, UNAVAILABLE): 10409

The test then checks once that the in-flight requests are within 500±5% and 1000±5% respectively.

The problem

At t=20 seconds, the 100 RPCs started at t=0 will fail as their deadlines will expire. At the same time, the client will start 100 more RPCs that may succeed. If the test driver were to query the in-flight RPCs at this time, it will see the RPC count b/w 900-1000 EmptyCall for and 400-500 for UnaryCall. Assuming the rate of RPC starting is equal to rate of RPCs timing out, we'll see 950 EmptyCalls and 450 UnaryCalls on average.

The same situation will happen at t=21, 22...30 and t=20+30, 21+30...30+30.

TL;DR the steady state in the test is cyclic.

arjan-bal · 2025-10-07T09:02:18Z

One way to fix the test would be to change the two assertions as follows:

In the first check, verify RPCs are within [threshold - 5%, threshold], even 1% may work. Notice that there is no +5% because circuit breaking must not allow RPCs more than the threshold.
In the second check, verify RPCs are within [threshold - QPS, threshold].

increase error threshold of diff in circuit_breaking_test

31d84d2

AgraVator requested a review from a team as a code owner October 1, 2025 08:17

AgraVator requested review from arjan-bal, pawbhard and sergiitk October 1, 2025 08:17

pawbhard approved these changes Oct 1, 2025

View reviewed changes

eshitachandwani approved these changes Oct 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

increase error threshold of diff in circuit_breaking_test #207

increase error threshold of diff in circuit_breaking_test #207

AgraVator commented Oct 1, 2025 •

edited

Loading

Uh oh!

eshitachandwani commented Oct 1, 2025

Uh oh!

pawbhard commented Oct 1, 2025

Uh oh!

arjan-bal commented Oct 7, 2025

Uh oh!

arjan-bal commented Oct 7, 2025 •

edited

Loading

Uh oh!

arjan-bal commented Oct 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

increase error threshold of diff in circuit_breaking_test #207

Are you sure you want to change the base?

increase error threshold of diff in circuit_breaking_test #207

Conversation

AgraVator commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eshitachandwani commented Oct 1, 2025

Uh oh!

pawbhard commented Oct 1, 2025

Uh oh!

arjan-bal commented Oct 7, 2025

Uh oh!

arjan-bal commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem

Uh oh!

arjan-bal commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AgraVator commented Oct 1, 2025 •

edited

Loading

arjan-bal commented Oct 7, 2025 •

edited

Loading

arjan-bal commented Oct 7, 2025 •

edited

Loading