Skip to content

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] Sync tcp patches from upstream #918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: linux-6.6.y
Choose a base branch
from

Conversation

opsiff
Copy link
Member

@opsiff opsiff commented Jul 4, 2025

Merge branch 'tcp-refactor-bhash2'

Kuniyuki Iwashima says:

====================

tcp: Refactor bhash2 and remove sk_bind2_node.

This series refactors code around bhash2 and remove some bhash2-specific
fields; sock.sk_bind2_node, and inet_timewait_sock.tw_bind2_node.

patch 1 : optimise bind() for non-wildcard v4-mapped-v6 address
patch 2 - 4 : optimise bind() conflict tests
patch 5 - 12 : Link bhash2 to bhash and unlink sk from bhash2 to
remove sk_bind2_node

The patch 8 will trigger a false-positive error by checkpatch.

v2: resend of https://lore.kernel.org/netdev/[email protected]/

  • Rebase on latest net-next
  • Patch 11
    • Add change in inet_diag_dump_icsk() for recent bhash dump patch

v1: https://lore.kernel.org/netdev/[email protected]/

Merge branch 'tcp-scale-connect-under-pressure'

Eric Dumazet says:

====================
tcp: scale connect() under pressure

Adoption of bhash2 in linux-6.1 made some operations almost twice
more expensive, because of additional locks.

This series adds RCU in __inet_hash_connect() to help the
case where many attempts need to be made before finding
an available 4-tuple.

This brings a ~200 % improvement in this experiment:

Server:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog

Client:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before series:

utime_start=0.288582
utime_end=1.548707
stime_start=20.637138
stime_end=2002.489845
num_transactions=484453
latency_min=0.156279245
latency_max=20.922042756
latency_mean=1.546521274
latency_stddev=3.936005194
num_samples=312537
throughput=47426.00

perf top on the client:

49.54% [kernel] [k] _raw_spin_lock
25.87% [kernel] [k] _raw_spin_lock_bh
5.97% [kernel] [k] queued_spin_lock_slowpath
5.67% [kernel] [k] __inet_hash_connect
3.53% [kernel] [k] __inet6_check_established
3.48% [kernel] [k] inet6_ehashfn
0.64% [kernel] [k] rcu_all_qs

After this series:

utime_start=0.271607
utime_end=3.847111
stime_start=18.407684
stime_end=1997.485557
num_transactions=1350742
latency_min=0.014131929
latency_max=17.895073144
latency_mean=0.505675853 # Nice reduction of latency metrics
latency_stddev=2.125164772
num_samples=307884
throughput=139866.80 # 194 % increase

perf top on client:

56.86% [kernel] [k] __inet6_check_established
17.96% [kernel] [k] __inet_hash_connect
13.88% [kernel] [k] inet6_ehashfn
2.52% [kernel] [k] rcu_all_qs
2.01% [kernel] [k] __cond_resched
0.41% [kernel] [k] _raw_spin_lock

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski [email protected]

Summary by Sourcery

Sync upstream TCP refactor and performance optimizations: restructure bhash2 binding, remove obsolete sk_bind2_node, add RCU fast-path for connect() under pressure, and introduce TCP_BOUND_INACTIVE support for diagnostics

New Features:

  • Scale TCP connect path under high load using an RCU-based fast path in __inet_hash_connect
  • Introduce a TCP_BOUND_INACTIVE pseudo-state and corresponding BPF_TCP_BOUND_INACTIVE support

Enhancements:

  • Refactor bind buckets (bhash2) by removing sk_bind2_node, unifying inet_bind_bucket and inet_bind2_bucket structures, and switching to RCU-safe list operations
  • Optimize bind() conflict checks and IPv4-mapped IPv6 address handling
  • Simplify bind bucket creation and destruction interfaces by eliminating redundant cache parameters
  • Update inet_diag to batch-process and report bound-inactive sockets

Guillaume Nault and others added 19 commits July 4, 2025 14:41
mainline inclusion
from mainline-v6.8-rc1
category: performance

Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets
that are bound but haven't yet called connect() or listen().

The code is inspired by the ->lhash2 loop. However there's no manual
test of the source port, since this kind of filtering is already
handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped
at a time, to avoid running with bh disabled for too long.

There's no TCP state for bound but otherwise inactive sockets. Such
sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed
to only dump listening sockets, actually requests the kernel to dump
sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping
bound-only sockets with "ss -l", we therefore need to define a new
pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set
explicitly.

With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to
40000, 64000, 60000, an updated version of iproute2 could work as
follow:

  $ ss -t state bound-inactive
  Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process
  0        0                0.0.0.0:40000           0.0.0.0:*
  0        0                   [::]:60000              [::]:*
  0        0                      *:64000                 *:*

Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Guillaume Nault <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/b3a84ae61e19c06806eea9c602b3b66e8f0cfc81.1701362867.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 91051f0)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I90f7c930ce5a69adb8139cc593dcc0e231926061
mainline inclusion
from mainline-v6.8-rc1
category: performance

While checking port availability in bind() or listen(), we used only
bhash for all v4-mapped-v6 addresses.  But there is no good reason not
to use bhash2 for v4-mapped-v6 non-wildcard addresses.

Let's do it by returning true in inet_use_bhash2_on_bind().  Then, we
also need to add a test in inet_bind2_bucket_match_addr_any() so that
::ffff:X.X.X.X will match with 0.0.0.0.

Note that sk->sk_rcv_saddr is initialised for v4-mapped-v6 sk in
__inet6_bind().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 5e07e67)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Iea294b9533360379e2da8f657a57437f2bba3534
mainline inclusion
from mainline-v6.8-rc1
category: performance

The protocol family tests in inet_bind2_bucket_addr_match() and
inet_bind2_bucket_match_addr_any() are ordered as follows.

  if (sk->sk_family != tb2->family)
  else if (sk->sk_family == AF_INET6)
  else

This patch rearranges them so that AF_INET6 socket is handled first
to make the following patch tidy, where tb2->family will be removed.

  if (sk->sk_family == AF_INET6)
  else if (tb2->family == AF_INET6)
  else

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 56f3e3f)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I615614fdfad00f22916ef0ce911aabab3d05a0fe
mainline inclusion
from mainline-v6.8-rc1
category: performance

In bhash2, IPv4/IPv6 addresses are saved in two union members,
which complicate address checks in inet_bind2_bucket_addr_match()
and inet_bind2_bucket_match_addr_any() considering uninitialised
memory and v4-mapped-v6 conflicts.

Let's simplify that by saving IPv4 address as v4-mapped-v6 address
and defining tb2.rcv_saddr as tb2.v6_rcv_saddr.s6_addr32[3].

Then, we can compare v6 address as is, and after checking v4-mapped-v6,
we can compare v4 address easily.  Also, we can remove tb2->family.

Note these functions will be further refactored in the next patch.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 06a8c04)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Iab0c8d45e3ba55cc3de579524bc29cdceb0f573a
mainline inclusion
from mainline-v6.8-rc1
category: performance

inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any()
are called for each bhash2 bucket to check conflicts.  Thus, we call
ipv6_addr_any() and ipv6_addr_v4mapped() over and over during bind().

Let's avoid calling them by saving the address type in inet_bind2_bucket.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 5a22bba)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I4a9e5f815f293a7df2b928436dadefb880d60b5f
mainline inclusion
from mainline-v6.8-rc1
category: performance

Later, we no longer link sockets to bhash.  Instead, each bhash2
bucket is linked to the corresponding bhash bucket.

Then, we pass the bhash bucket to bhash2 allocation functions as
tb.  However, tb is already used in inet_bind2_bucket_create() and
inet_bind2_bucket_init() as the bhash2 bucket.

To make the following diff clear, let's use tb2 for the bhash2 bucket
there.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 4dd7108)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I2d9e67e64ce03bada544e8b97d5c94ba176924dd
mainline inclusion
from mainline-v6.8-rc1
category: performance

bhash2 added a new member sk_bind2_node in struct sock to link
sockets to bhash2 in addition to bhash.

bhash is still needed to search conflicting sockets efficiently
from a port for the wildcard address.  However, bhash itself need
not have sockets.

If we link each bhash2 bucket to the corresponding bhash bucket,
we can iterate the same set of the sockets from bhash2 via bhash.

This patch links bhash2 to bhash only, and the actual use will be
in the later patches.  Finally, we will remove sk_bind2_node.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 822fb91)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I028bdb7ab02896b944470d24a62e856f78bade05
Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion
from mainline-v6.8-rc1
category: performance

The following patch adds code in the !inet_use_bhash2_on_bind(sk)
case in inet_csk_bind_conflict().

To avoid adding nest and make the change cleaner, this patch
rearranges tests in inet_csk_bind_conflict().

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 58655bc)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I6f9482a1f7955f2f959d1d2206147270574cf6a7
mainline inclusion
from mainline-v6.8-rc1
category: performance

Sockets in bhash are also linked to bhash2, but TIME_WAIT sockets
are linked separately in tb2->deathrow.

Let's replace tb->owners iteration in inet_csk_bind_conflict() with
two iterations over tb2->owners and tb2->deathrow.

This can be done safely under bhash's lock because socket insertion/
deletion in bhash2 happens with bhash's lock held.

Note that twsk_for_each_bound_bhash() will be removed later.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit b82ba72)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I808720724e977028ef8513a1a39694300194e93e
mainline inclusion
from mainline-v6.8-rc1
category: performance

We use hlist_empty(&tb->owners) to check if the bhash bucket has a socket.
We can check the child bhash2 buckets instead.

For this to work, the bhash2 bucket must be freed before the bhash bucket.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 8002d44)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Id16a58a73e26c30dcc5dbff80033b2681d71c747
mainline inclusion
from mainline-v6.8-rc1
category: performance

Now we do not use tb->owners and can unlink sockets from bhash.

sk_bind_node/tw_bind_node are available for bhash2 and will be
used in the following patch.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit b2cb9f9)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I9e373e39c2739b2e70149b79b5c4407ffb82d29c
mainline inclusion
from mainline-v6.8-rc1
category: performance

Now we can use sk_bind_node/tw_bind_node for bhash2, which means
we need not link TIME_WAIT sockets separately.

The dead code and sk_bind2_node will be removed in the next patch.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 770041d)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I0531abd39f6210024a9f349d55c25837a75b9c23
mainline inclusion
from mainline-v6.8-rc1
category: performance

Now all sockets including TIME_WAIT are linked to bhash2 using
sock_common.skc_bind_node.

We no longer use inet_bind2_bucket.deathrow, sock.sk_bind2_node,
and inet_timewait_sock.tw_bind2_node.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(cherry picked from commit 8191792)
Change-Id: Ie78b23ea9661ebdb3b8f0a3717d3b34a361ff354
Signed-off-by: Wentao Guan <[email protected]>
[Conflicts: context diff because of KABI]
Signed-off-by: Wentao Guan <[email protected]>
Conflicts:
	include/net/sock.h
mainline inclusion
from mainline-v6.15-rc1
category: performance

tcp_in_quickack_mode() is called from input path for small packets.

It calls __sk_dst_get() which reads sk->sk_dst_cache which has been
put in sock_read_tx group (for good reasons).

Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses.

Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull
these cache lines for the cases a delayed ACK is scheduled.

After this patch TCP receive path does not longer access sock_read_tx
group.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Reviewed-by: Neal Cardwell <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

(cherry picked from commit 1549270)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Ie4afef568644ab39555e776c10592200736a522f
mainline inclusion
from mainline-v6.15-rc1
category: performance

When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
__inet_check_established() and/or __inet6_check_established().

This patch adds an RCU lookup to avoid the spinlock
acquisition when the 4-tuple is found in the hash table.

Note that there are still spin_lock_bh() calls in
__inet_hash_connect() to protect inet_bind_hashbucket,
this will be fixed later in this series.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Tested-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit ae9d5b1)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I6cce9a2afd4062f8622cab4cad0a7d6453c4f33f
mainline inclusion
from mainline-v6.15-rc1
category: performance

There is no reason to call ipv6_addr_type().

Instead, use highly optimized ipv6_addr_any() and ipv6_addr_v4mapped().

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit ca79d80)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Ic23033aad4ec462fbb470cf859952ae341a21b81
mainline inclusion
from mainline-v6.15-rc1
category: performance

Add RCU protection to inet_bind_bucket structure.

- Add rcu_head field to the structure definition.

- Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy()
  first argument.

- Use hlist_del_rcu() and hlist_add_head_rcu() methods.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit d186f40)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: I57a0c27b179838e46c3d861bbc862dd90298fb22
mainline inclusion
from mainline-v6.15-rc1
category: performance

When __inet_hash_connect() has to try many 4-tuples before
finding an available one, we see a high spinlock cost from
the many spin_lock_bh(&head->lock) performed in its loop.

This patch adds an RCU lookup to avoid the spinlock cost.

check_established() gets a new @rcu_lookup argument.
First reason is to not make any changes while head->lock
is not held.
Second reason is to not make this RCU lookup a second time
after the spinlock has been acquired.

Tested:

Server:

ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog

Client:

ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server

Before series:

  utime_start=0.288582
  utime_end=1.548707
  stime_start=20.637138
  stime_end=2002.489845
  num_transactions=484453
  latency_min=0.156279245
  latency_max=20.922042756
  latency_mean=1.546521274
  latency_stddev=3.936005194
  num_samples=312537
  throughput=47426.00

perf top on the client:

 49.54%  [kernel]       [k] _raw_spin_lock
 25.87%  [kernel]       [k] _raw_spin_lock_bh
  5.97%  [kernel]       [k] queued_spin_lock_slowpath
  5.67%  [kernel]       [k] __inet_hash_connect
  3.53%  [kernel]       [k] __inet6_check_established
  3.48%  [kernel]       [k] inet6_ehashfn
  0.64%  [kernel]       [k] rcu_all_qs

After this series:

  utime_start=0.271607
  utime_end=3.847111
  stime_start=18.407684
  stime_end=1997.485557
  num_transactions=1350742
  latency_min=0.014131929
  latency_max=17.895073144
  latency_mean=0.505675853  # Nice reduction of latency metrics
  latency_stddev=2.125164772
  num_samples=307884
  throughput=139866.80      # 190 % increase

perf top on client:

 56.86%  [kernel]       [k] __inet6_check_established
 17.96%  [kernel]       [k] __inet_hash_connect
 13.88%  [kernel]       [k] inet6_ehashfn
  2.52%  [kernel]       [k] rcu_all_qs
  2.01%  [kernel]       [k] __cond_resched
  0.41%  [kernel]       [k] _raw_spin_lock

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Tested-by: Jason Xing <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit 86c2bc2)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Icf547979f93422af63cd937427ead38f616f0b4d
Signed-off-by: Wentao Guan <[email protected]>
…wildcard addresses.

mainline inclusion
from mainline-v6.9-rc3
category: bugfix

Commit 5e07e67 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard
address.") introduced bind() regression for v4-mapped-v6 address.

When we bind() the following two addresses on the same port, the 2nd
bind() should succeed but fails now.

  1. [::] w/ IPV6_ONLY
  2. ::ffff:127.0.0.1

After the chagne, v4-mapped-v6 uses bhash2 instead of bhash to
detect conflict faster, but I forgot to add a necessary change.

During the 2nd bind(), inet_bind2_bucket_match_addr_any() returns
the tb2 bucket of [::], and inet_bhash2_conflict() finally calls
inet_bind_conflict(), which returns true, meaning conflict.

  inet_bhash2_addr_any_conflict
  |- inet_bind2_bucket_match_addr_any  <-- return [::] bucket
  `- inet_bhash2_conflict
     `- __inet_bhash2_conflict <-- checks IPV6_ONLY for AF_INET
        |                          but not for v4-mapped-v6 address
        `- inet_bind_conflict  <-- does not check address

inet_bind_conflict() does not check socket addresses because
__inet_bhash2_conflict() is expected to do so.

However, it checks IPV6_V6ONLY attribute only against AF_INET
socket, and not for v4-mapped-v6 address.

As a result, v4-mapped-v6 address conflicts with v6-only wildcard
address.

To avoid that, let's add the missing test to use bhash2 for
v4-mapped-v6 address.

Fixes: 5e07e67 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard address.")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
(cherry picked from commit ea11144)
Signed-off-by: Wentao Guan <[email protected]>
Change-Id: Iae912826216b319d6a430a2137849586832ef488
Signed-off-by: Wentao Guan <[email protected]>
Copy link

sourcery-ai bot commented Jul 4, 2025

Reviewer's Guide

This PR refactors the secondary bind-hash (bhash2) into a unified, RCU-safe bind bucket model (removing sk_bind2_node and deathrow lists), introduces an RCU-based fast path in __inet_hash_connect to drastically improve connect() scalability under load, consolidates bind conflict checks, adds a cached quick-ACK metric to speed up ACK path decisions, and extends the inet_diag interface with a new TCP_BOUND_INACTIVE pseudo-state.

Sequence diagram for RCU-based fast path in __inet_hash_connect

sequenceDiagram
    participant App as Application
    participant Kernel as Kernel TCP Stack
    participant RCU as RCU Read Lock
    participant Hash as Bind Hash Bucket

    App->>Kernel: connect() syscall
    Kernel->>RCU: rcu_read_lock()
    Kernel->>Hash: hlist_for_each_entry_rcu (bind bucket lookup)
    alt Fast path: no conflict
        Kernel->>RCU: rcu_read_unlock()
        Kernel->>App: Success (no lock contention)
    else Conflict or not found
        Kernel->>RCU: rcu_read_unlock()
        Kernel->>Hash: spin_lock_bh (slow path)
        Kernel->>App: Continue with legacy path
    end
Loading

Class diagram for refactored bind bucket structures

classDiagram
    class inet_bind_bucket {
        +unsigned short port
        +unsigned short fastreuse
        +unsigned short fastreuseport
        +struct hlist_node node
        +struct hlist_head bhash2
        +struct rcu_head rcu
    }
    class inet_bind2_bucket {
        +struct net *ib_net
        +int l3mdev
        +unsigned short port
        +unsigned short addr_type
        +struct in6_addr v6_rcv_saddr
        +struct hlist_node node
        +struct hlist_node bhash_node
        +struct hlist_head owners
    }
    inet_bind_bucket "1" o-- "*" inet_bind2_bucket : bhash2
Loading

Class diagram for sock and inet_timewait_sock after sk_bind2_node removal

classDiagram
    class sock {
        ...
        -struct hlist_node sk_bind2_node
        ...
    }
    class inet_timewait_sock {
        ...
        -struct hlist_node tw_bind2_node
        ...
    }
Loading

Class diagram for inet_connection_sock quick-ACK caching

classDiagram
    class inet_connection_sock_ack_block {
        +unsigned int quick
        +unsigned int pingpong
        +unsigned int dst_quick_ack
    }
    class inet_connection_sock {
        +inet_connection_sock_ack_block icsk_ack
        ...
    }
Loading

Class diagram for new TCP state and flags

classDiagram
    class tcp_states_h {
        +enum TCP_BOUND_INACTIVE
        +enum TCPF_BOUND_INACTIVE
    }
    class bpf_h {
        +enum BPF_TCP_BOUND_INACTIVE
    }
Loading

Class diagram for inet_diag TCP_BOUND_INACTIVE support

classDiagram
    class inet_diag {
        +void inet_diag_dump_icsk(...)
        +TCP_BOUND_INACTIVE pseudo-state
    }
Loading

File-Level Changes

Change Details Files
Refactor bhash2 and remove sk_bind2_node with RCU support
  • Replace separate owners/deathrow lists with a single bhash2 list in bind buckets
  • Use hlist_add_head_rcu/hlist_del_rcu and kfree_rcu for bind bucket lifecycle
  • Update inet_bind2_bucket_init/create/destroy to accept the parent bind bucket pointer and eliminate redundant fields
  • Remove sk_bind2_node, sk_add_bind2_node and __sk_del_bind2_node; use unified sk_add_bind_node
  • Adjust timewait socket hashing to use the unified bind bucket list
net/ipv4/inet_hashtables.c
include/net/inet_hashtables.h
include/net/sock.h
net/ipv4/inet_timewait_sock.c
include/net/inet_timewait_sock.h
net/ipv4/inet_connection_sock.c
Add RCU-based fast path in __inet_hash_connect
  • Extend check_established and __inet6_check_established signatures to take an rcu_lookup flag
  • Introduce rcu_read_lock/unlock and hlist_for_each_entry_rcu in __inet_hash_connect to probe bind buckets without locks
  • Branch off into an RCU path to skip busy buckets and avoid spin-lock contention
  • Update header prototypes to reflect the new argument
net/ipv4/inet_hashtables.c
net/ipv6/inet6_hashtables.c
include/net/inet_hashtables.h
Consolidate bind conflict logic
  • Define sk_for_each_bound_bhash to iterate sockets across unified buckets
  • Simplify inet_csk_bind_conflict to a single code path using the new macro
  • Remove the dual-path conditional on inet_use_bhash2_on_bind
  • Drop the obsolete sk_for_each_bound_bhash2 macro and related fields
net/ipv4/inet_connection_sock.c
include/net/sock.h
Cache and apply dst_metric(RTAX_QUICKACK) for quick ACK decisions
  • Add dst_quick_ack bit field to inet_connection_sock
  • Initialize dst_quick_ack in sk_setup_caps() based on dst_metric
  • Update tcp_enter_quickack_mode() to consult the cached field
net/core/sock.c
include/net/inet_connection_sock.h
net/ipv4/tcp_input.c
Extend inet_diag and add TCP_BOUND_INACTIVE state
  • Introduce a new TCP_BOUND_INACTIVE pseudo-state and matching TCPF flag
  • Implement batched SKARR_SZ walking of bound-inactive buckets in inet_diag_dump_icsk
  • Add BUILD_BUG_ON check for the new bound_inactive state in tcp.c
  • Update BPF and tcp_states headers to expose TCP_BOUND_INACTIVE
net/ipv4/inet_diag.c
include/net/tcp_states.h
include/uapi/linux/bpf.h
net/ipv4/tcp.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@Avenger-285714 Avenger-285714 requested a review from Copilot July 4, 2025 07:19
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @opsiff - I've reviewed your changes - here's some feedback:

  • Extract the duplicated RCU‐based port scanning logic in __inet_hash_connect and __inet6_check_established into a common helper to reduce code duplication and ease future maintenance.
  • The resume/pagination state machine in inet_diag_dump_icsk is quite intricate; consider refactoring it into a dedicated iterator or helper to improve readability and avoid off‐by‐one errors.
  • After switching inet_bind_bucket_destroy to kfree_rcu(tb, rcu), please audit all bind/unbind paths to ensure there are no use‐after‐free races under concurrent bind() or connect() workloads.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the duplicated RCU‐based port scanning logic in __inet_hash_connect and __inet6_check_established into a common helper to reduce code duplication and ease future maintenance.
- The resume/pagination state machine in inet_diag_dump_icsk is quite intricate; consider refactoring it into a dedicated iterator or helper to improve readability and avoid off‐by‐one errors.
- After switching inet_bind_bucket_destroy to kfree_rcu(tb, rcu), please audit all bind/unbind paths to ensure there are no use‐after‐free races under concurrent bind() or connect() workloads.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copilot

This comment was marked as outdated.

@Avenger-285714
Copy link
Collaborator

/approve

@deepin-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Avenger-285714

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR syncs upstream TCP refactor patches and adds performance optimizations, including RCU fast paths and a new pseudo-state for diagnostics.

  • Refactor and unify bind bucket handling by removing sk_bind2_node and switching to RCU-safe lists.
  • Introduce an RCU-based fast-path in __inet_hash_connect to scale connect() under load.
  • Add TCP_BOUND_INACTIVE pseudo-state and batch-processing support in inet_diag_dump_icsk.

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
net/ipv6/inet6_hashtables.c Added rcu_lookup parameter and RCU lookup path in established check.
net/ipv4/tcp_input.c Replaced on-the-fly DST lookup with cached dst_quick_ack.
net/ipv4/tcp.c Added BUILD_BUG_ON for new TCP_BOUND_INACTIVE state.
net/ipv4/inet_timewait_sock.c Updated TW socket unhash to use __sk_del_bind_node and RCU destroy.
net/ipv4/inet_hashtables.c Unified bind bucket API, removed legacy owners list, added RCU.
net/ipv4/inet_diag.c Implemented TCP_BOUND_INACTIVE dump with batching (SKARR_SZ).
net/ipv4/inet_connection_sock.c Revised bhash2 conflict logic and added sk_for_each_bound_bhash macro.
net/core/sock.c Cached dst_quick_ack in inet_connection_sock.
include/uapi/linux/bpf.h Defined BPF_TCP_BOUND_INACTIVE constant.
include/net/tcp_states.h Defined TCP_BOUND_INACTIVE state and flag.
include/net/sock.h Removed obsolete sk_bind2_node field and macros.
include/net/ipv6.h Removed deprecated ipv6_addr_v4mapped_any.
include/net/inet_timewait_sock.h Dropped tw_bind2_node and related macro.
include/net/inet_hashtables.h Updated bind bucket structs and function prototypes.
include/net/inet_connection_sock.h Added dst_quick_ack bit to inet_connection_sock.
Comments suppressed due to low confidence (1)

net/ipv4/inet_diag.c:1083

  • Introduced a new diagnostic code path for bound-inactive sockets; please add unit or integration tests that exercise TCP_BOUND_INACTIVE reporting to ensure the batching logic and state filtering behave correctly.
#define SKARR_SZ 16

#define SKARR_SZ 16

/* Dump bound but inactive (not listening, connecting, etc.) sockets */
if (cb->args[0] == 1) {
Copy link
Preview

Copilot AI Jul 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The new batching loop for TCP_BOUND_INACTIVE in inet_diag_dump_icsk is quite large and complex. Consider extracting the bind-walk logic into a helper function to improve readability and reduce cyclomatic complexity.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants