-
Notifications
You must be signed in to change notification settings - Fork 92
[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] Sync tcp patches from upstream #918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linux-6.6.y
Are you sure you want to change the base?
[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] Sync tcp patches from upstream #918
Conversation
mainline inclusion from mainline-v6.8-rc1 category: performance Walk the hashinfo->bhash2 table so that inet_diag can dump TCP sockets that are bound but haven't yet called connect() or listen(). The code is inspired by the ->lhash2 loop. However there's no manual test of the source port, since this kind of filtering is already handled by inet_diag_bc_sk(). Also, a maximum of 16 sockets are dumped at a time, to avoid running with bh disabled for too long. There's no TCP state for bound but otherwise inactive sockets. Such sockets normally map to TCP_CLOSE. However, "ss -l", which is supposed to only dump listening sockets, actually requests the kernel to dump sockets in either the TCP_LISTEN or TCP_CLOSE states. To avoid dumping bound-only sockets with "ss -l", we therefore need to define a new pseudo-state (TCP_BOUND_INACTIVE) that user space will be able to set explicitly. With an IPv4, an IPv6 and an IPv6-only socket, bound respectively to 40000, 64000, 60000, an updated version of iproute2 could work as follow: $ ss -t state bound-inactive Recv-Q Send-Q Local Address:Port Peer Address:Port Process 0 0 0.0.0.0:40000 0.0.0.0:* 0 0 [::]:60000 [::]:* 0 0 *:64000 *:* Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Guillaume Nault <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/b3a84ae61e19c06806eea9c602b3b66e8f0cfc81.1701362867.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit 91051f0) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I90f7c930ce5a69adb8139cc593dcc0e231926061
mainline inclusion from mainline-v6.8-rc1 category: performance While checking port availability in bind() or listen(), we used only bhash for all v4-mapped-v6 addresses. But there is no good reason not to use bhash2 for v4-mapped-v6 non-wildcard addresses. Let's do it by returning true in inet_use_bhash2_on_bind(). Then, we also need to add a test in inet_bind2_bucket_match_addr_any() so that ::ffff:X.X.X.X will match with 0.0.0.0. Note that sk->sk_rcv_saddr is initialised for v4-mapped-v6 sk in __inet6_bind(). Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 5e07e67) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Iea294b9533360379e2da8f657a57437f2bba3534
mainline inclusion from mainline-v6.8-rc1 category: performance The protocol family tests in inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any() are ordered as follows. if (sk->sk_family != tb2->family) else if (sk->sk_family == AF_INET6) else This patch rearranges them so that AF_INET6 socket is handled first to make the following patch tidy, where tb2->family will be removed. if (sk->sk_family == AF_INET6) else if (tb2->family == AF_INET6) else Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 56f3e3f) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I615614fdfad00f22916ef0ce911aabab3d05a0fe
mainline inclusion from mainline-v6.8-rc1 category: performance In bhash2, IPv4/IPv6 addresses are saved in two union members, which complicate address checks in inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any() considering uninitialised memory and v4-mapped-v6 conflicts. Let's simplify that by saving IPv4 address as v4-mapped-v6 address and defining tb2.rcv_saddr as tb2.v6_rcv_saddr.s6_addr32[3]. Then, we can compare v6 address as is, and after checking v4-mapped-v6, we can compare v4 address easily. Also, we can remove tb2->family. Note these functions will be further refactored in the next patch. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 06a8c04) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Iab0c8d45e3ba55cc3de579524bc29cdceb0f573a
mainline inclusion from mainline-v6.8-rc1 category: performance inet_bind2_bucket_addr_match() and inet_bind2_bucket_match_addr_any() are called for each bhash2 bucket to check conflicts. Thus, we call ipv6_addr_any() and ipv6_addr_v4mapped() over and over during bind(). Let's avoid calling them by saving the address type in inet_bind2_bucket. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 5a22bba) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I4a9e5f815f293a7df2b928436dadefb880d60b5f
mainline inclusion from mainline-v6.8-rc1 category: performance Later, we no longer link sockets to bhash. Instead, each bhash2 bucket is linked to the corresponding bhash bucket. Then, we pass the bhash bucket to bhash2 allocation functions as tb. However, tb is already used in inet_bind2_bucket_create() and inet_bind2_bucket_init() as the bhash2 bucket. To make the following diff clear, let's use tb2 for the bhash2 bucket there. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 4dd7108) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I2d9e67e64ce03bada544e8b97d5c94ba176924dd
mainline inclusion from mainline-v6.8-rc1 category: performance bhash2 added a new member sk_bind2_node in struct sock to link sockets to bhash2 in addition to bhash. bhash is still needed to search conflicting sockets efficiently from a port for the wildcard address. However, bhash itself need not have sockets. If we link each bhash2 bucket to the corresponding bhash bucket, we can iterate the same set of the sockets from bhash2 via bhash. This patch links bhash2 to bhash only, and the actual use will be in the later patches. Finally, we will remove sk_bind2_node. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 822fb91) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I028bdb7ab02896b944470d24a62e856f78bade05 Signed-off-by: Wentao Guan <[email protected]>
mainline inclusion from mainline-v6.8-rc1 category: performance The following patch adds code in the !inet_use_bhash2_on_bind(sk) case in inet_csk_bind_conflict(). To avoid adding nest and make the change cleaner, this patch rearranges tests in inet_csk_bind_conflict(). Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 58655bc) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I6f9482a1f7955f2f959d1d2206147270574cf6a7
mainline inclusion from mainline-v6.8-rc1 category: performance Sockets in bhash are also linked to bhash2, but TIME_WAIT sockets are linked separately in tb2->deathrow. Let's replace tb->owners iteration in inet_csk_bind_conflict() with two iterations over tb2->owners and tb2->deathrow. This can be done safely under bhash's lock because socket insertion/ deletion in bhash2 happens with bhash's lock held. Note that twsk_for_each_bound_bhash() will be removed later. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit b82ba72) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I808720724e977028ef8513a1a39694300194e93e
mainline inclusion from mainline-v6.8-rc1 category: performance We use hlist_empty(&tb->owners) to check if the bhash bucket has a socket. We can check the child bhash2 buckets instead. For this to work, the bhash2 bucket must be freed before the bhash bucket. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 8002d44) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Id16a58a73e26c30dcc5dbff80033b2681d71c747
mainline inclusion from mainline-v6.8-rc1 category: performance Now we do not use tb->owners and can unlink sockets from bhash. sk_bind_node/tw_bind_node are available for bhash2 and will be used in the following patch. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit b2cb9f9) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I9e373e39c2739b2e70149b79b5c4407ffb82d29c
mainline inclusion from mainline-v6.8-rc1 category: performance Now we can use sk_bind_node/tw_bind_node for bhash2, which means we need not link TIME_WAIT sockets separately. The dead code and sk_bind2_node will be removed in the next patch. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 770041d) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I0531abd39f6210024a9f349d55c25837a75b9c23
mainline inclusion from mainline-v6.8-rc1 category: performance Now all sockets including TIME_WAIT are linked to bhash2 using sock_common.skc_bind_node. We no longer use inet_bind2_bucket.deathrow, sock.sk_bind2_node, and inet_timewait_sock.tw_bind2_node. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]> (cherry picked from commit 8191792) Change-Id: Ie78b23ea9661ebdb3b8f0a3717d3b34a361ff354 Signed-off-by: Wentao Guan <[email protected]> [Conflicts: context diff because of KABI] Signed-off-by: Wentao Guan <[email protected]> Conflicts: include/net/sock.h
mainline inclusion from mainline-v6.15-rc1 category: performance tcp_in_quickack_mode() is called from input path for small packets. It calls __sk_dst_get() which reads sk->sk_dst_cache which has been put in sock_read_tx group (for good reasons). Then dst_metric(dst, RTAX_QUICKACK) also needs extra cache line misses. Cache RTAX_QUICKACK in icsk->icsk_ack.dst_quick_ack to no longer pull these cache lines for the cases a delayed ACK is scheduled. After this patch TCP receive path does not longer access sock_read_tx group. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]> (cherry picked from commit 1549270) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Ie4afef568644ab39555e776c10592200736a522f
mainline inclusion from mainline-v6.15-rc1 category: performance When __inet_hash_connect() has to try many 4-tuples before finding an available one, we see a high spinlock cost from __inet_check_established() and/or __inet6_check_established(). This patch adds an RCU lookup to avoid the spinlock acquisition when the 4-tuple is found in the hash table. Note that there are still spin_lock_bh() calls in __inet_hash_connect() to protect inet_bind_hashbucket, this will be fixed later in this series. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit ae9d5b1) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I6cce9a2afd4062f8622cab4cad0a7d6453c4f33f
mainline inclusion from mainline-v6.15-rc1 category: performance There is no reason to call ipv6_addr_type(). Instead, use highly optimized ipv6_addr_any() and ipv6_addr_v4mapped(). Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit ca79d80) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Ic23033aad4ec462fbb470cf859952ae341a21b81
mainline inclusion from mainline-v6.15-rc1 category: performance Add RCU protection to inet_bind_bucket structure. - Add rcu_head field to the structure definition. - Use kfree_rcu() at destroy time, and remove inet_bind_bucket_destroy() first argument. - Use hlist_del_rcu() and hlist_add_head_rcu() methods. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit d186f40) Signed-off-by: Wentao Guan <[email protected]> Change-Id: I57a0c27b179838e46c3d861bbc862dd90298fb22
mainline inclusion from mainline-v6.15-rc1 category: performance When __inet_hash_connect() has to try many 4-tuples before finding an available one, we see a high spinlock cost from the many spin_lock_bh(&head->lock) performed in its loop. This patch adds an RCU lookup to avoid the spinlock cost. check_established() gets a new @rcu_lookup argument. First reason is to not make any changes while head->lock is not held. Second reason is to not make this RCU lookup a second time after the spinlock has been acquired. Tested: Server: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog Client: ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server Before series: utime_start=0.288582 utime_end=1.548707 stime_start=20.637138 stime_end=2002.489845 num_transactions=484453 latency_min=0.156279245 latency_max=20.922042756 latency_mean=1.546521274 latency_stddev=3.936005194 num_samples=312537 throughput=47426.00 perf top on the client: 49.54% [kernel] [k] _raw_spin_lock 25.87% [kernel] [k] _raw_spin_lock_bh 5.97% [kernel] [k] queued_spin_lock_slowpath 5.67% [kernel] [k] __inet_hash_connect 3.53% [kernel] [k] __inet6_check_established 3.48% [kernel] [k] inet6_ehashfn 0.64% [kernel] [k] rcu_all_qs After this series: utime_start=0.271607 utime_end=3.847111 stime_start=18.407684 stime_end=1997.485557 num_transactions=1350742 latency_min=0.014131929 latency_max=17.895073144 latency_mean=0.505675853 # Nice reduction of latency metrics latency_stddev=2.125164772 num_samples=307884 throughput=139866.80 # 190 % increase perf top on client: 56.86% [kernel] [k] __inet6_check_established 17.96% [kernel] [k] __inet_hash_connect 13.88% [kernel] [k] inet6_ehashfn 2.52% [kernel] [k] rcu_all_qs 2.01% [kernel] [k] __cond_resched 0.41% [kernel] [k] _raw_spin_lock Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Jason Xing <[email protected]> Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit 86c2bc2) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Icf547979f93422af63cd937427ead38f616f0b4d Signed-off-by: Wentao Guan <[email protected]>
…wildcard addresses. mainline inclusion from mainline-v6.9-rc3 category: bugfix Commit 5e07e67 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard address.") introduced bind() regression for v4-mapped-v6 address. When we bind() the following two addresses on the same port, the 2nd bind() should succeed but fails now. 1. [::] w/ IPV6_ONLY 2. ::ffff:127.0.0.1 After the chagne, v4-mapped-v6 uses bhash2 instead of bhash to detect conflict faster, but I forgot to add a necessary change. During the 2nd bind(), inet_bind2_bucket_match_addr_any() returns the tb2 bucket of [::], and inet_bhash2_conflict() finally calls inet_bind_conflict(), which returns true, meaning conflict. inet_bhash2_addr_any_conflict |- inet_bind2_bucket_match_addr_any <-- return [::] bucket `- inet_bhash2_conflict `- __inet_bhash2_conflict <-- checks IPV6_ONLY for AF_INET | but not for v4-mapped-v6 address `- inet_bind_conflict <-- does not check address inet_bind_conflict() does not check socket addresses because __inet_bhash2_conflict() is expected to do so. However, it checks IPV6_V6ONLY attribute only against AF_INET socket, and not for v4-mapped-v6 address. As a result, v4-mapped-v6 address conflicts with v6-only wildcard address. To avoid that, let's add the missing test to use bhash2 for v4-mapped-v6 address. Fixes: 5e07e67 ("tcp: Use bhash2 for v4-mapped-v6 non-wildcard address.") Signed-off-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> (cherry picked from commit ea11144) Signed-off-by: Wentao Guan <[email protected]> Change-Id: Iae912826216b319d6a430a2137849586832ef488 Signed-off-by: Wentao Guan <[email protected]>
Reviewer's GuideThis PR refactors the secondary bind-hash (bhash2) into a unified, RCU-safe bind bucket model (removing sk_bind2_node and deathrow lists), introduces an RCU-based fast path in __inet_hash_connect to drastically improve connect() scalability under load, consolidates bind conflict checks, adds a cached quick-ACK metric to speed up ACK path decisions, and extends the inet_diag interface with a new TCP_BOUND_INACTIVE pseudo-state. Sequence diagram for RCU-based fast path in __inet_hash_connectsequenceDiagram
participant App as Application
participant Kernel as Kernel TCP Stack
participant RCU as RCU Read Lock
participant Hash as Bind Hash Bucket
App->>Kernel: connect() syscall
Kernel->>RCU: rcu_read_lock()
Kernel->>Hash: hlist_for_each_entry_rcu (bind bucket lookup)
alt Fast path: no conflict
Kernel->>RCU: rcu_read_unlock()
Kernel->>App: Success (no lock contention)
else Conflict or not found
Kernel->>RCU: rcu_read_unlock()
Kernel->>Hash: spin_lock_bh (slow path)
Kernel->>App: Continue with legacy path
end
Class diagram for refactored bind bucket structuresclassDiagram
class inet_bind_bucket {
+unsigned short port
+unsigned short fastreuse
+unsigned short fastreuseport
+struct hlist_node node
+struct hlist_head bhash2
+struct rcu_head rcu
}
class inet_bind2_bucket {
+struct net *ib_net
+int l3mdev
+unsigned short port
+unsigned short addr_type
+struct in6_addr v6_rcv_saddr
+struct hlist_node node
+struct hlist_node bhash_node
+struct hlist_head owners
}
inet_bind_bucket "1" o-- "*" inet_bind2_bucket : bhash2
Class diagram for sock and inet_timewait_sock after sk_bind2_node removalclassDiagram
class sock {
...
-struct hlist_node sk_bind2_node
...
}
class inet_timewait_sock {
...
-struct hlist_node tw_bind2_node
...
}
Class diagram for inet_connection_sock quick-ACK cachingclassDiagram
class inet_connection_sock_ack_block {
+unsigned int quick
+unsigned int pingpong
+unsigned int dst_quick_ack
}
class inet_connection_sock {
+inet_connection_sock_ack_block icsk_ack
...
}
Class diagram for new TCP state and flagsclassDiagram
class tcp_states_h {
+enum TCP_BOUND_INACTIVE
+enum TCPF_BOUND_INACTIVE
}
class bpf_h {
+enum BPF_TCP_BOUND_INACTIVE
}
Class diagram for inet_diag TCP_BOUND_INACTIVE supportclassDiagram
class inet_diag {
+void inet_diag_dump_icsk(...)
+TCP_BOUND_INACTIVE pseudo-state
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @opsiff - I've reviewed your changes - here's some feedback:
- Extract the duplicated RCU‐based port scanning logic in __inet_hash_connect and __inet6_check_established into a common helper to reduce code duplication and ease future maintenance.
- The resume/pagination state machine in inet_diag_dump_icsk is quite intricate; consider refactoring it into a dedicated iterator or helper to improve readability and avoid off‐by‐one errors.
- After switching inet_bind_bucket_destroy to kfree_rcu(tb, rcu), please audit all bind/unbind paths to ensure there are no use‐after‐free races under concurrent bind() or connect() workloads.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the duplicated RCU‐based port scanning logic in __inet_hash_connect and __inet6_check_established into a common helper to reduce code duplication and ease future maintenance.
- The resume/pagination state machine in inet_diag_dump_icsk is quite intricate; consider refactoring it into a dedicated iterator or helper to improve readability and avoid off‐by‐one errors.
- After switching inet_bind_bucket_destroy to kfree_rcu(tb, rcu), please audit all bind/unbind paths to ensure there are no use‐after‐free races under concurrent bind() or connect() workloads.
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Avenger-285714 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR syncs upstream TCP refactor patches and adds performance optimizations, including RCU fast paths and a new pseudo-state for diagnostics.
- Refactor and unify bind bucket handling by removing
sk_bind2_node
and switching to RCU-safe lists. - Introduce an RCU-based fast-path in
__inet_hash_connect
to scaleconnect()
under load. - Add
TCP_BOUND_INACTIVE
pseudo-state and batch-processing support ininet_diag_dump_icsk
.
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
net/ipv6/inet6_hashtables.c | Added rcu_lookup parameter and RCU lookup path in established check. |
net/ipv4/tcp_input.c | Replaced on-the-fly DST lookup with cached dst_quick_ack . |
net/ipv4/tcp.c | Added BUILD_BUG_ON for new TCP_BOUND_INACTIVE state. |
net/ipv4/inet_timewait_sock.c | Updated TW socket unhash to use __sk_del_bind_node and RCU destroy. |
net/ipv4/inet_hashtables.c | Unified bind bucket API, removed legacy owners list, added RCU. |
net/ipv4/inet_diag.c | Implemented TCP_BOUND_INACTIVE dump with batching (SKARR_SZ). |
net/ipv4/inet_connection_sock.c | Revised bhash2 conflict logic and added sk_for_each_bound_bhash macro. |
net/core/sock.c | Cached dst_quick_ack in inet_connection_sock . |
include/uapi/linux/bpf.h | Defined BPF_TCP_BOUND_INACTIVE constant. |
include/net/tcp_states.h | Defined TCP_BOUND_INACTIVE state and flag. |
include/net/sock.h | Removed obsolete sk_bind2_node field and macros. |
include/net/ipv6.h | Removed deprecated ipv6_addr_v4mapped_any . |
include/net/inet_timewait_sock.h | Dropped tw_bind2_node and related macro. |
include/net/inet_hashtables.h | Updated bind bucket structs and function prototypes. |
include/net/inet_connection_sock.h | Added dst_quick_ack bit to inet_connection_sock . |
Comments suppressed due to low confidence (1)
net/ipv4/inet_diag.c:1083
- Introduced a new diagnostic code path for bound-inactive sockets; please add unit or integration tests that exercise
TCP_BOUND_INACTIVE
reporting to ensure the batching logic and state filtering behave correctly.
#define SKARR_SZ 16
#define SKARR_SZ 16 | ||
|
||
/* Dump bound but inactive (not listening, connecting, etc.) sockets */ | ||
if (cb->args[0] == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The new batching loop for TCP_BOUND_INACTIVE
in inet_diag_dump_icsk
is quite large and complex. Consider extracting the bind-walk logic into a helper function to improve readability and reduce cyclomatic complexity.
Copilot uses AI. Check for mistakes.
Merge branch 'tcp-refactor-bhash2'
Kuniyuki Iwashima says:
====================
tcp: Refactor bhash2 and remove sk_bind2_node.
This series refactors code around bhash2 and remove some bhash2-specific
fields; sock.sk_bind2_node, and inet_timewait_sock.tw_bind2_node.
patch 1 : optimise bind() for non-wildcard v4-mapped-v6 address
patch 2 - 4 : optimise bind() conflict tests
patch 5 - 12 : Link bhash2 to bhash and unlink sk from bhash2 to
remove sk_bind2_node
The patch 8 will trigger a false-positive error by checkpatch.
v2: resend of https://lore.kernel.org/netdev/[email protected]/
v1: https://lore.kernel.org/netdev/[email protected]/
Merge branch 'tcp-scale-connect-under-pressure'
Eric Dumazet says:
====================
tcp: scale connect() under pressure
Adoption of bhash2 in linux-6.1 made some operations almost twice
more expensive, because of additional locks.
This series adds RCU in __inet_hash_connect() to help the
case where many attempts need to be made before finding
an available 4-tuple.
This brings a ~200 % improvement in this experiment:
Server:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog
Client:
ulimit -n 40000; neper/tcp_crr -T 200 -F 30000 -6 --nolog -c -H server
Before series:
utime_start=0.288582
utime_end=1.548707
stime_start=20.637138
stime_end=2002.489845
num_transactions=484453
latency_min=0.156279245
latency_max=20.922042756
latency_mean=1.546521274
latency_stddev=3.936005194
num_samples=312537
throughput=47426.00
perf top on the client:
49.54% [kernel] [k] _raw_spin_lock
25.87% [kernel] [k] _raw_spin_lock_bh
5.97% [kernel] [k] queued_spin_lock_slowpath
5.67% [kernel] [k] __inet_hash_connect
3.53% [kernel] [k] __inet6_check_established
3.48% [kernel] [k] inet6_ehashfn
0.64% [kernel] [k] rcu_all_qs
After this series:
utime_start=0.271607
utime_end=3.847111
stime_start=18.407684
stime_end=1997.485557
num_transactions=1350742
latency_min=0.014131929
latency_max=17.895073144
latency_mean=0.505675853 # Nice reduction of latency metrics
latency_stddev=2.125164772
num_samples=307884
throughput=139866.80 # 194 % increase
perf top on client:
56.86% [kernel] [k] __inet6_check_established
17.96% [kernel] [k] __inet_hash_connect
13.88% [kernel] [k] inet6_ehashfn
2.52% [kernel] [k] rcu_all_qs
2.01% [kernel] [k] __cond_resched
0.41% [kernel] [k] _raw_spin_lock
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski [email protected]
Summary by Sourcery
Sync upstream TCP refactor and performance optimizations: restructure bhash2 binding, remove obsolete sk_bind2_node, add RCU fast-path for connect() under pressure, and introduce TCP_BOUND_INACTIVE support for diagnostics
New Features:
Enhancements: