summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2024-08-30tcp_bpf: fix return value of tcp_bpf_sendmsg()Cong Wang
When we cork messages in psock->cork, the last message triggers the flushing will result in sending a sk_msg larger than the current message size. In this case, in tcp_bpf_send_verdict(), 'copied' becomes negative at least in the following case: 468 case __SK_DROP: 469 default: 470 sk_msg_free_partial(sk, msg, tosend); 471 sk_msg_apply_bytes(psock, tosend); 472 *copied -= (tosend + delta); // <==== HERE 473 return -EACCES; Therefore, it could lead to the following BUG with a proper value of 'copied' (thanks to syzbot). We should not use negative 'copied' as a return value here. ------------[ cut here ]------------ kernel BUG at net/socket.c:733! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 UID: 0 PID: 3265 Comm: syz-executor510 Not tainted 6.11.0-rc3-syzkaller-00060-gd07b43284ab3 #0 Hardware name: linux,dummy-virt (DT) pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : sock_sendmsg_nosec net/socket.c:733 [inline] pc : sock_sendmsg_nosec net/socket.c:728 [inline] pc : __sock_sendmsg+0x5c/0x60 net/socket.c:745 lr : sock_sendmsg_nosec net/socket.c:730 [inline] lr : __sock_sendmsg+0x54/0x60 net/socket.c:745 sp : ffff800088ea3b30 x29: ffff800088ea3b30 x28: fbf00000062bc900 x27: 0000000000000000 x26: ffff800088ea3bc0 x25: ffff800088ea3bc0 x24: 0000000000000000 x23: f9f00000048dc000 x22: 0000000000000000 x21: ffff800088ea3d90 x20: f9f00000048dc000 x19: ffff800088ea3d90 x18: 0000000000000001 x17: 0000000000000000 x16: 0000000000000000 x15: 000000002002ffaf x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: ffff8000815849c0 x9 : ffff8000815b49c0 x8 : 0000000000000000 x7 : 000000000000003f x6 : 0000000000000000 x5 : 00000000000007e0 x4 : fff07ffffd239000 x3 : fbf00000062bc900 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 00000000fffffdef Call trace: sock_sendmsg_nosec net/socket.c:733 [inline] __sock_sendmsg+0x5c/0x60 net/socket.c:745 ____sys_sendmsg+0x274/0x2ac net/socket.c:2597 ___sys_sendmsg+0xac/0x100 net/socket.c:2651 __sys_sendmsg+0x84/0xe0 net/socket.c:2680 __do_sys_sendmsg net/socket.c:2689 [inline] __se_sys_sendmsg net/socket.c:2687 [inline] __arm64_sys_sendmsg+0x24/0x30 net/socket.c:2687 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x48/0x110 arch/arm64/kernel/syscall.c:49 el0_svc_common.constprop.0+0x40/0xe0 arch/arm64/kernel/syscall.c:132 do_el0_svc+0x1c/0x28 arch/arm64/kernel/syscall.c:151 el0_svc+0x34/0xec arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x100/0x12c arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:598 Code: f9404463 d63f0060 3108441f 54fffe81 (d4210000) ---[ end trace 0000000000000000 ]--- Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Reported-by: syzbot+58c03971700330ce14d8@syzkaller.appspotmail.com Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20240821030744.320934-1-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-29net/ipv4: net: prefer strscpy over strcpyHongbo Li
The deprecated helper strcpy() performs no bounds checking on the destination buffer. This could result in linear overflows beyond the end of the buffer, leading to all kinds of misbehaviors. The safe replacement is strscpy() [1]. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy [1] Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://patch.msgid.link/20240828123224.3697672-7-lihongbo22@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/faraday/ftgmac100.c 4186c8d9e6af ("net: ftgmac100: Ensure tx descriptor updates are visible") e24a6c874601 ("net: ftgmac100: Get link speed and duplex for NC-SI") https://lore.kernel.org/0b851ec5-f91d-4dd3-99da-e81b98c9ed28@kernel.org net/ipv4/tcp.c bac76cf89816 ("tcp: fix forever orphan socket caused by tcp_abort") edefba66d929 ("tcp: rstreason: introduce SK_RST_REASON_TCP_STATE for active reset") https://lore.kernel.org/20240828112207.5c199d41@canb.auug.org.au No adjacent changes. Link: https://patch.msgid.link/20240829130829.39148-1-pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-28tcp: annotate data-races around tcptw->tw_rcv_nxtEric Dumazet
No lock protects tcp tw fields. tcptw->tw_rcv_nxt can be changed from twsk_rcv_nxt_update() while other threads might read this field. Add READ_ONCE()/WRITE_ONCE() annotations, and make sure tcp_timewait_state_process() reads tcptw->tw_rcv_nxt only once. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20240827015250.3509197-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-28tcp: remove volatile qualifier on tw_substateEric Dumazet
Using a volatile qualifier for a specific struct field is unusual. Use instead READ_ONCE()/WRITE_ONCE() where necessary. tcp_timewait_state_process() can change tw_substate while other threads are reading this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20240827015250.3509197-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-27tcp: fix forever orphan socket caused by tcp_abortXueming Feng
We have some problem closing zero-window fin-wait-1 tcp sockets in our environment. This patch come from the investigation. Previously tcp_abort only sends out reset and calls tcp_done when the socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only purging the write queue, but not close the socket and left it to the timer. While purging the write queue, tp->packets_out and sk->sk_write_queue is cleared along the way. However tcp_retransmit_timer have early return based on !tp->packets_out and tcp_probe_timer have early return based on !sk->sk_write_queue. This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched and socket not being killed by the timers, converting a zero-windowed orphan into a forever orphan. This patch removes the SOCK_DEAD check in tcp_abort, making it send reset to peer and close the socket accordingly. Preventing the timer-less orphan from happening. According to Lorenzo's email in the v1 thread, the check was there to prevent force-closing the same socket twice. That situation is handled by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is already closed. The -ENOENT code comes from the associate patch Lorenzo made for iproute2-ss; link attached below, which also conform to RFC 9293. At the end of the patch, tcp_write_queue_purge(sk) is removed because it was already called in tcp_done_with_error(). p.s. This is the same patch with v2. Resent due to mis-labeled "changes requested" on patchwork.kernel.org. Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/ Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.") Signed-off-by: Xueming Feng <kuro@kuroa.me> Tested-by: Lorenzo Colitti <lorenzo@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26net/ipv4: fix macro definition sk_for_each_bound_bhashHongbo Li
The macro sk_for_each_bound_bhash accepts a parameter __sk, but it was not used, rather the sk2 is directly used, so we replace the sk2 with __sk in macro. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://patch.msgid.link/20240823070453.3327832-1-lihongbo22@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() processJason Xing
We found that one close-wait socket was reset by the other side due to a new connection reusing the same port which is beyond our expectation, so we have to investigate the underlying reason. The following experiment is conducted in the test environment. We limit the port range from 40000 to 40010 and delay the time to close() after receiving a fin from the active close side, which can help us easily reproduce like what happened in production. Here are three connections captured by tcpdump: 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965525191 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 2769915070 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 127.0.0.1.40002 > 127.0.0.1.9999: Flags [F.], seq 1, ack 1 // a few seconds later, within 60 seconds 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 127.0.0.1.9999 > 127.0.0.1.40002: Flags [.], ack 2 127.0.0.1.40002 > 127.0.0.1.9999: Flags [R], seq 2965525193 // later, very quickly 127.0.0.1.40002 > 127.0.0.1.9999: Flags [S], seq 2965590730 127.0.0.1.9999 > 127.0.0.1.40002: Flags [S.], seq 3120990805 127.0.0.1.40002 > 127.0.0.1.9999: Flags [.], ack 1 As we can see, the first flow is reset because: 1) client starts a new connection, I mean, the second one 2) client tries to find a suitable port which is a timewait socket (its state is timewait, substate is fin_wait2) 3) client occupies that timewait port to send a SYN 4) server finds a corresponding close-wait socket in ehash table, then replies with a challenge ack 5) client sends an RST to terminate this old close-wait socket. I don't think the port selection algo can choose a FIN_WAIT2 socket when we turn on tcp_tw_reuse because on the server side there remain unread data. In some cases, if one side haven't call close() yet, we should not consider it as expendable and treat it at will. Even though, sometimes, the server isn't able to call close() as soon as possible like what we expect, it can not be terminated easily, especially due to a second unrelated connection happening. After this patch, we can see the expected failure if we start a connection when all the ports are occupied in fin_wait2 state: "Ncat: Cannot assign requested address." Reported-by: Jade Dong <jadedong@tencent.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240823001152.31004-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-26Merge tag 'nf-next-24-08-23' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: Patch #1 fix checksum calculation in nfnetlink_queue with SCTP, segment GSO packet since skb_zerocopy() does not support GSO_BY_FRAGS, from Antonio Ojea. Patch #2 extend nfnetlink_queue coverage to handle SCTP packets, from Antonio Ojea. Patch #3 uses consume_skb() instead of kfree_skb() in nfnetlink, from Donald Hunter. Patch #4 adds a dedicate commit list for sets to speed up intra-transaction lookups, from Florian Westphal. Patch #5 skips removal of element from abort path for the pipapo backend, ditching the shadow copy of this datastructure is sufficient. Patch #6 moves nf_ct_netns_get() out of nf_conncount_init() to let users of conncoiunt decide when to enable conntrack, this is needed by openvswitch, from Xin Long. Patch #7 pass context to all nft_parse_register_load() in preparation for the next patch. Patches #8 and #9 reject loads from uninitialized registers from control plane to remove register initialization from datapath. From Florian Westphal. * tag 'nf-next-24-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: don't initialize registers in nft_do_chain() netfilter: nf_tables: allow loads only when register is initialized netfilter: nf_tables: pass context structure to nft_parse_register_load netfilter: move nf_ct_netns_get out of nf_conncount_init netfilter: nf_tables: do not remove elements if set backend implements .abort netfilter: nf_tables: store new sets in dedicated list netfilter: nfnetlink: convert kfree_skb to consume_skb selftests: netfilter: nft_queue.sh: sctp coverage netfilter: nfnetlink_queue: unbreak SCTP traffic ==================== Link: https://patch.msgid.link/20240822221939.157858-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-23net: nexthop: delete redundant judgment statementsLi Zetao
The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23ipmr: delete redundant judgment statementsLi Zetao
The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-23ipv4: delete redundant judgment statementsLi Zetao
The initial value of err is -ENOBUFS, and err is guaranteed to be less than 0 before all goto errout. Therefore, on the error path of errout, there is no need to repeatedly judge that err is less than 0, and delete redundant judgments to make the code more concise. Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.h c948c0973df5 ("bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops") f2878cdeb754 ("bnxt_en: Add support to call FW to update a VNIC") Link: https://patch.msgid.link/20240822210125.1542769-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits when using hintsIdo Schimmel
Unmask the upper DSCP bits when performing source validation and routing a packet using the same route from a previously processed packet (hint). In the future, this will allow us to perform the FIB lookup that is performed as part of source validation according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-13-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: udp: Unmask upper DSCP bits during early demuxIdo Schimmel
Unmask the upper DSCP bits when performing source validation for multicast packets during early demux. In the future, this will allow us to perform the FIB lookup which is performed as part of source validation according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-12-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: icmp: Pass full DS field to ip_route_input()Ido Schimmel
Align the ICMP code to other callers of ip_route_input() and pass the full DS field. In the future this will allow us to perform a route lookup according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-11-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in RTM_GETROUTE input route lookupIdo Schimmel
Unmask the upper DSCP bits when looking up an input route via the RTM_GETROUTE netlink message so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-10-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in input route lookupIdo Schimmel
Unmask the upper DSCP bits in input route lookup so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-9-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in fib_compute_spec_dst()Ido Schimmel
As explained in commit 35ebf65e851c ("ipv4: Create and use fib_compute_spec_dst() helper."), the function is used - for example - to determine the source address for an ICMP reply. If we are responding to a multicast or broadcast packet, the source address is set to the source address that we would use if we were to send a packet to the unicast source of the original packet. This address is determined by performing a FIB lookup and using the preferred source address of the resulting route. Unmask the upper DSCP bits of the DS field of the packet that triggered the reply so that in the future the FIB lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-8-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: ipmr: Unmask upper DSCP bits in ipmr_rt_fib_lookup()Ido Schimmel
Unmask the upper DSCP bits when calling ipmr_fib_lookup() so that in the future it could perform the FIB lookup according to the full DSCP value. Note that ipmr_fib_lookup() performs a FIB rule lookup (returning the relevant routing table) and that IPv4 multicast FIB rules do not support matching on TOS / DSCP. However, it is still worth unmasking the upper DSCP bits in case support for DSCP matching is ever added. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-7-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22netfilter: nft_fib: Unmask upper DSCP bitsIdo Schimmel
In a similar fashion to the iptables rpfilter match, unmask the upper DSCP bits of the DS field of the currently tested packet so that in the future the FIB lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-6-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22netfilter: rpfilter: Unmask upper DSCP bitsIdo Schimmel
The rpfilter match performs a reverse path filter test on a packet by performing a FIB lookup with the source and destination addresses swapped. Unmask the upper DSCP bits of the DS field of the tested packet so that in the future the FIB lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits when constructing the Record Route optionIdo Schimmel
The Record Route IP option records the addresses of the routers that routed the packet. In the case of forwarded packets, the kernel performs a route lookup via fib_lookup() and fills in the preferred source address of the matched route. Unmask the upper DSCP bits when performing the lookup so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22ipv4: Unmask upper DSCP bits in NETLINK_FIB_LOOKUP familyIdo Schimmel
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB lookup according to user provided parameters and communicate the result back to user space. Unmask the upper DSCP bits of the user-provided DS field before invoking the IPv4 FIB lookup API so that in the future the lookup could be performed according to the full DSCP value. No functional changes intended since the upper DSCP bits are masked when comparing against the TOS selectors in FIB rules and routes. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240821125251.1571445-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR including important fixes (from bpf-next point of view): commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp") commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map") No conflicts. Adjacent changes in: include/linux/bpf_verifier.h kernel/bpf/verifier.c tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/bpf/20240813234307.82773-1-alexei.starovoitov@gmail.com/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-21udp: fix receiving fraglist GSO packetsFelix Fietkau
When assembling fraglist GSO packets, udp4_gro_complete does not set skb->csum_start, which makes the extra validation in __udp_gso_segment fail. Fixes: 89add40066f9 ("net: drop bad gso csum_start and offset in virtio_net_hdr") Signed-off-by: Felix Fietkau <nbd@nbd.name> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240819150621.59833-1-nbd@nbd.name Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20l2tp: use skb_queue_purge in l2tp_ip_destroy_sockJames Chapman
Recent commit ed8ebee6def7 ("l2tp: have l2tp_ip_destroy_sock use ip_flush_pending_frames") was incorrect in that l2tp_ip does not use socket cork and ip_flush_pending_frames is for sockets that do. Use __skb_queue_purge instead and remove the unnecessary lock. Also unexport ip_flush_pending_frames since it was originally exported in commit 4ff8863419cd ("ipv4: export ip_flush_pending_frames") for l2tp and is not used by other modules. Suggested-by: xiyou.wangcong@gmail.com Signed-off-by: James Chapman <jchapman@katalix.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240819143333.3204957-1-jchapman@katalix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20ipv4: Centralize TOS matchingIdo Schimmel
The TOS field in the IPv4 flow information structure ('flowi4_tos') is matched by the kernel against the TOS selector in IPv4 rules and routes. The field is initialized differently by different call sites. Some treat it as DSCP (RFC 2474) and initialize all six DSCP bits, some treat it as RFC 1349 TOS and initialize it using RT_TOS() and some treat it as RFC 791 TOS and initialize it using IPTOS_RT_MASK. What is common to all these call sites is that they all initialize the lower three DSCP bits, which fits the TOS definition in the initial IPv4 specification (RFC 791). Therefore, the kernel only allows configuring IPv4 FIB rules that match on the lower three DSCP bits which are always guaranteed to be initialized by all call sites: # ip -4 rule add tos 0x1c table 100 # ip -4 rule add tos 0x3c table 100 Error: Invalid tos. While this works, it is unlikely to be very useful. RFC 791 that initially defined the TOS and IP precedence fields was updated by RFC 2474 over twenty five years ago where these fields were replaced by a single six bits DSCP field. Extending FIB rules to match on DSCP can be done by adding a new DSCP selector while maintaining the existing semantics of the TOS selector for applications that rely on that. A prerequisite for allowing FIB rules to match on DSCP is to adjust all the call sites to initialize the high order DSCP bits and remove their masking along the path to the core where the field is matched on. However, making this change alone will result in a behavior change. For example, a forwarded IPv4 packet with a DS field of 0xfc will no longer match a FIB rule that was configured with 'tos 0x1c'. This behavior change can be avoided by masking the upper three DSCP bits in 'flowi4_tos' before comparing it against the TOS selectors in FIB rules and routes. Implement the above by adding a new function that checks whether a given DSCP value matches the one specified in the IPv4 flow information structure and invoke it from the three places that currently match on 'flowi4_tos'. Use RT_TOS() for the masking of 'flowi4_tos' instead of IPTOS_RT_MASK since the latter is not uAPI and we should be able to remove it at some point. Include <linux/ip.h> in <linux/in_route.h> since the former defines IPTOS_TOS_MASK which is used in the definition of RT_TOS() in <linux/in_route.h>. No regressions in FIB tests: # ./fib_tests.sh [...] Tests passed: 218 Tests failed: 0 And FIB rule tests: # ./fib_rule_tests.sh [...] Tests passed: 116 Tests failed: 0 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20netfilter: nft_fib: Mask upper DSCP bits before FIB lookupIdo Schimmel
As part of its functionality, the nftables FIB expression module performs a FIB lookup, but unlike other users of the FIB lookup API, it does so without masking the upper DSCP bits. In particular, this differs from the equivalent iptables match ("rpfilter") that does mask the upper DSCP bits before the FIB lookup. Align the module to other users of the FIB lookup API and mask the upper DSCP bits using IPTOS_RT_MASK before the lookup. No regressions in nft_fib.sh: # ./nft_fib.sh PASS: fib expression did not cause unwanted packet drops PASS: fib expression did drop packets for 1.1.1.1 PASS: fib expression did drop packets for 1c3::c01d PASS: fib expression forward check with policy based routing Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20ipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP familyIdo Schimmel
The NETLINK_FIB_LOOKUP netlink family can be used to perform a FIB lookup according to user provided parameters and communicate the result back to user space. However, unlike other users of the FIB lookup API, the upper DSCP bits and the ECN bits of the DS field are not masked, which can result in the wrong result being returned. Solve this by masking the upper DSCP bits and the ECN bits using IPTOS_RT_MASK. The structure that communicates the request and the response is not exported to user space, so it is unlikely that this netlink family is actually in use [1]. [1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-20netfilter: nf_tables: pass context structure to nft_parse_register_loadFlorian Westphal
Mechanical transformation, no logical changes intended. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-19tcp_metrics: use netlink policy for IPv6 addr len validationJakub Kicinski
Use the netlink policy to validate IPv6 address length. Destination address currently has policy for max len set, and source has no policy validation. In both cases the code does the real check. With correct policy check the code can be removed. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20240816212245.467745-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-19tcp: prevent concurrent execution of tcp_sk_exit_batchFlorian Westphal
Its possible that two threads call tcp_sk_exit_batch() concurrently, once from the cleanup_net workqueue, once from a task that failed to clone a new netns. In the latter case, error unwinding calls the exit handlers in reverse order for the 'failed' netns. tcp_sk_exit_batch() calls tcp_twsk_purge(). Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"), this function picks up twsk in any dying netns, not just the one passed in via exit_batch list. This means that the error unwind of setup_net() can "steal" and destroy timewait sockets belonging to the exiting netns. This allows the netns exit worker to proceed to call WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount)); without the expected 1 -> 0 transition, which then splats. At same time, error unwind path that is also running inet_twsk_purge() will splat as well: WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 ... refcount_dec include/linux/refcount.h:351 [inline] inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178 setup_net+0x714/0xb40 net/core/net_namespace.c:375 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110 ... because refcount_dec() of tw_refcount unexpectedly dropped to 0. This doesn't seem like an actual bug (no tw sockets got lost and I don't see a use-after-free) but as erroneous trigger of debug check. Add a mutex to force strict ordering: the task that calls tcp_twsk_purge() blocks other task from doing final _dec_and_test before mutex-owner has removed all tw sockets of dying netns. Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.") Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/ Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/ Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240812222857.29837-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ip: Move INFINITY_LIFE_TIME to addrconf.h.Kuniyuki Iwashima
INFINITY_LIFE_TIME is the common value used in IPv4 and IPv6 but defined in both .c files. Also, 0xffffffff used in addrconf_timeout_fixup() is INFINITY_LIFE_TIME. Let's move INFINITY_LIFE_TIME's definition to addrconf.h Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Initialise ifa->hash in inet_alloc_ifa().Kuniyuki Iwashima
Whenever ifa is allocated, we call INIT_HLIST_NODE(&ifa->hash). Let's move it to inet_alloc_ifa(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Remove redundant !ifa->ifa_dev check.Kuniyuki Iwashima
Now, ifa_dev is only set in inet_alloc_ifa() and never NULL after ifa gets visible. Let's remove the unneeded NULL check for ifa->ifa_dev. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Set ifa->ifa_dev in inet_alloc_ifa().Kuniyuki Iwashima
When a new IPv4 address is assigned via ioctl(SIOCSIFADDR), inet_set_ifa() sets ifa->ifa_dev if it's different from in_dev passed as an argument. In this case, ifa is always a newly allocated object, and ifa->ifa_dev is NULL. inet_set_ifa() can be called for an existing reused ifa, then, this check is always false. Let's set ifa_dev in inet_alloc_ifa() and remove the check in inet_set_ifa(). Now, inet_alloc_ifa() is symmetric with inet_rcu_free_ifa(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15ipv4: Check !in_dev earlier for ioctl(SIOCSIFADDR).Kuniyuki Iwashima
dev->ip_ptr could be NULL if we set an invalid MTU. Even then, if we issue ioctl(SIOCSIFADDR) for a new IPv4 address, devinet_ioctl() allocates struct in_ifaddr and fails later in inet_set_ifa() because in_dev is NULL. Let's move the check earlier. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240809235406.50187-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml c25504a0ba36 ("dt-bindings: net: fsl,qoriq-mc-dpmac: add missed property phys") be034ee6c33d ("dt-bindings: net: fsl,qoriq-mc-dpmac: using unevaluatedProperties") https://lore.kernel.org/20240815110934.56ae623a@canb.auug.org.au drivers/net/dsa/vitesse-vsc73xx-core.c 5b9eebc2c7a5 ("net: dsa: vsc73xx: pass value in phy_write operation") fa63c6434b6f ("net: dsa: vsc73xx: check busy flag in MDIO operations") 2524d6c28bdc ("net: dsa: vsc73xx: use defined values in phy operations") https://lore.kernel.org/20240813104039.429b9fe6@canb.auug.org.au Resolve by using FIELD_PREP(), Stephen's resolution is simpler. Adjacent changes: net/vmw_vsock/af_vsock.c 69139d2919dd ("vsock: fix recursive ->recvmsg calls") 744500d81f81 ("vsock: add support for SIOCOUTQ ioctl") Link: https://patch.msgid.link/20240815141149.33862-1-pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-14tcp: Update window clamping conditionSubash Abhinov Kasiviswanathan
This patch is based on the discussions between Neal Cardwell and Eric Dumazet in the link https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/ It was correctly pointed out that tp->window_clamp would not be updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if (copied <= tp->rcvq_space.space). While it is expected for most setups to leave the sysctl enabled, the latter condition may not end up hitting depending on the TCP receive queue size and the pattern of arriving data. The updated check should be hit only on initial MSS update from TCP_MIN_MSS to measured MSS value and subsequently if there was an update to a larger value. Fixes: 05f76b2d634e ("tcp: Adjust clamping window for applications specifying SO_RCVBUF") Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-12net: nexthop: Increase weight to u16Petr Machata
In CLOS networks, as link failures occur at various points in the network, ECMP weights of the involved nodes are adjusted to compensate. With high fan-out of the involved nodes, and overall high number of nodes, a (non-)ECMP weight ratio that we would like to configure does not fit into 8 bits. Instead of, say, 255:254, we might like to configure something like 1000:999. For these deployments, the 8-bit weight may not be enough. To that end, in this patch increase the next hop weight from u8 to u16. Increasing the width of an integral type can be tricky, because while the code still compiles, the types may not check out anymore, and numerical errors come up. To prevent this, the conversion was done in two steps. First the type was changed from u8 to a single-member structure, which invalidated all uses of the field. This allowed going through them one by one and audit for type correctness. Then the structure was replaced with a vanilla u16 again. This should ensure that no place was missed. The UAPI for configuring nexthop group members is that an attribute NHA_GROUP carries an array of struct nexthop_grp entries: struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ __u8 resvd1; __u16 resvd2; }; The field resvd1 is currently validated and required to be zero. We can lift this requirement and carry high-order bits of the weight in the reserved field: struct nexthop_grp { __u32 id; /* nexthop id - must exist */ __u8 weight; /* weight of this nexthop */ __u8 weight_high; __u16 resvd2; }; Keeping the fields split this way was chosen in case an existing userspace makes assumptions about the width of the weight field, and to sidestep any endianness issues. The weight field is currently encoded as the weight value minus one, because weight of 0 is invalid. This same trick is impossible for the new weight_high field, because zero must mean actual zero. With this in place: - Old userspace is guaranteed to carry weight_high of 0, therefore configuring 8-bit weights as appropriate. When dumping nexthops with 16-bit weight, it would only show the lower 8 bits. But configuring such nexthops implies existence of userspace aware of the extension in the first place. - New userspace talking to an old kernel will work as long as it only attempts to configure 8-bit weights, where the high-order bits are zero. Old kernel will bounce attempts at configuring >8-bit weights. Renaming reserved fields as they are allocated for some purpose is commonly done in Linux. Whoever touches a reserved field is doing so at their own risk. nexthop_grp::resvd1 in particular is currently used by at least strace, however they carry an own copy of UAPI headers, and the conversion should be trivial. A helper is provided for decoding the weight out of the two fields. Forcing a conversion seems preferable to bending backwards and introducing anonymous unions or whatever. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-12net: nexthop: Add flag to assert that NHGRP reserved fields are zeroPetr Machata
There are many unpatched kernel versions out there that do not initialize the reserved fields of struct nexthop_grp. The issue with that is that if those fields were to be used for some end (i.e. stop being reserved), old kernels would still keep sending random data through the field, and a new userspace could not rely on the value. In this patch, use the existing NHA_OP_FLAGS, which is currently inbound only, to carry flags back to the userspace. Add a flag to indicate that the reserved fields in struct nexthop_grp are zeroed before dumping. This is reliant on the actual fix from commit 6d745cd0e972 ("net: nexthop: Initialize all fields in dumped nexthops"). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/21037748d4f9d8ff486151f4c09083bcf12d5df8.1723036486.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09udp: Fall back to software USO if IPv6 extension headers are presentJakub Sitnicki
In commit 10154dbded6d ("udp: Allow GSO transmit from devices with no checksum offload") we have intentionally allowed UDP GSO packets marked CHECKSUM_NONE to pass to the GSO stack, so that they can be segmented and checksummed by a software fallback when the egress device lacks these features. What was not taken into consideration is that a CHECKSUM_NONE skb can be handed over to the GSO stack also when the egress device advertises the tx-udp-segmentation / NETIF_F_GSO_UDP_L4 feature. This will happen when there are IPv6 extension headers present, which we check for in __ip6_append_data(). Syzbot has discovered this scenario, producing a warning as below: ip6tnl0: caps=(0x00000006401d7869, 0x00000006401d7869) WARNING: CPU: 0 PID: 5112 at net/core/dev.c:3293 skb_warn_bad_offload+0x166/0x1a0 net/core/dev.c:3291 Modules linked in: CPU: 0 PID: 5112 Comm: syz-executor391 Not tainted 6.10.0-rc7-syzkaller-01603-g80ab5445da62 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/07/2024 RIP: 0010:skb_warn_bad_offload+0x166/0x1a0 net/core/dev.c:3291 [...] Call Trace: <TASK> __skb_gso_segment+0x3be/0x4c0 net/core/gso.c:127 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x585/0x1120 net/core/dev.c:3661 __dev_queue_xmit+0x17a4/0x3e90 net/core/dev.c:4415 neigh_output include/net/neighbour.h:542 [inline] ip6_finish_output2+0xffa/0x1680 net/ipv6/ip6_output.c:137 ip6_finish_output+0x41e/0x810 net/ipv6/ip6_output.c:222 ip6_send_skb+0x112/0x230 net/ipv6/ip6_output.c:1958 udp_v6_send_skb+0xbf5/0x1870 net/ipv6/udp.c:1292 udpv6_sendmsg+0x23b3/0x3270 net/ipv6/udp.c:1588 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0xef/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2585 ___sys_sendmsg net/socket.c:2639 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2725 __do_sys_sendmmsg net/socket.c:2754 [inline] __se_sys_sendmmsg net/socket.c:2751 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2751 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [...] </TASK> We are hitting the bad offload warning because when an egress device is capable of handling segmentation offload requested by skb_shinfo(skb)->gso_type, the chain of gso_segment callbacks won't produce any segment skbs and return NULL. See the skb_gso_ok() branch in {__udp,tcp,sctp}_gso_segment helpers. To fix it, force a fallback to software USO when processing a packet with IPv6 extension headers, since we don't know if these can checksummed by all devices which offer USO. Fixes: 10154dbded6d ("udp: Allow GSO transmit from devices with no checksum offload") Reported-by: syzbot+e15b7e15b8a751a91d9a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000e1609a061d5330ce@google.com/ Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-2-f5c5b4149ab9@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Link: https://patch.msgid.link/20240808170148.3629934-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-07tcp: rstreason: let it work finally in tcp_send_active_reset()Jason Xing
Now it's time to let it work by using the 'reason' parameter in the trace world :) Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-07tcp: rstreason: introduce SK_RST_REASON_TCP_DISCONNECT_WITH_DATA for active ↵Jason Xing
reset When user tries to disconnect a socket and there are more data written into tcp write queue, we should tell users about this reset reason. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-07tcp: rstreason: introduce SK_RST_REASON_TCP_KEEPALIVE_TIMEOUT for active resetJason Xing
Introducing this to show the users the reason of keepalive timeout. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-07tcp: rstreason: introduce SK_RST_REASON_TCP_STATE for active resetJason Xing
Introducing a new type TCP_STATE to handle some reset conditions appearing in RFC 793 due to its socket state. Actually, we can look into RFC 9293 which has no discrepancy about this part. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-07tcp: rstreason: introduce SK_RST_REASON_TCP_ABORT_ON_MEMORY for active resetJason Xing
Introducing a new type TCP_ABORT_ON_MEMORY for tcp reset reason to handle out of memory case. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-07tcp: rstreason: introduce SK_RST_REASON_TCP_ABORT_ON_LINGER for active resetJason Xing
Introducing a new type TCP_ABORT_ON_LINGER for tcp reset reason to handle negative linger value case. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>