Age | Commit message (Collapse) | Author |
|
This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.
The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.
When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.
And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.
The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.
Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This field can be read or written without socket lock being held.
Add annotations to avoid load-store tearing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
SO_MAX_PACING_RATE setsockopt() does not need to hold
the socket lock, because sk->sk_pacing_rate readers
can run fine if the value is changed by other threads,
after adding READ_ONCE() accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While BPF allows to set icsk->->icsk_delack_max
and/or icsk->icsk_rto_min, we have an ip route
attribute (RTAX_RTO_MIN) to be able to tune rto_min,
but nothing to consequently adjust max delayed ack,
which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}).
This makes RTAX_RTO_MIN of almost no practical use,
unless customers are in big trouble.
Modern days datacenter communications want to set
rto_min to ~5 ms, and the max delayed ack one jiffie
smaller to avoid spurious retransmits.
After this patch, an "rto_min 5" route attribute will
effectively lower max delayed ack timers to 4 ms.
Note in the following ss output, "rto:6 ... ato:4"
$ ss -temoi dst XXXXXX
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 [2002:a05:6608:295::]:52950 [2002:a05:6608:297::]:41597
ino:255134 sk:1001 <->
skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack
cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500
rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121
bytes_received:54823120 segs_out:1370582 segs_in:1370580
data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps
pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579
busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920
rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536
While we could argue this patch fixes a bug with RTAX_RTO_MIN,
I do not add a Fixes: tag, so that we can soak it a bit before
asking backports to stable branches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.
For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)
Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.
Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.
Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.
This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.
The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().
This saves cpu cycles on both sides, and network resources.
Performance of a single TCP flow on a 200Gbit NIC:
- Throughput is increased by 20% (100Gbit -> 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.
Benchmark context:
- Regular netperf TCP_STREAM (no zerocopy)
- Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
- MAX_SKB_FRAGS = 17 (~60KB per GRO packet)
This feature is guarded by a new sysctl, and enabled by default:
/proc/sys/net/ipv4/tcp_backlog_ack_defer
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This partially reverts c3f9b01849ef ("tcp: tcp_release_cb()
should release socket ownership").
prequeue has been removed by Florian in commit e7942d0633c4
("tcp: remove prequeue support")
__tcp_checksum_complete_user() being gone, we no longer
have to release socket ownership in tcp_release_cb().
This is a prereq for third patch in the series
("net: call prot->release_cb() when processing backlog").
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Every time sk->sk_forward_alloc is read locklessly,
add a READ_ONCE().
Add sk_forward_alloc_add() helper to centralize updates,
to reduce number of WRITE_ONCE().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Enabling BIG TCP on a low end platform apparently increased
chances of getting flows locked on one busy TX queue.
A similar problem was handled in commit 9b462d02d6dd
("tcp: TCP Small Queues and strange attractors"),
but the strategy worked for either bulk flows,
or 'large enough' RPC. BIG TCP changed how large
RPC needed to be to enable the work around:
If RPC fits in a single skb, TSQ never triggers.
Root cause for the problem is a busy TX queue,
with delayed TX completions.
This patch changes how we set skb->ooo_okay to detect
the case TX completion was not done, but incoming ACK
already was processed and emptied rtx queue.
Update the comment to explain the tricky details.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230817182353.2523746-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.
In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP socket saves the minimum required header length in tcp_header_len
of struct tcp_sock, and later the value is used in __tcp_fast_path_on()
to generate a part of TCP header in tcp_sock(sk)->pred_flags.
In tcp_rcv_established(), if the incoming packet has the same pattern
with pred_flags, we enter the fast path and skip full option parsing.
The MD5 option is parsed in tcp_v[46]_rcv(), so we need not parse it
again later in tcp_rcv_established() unless other options exist. We
add TCPOLEN_MD5SIG_ALIGNED to tcp_header_len in two paths to avoid the
slow path.
For passive open connections with MD5, we add TCPOLEN_MD5SIG_ALIGNED
to tcp_header_len in tcp_create_openreq_child() after 3WHS.
On the other hand, we do it in tcp_connect_init() for active open
connections. However, the value is overwritten while processing
SYN+ACK or crossed SYN in tcp_rcv_synsent_state_process().
These two cases will have the wrong value in pred_flags and never go
into the fast path.
We could update tcp_header_len in tcp_rcv_synsent_state_process(), but
a test with slightly modified netperf which uses MD5 for each flow shows
that the slow path is actually a bit faster than the fast path.
On c5.4xlarge EC2 instance (16 vCPU, 32 GiB mem)
$ for i in {1..10}; do
./super_netperf $(nproc) -H localhost -l 10 -- -m 256 -M 256;
done
Avg of 10
* 36e68eadd303 : 10.376 Gbps
* all fast path : 10.374 Gbps (patch v2, See Link)
* all slow path : 10.394 Gbps
The header prediction is not worth adding complexity for MD5, so let's
disable it for MD5.
Link: https://lore.kernel.org/netdev/20230803042214.38309-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230803224552.69398-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TCP request sockets are lockless, tcp_rsk(req)->ts_recent
can change while being read by another cpu as syzbot noticed.
This is harmless, but we should annotate the known races.
Note that tcp_check_req() changes req->ts_recent a bit early,
we might change this in the future.
BUG: KCSAN: data-race in tcp_check_req / tcp_check_req
write to 0xffff88813c8afb84 of 4 bytes by interrupt on cpu 1:
tcp_check_req+0x694/0xc70 net/ipv4/tcp_minisocks.c:762
tcp_v4_rcv+0x12db/0x1b70 net/ipv4/tcp_ipv4.c:2071
ip_protocol_deliver_rcu+0x356/0x6d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x13c/0x1a0 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:303 [inline]
ip_local_deliver+0xec/0x1c0 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:468 [inline]
ip_rcv_finish net/ipv4/ip_input.c:449 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ip_rcv+0x197/0x270 net/ipv4/ip_input.c:569
__netif_receive_skb_one_core net/core/dev.c:5493 [inline]
__netif_receive_skb+0x90/0x1b0 net/core/dev.c:5607
process_backlog+0x21f/0x380 net/core/dev.c:5935
__napi_poll+0x60/0x3b0 net/core/dev.c:6498
napi_poll net/core/dev.c:6565 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6698
__do_softirq+0xc1/0x265 kernel/softirq.c:571
do_softirq+0x7e/0xb0 kernel/softirq.c:472
__local_bh_enable_ip+0x64/0x70 kernel/softirq.c:396
local_bh_enable+0x1f/0x20 include/linux/bottom_half.h:33
rcu_read_unlock_bh include/linux/rcupdate.h:843 [inline]
__dev_queue_xmit+0xabb/0x1d10 net/core/dev.c:4271
dev_queue_xmit include/linux/netdevice.h:3088 [inline]
neigh_hh_output include/net/neighbour.h:528 [inline]
neigh_output include/net/neighbour.h:542 [inline]
ip_finish_output2+0x700/0x840 net/ipv4/ip_output.c:229
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:317
NF_HOOK_COND include/linux/netfilter.h:292 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:431
dst_output include/net/dst.h:458 [inline]
ip_local_out net/ipv4/ip_output.c:126 [inline]
__ip_queue_xmit+0xa4d/0xa70 net/ipv4/ip_output.c:533
ip_queue_xmit+0x38/0x40 net/ipv4/ip_output.c:547
__tcp_transmit_skb+0x1194/0x16e0 net/ipv4/tcp_output.c:1399
tcp_transmit_skb net/ipv4/tcp_output.c:1417 [inline]
tcp_write_xmit+0x13ff/0x2fd0 net/ipv4/tcp_output.c:2693
__tcp_push_pending_frames+0x6a/0x1a0 net/ipv4/tcp_output.c:2877
tcp_push_pending_frames include/net/tcp.h:1952 [inline]
__tcp_sock_set_cork net/ipv4/tcp.c:3336 [inline]
tcp_sock_set_cork+0xe8/0x100 net/ipv4/tcp.c:3343
rds_tcp_xmit_path_complete+0x3b/0x40 net/rds/tcp_send.c:52
rds_send_xmit+0xf8d/0x1420 net/rds/send.c:422
rds_send_worker+0x42/0x1d0 net/rds/threads.c:200
process_one_work+0x3e6/0x750 kernel/workqueue.c:2408
worker_thread+0x5f2/0xa10 kernel/workqueue.c:2555
kthread+0x1d7/0x210 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
read to 0xffff88813c8afb84 of 4 bytes by interrupt on cpu 0:
tcp_check_req+0x32a/0xc70 net/ipv4/tcp_minisocks.c:622
tcp_v4_rcv+0x12db/0x1b70 net/ipv4/tcp_ipv4.c:2071
ip_protocol_deliver_rcu+0x356/0x6d0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x13c/0x1a0 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:303 [inline]
ip_local_deliver+0xec/0x1c0 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:468 [inline]
ip_rcv_finish net/ipv4/ip_input.c:449 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ip_rcv+0x197/0x270 net/ipv4/ip_input.c:569
__netif_receive_skb_one_core net/core/dev.c:5493 [inline]
__netif_receive_skb+0x90/0x1b0 net/core/dev.c:5607
process_backlog+0x21f/0x380 net/core/dev.c:5935
__napi_poll+0x60/0x3b0 net/core/dev.c:6498
napi_poll net/core/dev.c:6565 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6698
__do_softirq+0xc1/0x265 kernel/softirq.c:571
run_ksoftirqd+0x17/0x20 kernel/softirq.c:939
smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164
kthread+0x1d7/0x210 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
value changed: 0x1cd237f1 -> 0x1cd237f2
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230717144445.653164-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TCP request sockets are lockless, some of their fields
can change while being read by another cpu as syzbot noticed.
This is usually harmless, but we should annotate the known
races.
This patch takes care of tcp_rsk(req)->txhash,
a separate one is needed for tcp_rsk(req)->ts_recent.
BUG: KCSAN: data-race in tcp_make_synack / tcp_rtx_synack
write to 0xffff8881362304bc of 4 bytes by task 32083 on cpu 1:
tcp_rtx_synack+0x9d/0x2a0 net/ipv4/tcp_output.c:4213
inet_rtx_syn_ack+0x38/0x80 net/ipv4/inet_connection_sock.c:880
tcp_check_req+0x379/0xc70 net/ipv4/tcp_minisocks.c:665
tcp_v6_rcv+0x125b/0x1b20 net/ipv6/tcp_ipv6.c:1673
ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437
ip6_input_finish net/ipv6/ip6_input.c:482 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491
dst_input include/net/dst.h:468 [inline]
ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79
NF_HOOK include/linux/netfilter.h:303 [inline]
ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309
__netif_receive_skb_one_core net/core/dev.c:5452 [inline]
__netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566
netif_receive_skb_internal net/core/dev.c:5652 [inline]
netif_receive_skb+0x4a/0x310 net/core/dev.c:5711
tun_rx_batched+0x3bf/0x400
tun_get_user+0x1d24/0x22b0 drivers/net/tun.c:1997
tun_chr_write_iter+0x18e/0x240 drivers/net/tun.c:2043
call_write_iter include/linux/fs.h:1871 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x4ab/0x7d0 fs/read_write.c:584
ksys_write+0xeb/0x1a0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x42/0x50 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff8881362304bc of 4 bytes by task 32078 on cpu 0:
tcp_make_synack+0x367/0xb40 net/ipv4/tcp_output.c:3663
tcp_v6_send_synack+0x72/0x420 net/ipv6/tcp_ipv6.c:544
tcp_conn_request+0x11a8/0x1560 net/ipv4/tcp_input.c:7059
tcp_v6_conn_request+0x13f/0x180 net/ipv6/tcp_ipv6.c:1175
tcp_rcv_state_process+0x156/0x1de0 net/ipv4/tcp_input.c:6494
tcp_v6_do_rcv+0x98a/0xb70 net/ipv6/tcp_ipv6.c:1509
tcp_v6_rcv+0x17b8/0x1b20 net/ipv6/tcp_ipv6.c:1735
ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437
ip6_input_finish net/ipv6/ip6_input.c:482 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491
dst_input include/net/dst.h:468 [inline]
ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79
NF_HOOK include/linux/netfilter.h:303 [inline]
ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309
__netif_receive_skb_one_core net/core/dev.c:5452 [inline]
__netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566
netif_receive_skb_internal net/core/dev.c:5652 [inline]
netif_receive_skb+0x4a/0x310 net/core/dev.c:5711
tun_rx_batched+0x3bf/0x400
tun_get_user+0x1d24/0x22b0 drivers/net/tun.c:1997
tun_chr_write_iter+0x18e/0x240 drivers/net/tun.c:2043
call_write_iter include/linux/fs.h:1871 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x4ab/0x7d0 fs/read_write.c:584
ksys_write+0xeb/0x1a0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x42/0x50 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x91d25731 -> 0xe79325cd
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 32078 Comm: syz-executor.4 Not tainted 6.5.0-rc1-syzkaller-00033-geb26cbb1a754 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
Fixes: 58d607d3e52f ("tcp: provide skb->hash to synack packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230717144445.653164-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.
To reproduce: Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()). This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].
As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.
The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender. This problem has previously been identified in
RFC 7323, appendix F [1].
The Linux kernel currently adheres to never shrinking the window.
In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session. A receive buffer full condition should instead result
in a zero window and an indefinite wait.
In practice, this problem is largely hidden for most flows. It
is not applicable to mice flows. Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.
But this problem does show up for other types of flows. Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time. In these cases, we directly
encounter the problem described in [1].
RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].
This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf). This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window
Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323
Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now all tcp_stream_alloc_skb() callers pass @size == 0, we can
remove this parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now all skbs in write queue do not contain any payload in skb->head,
we can remove some dead code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_send_syn_data() is the last component in TCP transmit
path to put payload in skb->head.
Switch it to use page frags, so that we can remove dead
code later.
This allows to put more payload than previous implementation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_mtu_probe() is still copying payload from skbs in the write queue,
using skb_copy_bits(), ignoring potential errors.
Modern TCP stack wants to only deal with payload found in page frags,
as this is a prereq for TCPDirect (host stack might not have access
to the payload)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230607214113.1992947-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We can change tcp_sk() to propagate its argument const qualifier,
thanks to container_of_const().
We have two places where a const sock pointer has to be upgraded
to a write one. We have been using const qualifier for lockless
listeners to clearly identify points where writes could happen.
Add tcp_sk_rw() helper to better document these.
tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check()
and tcp_rack_reo_wnd() get an additional const qualififer
for their @tp local variables.
smc_check_reset_syn_req() also needs a similar change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
net/wireless/nl80211.c
b27f07c50a73 ("wifi: nl80211: fix puncturing bitmap policy")
cbbaf2bb829b ("wifi: nl80211: add a command to enable/disable HW timestamping")
https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au
tools/testing/selftests/net/Makefile
62199e3f1658 ("selftests: net: Add VXLAN MDB test")
13715acf8ab5 ("selftest: Add test for bind() conflicts.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_poll() reads sk->sk_err without socket lock held/owned.
We should used READ_ONCE() here, and update writers
to use WRITE_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_rtx_synack() now could be called in process context as explained in
0a375c822497 ("tcp: tcp_rtx_synack() can be called from process
context").
tcp_rtx_synack() might call tcp_make_synack(), which will touch per-CPU
variables with preemption enabled. This causes the following BUG:
BUG: using __this_cpu_add() in preemptible [00000000] code: ThriftIO1/5464
caller is tcp_make_synack+0x841/0xac0
Call Trace:
<TASK>
dump_stack_lvl+0x10d/0x1a0
check_preemption_disabled+0x104/0x110
tcp_make_synack+0x841/0xac0
tcp_v6_send_synack+0x5c/0x450
tcp_rtx_synack+0xeb/0x1f0
inet_rtx_syn_ack+0x34/0x60
tcp_check_req+0x3af/0x9e0
tcp_rcv_state_process+0x59b/0x2030
tcp_v6_do_rcv+0x5f5/0x700
release_sock+0x3a/0xf0
tcp_sendmsg+0x33/0x40
____sys_sendmsg+0x2f2/0x490
__sys_sendmsg+0x184/0x230
do_syscall_64+0x3d/0x90
Avoid calling __TCP_INC_STATS() with will touch per-cpu variables. Use
TCP_INC_STATS() which is safe to be called from context switch.
Fixes: 8336886f786f ("tcp: TCP Fast Open Server - support TFO listeners")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230308190745.780221-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To do that, separate two scenarios:
- where it's the first MD5 key on the system, which means that enabling
of the static key may need to sleep;
- copying of an existing key from a listening socket to the request
socket upon receiving a signed TCP segment, where static key was
already enabled (when the key was added to the listening socket).
Now the life-time of the static branch for TCP-MD5 is until:
- last tcp_md5sig_info is destroyed
- last socket in time-wait state with MD5 key is closed.
Which means that after all sockets with TCP-MD5 keys are gone, the
system gets back the performance of disabled md5-key static branch.
While at here, provide static_key_fast_inc() helper that does ref
counter increment in atomic fashion (without grabbing cpus_read_lock()
on CONFIG_JUMP_LABEL=y). This is needed to add a new user for
a static_key when the caller controls the lifetime of another user.
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use try_cmpxchg() (instead of cmpxchg()) in a more readable way.
oval = smp_load_acquire(&sk->sk_tsq_flags);
do {
...
} while (!try_cmpxchg(&sk->sk_tsq_flags, &oval, nval));
Reduce indentation level.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221110190239.3531280-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
try_cmpxchg() is slighly more efficient (at least on x86),
and smp_load_acquire(&sk->sk_tsq_flags) could avoid a KCSAN report.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221110174829.3403442-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.
The following event sequence is an example that triggers the bug:
(a) The connection is cwnd_limited, but packets_out is not at its
peak due to TSO deferral deciding not to send another skb yet.
In such cases the connection can advance max_packets_seq and set
tp->is_cwnd_limited to true and max_packets_out to a small
number.
(b) Then later in the round trip the connection is pacing-limited (not
cwnd-limited), and packets_out is larger. In such cases the
connection would raise max_packets_out to a bigger number but
(unexpectedly) flip tp->is_cwnd_limited from true to false.
This commit fixes that bug.
One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp->is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.
Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:
(1) If the connection is cwnd-limited then we remember that fact for
the current window.
(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.
In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp->packets_out > tp->max_packets_out can only trigger an update of
tp->is_cwnd_limited if tp->is_cwnd_limited is false.
This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.
Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_[rw]mem_(max|default), they can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While reading these sysctl variables, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.
- .sysctl_rmem
- .sysctl_rwmem
- .sysctl_rmem_offset
- .sysctl_wmem_offset
- sysctl_tcp_rmem[1, 2]
- sysctl_tcp_wmem[1, 2]
- sysctl_decnet_rmem[1]
- sysctl_decnet_wmem[1]
- sysctl_tipc_rmem[1]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 4a41f453bedfd5e9cd040bad509d9da49feb3e2c.
This to-be-reverted commit was meant to apply a stricter rule for the
stack to enter pingpong mode. However, the condition used to check for
interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
jiffy based and might be too coarse, which delays the stack entering
pingpong mode.
We revert this patch so that we no longer use the above condition to
determine interactive session, and also reduce pingpong threshold to 1.
Fixes: 4a41f453bedf ("tcp: change pingpong threshold to 3")
Reported-by: LemmyHuang <hlm3280@163.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220721204404.388396-1-weiwan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While reading sysctl_tcp_tso_rtt_log, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 65466904b015 ("tcp: adjust TSO packet sizes based on min_rtt")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_min_tso_segs, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 95bd09eb2750 ("tcp: TSO packets automatic sizing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_limit_output_bytes, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 46d3ceabd8d9 ("tcp: TCP Small Queues")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_workaround_signed_windows, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 15d99e02baba ("[TCP]: sysctl to allow TCP window > 32767 sans wscale")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While reading sysctl_tcp_retrans_collapse, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_slow_start_after_idle, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 35089bb203f4 ("[TCP]: Add tcp_slow_start_after_idle sysctl.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_early_retrans, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: eed530b6c676 ("tcp: early retransmit")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading these knobs, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.
- tcp_sack
- tcp_window_scaling
- tcp_timestamps
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading these sysctl knobs, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.
- tcp_retries1
- tcp_retries2
- tcp_orphan_retries
- tcp_fin_timeout
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_probe_interval, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 05cbc0db03e8 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_probe_threshold, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 6b58e0a5f32d ("ipv4: Use binary search to choose tcp PMTU probe_size")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_min_snd_mss, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5f3e2bf008c2 ("tcp: add tcp_min_snd_mss sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_base_mss, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5d424d5a674f ("[TCP]: MTU probing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_mtu_probing, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 5d424d5a674f ("[TCP]: MTU probing")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
include/net/sock.h
310731e2f161 ("net: Fix data-races around sysctl_mem.")
e70f3c701276 ("Revert "net: set SK_MEM_QUANTUM to 4096"")
https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/
net/ipv4/fib_semantics.c
747c14307214 ("ip: fix dflt addr selection for connected nexthop")
d62607c3fe45 ("net: rename reference+tracking helpers")
net/tls/tls.h
include/net/tls.h
3d8c51b25a23 ("net/tls: Check for errors in tls_device_init")
587903142308 ("tls: create an internal header")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While reading sysctl_tcp_ecn_fallback, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 492135557dc0 ("tcp: add rfc3168, section 6.1.1.1. fallback")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While reading sysctl_tcp_ecn, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
current code of __tcp_retransmit_skb only check TCP_SKB_CB(skb)->seq
in send window, and TCP_SKB_CB(skb)->seq_end maybe out of send window.
If receiver has shrunk his window, and skb is out of new window, it
should retransmit a smaller portion of the payload.
test packetdrill script:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) win 65535 <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8>
+.05 < S. 0:0(0) ack 1 win 6000 <mss 1000,nop,nop,sackOK>
+0 > . 1:1(0) ack 1
+0 write(3, ..., 10000) = 10000
+0 > . 1:2001(2000) ack 1 win 65535
+0 > . 2001:4001(2000) ack 1 win 65535
+0 > . 4001:6001(2000) ack 1 win 65535
+.05 < . 1:1(0) ack 4001 win 1001
and tcpdump show:
192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 1:2001, ack 1, win 65535, length 2000
192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 2001:4001, ack 1, win 65535, length 2000
192.168.226.67.55 > 192.0.2.1.8080: Flags [P.], seq 4001:5001, ack 1, win 65535, length 1000
192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 5001:6001, ack 1, win 65535, length 1000
192.0.2.1.8080 > 192.168.226.67.55: Flags [.], ack 4001, win 1001, length 0
192.168.226.67.55 > 192.0.2.1.8080: Flags [.], seq 5001:6001, ack 1, win 65535, length 1000
192.168.226.67.55 > 192.0.2.1.8080: Flags [P.], seq 4001:5001, ack 1, win 65535, length 1000
when cient retract window to 1001, send window is [4001,5002],
but TLP send 5001-6001 packet which is out of send window.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1657532838-20200-1-git-send-email-liyonglong@chinatelecom.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sk_forced_mem_schedule() has a bug similar to ones fixed
in commit 7c80b038d23e ("net: fix sk_wmem_schedule() and
sk_rmem_schedule() errors")
While this bug has little chance to trigger in old kernels,
we need to fix it before the following patch.
Fixes: d83769a580f1 ("tcp: fix possible deadlock in tcp_send_fin()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE.
This might change in the future, but it seems better to avoid the
confusion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|