<feed xmlns='http://www.w3.org/2005/Atom'>
<title>lwn.git/net/ipv4/sysctl_net_ipv4.c, branch docs-6.9</title>
<subtitle>Linux kernel documentation tree maintained by Jonathan Corbet</subtitle>
<id>http://mirrors.hust.edu.cn/git/lwn.git/atom?h=docs-6.9</id>
<link rel='self' href='http://mirrors.hust.edu.cn/git/lwn.git/atom?h=docs-6.9'/>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/'/>
<updated>2023-12-08T18:44:42+00:00</updated>
<entry>
<title>Use READ/WRITE_ONCE() for IP local_port_range.</title>
<updated>2023-12-08T18:44:42+00:00</updated>
<author>
<name>David Laight</name>
<email>David.Laight@ACULAB.COM</email>
</author>
<published>2023-12-06T13:44:20+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=d9f28735af8781d9c8c6c406c2a102090644133d'/>
<id>urn:sha1:d9f28735af8781d9c8c6c406c2a102090644133d</id>
<content type='text'>
Commit 227b60f5102cd added a seqlock to ensure that the low and high
port numbers were always updated together.
This is overkill because the two 16bit port numbers can be held in
a u32 and read/written in a single instruction.

More recently 91d0b78c5177f added support for finer per-socket limits.
The user-supplied value is 'high &lt;&lt; 16 | low' but they are held
separately and the socket options protected by the socket lock.

Use a u32 containing 'high &lt;&lt; 16 | low' for both the 'net' and 'sk'
fields and use READ_ONCE()/WRITE_ONCE() to ensure both values are
always updated together.

Change (the now trival) inet_get_local_port_range() to a static inline
to optimise the calling code.
(In particular avoiding returning integers by reference.)

Signed-off-by: David Laight &lt;david.laight@aculab.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: David Ahern &lt;dsahern@kernel.org&gt;
Acked-by: Mat Martineau &lt;martineau@kernel.org&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Link: https://lore.kernel.org/r/4e505d4198e946a8be03fb1b4c3072b0@AcuMS.aculab.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: Set pingpong threshold via sysctl</title>
<updated>2023-10-16T21:55:32+00:00</updated>
<author>
<name>Haiyang Zhang</name>
<email>haiyangz@microsoft.com</email>
</author>
<published>2023-10-11T20:30:44+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=562b1fdf061bff9394ccd884456ed1173c224fdc'/>
<id>urn:sha1:562b1fdf061bff9394ccd884456ed1173c224fdc</id>
<content type='text'>
TCP pingpong threshold is 1 by default. But some applications, like SQL DB
may prefer a higher pingpong threshold to activate delayed acks in quick
ack mode for better performance.

The pingpong threshold and related code were changed to 3 in the year
2019 in:
  commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
And reverted to 1 in the year 2022 in:
  commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")

There is no single value that fits all applications.
Add net.ipv4.tcp_pingpong_thresh sysctl tunable, so it can be tuned for
optimal performance based on the application needs.

Signed-off-by: Haiyang Zhang &lt;haiyangz@microsoft.com&gt;
Reviewed-by: Simon Horman &lt;horms@kernel.org&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Link: https://lore.kernel.org/r/1697056244-21888-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: defer regular ACK while processing socket backlog</title>
<updated>2023-09-12T17:10:01+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-09-11T17:05:31+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=133c4c0d37175f510a10fa9bed51e223936073fc'/>
<id>urn:sha1:133c4c0d37175f510a10fa9bed51e223936073fc</id>
<content type='text'>
This idea came after a particular workload requested
the quickack attribute set on routes, and a performance
drop was noticed for large bulk transfers.

For high throughput flows, it is best to use one cpu
running the user thread issuing socket system calls,
and a separate cpu to process incoming packets from BH context.
(With TSO/GRO, bottleneck is usually the 'user' cpu)

Problem is the user thread can spend a lot of time while holding
the socket lock, forcing BH handler to queue most of incoming
packets in the socket backlog.

Whenever the user thread releases the socket lock, it must first
process all accumulated packets in the backlog, potentially
adding latency spikes. Due to flood mitigation, having too many
packets in the backlog increases chance of unexpected drops.

Backlog processing unfortunately shifts a fair amount of cpu cycles
from the BH cpu to the 'user' cpu, thus reducing max throughput.

This patch takes advantage of the backlog processing,
and the fact that ACK are mostly cumulative.

The idea is to detect we are in the backlog processing
and defer all eligible ACK into a single one,
sent from tcp_release_cb().

This saves cpu cycles on both sides, and network resources.

Performance of a single TCP flow on a 200Gbit NIC:

- Throughput is increased by 20% (100Gbit -&gt; 120Gbit).
- Number of generated ACK per second shrinks from 240,000 to 40,000.
- Number of backlog drops per second shrinks from 230 to 0.

Benchmark context:
 - Regular netperf TCP_STREAM (no zerocopy)
 - Intel(R) Xeon(R) Platinum 8481C (Saphire Rapids)
 - MAX_SKB_FRAGS = 17 (~60KB per GRO packet)

This feature is guarded by a new sysctl, and enabled by default:
 /proc/sys/net/ipv4/tcp_backlog_ack_defer

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Acked-by: Soheil Hassas Yeganeh &lt;soheil@google.com&gt;
Acked-by: Dave Taht &lt;dave.taht@gmail.com&gt;
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>networking: Update to register_net_sysctl_sz</title>
<updated>2023-08-15T22:26:18+00:00</updated>
<author>
<name>Joel Granados</name>
<email>joel.granados@gmail.com</email>
</author>
<published>2023-08-09T10:50:03+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=c899710fe7f9f24dd77135875f199359f7b8b774'/>
<id>urn:sha1:c899710fe7f9f24dd77135875f199359f7b8b774</id>
<content type='text'>
Move from register_net_sysctl to register_net_sysctl_sz for all the
networking related files. Do this while making sure to mirror the NULL
assignments with a table_size of zero for the unprivileged users.

We need to move to the new function in preparation for when we change
SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do
so would erroneously allow ARRAY_SIZE() to be called on a pointer. We
hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all
the relevant net sysctl registering functions to register_net_sysctl_sz
in subsequent commits.

An additional size function was added to the following files in order to
calculate the size of an array that is defined in another file:
    include/net/ipv6.h
    net/ipv6/icmp.c
    net/ipv6/route.c
    net/ipv6/sysctl_net_ipv6.c

Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: enforce receive buffer memory limits by allowing the tcp window to shrink</title>
<updated>2023-06-17T08:53:53+00:00</updated>
<author>
<name>mfreemon@cloudflare.com</name>
<email>mfreemon@cloudflare.com</email>
</author>
<published>2023-06-12T03:05:24+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=b650d953cd391595e536153ce30b4aab385643ac'/>
<id>urn:sha1:b650d953cd391595e536153ce30b4aab385643ac</id>
<content type='text'>
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon &lt;mfreemon@cloudflare.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2023-06-08T18:35:14+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2023-06-08T18:34:28+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=449f6bc17a51e68b06cfd742898e5ff3fe6e04d7'/>
<id>urn:sha1:449f6bc17a51e68b06cfd742898e5ff3fe6e04d7</id>
<content type='text'>
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd692 ("net: sched: add rcu annotations around qdisc-&gt;qdisc_sleeping")
  dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee4118f ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: fix formatting in sysctl_net_ipv4.c</title>
<updated>2023-06-07T09:57:28+00:00</updated>
<author>
<name>David Morley</name>
<email>morleyd@google.com</email>
</author>
<published>2023-06-06T18:12:33+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=0824a987a58070470351906590a89fa5e0f88ba1'/>
<id>urn:sha1:0824a987a58070470351906590a89fa5e0f88ba1</id>
<content type='text'>
Fix incorrectly formatted tcp_syn_linear_timeouts sysctl in the
ipv4_net_table.

Fixes: ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear")
Signed-off-by: David Morley &lt;morleyd@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Tested-by: David Morley &lt;morleyd@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294</title>
<updated>2023-06-02T08:55:22+00:00</updated>
<author>
<name>Akihiro Suda</name>
<email>suda.gitsendemail@gmail.com</email>
</author>
<published>2023-06-01T03:13:05+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=e209fee4118fe9a449d4d805361eb2de6796be39'/>
<id>urn:sha1:e209fee4118fe9a449d4d805361eb2de6796be39</id>
<content type='text'>
With this commit, all the GIDs ("0 4294967294") can be written to the
"net.ipv4.ping_group_range" sysctl.

Note that 4294967295 (0xffffffff) is an invalid GID (see gid_valid() in
include/linux/uidgid.h), and an attempt to register this number will cause
-EINVAL.

Prior to this commit, only up to GID 2147483647 could be covered.
Documentation/networking/ip-sysctl.rst had "0 4294967295" as an example
value, but this example was wrong and causing -EINVAL.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Co-developed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: Akihiro Suda &lt;akihiro.suda.cz@hco.ntt.co.jp&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: make the first N SYN RTO backoffs linear</title>
<updated>2023-05-11T08:31:16+00:00</updated>
<author>
<name>David Morley</name>
<email>morleyd@google.com</email>
</author>
<published>2023-05-09T18:05:58+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=ccce324dabfe2143519daf50ed8b1ef1d0c542f7'/>
<id>urn:sha1:ccce324dabfe2143519daf50ed8b1ef1d0c542f7</id>
<content type='text'>
Currently the SYN RTO schedule follows an exponential backoff
scheme, which can be unnecessarily conservative in cases where
there are link failures. In such cases, it's better to
aggressively try to retransmit packets, so it takes routers
less time to find a repath with a working link.

We chose a default value for this sysctl of 4, to follow
the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ...
MacOS and IOS have used this backoff schedule for over
a decade, since before this 2009 IETF presentation
discussed the behavior:
https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf

This commit makes the SYN RTO schedule start with a number of
linear backoffs given by the following sysctl:
* tcp_syn_linear_timeouts

This changes the SYN RTO scheme to be: init_rto_val for
tcp_syn_linear_timeouts, exp backoff starting at init_rto_val

For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our
backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ...

Signed-off-by: David Morley &lt;morleyd@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Tested-by: David Morley &lt;morleyd@google.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com
Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
</content>
</entry>
<entry>
<title>tcp: restrict net.ipv4.tcp_app_win</title>
<updated>2023-04-07T07:19:11+00:00</updated>
<author>
<name>YueHaibing</name>
<email>yuehaibing@huawei.com</email>
</author>
<published>2023-04-06T06:34:50+00:00</published>
<link rel='alternate' type='text/html' href='http://mirrors.hust.edu.cn/git/lwn.git/commit/?id=dc5110c2d959c1707e12df5f792f41d90614adaa'/>
<id>urn:sha1:dc5110c2d959c1707e12df5f792f41d90614adaa</id>
<content type='text'>
UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23
shift exponent 255 is too large for 32-bit type 'int'
CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00161-g62bad54b26db-dirty #206
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 &lt;TASK&gt;
 dump_stack_lvl+0x136/0x150
 __ubsan_handle_shift_out_of_bounds+0x21f/0x5a0
 tcp_init_transfer.cold+0x3a/0xb9
 tcp_finish_connect+0x1d0/0x620
 tcp_rcv_state_process+0xd78/0x4d60
 tcp_v4_do_rcv+0x33d/0x9d0
 __release_sock+0x133/0x3b0
 release_sock+0x58/0x1b0

'maxwin' is int, shifting int for 32 or more bits is undefined behaviour.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: YueHaibing &lt;yuehaibing@huawei.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
