Age | Commit message (Collapse) | Author |
|
After running the 'sendmsg02' program of Linux Test Project (LTP),
kmemleak reports the following memory leak:
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888243866800 (size 2048):
comm "sendmsg02", pid 67, jiffies 4294903166
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 5e 00 00 00 00 00 00 00 ........^.......
01 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace (crc 7e96a3f2):
kmemleak_alloc+0x56/0x90
kmem_cache_alloc_noprof+0x209/0x450
sk_prot_alloc.constprop.0+0x60/0x160
sk_alloc+0x32/0xc0
unix_create1+0x67/0x2b0
unix_create+0x47/0xa0
__sock_create+0x12e/0x200
__sys_socket+0x6d/0x100
__x64_sys_socket+0x1b/0x30
x64_sys_call+0x7e1/0x2140
do_syscall_64+0x54/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Commit 689c398885cc ("af_unix: Defer sock_put() to clean up path in
unix_dgram_sendmsg().") defers sock_put() in the error handling path.
However, it fails to account for the condition 'msg->msg_namelen != 0',
resulting in a memory leak when the code jumps to the 'lookup' label.
Fix issue by calling sock_put() if 'msg->msg_namelen != 0' is met.
Fixes: 689c398885cc ("af_unix: Defer sock_put() to clean up path in unix_dgram_sendmsg().")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Acked-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250225021457.1824-1-ahuang12@lenovo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is based on Donald Hunter's patch.
These functions could fail for various reasons, sometimes
triggering kfree_skb().
* unix_stream_connect() : connect()
* unix_stream_sendmsg() : sendmsg()
* queue_oob() : sendmsg(MSG_OOB)
* unix_dgram_sendmsg() : sendmsg()
Such kfree_skb() is tied to the errno of connect() and
sendmsg(), and we need not define skb drop reasons.
Let's use consume_skb() not to churn kfree_skb() events.
Link: https://lore.kernel.org/netdev/eb30b164-7f86-46bf-a5d3-0f8bda5e9398@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a follow-up of commit d460b04bc452 ("af_unix: Clean up
error paths in unix_stream_sendmsg().").
If we initialise skb with NULL in unix_stream_sendmsg(), we can
reuse the existing out_pipe label for the SEND_SHUTDOWN check.
Let's rename it and adjust the existing label as out_pipe_lock.
While at it, size and data_len are moved to the while loop scope.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_dgram_disconnected() is called from two places:
1. when a connect()ed socket dis-connect()s or re-connect()s to
another socket
2. when sendmsg() fails because the peer socket that the client
has connect()ed to has been close()d
Then, the client's recv queue is purged to remove all messages from
the old peer socket.
Let's define a new drop reason for that case.
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>>
>>> # s1 has a message from s2
>>> s1, s2 = socketpair(AF_UNIX, SOCK_DGRAM)
>>> s2.send(b'hello world')
>>>
>>> # re-connect() drops the message from s2
>>> s3 = socket(AF_UNIX, SOCK_DGRAM)
>>> s3.bind('')
>>> s1.connect(s3.getsockname())
# cat /sys/kernel/tracing/trace_pipe
python3-250 ... kfree_skb: ... location=skb_queue_purge_reason+0xdc/0x110 reason: UNIX_DISCONNECT
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_stream_read_skb() is called when BPF SOCKMAP reads some data
from a socket in the map.
SOCKMAP does not support MSG_OOB, and reading OOB results in a drop.
Let's set drop reasons respectively.
* SOCKET_CLOSE : the socket in SOCKMAP was close()d
* UNIX_SKIP_OOB : OOB was read from the socket in SOCKMAP
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
AF_UNIX SOCK_STREAM socket supports MSG_OOB.
When OOB data is sent to a socket, recv() will break at that point.
If the next recv() does not have MSG_OOB, the normal data following
the OOB data is returned.
Then, the OOB skb is dropped.
Let's define a new drop reason for that case in manage_oob().
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>> s1, s2 = socketpair(AF_UNIX)
>>> s1.send(b'a', MSG_OOB)
>>> s1.send(b'b')
>>> s2.recv(2)
b'b'
# cat /sys/kernel/tracing/trace_pipe
...
python3-223 ... kfree_skb: ... location=unix_stream_read_generic+0x59e/0xc20 reason: UNIX_SKIP_OOB
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Inflight file descriptors by SCM_RIGHTS hold references to the
struct file.
AF_UNIX sockets could hold references to each other, forming
reference cycles.
Once such sockets are close()d without the fd recv()ed, they
will be unaccessible from userspace but remain in kernel.
__unix_gc() garbage-collects skb with the dead file descriptors
and frees them by __skb_queue_purge().
Let's set SKB_DROP_REASON_SOCKET_CLOSE there.
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>> from array import array
>>>
>>> # Create a reference cycle
>>> s1 = socket(AF_UNIX, SOCK_DGRAM)
>>> s1.bind('')
>>> s1.sendmsg([b"nop"], [(SOL_SOCKET, SCM_RIGHTS, array("i", [s1.fileno()]))], 0, s1.getsockname())
>>> s1.close()
>>>
>>> # Trigger GC
>>> s2 = socket(AF_UNIX)
>>> s2.close()
# cat /sys/kernel/tracing/trace_pipe
...
kworker/u16:2-42 ... kfree_skb: ... location=__unix_gc+0x4ad/0x580 reason: SOCKET_CLOSE
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_sock_destructor() is called as sk->sk_destruct() just before
the socket is actually freed.
Let's use SKB_DROP_REASON_SOCKET_CLOSE for skb_queue_purge().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_release_sock() is called when the last refcnt of struct file
is released.
Let's define a new drop reason SKB_DROP_REASON_SOCKET_CLOSE and
set it for kfree_skb() in unix_release_sock().
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>> s1, s2 = socketpair(AF_UNIX)
>>> s1.send(b'hello world')
>>> s2.close()
# cat /sys/kernel/tracing/trace_pipe
...
python3-280 ... kfree_skb: ... protocol=0 location=unix_release_sock+0x260/0x420 reason: SOCKET_CLOSE
To be precise, unix_release_sock() is also called for a new child
socket in unix_stream_connect() when something fails, but the new
sk does not have skb in the recv queue then and no event is logged.
Note that only tcp_inbound_ao_hash() uses a similar drop reason,
SKB_DROP_REASON_TCP_CLOSE, and this can be generalised later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This makes it possible to disable the MSG_OOB support in .config.
Signed-off-by: Florent Revest <revest@chromium.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241218143334.1507465-1-revest@chromium.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_our_peer() is used only in unix_may_send().
Let's inline it in unix_may_send().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The error path is complicated in unix_dgram_sendmsg() because there
are two timings when other could be non-NULL: when it's fetched from
unix_peer_get() and when it's looked up by unix_find_other().
Let's move unix_peer_get() to the else branch for unix_find_other()
and clean up the error paths.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When other has SOCK_DEAD in unix_dgram_sendmsg(), we hold
unix_state_lock() for the sender socket first.
However, we do not need it for sk->sk_type.
Let's move the lock down a bit.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When other has SOCK_DEAD in unix_dgram_sendmsg(), we call sock_put() for
it first and then set NULL to other before jumping to the error path.
This is to skip sock_put() in the error path.
Let's not set NULL to other and defer the sock_put() to the error path
to clean up the labels later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
There are two paths jumping to the restart label in unix_dgram_sendmsg().
One requires another lookup and sk_filter(), but the other doesn't.
Let's split the label to make each flow more straightforward.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In unix_dgram_sendmsg(), we use a local variable sunaddr pointing
NULL or msg->msg_name based on msg->msg_namelen.
Let's remove sunaddr and simplify the usage.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When other is NULL in unix_dgram_sendmsg(), we check if sunaddr
is NULL before looking up a receiver socket.
There are three paths going through the check, but it's always
false for 2 out of the 3 paths: the first socket lookup and the
second 'goto restart'.
The condition can be true for the first 'goto restart' only when
SOCK_DEAD is flagged for the socket found with msg->msg_name.
Let's move the check to the single appropriate path.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will introduce skb drop reason for AF_UNIX, then we need to
set an errno and a drop reason for each path.
Let's set an error only when it's needed in unix_dgram_sendmsg().
Then, we need not (re)set 0 to err.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
If we move send_sig() to the SEND_SHUTDOWN check before
the while loop, then we can reuse the same kfree_skb()
after the pipe_err_free label.
Let's gather the scattered kfree_skb()s in error paths.
While at it, some style issues are fixed, and the pipe_err_free
label is renamed to out_pipe to match other label names.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will introduce skb drop reason for AF_UNIX, then we need to
set an errno and a drop reason for each path.
Let's set an error only when it's needed in unix_stream_sendmsg().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The label order is weird in unix_stream_connect(), and all NULL checks
are unnecessary if reordered.
Let's clean up the error paths to make it easy to set a drop reason
for each path.
While at it, a comment with the old style is updated.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will introduce skb drop reason for AF_UNIX, then we need to
set an errno and a drop reason for each path.
Let's set an error only when it's needed in unix_stream_connect().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When `skb_splice_from_iter` was introduced, it inadvertently added
checksumming for AF_UNIX sockets. This resulted in significant
slowdowns, for example when using sendfile over unix sockets.
Using the test code in [1] in my test setup (2G single core qemu),
the client receives a 1000M file in:
- without the patch: 1482ms (+/- 36ms)
- with the patch: 652.5ms (+/- 22.9ms)
This commit addresses the issue by marking checksumming as unnecessary in
`unix_stream_sendmsg`
Cc: stable@vger.kernel.org
Signed-off-by: Frederik Deweerdt <deweerdt.lkml@gmail.com>
Fixes: 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/Z1fMaHkRf8cfubuE@xiberoa
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported use-after-free in unix_stream_recv_urg(). [0]
The scenario is
1. send(MSG_OOB)
2. recv(MSG_OOB)
-> The consumed OOB remains in recv queue
3. send(MSG_OOB)
4. recv()
-> manage_oob() returns the next skb of the consumed OOB
-> This is also OOB, but unix_sk(sk)->oob_skb is not cleared
5. recv(MSG_OOB)
-> unix_sk(sk)->oob_skb is used but already freed
The recent commit 8594d9b85c07 ("af_unix: Don't call skb_get() for OOB
skb.") uncovered the issue.
If the OOB skb is consumed and the next skb is peeked in manage_oob(),
we still need to check if the skb is OOB.
Let's do so by falling back to the following checks in manage_oob()
and add the test case in selftest.
Note that we need to add a similar check for SIOCATMARK.
[0]:
BUG: KASAN: slab-use-after-free in unix_stream_read_actor+0xa6/0xb0 net/unix/af_unix.c:2959
Read of size 4 at addr ffff8880326abcc4 by task syz-executor178/5235
CPU: 0 UID: 0 PID: 5235 Comm: syz-executor178 Not tainted 6.11.0-rc5-syzkaller-00742-gfbdaffe41adc #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
unix_stream_read_actor+0xa6/0xb0 net/unix/af_unix.c:2959
unix_stream_recv_urg+0x1df/0x320 net/unix/af_unix.c:2640
unix_stream_read_generic+0x2456/0x2520 net/unix/af_unix.c:2778
unix_stream_recvmsg+0x22b/0x2c0 net/unix/af_unix.c:2996
sock_recvmsg_nosec net/socket.c:1046 [inline]
sock_recvmsg+0x22f/0x280 net/socket.c:1068
____sys_recvmsg+0x1db/0x470 net/socket.c:2816
___sys_recvmsg net/socket.c:2858 [inline]
__sys_recvmsg+0x2f0/0x3e0 net/socket.c:2888
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5360d6b4e9
Code: 48 83 c4 28 c3 e8 37 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff29b3a458 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
RAX: ffffffffffffffda RBX: 00007fff29b3a638 RCX: 00007f5360d6b4e9
RDX: 0000000000002001 RSI: 0000000020000640 RDI: 0000000000000003
RBP: 00007f5360dde610 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007fff29b3a628 R14: 0000000000000001 R15: 0000000000000001
</TASK>
Allocated by task 5235:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
unpoison_slab_object mm/kasan/common.c:312 [inline]
__kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook mm/slub.c:3988 [inline]
slab_alloc_node mm/slub.c:4037 [inline]
kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4080
__alloc_skb+0x1c3/0x440 net/core/skbuff.c:667
alloc_skb include/linux/skbuff.h:1320 [inline]
alloc_skb_with_frags+0xc3/0x770 net/core/skbuff.c:6528
sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2815
sock_alloc_send_skb include/net/sock.h:1778 [inline]
queue_oob+0x108/0x680 net/unix/af_unix.c:2198
unix_stream_sendmsg+0xd24/0xf80 net/unix/af_unix.c:2351
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
___sys_sendmsg net/socket.c:2651 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2680
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 5235:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
__kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
kasan_slab_free include/linux/kasan.h:184 [inline]
slab_free_hook mm/slub.c:2252 [inline]
slab_free mm/slub.c:4473 [inline]
kmem_cache_free+0x145/0x350 mm/slub.c:4548
unix_stream_read_generic+0x1ef6/0x2520 net/unix/af_unix.c:2917
unix_stream_recvmsg+0x22b/0x2c0 net/unix/af_unix.c:2996
sock_recvmsg_nosec net/socket.c:1046 [inline]
sock_recvmsg+0x22f/0x280 net/socket.c:1068
__sys_recvfrom+0x256/0x3e0 net/socket.c:2255
__do_sys_recvfrom net/socket.c:2273 [inline]
__se_sys_recvfrom net/socket.c:2269 [inline]
__x64_sys_recvfrom+0xde/0x100 net/socket.c:2269
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff8880326abc80
which belongs to the cache skbuff_head_cache of size 240
The buggy address is located 68 bytes inside of
freed 240-byte region [ffff8880326abc80, ffff8880326abd70)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x326ab
ksm flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xfdffffff(slab)
raw: 00fff00000000000 ffff88801eaee780 ffffea0000b7dc80 dead000000000003
raw: 0000000000000000 00000000800c000c 00000001fdffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 4686, tgid 4686 (udevadm), ts 32357469485, free_ts 28829011109
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1493
prep_new_page mm/page_alloc.c:1501 [inline]
get_page_from_freelist+0x2e4c/0x2f10 mm/page_alloc.c:3439
__alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4695
__alloc_pages_node_noprof include/linux/gfp.h:269 [inline]
alloc_pages_node_noprof include/linux/gfp.h:296 [inline]
alloc_slab_page+0x5f/0x120 mm/slub.c:2321
allocate_slab+0x5a/0x2f0 mm/slub.c:2484
new_slab mm/slub.c:2537 [inline]
___slab_alloc+0xcd1/0x14b0 mm/slub.c:3723
__slab_alloc+0x58/0xa0 mm/slub.c:3813
__slab_alloc_node mm/slub.c:3866 [inline]
slab_alloc_node mm/slub.c:4025 [inline]
kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4080
__alloc_skb+0x1c3/0x440 net/core/skbuff.c:667
alloc_skb include/linux/skbuff.h:1320 [inline]
alloc_uevent_skb+0x74/0x230 lib/kobject_uevent.c:289
uevent_net_broadcast_untagged lib/kobject_uevent.c:326 [inline]
kobject_uevent_net_broadcast+0x2fd/0x580 lib/kobject_uevent.c:410
kobject_uevent_env+0x57d/0x8e0 lib/kobject_uevent.c:608
kobject_synth_uevent+0x4ef/0xae0 lib/kobject_uevent.c:207
uevent_store+0x4b/0x70 drivers/base/bus.c:633
kernfs_fop_write_iter+0x3a1/0x500 fs/kernfs/file.c:334
new_sync_write fs/read_write.c:497 [inline]
vfs_write+0xa72/0xc90 fs/read_write.c:590
page last free pid 1 tgid 1 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1094 [inline]
free_unref_page+0xd22/0xea0 mm/page_alloc.c:2612
kasan_depopulate_vmalloc_pte+0x74/0x90 mm/kasan/shadow.c:408
apply_to_pte_range mm/memory.c:2797 [inline]
apply_to_pmd_range mm/memory.c:2841 [inline]
apply_to_pud_range mm/memory.c:2877 [inline]
apply_to_p4d_range mm/memory.c:2913 [inline]
__apply_to_page_range+0x8a8/0xe50 mm/memory.c:2947
kasan_release_vmalloc+0x9a/0xb0 mm/kasan/shadow.c:525
purge_vmap_node+0x3e3/0x770 mm/vmalloc.c:2208
__purge_vmap_area_lazy+0x708/0xae0 mm/vmalloc.c:2290
_vm_unmap_aliases+0x79d/0x840 mm/vmalloc.c:2885
change_page_attr_set_clr+0x2fe/0xdb0 arch/x86/mm/pat/set_memory.c:1881
change_page_attr_set arch/x86/mm/pat/set_memory.c:1922 [inline]
set_memory_nx+0xf2/0x130 arch/x86/mm/pat/set_memory.c:2110
free_init_pages arch/x86/mm/init.c:924 [inline]
free_kernel_image_pages arch/x86/mm/init.c:943 [inline]
free_initmem+0x79/0x110 arch/x86/mm/init.c:970
kernel_init+0x31/0x2b0 init/main.c:1476
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Memory state around the buggy address:
ffff8880326abb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880326abc00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff8880326abc80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880326abd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
ffff8880326abd80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
Fixes: 93c99f21db36 ("af_unix: Don't stop recv(MSG_DONTWAIT) if consumed OOB skb is at the head.")
Reported-by: syzbot+8811381d455e3e9ec788@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8811381d455e3e9ec788
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240905193240.17565-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When OOB skb has been already consumed, manage_oob() returns the next
skb if exists. In such a case, we need to fall back to the else branch
below.
Then, we want to keep holding spin_lock(&sk->sk_receive_queue.lock).
Let's move it out of if-else branch and add lightweight check before
spin_lock() for major use cases without OOB skb.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240905193240.17565-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When OOB skb has been already consumed, manage_oob() returns the next
skb if exists. In such a case, we need to fall back to the else branch
below.
Then, we need to keep two skbs and free them later with consume_skb()
and kfree_skb().
Let's rename unlinked_skb accordingly.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240905193240.17565-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a prep for the later fix.
No functional change intended.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240905193240.17565-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since introduced, OOB skb holds an additional reference count with no
special reason and caused many issues.
Also, kfree_skb() and consume_skb() are used to decrement the count,
which is confusing.
Let's drop the unnecessary skb_get() in queue_oob() and corresponding
kfree_skb(), consume_skb(), and skb_unref().
Now unix_sk(sk)->oob_skb is just a pointer to skb in the receive queue,
so special handing is no longer needed in GC.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240816233921.57800-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
AF_UNIX socket tracks the most recent OOB packet (in its receive queue)
with an `oob_skb` pointer. BPF redirecting does not account for that: when
an OOB packet is moved between sockets, `oob_skb` is left outdated. This
results in a single skb that may be accessed from two different sockets.
Take the easy way out: silently drop MSG_OOB data targeting any socket that
is in a sockmap or a sockhash. Note that such silent drop is akin to the
fate of redirected skb's scm_fp_list (SCM_RIGHTS, SCM_CREDENTIALS).
For symmetry, forbid MSG_OOB in unix_bpf_recvmsg().
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20240713200218.2140950-2-mhal@rbox.co
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/phy/aquantia/aquantia.h
219343755eae ("net: phy: aquantia: add missing include guards")
61578f679378 ("net: phy: aquantia: add support for PHY LEDs")
drivers/net/ethernet/wangxun/libwx/wx_hw.c
bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx")
b501d261a5b3 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/
include/linux/mlx5/mlx5_ifc.h
048a403648fc ("net/mlx5: IFC updates for changing max EQs")
99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/
Adjacent changes:
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard")
include/net/mac80211.h
816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
5a009b42e041 ("wifi: mac80211: track changes in AP's TPE")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
KMSAN reported uninit-value access in __unix_walk_scc() [1].
In the list_for_each_entry_reverse() loop, when the vertex's index
equals it's scc_index, the loop uses the variable vertex as a
temporary variable that points to a vertex in scc. And when the loop
is finished, the variable vertex points to the list head, in this case
scc, which is a local variable on the stack (more precisely, it's not
even scc and might underflow the call stack of __unix_walk_scc():
container_of(&scc, struct unix_vertex, scc_entry)).
However, the variable vertex is used under the label prev_vertex. So
if the edge_stack is not empty and the function jumps to the
prev_vertex label, the function will access invalid data on the
stack. This causes the uninit-value access issue.
Fix this by introducing a new temporary variable for the loop.
[1]
BUG: KMSAN: uninit-value in __unix_walk_scc net/unix/garbage.c:478 [inline]
BUG: KMSAN: uninit-value in unix_walk_scc net/unix/garbage.c:526 [inline]
BUG: KMSAN: uninit-value in __unix_gc+0x2589/0x3c20 net/unix/garbage.c:584
__unix_walk_scc net/unix/garbage.c:478 [inline]
unix_walk_scc net/unix/garbage.c:526 [inline]
__unix_gc+0x2589/0x3c20 net/unix/garbage.c:584
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xade/0x1bf0 kernel/workqueue.c:3312
worker_thread+0xeb6/0x15b0 kernel/workqueue.c:3393
kthread+0x3c4/0x530 kernel/kthread.c:389
ret_from_fork+0x6e/0x90 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Uninit was stored to memory at:
unix_walk_scc net/unix/garbage.c:526 [inline]
__unix_gc+0x2adf/0x3c20 net/unix/garbage.c:584
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xade/0x1bf0 kernel/workqueue.c:3312
worker_thread+0xeb6/0x15b0 kernel/workqueue.c:3393
kthread+0x3c4/0x530 kernel/kthread.c:389
ret_from_fork+0x6e/0x90 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
Local variable entries created at:
ref_tracker_free+0x48/0xf30 lib/ref_tracker.c:222
netdev_tracker_free include/linux/netdevice.h:4058 [inline]
netdev_put include/linux/netdevice.h:4075 [inline]
dev_put include/linux/netdevice.h:4101 [inline]
update_gid_event_work_handler+0xaa/0x1b0 drivers/infiniband/core/roce_gid_mgmt.c:813
CPU: 1 PID: 12763 Comm: kworker/u8:31 Not tainted 6.10.0-rc4-00217-g35bb670d65fc #32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Workqueue: events_unbound __unix_gc
Fixes: 3484f063172d ("af_unix: Detect Strongly Connected Components.")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240702160428.10153-1-syoshida@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
e3f02f32a050 ("ionic: fix kernel panic due to multi-buffer handling")
d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Even if OOB data is recv()ed, ioctl(SIOCATMARK) must return 1 when the
OOB skb is at the head of the receive queue and no new OOB data is queued.
Without fix:
# RUN msg_oob.no_peek.oob ...
# msg_oob.c:305:oob:Expected answ[0] (0) == oob_head (1)
# oob: Test terminated by assertion
# FAIL msg_oob.no_peek.oob
not ok 2 msg_oob.no_peek.oob
With fix:
# RUN msg_oob.no_peek.oob ...
# OK msg_oob.no_peek.oob
ok 2 msg_oob.no_peek.oob
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, recv() is stopped at a consumed OOB skb even if a new
OOB skb is queued and we can ignore the old OOB skb.
>>> from socket import *
>>> c1, c2 = socket(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'hellowor', MSG_OOB)
8
>>> c2.recv(1, MSG_OOB) # consume OOB data stays at middle of recvq.
b'r'
>>> c1.send(b'ld', MSG_OOB)
2
>>> c2.recv(10) # recv() stops at the old consumed OOB
b'hellowo' # should be 'hellowol'
manage_oob() should not stop recv() at the old consumed OOB skb if
there is a new OOB data queued.
Note that TCP behaviour is apparently wrong in this test case because
we can recv() the same OOB data twice.
Without fix:
# RUN msg_oob.no_peek.ex_oob_ahead_break ...
# msg_oob.c:138:ex_oob_ahead_break:AF_UNIX :hellowo
# msg_oob.c:139:ex_oob_ahead_break:Expected:hellowol
# msg_oob.c:141:ex_oob_ahead_break:Expected ret[0] (7) == expected_len (8)
# ex_oob_ahead_break: Test terminated by assertion
# FAIL msg_oob.no_peek.ex_oob_ahead_break
not ok 11 msg_oob.no_peek.ex_oob_ahead_break
With fix:
# RUN msg_oob.no_peek.ex_oob_ahead_break ...
# msg_oob.c:146:ex_oob_ahead_break:AF_UNIX :hellowol
# msg_oob.c:147:ex_oob_ahead_break:TCP :helloworl
# OK msg_oob.no_peek.ex_oob_ahead_break
ok 11 msg_oob.no_peek.ex_oob_ahead_break
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Let's say a socket send()s "hello" with MSG_OOB and "world" without flags,
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX)
>>> c1.send(b'hello', MSG_OOB)
5
>>> c1.send(b'world')
5
and its peer recv()s "hell" and "o".
>>> c2.recv(10)
b'hell'
>>> c2.recv(1, MSG_OOB)
b'o'
Now the consumed OOB skb stays at the head of recvq to return a correct
value for ioctl(SIOCATMARK), which is broken now and fixed by a later
patch.
Then, if peer issues recv() with MSG_DONTWAIT, manage_oob() returns NULL,
so recv() ends up with -EAGAIN.
>>> c2.setblocking(False) # This causes -EAGAIN even with available data
>>> c2.recv(5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BlockingIOError: [Errno 11] Resource temporarily unavailable
However, next recv() will return the following available data, "world".
>>> c2.recv(5)
b'world'
When the consumed OOB skb is at the head of the queue, we need to fetch
the next skb to fix the weird behaviour.
Note that the issue does not happen without MSG_DONTWAIT because we can
retry after manage_oob().
This patch also adds a test case that covers the issue.
Without fix:
# RUN msg_oob.no_peek.ex_oob_break ...
# msg_oob.c:134:ex_oob_break:AF_UNIX :Resource temporarily unavailable
# msg_oob.c:135:ex_oob_break:Expected:ld
# msg_oob.c:137:ex_oob_break:Expected ret[0] (-1) == expected_len (2)
# ex_oob_break: Test terminated by assertion
# FAIL msg_oob.no_peek.ex_oob_break
not ok 8 msg_oob.no_peek.ex_oob_break
With fix:
# RUN msg_oob.no_peek.ex_oob_break ...
# OK msg_oob.no_peek.ex_oob_break
ok 8 msg_oob.no_peek.ex_oob_break
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
After consuming OOB data, recv() reading the preceding data must break at
the OOB skb regardless of MSG_PEEK.
Currently, MSG_PEEK does not stop recv() for AF_UNIX, and the behaviour is
not compliant with TCP.
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX)
>>> c1.send(b'hello', MSG_OOB)
5
>>> c1.send(b'world')
5
>>> c2.recv(1, MSG_OOB)
b'o'
>>> c2.recv(9, MSG_PEEK) # This should return b'hell'
b'hellworld' # even with enough buffer.
Let's fix it by returning NULL for consumed skb and unlinking it only if
MSG_PEEK is not specified.
This patch also adds test cases that add recv(MSG_PEEK) before each recv().
Without fix:
# RUN msg_oob.peek.oob_ahead_break ...
# msg_oob.c:134:oob_ahead_break:AF_UNIX :hellworld
# msg_oob.c:135:oob_ahead_break:Expected:hell
# msg_oob.c:137:oob_ahead_break:Expected ret[0] (9) == expected_len (4)
# oob_ahead_break: Test terminated by assertion
# FAIL msg_oob.peek.oob_ahead_break
not ok 13 msg_oob.peek.oob_ahead_break
With fix:
# RUN msg_oob.peek.oob_ahead_break ...
# OK msg_oob.peek.oob_ahead_break
ok 13 msg_oob.peek.oob_ahead_break
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket,
the listener's sk_peer_pid/sk_peer_cred are copied to the client in
copy_peercred().
Then, two sk_peer_locks are held there; one is client's and another
is listener's.
However, the latter is not needed because we hold the listner's
unix_state_lock() there and unix_listen() cannot update the cred
concurrently.
Let's drop the unnecessary spin_lock() and use the bare spin_lock()
for the client to protect concurrent read by getsockopt(SO_PEERCRED).
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When (AF_UNIX, SOCK_STREAM) socket connect()s to a listening socket,
the listener's sk_peer_pid/sk_peer_cred are copied to the client in
copy_peercred().
Then, the client's sk_peer_pid and sk_peer_cred are always NULL, so
we need not call put_pid() and put_cred() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
init_peercred() is called in 3 places:
1. socketpair() : both sockets
2. connect() : child socket
3. listen() : listening socket
The first two need not hold sk_peer_lock because no one can
touch the socket.
Let's set cred/pid without holding lock for the two cases and
rename the old init_peercred() to update_peercred() to properly
reflect the use case.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
While GC is cleaning up cyclic references by SCM_RIGHTS,
unix_collect_skb() collects skb in the socket's recvq.
If the socket is TCP_LISTEN, we need to collect skb in the
embryo's queue. Then, both the listener's recvq lock and
the embroy's one are held.
The locking is always done in the listener -> embryo order.
Let's define it as unix_recvq_lock_cmp_fn() instead of using
spin_lock_nested().
Note that the reverse order is defined for consistency.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
sk_diag_dump_icons() acquires embryo's lock by unix_state_lock_nested()
to fetch its peer.
The embryo's ->peer is set to NULL only when its parent listener is
close()d. Then, unix_release_sock() is called for each embryo after
unlinking skb by skb_dequeue().
In sk_diag_dump_icons(), we hold the parent's recvq lock, so we need
not acquire unix_state_lock_nested(), and peer is always non-NULL.
Let's remove unnecessary unix_state_lock_nested() and non-NULL test
for peer.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
sk_diag_dump_peer() and sk_diag_dump() call unix_state_lock() for
sock_i_ino() which reads SOCK_INODE(sk->sk_socket)->i_ino, but it's
protected by sk->sk_callback_lock.
Let's remove unnecessary unix_state_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
While a SOCK_(STREAM|SEQPACKET) socket connect()s to another, we hold
two locks of them by unix_state_lock() and unix_state_lock_nested() in
unix_stream_connect().
Before unix_state_lock_nested(), the following is guaranteed by checking
sk->sk_state:
1. The first socket is TCP_LISTEN
2. The second socket is not the first one
3. Simultaneous connect() must fail
So, the client state can be TCP_CLOSE or TCP_LISTEN or TCP_ESTABLISHED.
Let's define the expected states as unix_state_lock_cmp_fn() instead of
using unix_state_lock_nested().
Note that 2. is detected by debug_spin_lock_before() and 3. cannot be
expressed as lock_cmp_fn.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When a SOCK_(STREAM|SEQPACKET) socket connect()s to another one, we need
to lock the two sockets to check their states in unix_stream_connect().
We use unix_state_lock() for the server and unix_state_lock_nested() for
client with tricky sk->sk_state check to avoid deadlock.
The possible deadlock scenario are the following:
1) Self connect()
2) Simultaneous connect()
The former is simple, attempt to grab the same lock, and the latter is
AB-BA deadlock.
After the server's unix_state_lock(), we check the server socket's state,
and if it's not TCP_LISTEN, connect() fails with -EINVAL.
Then, we avoid the former deadlock by checking the client's state before
unix_state_lock_nested(). If its state is not TCP_LISTEN, we can make
sure that the client and the server are not identical based on the state.
Also, the latter deadlock can be avoided in the same way. Due to the
server sk->sk_state requirement, AB-BA deadlock could happen only with
TCP_LISTEN sockets. So, if the client's state is TCP_LISTEN, we can
give up the second lock to avoid the deadlock.
CPU 1 CPU 2 CPU 3
connect(A -> B) connect(B -> A) listen(A)
--- --- ---
unix_state_lock(B)
B->sk_state == TCP_LISTEN
READ_ONCE(A->sk_state) == TCP_CLOSE
^^^^^^^^^
ok, will lock A unix_state_lock(A)
.--------------' WRITE_ONCE(A->sk_state, TCP_LISTEN)
| unix_state_unlock(A)
|
| unix_state_lock(A)
| A->sk_sk_state == TCP_LISTEN
| READ_ONCE(B->sk_state) == TCP_LISTEN
v ^^^^^^^^^^
unix_state_lock_nested(A) Don't lock B !!
Currently, while checking the client's state, we also check if it's
TCP_ESTABLISHED, but this is unlikely and can be checked after we know
the state is not TCP_CLOSE.
Moreover, if it happens after the second lock, we now jump to the restart
label, but it's unlikely that the server is not found during the retry,
so the jump is mostly to revist the client state check.
Let's remove the retry logic and check the state against TCP_CLOSE first.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
unix_dgram_connect() and unix_dgram_{send,recv}msg() lock the socket
and peer in ascending order of the socket address.
Let's define the order as unix_state_lock_cmp_fn() instead of using
unix_state_lock_nested().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When created, AF_UNIX socket is put into net->unx.table.buckets[],
and the hash is stored in sk->sk_hash.
* unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD
When bind() is called, the socket could be moved to another bucket.
* pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD
* abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1
Then, we call unix_table_double_lock() which locks a single bucket
or two.
Let's define the order as unix_table_lock_cmp_fn() instead of using
spin_lock_nested().
The locking is always done in ascending order of sk->sk_hash, which
is the index of buckets/locks array allocated by kvmalloc_array().
sk_hash_A < sk_hash_B
<=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map
So, the relation of two sk->sk_hash can be derived from the addresses
of dep_map in the array of locks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts, no adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Read with MSG_PEEK flag loops if the first byte to read is an OOB byte.
commit 22dd70eb2c3d ("af_unix: Don't peek OOB data without MSG_OOB.")
addresses the loop issue but does not address the issue that no data
beyond OOB byte can be read.
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'a', MSG_OOB)
1
>>> c1.send(b'b')
1
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
b'b'
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c2.setsockopt(SOL_SOCKET, SO_OOBINLINE, 1)
>>> c1.send(b'a', MSG_OOB)
1
>>> c1.send(b'b')
1
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
b'a'
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
b'a'
>>> c2.recv(1, MSG_DONTWAIT)
b'a'
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
b'b'
>>>
Fixes: 314001f0bf92 ("af_unix: Add OOB support")
Signed-off-by: Rao Shoaib <Rao.Shoaib@oracle.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240611084639.2248934-1-Rao.Shoaib@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/pensando/ionic/ionic_txrx.c
d9c04209990b ("ionic: Mark error paths in the data path as unlikely")
491aee894a08 ("ionic: fix kernel panic in XDP_TX action")
net/ipv6/ip6_fib.c
b4cb4a1391dc ("net: use unrcu_pointer() helper")
b01e1c030770 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While dumping sockets via UNIX_DIAG, we do not hold unix_state_lock().
Let's use READ_ONCE() to read sk->sk_shutdown.
Fixes: e4e541a84863 ("sock-diag: Report shutdown for inet and unix sockets (v2)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|