summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-12-06selftests/bpf: Add more test cases for LPM trieHou Tao
Add more test cases for LPM trie in test_maps: 1) test_lpm_trie_update_flags It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check whether the return value of update operation is expected. 2) test_lpm_trie_update_full_maps It tests the update operations on a full LPM trie map. Adding new node will fail and overwriting the value of existed node will succeed. 3) test_lpm_trie_iterate_strs and test_lpm_trie_iterate_ints There two test cases test whether the iteration through get_next_key is sorted and expected. These two test cases delete the minimal key after each iteration and check whether next iteration returns the second minimal key. The only difference between these two test cases is the former one saves strings in the LPM trie and the latter saves integers. Without the fix of get_next_key, these two cases will fail as shown below: test_lpm_trie_iterate_strs(1091):FAIL:iterate #2 got abc exp abS test_lpm_trie_iterate_ints(1142):FAIL:iterate #1 got 0x2 exp 0x1 Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-10-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06selftests/bpf: Move test_lpm_map.c to map_testsHou Tao
Move test_lpm_map.c to map_tests/ to include LPM trie test cases in regular test_maps run. Most code remains unchanged, including the use of assert(). Only reduce n_lookups from 64K to 512, which decreases test_lpm_map runtime from 37s to 0.7s. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-9-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Use raw_spinlock_t for LPM trieHou Tao
After switching from kmalloc() to the bpf memory allocator, there will be no blocking operation during the update of LPM trie. Therefore, change trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in atomic context, even on RT kernels. The max value of prefixlen is 2048. Therefore, update or deletion operations will find the target after at most 2048 comparisons. Constructing a test case which updates an element after 2048 comparisons under a 8 CPU VM, and the average time and the maximal time for such update operation is about 210us and 900us. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Switch to bpf mem allocator for LPM trieHou Tao
Multiple syzbot warnings have been reported. These warnings are mainly about the lock order between trie->lock and kmalloc()'s internal lock. See report [1] as an example: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted ------------------------------------------------------ syz.3.2069/15008 is trying to acquire lock: ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ... but task is already holding lock: ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ... which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&trie->lock){-.-.}-{2:2}: __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 trie_delete_elem+0xb0/0x820 ___bpf_prog_run+0x3e51/0xabd0 __bpf_prog_run32+0xc1/0x100 bpf_dispatcher_nop_func ...... bpf_trace_run2+0x231/0x590 __bpf_trace_contention_end+0xca/0x110 trace_contention_end.constprop.0+0xea/0x170 __pv_queued_spin_lock_slowpath+0x28e/0xcc0 pv_queued_spin_lock_slowpath queued_spin_lock_slowpath queued_spin_lock do_raw_spin_lock+0x210/0x2c0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x42/0x60 __put_partials+0xc3/0x170 qlink_free qlist_free_all+0x4e/0x140 kasan_quarantine_reduce+0x192/0x1e0 __kasan_slab_alloc+0x69/0x90 kasan_slab_alloc slab_post_alloc_hook slab_alloc_node kmem_cache_alloc_node_noprof+0x153/0x310 __alloc_skb+0x2b1/0x380 ...... -> #0 (&n->list_lock){-.-.}-{2:2}: check_prev_add check_prevs_add validate_chain __lock_acquire+0x2478/0x3b30 lock_acquire lock_acquire+0x1b1/0x560 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 get_partial_node.part.0+0x20/0x350 get_partial_node get_partial ___slab_alloc+0x65b/0x1870 __slab_alloc.constprop.0+0x56/0xb0 __slab_alloc_node slab_alloc_node __do_kmalloc_node __kmalloc_node_noprof+0x35c/0x440 kmalloc_node_noprof bpf_map_kmalloc_node+0x98/0x4a0 lpm_trie_node_alloc trie_update_elem+0x1ef/0xe00 bpf_map_update_value+0x2c1/0x6c0 map_update_elem+0x623/0x910 __sys_bpf+0x90c/0x49a0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&trie->lock); lock(&n->list_lock); lock(&trie->lock); lock(&n->list_lock); *** DEADLOCK *** [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7 A bpf program attached to trace_contention_end() triggers after acquiring &n->list_lock. The program invokes trie_delete_elem(), which then acquires trie->lock. However, it is possible that another process is invoking trie_update_elem(). trie_update_elem() will acquire trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke get_partial_node() and try to acquire &n->list_lock (not necessarily the same lock object). Therefore, lockdep warns about the circular locking dependency. Invoking kmalloc() before acquiring trie->lock could fix the warning. However, since BPF programs call be invoked from any context (e.g., through kprobe/tracepoint/fentry), there may still be lock ordering problems for internal locks in kmalloc() or trie->lock itself. To eliminate these potential lock ordering problems with kmalloc()'s internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent BPF memory allocator APIs that can be invoked in any context. The lock ordering problems with trie->lock (e.g., reentrance) will be handled separately. Three aspects of this change require explanation: 1. Intermediate and leaf nodes are allocated from the same allocator. Since the value size of LPM trie is usually small, using a single alocator reduces the memory overhead of the BPF memory allocator. 2. Leaf nodes are allocated before disabling IRQs. This handles cases where leaf_size is large (e.g., > 4KB - 8) and updates require intermediate node allocation. If leaf nodes were allocated in IRQ-disabled region, the free objects in BPF memory allocator would not be refilled timely and the intermediate node allocation may fail. 3. Paired migrate_{disable|enable}() calls for node alloc and free. The BPF memory allocator uses per-CPU struct internally, these paired calls are necessary to guarantee correctness. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Fix exact match conditions in trie_get_next_key()Hou Tao
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify an exact match, However, it is incorrect because when the target key doesn't fully match the found node (e.g., node->prefixlen != matchlen), these two nodes may also have the same prefixlen. It will return expected result when the passed key exist in the trie. However when a recently-deleted key or nonexistent key is passed to trie_get_next_key(), it may skip keys and return incorrect result. Fix it by using node->prefixlen == matchlen to identify exact matches. When the condition is true after the search, it also implies node->prefixlen equals key->prefixlen, otherwise, the search would return NULL instead. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Handle in-place update for full LPM trie correctlyHou Tao
When a LPM trie is full, in-place updates of existing elements incorrectly return -ENOSPC. Fix this by deferring the check of trie->n_entries. For new insertions, n_entries must not exceed max_entries. However, in-place updates are allowed even when the trie is full. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trieHou Tao
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST flags. These flags can be specified by users and are relevant since LPM trie supports exact matches during update. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elemHou Tao
There is no need to call kfree(im_node) when updating element fails, because im_node must be NULL. Remove the unnecessary kfree() for im_node. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Remove unnecessary check when updating LPM trieHou Tao
When "node->prefixlen == matchlen" is true, it means that the node is fully matched. If "node->prefixlen == key->prefixlen" is false, it means the prefix length of key is greater than the prefix length of node, otherwise, matchlen will not be equal with node->prefixlen. However, it also implies that the prefix length of node must be less than max_prefixlen. Therefore, "node->prefixlen == trie->max_prefixlen" will always be false when the check of "node->prefixlen == key->prefixlen" returns false. Remove this unnecessary comparison. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04Merge branch 'fixes-for-stack-with-allow_ptr_leaks'Alexei Starovoitov
Kumar Kartikeya Dwivedi says: ==================== Fixes for stack with allow_ptr_leaks Two fixes for usability/correctness gaps when interacting with the stack without CAP_PERFMON (i.e. with allow_ptr_leaks = false). See the commits for details. I've verified that the tests fail when run without the fixes. Changelog: ---------- v3 -> v4 v3: https://lore.kernel.org/bpf/20241202083814.1888784-1-memxor@gmail.com * Address Andrii's comments * Fix bug paperered over by missing CAP_NET_ADMIN in verifier_mtu test * Add warning when undefined CAP_ constant is specified, and fail test * Reorder annotations to be more clear * Verify that fixes fail without patches again * Add Acked-by from Andrii v2 -> v3 v2: https://lore.kernel.org/bpf/20241127212026.3580542-1-memxor@gmail.com * Address comments from Eduard * Fix comment for mark_stack_slot_misc * We can simply always return early when stype == STACK_INVALID * Drop allow_ptr_leaks conditionals * Add Eduard's __caps_unpriv patch into the series * Convert test_verifier_mtu to use it * Move existing tests to __caps_unpriv annotation and verifier_spill_fill.c * Add Acked-by from Eduard v1 -> v2 v1: https://lore.kernel.org/bpf/20241127185135.2753982-1-memxor@gmail.com * Fix CI errors in selftest by removing dependence on BPF_ST ==================== Link: https://patch.msgid.link/20241204044757.1483141-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04selftests/bpf: Add test for narrow spill into 64-bit spilled scalarKumar Kartikeya Dwivedi
Add a test case to verify that without CAP_PERFMON, the test now succeeds instead of failing due to a verification error. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04selftests/bpf: Add test for reading from STACK_INVALID slotsKumar Kartikeya Dwivedi
Ensure that when CAP_PERFMON is dropped, and the verifier sees allow_ptr_leaks as false, we are not permitted to read from a STACK_INVALID slot. Without the fix, the test will report unexpected success in loading. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04selftests/bpf: Introduce __caps_unpriv annotation for testsEduard Zingerman
Add a __caps_unpriv annotation so that tests requiring specific capabilities while dropping the rest can conveniently specify them during selftest declaration instead of munging with capabilities at runtime from the testing binary. While at it, let us convert test_verifier_mtu to use this new support instead. Since we do not want to include linux/capability.h, we only defined the four main capabilities BPF subsystem deals with in bpf_misc.h for use in tests. If the user passes a CAP_SYS_NICE or anything else that's not defined in the header, capability parsing code will return a warning. Also reject strtol returning 0. CAP_CHOWN = 0 but we'll never need to use it, and strtol doesn't errno on failed conversion. Fail the test in such a case. The original diff for this idea is available at link [0]. [0]: https://lore.kernel.org/bpf/a1e48f5d9ae133e19adc6adf27e19d585e06bab4.camel@gmail.com Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> [ Kartikeya: rebase on bpf-next, add warn to parse_caps, convert test_verifier_mtu ] Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Fix narrow scalar spill onto 64-bit spilled scalar slotsTao Lyu
When CAP_PERFMON and CAP_SYS_ADMIN (allow_ptr_leaks) are disabled, the verifier aims to reject partial overwrite on an 8-byte stack slot that contains a spilled pointer. However, in such a scenario, it rejects all partial stack overwrites as long as the targeted stack slot is a spilled register, because it does not check if the stack slot is a spilled pointer. Incomplete checks will result in the rejection of valid programs, which spill narrower scalar values onto scalar slots, as shown below. 0: R1=ctx() R10=fp0 ; asm volatile ( @ repro.bpf.c:679 0: (7a) *(u64 *)(r10 -8) = 1 ; R10=fp0 fp-8_w=1 1: (62) *(u32 *)(r10 -8) = 1 attempt to corrupt spilled pointer on stack processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0. Fix this by expanding the check to not consider spilled scalar registers when rejecting the write into the stack. Previous discussion on this patch is at link [0]. [0]: https://lore.kernel.org/bpf/20240403202409.2615469-1-tao.lyu@epfl.ch Fixes: ab125ed3ec1c ("bpf: fix check for attempt to corrupt spilled pointer") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_miscKumar Kartikeya Dwivedi
Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to STACK_MISC when allow_ptr_leaks is false, since invalid contents shouldn't be read unless the program has the relevant capabilities. The relaxation only makes sense when env->allow_ptr_leaks is true. However, such conversion in privileged mode becomes unnecessary, as invalid slots can be read without being upgraded to STACK_MISC. Currently, the condition is inverted (i.e. checking for true instead of false), simply remove it to restore correct behavior. Fixes: eaf18febd6eb ("bpf: preserve STACK_ZERO slots on partial reg spills") Acked-by: Andrii Nakryiko <andrii@kernel.org> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-03samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGSEduard Zingerman
Commit [0] breaks samples/bpf build: $ make M=samples/bpf ... make -C /path/to/kernel/samples/bpf/../../tools/lib/bpf \ ... EXTRA_CFLAGS=" \ ... -fsanitize=bounds \ -I/path/to/kernel/usr/include \ ... /path/to/kernel/samples/bpf/libbpf/libbpf.a install_headers CC /path/to/kernel/samples/bpf/libbpf/staticobjs/libbpf.o In file included from libbpf.c:29: /path/to/kernel/tools/include/linux/err.h:35:8: error: 'inline' can only appear on functions 35 | static inline void * __must_check ERR_PTR(long error_) | ^ The error is caused by `objtree` variable changing definition from `.` (dot) to an absolute path: - The variable TPROGS_CFLAGS is constructed as follows: ... TPROGS_CFLAGS += -I$(objtree)/usr/include - It is passed as EXTRA_CFLAGS for libbpf compilation: $(LIBBPF): ... ... $(MAKE) -C $(LIBBPF_SRC) RM='rm -rf' EXTRA_CFLAGS="$(TPROGS_CFLAGS)" - Before commit [0], the line passed to libbpf makefile was '-I./usr/include', where '.' referred to LIBBPF_SRC due to -C flag. The directory $(LIBBPF_SRC)/usr/include does not exist and thus was never resolved by C compiler. - After commit [0], the line passed to libbpf makefile became: '<output-dir>/usr/include', this directory exists and is resolved by C compiler. - Both 'tools/include' and 'usr/include' define files err.h and types.h. - libbpf expects headers like 'linux/err.h' and 'linux/types.h' defined in 'tools/include', not 'usr/include', hence the compilation error. This commit removes unnecessary -I flags from libbpf compilation. (libbpf sets up the necessary includes at lib/bpf/Makefile:63). Changes v1 [1] -> v2: - dropped unnecessary replacement of KBUILD_OUTPUT with $(objtree) (Andrii) Changes v2 [2] -> v3: - make sure --sysroot option is set for libbpf's EXTRA_CFLAGS, if $(SYSROOT) is set (Stanislav) [0] commit 13b25489b6f8 ("kbuild: change working directory to external module directory with M=") [1] https://lore.kernel.org/bpf/20241202212154.3174402-1-eddyz87@gmail.com/ [2] https://lore.kernel.org/bpf/20241202234741.3492084-1-eddyz87@gmail.com/ Fixes: 13b25489b6f8 ("kbuild: change working directory to external module directory with M=") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/bpf/20241203182222.3915763-1-eddyz87@gmail.com
2024-12-02bpf: Zero index arg error string for dynptr and iterKumar Kartikeya Dwivedi
Andrii spotted that process_dynptr_func's rejection of incorrect argument register type will print an error string where argument numbers are not zero-indexed, unlike elsewhere in the verifier. Fix this by subtracting 1 from regno. The same scenario exists for iterator messages. Fix selftest error strings that match on the exact argument number while we're at it to ensure clean bisection. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241203002235.3776418-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02Merge branch 'fix-missing-process_iter_arg-type-check'Alexei Starovoitov
Kumar Kartikeya Dwivedi says: ==================== Fix missing process_iter_arg type check I am taking over Tao's earlier patch set that can be found at [0], after an offline discussion. The bug reported in that thread is that process_iter_arg missed a reg->type == PTR_TO_STACK check. Fix this by adding it in, and also address comments from Andrii on the earlier attempt. Include more selftests to ensure the error is caught. [0]: https://lore.kernel.org/bpf/20241107214736.347630-1-tao.lyu@epfl.ch Changelog: ---------- v1 -> v2: v1: https://lore.kernel.org/bpf/20241127230147.4158201-1-memxor@gmail.com ==================== Link: https://patch.msgid.link/20241203000238.3602922-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02selftests/bpf: Add tests for iter arg checkKumar Kartikeya Dwivedi
Add selftests to cover argument type check for iterator kfuncs, and cover all three kinds (new, next, destroy). Without the fix in the previous patch, the selftest would not cause a verifier error. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241203000238.3602922-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-02bpf: Ensure reg is PTR_TO_STACK in process_iter_argTao Lyu
Currently, KF_ARG_PTR_TO_ITER handling missed checking the reg->type and ensuring it is PTR_TO_STACK. Instead of enforcing this in the caller of process_iter_arg, move the check into it instead so that all callers will gain the check by default. This is similar to process_dynptr_func. An existing selftest in verifier_bits_iter.c fails due to this change, but it's because it was passing a NULL pointer into iter_next helper and getting an error further down the checks, but probably meant to pass an uninitialized iterator on the stack (as is done in the subsequent test below it). We will gain coverage for non-PTR_TO_STACK arguments in later patches hence just change the declaration to zero-ed stack object. Fixes: 06accc8779c1 ("bpf: add support for open-coded iterator loops") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Lyu <tao.lyu@epfl.ch> [ Kartikeya: move check into process_iter_arg, rewrite commit log ] Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241203000238.3602922-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-29tools: Override makefile ARCH variable if defined, but emptyBjörn Töpel
There are a number of tools (bpftool, selftests), that require a "bootstrap" build. Here, a bootstrap build is a build host variant of a target. E.g., assume that you're performing a bpftool cross-build on x86 to riscv, a bootstrap build would then be an x86 variant of bpftool. The typical way to perform the host build variant, is to pass "ARCH=" in a sub-make. However, if a variable has been set with a command argument, then ordinary assignments in the makefile are ignored. This side-effect results in that ARCH, and variables depending on ARCH are not set. Workaround by overriding ARCH to the host arch, if ARCH is empty. Fixes: 8859b0da5aac ("tools/bpftool: Fix cross-build") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Quentin Monnet <qmo@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/bpf/20241127101748.165693-1-bjorn@kernel.org
2024-11-26selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in ↵Zijian Zhang
test_sockmap Add this to more comprehensively test the socket memory accounting logic in the __SK_REDIRECT and __SK_DROP cases of tcp_bpf_sendmsg. We don't have test when apply_bytes are not zero in test_txmsg_redir_wait_sndmem. test_send_large has opt->rate=2, it will invoke sendmsg two times. Specifically, the first sendmsg will trigger the case where the ret value of tcp_bpf_sendmsg_redir is less than 0; while the second sendmsg happens after the 3 seconds timeout, and it will trigger __SK_DROP because socket c2 has been removed from the sockmap/hash. Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20241016234838.3167769-2-zijianzhang@bytedance.com
2024-11-26tcp_bpf: Fix the sk_mem_uncharge logic in tcp_bpf_sendmsgZijian Zhang
The current sk memory accounting logic in __SK_REDIRECT is pre-uncharging tosend bytes, which is either msg->sg.size or a smaller value apply_bytes. Potential problems with this strategy are as follows: - If the actual sent bytes are smaller than tosend, we need to charge some bytes back, as in line 487, which is okay but seems not clean. - When tosend is set to apply_bytes, as in line 417, and (ret < 0), we may miss uncharging (msg->sg.size - apply_bytes) bytes. [...] 415 tosend = msg->sg.size; 416 if (psock->apply_bytes && psock->apply_bytes < tosend) 417 tosend = psock->apply_bytes; [...] 443 sk_msg_return(sk, msg, tosend); 444 release_sock(sk); 446 origsize = msg->sg.size; 447 ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, 448 msg, tosend, flags); 449 sent = origsize - msg->sg.size; [...] 454 lock_sock(sk); 455 if (unlikely(ret < 0)) { 456 int free = sk_msg_free_nocharge(sk, msg); 458 if (!cork) 459 *copied -= free; 460 } [...] 487 if (eval == __SK_REDIRECT) 488 sk_mem_charge(sk, tosend - sent); [...] When running the selftest test_txmsg_redir_wait_sndmem with txmsg_apply, the following warning will be reported: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 57 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x190/0x1a0 Modules linked in: CPU: 6 UID: 0 PID: 57 Comm: kworker/6:0 Not tainted 6.12.0-rc1.bm.1-amd64+ #43 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Workqueue: events sk_psock_destroy RIP: 0010:inet_sock_destruct+0x190/0x1a0 RSP: 0018:ffffad0a8021fe08 EFLAGS: 00010206 RAX: 0000000000000011 RBX: ffff9aab4475b900 RCX: ffff9aab481a0800 RDX: 0000000000000303 RSI: 0000000000000011 RDI: ffff9aab4475b900 RBP: ffff9aab4475b990 R08: 0000000000000000 R09: ffff9aab40050ec0 R10: 0000000000000000 R11: ffff9aae6fdb1d01 R12: ffff9aab49c60400 R13: ffff9aab49c60598 R14: ffff9aab49c60598 R15: dead000000000100 FS: 0000000000000000(0000) GS:ffff9aae6fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffec7e47bd8 CR3: 00000001a1a1c004 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x89/0x130 ? inet_sock_destruct+0x190/0x1a0 ? report_bug+0xfc/0x1e0 ? handle_bug+0x5c/0xa0 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? inet_sock_destruct+0x190/0x1a0 __sk_destruct+0x25/0x220 sk_psock_destroy+0x2b2/0x310 process_scheduled_works+0xa3/0x3e0 worker_thread+0x117/0x240 ? __pfx_worker_thread+0x10/0x10 kthread+0xcf/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- In __SK_REDIRECT, a more concise way is delaying the uncharging after sent bytes are finalized, and uncharge this value. When (ret < 0), we shall invoke sk_msg_free. Same thing happens in case __SK_DROP, when tosend is set to apply_bytes, we may miss uncharging (msg->sg.size - apply_bytes) bytes. The same warning will be reported in selftest. [...] 468 case __SK_DROP: 469 default: 470 sk_msg_free_partial(sk, msg, tosend); 471 sk_msg_apply_bytes(psock, tosend); 472 *copied -= (tosend + delta); 473 return -EACCES; [...] So instead of sk_msg_free_partial we can do sk_msg_free here. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Fixes: 8ec95b94716a ("bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20241016234838.3167769-3-zijianzhang@bytedance.com
2024-11-26selftests/bpf: Check for PREEMPTION instead of PREEMPTSebastian Andrzej Siewior
CONFIG_PREEMPT is a preemtion model the so called "Low-Latency Desktop". A different preemption model is PREEMPT_RT the so called "Real-Time". Both implement preemption in kernel and set CONFIG_PREEMPTION. There is also the so called "LAZY PREEMPT" which the "Scheduler controlled preemption model". Here we have also preemption in the kernel the rules are slightly different. Therefore the testsuite should not check for CONFIG_PREEMPT (as one model) but for CONFIG_PREEMPTION to figure out if preemption in the kernel is possible. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20241119161819.qvEcs-n_@linutronix.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25bpftool: fix potential NULL pointer dereferencing in prog_dump()Amir Mohammadi
A NULL pointer dereference could occur if ksyms is not properly checked before usage in the prog_dump() function. Fixes: b053b439b72a ("bpf: libbpf: bpftool: Print bpf_line_info during prog dump") Signed-off-by: Amir Mohammadi <amiremohamadi@yahoo.com> Reviewed-by: Quentin Monnet <qmo@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20241121083413.7214-1-amiremohamadi@yahoo.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25xsk: always clear DMA mapping information when unmapping the poolLarysa Zaremba
When the umem is shared, the DMA mapping is also shared between the xsk pools, therefore it should stay valid as long as at least 1 user remains. However, the pool also keeps the copies of DMA-related information that are initialized in the same way in xp_init_dma_info(), but cleared by xp_dma_unmap() only for the last remaining pool, this causes the problems below. The first one is that the commit adbf5a42341f ("ice: remove af_xdp_zc_qps bitmap") relies on pool->dev to determine the presence of a ZC pool on a given queue, avoiding internal bookkeeping. This works perfectly fine if the UMEM is not shared, but reliably fails otherwise as stated in the linked report. The second one is pool->dma_pages which is dynamically allocated and only freed in xp_dma_unmap(), this leads to a small memory leak. kmemleak does not catch it, but by printing the allocation results after terminating the userspace program it is possible to see that all addresses except the one belonging to the last detached pool are still accessible through the kmemleak dump functionality. Always clear the DMA mapping information from the pool and free pool->dma_pages when unmapping the pool, so that the only difference between results of the last remaining user's call and the ones before would be the destruction of the DMA mapping. Fixes: adbf5a42341f ("ice: remove af_xdp_zc_qps bitmap") Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings") Reported-by: Alasdair McWilliam <alasdair.mcwilliam@outlook.com> Closes: https://lore.kernel.org/PA4P194MB10056F208AF221D043F57A3D86512@PA4P194MB1005.EURP194.PROD.OUTLOOK.COM Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20241122112912.89881-1-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25Merge branch 'bpf-fix-oob-accesses-in-map_delete_elem-callbacks'Alexei Starovoitov
Maciej Fijalkowski says: ==================== bpf: fix OOB accesses in map_delete_elem callbacks v1->v2: - CC stable and collect tags from Toke & John Hi, Jordy reported that for big enough XSKMAPs and DEVMAPs, when deleting elements, OOB writes occur. Reproducer below: // compile with gcc -o map_poc map_poc.c -lbpf #include <errno.h> #include <linux/bpf.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/syscall.h> #include <unistd.h> int main() { // Create a large enough BPF XSK map int map_fd; union bpf_attr create_attr = { .map_type = BPF_MAP_TYPE_XSKMAP, .key_size = sizeof(int), .value_size = sizeof(int), .max_entries = 0x80000000 + 2, }; map_fd = syscall(SYS_bpf, BPF_MAP_CREATE, &create_attr, sizeof(create_attr)); if (map_fd < 0) { fprintf(stderr, "Failed to create BPF map: %s\n", strerror(errno)); return 1; } // Delete an element from the map using syscall unsigned int key = 0x80000000 + 1; if (syscall(SYS_bpf, BPF_MAP_DELETE_ELEM, &(union bpf_attr){ .map_fd = map_fd, .key = &key, }, sizeof(union bpf_attr)) < 0) { fprintf(stderr, "Failed to delete element from BPF map: %s\n", strerror(errno)); return 1; } close(map_fd); return 0; } This tiny series changes data types from int to u32 of keys being used for map accesses. Thanks, Maciej ==================== Link: https://patch.msgid.link/20241122121030.716788-1-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25bpf: fix OOB devmap writes when deleting elementsMaciej Fijalkowski
Jordy reported issue against XSKMAP which also applies to DEVMAP - the index used for accessing map entry, due to being a signed integer, causes the OOB writes. Fix is simple as changing the type from int to u32, however, when compared to XSKMAP case, one more thing needs to be addressed. When map is released from system via dev_map_free(), we iterate through all of the entries and an iterator variable is also an int, which implies OOB accesses. Again, change it to be u32. Example splat below: [ 160.724676] BUG: unable to handle page fault for address: ffffc8fc2c001000 [ 160.731662] #PF: supervisor read access in kernel mode [ 160.736876] #PF: error_code(0x0000) - not-present page [ 160.742095] PGD 0 P4D 0 [ 160.744678] Oops: Oops: 0000 [#1] PREEMPT SMP [ 160.749106] CPU: 1 UID: 0 PID: 520 Comm: kworker/u145:12 Not tainted 6.12.0-rc1+ #487 [ 160.757050] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [ 160.767642] Workqueue: events_unbound bpf_map_free_deferred [ 160.773308] RIP: 0010:dev_map_free+0x77/0x170 [ 160.777735] Code: 00 e8 fd 91 ed ff e8 b8 73 ed ff 41 83 7d 18 19 74 6e 41 8b 45 24 49 8b bd f8 00 00 00 31 db 85 c0 74 48 48 63 c3 48 8d 04 c7 <48> 8b 28 48 85 ed 74 30 48 8b 7d 18 48 85 ff 74 05 e8 b3 52 fa ff [ 160.796777] RSP: 0018:ffffc9000ee1fe38 EFLAGS: 00010202 [ 160.802086] RAX: ffffc8fc2c001000 RBX: 0000000080000000 RCX: 0000000000000024 [ 160.809331] RDX: 0000000000000000 RSI: 0000000000000024 RDI: ffffc9002c001000 [ 160.816576] RBP: 0000000000000000 R08: 0000000000000023 R09: 0000000000000001 [ 160.823823] R10: 0000000000000001 R11: 00000000000ee6b2 R12: dead000000000122 [ 160.831066] R13: ffff88810c928e00 R14: ffff8881002df405 R15: 0000000000000000 [ 160.838310] FS: 0000000000000000(0000) GS:ffff8897e0c40000(0000) knlGS:0000000000000000 [ 160.846528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 160.852357] CR2: ffffc8fc2c001000 CR3: 0000000005c32006 CR4: 00000000007726f0 [ 160.859604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 160.866847] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 160.874092] PKRU: 55555554 [ 160.876847] Call Trace: [ 160.879338] <TASK> [ 160.881477] ? __die+0x20/0x60 [ 160.884586] ? page_fault_oops+0x15a/0x450 [ 160.888746] ? search_extable+0x22/0x30 [ 160.892647] ? search_bpf_extables+0x5f/0x80 [ 160.896988] ? exc_page_fault+0xa9/0x140 [ 160.900973] ? asm_exc_page_fault+0x22/0x30 [ 160.905232] ? dev_map_free+0x77/0x170 [ 160.909043] ? dev_map_free+0x58/0x170 [ 160.912857] bpf_map_free_deferred+0x51/0x90 [ 160.917196] process_one_work+0x142/0x370 [ 160.921272] worker_thread+0x29e/0x3b0 [ 160.925082] ? rescuer_thread+0x4b0/0x4b0 [ 160.929157] kthread+0xd4/0x110 [ 160.932355] ? kthread_park+0x80/0x80 [ 160.936079] ret_from_fork+0x2d/0x50 [ 160.943396] ? kthread_park+0x80/0x80 [ 160.950803] ret_from_fork_asm+0x11/0x20 [ 160.958482] </TASK> Fixes: 546ac1ffb70d ("bpf: add devmap, a map for storing net device references") CC: stable@vger.kernel.org Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Jordy Zomer <jordyzomer@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20241122121030.716788-3-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25xsk: fix OOB map writes when deleting elementsMaciej Fijalkowski
Jordy says: " In the xsk_map_delete_elem function an unsigned integer (map->max_entries) is compared with a user-controlled signed integer (k). Due to implicit type conversion, a large unsigned value for map->max_entries can bypass the intended bounds check: if (k >= map->max_entries) return -EINVAL; This allows k to hold a negative value (between -2147483648 and -2), which is then used as an array index in m->xsk_map[k], which results in an out-of-bounds access. spin_lock_bh(&m->lock); map_entry = &m->xsk_map[k]; // Out-of-bounds map_entry old_xs = unrcu_pointer(xchg(map_entry, NULL)); // Oob write if (old_xs) xsk_map_sock_delete(old_xs, map_entry); spin_unlock_bh(&m->lock); The xchg operation can then be used to cause an out-of-bounds write. Moreover, the invalid map_entry passed to xsk_map_sock_delete can lead to further memory corruption. " It indeed results in following splat: [76612.897343] BUG: unable to handle page fault for address: ffffc8fc2e461108 [76612.904330] #PF: supervisor write access in kernel mode [76612.909639] #PF: error_code(0x0002) - not-present page [76612.914855] PGD 0 P4D 0 [76612.917431] Oops: Oops: 0002 [#1] PREEMPT SMP [76612.921859] CPU: 11 UID: 0 PID: 10318 Comm: a.out Not tainted 6.12.0-rc1+ #470 [76612.929189] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [76612.939781] RIP: 0010:xsk_map_delete_elem+0x2d/0x60 [76612.944738] Code: 00 00 41 54 55 53 48 63 2e 3b 6f 24 73 38 4c 8d a7 f8 00 00 00 48 89 fb 4c 89 e7 e8 2d bf 05 00 48 8d b4 eb 00 01 00 00 31 ff <48> 87 3e 48 85 ff 74 05 e8 16 ff ff ff 4c 89 e7 e8 3e bc 05 00 31 [76612.963774] RSP: 0018:ffffc9002e407df8 EFLAGS: 00010246 [76612.969079] RAX: 0000000000000000 RBX: ffffc9002e461000 RCX: 0000000000000000 [76612.976323] RDX: 0000000000000001 RSI: ffffc8fc2e461108 RDI: 0000000000000000 [76612.983569] RBP: ffffffff80000001 R08: 0000000000000000 R09: 0000000000000007 [76612.990812] R10: ffffc9002e407e18 R11: ffff888108a38858 R12: ffffc9002e4610f8 [76612.998060] R13: ffff888108a38858 R14: 00007ffd1ae0ac78 R15: ffffc9002e4610c0 [76613.005303] FS: 00007f80b6f59740(0000) GS:ffff8897e0ec0000(0000) knlGS:0000000000000000 [76613.013517] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [76613.019349] CR2: ffffc8fc2e461108 CR3: 000000011e3ef001 CR4: 00000000007726f0 [76613.026595] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [76613.033841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [76613.041086] PKRU: 55555554 [76613.043842] Call Trace: [76613.046331] <TASK> [76613.048468] ? __die+0x20/0x60 [76613.051581] ? page_fault_oops+0x15a/0x450 [76613.055747] ? search_extable+0x22/0x30 [76613.059649] ? search_bpf_extables+0x5f/0x80 [76613.063988] ? exc_page_fault+0xa9/0x140 [76613.067975] ? asm_exc_page_fault+0x22/0x30 [76613.072229] ? xsk_map_delete_elem+0x2d/0x60 [76613.076573] ? xsk_map_delete_elem+0x23/0x60 [76613.080914] __sys_bpf+0x19b7/0x23c0 [76613.084555] __x64_sys_bpf+0x1a/0x20 [76613.088194] do_syscall_64+0x37/0xb0 [76613.091832] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [76613.096962] RIP: 0033:0x7f80b6d1e88d [76613.100592] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48 [76613.119631] RSP: 002b:00007ffd1ae0ac68 EFLAGS: 00000206 ORIG_RAX: 0000000000000141 [76613.131330] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f80b6d1e88d [76613.142632] RDX: 0000000000000098 RSI: 00007ffd1ae0ad20 RDI: 0000000000000003 [76613.153967] RBP: 00007ffd1ae0adc0 R08: 0000000000000000 R09: 0000000000000000 [76613.166030] R10: 00007f80b6f77040 R11: 0000000000000206 R12: 00007ffd1ae0aed8 [76613.177130] R13: 000055ddf42ce1e9 R14: 000055ddf42d0d98 R15: 00007f80b6fab040 [76613.188129] </TASK> Fix this by simply changing key type from int to u32. Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP") CC: stable@vger.kernel.org Reported-by: Jordy Zomer <jordyzomer@google.com> Suggested-by: Jordy Zomer <jordyzomer@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20241122121030.716788-2-maciej.fijalkowski@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25Merge branch 'bpf-vsock-fix-poll-and-close'Alexei Starovoitov
Michal Luczaj says: ==================== bpf, vsock: Fix poll() and close() Two small fixes for vsock: poll() missing a queue check, and close() not invoking sockmap cleanup. Signed-off-by: Michal Luczaj <mhal@rbox.co> Acked-by: John Fastabend <john.fastabend@gmail.com> --- ==================== Link: https://patch.msgid.link/20241118-vsock-bpf-poll-close-v1-0-f1b9669cacdc@rbox.co Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-25selftest/bpf: Add test for vsock removal from sockmap on close()Michal Luczaj
Make sure the proto::close callback gets invoked on vsock release. Signed-off-by: Michal Luczaj <mhal@rbox.co> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20241118-vsock-bpf-poll-close-v1-4-f1b9669cacdc@rbox.co Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com>
2024-11-25bpf, vsock: Invoke proto::close on close()Michal Luczaj
vsock defines a BPF callback to be invoked when close() is called. However, this callback is never actually executed. As a result, a closed vsock socket is not automatically removed from the sockmap/sockhash. Introduce a dummy vsock_close() and make vsock_release() call proto::close. Note: changes in __vsock_release() look messy, but it's only due to indent level reduction and variables xmas tree reorder. Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://lore.kernel.org/r/20241118-vsock-bpf-poll-close-v1-3-f1b9669cacdc@rbox.co Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com>
2024-11-25selftest/bpf: Add test for af_vsock poll()Michal Luczaj
Verify that vsock's poll() notices when sk_psock::ingress_msg isn't empty. Signed-off-by: Michal Luczaj <mhal@rbox.co> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20241118-vsock-bpf-poll-close-v1-2-f1b9669cacdc@rbox.co Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com>
2024-11-25bpf, vsock: Fix poll() missing a queueMichal Luczaj
When a verdict program simply passes a packet without redirection, sk_msg is enqueued on sk_psock::ingress_msg. Add a missing check to poll(). Fixes: 634f1a7110b4 ("vsock: support sockmap") Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://lore.kernel.org/r/20241118-vsock-bpf-poll-close-v1-1-f1b9669cacdc@rbox.co Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com>
2024-11-25bpf, lsm: Remove getlsmprop hooks BTF IDsThomas Weißschuh
These hooks are not useful for BPF LSM currently. Furthermore a recent renaming introduced build warnings: BTFIDS vmlinux WARN: resolve_btfids: unresolved symbol bpf_lsm_task_getsecid_obj WARN: resolve_btfids: unresolved symbol bpf_lsm_current_getsecid_subj Link: https://lore.kernel.org/lkml/20241123-bpf_lsm_task_getsecid_obj-v1-1-0d0f94649e05@weissschuh.net/ Fixes: 37f670aacd48 ("lsm: use lsm_prop in security_current_getsecid") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20241125-bpf_lsm_task_getsecid_obj-v2-1-c8395bde84e0@weissschuh.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "The biggest change here is eliminating the awful idea that KVM had of essentially guessing which pfns are refcounted pages. The reason to do so was that KVM needs to map both non-refcounted pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain refcounted pages. However, the result was security issues in the past, and more recently the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by struct page but is not refcounted. In particular this broke virtio-gpu blob resources (which directly map host graphics buffers into the guest as "vram" for the virtio-gpu device) with the amdgpu driver, because amdgpu allocates non-compound higher order pages and the tail pages could not be mapped into KVM. This requires adjusting all uses of struct page in the per-architecture code, to always work on the pfn whenever possible. The large series that did this, from David Stevens and Sean Christopherson, also cleaned up substantially the set of functions that provided arch code with the pfn for a host virtual addresses. The previous maze of twisty little passages, all different, is replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages) saving almost 200 lines of code. ARM: - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests LoongArch: - Add iocsr and mmio bus simulation in kernel. - Add in-kernel interrupt controller emulation. - Add support for virtualization extensions to the eiointc irqchip. PPC: - Drop lingering and utterly obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect documentation references to non-existing ioctls RISC-V: - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side s390: - New selftests: more ucontrol selftests and CPU model sanity checks - Support for the gen17 CPU model - List registers supported by KVM_GET/SET_ONE_REG in the documentation x86: - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes. Even if the hardware A/D tracking is disabled, it is possible to use the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty, and that removes a lot of special cases. - Elide TLB flushes when aging secondary PTEs, as has been done in x86's primary MMU for over 10 years. - Recover huge pages in-place in the TDP MMU when dirty page logging is toggled off, instead of zapping them and waiting until the page is re-accessed to create a huge mapping. This reduces vCPU jitter. - Batch TLB flushes when dirty page logging is toggled off. This reduces the time it takes to disable dirty logging by ~3x. - Remove the shrinker that was (poorly) attempting to reclaim shadow page tables in low-memory situations. - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Advertise CPUIDs for new instructions in Clearwater Forest - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which in turn can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe; harden the cache accessors to try to prevent similar issues from occuring in the future. The issue that triggered this change was already fixed in 6.12, but was still kinda latent. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups - Switch hugepage recovery thread to use vhost_task. These kthreads can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory); therefore KVM tried to place the thread in the VM's cgroups and charge the CPU time consumed by that work to the VM's container. However the kthreads did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM instances inside could not complete freezing. Fix this by replacing the kthread with a PF_USER_WORKER thread, via the vhost_task abstraction. Another 100+ lines removed, with generally better behavior too like having these threads properly parented in the process tree. - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't really work; there was really nothing to work around anyway: the broken patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the erratum. - Fix 6.12 regression where CONFIG_KVM will be built as a module even if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'. x86 selftests: - x86 selftests can now use AVX. Documentation: - Use rST internal links - Reorganize the introduction to the API document Generic: - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long due to having to wait for all CPUs become quiescent. In general both reads and writes are rare, but userspace that supports confidential computing is introducing the use of "helper" vCPUs that may jump from one host processor to another. Those will be very happy to trigger a synchronize_rcu(), and the effect on performance is quite the disaster" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits) KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD KVM: x86: add back X86_LOCAL_APIC dependency Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()" KVM: x86: switch hugepage recovery thread to vhost_task KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest Documentation: KVM: fix malformed table irqchip/loongson-eiointc: Add virt extension support LoongArch: KVM: Add irqfd support LoongArch: KVM: Add PCHPIC user mode read and write functions LoongArch: KVM: Add PCHPIC read and write functions LoongArch: KVM: Add PCHPIC device support LoongArch: KVM: Add EIOINTC user mode read and write functions LoongArch: KVM: Add EIOINTC read and write functions LoongArch: KVM: Add EIOINTC device support LoongArch: KVM: Add IPI user mode read and write function LoongArch: KVM: Add IPI read and write function LoongArch: KVM: Add IPI device support LoongArch: KVM: Add iocsr and mmio bus simulation in kernel KVM: arm64: Pass on SVE mapping failures ...
2024-11-23Merge tag 'powerpc-6.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Rework kfence support for the HPT MMU to work on systems with >= 16TB of RAM. - Remove the powerpc "maple" platform, used by the "Yellow Dog Powerstation". - Add support for DYNAMIC_FTRACE_WITH_CALL_OPS, DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines. - Add support for running KVM nested guests on Power11. - Other small features, cleanups and fixes. Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard, Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek, Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash, Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum, Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao. * tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits) EDAC/powerpc: Remove PPC_MAPLE drivers powerpc/perf: Add per-task/process monitoring to vpa_pmu driver powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu powerpc/perf: Add perf interface to expose vpa counters MAINTAINERS: powerpc: Mark Maddy as "M" powerpc/Makefile: Allow overriding CPP powerpc-km82xx.c: replace of_node_put() with __free ps3: Correct some typos in comments powerpc/kexec: Fix return of uninitialized variable macintosh: Use common error handling code in via_pmu_led_init() powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type() powerpc: remove dead config options for MPC85xx platform support powerpc/xive: Use cpumask_intersects() selftests/powerpc: Remove the path after initialization. powerpc/xmon: symbol lookup length fixed powerpc/ep8248e: Use %pa to format resource_size_t powerpc/ps3: Reorganize kerneldoc parameter names KVM: PPC: Book3S HV: Fix kmv -> kvm typo powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static ...
2024-11-23Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
2024-11-22Merge tag '6.13-rc-part1-SMB3-client-fixes' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull smb client updates from Steve French: - Fix two SMB3.1.1 POSIX Extensions problems - Fixes for special file handling (symlinks and FIFOs) - Improve compounding - Four cleanup patches - Fix use after free in signing - Add support for handling namespaces for reconnect related upcalls (e.g. for DNS names resolution and auth) - Fix various directory lease problems (directory entry caching), including some important potential use after frees * tag '6.13-rc-part1-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: prevent use-after-free due to open_cached_dir error paths smb: Don't leak cfid when reconnect races with open_cached_dir smb: client: handle max length for SMB symlinks smb: client: get rid of bounds check in SMB2_ioctl_init() smb: client: improve compound padding in encryption smb3: request handle caching when caching directories cifs: Recognize SFU char/block devices created by Windows NFS server on Windows Server <<2012 CIFS: New mount option for cifs.upcall namespace resolution smb/client: Prevent error pointer dereference fs/smb/client: implement chmod() for SMB3 POSIX Extensions smb: cached directories can be more than root file handle smb: client: fix use-after-free of signing key smb: client: Use str_yes_no() helper function smb: client: memcpy() with surrounding object base address cifs: Remove pre-historic unused CIFSSMBCopy
2024-11-22Merge tag 'ovl-update-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs updates from Amir Goldstein: - Fix a syzbot reported NULL pointer deref with bfs lower layers - Fix a copy up failure of large file from lower fuse fs - Followup cleanup of backing_file API from Miklos - Introduction and use of revert/override_creds_light() helpers, that were suggested by Christian as a mitigation to cache line bouncing and false sharing of fields in overlayfs creator_cred long lived struct cred copy. - Store up to two backing file references (upper and lower) in an ovl_file container instead of storing a single backing file in file->private_data. This is used to avoid the practice of opening a short lived backing file for the duration of some file operations and to avoid the specialized use of FDPUT_FPUT in such occasions, that was getting in the way of Al's fd_file() conversions. * tag 'ovl-update-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: Filter invalid inodes with missing lookup function ovl: convert ovl_real_fdget() callers to ovl_real_file() ovl: convert ovl_real_fdget_path() callers to ovl_real_file_path() ovl: store upper real file in ovl_file struct ovl: allocate a container struct ovl_file for ovl private context ovl: do not open non-data lower file for fsync ovl: Optimize override/revert creds ovl: pass an explicit reference of creators creds to callers ovl: use wrapper ovl_revert_creds() fs/backing-file: Convert to revert/override_creds_light() cred: Add a light version of override/revert_creds() backing-file: clean up the API ovl: properly handle large files in ovl_security_fileattr
2024-11-22Merge tag 'unicode-next-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode updates from Gabriel Krisman Bertazi: - constify a read-only struct (Thomas Weißschuh) - fix the error path of unicode_load, avoiding a possible kernel oops if it fails to find the unicode module (André Almeida) - documentation fix, updating a filename in the README (Gan Jie) - add the link of my tree to MAINTAINERS (André Almeida) * tag 'unicode-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: MAINTAINERS: Add Unicode tree unicode: change the reference of database file unicode: Fix utf8_load() error path unicode: constify utf8 data table
2024-11-22Merge tag 'sysctl-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: "sysctl ctl_table constification: - Constifying ctl_table structs prevents the modification of proc_handler function pointers. All ctl_table struct arguments are const qualified in the sysctl API in such a way that the ctl_table arrays being defined elsewhere and passed through sysctl can be constified one-by-one. We kick the constification off by qualifying user_table in kernel/ucount.c and expect all the ctl_tables to be constified in the coming releases. Misc fixes: - Adjust comments in two places to better reflect the code - Remove superfluous dput calls - Remove Luis from sysctl maintainership - Replace comments about holding a lock with calls to lockdep_assert_held" * tag 'sysctl-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: Reduce dput(child) calls in proc_sys_fill_cache() sysctl: Reorganize kerneldoc parameter names ucounts: constify sysctl table user_table sysctl: update comments to new registration APIs MAINTAINERS: remove me from sysctl sysctl: Convert locking comments to lockdep assertions const_structs.checkpatch: add ctl_table sysctl: make internal ctl_tables const sysctl: allow registration of const struct ctl_table sysctl: move internal interfaces to const struct ctl_table bpf: Constify ctl_table argument of filter function
2024-11-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Seveal fixes scattered across the drivers and a few new features: - Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns - Force disassociate the userspace FD when hns does an async reset - bnxt new features for optimized modify QP to skip certain stayes, CQ coalescing, better debug dumping - mlx5 new data placement ordering feature - Faster destruction of mlx5 devx HW objects - Improvements to RDMA CM mad handling" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits) RDMA/bnxt_re: Correct the sequence of device suspend RDMA/bnxt_re: Use the default mode of congestion control RDMA/bnxt_re: Support different traffic class IB/cm: Rework sending DREQ when destroying a cm_id IB/cm: Do not hold reference on cm_id unless needed IB/cm: Explicitly mark if a response MAD is a retransmission RDMA/mlx5: Move events notifier registration to be after device registration RDMA/bnxt_re: Cache MSIx info to a local structure RDMA/bnxt_re: Refurbish CQ to NQ hash calculation RDMA/bnxt_re: Refactor NQ allocation RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved RDMA/hns: Fix different dgids mapping to the same dip_idx RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design bnxt_en: Add support for RoCE sriov configuration RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg() RDMA/hns: Fix out-of-order issue of requester when setting FENCE RDMA/nldev: Add IB device and net device rename events RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation RDMA/core: Move ib_uverbs_file struct to uverbs_types.h ...
2024-11-22Merge tag 'iommu-updates-v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu updates from Joerg Roedel: "Core Updates: - Convert call-sites using iommu_domain_alloc() to more specific versions and remove function - Introduce iommu_paging_domain_alloc_flags() - Extend support for allocating PASID-capable domains to more drivers - Remove iommu_present() - Some smaller improvements New IOMMU driver for RISC-V Intel VT-d Updates: - Add domain_alloc_paging support - Enable user space IOPFs in non-PASID and non-svm cases - Small code refactoring and cleanups - Add domain replacement support for pasid AMD-Vi Updates: - Adapt to iommu_paging_domain_alloc_flags() interface and alloc V2 page-tables by default - Replace custom domain ID allocator with IDA allocator - Add ops->release_domain() support - Other improvements to device attach and domain allocation code paths ARM-SMMU Updates: - SMMUv2: - Return -EPROBE_DEFER for client devices probing before their SMMU - Devicetree binding updates for Qualcomm MMU-500 implementations - SMMUv3: - Minor fixes and cleanup for NVIDIA's virtual command queue driver - IO-PGTable: - Fix indexing of concatenated PGDs and extend selftest coverage - Remove unused block-splitting support S390 IOMMU: - Implement support for blocking domain Mediatek IOMMU: - Enable 35-bit physical address support for mt8186 OMAP IOMMU driver: - Adapt to recent IOMMU core changes and unbreak driver" * tag 'iommu-updates-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (92 commits) iommu/tegra241-cmdqv: Fix alignment failure at max_n_shift iommu: Make set_dev_pasid op support domain replacement iommu/arm-smmu-v3: Make set_dev_pasid() op support replace iommu/vt-d: Add set_dev_pasid callback for nested domain iommu/vt-d: Make identity_domain_set_dev_pasid() to handle domain replacement iommu/vt-d: Make intel_svm_set_dev_pasid() support domain replacement iommu/vt-d: Limit intel_iommu_set_dev_pasid() for paging domain iommu/vt-d: Make intel_iommu_set_dev_pasid() to handle domain replacement iommu/vt-d: Add iommu_domain_did() to get did iommu/vt-d: Consolidate the struct dev_pasid_info add/remove iommu/vt-d: Add pasid replace helpers iommu/vt-d: Refactor the pasid setup helpers iommu/vt-d: Add a helper to flush cache for updating present pasid entry iommu: Pass old domain to set_dev_pasid op iommu/iova: Fix typo 'adderss' iommu: Add a kdoc to iommu_unmap() iommu/io-pgtable-arm-v7s: Remove split on unmap behavior iommu/io-pgtable-arm: Remove split on unmap behavior iommu/vt-d: Drain PRQs when domain removed from RID iommu/vt-d: Drop pasid requirement for prq initialization ...
2024-11-22Merge tag 'thermal-6.13-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more thermal control updates from Rafael Wysocki: "These update a few thermal drivers used on ARM platforms and thermal tools: - Add SAR2130P compatible to DT bindings in the QCom Tsens driver (Dmitry Baryshkov) - Add static annotation to arrays describing platform sensors in the LVTS Mediatek driver (Colin Ian King) - Switch back to struct platform_driver::remove() from the previous callbacks prototype rework (Uwe Kleine-König) - Add MSM8937 compatible to DT bindings and its support in the QCom Tsens driver (Barnabás Czémán) - Remove a pointless sign test on an unsigned value in k3_bgp_read_temp() in the k3_j72xx_bandgap driver (Rex Nie) - Fix a pointer reference loss when realloc() fails in the thermal library (Zhang Jiao)" * tag 'thermal-6.13-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: tools/thermal: Fix common realloc mistake thermal/drivers/k3_j72xx_bandgap: Simplify code in k3_bgp_read_temp() thermal/drivers/qcom/tsens-v1: Add support for MSM8937 tsens dt-bindings: thermal: tsens: Add MSM8937 thermal: Switch back to struct platform_driver::remove() thermal/drivers/mediatek/lvts_thermal: Make read-only arrays static const dt-bindings: thermal: qcom-tsens: Add SAR2130P compatible
2024-11-22Merge tag 'pm-6.13-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These mostly are updates of cpufreq drivers used on ARM platforms plus one new DT-based cpufreq driver for virtualized guests and two cpuidle changes that should not make any difference on systems currently in the field, but will be needed for future development: - Add virtual cpufreq driver for guest kernels (David Dai) - Minor cleanup to various cpufreq drivers (Andy Shevchenko, Dhruva Gole, Jie Zhan, Jinjie Ruan, Shuosheng Huang, Sibi Sankar, and Yuan Can) - Revert "cpufreq: brcmstb-avs-cpufreq: Fix initial command check" (Colin Ian King) - Improve DT bindings for qcom-hw driver (Dmitry Baryshkov, Konrad Dybcio, and Nikunj Kela) - Make cpuidle_play_dead() try all idle states with :enter_dead() callbacks and change their return type to void (Rafael Wysocki)" * tag 'pm-6.13-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (22 commits) cpuidle: Change :enter_dead() driver callback return type to void cpuidle: Do not return from cpuidle_play_dead() on callback failures arm64: dts: qcom: sc8180x: Add a SoC-specific compatible to cpufreq-hw dt-bindings: cpufreq: cpufreq-qcom-hw: Add SC8180X compatible cpufreq: sun50i: add a100 cpufreq support cpufreq: mediatek-hw: Fix wrong return value in mtk_cpufreq_get_cpu_power() cpufreq: CPPC: Fix wrong return value in cppc_get_cpu_power() cpufreq: CPPC: Fix wrong return value in cppc_get_cpu_cost() cpufreq: loongson3: Check for error code from devm_mutex_init() call cpufreq: scmi: Fix cleanup path when boost enablement fails cpufreq: CPPC: Fix possible null-ptr-deref for cppc_get_cpu_cost() cpufreq: CPPC: Fix possible null-ptr-deref for cpufreq_cpu_get_raw() Revert "cpufreq: brcmstb-avs-cpufreq: Fix initial command check" dt-bindings: cpufreq: cpufreq-qcom-hw: Add SAR2130P compatible cpufreq: add virtual-cpufreq driver dt-bindings: cpufreq: add virtual cpufreq device cpufreq: loongson2: Unregister platform_driver on failure cpufreq: ti-cpufreq: Remove revision offsets in AM62 family cpufreq: ti-cpufreq: Allow backward compatibility for efuse syscon cppc_cpufreq: Remove HiSilicon CPPC workaround ...
2024-11-22Merge tag 'tpmdd-next-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull TPM updates from Jarkko Sakkinen: "Only some minor tweaks" * tag 'tpmdd-next-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: atmel: Drop PPC64 specific MMIO setup char: tpm: cr50: Add new device/vendor ID 0x50666666 char: tpm: cr50: Move i2c locking to request/relinquish locality ops char: tpm: cr50: Use generic request/relinquish locality ops tpm: ibmvtpm: Set TPM_OPS_AUTO_STARTUP flag on driver
2024-11-22Merge tag 'mtd/for-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux Pull MTD updates from Miquel Raynal: "MTD device changes: - switch platform_driver back to remove() - misc fixes SPI-NAND changes: - a load of fixes to Winbond manufacturer driver - structure constification Raw NAND changes: - improve the power management of the GPMI driver - Davinci driver clean-ups - fix leak in the Atmel driver - fix some typos in the core SPI NOR changes: - Introduce byte swap support for 8D-8D-8D mode and a user for it: macronix. SPI NOR flashes may swap the bytes on a 16-bit boundary when configured in Octal DTR mode. For such cases the byte order is propagated through SPI MEM to the SPI controllers so that the controllers swap the bytes back at runtime. This avoids breaking the boot sequence because of the endianness problems that appear when the bootloaders use 1-1-1 and the kernel uses 8D-8D-8D with byte swap support. Along with the SPI MEM byte swap support we queue a patch for the SPI MXIC controller that swaps the bytes back at runtime" * tag 'mtd/for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (25 commits) mtd: spi-nor: core: replace dummy buswidth from addr to data mtd: spi-nor: winbond: add "w/ and w/o SFDP" comment mtd: spi-nor: spansion: Use nor->addr_nbytes in octal DTR mode in RD_ANY_REG_OP mtd: Switch back to struct platform_driver::remove() mtd: cfi_cmdset_0002: remove redundant assignment to variable ret mtd: spinand: Constify struct nand_ecc_engine_ops MAINTAINERS: add mailing list for GPMI NAND driver mtd: spinand: winbond: Sort the devices mtd: spinand: winbond: Ignore the last ID characters mtd: spinand: winbond: Fix 512GW, 01GW, 01JW and 02JW ECC information mtd: spinand: winbond: Fix 512GW and 02JW OOB layout mtd: nand: raw: gpmi: improve power management handling mtd: nand: raw: gpmi: switch to SYSTEM_SLEEP_PM_OPS mtd: rawnand: davinci: use generic device property helpers mtd: rawnand: davinci: break the line correctly mtd: rawnand: davinci: order headers alphabetically mtd: rawnand: atmel: Fix possible memory leak mtd: rawnand: Correct multiple typos in comments mtd: hyperbus: rpc-if: Add missing MODULE_DEVICE_TABLE mtd: spi-nor: add support for Macronix Octal flash ...
2024-11-22Merge tag 'clk-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk updates from Stephen Boyd: "The core framework gained a clk provider helper, a clk consumer helper, and some unit tests for the assigned clk rates feature in DeviceTree. On the vendor driver side, we gained a whole pile of SoC driver support detailed below. The majority in the diffstat is Qualcomm, but there's also quite a few Samsung and Mediatek clk driver additions in here as well. The top vendors is quite common, but the sheer amount of new drivers is uncommon, so I'm anticipating a larger number of fixes for clk drivers this cycle. Core: - devm_clk_bulk_get_all_enabled() to return number of clks acquired - devm_clk_hw_register_gate_parent_hw() helper to modernize drivers - KUnit tests for clk-assigned-rates{,-u64} New Drivers: - Marvell PXA1908 SoC clks - Mobileye EyeQ5, EyeQ6L and EyeQ6H clk driver - TWL6030 clk driver - Nuvoton Arbel BMC NPCM8XX SoC clks - MediaTek MT6735 SoC clks - MediaTek MT7620, MT7628 and MT7688 MMC clks - Add a driver for gated fixed rate clocks - Global clock controllers for Qualcomm QCS8300 and IPQ5424 SoCs - Camera, display and video clock controllers for Qualcomm SA8775P SoCs - Global, display, GPU, TCSR, and RPMh clock controllers for Qualcomm SAR2130P - Global, camera, display, GPU, and video clock controllers for Qualcomm SM8475 SoCs - RTC power domain and Battery Backup Function (VBATTB) clock support for the Renesas RZ/G3S SoC - Qualcomm IPQ9574 alpha PLLs - Support for i.MX91 CCM in the i.MX93 driver - Microchip LAN969X SoC clks - Cortex-A55 core clocks and Interrupt Control Unit (ICU) clock and reset on Renesas RZ/V2H(P) - Samsung ExynosAutov920 clk drivers for PERIC1, MISC, HSI0 and HSI1 - Samsung Exynos8895 clk drivers for FSYS0/1, PERIC0/1, PERIS and TOP Updates: - Convert more clk bindings to YAML - Various clk driver cleanups: NULL checks, add const, etc. - Remove END/NUM #defines that count number of clks in various binding headers - Continue moving reset drivers to drivers/reset via auxiliary bus" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (162 commits) clk: clk-loongson2: Fix potential buffer overflow in flexible-array member access clk: Fix invalid execution of clk_set_rate clk: clk-loongson2: Fix memory corruption bug in struct loongson2_clk_provider clk: lan966x: make it selectable for ARCH_LAN969X clk: eyeq: add EyeQ6H west fixed factor clocks clk: eyeq: add EyeQ6H central fixed factor clocks clk: eyeq: add EyeQ5 fixed factor clocks clk: eyeq: add fixed factor clocks infrastructure clk: eyeq: require clock index with phandle in all cases clk: fixed-factor: add clk_hw_register_fixed_factor_index() function dt-bindings: clock: eyeq: add more Mobileye EyeQ5/EyeQ6H clocks dt-bindings: soc: mobileye: set `#clock-cells = <1>` for all compatibles clk: clk-axi-clkgen: make sure to enable the AXI bus clock dt-bindings: clock: axi-clkgen: include AXI clk clk: mmp: Add Marvell PXA1908 MPMU driver clk: mmp: Add Marvell PXA1908 APMU driver clk: mmp: Add Marvell PXA1908 APBCP driver clk: mmp: Add Marvell PXA1908 APBC driver dt-bindings: clock: Add Marvell PXA1908 clock bindings clk: mmp: Switch to use struct u32_fract instead of custom one ...
2024-11-22Merge tag 'backlight-next-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: - Improve handling of LCD power states and interactions with the fbdev subsystem - Introduce new LCD_POWER_ constants to decouple the LCD subsystem from fbdev - Update several drivers to use the new LCD_POWER_ constants - Clarify the semantics of the lcd_ops.controls_device callback - Remove unnecessary includes and dependencies - Remove unused notifier functionality - Simplify code with scoped for-each loops - Fix module autoloading for the ktz8866 driver - Update device tree bindings to yaml format - Minor cleanups and improvements in various drivers * tag 'backlight-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: (33 commits) MAINTAINERS: Use Daniel Thompson's korg address for Backlight work dt-bindings: backlight: Convert zii,rave-sp-backlight.txt to yaml backlight: Remove notifier backlight: ktz8866: Fix module autoloading backlight: 88pm860x_bl: Simplify with scoped for each OF child loop backlight: lcd: Do not include <linux/fb.h> in lcd header backlight: lcd: Remove struct fb_videomode from set_mode callback backlight: lcd: Replace check_fb with controls_device HID: picoLCD: Replace check_fb in favor of struct fb_info.lcd_dev fbdev: omap: Use lcd power constants fbdev: imxfb: Use lcd power constants fbdev: imxfb: Replace check_fb in favor of struct fb_info.lcd_dev fbdev: clps711x-fb: Use lcd power constants fbdev: clps711x-fb: Replace check_fb in favor of struct fb_info.lcd_dev backlight: tdo24m: Use lcd power constants backlight: platform_lcd: Use lcd power constants backlight: platform_lcd: Remove match_fb from struct plat_lcd_data backlight: platform_lcd: Remove include statement for <linux/backlight.h> backlight: otm3225a: Use lcd power constants backlight: ltv350qv: Use lcd power constants ...