lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
2026-06-08	btrfs: tracepoints: remove double negation in finish ordered extent event	Filipe Manana
	There is no need to add a double negation (!!) to the update field because the field has a boolean type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-08	btrfs: add ioctl GET_CSUMS to read raw checksums from file range	Mark Harmstone
	Add a new unprivileged BTRFS_IOC_GET_CSUMS ioctl, which can be used to query the on-disk csums for a file range. The ioctl is deliberately per-file rather than exposing raw csum tree lookups, to avoid leaking information to users about files they may not have access to. This is done by userspace passing a struct btrfs_ioctl_get_csums_args to the kernel, which details the offset and length we're interested in, and a buffer for the kernel to write its results into. The kernel writes a struct btrfs_ioctl_get_csums_entry into the buffer, followed by the csums if available. The maximum size of the user buffer is capped to 16MiB. If the extent is an uncompressed, non-NODATASUM extent, the kernel sets the entry type to BTRFS_GET_CSUMS_HAS_CSUMS and follows it with the csums. If it is sparse, preallocated, or beyond the EOF, it sets the type to BTRFS_GET_CSUMS_ZEROED - this is so userspace knows it can use the precomputed hash of the zero sector. Otherwise, it sets the type to BTRFS_GET_CSUMS_NODATASUM, BTRFS_GET_CSUMS_COMPRESSED, BTRFS_GET_CSUM_ENCRYPTED, or BTRFS_GET_CSUM_INLINE. For example, a file with a [0, 4K) hole and [4K, 12K) data extent would produce the following output buffer: \| [0, 4K) ZEROED \| [4K, 12K) HAS_CSUMS \| csum data \| We do store the csums of compressed extents, but we deliberately don't return them here: they're calculated over the compressed data, not the uncompressed data that's returned to userspace. Similarly for encrypted data, once encryption is supported, in which the csums will be on the ciphertext. The main use case for this is for speeding up mkfs.btrfs --rootdir. For the case when the source FS is btrfs and using the same csum algorithm, we can avoid having to recalculate the csums - in my synthetic benchmarks (16GB file on a spinning-rust drive), this resulted in a ~11% speed-up (218s to 196s). When using the --reflink option added in btrfs-progs v6.16.1, we can forgo reading the data entirely, resulting a ~2200% speed-up on the same test (128s to 6s). # mkdir rootdir # dd if=/dev/urandom of=rootdir/file bs=4096 count=4194304 (without ioctl) # echo 3 > /proc/sys/vm/drop_caches # time mkfs.btrfs --rootdir rootdir testimg ... real 3m37.965s user 0m5.496s sys 0m6.125s # echo 3 > /proc/sys/vm/drop_caches # time mkfs.btrfs --rootdir rootdir --reflink testimg ... real 2m8.342s user 0m5.472s sys 0m1.667s (with ioctl) # echo 3 > /proc/sys/vm/drop_caches # time mkfs.btrfs --rootdir rootdir testimg ... real 3m15.865s user 0m4.258s sys 0m6.261s # echo 3 > /proc/sys/vm/drop_caches # time mkfs.btrfs --rootdir rootdir --reflink testimg ... real 0m5.847s user 0m2.899s sys 0m0.097s Another notable use case is for deduplication, where reading the checksums may serve as a hint instead of reading the whole file data. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-06-08	firmware: stratix10-svc: Add support to query Arm Trusted Firmware (ATF) version	Tze Yee Ng
	Add entry in Stratix10 service layer that allow client to retrieve the ATF version at runtime, which is useful for system diagnostics, compatibility checks, and ensuring the correct secure firmware is in use. The change introduces: - A new service command definition in the Stratix10 service layer to initiate the ATF version query. - A corresponding macro definition in the header file to expose the command ID for use by other components. The service layer uses a Secure Monitor Call (SMC) to communicate with the ATF and retrieve the version string, which can then be logged or validated by client application. Signed-off-by: Tze Yee Ng <tze.yee.ng@altera.com> Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
2026-06-08	dmaengine: iop32x-adma: Remove a leftover header file	Vladimir Zapolskiy
	The Intel IOPx3xx platform was completely removed in commit b91a69d162aa ("ARM: iop32x: remove the platform"), and it'd be safe to remove an unused and leftover platform data specific header file dma-iop32x.h also. Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260114051508.3908807-1-vz@mleia.com [vkoul: fixed subsystem tag] Signed-off-by: Vinod Koul <vkoul@kernel.org>
2026-06-08	irqchip/renesas-rzv2h: Add DMA ACK signal routing support	John Madieu
	Some peripherals on RZ/G3E SoCs (SSIU, SPDIF, SCU/SRC, DVC) require explicit ACK signal routing through the ICU via the ICU_DMACKSELk registers for level-based DMA handshaking. Add rzv2h_icu_register_dma_ack() to configure ICU_DMACKSELk, routing a DMAC channel's ACK signal to the specified peripheral. Signed-off-by: John Madieu <john.madieu.xa@bp.renesas.com> Acked-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260525110750.4020112-2-john.madieu.xa@bp.renesas.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2026-06-08	power: sequencing: Add an API to return the pwrseq device's 'dev' pointer	Manivannan Sadhasivam
	The consumer drivers can make use of the pwrseq device's 'dev' pointer to query the pwrseq provider's DT node to check for existence of specific properties. Hence, add an API to return the pwrseq device's 'dev' pointer to consumers. Note that since pwrseq_get() would've increased the pwrseq refcount, there is no need to increase the refcount in this API again. Tested-by: Wei Deng <wei.deng@oss.qualcomm.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> Link: https://patch.msgid.link/20260519-pwrseq-m2-bt-v3-6-b39dc2ae3966@oss.qualcomm.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
2026-06-08	mm: Refactor lazy_mmu_mode_pause() and lazy_mmu_mode_resume()	Juergen Gross
	In order to allow pausing and resuming MMU lazy mode for other tasks than current, refactor lazy_mmu_mode_pause() and lazy_mmu_mode_resume(). This will be needed when dropping the Xen PV private lazy MMU bookkeeping. Acked-by: "David Hildenbrand (Arm)" <david@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260526150514.129330-4-jgross@suse.com>
2026-06-08	x86/xen: Drop lazy mode from trace entries	Juergen Gross
	Drop the lazy mode (cpu or mmu) from the xen_mc_batch and xen_mc_issue trace entries. This is done in preparation of removing the xen_lazy_mode percpu variable. Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260526150514.129330-2-jgross@suse.com>
2026-06-08	x86/xen: Cleanup Xen related trace points	Juergen Gross
	Since dropping Xen-PV support for 32-bit, include/trace/events/xen.h contains several stale trace point definitions. Remove them. Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260522152114.77319-3-jgross@suse.com>
2026-06-08	xen: constify xsd_errors array	Len Bao
	The 'xsd_errors' array is initialized in the declaration and never changed. So, constify it to reduce the attack surface. At the same time, use the preferred '__maybe_unused' form over the '__attribute__((unused))' form. Signed-off-by: Len Bao <len.bao@gmx.us> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260523140809.30915-1-len.bao@gmx.us>
2026-06-08	wifi: mac80211: bound S1G TIM PVB walk to the TIM element	Bryam Vargas
	ieee80211_s1g_check_tim() parses the S1G Partial Virtual Bitmap (PVB) of a received TIM element. The TIM is handed in as the element payload: ieee802_11_parse_elems_full() stores elems->tim = elem->data and elems->tim_len = elem->datalen (net/mac80211/parse.c), so the valid bytes are [tim, tim + tim_len). When walking the encoded blocks the function passes the walker an end sentinel of (const u8 )tim + tim_len + 2, i.e. two bytes past the end of the element. ieee80211_s1g_find_target_block() loops while (ptr + 1 <= end) and dereferences ptr (and the per-mode ieee80211_s1g_len_() helpers read ptr), so it can read up to two bytes beyond the TIM element -- an out-of-bounds read of adjacent skb/heap data when the TIM is the last element in the frame. The +2 appears to account for the element id/len header, but tim already points past that header at the element payload, so the addend is wrong. Pass the correct element end, (const u8 )tim + tim_len. Fixes: e0c47c6229c2 ("wifi: mac80211: support parsing S1G TIM PVB") Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Link: https://patch.msgid.link/20260606074341.49135-1-hexlabsecurity@proton.me Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2026-06-07	Merge tag 'sched-urgent-2026-06-07' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq fix from Ingo Molnar: - Fix uninitialized stack variable in rseq_exit_user_update() (Qing Wang) * tag 'sched-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Fix using an uninitialized stack variable in rseq_exit_user_update()
2026-06-07	rhashtable: Fix rhashtable_next_key() build warnings	Mykyta Yatsenko
	rhashtable.o builds with warnings as rhashtable_next_key() kdoc from lib/rhashtable.c does not have the arguments descriptions. Move rhashtable_next_key() kdoc from header to c file, matching other functions. Move rhashtable_next_key() next to the other forward declarations in the header file. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202606061925.WI4bYI8k-lkp@intel.com/ Fixes: 8f4fa9f89b72 ("rhashtable: Add rhashtable_next_key() API") Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260606-rhash_fixes_1-v1-1-932ab036e6bc@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Add support for tracing_multi link session	Jiri Olsa
	Adding support to use session attachment with tracing_multi link. Adding new BPF_TRACE_FSESSION_MULTI program attach type, that follows the BPF_TRACE_FSESSION behaviour but on the tracing_multi link. Such program is called on entry and exit of the attached function and allows to pass cookie value from entry to exit execution. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-16-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Add support for tracing_multi link cookies	Jiri Olsa
	Add support to specify cookies for tracing_multi link. Cookies are provided in array where each value is paired with provided BTF ID value with the same array index. Such cookie can be retrieved by bpf program with bpf_get_attach_cookie helper call. We need to sort cookies array together with ids array in check_dup_ids, to keep the id->cookie relation. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-15-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Add support for tracing multi link	Jiri Olsa
	Adding new link to allow to attach program to multiple function BTF IDs. The link is represented by struct bpf_tracing_multi_link. To configure the link, new fields are added to bpf_attr::link_create to pass array of BTF IDs; struct { __aligned_u64 ids; __u32 cnt; } tracing_multi; Each BTF ID represents function (BTF_KIND_FUNC) that the link will attach bpf program to. We use previously added bpf_trampoline_multi_attach/detach functions to attach/detach the link. The linkinfo/fdinfo callbacks will be implemented in following changes. Note this is supported only for archs (x86_64) with ftrace direct and have single ops support. CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS && CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS Note using sort_r (instead of plain sort) in check_dup_ids, because we will use the swap callback in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-14-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Add bpf_trampoline_multi_attach/detach functions	Jiri Olsa
	Adding bpf_trampoline_multi_attach/detach functions that allows to attach/detach tracing program to multiple functions/trampolines. The attachment is defined with bpf_program and array of BTF ids of functions to attach the bpf program to. Adding bpf_tracing_multi_link object that holds all the attached trampolines and is initialized in attach and used in detach. The attachment allocates or uses currently existing trampoline for each function to attach and links it with the bpf program. The attach works as follows: - we get all the needed trampolines - lock them and add the bpf program to each (__bpf_trampoline_link_prog) - the trampoline_multi_ops passed in __bpf_trampoline_link_prog gathers ftrace_hash (ip -> trampoline) objects - we call update_ftrace_direct_add/mod to update needed locations - we unlock all the trampolines The detach works as follows: - we lock all the needed trampolines - remove the program from each (__bpf_trampoline_unlink_prog) - the trampoline_multi_ops passed in __bpf_trampoline_unlink_prog gathers ftrace_hash (ip -> trampoline) objects - we call update_ftrace_direct_del/mod to update needed locations - we unlock and put all the trampolines We store the old image/flags in the trampoline before the update and use it in case we need to rollback the attachment. We keep the ftrace_hash objects allocated during attach in the link so they can be used for detach as well. Adding trampoline_(un)lock_all functions to (un)lock all trampolines to gate the tracing_multi attachment. Note this is supported only for archs (x86_64) with ftrace direct and have single ops support. CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS && CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS It also needs CONFIG_BPF_SYSCALL enabled. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-13-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Add multi tracing attach types	Jiri Olsa
	Adding new program attach types multi tracing attachment: BPF_TRACE_FENTRY_MULTI BPF_TRACE_FEXIT_MULTI and their base support in verifier code. Programs with such attach type will use specific link attachment interface coming in following changes. This was suggested by Andrii some (long) time ago and turned out to be easier than having special program flag for that. Bpf programs with such types have 'bpf_multi_func' function set as their attach_btf_id and keep module reference when it's specified by attach_prog_fd. They are also accepted as sleepable programs during verification, and the real validation for specific BTF_IDs/functions will happen during the multi link attachment in following changes. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-11-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Factor fsession link to use struct bpf_tramp_node	Jiri Olsa
	Now that we split trampoline attachment object (bpf_tramp_node) from the link object (bpf_tramp_link) we can use bpf_tramp_node as fsession's fexit attachment object and get rid of the bpf_fsession_link object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-10-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Add struct bpf_tramp_node object	Jiri Olsa
	Adding struct bpf_tramp_node to decouple the link out of the trampoline attachment info. At the moment the object for attaching bpf program to the trampoline is 'struct bpf_tramp_link': struct bpf_tramp_link { struct bpf_link link; struct hlist_node tramp_hlist; u64 cookie; } The link holds the bpf_prog pointer and forces one link - one program binding logic. In following changes we want to attach program to multiple trampolines but we want to keep just one bpf_link object. Splitting struct bpf_tramp_link into: struct bpf_tramp_link { struct bpf_link link; struct bpf_tramp_node node; }; struct bpf_tramp_node { struct bpf_link *link; struct hlist_node tramp_hlist; u64 cookie; }; The 'struct bpf_tramp_link' defines standard single trampoline link and 'struct bpf_tramp_node' is the attachment trampoline object with pointer to the bpf_link object. This will allow us to define link for multiple trampolines, like: struct bpf_tracing_multi_link { struct bpf_link link; ... int nodes_cnt; struct bpf_tracing_multi_node nodes[] __counted_by(nodes_cnt); }; Cc: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-9-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	bpf: Use mutex lock pool for bpf trampolines	Jiri Olsa
	Adding mutex lock pool that replaces bpf trampolines mutex. For tracing_multi link coming in following changes we need to lock all the involved trampolines during the attachment. This could mean thousands of mutex locks, which is not convenient. As suggested by Andrii we can replace bpf trampolines mutex with mutex pool, where each trampoline is hash-ed to one of the locks from the pool. It's better to lock all the pool mutexes (32 at the moment) than thousands of them. There is 48 (MAX_LOCK_DEPTH) lock limit allowed to be simultaneously held by task, so we need to keep 32 mutexes (5 bits) in the pool, so when we lock them all in following changes the lockdep won't scream. Removing the mutex_is_locked in bpf_trampoline_put, because we removed the mutex from bpf_trampoline. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	ftrace: Add add_ftrace_hash_entry function	Jiri Olsa
	Renaming __add_hash_entry to add_ftrace_hash_entry and making it global, it will be used in following changes outside ftrace.c object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	ftrace: Add ftrace_hash_remove function	Jiri Olsa
	Adding ftrace_hash_remove function that removes all entries from struct ftrace_hash object without freeing them. It will be used in following changes where entries are allocated as part of another structure and are free-ed separately. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	ftrace: Add ftrace_hash_count function	Jiri Olsa
	Adding external ftrace_hash_count function so we could get hash count outside of ftrace object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07	net/mlx5: Add sd_group_size bits for SD management	Shay Drory
	Currently, mlx5 is querying the MPIR register to get the number of PFs that should comprise the SD group. However, this register does not reflect the correct number in complex deployments. Hence, add an sd_group_size field to nic_vport_context to determine the correct number of PFs, and add an sd_group_size capability bit to indicate whether FW supports it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260529052359.389413-3-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-07	net/mlx5: Update IFC allowed_list_size field bits	Dragos Tatulea
	The vport context allowed_list_size was increased from 12 to 16 bits. Writing to this field is protected by the log_max_current_uc/mc_list capabilities. On older FW versions these capabilities are limited to < 2K and only the high bits of the field are extended. This means that the change is backward compatible with older FW versions. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260529052359.389413-2-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-07	filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n	Christian Brauner
	The CONFIG_FILE_LOCKING=n stub for break_lease() takes a 'bool wait' argument, whereas the CONFIG_FILE_LOCKING=y version and every caller pass an openmode as an 'unsigned int mode'. The mismatch was introduced when __break_lease() was reworked to use flags: only the stub was switched to 'bool wait', a stray leftover from the neighbouring break_layout() helper. The real prototype kept 'unsigned int mode'. This was harmless until O_WRONLY changed from the octal literal 00000001 to (1 << 0). clang's -Wtautological-constant-compare then fires on the implicit shift-to-bool conversion at the first FILE_LOCKING=n caller: fs/open.c:112:29: warning: converting the result of '<<' to a boolean always evaluates to true [-Wtautological-constant-compare] 112 \| error = break_lease(inode, O_WRONLY); Restore the stub's parameter to 'unsigned int mode' so it matches the real prototype and every caller. The stub still just returns 0, so there is no functional change; it removes the type inconsistency and silences the warning. Root cause diagnosed by Nathan Chancellor. Fixes: 4be9f3cc582a ("filelock: rework the __break_lease API to use flags") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202606071029.DKCs8WOs-lkp@intel.com/ Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-07	Merge branch 'for-linus' into for-next	Takashi Iwai
	Signed-off-by: Takashi Iwai <tiwai@suse.de>
2026-06-06	cfi: Include uaccess.h for get_kernel_nofault()	Nathan Chancellor
	After commit 0652a3daa787 ("tracing: Fix CFI violation in probestub being called by tprobes"), there are many build errors when building ARCH=arm multi_v7_defconfig + CONFIG_CFI=y like: In file included from drivers/base/devres.c:17: In file included from drivers/base/trace.h:16: In file included from include/linux/tracepoint.h:23: include/linux/cfi.h:44:6: error: call to undeclared function 'get_kernel_nofault'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 44 \| if (get_kernel_nofault(hash, func - cfi_get_offset())) \| ^ 1 error generated. get_kernel_nofault() is called in the generic version of cfi_get_func_hash() but nothing ensures uaccess.h is always included for a proper expansion and prototype. Include uaccess.h in cfi.h to clear up the errors. Cc: stable@vger.kernel.org Fixes: 0652a3daa787 ("tracing: Fix CFI violation in probestub being called by tprobes") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-06-06	vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags	Jori Koolstra
	A recent build failure[1] exposed the diffculty of working with the current octal and hex definitions of O_ flags when trying to find a gap for a new flag. This difficulty is compounded by the fact that O_ flags may have architectural specific values. Replace the hex/octal #defines, which are hard to parse when looking for free bits, with explicit bit shifts like (1 << 11). Also, add comments that identify which architectures redefine some of the seemingly free ("cursed") bits in uapi/asm-generic/fcntl.h. These should not be used to define new O_ flags (for now, at least). The translastion was done with Claude Opus 4.8, and verified with a (non-AI) gawk script. The accounting of which architectures claim which bit-gaps in uapi/asm-generic/fcntl.h is also done by hand. [1]: https://lore.kernel.org/all/agruPPybCx8q2XcJ@sirena.org.uk/ Assisted-by: Claude:Opus 4.8 Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl> Link: https://patch.msgid.link/20260604222405.5382-1-jkoolstra@xs4all.nl Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-06	bpf: Add simple xattr support to bpffs	Daniel Borkmann
	Add support for extended attributes on bpffs inodes so that user space and BPF LSM programs can attach metadata, for example, a content hash or a security label - to a pinned object or directory. BPF LSM or user space tooling can then uniformly look at this (e.g. security.bpf.) in similar way to other fs'es. The store is in-memory and non-persistent: it lives only for the lifetime of the mount, like everything else in bpffs. The modelling is similar to tmpfs. bpffs serves the trusted. and security.* namespaces; user.* is left unsupported. As bpffs is FS_USERNS_MOUNT, security.* is reachable by the unprivileged mounter in a user namespace, and thus we are using the simple_xattr_set_limited infra there (trusted.* needs global CAP_SYS_ADMIN). bpf_fill_super() is open-coded instead of using simple_fill_super(), because the root inode must now be allocated through bpf_fs_alloc_inode() i.e. carry the bpf_fs_inode wrapper and come from the right cache - which requires s_op (and s_xattr) to be installed before the first inode is created. While at it, also harden s_iflags with SB_I_NOEXEC and SB_I_NODEV. bpf_fs_listxattr() is only reachable through the filesystem via i_op->listxattr, so the BPF token inode is left untouched. Name-based fsetxattr()/fgetxattr() on a token fd still work since the get/set handlers are installed at the superblock. For security.* namespace, we use simple_xattr_set_limited() but there was no simple_xattr_add_limited() API yet which was needed in bpf_fs_initxattrs() to avoid underflows in the accounting. The symlink target is freed in bpf_free_inode() rather than in bpf_destroy_inode() so that it is released only after an RCU grace period, as an RCU path walk following the symlink may still dereference inode->i_link in security_inode_follow_link(). Lastly, the bpf_symlink() allocated the symlink target is switched to GFP_KERNEL_ACCOUNT, so the string is charged to the caller's memcg. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20260602074012.416289-1-daniel@iogearbox.net Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-06	simpe_xattr: use per-sb cache	Miklos Szeredi
	Move the hash table to the super block to remove excessive overhead in case of small number of xattrs per inode. Add linked list to the inode, used for listxattr and eviction. Listxattr uses rcu protection to iterate the list of xattrs. Before being made per-sb, lazy allocation was protected by inode lock. Now inode lock no longer provides sufficient exclusion, so use cmpxchg() to ensure atomicity. Though I haven't found a description of this pattern, after some research it seems that cmpxchg_release() and READ_ONCE() should provide the necessary memory barriers. Use simple_xattr_free_rcu() in simple_xattrs_free(). This is needed because the hash table is now shared between inodes and lookup on a different inode might be running the compare function on the just freed element within the RCU grace period. Following stats are based on slabinfo diff, after creating 100k empty files, then adding a "user.test=foo" xattr to each: v7.0 (no rhashtable): File creation: 993.40 bytes/file Xattr addition: 79.99 bytes/file v7.1-rc2 (per-inode rhashtable): File creation: 939.73 bytes/file Xattr addition: 1296.08 bytes/file v7.1-rc2 + this patch (per-sb rhashtable) File creation: 946.84 bytes/file Xattr addition: 111.86 bytes/file The overhead of a single xattr is reduced to nearly v7.0 levels. The per xattr overhead is slightly larger due to the addition of three pointers to struct simple_xattr. Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://patch.msgid.link/20260605135322.2632068-5-mszeredi@redhat.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-06	simple_xattr: change interface to pass struct simple_xattrs **	Miklos Szeredi
	Change the simple_xattr API to accept pointer-to-pointer (struct simple_xattrs **) instead of pointer. This allows the functions to handle lazy allocation internally without requiring callers to use simple_xattrs_lazy_alloc(). The simple_xattr_set(), simple_xattr_set_limited() and simple_xattr_add() functions now handle allocation when xattrs is NULL. simple_xattrs_free() now also frees the xattrs structure itself and sets the pointer to NULL. This simplifies callers and removes the need for most callers to explicitly manage xattrs allocation and lifetime. In shmem_initxattrs(), the total required space for all initial xattrs (ispace) is pre-calculated and deducted from sbinfo->free_ispace. Since this patch modifies the function to add new xattrs directly to the inode's &info->xattrs list rather than using a local temporary variable, a failure means that the partially populated info->xattrs list remains attached to the inode. When the VFS caller handles the -ENOMEM error, it drops the newly created inode via iput(), shmem_free_inode() adds freed to sbinfo->free_ispace a second time, permanently inflating the tmpfs free space quota. Fix by substracting already added xattrs from ispace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://patch.msgid.link/20260605135322.2632068-4-mszeredi@redhat.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-06	kernfs: fix xattr race condition with multiple superblocks	Miklos Szeredi
	Multiple superblocks with different namespaces can share the same kernfs_node when kernfs_test_super() finds a matching root but different namespace. This means multiple inodes from different superblocks can reference the same kernfs_node->iattr->xattrs structure. The VFS layer only holds per-inode locks during xattr operations, which is insufficient to serialize concurrent xattr modifications on the shared kernfs_node. This can lead to race conditions in simple_xattr_set() where the lookup->replace/remove sequence is not atomic with respect to operations from other superblocks. Fix this by protecting xattr operations with the existing hashed kernfs_locks->open_file_mutex[] array, which is already used to protect per-node open file data. The hashed mutex array provides scalable per-node serialization (scaled by CPU count, up to 1024 locks on 32+ CPU systems) with zero memory overhead. Changes: - Rename open_file_mutex[] to node_mutex[] to reflect dual purpose - Add kernfs_node_lock_ptr() and kernfs_node_lock() helpers - Protect simple_xattr_set() calls in kernfs_xattr_set() and kernfs_vfs_user_xattr_set() with the hashed mutex - Update file.c to use new helpers via compatibility wrappers - Update documentation to explain the extended lock usage Fixes: b32c4a213698 ("xattr: add rhashtable-based simple_xattr infrastructure") Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260601162454.2116375-1-mszeredi%40redhat.com Assisted-by: Claude:claude-sonnet-4-5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://patch.msgid.link/20260605135322.2632068-2-mszeredi@redhat.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05	kconfig: Remove the architecture specific config for Propeller	Rong Xu
	The CONFIG_PROPELLER_CLANG option currently depends on ARCH_SUPPORTS_PROPELLER_CLANG, but this dependency seems unnecessary. Remove ARCH_SUPPORTS_PROPELLER_CLANG and allow users to control Propeller builds solely through CONFIG_PROPELLER_CLANG. This simplifies the kconfig and avoids potential confusion. Move the .llvm_bb_addr_map sections grouping to include/asm-generic/vmlinux.lds.h. The Propeller documentation has been updated to reflect the most recent tool location and now includes instructions for arm64. Contributor Acknowledgments: * SPE instructions: Daniel Hoekwater <hoekwater@google.com> Signed-off-by: Rong Xu <xur@google.com> Suggested-by: Will Deacon <will@kernel.org> Suggested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Yabin Cui <yabinc@google.com> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260604195612.3757860-3-xur@google.com Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2026-06-05	bpf: Expose signature verdict via bpf_prog_aux	KP Singh
	BPF_PROG_LOAD verifies the loader signature but does not record the outcome on the BPF program. [BPF] LSMs and audit can read attr->signature and attr->keyring_id to infer "was this signed, and if so, against which keyring". Add prog->aux->sig (verdict + keyring_{type,serial}), populated by bpf_prog_load before the LSM hook. keyring_type classifies the keyring the load referenced (builtin, secondary, platform or user), while keyring_serial records the serial of the keyring the signature was actually validated against. System keyrings carry a pseudo key pointer with no user-visible serial and are reported as 0, as are unsigned loads. Failed verifications reject the load before the hook runs, so it observes only either UNSIGNED or VERIFIED. Signed-off-by: KP Singh <kpsingh@kernel.org> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260605213518.544262-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	bridge: add bridge_flags_bit enum	Eric Dumazet
	We want to use atomic operations for lockless p->flags changes and reads. Add definitions for bits in addition of masks so that we can use test_bit(), clear_bit() and set_bit() in subsequent patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260604141343.2124500-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-05	bpf: Check tail zero of bpf_prog_info	Leon Hwang
	Since there're 4 bytes padding at the end of struct bpf_prog_info, they won't be checked by bpf_check_uarg_tail_zero(). pahole -C bpf_prog_info ./vmlinux struct bpf_prog_info { ... __u32 attach_btf_obj_id; /* 220 4 / __u32 attach_btf_id; / 224 4 / / size: 232, cachelines: 4, members: 38 / / sum members: 224 / / sum bitfield members: 1 bits, bit holes: 1, sum bit holes: 31 bits / / padding: 4 / / forced alignments: 9 / / last cacheline: 40 bytes */ } __attribute__((__aligned__(8))); If a future kernel extension adds a new 4-byte field, older userspace programs allocating this structure on the stack might inadvertently pass uninitialized stack garbage into the new field, permanently breaking backward compatibility. -- sashiko [1] Fix it by changing sizeof(info) to offsetofend(struct bpf_prog_info, attach_btf_id). And, add "__u32 :32" to the tail of struct bpf_prog_info. [1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/ Fixes: aba64c7da983 ("bpf: Add verified_insns to bpf_prog_info and fdinfo") Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260605155249.20772-3-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	bpf: Check tail zero of bpf_map_info	Leon Hwang
	Since there're 4 bytes padding at the end of struct bpf_map_info, they won't be checked by bpf_check_uarg_tail_zero(). pahole -C bpf_map_info ./vmlinux struct bpf_map_info { ... __u64 hash __attribute__((__aligned__(8))); /* 88 8 / __u32 hash_size; / 96 4 / / size: 104, cachelines: 2, members: 18 / / padding: 4 / / forced alignments: 1 / / last cacheline: 40 bytes */ } __attribute__((__aligned__(8))); If a future kernel extension adds a new 4-byte field, older userspace programs allocating this structure on the stack might inadvertently pass uninitialized stack garbage into the new field, permanently breaking backward compatibility. -- sashiko [1] Fix it by changing sizeof(info) to offsetofend(struct bpf_map_info, hash_size). And, add "__u32 :32" to the tail of struct bpf_map_info. [1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/ Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD") Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260605155249.20772-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	block/bdev: Annotate the blk_holder_ops callback functions	Bart Van Assche
	The four callback functions in blk_holder_ops all release the bd_holder_lock. Annotate these functions accordingly. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/be51cf81110f691ebd5868ac2f15ceb847805bc8.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05	block: Annotate the queue limits functions	Bart Van Assche
	Let the thread-safety checker verify whether every start of a queue limits update is followed by a call to a function that finishes a queue limits update. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/8f71062b6d0fcf2b80bc8cda701c453224755439.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-05	vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC	Ankit Agrawal
	Add a CXL DVSEC-based readiness check for Blackwell-Next GPUs alongside the existing legacy BAR0 polling path. The CXL Device DVSEC offset is discovered at probe time. Probe, fault and read/write paths then branch on that to use either the legacy BAR0 polling or the CXL DVSEC polling. The CXL path polls Memory_Active, requiring MEM_INFO_VALID within 1s and MEM_ACTIVE within Memory_Active_Timeout (up to 256s) as per CXL spec r4.0 sec 8.1.3.8.2. Given the long worst-case wait, the CXL poll runs outside memory_lock with only a quick readiness check is done under the lock. The poll loops sleep with schedule_timeout_killable() and return -EINTR on a fatal signal. This avoids hung-task panics during the long uninterruptible wait. Extend this to the legacy based wait as well for improvement. In the fault handler the wait runs locklessly before memory_lock. If a reset races in, the in-lock recheck returns -EAGAIN and the wait is retried rather than returning a spurious VM_FAULT_SIGBUS. Add PCI_DVSEC_CXL_MEM_ACTIVE_TIMEOUT to pci_regs.h for the timeout field. Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260602063015.3915-1-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-06-05	RDMA/hfi1: Open-code rvt_set_ibdev_name()	Arnd Bergmann
	clang warns about a function missing a printf attribute: include/rdma/rdma_vt.h:457:47: error: diagnostic behavior may be improved by adding the 'format(printf, 2, 3)' attribute to the declaration of 'rvt_set_ibdev_name' [-Werror,-Wmissing-format-attribute] 447 \| static inline void rvt_set_ibdev_name(struct rvt_dev_info rdi, \| __attribute__((format(printf, 2, 3))) 448 \| const char fmt, const char *name, 449 \| const int unit) The helper was originally added as an abstraction for the hfi1 and qib drivers needing the same thing, but now qib is gone, and hfi1 is the only remaining user of rdma_vt. Avoid the warning and allow the compiler to check the format string by open-coding the helper and directly assigning the device name. Fixes: 5084c8ff21f2 ("IB/{rdmavt, hfi1, qib}: Self determine driver name") Link: https://patch.msgid.link/r/20260602140453.3542427-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-05	RDMA/umem: Make ib_umem_is_contiguous() safe on 32 bit	Jason Gunthorpe
	Sashiko points out the roundup_pow_of_two() only uses unsigned long but dma_addr_t can be u64. Change this algorithm to be simpler, compute the page size, if any page size is found and it results in a single block then it is contiguous. Link: https://patch.msgid.link/r/3-v1-88303e9e509f+f7-ib_umem_types_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-05	RDMA/umem: Be careful about boundary conditions in ib_umem_find_best_pgsz()	Jason Gunthorpe
	Several corner cases, especially important on 32 bits: - umem->iova is u64, the function argument should pass in u64 or iova will be truncated - Check that the length is not too large for the iova - Check that lengths > 4G don't overflow the GENMASK Link: https://patch.msgid.link/r/2-v1-88303e9e509f+f7-ib_umem_types_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-05	bpf: Replace scratch PTE atomically when allocating arena pages	Tejun Heo
	apply_range_set_cb() maps the pages for a new arena allocation and returned -EBUSY when the target PTE was already populated. Kernel-fault recovery leaves the per-arena scratch page in unallocated arena PTEs, so a later bpf_arena_alloc_pages() over such a page hits that -EBUSY, and every subsequent allocation of it fails the same way. Allocation must install the real page over scratch instead. Overwriting the scratch PTE in place is a valid->valid change, which arm64 forbids without break-before-make. Route through an invalid entry instead: ptep_try_set() fills only a none slot, so the PTE goes scratch->none->page. On finding scratch, clear it and flush_tlb_before_set() before retrying. The new flush_tlb_before_set() is a no-op except on arches like arm64 that need the break-before-make TLB invalidate. The loop also copes with a concurrent fault re-scratching the slot. Arches without ptep_try_set() never install the scratch page, so keep the must-be-empty check and set_pte_at() for them. Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page") Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: David Hildenbrand <david@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260601183728.1800490-1-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	bpf: Implement resizable hashmap basic functions	Mykyta Yatsenko
	Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast() for deletes, and rhashtable_lookup_get_insert_fast() for inserts. Updates modify values in place under RCU rather than allocating a new element and swapping the pointer (as regular htab does). This trades read consistency for performance: concurrent readers may see partial updates. BPF_F_LOCK support and special-field handling (timers, kptrs, etc.) follow in a later commit. Initialize rhashtable with bpf_mem_alloc element cache. Require BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via rhashtable_free_and_destroy(). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260605-rhash-v7-4-5b8e05f8630d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	rhashtable: Use irq work for shrinking	Herbert Xu
	Use irq work for automatic shrinking so that this may be called in NMI context. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260605-rhash-v7-3-5b8e05f8630d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	rhashtable: Add rhashtable_next_key() API	Mykyta Yatsenko
	Introduce a simpler iteration mechanism for rhashtable that lets the caller continue from an arbitrary position by supplying the previous key, without the per-iterator state of the rhashtable_walk_* API. void rhashtable_next_key(struct rhashtable ht, const void *prev_key); Caller holds RCU; passes NULL prev_key for the first element or the previously returned key to advance. Walks tbl->future_tbl chain so in-flight rehashes are observed. Best-effort: in case of concurrent resize, provides no guarantees: - may produce duplicate elements - may skip any amount of elements - termination of the loop is not guaranteed in case of sustained rehash. Callers are advised to bound loop externally or avoid inserting new elements during such loop. Returns ERR_PTR(-ENOENT) if prev_key is not found. Behavior on tables with duplicate keys is undefined. rhltable is not supported — returns ERR_PTR(-EOPNOTSUPP). Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20260605-rhash-v7-1-5b8e05f8630d@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05	netfilter: conntrack: revert ct extension genid infrastructure	Pablo Neira Ayuso
	This infrastructure is not used anymore after moving ct timeout and helper to use datapath refcount to track object use. Revert commit c56716c69ce1 ("netfilter: extensions: introduce extension genid count") this patch disables all ct extensions (leading to NULL) for unconfirmed conntracks, when this is only targeted at ct helper and ct timeout. There is also codebase that dereferences the ct extension without checking for NULL which could lead to crash. Fixes: c56716c69ce1 ("netfilter: extensions: introduce extension genid count") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>