summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2026-05-14Merge tag 'net-7.1-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Previous releases - regressions: - ethtool: fix NULL pointer dereference in phy_reply_size - netfilter: - allocate hook ops while under mutex - close dangling table module init race - restore nf_conntrack helper propagation via expectation - tcp: - fix potential UAF in reqsk_timer_handler(). - fix out-of-bounds access for twsk in tcp_ao_established_key(). - vsock: fix empty payload in tap skb for non-linear buffers - hsr: fix NULL pointer dereference in hsr_get_node_data() - eth: - cortina: fix RX drop accounting - ice: fix locking in ice_dcb_rebuild() Previous releases - always broken: - napi: avoid gro timer misfiring at end of busypoll - sched: - dualpi2: initialize timer earlier in dualpi2_init() - sch_cbs: Call qdisc_reset for child qdisc - shaper: - fix ordering issue in net_shaper_commit() - reject handle IDs exceeding internal bit-width - ipv6: flowlabel: enforce per-netns limit for unprivileged callers - tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring - smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint - sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL - batman-adv: - reject new tp_meter sessions during teardown - purge non-released claims - eth: - i40e: cleanup PTP registration on probe failure - idpf: fix double free and use-after-free in aux device error paths - ena: fix potential use-after-free in get_timestamp" * tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) net: phy: DP83TC811: add reading of abilities net: tls: prevent chain-after-chain in plain text SG net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot macsec: use rcu_work to defer TX SA crypto cleanup out of softirq macsec: use rcu_work to defer RX SA crypto cleanup out of softirq macsec: introduce dedicated workqueue for SA crypto cleanup net: net_failover: Fix the deadlock in slave register MAINTAINERS: update atlantic driver maintainer selftests/tc-testing: Add QFQ/CBS qlen underflow test net/sched: sch_cbs: Call qdisc_reset for child qdisc FDDI: defza: Sanitise the reset safety timer net: ethernet: ravb: Do not check URAM suspension when WoL is active ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS net: atm: fix skb leak in sigd_send() default branch net: ethtool: phy: avoid NULL deref when PHY driver is unbound net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled net: shaper: reject QUEUE scope handle with missing id ...
2026-05-14ptrace: slightly saner 'get_dumpable()' logicLinus Torvalds
The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a *bit* more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-14io_uring/rsrc: add huge page accounting for registered buffersJens Axboe
Track huge page references in a per-ring xarray to prevent double accounting when the same huge page is used by multiple registered buffers, either within the same ring or across cloned rings. When registering buffers backed by huge pages, we need to account for RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common with cloned buffers), we must not account for the same page multiple times. Similarly, we must only unaccount when the last reference to a huge page is released. Maintain a per-ring xarray (hpage_acct) that tracks reference counts for each huge page. When registering a buffer, for each unique huge page, increment its accounting reference count, and only account pages that are newly added. When unregistering a buffer, for each unique huge page, decrement its refcount. Once the refcount hits zero, the page is unaccounted. Note: any account is done against the ctx->user that was assigned when the ring was setup. As before, if root is running the operation, no accounting is done. With these changes, any use of imu->acct_pages is also dead, hence kill it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a 64-bit arch. Additionally, hpage_already_acct() is gone, which was an O(M*M) scan over current + previous registrations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-14timers: Fix flseep() typo in kernel-doc commentGitle Mikkelsen
Signed-off-by: Gitle Mikkelsen <gitlem@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260501170616.1402-1-gitlem@gmail.com
2026-05-14slab: fix kernel-docs for mm-apiMarco Elver
The mm-api kernel-docs have been disconnected from their symbols. While the scripts were previously taught to handle the _noprof suffix added by allocation tagging (in 51a7bf0238c2 "scripts/kernel-doc: drop "_noprof" on function prototypes"), this does not handle cases where the internal implementation function has an additional leading underscore. The added optional parameters (via DECL_KMALLOC_PARAMS) further complicate parsing the internal signatures. When the kernel-doc block remains above the internal implementation function but uses the public API name, the documentation generator fails to associate the documented symbol. Simply moving the docs to the macros in slab.h fixes the association but causes loss of types in the generated documentation (rendering as e.g. untyped 'kmalloc(size, flags)' macro). Fix this by: 1. Moving the kernel-doc comment blocks from slub.c to slab.h, placing them directly above the user-facing macros. 2. Providing explicit, typed C prototypes for the documented APIs inside '#if 0 /* kernel-doc */' blocks. 3. Converting the variadic macros for the documented APIs to use explicit arguments to match the documentation. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Link: https://patch.msgid.link/20260511200136.3201646-3-elver@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14slab: improve KMALLOC_PARTITION_RANDOM randomnessMarco Elver
When using CONFIG_KMALLOC_PARTITION_RANDOM, _RET_IP_ was previously used to identify the allocation site. _RET_IP_, however, evaluates to the caller's parent's instruction pointer rather than the actual allocation site; this would lead to collisions where a function performs multiple allocations. With the generalization to kmalloc_token_t, we now generate the token at the outermost macro, and using _THIS_IP_ would fix this for all cases. Unfortunately, the generic implementation of _THIS_IP_ relies on taking the address of a local label, which is considered broken by both GCC [1] and Clang [2] because label addresses are only expected to be used with computed gotos. While the generic version more or less works today, it is known to be brittle. For example, Clang -O2 always returns 1 when this function is inlined: static inline unsigned long get_ip(void) { return ({ __label__ __here; __here: (unsigned long)&&__here; }); } To provide a reliable unique identifier without breaking architectures relying on the generic _THIS_IP_, introduce _CODE_LOCATION_: it resolves to _THIS_IP_ where architectures provide a safe implementation, and falls back to a zero-cost static marker where _THIS_IP_ is broken. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120071 [1] Link: https://github.com/llvm/llvm-project/issues/138272 [2] Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260511200136.3201646-2-elver@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-14slab: support for compiler-assisted type-based slab cache partitioningMarco Elver
Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning mode of the latter. Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature available in Clang 22 and later, called "allocation tokens" via __builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM (formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a slab cache to an allocation of type T, regardless of allocation site. The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs the compiler to infer an allocation type from arguments commonly passed to memory-allocating functions and returns a type-derived token ID. The implementation passes kmalloc-args to the builtin: the compiler performs best-effort type inference, and then recognizes common patterns such as `kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also `(T *)kmalloc(...)`. Where the compiler fails to infer a type the fallback token (default: 0) is chosen. Note: kmalloc_obj(..) APIs fix the pattern how size and result type are expressed, and therefore ensures there's not much drift in which patterns the compiler needs to recognize. Specifically, kmalloc_obj() and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the compiler recognizes via the cast to TYPE*. Clang's default token ID calculation is described as [1]: typehashpointersplit: This mode assigns a token ID based on the hash of the allocated type's name, where the top half ID-space is reserved for types that contain pointers and the bottom half for types that do not contain pointers. Separating pointer-containing objects from pointerless objects and data allocations can help mitigate certain classes of memory corruption exploits [2]: attackers who gains a buffer overflow on a primitive buffer cannot use it to directly corrupt pointers or other critical metadata in an object residing in a different, isolated heap region. It is important to note that heap isolation strategies offer a best-effort approach, and do not provide a 100% security guarantee, albeit achievable at relatively low performance cost. Note that this also does not prevent cross-cache attacks: while waiting for future features like SLAB_VIRTUAL [3] to provide physical page isolation, this feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as much as possible today. With all that, my kernel (x86 defconfig) shows me a histogram of slab cache object distribution per /proc/slabinfo (after boot): <slab cache> <objs> <hist> kmalloc-part-15 1465 ++++++++++++++ kmalloc-part-14 2988 +++++++++++++++++++++++++++++ kmalloc-part-13 1656 ++++++++++++++++ kmalloc-part-12 1045 ++++++++++ kmalloc-part-11 1697 ++++++++++++++++ kmalloc-part-10 1489 ++++++++++++++ kmalloc-part-09 965 +++++++++ kmalloc-part-08 710 +++++++ kmalloc-part-07 100 + kmalloc-part-06 217 ++ kmalloc-part-05 105 + kmalloc-part-04 4047 ++++++++++++++++++++++++++++++++++++++++ kmalloc-part-03 183 + kmalloc-part-02 283 ++ kmalloc-part-01 316 +++ kmalloc 1422 ++++++++++++++ The above /proc/slabinfo snapshot shows me there are 6673 allocated objects (slabs 00 - 07) that the compiler claims contain no pointers or it was unable to infer the type of, and 12015 objects that contain pointers (slabs 08 - 15). On a whole, this looks relatively sane. Additionally, when I compile my kernel with -Rpass=alloc-token, which provides diagnostics where (after dead-code elimination) type inference failed, I see 186 allocation sites where the compiler failed to identify a type (down from 966 when I sent the RFC [4]). Some initial review confirms these are mostly variable sized buffers, but also include structs with trailing flexible length arrays. Link: https://clang.llvm.org/docs/AllocToken.html [1] Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2] Link: https://lwn.net/Articles/944647/ [3] Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4] Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434 Acked-by: GONG Ruiqi <gongruiqi1@huawei.com> Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260511200136.3201646-1-elver@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-05-13mm/page_alloc: fix initialization of tags of the huge zero folio with ↵David Hildenbrand (Arm)
init_on_free __GFP_ZEROTAGS semantics are currently a bit weird, but effectively this flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN. If we run with init_on_free, we will zero out pages during __free_pages_prepare(), to skip zeroing on the allocation path. However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will consequently not only skip clearing page content, but also skip clearing tag memory. Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that will get mapped to user space through set_pte_at() later: set_pte_at() and friends will detect that the tags have not been initialized yet (PG_mte_tagged not set), and initialize them. However, for the huge zero folio, which will be mapped through a PMD marked as special, this initialization will not be performed, ending up exposing whatever tags were still set for the pages. The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state that allocation tags are set to 0 when a page is first mapped to user space. That no longer holds with the huge zero folio when init_on_free is enabled. Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to tag_clear_highpages() whether we want to also clear page content. Invert the meaning of the tag_clear_highpages() return value to have clearer semantics. Reproduced with the huge zero folio by modifying the check_buffer_fill arm64/mte selftest to use a 2 MiB area, after making sure that pages have a non-0 tag set when freeing (note that, during boot, we will not actually initialize tags, but only set KASAN_TAG_KERNEL in the page flags). $ ./check_buffer_fill 1..20 ... not ok 17 Check initial tags with private mapping, sync error mode and mmap memory not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory ... This code needs more cleanups; we'll tackle that next, like decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN. [akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David] Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-13Drivers: hv: vmbus: Provide option to skip VMBus unload on panicMichael Kelley
Currently, VMBus code initiates a VMBus unload in the panic path so that if a kdump kernel is loaded, it can start fresh in setting up its own VMBus connection. However, a driver for the VMBus virtual frame buffer may need to flush dirty portions of the frame buffer back to the Hyper-V host so that panic information is visible in the graphics console. To support such flushing, provide exported functions for the frame buffer driver to specify that the VMBus unload should not be done by the VMBus driver, and to initiate the VMBus unload itself. Together these allow a frame buffer driver to delay the VMBus unload until after it has completed the flush. Ideally, the VMBus driver could use its own panic-path callback to do the unload after all frame buffer drivers have finished. But DRM frame buffer drivers use the kmsg dump callback, and there are no callbacks after that in the panic path. Hence this somewhat messy approach to properly sequencing the frame buffer flush and the VMBus unload. Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic") Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-05-13Merge tag 'sched_ext-for-7.1-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The bulk of this is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup - A scx_link_sched() self-deadlock on scx_sched_lock - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes" * tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() sched_ext: Drop NONE early return in scx_disable_and_exit_task() sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths sched_ext: Fix ops->priv clobber on concurrent attach/detach selftests/sched_ext: Fix build error in dequeue selftest sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths sched_ext: Close sub-sched init race with post-init DEAD recheck sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings sched_ext: Drop unused scx_find_sub_sched() stub sched_ext: Move scx_error() out of scx_link_sched()'s lock region
2026-05-13Merge tag 'cgroup-for-7.1-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain - Pending DL migration state leaked into later attaches when a later can_attach() check failed - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests * tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Return only actually allocated CPUs during partition invalidation selftests/cgroup: Fix error path leaks in test_percpu_basic cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cgroup/cpuset: Reset DL migration state on can_attach() failure selftests/cgroup: Fix string comparison in write_test selftests/cgroup: Fix cg_read_strcmp() empty string comparison cgroup/dmem: Return -ENOMEM on failed pool preallocation cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
2026-05-13vfio/pci: Make VFIO_PCI_OFFSET_TO_INDEX() return unsignedMatt Evans
VFIO_PCI_OFFSET_TO_INDEX() is used in several places with a signed parameter (e.g. loff_t). Because it makes no sense for a BAR/resource index to be negative, enforce this in the macro. This fixes at least one current issue, where vfio_pci_ioeventfd() uses this macro with an unvalidated signed loff_t returned into a signed type, leading to a possible negative array access. This instance does test against an out-of-bounds positive value, so treating the index as unsigned fixes this issue. Fixes: 89e1f7d4c66d8 ("vfio: Add PCI device driver") Signed-off-by: Matt Evans <mattev@meta.com> Link: https://lore.kernel.org/r/20260511144642.2926799-1-mattev@meta.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-05-13block: pass a minsize argument to bio_iov_iter_bounceChristoph Hellwig
When bouncing for block size > PAGE_SIZE file systems that require file system block size alignment (e.g. zoned XFS), the bio needs to be big enough to fit an entire block. Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260507050153.1298375-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-13thermal: hwmon: Use extra_groups for adding temperature attributesRafael J. Wysocki
Instead of passing NULL as the last argument to __hwmon_device_register() in hwmon_device_register_for_thermal() and then adding each temperature sysfs attribute to the hwmon device via device_create_file(), redefine hwmon_device_register_for_thermal() to take an extra_groups argument that will be passed to __hwmon_device_register(), define an attribute group with a proper .is_visible() callback for the temperature attributes and a related attribute groups pointer, and pass the latter to hwmon_device_register_for_thermal(). This causes the code to be way more straightforward and closer to what the other users of the hwmon subsystem do. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/8704209.T7Z3S40VBb@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-05-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "arm64: - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0 - Correctly deal with permission faults in guest_memfd memslots - Fix the steal-time selftest after the infrastructure was reworked - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390 - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events - Handle the current vcpu being NULL on EL2 panic - Fix the selftest_vcpu memcache being empty at the point of donation or sharing - Check that the memcache has enough capacity before engaging on the share/donate path - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context s390: - Fix array overrun with large amounts of PCI devices x86: - Never use L0's PAUSE loop exiting while L2 is running, since it's unlikely that a nested guest will help solving the hypervisor's spinlock contention - Fix emulation of MOVNTDQA - Fix typo in Xen hypercall tracepoint - Add back an optimization that was left behind when recently fixing a bug - Add module parameter to disable CET, whose implementation seems to have issues. For now it remains enabled by default Generic: - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn() Documentation: - Update stale links Selftests: - Fix guest_memfd_test with host page size > guest page size" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) KVM: VMX: introduce module parameter to disable CET KVM: x86: Swap the dst and src operand for MOVNTDQA KVM: x86: use again the flush argument of __link_shadow_page() KVM: selftests: Ensure gmem file sizes are multiple of host page size Documentation: kvm: update links in the references section of AMD Memory Encryption KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running KVM: x86: Fix Xen hypercall tracepoint argument assignment KVM: Reject wrapped offset in kvm_reset_dirty_gfn() KVM: arm64: Pre-check vcpu memcache for host->guest donate KVM: arm64: Pre-check vcpu memcache for host->guest share KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache KVM: arm64: Fix __deactivate_fgt macro parameter typo KVM: arm64: Guard against NULL vcpu on VHE hyp panic path KVM: arm64: Make EL2 exception entry and exit context-synchronization events MAINTAINERS: Add Steffen as reviewer for KVM/arm64 KVM: arm64: Remove potential UB on nvhe tracing clock update KVM: selftests: arm64: Fix steal_time test after UAPI refactoring KVM: arm64: Handle permission faults with guest_memfd KVM: arm64: nv: Consider the DS bit when translating TCR_EL2 KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests ...
2026-05-13Merge tag 'ib-mfd-gpio-v7.2' of ↵Bartosz Golaszewski
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd into gpio/for-next Immutable branch between MFD and GPIO due for the v7.2 merge window
2026-05-12stddef: Document designated initializer semantics for __TRAILING_OVERLAP()Gustavo A. R. Silva
Document the designated initializer behavior for overlapping storage between NAME and MEMBERS, and clarify the implications for static initialization to help avoid unintended overwrites. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/agD0R-kNbg9YMOCT@kspp Signed-off-by: Kees Cook <kees@kernel.org>
2026-05-13Merge patch series "rust: auxiliary: replace drvdata() with registration data"Danilo Krummrich
Danilo Krummrich <dakr@kernel.org> says: When drvdata() was introduced in commit 6f61a2637abe ("rust: device: introduce Device::drvdata()"), its commit message already noted that a direct accessor to the driver's bus device private data is not commonly required -- bus callbacks provide access through &self, and other entry points (IRQs, workqueues, IOCTLs, etc.) carry their own private data. The sole motivation for drvdata() was inter-driver interaction, e.g. a parent driver deriving its bus device private data from the child driver via the auxiliary bus. However, drvdata() exposes the driver's bus device private data beyond the driver's own scope. This creates ordering constraints -- drvdata may not be set yet when the first caller of drvdata() can appear -- and forces the driver's bus device private data to outlive all registrations that access it; a requirement that causes unnecessary complications. Private data should be private to the entity that issues it; bus device private data belongs to bus callbacks, class device private data to class callbacks, IRQ private data to the IRQ handler, etc. This series replaces drvdata() with a dedicated registration_data pointer on struct auxiliary_device. The parent stores its private data explicitly during registration; the data is private to the registration and lives as long as the Registration object. On teardown, Registration::drop() first triggers auxiliary_device_delete() (unbinding the child), then frees the registration data. Ordering constraints are structural -- the child's lifecycle is scoped to the registration by construction, not by convention. With no remaining use case for drvdata(), drvdata(), match_type_id(), set_type_id() and struct driver_type are removed. This is a prerequisite for [1], which builds on the removal of drvdata() to enable Higher-Ranked Lifetime Types (HRT) for Rust device drivers. [1] https://lore.kernel.org/driver-core/20260427221155.2144848-1-dakr@kernel.org/ Link: https://patch.msgid.link/20260505152400.3905096-1-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-12Merge tag 'probes-fixes-v7.1-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist() Since the ftrace adds its NOPs at .kprobes.text section (which stores an array), a wrong entry is added when loading a module which uses "__kprobes" attribute. To solve this, add "notrace" to __kprobes functions - test_kprobes: clear kprobes between test runs Clear all kprobes in the test program after running a test set, because Kunit test can run several times - fprobe: Fix unregister_fprobe() to wait for RCU grace period Since the fprobe data structure is removed with hlist_del_rcu(), it should wait for the RCU grace period. If the caller waits for RCU, we can use the async variant (e.g. eBPF) * tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix unregister_fprobe() to wait for RCU grace period test_kprobes: clear kprobes between test runs kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
2026-05-12HID: core: introduce hid_safe_input_report()Benjamin Tissoires
hid_input_report() is used in too many places to have a commit that doesn't cross subsystem borders. Instead of changing the API, introduce a new one when things matters in the transport layers: - usbhid - i2chid This effectively revert to the old behavior for those two transport layers. Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2026-05-12HID: pass the buffer size to hid_report_raw_eventBenjamin Tissoires
commit 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()") enforced the provided data to be at least the size of the declared buffer in the report descriptor to prevent a buffer overflow. However, we can try to be smarter by providing both the buffer size and the data size, meaning that hid_report_raw_event() can make better decision whether we should plaining reject the buffer (buffer overflow attempt) or if we can safely memset it to 0 and pass it to the rest of the stack. Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Acked-by: Johan Hovold <johan@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2026-05-12netfs: Fix potential UAF in netfs_unlock_abandoned_read_pages()David Howells
netfs_unlock_abandoned_read_pages(rreq) accesses the index of the folios it is wanting to unlock and compares that to rreq->no_unlock_folio so that it doesn't unlock a folio being read for netfs_perform_write() or netfs_write_begin(). However, given that netfs_unlock_abandoned_read_pages() is called _after_ NETFS_RREQ_IN_PROGRESS is cleared, the one folio that it's not allowed to dereference is the one specified by ->no_unlock_folio as ownership immediately reverts to the caller. Fix this by storing the folio pointer instead and using that rather than the index. Also fix netfs_unlock_read_folio() where the same applies. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-20-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix potential for tearing in ->remote_i_size and ->zero_pointDavid Howells
Fix potential tearing in using ->remote_i_size and ->zero_point by copying i_size_read() and i_size_write() and using the same seqcount as for i_size. We need to make sure that netfslib and the filesystems that use it always hold i_lock whilst updating any of the sizes to prevent i_size_seqcount from getting corrupted. Fixes: 4058f742105e ("netfs: Keep track of the actual remote file size") Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-6-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix missing barriers when accessing stream->subrequests locklesslyDavid Howells
The list of subrequests attached to stream->subrequests is accessed without locks by netfs_collect_read_results() and netfs_collect_write_results(), and then they access subreq->flags without taking a barrier after getting the subreq pointer from the list. Relatedly, the functions that build the list don't use any sort of write barrier when constructing the list to make sure that the NETFS_SREQ_IN_PROGRESS flag is perceived to be set first if no lock is taken. Fix this by: (1) Add a new list_add_tail_release() function that uses a release barrier to set the pointer to the new member of the list. (2) Add a new list_first_entry_or_null_acquire() function that uses an acquire barrier to read the pointer to the first member in a list (or return NULL). (3) Use list_add_tail_release() when adding a subreq to ->subrequests. (4) Use list_first_entry_or_null_acquire() when initially accessing the front of the list (when an item is removed, the pointer to the new front iterm is obtained under the same lock). Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Link: https://sashiko.dev/#/patchset/20260326104544.509518-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-4-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12gpio: regmap: Support sparsed fixed directionLinus Walleij
On some regmapped GPIOs apparently only a sparser selection of the lines (not all) are actually fixed direction. Support this situation by adding an optional bitmap indicating which GPIOs are actually fixed direction and which are not. Link: https://lore.kernel.org/linux-gpio/20260501155421.3329862-10-elder@riscstar.com/ Tested-by: Alex Elder <elder@riscstar.com> Signed-off-by: Linus Walleij <linusw@kernel.org> Reviewed-by: Michael Walle <mwalle@kernel.org> Reviewed-by: Alex Elder <elder@riscstar.com> Link: https://patch.msgid.link/20260511-regmap-gpio-sparse-fixed-dir-v3-1-1429ec453be7@kernel.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
2026-05-12rhashtable: give each instance its own lockdep classChristian Brauner
syzbot reported a possible circular locking dependency between &ht->mutex and fs_reclaim: CPU0 (kswapd0) CPU1 (kworker) -------------- -------------- fs_reclaim ht->mutex shmem_evict_inode rhashtable_rehash_alloc simple_xattrs_free bucket_table_alloc(GFP_KERNEL) rhashtable_free_and_destroy __kvmalloc_node mutex_lock(&ht->mutex) might_alloc -> fs_reclaim The two halves of the splat refer to two different events on &ht->mutex. The kswapd0 path is unambiguous: shmem_evict_inode at mm/shmem.c:1429 calls simple_xattrs_free(), which calls rhashtable_free_and_destroy() on the per-inode simple_xattrs rhashtable being torn down with the inode. The previously-recorded ht->mutex -> fs_reclaim edge comes from rht_deferred_worker -> rhashtable_rehash_alloc -> bucket_table_alloc(GFP_KERNEL) -> __kvmalloc_node -> might_alloc -> fs_reclaim. That stack stops at generic library code: there is no subsystem-specific frame above rht_deferred_worker, so the splat does not identify which rhashtable's worker recorded the edge -- only that some rhashtable in the system did. Whether or not that recording happened on the same simple_xattrs ht that is now being destroyed, the predicted deadlock cannot occur: rhashtable_free_and_destroy() does cancel_work_sync(&ht->run_work) before taking ht->mutex, so the deferred worker cannot be running on the instance being torn down. If the recording was on a different rhashtable instance, the two ht->mutex acquisitions are on distinct mutex objects and cannot deadlock either. Lockdep flags a cycle regardless because mutex_init(&ht->mutex) lives on a single source line in rhashtable_init_noprof(), so every ht->mutex in the kernel shares one static lockdep class. Lockdep matches by class, not by instance, and collapses all of these into one node. Lift the lockdep key out of rhashtable_init_noprof() and into the caller. The user-visible rhashtable_init_noprof() / rhltable_init_noprof() identifiers become macros that declare a per-call-site static lock_class_key. Link: https://patch.msgid.link/20260427-work-rhashtable-lockdep-v1-1-f69e8bd91cb2@kernel.org Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc") Acked-by: Michal Hocko <mhocko@suse.com> Reported-by: syzbot+5af806780f38a5fe691f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/69e798fe.050a0220.24bfd3.0032.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12bootconfig: move xbc_snprint_cmdline() to lib/bootconfig.cBreno Leitao
Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the function (and its xbc_namebuf scratch buffer) becomes part of the shared parser library. tools/bootconfig already compiles lib/bootconfig.c directly, which lets a follow-up patch reuse the same renderer in the userspace tool to convert a bootconfig file into a flat cmdline string at build time. No functional change. Link: https://lore.kernel.org/all/20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org/ Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-05-11proc: handle subset=pid separately in userns visibility checksAlexey Gladkov
When procfs is mounted with subset=pid, only the dynamic process-related part of the filesystem remains visible. That part cannot be hidden by overmounts, so checking whether an existing procfs mount is fully visible does not make sense for this mode. At the same time, a subset=pid procfs mount must not be used as evidence that a later procfs mount would not reveal additional information. It provides a restricted view of procfs, not the full filesystem view. Mark subset=pid procfs instances as restricted variants. Ignore restricted variants when looking for an already-visible mount, and allow new restricted variants without consulting mnt_already_visible(). Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://patch.msgid.link/4d5e760c3d534dd2e05578d119cc408450053a98.1777278334.git.legion@kernel.org Reviewed-by: Aleksa Sarai <aleksa@amutable.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMINAlexey Gladkov
Cache the mounters credentials and allow access to the net directories contingent of the permissions of the mounter of proc. Do not show /proc/self/net when proc is mounted with subset=pid option and the mounter does not have CAP_NET_ADMIN. To avoid inadvertently allowing access to /proc/<pid>/net, updating mounter credentials is not supported. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://patch.msgid.link/d2466fe9085367f1e24693c437ecb8cff2789660.1777278334.git.legion@kernel.org Reviewed-by: Aleksa Sarai <aleksa@amutable.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTEDChristian Brauner
Whether a filesystem's mounts need to undergo a visibility check in user namespaces is a static property of the filesystem type, not a runtime property of each superblock instance. Both proc and sysfs always set SB_I_USERNS_VISIBLE on their superblocks unconditionally (sysfs does so on first creation, and subsequent mounts reuse the same superblock). Move this flag from sb->s_iflags (SB_I_USERNS_VISIBLE) to file_system_type->fs_flags (FS_USERNS_MOUNT_RESTRICTED) so the intent is expressed at the filesystem type level where it belongs. All check sites are updated to test sb->s_type->fs_flags instead of sb->s_iflags. The SB_I_NOEXEC and SB_I_NODEV flags remain on the superblock as they are runtime properties set during fill_super. Link: https://patch.msgid.link/72887c5b6204dc3adf5a53104f0be6bd8bc4f6cd.1777278334.git.legion@kernel.org Reviewed-by: Aleksa Sarai <aleksa@amutable.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11fs: RCU-ify filesystems listChristian Brauner
The drivers list was protected by an rwlock; every mount, every open of /proc/filesystems and the legacy sysfs(2) syscall walked a hand-rolled singly-linked list under it. /proc/filesystems is especially hot because libselinux causes programs as mundane as mkdir, ls and sed to open and read it on every invocation. Convert the list to an RCU-protected hlist and switch the writer side to a plain spinlock. Writers keep their existing non-sleeping section while readers walk under rcu_read_lock() with no lock traffic: - register_filesystem()/unregister_filesystem() take file_systems_lock, publish via hlist_{add_tail,del_init}_rcu() and invalidate the cached /proc/filesystems string. unregister_filesystem() keeps its synchronize_rcu() after dropping the lock so in-flight readers are drained before the module (and its embedded file_system_type) can go away. - __get_fs_type(), list_bdev_fs_names() and the fs_index()/fs_name()/fs_maxindex() helpers walk the list under rcu_read_lock(). fs_name() continues to drop the read-side lock after try_module_get() and accesses ->name outside the RCU section; the module reference pins the embedded file_system_type across the boundary. struct file_system_type::next becomes struct hlist_node list; no in-tree caller references the old ->next field outside fs/filesystems.c. Link: https://patch.msgid.link/20260425220844.1763933-3-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11proc: allow to mark /proc files permanent outside of fs/proc/Alexey Dobriyan
Add proc_make_permanent() function to mark PDE as permanent to speed up open/read/close (one alloc/free and lock/unlock less). Enable it for built-in code and for compiled-in modules. This function becomes nop magically in modular code. Note, note, note! If built-in code creates and deletes PDEs dynamically (not in init hook), then proc_make_permanent() must not be used. It is intended for simple code: static int __init xxx_module_init(void) { g_pde = proc_create_single(); proc_make_permanent(g_pde); return 0; } static void __exit xxx_module_exit(void) { remove_proc_entry(g_pde); } If module is built-in then exit hook never executed and PDE is permanent so it is OK to mark it as such. If module is module then rmmod will yank PDE, but proc_make_permanent() is nop and core /proc code will do everything right. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://patch.msgid.link/20260425220844.1763933-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11fs: allow lockless ->i_count bumps as long as it does not transition 0->1Mateusz Guzik
With this change only 0->1 and 1->0 transitions need the lock. I verified all places which look at the refcount either only care about it staying 0 (and have the lock enforce it) or don't hold the inode lock to begin with (making the above change irrelevant to their correcness or lack thereof). I also confirmed nfs and btrfs like to call into these a lot and now avoid the lock in the common case, shaving off some atomics. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20260421182538.1215894-4-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11fs: add icount_read_once() and stop open-coding ->i_count loadsMateusz Guzik
Similarly to inode_state_read_once(), it makes the caller spell out they acknowledge instability of the returned value. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20260421182538.1215894-2-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11cgroup/cpuset: Reserve DL bandwidth only for root-domain movesGuopeng Zhang
cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11device property: initialize the remaining fields of fwnode_handle in ↵Bartosz Golaszewski
fwnode_init() If a firmware node is allocated on the stack (for instance: temporary software node whose life-time we control) or on the heap - but using a non-zeroing allocation function - and initialized using fwnode_init(), its secondary pointer will contain uninitialized memory which likely will be neither NULL nor IS_ERR() and so may end up being dereferenced (for example: in dev_to_swnode()). Set fwnode->secondary to NULL on initialization. While at it: initialize the remaining fields of struct fwnode_handle too just to be sure. Cc: stable@vger.kernel.org Fixes: 01bb86b380a3 ("driver core: Add fwnode_init()") Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Link: https://patch.msgid.link/20260511074927.9473-1-bartosz.golaszewski@oss.qualcomm.com [ Fix typo in commit message. - Danilo ] Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-11bpf: Fix s16 truncation for large bpf-to-bpf call offsetsYazhou Tang
Currently, the BPF instruction set allows bpf-to-bpf calls (or internal calls, pseudo calls) to use a 32-bit imm field to represent the relative jump offset. However, when JIT is disabled or falls back to the interpreter, the verifier invokes bpf_patch_call_args() to rewrite the call instruction. In this function, the 32-bit imm is downcast to s16 and stored in the off field. void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth) { stack_depth = max_t(u32, stack_depth, 1); insn->off = (s16) insn->imm; insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] - __bpf_call_base_args; insn->code = BPF_JMP | BPF_CALL_ARGS; } If the original imm exceeds the s16 range (i.e., a jump offset greater than 32767 instructions), this downcast silently truncates the offset, resulting in an incorrect call target. Fix this by: 1. In bpf_patch_call_args(), keeping the imm field unchanged and using the off field to store the index of the interpreter function. 2. In ___bpf_prog_run() for the JMP_CALL_ARGS case, retrieving the interpreter function pointer from the interpreters_args array using the off field as the index, and passing the original imm to calculate the last argument of the interpreter function. After these changes, the truncation issue is resolved, and __bpf_call_base_args is also no longer needed and can be removed, which makes the code cleaner. Performance: In ___bpf_prog_run() for the JMP_CALL_ARGS case, changing the retrieval of the interpreter function pointer from pointer addition to direct array indexing improves performance. The possible reason is that the latter has better instruction-level parallelism. See the v5 discussion [1] for more details. [1] https://lore.kernel.org/bpf/f120c3c4-6999-414a-b514-518bb64b4758@zju.edu.cn/ To avoid requiring bpftool changes, keep the new imm/off encoding internal and restore the legacy xlated dump layout in bpf_insn_prepare_dump(). For bpf-to-bpf call offsets that do not fit in s16, export off as 0 instead of a truncated and misleading value. Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter") Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump") Suggested-by: Xu Kuohai <xukuohai@huaweicloud.com> Suggested-by: Puranjay Mohan <puranjay@kernel.org> Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Link: https://lore.kernel.org/r/20260506094714.419842-3-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-11bpf: Fix out-of-bounds read in bpf_patch_call_args()Yazhou Tang
The interpreters_args array only accommodates stack depths up to MAX_BPF_STACK (512 bytes). However, do_misc_fixups() may allow a larger stack depth if JIT is requested. If JIT compilation later fails and falls back to the interpreter, the verifier invokes bpf_patch_call_args() with this oversized stack depth. This causes a load-time out-of-bounds (OOB) read when calculating the interpreter function pointer index. Fix this by changing bpf_patch_call_args() to return an int and explicitly rejecting the JIT fallback (returning -EINVAL) if the stack depth exceeds MAX_BPF_STACK. Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter") Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260506094714.419842-2-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-11tty: add missing tty_driver include to tty_port.hJohan Hovold
Include the definition of struct tty_driver in tty_port.h to keep the header self-contained and avoid build breakage in case anyone includes it before tty_driver.h. Fixes: eb3b0d92c9c3 ("tty: tty_port: add workqueue to flip TTY buffer") Cc: Xin Zhao <jackzxcui1989@163.com> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260506124323.186703-1-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-11memstick: Constify the driver id_tableKrzysztof Kozlowski
Just like all other driver structures, the id_table should never be modified by core subsystem parts. Constify this member and actual data structures for increased code safety. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2026-05-11iomap: remove over-strict inline data boundary checkNamjae Jeon
The current iomap_inline_data_valid() check ensures that inline data does not cross a PAGE_SIZE boundary. However, this is an unnecessarily strict constraint. If a filesystem provides a valid iomap::inline_data pointer and iomap::length, we should trust that the caller has mapped sufficient memory for the range, even if it spans across page boundaries. Removing this check allows filesystems to point directly to their internal data structures without forced page-alignment or additional redundant allocations. This remove iomap_inline_data_valid() and its callers in buffered and direct I/O paths. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Link: https://patch.msgid.link/20260511141151.6021-1-linkinjeon@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11nfs: Implement fileattr_get for case sensitivityChuck Lever
An NFS server re-exporting an NFS mount point needs to report the case sensitivity behavior of the underlying filesystem to its clients. NFSD's attribute encoder obtains that information by calling vfs_fileattr_get() on the lower filesystem, so the NFS client must implement fileattr_get to surface what it learned from its own server. The NFS client already retrieves case sensitivity information from servers during mount via PATHCONF (NFSv3) or the FATTR4_CASE_INSENSITIVE/FATTR4_CASE_PRESERVING attributes (NFSv4). Expose this information through fileattr_get by reporting the FS_XFLAG_CASEFOLD and FS_XFLAG_CASENONPRESERVING flags. NFSv2 lacks PATHCONF support, so mounts using that protocol version default to standard POSIX behavior: case-sensitive and case-preserving. PATHCONF is now invoked unconditionally for NFSv2 and NFSv3 mounts so the case-sensitivity capabilities are established even when the user pins server->namelen with the namlen= mount option. That option is orthogonal to case handling, and skipping PATHCONF because namelen was already known would leave the caps unset. The two capability bits carry opposite polarity because their POSIX defaults differ. Most servers are case-sensitive and case- preserving, matching "neither xflag set." NFS_CAP_CASE_INSENSITIVE is set only when the server affirms case insensitivity, so "server said no" and "server did not answer" both collapse to the case- sensitive default. NFS_CAP_CASE_NONPRESERVING follows the same pattern in the opposite direction: set only when the server affirms that it does not preserve case, so that silence or a missing attribute lands on the case-preserving default. The NFSv4 probe checks res.attr_bitmask[0] to distinguish "server said false" from "server omitted the attribute" before setting the bit. Both capability bits are cleared before each probe so a remount, an NFSv4 transparent state migration to a server with different case semantics, or a probe whose reply does not arrive does not retain stale capabilities from the prior probe. Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260507-case-sensitivity-v14-10-e62cc8200435@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11fs: Add case sensitivity flags to file_kattrChuck Lever
Enable upper layers such as NFSD to retrieve case sensitivity information from file systems by adding FS_XFLAG_CASEFOLD and FS_XFLAG_CASENONPRESERVING flags. Filesystems report case-insensitive or case-nonpreserving behavior by setting these flags directly in fa->fsx_xflags. The default (flags unset) indicates POSIX semantics: case-sensitive and case-preserving. Both flags are added to FS_XFLAG_RDONLY_MASK so FS_IOC_FSSETXATTR silently strips them, keeping the new xflags strictly a reporting interface. Callers that want to toggle casefolding continue to use FS_IOC_SETFLAGS with FS_CASEFOLD_FL, the established UAPI on filesystems that support the operation (ext4 and f2fs on empty directories). Case sensitivity information is exported to userspace via the fa_xflags field in the FS_IOC_FSGETXATTR ioctl and file_getattr() system call. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20260507-case-sensitivity-v14-2-e62cc8200435@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-11genirq/msi: Fix typos in msi_domain_ops commentMiles Krause
Fix spelling and possessive typos in the msi_domain_ops comment. No functional change. Signed-off-by: Miles Krause <mileskrause5200@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260505014602.5879-1-mileskrause5200@gmail.com
2026-05-11rust: auxiliary: add registration data to auxiliary devicesDanilo Krummrich
Add a registration_data pointer to struct auxiliary_device, allowing the registering (parent) driver to attach private data to the device at registration time and retrieve it later when called back by the auxiliary (child) driver. By tying the data to the device's registration, Rust drivers can bind the lifetime of device resources to it, since the auxiliary bus guarantees that the parent driver remains bound while the auxiliary device is bound. On the Rust side, Registration<T> takes ownership of the data via ForeignOwnable. A TypeId is stored alongside the data for runtime type checking, making Device::registration_data<T>() a safe method. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://patch.msgid.link/20260505152400.3905096-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-11Merge branch 'irq/urgent' into irq/driversThomas Gleixner
to synchronize upstream fixes on which other changes depend on.
2026-05-11irqchip/gic-v5: Move LPI allocation into the LPI domainSascha Bischoff
The IPI and ITS MSI domains currently allocate and release LPIs directly, then pass the selected LPI ID to the parent LPI domain. This leaks the LPI domain's allocation policy into its child domains and forces each child to duplicate part of the parent domain's teardown. Make the LPI domain allocate LPIs in its .alloc() callback and release them in a matching .free() callback. Child domains can then request a parent interrupt without passing an implementation-specific LPI ID, and the LPI lifetime is tied to the domain that owns the LPI namespace. Remove the gicv5_alloc_lpi() and gicv5_free_lpi() wrappers now that no external caller needs to manage LPIs directly. This is a preparatory change for an actual leakage problem in the allocation code and therefore tagged with the same Fixes tag. Fixes: 0f0101325876 ("irqchip/gic-v5: Add GICv5 LPI/IPI support") Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260506093634.382062-2-sascha.bischoff@arm.com
2026-05-11Merge tag 'ib-gpio-add-gpiod-is-single-ended-for-v7.2' of ↵Bartosz Golaszewski
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into gpio/for-next Immutable branch betweeb the GPIO and I2C trees for v7.2-rc1 - add the gpiod_is_single_ended() helper function
2026-05-11gpiolib: add gpiod_is_single_ended() helperJie Li
The direction of a single-ended (open-drain or open-source) GPIO line cannot always be reliably determined by reading hardware registers. In true open-drain implementations, the "high" state is achieved by entering a high-impedance mode, which many hardware controllers report as "input" even if the software intends to use it as an output. This creates issues for consumer drivers (like I2C) that rely on gpiod_get_direction() to decide if a line can be driven. Introduce gpiod_is_single_ended() to allow consumers to check the software configuration (GPIO_FLAG_OPEN_DRAIN/GPIO_FLAG_OPEN_SOURCE) of a descriptor. This provides a robust way to identify lines that are capable of being driven, regardless of their instantaneous hardware state. Signed-off-by: Jie Li <jie.i.li@nokia.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260511113726.49041-2-jie.i.li@nokia.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
2026-05-11Merge tag 'ib-gpio-add-fwnode-gpiod-get-for-v7.2' of ↵Bartosz Golaszewski
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into gpio/for-next Immutable branch between the GPIO and PCI trees for v7.2 - add fwnode_gpiod_get() helper to GPIOLIB