| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Previous releases - regressions:
- ethtool: fix NULL pointer dereference in phy_reply_size
- netfilter:
- allocate hook ops while under mutex
- close dangling table module init race
- restore nf_conntrack helper propagation via expectation
- tcp:
- fix potential UAF in reqsk_timer_handler().
- fix out-of-bounds access for twsk in tcp_ao_established_key().
- vsock: fix empty payload in tap skb for non-linear buffers
- hsr: fix NULL pointer dereference in hsr_get_node_data()
- eth:
- cortina: fix RX drop accounting
- ice: fix locking in ice_dcb_rebuild()
Previous releases - always broken:
- napi: avoid gro timer misfiring at end of busypoll
- sched:
- dualpi2: initialize timer earlier in dualpi2_init()
- sch_cbs: Call qdisc_reset for child qdisc
- shaper:
- fix ordering issue in net_shaper_commit()
- reject handle IDs exceeding internal bit-width
- ipv6: flowlabel: enforce per-netns limit for unprivileged callers
- tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
- smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
- sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL
- batman-adv:
- reject new tp_meter sessions during teardown
- purge non-released claims
- eth:
- i40e: cleanup PTP registration on probe failure
- idpf: fix double free and use-after-free in aux device error paths
- ena: fix potential use-after-free in get_timestamp"
* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
net: phy: DP83TC811: add reading of abilities
net: tls: prevent chain-after-chain in plain text SG
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
macsec: introduce dedicated workqueue for SA crypto cleanup
net: net_failover: Fix the deadlock in slave register
MAINTAINERS: update atlantic driver maintainer
selftests/tc-testing: Add QFQ/CBS qlen underflow test
net/sched: sch_cbs: Call qdisc_reset for child qdisc
FDDI: defza: Sanitise the reset safety timer
net: ethernet: ravb: Do not check URAM suspension when WoL is active
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
net: atm: fix skb leak in sigd_send() default branch
net: ethtool: phy: avoid NULL deref when PHY driver is unbound
net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
net: shaper: reject QUEUE scope handle with missing id
...
|
|
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.
And almost all users do in fact use it only for the case where the task
has a mm pointer.
But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).
It's not what this flag was designed for, but it is what it is.
The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.
Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Track huge page references in a per-ring xarray to prevent double
accounting when the same huge page is used by multiple registered
buffers, either within the same ring or across cloned rings.
When registering buffers backed by huge pages, we need to account for
RLIMIT_MEMLOCK. But if multiple buffers share the same huge page (common
with cloned buffers), we must not account for the same page multiple
times. Similarly, we must only unaccount when the last reference to a
huge page is released.
Maintain a per-ring xarray (hpage_acct) that tracks reference counts for
each huge page. When registering a buffer, for each unique huge page,
increment its accounting reference count, and only account pages that
are newly added.
When unregistering a buffer, for each unique huge page, decrement its
refcount. Once the refcount hits zero, the page is unaccounted.
Note: any account is done against the ctx->user that was assigned when
the ring was setup. As before, if root is running the operation, no
accounting is done.
With these changes, any use of imu->acct_pages is also dead, hence kill
it from struct io_mapped_ubuf. This shrinks it from 56b to 48b on a
64-bit arch. Additionally, hpage_already_acct() is gone, which was an
O(M*M) scan over current + previous registrations.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Gitle Mikkelsen <gitlem@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260501170616.1402-1-gitlem@gmail.com
|
|
The mm-api kernel-docs have been disconnected from their symbols. While
the scripts were previously taught to handle the _noprof suffix added by
allocation tagging (in 51a7bf0238c2 "scripts/kernel-doc: drop "_noprof"
on function prototypes"), this does not handle cases where the internal
implementation function has an additional leading underscore. The added
optional parameters (via DECL_KMALLOC_PARAMS) further complicate parsing
the internal signatures.
When the kernel-doc block remains above the internal implementation
function but uses the public API name, the documentation generator fails
to associate the documented symbol.
Simply moving the docs to the macros in slab.h fixes the association but
causes loss of types in the generated documentation (rendering as e.g.
untyped 'kmalloc(size, flags)' macro).
Fix this by:
1. Moving the kernel-doc comment blocks from slub.c to slab.h, placing
them directly above the user-facing macros.
2. Providing explicit, typed C prototypes for the documented APIs inside
'#if 0 /* kernel-doc */' blocks.
3. Converting the variadic macros for the documented APIs to use
explicit arguments to match the documentation.
No functional change intended.
Signed-off-by: Marco Elver <elver@google.com>
Link: https://patch.msgid.link/20260511200136.3201646-3-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
|
|
When using CONFIG_KMALLOC_PARTITION_RANDOM, _RET_IP_ was previously used
to identify the allocation site. _RET_IP_, however, evaluates to the
caller's parent's instruction pointer rather than the actual allocation
site; this would lead to collisions where a function performs multiple
allocations.
With the generalization to kmalloc_token_t, we now generate the token at
the outermost macro, and using _THIS_IP_ would fix this for all cases.
Unfortunately, the generic implementation of _THIS_IP_ relies on taking
the address of a local label, which is considered broken by both GCC [1]
and Clang [2] because label addresses are only expected to be used with
computed gotos. While the generic version more or less works today, it
is known to be brittle. For example, Clang -O2 always returns 1 when
this function is inlined:
static inline unsigned long get_ip(void)
{ return ({ __label__ __here; __here: (unsigned long)&&__here; }); }
To provide a reliable unique identifier without breaking architectures
relying on the generic _THIS_IP_, introduce _CODE_LOCATION_: it resolves
to _THIS_IP_ where architectures provide a safe implementation, and
falls back to a zero-cost static marker where _THIS_IP_ is broken.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120071 [1]
Link: https://github.com/llvm/llvm-project/issues/138272 [2]
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-2-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
|
|
Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning
mode of the latter.
Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature
available in Clang 22 and later, called "allocation tokens" via
__builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM
(formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a
slab cache to an allocation of type T, regardless of allocation site.
The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
the compiler to infer an allocation type from arguments commonly passed
to memory-allocating functions and returns a type-derived token ID. The
implementation passes kmalloc-args to the builtin: the compiler performs
best-effort type inference, and then recognizes common patterns such as
`kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
`(T *)kmalloc(...)`. Where the compiler fails to infer a type the
fallback token (default: 0) is chosen.
Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
expressed, and therefore ensures there's not much drift in which
patterns the compiler needs to recognize. Specifically, kmalloc_obj()
and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
compiler recognizes via the cast to TYPE*.
Clang's default token ID calculation is described as [1]:
typehashpointersplit: This mode assigns a token ID based on the hash
of the allocated type's name, where the top half ID-space is reserved
for types that contain pointers and the bottom half for types that do
not contain pointers.
Separating pointer-containing objects from pointerless objects and data
allocations can help mitigate certain classes of memory corruption
exploits [2]: attackers who gains a buffer overflow on a primitive
buffer cannot use it to directly corrupt pointers or other critical
metadata in an object residing in a different, isolated heap region.
It is important to note that heap isolation strategies offer a
best-effort approach, and do not provide a 100% security guarantee,
albeit achievable at relatively low performance cost. Note that this
also does not prevent cross-cache attacks: while waiting for future
features like SLAB_VIRTUAL [3] to provide physical page isolation, this
feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
much as possible today.
With all that, my kernel (x86 defconfig) shows me a histogram of slab
cache object distribution per /proc/slabinfo (after boot):
<slab cache> <objs> <hist>
kmalloc-part-15 1465 ++++++++++++++
kmalloc-part-14 2988 +++++++++++++++++++++++++++++
kmalloc-part-13 1656 ++++++++++++++++
kmalloc-part-12 1045 ++++++++++
kmalloc-part-11 1697 ++++++++++++++++
kmalloc-part-10 1489 ++++++++++++++
kmalloc-part-09 965 +++++++++
kmalloc-part-08 710 +++++++
kmalloc-part-07 100 +
kmalloc-part-06 217 ++
kmalloc-part-05 105 +
kmalloc-part-04 4047 ++++++++++++++++++++++++++++++++++++++++
kmalloc-part-03 183 +
kmalloc-part-02 283 ++
kmalloc-part-01 316 +++
kmalloc 1422 ++++++++++++++
The above /proc/slabinfo snapshot shows me there are 6673 allocated
objects (slabs 00 - 07) that the compiler claims contain no pointers or
it was unable to infer the type of, and 12015 objects that contain
pointers (slabs 08 - 15). On a whole, this looks relatively sane.
Additionally, when I compile my kernel with -Rpass=alloc-token, which
provides diagnostics where (after dead-code elimination) type inference
failed, I see 186 allocation sites where the compiler failed to identify
a type (down from 966 when I sent the RFC [4]). Some initial review
confirms these are mostly variable sized buffers, but also include
structs with trailing flexible length arrays.
Link: https://clang.llvm.org/docs/AllocToken.html [1]
Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2]
Link: https://lwn.net/Articles/944647/ [3]
Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4]
Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434
Acked-by: GONG Ruiqi <gongruiqi1@huawei.com>
Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Link: https://patch.msgid.link/20260511200136.3201646-1-elver@google.com
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
|
|
init_on_free
__GFP_ZEROTAGS semantics are currently a bit weird, but effectively this
flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN.
If we run with init_on_free, we will zero out pages during
__free_pages_prepare(), to skip zeroing on the allocation path.
However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will
consequently not only skip clearing page content, but also skip clearing
tag memory.
Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that
will get mapped to user space through set_pte_at() later: set_pte_at() and
friends will detect that the tags have not been initialized yet
(PG_mte_tagged not set), and initialize them.
However, for the huge zero folio, which will be mapped through a PMD
marked as special, this initialization will not be performed, ending up
exposing whatever tags were still set for the pages.
The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state
that allocation tags are set to 0 when a page is first mapped to user
space. That no longer holds with the huge zero folio when init_on_free is
enabled.
Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to
tag_clear_highpages() whether we want to also clear page content.
Invert the meaning of the tag_clear_highpages() return value to have
clearer semantics.
Reproduced with the huge zero folio by modifying the check_buffer_fill
arm64/mte selftest to use a 2 MiB area, after making sure that pages have
a non-0 tag set when freeing (note that, during boot, we will not actually
initialize tags, but only set KASAN_TAG_KERNEL in the page flags).
$ ./check_buffer_fill
1..20
...
not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
...
This code needs more cleanups; we'll tackle that next, like
decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN.
[akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David]
Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org
Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, VMBus code initiates a VMBus unload in the panic path so
that if a kdump kernel is loaded, it can start fresh in setting up its
own VMBus connection. However, a driver for the VMBus virtual frame
buffer may need to flush dirty portions of the frame buffer back to
the Hyper-V host so that panic information is visible in the graphics
console. To support such flushing, provide exported functions for the
frame buffer driver to specify that the VMBus unload should not be
done by the VMBus driver, and to initiate the VMBus unload itself.
Together these allow a frame buffer driver to delay the VMBus unload
until after it has completed the flush.
Ideally, the VMBus driver could use its own panic-path callback to do
the unload after all frame buffer drivers have finished. But DRM frame
buffer drivers use the kmsg dump callback, and there are no callbacks
after that in the panic path. Hence this somewhat messy approach to
properly sequencing the frame buffer flush and the VMBus unload.
Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.
- UAFs and lifecycle bugs on the sub-sched attach/detach paths:
parent sub_kset freed under a racing child, list_del_rcu on an
uninitialized list head, ops->priv stomped by concurrent
attach/detach, and a UAF in the init-failure error path
- Task state-machine reorg closing concurrent enable-vs-dead races: a
task exiting during the unlocked init window could trip NULL ops
derefs or skip exit_task() cleanup
- A scx_link_sched() self-deadlock on scx_sched_lock
- isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
cpumask without RCU, and stop rejecting BPF schedulers when only
cpuset isolated partitions are active
- PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
the failing task rather than the irq_work kthread
- Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
fixes"
* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
sched_ext: Fix ops->priv clobber on concurrent attach/detach
selftests/sched_ext: Fix build error in dequeue selftest
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
sched_ext: Close sub-sched init race with post-init DEAD recheck
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
sched_ext: Drop unused scx_find_sub_sched() stub
sched_ext: Move scx_error() out of scx_link_sched()'s lock region
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
passed OOM-kill tests
* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
selftests/cgroup: Fix error path leaks in test_percpu_basic
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cgroup/cpuset: Reset DL migration state on can_attach() failure
selftests/cgroup: Fix string comparison in write_test
selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cgroup/dmem: Return -ENOMEM on failed pool preallocation
cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
|
|
VFIO_PCI_OFFSET_TO_INDEX() is used in several places with a signed
parameter (e.g. loff_t). Because it makes no sense for a BAR/resource
index to be negative, enforce this in the macro.
This fixes at least one current issue, where vfio_pci_ioeventfd() uses
this macro with an unvalidated signed loff_t returned into a signed
type, leading to a possible negative array access. This instance does
test against an out-of-bounds positive value, so treating the index as
unsigned fixes this issue.
Fixes: 89e1f7d4c66d8 ("vfio: Add PCI device driver")
Signed-off-by: Matt Evans <mattev@meta.com>
Link: https://lore.kernel.org/r/20260511144642.2926799-1-mattev@meta.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
|
|
When bouncing for block size > PAGE_SIZE file systems that require
file system block size alignment (e.g. zoned XFS), the bio needs to
be big enough to fit an entire block.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260507050153.1298375-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of passing NULL as the last argument to __hwmon_device_register()
in hwmon_device_register_for_thermal() and then adding each temperature
sysfs attribute to the hwmon device via device_create_file(), redefine
hwmon_device_register_for_thermal() to take an extra_groups argument
that will be passed to __hwmon_device_register(), define an attribute
group with a proper .is_visible() callback for the temperature
attributes and a related attribute groups pointer, and pass the latter
to hwmon_device_register_for_thermal().
This causes the code to be way more straightforward and closer to
what the other users of the hwmon subsystem do.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/8704209.T7Z3S40VBb@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events
- Handle the current vcpu being NULL on EL2 panic
- Fix the selftest_vcpu memcache being empty at the point of donation
or sharing
- Check that the memcache has enough capacity before engaging on the
share/donate path
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context
s390:
- Fix array overrun with large amounts of PCI devices
x86:
- Never use L0's PAUSE loop exiting while L2 is running, since it's
unlikely that a nested guest will help solving the hypervisor's
spinlock contention
- Fix emulation of MOVNTDQA
- Fix typo in Xen hypercall tracepoint
- Add back an optimization that was left behind when recently fixing
a bug
- Add module parameter to disable CET, whose implementation seems to
have issues. For now it remains enabled by default
Generic:
- Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
Documentation:
- Update stale links
Selftests:
- Fix guest_memfd_test with host page size > guest page size"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
KVM: VMX: introduce module parameter to disable CET
KVM: x86: Swap the dst and src operand for MOVNTDQA
KVM: x86: use again the flush argument of __link_shadow_page()
KVM: selftests: Ensure gmem file sizes are multiple of host page size
Documentation: kvm: update links in the references section of AMD Memory Encryption
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
KVM: x86: Fix Xen hypercall tracepoint argument assignment
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Make EL2 exception entry and exit context-synchronization events
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM: arm64: Remove potential UB on nvhe tracing clock update
KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
KVM: arm64: Handle permission faults with guest_memfd
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd into gpio/for-next
Immutable branch between MFD and GPIO due for the v7.2 merge window
|
|
Document the designated initializer behavior for overlapping storage
between NAME and MEMBERS, and clarify the implications for static
initialization to help avoid unintended overwrites.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/agD0R-kNbg9YMOCT@kspp
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Danilo Krummrich <dakr@kernel.org> says:
When drvdata() was introduced in commit 6f61a2637abe ("rust: device: introduce
Device::drvdata()"), its commit message already noted that a direct accessor to
the driver's bus device private data is not commonly required -- bus callbacks
provide access through &self, and other entry points (IRQs, workqueues, IOCTLs,
etc.) carry their own private data.
The sole motivation for drvdata() was inter-driver interaction, e.g. a parent
driver deriving its bus device private data from the child driver via the
auxiliary bus.
However, drvdata() exposes the driver's bus device private data beyond the
driver's own scope. This creates ordering constraints -- drvdata may not be set
yet when the first caller of drvdata() can appear -- and forces the driver's bus
device private data to outlive all registrations that access it; a requirement
that causes unnecessary complications.
Private data should be private to the entity that issues it; bus device private
data belongs to bus callbacks, class device private data to class callbacks, IRQ
private data to the IRQ handler, etc.
This series replaces drvdata() with a dedicated registration_data pointer on
struct auxiliary_device. The parent stores its private data explicitly during
registration; the data is private to the registration and lives as long as the
Registration object.
On teardown, Registration::drop() first triggers auxiliary_device_delete()
(unbinding the child), then frees the registration data. Ordering constraints
are structural -- the child's lifecycle is scoped to the registration by
construction, not by convention.
With no remaining use case for drvdata(), drvdata(), match_type_id(),
set_type_id() and struct driver_type are removed.
This is a prerequisite for [1], which builds on the removal of drvdata() to
enable Higher-Ranked Lifetime Types (HRT) for Rust device drivers.
[1] https://lore.kernel.org/driver-core/20260427221155.2144848-1-dakr@kernel.org/
Link: https://patch.msgid.link/20260505152400.3905096-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
Since the ftrace adds its NOPs at .kprobes.text section (which stores
an array), a wrong entry is added when loading a module which uses
"__kprobes" attribute.
To solve this, add "notrace" to __kprobes functions
- test_kprobes: clear kprobes between test runs
Clear all kprobes in the test program after running a test set,
because Kunit test can run several times
- fprobe: Fix unregister_fprobe() to wait for RCU grace period
Since the fprobe data structure is removed with hlist_del_rcu(), it
should wait for the RCU grace period. If the caller waits for RCU, we
can use the async variant (e.g. eBPF)
* tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fprobe: Fix unregister_fprobe() to wait for RCU grace period
test_kprobes: clear kprobes between test runs
kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
|
|
hid_input_report() is used in too many places to have a commit that
doesn't cross subsystem borders. Instead of changing the API, introduce
a new one when things matters in the transport layers:
- usbhid
- i2chid
This effectively revert to the old behavior for those two transport
layers.
Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
|
|
commit 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing
bogus memset()") enforced the provided data to be at least the size of
the declared buffer in the report descriptor to prevent a buffer
overflow. However, we can try to be smarter by providing both the buffer
size and the data size, meaning that hid_report_raw_event() can make
better decision whether we should plaining reject the buffer (buffer
overflow attempt) or if we can safely memset it to 0 and pass it to the
rest of the stack.
Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Acked-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
|
|
netfs_unlock_abandoned_read_pages(rreq) accesses the index of the folios it
is wanting to unlock and compares that to rreq->no_unlock_folio so that it
doesn't unlock a folio being read for netfs_perform_write() or
netfs_write_begin().
However, given that netfs_unlock_abandoned_read_pages() is called _after_
NETFS_RREQ_IN_PROGRESS is cleared, the one folio that it's not allowed to
dereference is the one specified by ->no_unlock_folio as ownership
immediately reverts to the caller.
Fix this by storing the folio pointer instead and using that rather than
the index. Also fix netfs_unlock_read_folio() where the same applies.
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20260512123404.719402-20-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix potential tearing in using ->remote_i_size and ->zero_point by copying
i_size_read() and i_size_write() and using the same seqcount as for i_size.
We need to make sure that netfslib and the filesystems that use it always
hold i_lock whilst updating any of the sizes to prevent i_size_seqcount
from getting corrupted.
Fixes: 4058f742105e ("netfs: Keep track of the actual remote file size")
Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data")
Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20260512123404.719402-6-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The list of subrequests attached to stream->subrequests is accessed without
locks by netfs_collect_read_results() and netfs_collect_write_results(),
and then they access subreq->flags without taking a barrier after getting
the subreq pointer from the list. Relatedly, the functions that build the
list don't use any sort of write barrier when constructing the list to make
sure that the NETFS_SREQ_IN_PROGRESS flag is perceived to be set first if
no lock is taken.
Fix this by:
(1) Add a new list_add_tail_release() function that uses a release barrier
to set the pointer to the new member of the list.
(2) Add a new list_first_entry_or_null_acquire() function that uses an
acquire barrier to read the pointer to the first member in a list (or
return NULL).
(3) Use list_add_tail_release() when adding a subreq to ->subrequests.
(4) Use list_first_entry_or_null_acquire() when initially accessing the
front of the list (when an item is removed, the pointer to the new
front iterm is obtained under the same lock).
Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item")
Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Link: https://sashiko.dev/#/patchset/20260326104544.509518-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20260512123404.719402-4-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
On some regmapped GPIOs apparently only a sparser selection of the lines
(not all) are actually fixed direction.
Support this situation by adding an optional bitmap indicating which
GPIOs are actually fixed direction and which are not.
Link: https://lore.kernel.org/linux-gpio/20260501155421.3329862-10-elder@riscstar.com/
Tested-by: Alex Elder <elder@riscstar.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Reviewed-by: Alex Elder <elder@riscstar.com>
Link: https://patch.msgid.link/20260511-regmap-gpio-sparse-fixed-dir-v3-1-1429ec453be7@kernel.org
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
|
|
syzbot reported a possible circular locking dependency between
&ht->mutex and fs_reclaim:
CPU0 (kswapd0) CPU1 (kworker)
-------------- --------------
fs_reclaim ht->mutex
shmem_evict_inode rhashtable_rehash_alloc
simple_xattrs_free bucket_table_alloc(GFP_KERNEL)
rhashtable_free_and_destroy __kvmalloc_node
mutex_lock(&ht->mutex) might_alloc -> fs_reclaim
The two halves of the splat refer to two different events on
&ht->mutex.
The kswapd0 path is unambiguous: shmem_evict_inode at mm/shmem.c:1429
calls simple_xattrs_free(), which calls rhashtable_free_and_destroy()
on the per-inode simple_xattrs rhashtable being torn down with the
inode.
The previously-recorded ht->mutex -> fs_reclaim edge comes from
rht_deferred_worker -> rhashtable_rehash_alloc ->
bucket_table_alloc(GFP_KERNEL) -> __kvmalloc_node ->
might_alloc -> fs_reclaim. That stack stops at generic library code:
there is no subsystem-specific frame above rht_deferred_worker, so
the splat does not identify which rhashtable's worker recorded the
edge -- only that some rhashtable in the system did.
Whether or not that recording happened on the same simple_xattrs ht
that is now being destroyed, the predicted deadlock cannot occur:
rhashtable_free_and_destroy() does cancel_work_sync(&ht->run_work)
before taking ht->mutex, so the deferred worker cannot be running on
the instance being torn down. If the recording was on a different
rhashtable instance, the two ht->mutex acquisitions are on distinct
mutex objects and cannot deadlock either.
Lockdep flags a cycle regardless because mutex_init(&ht->mutex) lives
on a single source line in rhashtable_init_noprof(), so every
ht->mutex in the kernel shares one static lockdep class. Lockdep
matches by class, not by instance, and collapses all of these into
one node.
Lift the lockdep key out of rhashtable_init_noprof() and into the
caller. The user-visible rhashtable_init_noprof() /
rhltable_init_noprof() identifiers become macros that declare a
per-call-site static lock_class_key.
Link: https://patch.msgid.link/20260427-work-rhashtable-lockdep-v1-1-f69e8bd91cb2@kernel.org
Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc")
Acked-by: Michal Hocko <mhocko@suse.com>
Reported-by: syzbot+5af806780f38a5fe691f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/69e798fe.050a0220.24bfd3.0032.GAE@google.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move xbc_snprint_cmdline() from init/main.c to lib/bootconfig.c so the
function (and its xbc_namebuf scratch buffer) becomes part of the shared
parser library. tools/bootconfig already compiles lib/bootconfig.c
directly, which lets a follow-up patch reuse the same renderer in the
userspace tool to convert a bootconfig file into a flat cmdline string
at build time.
No functional change.
Link: https://lore.kernel.org/all/20260508-bootconfig_using_tools-v1-1-1132219aa773@debian.org/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When procfs is mounted with subset=pid, only the dynamic process-related
part of the filesystem remains visible. That part cannot be hidden by
overmounts, so checking whether an existing procfs mount is fully
visible does not make sense for this mode.
At the same time, a subset=pid procfs mount must not be used as evidence
that a later procfs mount would not reveal additional information. It
provides a restricted view of procfs, not the full filesystem view.
Mark subset=pid procfs instances as restricted variants. Ignore
restricted variants when looking for an already-visible mount, and allow
new restricted variants without consulting mnt_already_visible().
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://patch.msgid.link/4d5e760c3d534dd2e05578d119cc408450053a98.1777278334.git.legion@kernel.org
Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.
Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN. To avoid inadvertently
allowing access to /proc/<pid>/net, updating mounter credentials is not
supported.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://patch.msgid.link/d2466fe9085367f1e24693c437ecb8cff2789660.1777278334.git.legion@kernel.org
Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Whether a filesystem's mounts need to undergo a visibility check in user
namespaces is a static property of the filesystem type, not a runtime
property of each superblock instance. Both proc and sysfs always set
SB_I_USERNS_VISIBLE on their superblocks unconditionally (sysfs does so
on first creation, and subsequent mounts reuse the same superblock).
Move this flag from sb->s_iflags (SB_I_USERNS_VISIBLE) to
file_system_type->fs_flags (FS_USERNS_MOUNT_RESTRICTED) so the intent
is expressed at the filesystem type level where it belongs.
All check sites are updated to test sb->s_type->fs_flags instead of
sb->s_iflags. The SB_I_NOEXEC and SB_I_NODEV flags remain on the
superblock as they are runtime properties set during fill_super.
Link: https://patch.msgid.link/72887c5b6204dc3adf5a53104f0be6bd8bc4f6cd.1777278334.git.legion@kernel.org
Reviewed-by: Aleksa Sarai <aleksa@amutable.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The drivers list was protected by an rwlock; every mount, every open
of /proc/filesystems and the legacy sysfs(2) syscall walked a
hand-rolled singly-linked list under it. /proc/filesystems is
especially hot because libselinux causes programs as mundane as
mkdir, ls and sed to open and read it on every invocation.
Convert the list to an RCU-protected hlist and switch the writer side
to a plain spinlock. Writers keep their existing non-sleeping
section while readers walk under rcu_read_lock() with no lock traffic:
- register_filesystem()/unregister_filesystem() take
file_systems_lock, publish via hlist_{add_tail,del_init}_rcu()
and invalidate the cached /proc/filesystems string.
unregister_filesystem() keeps its synchronize_rcu() after
dropping the lock so in-flight readers are drained before the
module (and its embedded file_system_type) can go away.
- __get_fs_type(), list_bdev_fs_names() and the
fs_index()/fs_name()/fs_maxindex() helpers walk the list under
rcu_read_lock(). fs_name() continues to drop the read-side
lock after try_module_get() and accesses ->name outside the RCU
section; the module reference pins the embedded file_system_type
across the boundary.
struct file_system_type::next becomes struct hlist_node list; no
in-tree caller references the old ->next field outside
fs/filesystems.c.
Link: https://patch.msgid.link/20260425220844.1763933-3-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add proc_make_permanent() function to mark PDE as permanent to speed up
open/read/close (one alloc/free and lock/unlock less).
Enable it for built-in code and for compiled-in modules.
This function becomes nop magically in modular code.
Note, note, note!
If built-in code creates and deletes PDEs dynamically (not in init
hook), then proc_make_permanent() must not be used.
It is intended for simple code:
static int __init xxx_module_init(void)
{
g_pde = proc_create_single();
proc_make_permanent(g_pde);
return 0;
}
static void __exit xxx_module_exit(void)
{
remove_proc_entry(g_pde);
}
If module is built-in then exit hook never executed and PDE is
permanent so it is OK to mark it as such.
If module is module then rmmod will yank PDE, but proc_make_permanent()
is nop and core /proc code will do everything right.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://patch.msgid.link/20260425220844.1763933-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With this change only 0->1 and 1->0 transitions need the lock.
I verified all places which look at the refcount either only care about
it staying 0 (and have the lock enforce it) or don't hold the inode lock
to begin with (making the above change irrelevant to their correcness or
lack thereof).
I also confirmed nfs and btrfs like to call into these a lot and now
avoid the lock in the common case, shaving off some atomics.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20260421182538.1215894-4-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Similarly to inode_state_read_once(), it makes the caller spell out
they acknowledge instability of the returned value.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20260421182538.1215894-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.
set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.
Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails")
Cc: stable@vger.kernel.org # v6.10+
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
fwnode_init()
If a firmware node is allocated on the stack (for instance: temporary
software node whose life-time we control) or on the heap - but using a
non-zeroing allocation function - and initialized using fwnode_init(),
its secondary pointer will contain uninitialized memory which likely
will be neither NULL nor IS_ERR() and so may end up being dereferenced
(for example: in dev_to_swnode()). Set fwnode->secondary to NULL on
initialization. While at it: initialize the remaining fields of struct
fwnode_handle too just to be sure.
Cc: stable@vger.kernel.org
Fixes: 01bb86b380a3 ("driver core: Add fwnode_init()")
Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260511074927.9473-1-bartosz.golaszewski@oss.qualcomm.com
[ Fix typo in commit message. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
Currently, the BPF instruction set allows bpf-to-bpf calls (or internal
calls, pseudo calls) to use a 32-bit imm field to represent the relative
jump offset.
However, when JIT is disabled or falls back to the interpreter, the
verifier invokes bpf_patch_call_args() to rewrite the call instruction.
In this function, the 32-bit imm is downcast to s16 and stored in the off
field.
void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth)
{
stack_depth = max_t(u32, stack_depth, 1);
insn->off = (s16) insn->imm;
insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] -
__bpf_call_base_args;
insn->code = BPF_JMP | BPF_CALL_ARGS;
}
If the original imm exceeds the s16 range (i.e., a jump offset greater
than 32767 instructions), this downcast silently truncates the offset,
resulting in an incorrect call target.
Fix this by:
1. In bpf_patch_call_args(), keeping the imm field unchanged and using the
off field to store the index of the interpreter function.
2. In ___bpf_prog_run() for the JMP_CALL_ARGS case, retrieving the
interpreter function pointer from the interpreters_args array using the
off field as the index, and passing the original imm to calculate the
last argument of the interpreter function.
After these changes, the truncation issue is resolved, and __bpf_call_base_args
is also no longer needed and can be removed, which makes the code cleaner.
Performance: In ___bpf_prog_run() for the JMP_CALL_ARGS case, changing the
retrieval of the interpreter function pointer from pointer addition to
direct array indexing improves performance. The possible reason is that the
latter has better instruction-level parallelism. See the v5 discussion [1]
for more details.
[1] https://lore.kernel.org/bpf/f120c3c4-6999-414a-b514-518bb64b4758@zju.edu.cn/
To avoid requiring bpftool changes, keep the new imm/off encoding internal
and restore the legacy xlated dump layout in bpf_insn_prepare_dump().
For bpf-to-bpf call offsets that do not fit in s16, export off as 0 instead
of a truncated and misleading value.
Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump")
Suggested-by: Xu Kuohai <xukuohai@huaweicloud.com>
Suggested-by: Puranjay Mohan <puranjay@kernel.org>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Link: https://lore.kernel.org/r/20260506094714.419842-3-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The interpreters_args array only accommodates stack depths up to
MAX_BPF_STACK (512 bytes). However, do_misc_fixups() may allow a larger
stack depth if JIT is requested.
If JIT compilation later fails and falls back to the interpreter, the
verifier invokes bpf_patch_call_args() with this oversized stack depth.
This causes a load-time out-of-bounds (OOB) read when calculating the
interpreter function pointer index.
Fix this by changing bpf_patch_call_args() to return an int and explicitly
rejecting the JIT fallback (returning -EINVAL) if the stack depth exceeds
MAX_BPF_STACK.
Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Acked-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260506094714.419842-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Include the definition of struct tty_driver in tty_port.h to keep the
header self-contained and avoid build breakage in case anyone includes
it before tty_driver.h.
Fixes: eb3b0d92c9c3 ("tty: tty_port: add workqueue to flip TTY buffer")
Cc: Xin Zhao <jackzxcui1989@163.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260506124323.186703-1-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Just like all other driver structures, the id_table should never be
modified by core subsystem parts. Constify this member and actual data
structures for increased code safety.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
The current iomap_inline_data_valid() check ensures that inline data
does not cross a PAGE_SIZE boundary. However, this is an unnecessarily
strict constraint. If a filesystem provides a valid iomap::inline_data
pointer and iomap::length, we should trust that the caller has mapped
sufficient memory for the range, even if it spans across page boundaries.
Removing this check allows filesystems to point directly to their
internal data structures without forced page-alignment or additional
redundant allocations. This remove iomap_inline_data_valid() and
its callers in buffered and direct I/O paths.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Link: https://patch.msgid.link/20260511141151.6021-1-linkinjeon@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
An NFS server re-exporting an NFS mount point needs to report
the case sensitivity behavior of the underlying filesystem to
its clients. NFSD's attribute encoder obtains that information
by calling vfs_fileattr_get() on the lower filesystem, so the
NFS client must implement fileattr_get to surface what it
learned from its own server.
The NFS client already retrieves case sensitivity information
from servers during mount via PATHCONF (NFSv3) or the
FATTR4_CASE_INSENSITIVE/FATTR4_CASE_PRESERVING attributes
(NFSv4). Expose this information through fileattr_get by
reporting the FS_XFLAG_CASEFOLD and FS_XFLAG_CASENONPRESERVING
flags. NFSv2 lacks PATHCONF support, so mounts using that protocol
version default to standard POSIX behavior: case-sensitive and
case-preserving.
PATHCONF is now invoked unconditionally for NFSv2 and NFSv3 mounts
so the case-sensitivity capabilities are established even when the
user pins server->namelen with the namlen= mount option. That option
is orthogonal to case handling, and skipping PATHCONF because
namelen was already known would leave the caps unset.
The two capability bits carry opposite polarity because their POSIX
defaults differ. Most servers are case-sensitive and case-
preserving, matching "neither xflag set." NFS_CAP_CASE_INSENSITIVE
is set only when the server affirms case insensitivity, so "server
said no" and "server did not answer" both collapse to the case-
sensitive default. NFS_CAP_CASE_NONPRESERVING follows the same
pattern in the opposite direction: set only when the server affirms
that it does not preserve case, so that silence or a missing
attribute lands on the case-preserving default. The NFSv4 probe
checks res.attr_bitmask[0] to distinguish "server said false" from
"server omitted the attribute" before setting the bit.
Both capability bits are cleared before each probe so a remount,
an NFSv4 transparent state migration to a server with different
case semantics, or a probe whose reply does not arrive does not
retain stale capabilities from the prior probe.
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260507-case-sensitivity-v14-10-e62cc8200435@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Enable upper layers such as NFSD to retrieve case sensitivity
information from file systems by adding FS_XFLAG_CASEFOLD and
FS_XFLAG_CASENONPRESERVING flags.
Filesystems report case-insensitive or case-nonpreserving behavior
by setting these flags directly in fa->fsx_xflags. The default
(flags unset) indicates POSIX semantics: case-sensitive and
case-preserving. Both flags are added to FS_XFLAG_RDONLY_MASK so
FS_IOC_FSSETXATTR silently strips them, keeping the new xflags
strictly a reporting interface. Callers that want to toggle
casefolding continue to use FS_IOC_SETFLAGS with FS_CASEFOLD_FL,
the established UAPI on filesystems that support the operation
(ext4 and f2fs on empty directories).
Case sensitivity information is exported to userspace via the
fa_xflags field in the FS_IOC_FSGETXATTR ioctl and file_getattr()
system call.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260507-case-sensitivity-v14-2-e62cc8200435@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix spelling and possessive typos in the msi_domain_ops comment.
No functional change.
Signed-off-by: Miles Krause <mileskrause5200@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260505014602.5879-1-mileskrause5200@gmail.com
|
|
Add a registration_data pointer to struct auxiliary_device, allowing the
registering (parent) driver to attach private data to the device at
registration time and retrieve it later when called back by the
auxiliary (child) driver.
By tying the data to the device's registration, Rust drivers can bind
the lifetime of device resources to it, since the auxiliary bus
guarantees that the parent driver remains bound while the auxiliary
device is bound.
On the Rust side, Registration<T> takes ownership of the data via
ForeignOwnable. A TypeId is stored alongside the data for runtime type
checking, making Device::registration_data<T>() a safe method.
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20260505152400.3905096-3-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
to synchronize upstream fixes on which other changes depend on.
|
|
The IPI and ITS MSI domains currently allocate and release LPIs
directly, then pass the selected LPI ID to the parent LPI domain. This
leaks the LPI domain's allocation policy into its child domains and
forces each child to duplicate part of the parent domain's teardown.
Make the LPI domain allocate LPIs in its .alloc() callback and release
them in a matching .free() callback. Child domains can then request a
parent interrupt without passing an implementation-specific LPI ID,
and the LPI lifetime is tied to the domain that owns the LPI
namespace.
Remove the gicv5_alloc_lpi() and gicv5_free_lpi() wrappers now that no
external caller needs to manage LPIs directly.
This is a preparatory change for an actual leakage problem in the
allocation code and therefore tagged with the same Fixes tag.
Fixes: 0f0101325876 ("irqchip/gic-v5: Add GICv5 LPI/IPI support")
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260506093634.382062-2-sascha.bischoff@arm.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into gpio/for-next
Immutable branch betweeb the GPIO and I2C trees for v7.2-rc1
- add the gpiod_is_single_ended() helper function
|
|
The direction of a single-ended (open-drain or open-source) GPIO line
cannot always be reliably determined by reading hardware registers.
In true open-drain implementations, the "high" state is achieved by
entering a high-impedance mode, which many hardware controllers report
as "input" even if the software intends to use it as an output.
This creates issues for consumer drivers (like I2C) that rely on
gpiod_get_direction() to decide if a line can be driven.
Introduce gpiod_is_single_ended() to allow consumers to check the
software configuration (GPIO_FLAG_OPEN_DRAIN/GPIO_FLAG_OPEN_SOURCE) of
a descriptor. This provides a robust way to identify lines that are
capable of being driven, regardless of their instantaneous hardware state.
Signed-off-by: Jie Li <jie.i.li@nokia.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/20260511113726.49041-2-jie.i.li@nokia.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into gpio/for-next
Immutable branch between the GPIO and PCI trees for v7.2
- add fwnode_gpiod_get() helper to GPIOLIB
|