| Age | Commit message (Collapse) | Author |
|
Both the compilation of kernel/time/vsyscall.c, which contains the real
definition of update_vsyscall() and the other vDSO definitions in
timekeeper_internal.h use CONFIG_GENERIC_GETTIMEOFDAY and not
CONFIG_GENERIC_TIME_VSYSCALL.
Align the code to use a single Kconfig symbol.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260519-vdso-generic_time_vsyscal-v1-2-5c2a5905d5f5@linutronix.de
|
|
The dyntick-idle steal time is currently accounted when the tick restarts
but the stolen idle time is not subtracted from the idle time that was
already accounted. This is to avoid observing the idle time going backward
as the dyntick-idle cputime accessors can't reliably know in advance the
stolen idle time.
In order to maintain a forward progressing idle cputime while subtracting
idle steal time from it, keep track of the previously accounted idle stolen
time and substract it from _later_ idle cputime accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-16-frederic@kernel.org
|
|
The last reason why get_cpu_idle/iowait_time_us() may return -1 now is if
the config doesn't support nohz.
The ad-hoc replacement solution by cpufreq is to compute jiffies minus the
whole busy cputime. Although the intention should provide a coherent low
resolution estimation of the idle and iowait time, the implementation is
buggy because jiffies don't start at 0.
Just provide instead a real get_cpu_[idle|iowait]_time_us() offcase.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-14-frederic@kernel.org
|
|
Fetching the idle cputime is available through a variety of accessors all
over the place depending on the different accounting flavours and needs:
- idle vtime generic accounting can be accessed by kcpustat_field(),
kcpustat_cpu_fetch(), get_idle/iowait_time() and
get_cpu_idle/iowait_time_us()
- dynticks-idle accounting can only be accessed by get_idle/iowait_time()
or get_cpu_idle/iowait_time_us()
- CONFIG_NO_HZ_COMMON=n idle accounting can be accessed by kcpustat_field()
kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by
get_cpu_idle/iowait_time_us()
Moreover get_idle/iowait_time() relies on get_cpu_idle/iowait_time_us()
with a non-sensical conversion to microseconds and back to nanoseconds on
the way.
Start consolidating the APIs with removing get_idle/iowait_time() and make
kcpustat_field() and kcpustat_cpu_fetch() work for all cases.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-13-frederic@kernel.org
|
|
Although the dynticks-idle cputime accounting is necessarily tied to the
tick subsystem, the actual related accounting code has no business residing
there and should be part of the scheduler cputime code.
Move away the relevant pieces and state machine to where they belong.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-10-frederic@kernel.org
|
|
The non-vtime dynticks-idle cputime accounting is a big mess that
accumulates within two concurrent statistics, each having their own
shortcomings:
* The accounting for online CPUs which is based on the delta between
tick_nohz_start_idle() and tick_nohz_stop_idle().
Pros:
- Works when the tick is off
- Has nsecs granularity
Cons:
- Account idle steal time but doesn't substract it from idle
cputime.
- Assumes CONFIG_IRQ_TIME_ACCOUNTING by not accounting IRQs but
the IRQ time is simply ignored when
CONFIG_IRQ_TIME_ACCOUNTING=n
- The windows between 1) idle task scheduling and the first call
to tick_nohz_start_idle() and 2) idle task between the last
tick_nohz_stop_idle() and the rest of the idle time are
blindspots wrt. cputime accounting (though mostly insignificant
amount)
- Relies on private fields outside of kernel stats, with specific
accessors.
* The accounting for offline CPUs which is based on ticks and the
jiffies delta during which the tick was stopped.
Pros:
- Handles steal time correctly
- Handle CONFIG_IRQ_TIME_ACCOUNTING=y and
CONFIG_IRQ_TIME_ACCOUNTING=n correctly.
- Handles the whole idle task
- Accounts directly to kernel stats, without midlayer accumulator.
Cons:
- Doesn't elapse when the tick is off, which doesn't make it
suitable for online CPUs.
- Has TICK_NSEC granularity (jiffies)
- Needs to track the dyntick-idle ticks that were accounted and
substract them from the total jiffies time spent while the tick
was stopped. This is an ugly workaround.
Having two different accounting for a single context is not the only
problem: since those accountings are of different natures, it is
possible to observe the global idle time going backward after a CPU goes
offline.
Clean up the situation with introducing a hybrid approach that stays
coherent and works for both online and offline CPUs:
* Tick based or native vtime accounting operate before the idle loop
is entered and resume once the idle loop prepares to exit.
* When the idle loop starts, switch to dynticks-idle accounting as is
done currently, except that the statistics accumulate directly to the
relevant kernel stat fields.
* Private dyntick cputime accounting fields are removed.
* Works on both online and offline case.
Further improvement will include:
* Only switch to dynticks-idle cputime accounting when the tick actually
goes in dynticks mode.
* Handle CONFIG_IRQ_TIME_ACCOUNTING=n correctly such that the
dynticks-idle accounting still elapses while on IRQs.
* Correctly substract idle steal cputime from idle time
Reported-by: Xin Zhao <jackzxcui1989@163.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-8-frederic@kernel.org
|
|
Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.
For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.
Prepare for that and introduce three new APIs which will be used in
subsequent patches:
- vtime_dynticks_start() is deemed to be called when idle enters in
dyntick mode. The idle cputime that elapsed so far is accumulated.
- vtime_dynticks_stop() is deemed to be called when idle exits from
dyntick mode. The vtime entry clocks are fast-forward to current time
so that idle accounting restarts elapsing from now.
- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
fast-forward the clock to current time so that the IRQ time is still
accounted by vtime while nohz cputime is paused.
Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-6-frederic@kernel.org
|
|
Currently whether generic vtime is running or not, the idle cputime is
fetched from the nohz accounting.
However generic vtime already does its own idle cputime accounting. Only
the kernel stat accessors are not plugged to support it.
Read the idle generic vtime cputime when it's running, this will allow to
later more clearly split nohz and vtime cputime accounting.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-5-frederic@kernel.org
|
|
The first parameter to kcpustat_field() is a pointer to the cpu kcpustat to
be fetched from. This parameter is error prone because a copy to a kcpustat
could be passed by accident instead of the original one. Also the kcpustat
structure can already be retrieved with the help of the mandatory CPU
argument.
Remove the needless parameter.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-4-frederic@kernel.org
|
|
The macro requires callers to pass a stack variable, but not all
callbacks use it. Add (void)__stack to suppress the clang W=1 warning.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260602175204.624401-1-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
rcu-merge.2026.05.24
rcutorture.2026.05.24: Torture-test updates
misc.2026.05.24: Miscellaneous RCU updates
|
|
BPF programs should have no need in looking into struct io_ring_ctx, if
anything, most of such cases would be anti patterns like looking up ring
indices directly via the context.
Replace it with a new empty structure, which is just an alias to struct
io_ring_ctx. It'll create a new BTF type and fail verification if a BPF
program tries to access it (beyond the first byte). It'll also give more
flexibility for the future, and otherwise it can be made aligned with
io_ring_ctx as before with struct groups if ever needed or extended in a
different way.
Fixes: d0e437b76bd3c ("io_uring/bpf-ops: implement loop_step with BPF struct_ops")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/5f6ca3649e9e0bae8667db4357e28dd00cd07901.1780394491.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM fixes from Andrew Morton:
"13 hotfixes. All are for MM. 10 are cc:stable and the remaining 3
address post-7.1 issues or aren't considered suitable for backporting.
There's a three-patch series "userfaultfd: verify VMA state across
UFFDIO_COPY retry" from Mike Rapoport which fixes a few uffd things.
The rest are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: remove redundant check in vm_uffd_ops()
userfaultfd: refuse to __mfill_atomic_pte() for unsupported VMAs
userfaultfd: verify VMA state across UFFDIO_COPY retry
mm/huge_memory: update file PMD counter before folio_put()
mm/huge_memory: update file PUD counter before folio_put()
mm/hugetlb_vmemmap: fix incorrect vmemmap restore in rollback
mm/damon/ops-common: call folio_test_lru() after folio_get()
mm/cma: fix reserved page leak on activation failure
mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison
mm/hugetlb: restore reservation on error in hugetlb folio copy paths
mm/cma_debug: fix invalid accesses for inactive CMA areas
memcg: use round-robin victim selection in refill_stock
mm/hugetlb: avoid false positive lockdep assertion
|
|
The empty zero page is used to back any kernel or user space mapping
that is supposed to remain cleared, and so the page itself is never
supposed to be modified.
So mark it as const, which moves it into .rodata rather than .bss: on
most architectures, this ensures that both the kernel's mapping of it
and any aliases that are accessible via the kernel direct (linear) map
are mapped read-only, and cannot be used (inadvertently or maliciously)
to corrupt the contents of the zero page.
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com> says:
this series targets to use named initializers for platform_device_id
arrays. In general these are better readable for humans and more robust
to changes in the respective struct definition.
This robustness is needed as I want to do
Link: https://patch.msgid.link/cover.1779878004.git.u.kleine-koenig@baylibre.com
|
|
Linux 7.1-rc6
|
|
There is an bug in which an uninitialized stack variable is used in
rseq_exit_user_update() as reported by syzbot:
BUG: KMSAN: kernel-infoleak in rseq_set_ids_get_csaddr include/linux/rseq_entry.h:502 [inline]
The local variable:
struct rseq_ids ids = {
.cpu_id = task_cpu(t),
.mm_cid = task_mm_cid(t),
.node_id = cpu_to_node(ids.cpu_id),
};
According to the C standard, the evaluation order of expressions in an
initializer list is indeterminately sequenced. The compiler (Clang, in
this KMSAN build) evaluates `cpu_to_node(ids.cpu_id)` *before*
`ids.cpu_id` is initialized with `task_cpu(t)`.
This is fixed by moving the assignment of ids.node_id outside the
structure initialization.
Fixes: 82f572449cfe ("rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode")
Closes: https://syzkaller.appspot.com/bug?extid=185a631927096f9da2fc
Reported-by: syzbot+185a631927096f9da2fc@syzkaller.appspotmail.com
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://patch.msgid.link/20260602030854.574038-1-wangqing7171@gmail.com
|
|
Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
!->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
blocked_on relation is broken but the donor task might still need a return
migration.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org
|
|
Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com
|
|
Add a new is_blocked flag to the task struct. This flag is set
by try_to_block_task() and cleared by ttwu_do_wakeup() and
tracks if the task is blocked.
Traditionally this would mirror !p->on_rq, however due things
like DELAY_DEQUEUE and PROXY_EXEC, this can diverge, so its
useful to manage separately.
Additionally with this, we might be able to get rid of the
p->se.sched_delayed (ab)use in the core code (eventually).
Taken whole cloth from Peter's email:
https://lore.kernel.org/lkml/20260501132143.GC1026330@noisy.programming.kicks-ass.net/
With a few additional p->is_blocked = 0 in a few cases where
we return current if blocked_on gets zeroed or there is
no owner. This may hint that these current special cases
might be dropped eventually.
This change also helps resolve wait-queue stalls seen with
proxy-execution. See previous patch attempts for details:
https://lore.kernel.org/lkml/20260430215103.2978955-2-jstultz@google.com/
Reported-by: Vineeth Pillai <vineethrp@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-7-jstultz@google.com
|
|
This patch adds logic so try_to_wake_up() will notice if we are
waking a task where blocked_on == PROXY_WAKING, and if necessary
dequeue the task so the wakeup will naturally return-migrate the
donor task back to a cpu it can run on.
This helps performance as we do the dequeue and wakeup under the
locks normally taken in the try_to_wake_up() and avoids having
to do proxy_force_return() from __schedule(), which has to
re-take similar locks and then force a pick again loop.
This was split out from the larger proxy patch, and
significantly reworked.
Credits for the original patch go to:
Peter Zijlstra (Intel) <peterz@infradead.org>
Juri Lelli <juri.lelli@redhat.com>
Valentin Schneider <valentin.schneider@arm.com>
Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-6-jstultz@google.com
|
|
Pick up urgent fixes.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
ktime_get_snapshot() resolves to ktime_get_snapshot_id(CLOCK_REALTIME).
Make it obvious in the code and convert the readout to use the
snapshot::systime and monoraw fields instead of snapshot::real and raw,
which aregoing away.
Similar to the PPS generators, avoid the more expensive snapshot when
CONFIG_NTP_PPS is disabled.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195557.123410250@kernel.org
|
|
ktime_get_snapshot() provides a snapshot of the underlying clocksource
counter value and the corresponding CLOCK_MONOTONIC_RAW, CLOCK_REALTIME and
CLOCK_BOOTTIME timestamps.
There is no usage of CLOCK_REALTIME and CLOCK_BOOTTIME at the same time and
CLOCK_BOOTTIME support was just added for the ARM64 KVM tracing mechanism,
which needs CLOCK_BOOTTIME and the underlying clocksource counter value.
ktime_get_snapshot() is also not suitable for usage with CLOCK_AUX, but
that's a prerequisite to support PTP hardware timestamping for CLOCK_AUX
steering.
As a first step, rename ktime_get_snapshot() to ktime_get_snapshot_id(),
which now takes a clockid argument to select the clock which needs to be
captured. The result is stored in system_time_snapshot::systime, which will
replace the system_time_snapshot::real/boot members once all usage sites
have been converted.
ktime_get_snapshot() is a simple wrapper which hands in CLOCK_REALTIME as
clockid argument for the conversion period. That means CLOCK_REALTIME is
now captured twice, but that redunancy is only temporary.
As all usage sites of struct system_time_snapshot has to be updated anyway,
rename the 'raw' member to 'monoraw' for clarity.
No functional change vs. current users of ktime_get_snapshot()
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260529195556.971591633@kernel.org
|
|
The loader verifies map->sha against the metadata hash in its
instructions. map->sha is calculated when BPF_OBJ_GET_INFO_BY_FD is
called on the frozen map.
While the map is frozen, the /signed loader/ must also ensure the map
is exclusive, as, without exclusivity (which a hostile host could just
omit when loading the loader), another BPF program with map access can
mutate the contents afterwards, so the check passes on stale data.
With the extra check as part of the signed loader, it now refuses to
move on with map->sha validation if the host set it up wrongly.
Fixes: fb2b0e290147 ("libbpf: Update light skeleton for signing")
Signed-off-by: KP Singh <kpsingh@kernel.org>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260601150248.394863-4-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
bpf_map_get_info_by_fd() is the only caller of the ->map_get_hash
and always invokes it with hash_buf == map->sha and hash_buf_size
of SHA256_DIGEST_SIZE. array_map_get_hash() in turn lets sha256()
write the digest directly into that buffer (map->sha) and then
performs a trailing memcpy(), which evaluates to memcpy(map->sha,
map->sha, 32): a redundant self-copy. The hash_buf_size argument
was never used at all. Simplify this a bit, no functional change.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260601150248.394863-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Introduce release_reg() to consolidate the release logic shared by both
helpers and kfuncs: dynptr release, kptr_xchg percpu-to-RCU conversion,
regular reference release, and NULL pass-through. NULL pass-through is
only allowed if the prototype indicates the argument may be null.
Determine release_regno from the function prototype/metadata before
argument checking, rather than discovering it dynamically during
argument processing. For helpers, scan the arg_type array in
check_func_proto() via check_proto_release_reg(). For kfuncs, set
release_regno to BPF_REG_1 in bpf_fetch_kfunc_arg_meta() when
KF_RELEASE is set. In the future when we start adding decl_tag to
kfunc arguments, we can just look at the function prototype instead
of a release_regno.
Extract ref_convert_alloc_rcu_protected() and
invalidate_rcu_protected_refs() to make it more clear what the code is
doing. For ref_convert_alloc_rcu_protected(), it pre-converts
MEM_ALLOC | MEM_PERCPU registers to MEM_RCU (clearing id so they
survive), then calls release_reference() to invalidate the remaining
registers and release the reference state.
Add KF_RELEASE to bpf_dynptr_file_discard() so its release_regno is set
via fetch_kfunc_meta rather than being assigned manually in the dynptr
argument processing. Set arg_type to ARG_PTR_TO_DYNPTR for
KF_ARG_PTR_TO_DYNPTR so that check_func_arg_reg_off() correctly allows
non-zero stack offsets for dynptr release arguments same as helper.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260529014936.2811085-9-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Helpers and kfuncs independently tracked referenced object metadata
using standalone id fields in their respective arg_meta structs.
This led to duplicated logic and inconsistent error handling between the
two paths.
Introduce struct ref_obj_desc to consolidate id and parent_id along with
a count of how many arguments carry a reference. Add update_ref_obj() to
populate it from a bpf_reg_state, replacing open-coded assignments in
check_func_arg(), check_kfunc_args(), and process_iter_arg(). Add
validate_ref_obj() to check for ambiguous ref_obj before using it.
For ref_obj releasing helpers and kfuncs, keep checking it before
calling update_ref_obj() for now. A later patch will make these
functions not depending on ref_obj. For other users of ref_obj, move the
checks to the use locations. For helper, this means moving the checks
inside helper_multiple_ref_obj_use() to use locations.
is_acquire_function() is dropped as ref_obj is never used.
Pass ref_obj_desc into process_dynptr_func()/mark_stack_slots_dynptr()
instead of a bare parent_id to make it less confusing.
Drop the selftest introduced in 7ec899ac90a2 ("selftests/bpf: Negative
test case for ref_obj_id in args") since the verifier no longer
complains about ambiguous ref_obj if it is not used.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260529014936.2811085-8-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Refactor object relationship tracking in the verifier and fix a dynptr
use-after-free bug where file/skb dynptrs are not invalidated when the
parent referenced object is freed.
Add parent_id to bpf_reg_state to precisely track child-parent
relationships. A child object's parent_id points to the parent object's
id. This replaces the PTR_TO_MEM-specific dynptr_id.
Remove ref_obj_id from bpf_reg_state by folding its role into the
existing id field. Previously, id tracked pointer identity for null
checking while ref_obj_id tracked the owning reference for lifetime
management. These are now unified: acquire helpers and kfuncs set id
to the acquired reference id, and release paths use id directly.
Add reg_is_referenced() which checks if a register is referenced by
looking up its id in the reference array. This replaces all former
ref_obj_id checks.
For release_reference(), invalidating an object now also invalidates
all descendants by traversing the object tree. This is done using
stack-based DFS to avoid recursive call chains of release_reference() ->
unmark_stack_slots_dynptr() -> release_reference(). Referenced objects
encountered during tree traversal are reported as leaked references.
Add parent_id to bpf_reference_state to enable hierarchical reference
tracking. When acquiring a reference, a parent_id can be specified to
link the new reference to an existing one (e.g., referenced dynptrs
acquire a reference with parent_id linking to the parent object's
reference).
Pointer casting:
For pointer casting helpers (bpf_sk_fullsock, bpf_tcp_sock), instead of
propagating ref_obj_id, the cast result reuses the same reference id as
the source pointer. Since the cast may return NULL for a non-NULL input,
the NULL case is explored as a separate verifier branch. This allows
releasing any of the original or cast pointers to invalidate all others.
Referenced dynptrs:
When constructing a referenced dynptr, acquire a intermediate reference
with parent_id linking to the parent referenced object. The dynptr and
all clones share the same parent_id (pointing to the intermediate ref)
but get unique ids for independent slice tracking. Releasing a
referenced dynptr releases the parent reference, which in turn
invalidates all clones and their derived slices.
Owning to non-owning reference conversion:
After converting owning to non-owning by clearing id (e.g.,
object(id=1) -> object(id=0)), the verifier releases the reference
state via release_reference_nomark().
Note that the error message "reference has not been acquired before" in
the helper and kfunc release paths is removed. This message was already
unreachable. The verifier only calls release_reference() after
confirming the reference is valid, so the condition could never trigger
in practice.
Fixes: 870c28588afa ("bpf: net_sched: Add basic bpf qdisc kfuncs")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260529014936.2811085-6-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Simplify dynptr checking for helper and kfunc by unifying it. Remember
the initialized dynptr (i.e.,g !(arg_type |= MEM_UNINIT)) pass to a
dynptr kfunc during process_dynptr_func() so that we can easily
retrieve the information for verification later. By saving it in
meta->dynptr, there is no need to call dynptr helpers such as
dynptr_id(), dynptr_ref_obj_id() and dynptr_type() in check_func_arg().
Remove and open code the helpers in process_dynptr_func() when
saving id, ref_obj_id, and type.
Besides, since dynptr ref_obj_id information is now pass around in
meta->bpf_dynptr_desc, drop the check in helper_multiple_ref_obj_use.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Link: https://lore.kernel.org/r/20260529014936.2811085-3-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The AX.25 subsystem was removed in commit dd8d4bc28ad7
("net: remove ax25 and amateur radio (hamradio) subsystem"),
which removed the ax25_ptr field from struct net_device but
left behind the kdoc comment and documentation.
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20260531134837.4111349-1-costa.shul@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Annotate the following functions used in the issuing path:
ata_qc_issue(), ata_sas_queuecmd(), ata_scsi_qc_issue(),
ata_scsi_translate(), __ata_scsi_queuecmd()
These functions are all used in the issuing path, so context analysis will
be able to verify that the ap lock is held, from it is taken in
sas_queuecommand() or ata_scsi_queuecmd() all the way down to
ata_qc_issue().
Commenting out the spin_lock_irqsave() successfully results in a compiler
error on Clang 23.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Co-developed-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
|
|
Annotate the following functions with __must_hold(&host->eh_mutex):
* All ata_port_operations.error_handler() implementations.
* ata_eh_reset() and ata_eh_recover() because these functions call
ata_eh_release() and ata_eh_acquire().
* All callers of ata_eh_reset() and ata_eh_recover().
Enable Clang's context analysis. This will cause the build to fail if
e.g. a locking bug would be introduced in an error path. This patch
should not affect the generated assembler code.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[cassel: drop note about clang 23 from commit log]
Signed-off-by: Niklas Cassel <cassel@kernel.org>
|
|
|
|
We need the char/misc/iio fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We need the tty/serial fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We need the USB and Thunderbolt fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The auth.unix.ip and auth.unix.gid caches live in the sunrpc module,
so they cannot use the nfsd generic netlink family. Create a new
"sunrpc" generic netlink family with its own "exportd" multicast
group to support cache upcall notifications for sunrpc-resident
caches.
Define a YAML spec (sunrpc_cache.yaml) with a cache-type enum
(ip_map, unix_gid), a cache-notify multicast event, and the
corresponding uapi header.
Implement sunrpc_cache_notify() in cache.c, which checks for
listeners on the exportd multicast group, builds and sends a
SUNRPC_CMD_CACHE_NOTIFY message with the cache-type attribute.
Register/unregister the sunrpc_nl_family in init_sunrpc() and
cleanup_sunrpc().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add sunrpc_cache_requests_count() and sunrpc_cache_requests_snapshot()
to allow callers to count and snapshot the pending upcall request list
without exposing struct cache_request outside of cache.c.
Both functions skip entries that no longer have CACHE_PENDING set.
The snapshot function takes a cache_get() reference on each item so the
caller can safely use them after the queue_lock is released.
These will be used by the nfsd generic netlink dumpit handler for
svc_export upcall requests.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
A later patch will be changing the kernel to send a netlink notification
when there is a pending cache_request. Add a new cache_notify operation
to struct cache_detail for this purpose.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This function doesn't have anything to do with a timeout. The only
difference is that it warns if there are no listeners. Rename it to
sunrpc_cache_upcall_warn().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Since it will soon also send an upcall via netlink, if configured.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
'vfs/vfs-7.2.directory.delegations' and 'vfs/vfs-7.2.exportfs' into vfs-7.2-merge
|
|
Alexey Charkov <alchark@flipper.net> says:
The Nuvoton NAU8822 codec has four power supply pins: VDDA, VDDB, VDDC
and VDDSPK, which must be online and stable before the device can be
accessed over I2C. On boards where these rails are software-controlled,
probing the codec before the regulators are up results in -ENXIO errors
during register access.
This short series adds optional regulator support to both the device
tree binding and the driver, so platforms that need explicit power
sequencing can describe and enforce it:
Link: https://patch.msgid.link/20260525-nau8822-reg-v2-0-7d37ae393e46@flipper.net
|
|
Carlos Song (OSS) <carlos.song@oss.nxp.com> says:
This series fixes two issues in the fsl-lpspi DMA transfer error paths.
Patch 1 replaces the deprecated dmaengine_terminate_all() with
dmaengine_terminate_sync() across all error paths in
fsl_lpspi_dma_transfer().
Patch 2 fixes a missing RX DMA channel termination when TX descriptor
preparation fails. Since the RX channel is already submitted and issued
before the TX descriptor is prepared, returning -EINVAL without
terminating the RX channel leaves it running against buffers that the
SPI core will unmap, potentially causing memory corruption.
Link: https://patch.msgid.link/20260525062357.3191349-1-carlos.song@oss.nxp.com
|
|
Conflicts:
drivers/net/ethernet/microsoft/mana/mana_en.c:
17bfe0a8c014e ("net: mana: Add NULL guards in teardown path to prevent panic on attach failure")
d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Replace "Indentifies" with "Identifies".
Signed-off-by: Long Wei <longwei27@huawei.com>
Link: https://patch.msgid.link/20260516085653.2193872-1-longwei27@huawei.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Increment the incoming FLB refcount in liveupdate_flb_get_incoming() so
that the FLB structure cannot be freed while the caller is actively using
it. Add an additional liveupdate_flb_put_incoming() function so the
caller can explicitly indicate when it is done using the FLB data.
During a Live Update, a subsystem might need to hold onto the incoming
File-Lifecycle-Bound (FLB) data for an extended period, such as during
device enumeration. Incrementing the reference count guarantees that the
data remains valid and accessible until the subsystem releases it,
preventing future use-after-free bugs.
Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/r/20260423174032.3140399-3-dmatlack@google.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Use refcount_t instead of a raw integer to keep track of references on
incoming and outgoing FLBs. Using refcount_t provides protection from
overflow, underflow, and other issues.
Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state")
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/r/20260423174032.3140399-2-dmatlack@google.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Currently, if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled,
kho_release_scratch() will initialize the struct pages and set migratetype
of KHO scratch. Unless the whole scratch fits below first_deferred_pfn,
some of that will be overwritten either by deferred_init_pages() or
memmap_init_reserved_range().
To fix it, make memmap_init_range(), deferred_init_memmap_chunk() and
__init_page_from_nid() recognize KHO scratch regions and set
migratetype of pageblocks in those regions to MIGRATE_CMA.
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Link: https://patch.msgid.link/20260423122538.140993-2-mclapinski@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|