| Age | Commit message (Collapse) | Author |
|
In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and
mm_cid.active is set, the CID is checked with cid_in_transit() before
setting the transition bit. In per-CPU mode a newly forked or exec'd
task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are
assigned lazily on schedule-in. With cid_in_transit() the guard passes
for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET |
MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this
to clear_bit() with MM_CID_UNSET as the bit number, triggering an
out-of-bounds write.
Symptoms: this is genuine memory corruption, but a bounded out-of-bounds
write, not an arbitrary one. MM_CID_UNSET is the fixed sentinel BIT(31),
so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid()
strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence
test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET,
mm_cidmask(mm)). The cid bitmap is embedded in the mm_struct slab object
(after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus()
bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a
fixed offset of 2^31 / 8 == 256 MiB past the bitmap base. The address is
not attacker-influenced (fixed sentinel -> fixed offset) and the op only
clears a single bit; what sits 256 MiB further along the direct map is
whatever kernel object happens to live there, so this corrupts one bit of
unpredictable kernel memory -- it is not an arbitrary-address or
arbitrary-value write.
It triggers only in per-CPU CID mode, when a CPU is running an active
task of the target mm whose cid is still MM_CID_UNSET -- the
fork()/execve() window before that task's next schedule-in assigns it a
real CID -- and a per-CPU -> per-task fixup walks over it (the mode
fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred
max_cids recompute in mm_cid_work_fn()).
In practice syzkaller surfaced it as a KASAN use-after-free reported in
__schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined
via mm_cid_schedout() -> mm_drop_cid().
Guard the transition-bit assignment against MM_CID_UNSET, in addition to
the existing cid_in_transit() check, so the bit is only set on a genuine
task-owned CID. A CPU-owned (MM_CID_ONCPU) CID of a running active task
is handled by the cid_on_cpu(pcp->cid) branch above and never reaches
this path, so excluding MM_CID_UNSET (and the already-transitioning case)
is sufficient.
Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Assisted-by: Claude:claude-opus-4-8 syzkaller
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260616203818.1516263-1-riel@surriel.com
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"SMP load-balancing updates:
- A large series to introduce infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data
within the same Last Level Cache (LLC) domain. By improving cache
locality, the scheduler can reduce cache bouncing and cache misses,
ultimately improving data access efficiency.
Implemented by Chen Yu and Tim Chen, based on early prototype work
by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and
Shrikanth Hegde.
- A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde)
Fair scheduler updates:
- A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing
SMT awareness (Andrea Righi, K Prateek Nayak)
- A series to optimize cfs_rq and sched_entity allocation for better
data locality (Zecheng Li)
- A preparatory series to change fair/cgroup scheduling to a single
runqueue, without the final change (Peter Zijlstra)
- Auto-manage ext/fair dl_server bandwidth (Andrea Righi)
- Fix cpu_util runnable_avg arithmetic (Hongyan Xia)
- Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel)
- Allow account_cfs_rq_runtime() to throttle current hierarchy
(K Prateek Nayak)
- Update util_est after updating util_avg during dequeue, to fix the
util signal update logic, which reduces signal noise (Vincent
Guittot)
Scheduler topology updates:
- Allow multiple domains to claim sched_domain_shared (K Prateek
Nayak)
- Add parameter to split LLC (Peter Zijlstra)
Core scheduler updates:
- Use trace_call__<tp>() to save a static branch (Gabriele Monaco)
Scheduler statistics updates:
- Drop now-stale mul_u64_u64_div_u64() cputime over-approximation
guard (Nicolas Pitre)
Deadline scheduler updates:
- Reject debugfs dl_server writes for offline CPUs (Andrea Righi)
- Fix replenishment logic for non-deferred servers (Yuri Andriaccio)
RT scheduling updates:
- Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt)
- Update default bandwidth for real-time tasks to 1.0 (Yuri
Andriaccio)
Proxy scheduling updates:
- A series to implement Optimized Donor Migration for Proxy Execution
(John Stultz, Peter Zijlstra)
- Various proxy scheduling cleanups and fixes (Peter Zijlstra,
K Prateek Nayak)
Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi,
Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde,
Peter Zijlstra, Liang Luo and Yiyang Chen"
* tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits)
sched/fair: Fix newidle vs core-sched
sched/deadline: Use task_on_rq_migrating() helper
sched/core: Combine separate 'else' and 'if' statements
sched/fair: Fix cpu_util runnable_avg arithmetic
sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
sched/fair: Call update_curr() before unthrottling the hierarchy
sched/fair: Use throttled_csd_list for local unthrottle
sched/fair: Convert cfs bandwidth throttling to use guards
sched/fair: Allocate cfs_tg_state with percpu allocator
sched/fair: Remove task_group->se pointer array
sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state
sched: restore timer_slack_ns when resetting RT policy on fork
MAINTAINERS: Fix spelling mistake in Peter's name
sched: Simplify ttwu_runnable()
sched/proxy: Remove superfluous clear_task_blocked_in()
sched/proxy: Remove PROXY_WAKING
sched/proxy: Switch proxy to use p->is_blocked
sched/proxy: Only return migrate when needed
sched: Be more strict about p->is_blocked
...
|
|
The kernel coding style recommends using 'else if' instead of
placing 'if' on a separate line after 'else'. This change makes
the code consistent with the rest of the kernel codebase.
Signed-off-by: Liang Luo <luoliang@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260608071842.325159-1-luoliang@kylinos.cn
|
|
Although the dynticks-idle cputime accounting is necessarily tied to the
tick subsystem, the actual related accounting code has no business residing
there and should be part of the scheduler cputime code.
Move away the relevant pieces and state machine to where they belong.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260508131647.43868-10-frederic@kernel.org
|
|
Subsequent commits will allow update_curr() to throttle the hierarchy
when the runtime accounting exceeds allocated quota. Call update_curr()
before the unthrottle event, and in tg_unthrottle_up() to catch up on
any remaining runtime and stabilize the "runtime_remaining" and
"throttle_count" for that cfs_rq.
Doing an update_curr() early ensures the cfs_rq is not throttled right
back up again when the unthrottle is in progress.
Since all callers of unthrottle_cfs_rq(), except two, already update the
rq_clock and call rq_clock_start_loop_update(), move the
update_rq_clock() from unthrottle_cfs_rq() to the callers that don't
update the rq_clock.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Benjamin Segall <bsegall@google.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260602052531.11450-1-kprateek.nayak@amd.com
|
|
To remove the cfs_rq pointer array in task_group, allocate the combined
cfs_rq and sched_entity using the per-cpu allocator.
This patch implements the following:
- Changes task_group->cfs_rq from 'struct cfs_rq **' to
'struct cfs_rq __percpu *'.
- Updates memory allocation in alloc_fair_sched_group() and
free_fair_sched_group() to use alloc_percpu() and free_percpu()
respectively.
- Uses the inline accessor tg_cfs_rq(tg, cpu) with per_cpu_ptr() to retrieve
the pointer to cfs_rq for the given task group and CPU.
- Replaces direct accesses tg->cfs_rq[cpu] with calls to the new tg_cfs_rq(tg,
cpu) helper.
- Handles the root_task_group: since struct rq is already a per-cpu variable
(runqueues), its embedded cfs_rq (rq->cfs) is also per-cpu. Therefore, we
assign root_task_group.cfs_rq = &runqueues.cfs.
- Cleanup the code in initializing the root task group.
This change places each CPU's cfs_rq and sched_entity in its local per-cpu
memory area to remove the per-task_group pointer arrays.
Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://patch.msgid.link/20260522141623.600235-4-zli94@ncsu.edu
|
|
Now that struct sched_entity is co-located with struct cfs_rq for non-root task
groups, the task_group->se pointer array is redundant. The associated
sched_entity can be loaded directly from the cfs_rq.
This patch performs the access conversion with the helpers:
- is_root_task_group(tg): checks if a task group is the root task group. It
compares the task group's address with the global root_task_group variable.
- tg_se(tg, cpu): retrieves the cfs_rq and returns the address of the
co-located se. This function checks if tg is the root task group to ensure
behaving the same of previous tg->se[cpu]. Replaces all accesses that use
the tg->se[cpu] pointer array with calls to the new tg_se(tg, cpu) accessor.
- cfs_rq_se(cfs_rq): simplifies access paths like cfs_rq->tg->se[...] to use
the co-located sched_entity. This function also checks if tg is the root
task group to ensure same behavior.
Since tg_se is not in very hot code paths, and the branch is a register
comparison with an immediate value (`&root_task_group`), the performance impact
is expected to be negligible.
Signed-off-by: Zecheng Li <zecheng@google.com>
Signed-off-by: Zecheng Li <zli94@ncsu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://patch.msgid.link/20260522141623.600235-3-zli94@ncsu.edu
|
|
Commit ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values
for realtime tasks") sets timer_slack_ns to 0 for RT tasks in
__setscheduler_params(). However, when an RT task with SCHED_RESET_ON_FORK
creates child threads, the children inherit timer_slack_ns=0 from the
parent. sched_fork() resets the child's policy to SCHED_NORMAL but does
not restore timer_slack_ns, leaving the child permanently running with
zero slack.
Fix this by restoring timer_slack_ns from default_timer_slack_ns in
sched_fork() when resetting from RT/DL to NORMAL policy, matching the
existing behavior in __setscheduler_params().
Note: this fix alone requires a correct default_timer_slack_ns to be
effective. See the following patch for that fix.
Fixes: ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values for realtime tasks")
Reported-by: Qiaoting.Lin <linqiaoting@xiaomi.com>
Signed-off-by: Guanyou.Chen <chenguanyou@xiaomi.com>
Signed-off-by: Chunhui.Li <chunhui.li@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260522131000.1664983-2-chenguanyou@xiaomi.com
|
|
Note that both proxy and delayed tasks have ->is_blocked set. Use this one
condition to guard both paths.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.714832584%40infradead.org
|
|
Per the discussion here:
https://lore.kernel.org/all/20260403112810.GG3738786@noisy.programming.kicks-ass.net/
The reason for this condition is that the signal condition in
try_to_block_task() would set_task_blocked_in_waking(). However, it no longer
does that, in fact, that path does clear_task_blocked_on().
Further, per the discussions here:
https://lore.kernel.org/r/dc61cf77-e541-441d-a708-c40e19aa0db2%40amd.com
https://lore.kernel.org/r//9dd1d24d-45d3-4ee2-8e67-8305b34bfb6d%40amd.com
there are a few other edge cases that needed this. But they're all
variants of PROXY_WAKING leaking out. And since PROXY_WAKING is now
gone, this is no longer needed either.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260526113322.120970670%40infradead.org
|
|
Now that the proxy path uses ->is_blocked, use the '->is_blocked &&
!->blocked_on' state instead of PROXY_WAKING. Notably, this is where a
blocked_on relation is broken but the donor task might still need a return
migration.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org
|
|
Rather than gate the proxy paths with p->blocked_on, use p->is_blocked.
This opens up the state: '->is_blocked && !->blocked_on' for future use.
Notably, only proxy and delayed tasks can be ->on_rq && ->is_blocked, and it is
guaranteed that sched_class::pick_task() will never return a delayed task.
Therefore any task returned from pick_next_task() that has ->is_blocked set,
must be a proxy task.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260526113322.477954312%40infradead.org
|
|
Current code will 'unconditionally' return migrate on PROXY_WAKING, even if the
task is (still) on the original CPU.
Check task_cpu(p) against p->waking_cpu, which per proxy_set_task_cpu()
preserves the original CPU the task was on. If they do not mis-match, there is
no need to go through the more expensive wakeup path.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260527082916.GP3126523%40noisy.programming.kicks-ass.net
|
|
Upon entry to try_to_block_task(), p->is_blocked should be false. After all,
the prior wakeup would have made it so per ttwu_do_wakeup().
Ensure this is the case, rather than clearing it in the path that doesn't set
it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260526113322.364017314%40infradead.org
|
|
The reason for the clause in try_to_wake_up() is, per its comment, that
find_proxy_task()'s proxy_deactivate() is not always called with a cleared
p->blocked_on.
However, that seems silly and easily cured. Make sure to always call
proxy_deactivate() with a cleared p->blocked_on such that we might remove this
clause from the common wake-up path.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260526113322.244729903%40infradead.org
|
|
Add link to the task this task is proxying for, and use it so
the mutex owner can do an intelligent hand-off of the mutex to
the task that the owner is running on behalf.
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com
|
|
Add a new is_blocked flag to the task struct. This flag is set
by try_to_block_task() and cleared by ttwu_do_wakeup() and
tracks if the task is blocked.
Traditionally this would mirror !p->on_rq, however due things
like DELAY_DEQUEUE and PROXY_EXEC, this can diverge, so its
useful to manage separately.
Additionally with this, we might be able to get rid of the
p->se.sched_delayed (ab)use in the core code (eventually).
Taken whole cloth from Peter's email:
https://lore.kernel.org/lkml/20260501132143.GC1026330@noisy.programming.kicks-ass.net/
With a few additional p->is_blocked = 0 in a few cases where
we return current if blocked_on gets zeroed or there is
no owner. This may hint that these current special cases
might be dropped eventually.
This change also helps resolve wait-queue stalls seen with
proxy-execution. See previous patch attempts for details:
https://lore.kernel.org/lkml/20260430215103.2978955-2-jstultz@google.com/
Reported-by: Vineeth Pillai <vineethrp@google.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-7-jstultz@google.com
|
|
This patch adds logic so try_to_wake_up() will notice if we are
waking a task where blocked_on == PROXY_WAKING, and if necessary
dequeue the task so the wakeup will naturally return-migrate the
donor task back to a cpu it can run on.
This helps performance as we do the dequeue and wakeup under the
locks normally taken in the try_to_wake_up() and avoids having
to do proxy_force_return() from __schedule(), which has to
re-take similar locks and then force a pick again loop.
This was split out from the larger proxy patch, and
significantly reworked.
Credits for the original patch go to:
Peter Zijlstra (Intel) <peterz@infradead.org>
Juri Lelli <juri.lelli@redhat.com>
Valentin Schneider <valentin.schneider@arm.com>
Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-6-jstultz@google.com
|
|
Pull most of the logic out of try_to_block_task() and put it
into block_task() directly, so that we can call block_task() and
not have to worry about the failing cases in try_to_block_task()
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-5-jstultz@google.com
|
|
Historically, the prev value from __schedule() was the rq->curr.
This prev value is passed down through numerous functions, and
used in the class scheduler implementations. The fact that
prev was on_cpu until the end of __schedule(), meant it was
stable across the rq lock drops that the class->balance()
implementations often do.
However, with proxy-exec, the prev passed to functions called
by __schedule() is rq->donor, which may not be the same as
rq->curr and may not be on_cpu, this makes the prev value
potentially unstable across rq lock drops.
A recently found issue with proxy-exec, is when we begin doing
return migration from try_to_wake_up(), its possible we may be
waking up the rq->donor. When we do this, we proxy_resched_idle()
to put_prev_set_next() setting the rq->donor to rq->idle, allowing
the rq->donor to be return migrated and allowed to run.
This however runs into trouble, as on another cpu we might be in
the middle of calling __schedule(). Conceptually the rq lock is
held for the majority of the time, but in calling prev_balance()
its possible the class->balance() handler call may briefly drop the rq lock.
This opens a window for try_to_wake_up() to wake and return migrate the
rq->donor before the class logic reacquires the rq lock.
Unfortunately prev_balance() pass in a prev argument, to which we pass
rq->donor. However this prev value can now become stale and incorrect across a
rq lock drop.
So, to correct this, rework the prev_balance() call so that it does not take a
"prev" argument.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260512025635.2840817-2-jstultz@google.com
|
|
The reason for pick_next_task_fair() is the put/set optimization that
avoids touching the common ancestors. However, it is possible to
implement this in the put_prev_task() and set_next_task() calls as
used in put_prev_set_next_task().
Notably, put_prev_set_next_task() is the only site that:
- calls put_prev_task() with a .next argument;
- calls set_next_task() with .first = true.
This means that put_prev_task() can determine the common hierarchy and
stop there, and then set_next_task() can terminate where put_prev_task
stopped.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260511120628.057634261@infradead.org
|
|
Robots figured out you can read and write this concurrently and got
'upset'. Gemini even noted sched_dynamic_show() can generate
'confusing' output if it observed different values during the
printing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260511120627.176946327@infradead.org
|
|
Commit c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ
kick path") removed the rcu_read_lock()/unlock() pair from
set_cpu_sd_state_busy() and set_cpu_sd_state_idle() on the assumption
that all callers run in a safe context for rcu_dereference_all(): IRQs
disabled or cpus_write_lock() held.
That assumption is wrong for the CPU hotplug teardown path. When CPUs
are taken offline, set_cpu_sd_state_busy() is invoked via:
cpuhp/N kthread
cpuhp_thread_fun()
cpuhp_invoke_callback()
sched_cpu_deactivate()
nohz_balance_exit_idle()
set_cpu_sd_state_busy()
rcu_dereference_all(per_cpu(sd_llc, cpu))
The cpuhp kthread holds cpu_hotplug_lock (percpu-rwsem) but runs with
preemption and IRQs enabled. As a result, lockdep correctly reports a
suspicious RCU usage on CPU offline, e.g.:
# echo 0 > /sys/devices/system/cpu/cpu1/online
=============================
WARNING: suspicious RCU usage
-----------------------------
kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage!
...
2 locks held by cpuhp/1/20:
#0: (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae
#1: (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae
Call Trace:
lockdep_rcu_suspicious
nohz_balance_exit_idle
sched_cpu_deactivate
cpuhp_invoke_callback
cpuhp_thread_fun
smpboot_thread_fn
Fix this by adding RCU read lock coverage to the one caller that lacks
it: nohz_balance_exit_idle() in the CPU hotplug teardown.
The other callers (nohz_balancer_kick() and nohz_balance_enter_idle())
genuinely run with IRQs disabled, so they remain unchanged.
Fixes: c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path")
Closes: https://lore.kernel.org/all/38fe0a1d-1a48-435a-910a-c278024d9ac9@samsung.com/
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260522092523.2046095-1-arighi@nvidia.com
|
|
Merge the cache aware balancer topic branch.
# Conflicts:
# kernel/sched/topology.c
|
|
K Prateek noticed we weren't setting the rq->next_class in
proxy_resched_idle(), when I was debugging an issue seen with
CONFIG_SCHED_PROXY_EXEC and some of Peter's new patches, and
suggested this fix.
So set rq->next_class when we temporarily switch the donor to
idle, so we don't accidentally call wakeup_preempt_fair()
with idle as the donor.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260514234732.3170197-1-jstultz@google.com
|
|
Now, that cpu_smt_mask is defined as cpumask_of(cpu) for
CONFIG_SCHED_SMT=n, it is possible to get rid of the ifdeffery.
Effectively,
- This makes sched_smt_present is defined always
- cpumask_weight(cpumask_of(cpu)) == 1. So sched_smt_present_inc/dec
will never enable the sched_smt_present. Which is expected.
- Paths that were compile-time eliminated become runtime guarded
using static keys.
- Defines set_idle_cores, test_idle_cores, etc which could likely benefit
the CONFIG_SCHED_SMT=n systems to use the same optimizations within the
LLC at wakeups.
- This will expose sched_smt_present symbol for CONFIG_SCHED_SMT=n.
Likely not a concern.
- There is a bloat of code CONFIG_SCHED_SMT=n. (NR_CPUS=2048)
add/remove: 24/18 grow/shrink: 26/28 up/down: 6396/-3188 (3208)
Total: Before=30629880, After=30633088, chg +0.01%
- No code bloat for CONFIG_SCHED_SMT=y, which is expected.
- Add comments around stop_core_cpuslocked on why ifdefs are not
removed.
- This leaves the remaining uses of CONFIG_SCHED_SMT mainly for
topology building bits which has a policy based decision.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260515172456.542799-3-sshegde@linux.ibm.com
|
|
The wrapper functions __trace_set_current_state() and
__trace_set_need_resched() allow the tracepoints to be called from code
outside sched/core.c, those calls are already guarded by a
tracepoint_enabled(<tp>) so there is no need to repeat this check once
again inside the call using trace_<tp>().
Use the new trace_call__<tp>() API to directly call the tracepoint
without check. Those helper functions must be called after the
appropriate check.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260429094227.34087-1-gmonaco@redhat.com
|
|
A yield-triggered crash can happen when a newly forked sched_entity
enters the fair class with se->rel_deadline unexpectedly set.
The failing sequence is:
1. A task is forked while se->rel_deadline is still set.
2. __sched_fork() initializes vruntime, vlag and other sched_entity
state, but does not clear rel_deadline.
3. On the first enqueue, enqueue_entity() calls place_entity().
4. Because se->rel_deadline is set, place_entity() treats se->deadline
as a relative deadline and converts it to an absolute deadline by
adding the current vruntime.
5. However, the forked entity's deadline is not a valid inherited
relative deadline for this new scheduling instance, so the conversion
produces an abnormally large deadline.
6. If the task later calls sched_yield(), yield_task_fair() advances
se->vruntime to se->deadline.
7. The inflated vruntime is then used by the following enqueue path,
where the vruntime-derived key can overflow when multiplied by the
entity weight.
8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility
calculation, and can eventually make all entities appear ineligible.
pick_next_entity() may then return NULL unexpectedly, leading to a
later NULL dereference.
A captured trace shows the effect clearly. Before yield, the entity's
vruntime was around:
9834017729983308
After yield_task_fair() executed:
se->vruntime = se->deadline
the vruntime jumped to:
19668035460670230
and the deadline was later advanced further to:
19668035463470230
This shows that the deadline had already become abnormally large before
yield_task_fair() copied it into vruntime.
rel_deadline is only meaningful when se->deadline really carries a
relative deadline that still needs to be placed against vruntime. A
freshly forked sched_entity should not inherit or retain this state.
Clear se->rel_deadline in __sched_fork(), together with the other
sched_entity runtime state, so that the first enqueue does not interpret
the new entity's deadline as a stale relative deadline.
Fixes: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'")
Analyzed-by: Hui Tang <tanghui20@huawei.com>
Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verification updates from Steven Rostedt:
- Refactor da_monitor header to share handlers across monitor types
No functional changes, only less code duplication.
- Add Hybrid Automata model class
Add a new model class that extends deterministic automata by adding
constraints on transitions and states. Those constraints can take
into account wall-clock time and as such allow RV monitor to make
assertions on real time. Add documentation and code generation
scripts.
- Add stall monitor as hybrid automaton example
Add a monitor that triggers a violation when a task is stalling as an
example of automaton working with real time variables.
- Convert the opid monitor to a hybrid automaton
The opid monitor can be heavily simplified if written as a hybrid
automaton: instead of tracking preempt and interrupt enable/disable
events, it can just run constraints on the preemption/interrupt
states when events like wakeup and need_resched verify.
- Add support for per-object monitors in DA/HA
Allow writing deterministic and hybrid automata monitors for generic
objects (e.g. any struct), by exploiting a hash table where objects
are saved. This allows to track more than just tasks in RV. For
instance it will be used to track deadline entities in deadline
monitors.
- Add deadline tracepoints and move some deadline utilities
Prepare the ground for deadline monitors by defining events and
exporting helpers.
- Add nomiss deadline monitor
Add first example of deadline monitor asserting all entities complete
before their deadline.
- Improve rvgen error handling
Introduce AutomataError exception class and better handle expected
exceptions while showing a backtrace for unexpected ones.
- Improve python code quality in rvgen
Refactor the rvgen generation scripts to align with python best
practices: use f-strings instead of %, use len() instead of
__len__(), remove semicolons, use context managers for file
operations, fix whitespace violations, extract magic strings into
constants, remove unused imports and methods.
- Fix small bugs in rvgen
The generator scripts presented some corner case bugs: logical error
in validating what a correct dot file looks like, fix an isinstance()
check, enforce a dot file has an initial state, fix type annotations
and typos in comments.
- rvgen refactoring
Refactor automata.py to use iterator-based parsing and handle
required arguments directly in argparse.
- Allow epoll in rtapp-sleep monitor
The epoll_wait call is now rt-friendly so it should be allowed in the
sleep monitor as a valid sleep method.
* tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (32 commits)
rv: Allow epoll in rtapp-sleep monitor
rv/rvgen: fix _fill_states() return type annotation
rv/rvgen: fix unbound loop variable warning
rv/rvgen: enforce presence of initial state
rv/rvgen: extract node marker string to class constant
rv/rvgen: fix isinstance check in Variable.expand()
rv/rvgen: make monitor arguments required in rvgen
rv/rvgen: remove unused __get_main_name method
rv/rvgen: remove unused sys import from dot2c
rv/rvgen: refactor automata.py to use iterator-based parsing
rv/rvgen: use class constant for init marker
rv/rvgen: fix DOT file validation logic error
rv/rvgen: fix PEP 8 whitespace violations
rv/rvgen: fix typos in automata and generator docstring and comments
rv/rvgen: use context managers for file operations
rv/rvgen: remove unnecessary semicolons
rv/rvgen: replace __len__() calls with len()
rv/rvgen: replace % string formatting with f-strings
rv/rvgen: remove bare except clauses in generator
rv/rvgen: introduce AutomataError exception class
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- cgroup sub-scheduler groundwork
Multiple BPF schedulers can be attached to cgroups and the dispatch
path is made hierarchical. This involves substantial restructuring of
the core dispatch, bypass, watchdog, and dump paths to be
per-scheduler, along with new infrastructure for scheduler ownership
enforcement, lifecycle management, and cgroup subtree iteration
The enqueue path is not yet updated and will follow in a later cycle
- scx_bpf_dsq_reenq() generalized to support any DSQ including remote
local DSQs and user DSQs
Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched
to local DSQs either run immediately or get reenqueued back through
ops.enqueue(), giving schedulers tighter control over queueing
latency
Also useful for opportunistic CPU sharing across sub-schedulers
- ops.dequeue() was only invoked when the core knew a task was in BPF
data structures, missing scheduling property change events and
skipping callbacks for non-local DSQ dispatches from ops.select_cpu()
Fixed to guarantee exactly one ops.dequeue() call when a task leaves
BPF scheduler custody
- Kfunc access validation moved from runtime to BPF verifier time,
removing runtime mask enforcement
- Idle SMT sibling prioritization in the idle CPU selection path
- Documentation, selftest, and tooling updates. Misc bug fixes and
cleanups
* tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits)
tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()
sched_ext: Make string params of __ENUM_set() const
tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap
sched_ext: Drop spurious warning on kick during scheduler disable
sched_ext: Warn on task-based SCX op recursion
sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
sched_ext: Remove runtime kfunc mask enforcement
sched_ext: Add verifier-time kfunc context filter
sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup()
sched_ext: Decouple kfunc unlocked-context check from kf_mask
sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking
sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked
sched_ext: Drop TRACING access to select_cpu kfuncs
selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message
sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code
selftests/sched_ext: Improve runner error reporting for invalid arguments
sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name
sched_ext: Documentation: Add ops.dequeue() to task lifecycle
tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"
* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
sched/eevdf: Clear buddies for preempt_short
sched/rt: Cleanup global RT bandwidth functions
sched/rt: Move group schedulability check to sched_rt_global_validate()
sched/rt: Skip group schedulable check with rt_group_sched=0
sched/fair: Avoid overflow in enqueue_entity()
sched: Use u64 for bandwidth ratio calculations
sched/fair: Prevent negative lag increase during delayed dequeue
sched/fair: Use sched_energy_enabled()
sched: Handle blocked-waiter migration (and return migration)
sched: Move attach_one_task and attach_task helpers to sched.h
sched: Add logic to zap balance callbacks if we pick again
sched: Add assert_balance_callbacks_empty helper
sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
sched: Fix modifying donor->blocked on without proper locking
locking: Add task::blocked_lock to serialize blocked_on state
sched: Fix potentially missing balancing with Proxy Exec
sched: Minimise repeated sched_proxy_exec() checking
sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
sched/core: Get this cpu once in ttwu_queue_cond()
...
|
|
For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/459a37102f3d74a4e09ea58401d2094ac731d044.1775065312.git.tim.c.chen@linux.intel.com
|
|
Introduce an index mapping between CPUs and their LLCs. This provides
a roughly continuous per LLC index needed for cache-aware load balancing in
later patches.
The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.
With the new mapping, CPUs in the same LLC share an approximate
continuous id:
per_cpu(llc_id, CPU=0...15) = 0
per_cpu(llc_id, CPU=16...31) = 1
per_cpu(llc_id, CPU=32...47) = 2
...
Note that the LLC IDs are allocated via bitmask, so the IDs may be
reused during CPU offline->online transitions.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Originally-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/047ef46339e4db497b54a89940a7ebedf27fcf28.1775065312.git.tim.c.chen@linux.intel.com
|
|
Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.
In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.
This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.
At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.
In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/6269a53221b9439b9ca00d18a9d1946fb64d8cff.1775065312.git.tim.c.chen@linux.intel.com
|
|
to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and
runtime values, but it returns unsigned long. tg_rt_schedulable() also
stores the current group limit and the accumulated child sum in unsigned
long.
On 32-bit builds, large bandwidth ratios can be truncated and the RT
group sum can wrap when enough siblings are present. That can let an
overcommitted RT hierarchy pass the schedulability check, and it also
narrows the helper result for other callers.
Return u64 from to_ratio() and use u64 for the RT group totals so
bandwidth ratios are preserved and compared at full width on both 32-bit
and 64-bit builds.
Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints")
Assisted-by: Codex:GPT-5
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com
|
|
Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.
Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.
Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.
This was split out from the larger proxy patch, and
significantly reworked.
Credits for the original patch go to:
Peter Zijlstra (Intel) <peterz@infradead.org>
Juri Lelli <juri.lelli@redhat.com>
Valentin Schneider <valentin.schneider@arm.com>
Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com
|
|
With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.
However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().
This will throw a warning:
[ 8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[ 8.796467] Call Trace:
[ 8.796467] <TASK>
[ 8.796467] ? __warn.cold+0xb2/0x14e
[ 8.796467] ? sched_balance_rq+0xe92/0x1250
[ 8.796467] ? report_bug+0x107/0x1a0
[ 8.796467] ? handle_bug+0x54/0x90
[ 8.796467] ? exc_invalid_op+0x17/0x70
[ 8.796467] ? asm_exc_invalid_op+0x1a/0x20
[ 8.796467] ? sched_balance_rq+0xe92/0x1250
[ 8.796467] sched_balance_newidle+0x295/0x820
[ 8.796467] pick_next_task_fair+0x51/0x3f0
[ 8.796467] __schedule+0x23a/0x14b0
[ 8.796467] ? lock_release+0x16d/0x2e0
[ 8.796467] schedule+0x3d/0x150
[ 8.796467] worker_thread+0xb5/0x350
[ 8.796467] ? __pfx_worker_thread+0x10/0x10
[ 8.796467] kthread+0xee/0x120
[ 8.796467] ? __pfx_kthread+0x10/0x10
[ 8.796467] ret_from_fork+0x31/0x50
[ 8.796467] ? __pfx_kthread+0x10/0x10
[ 8.796467] ret_from_fork_asm+0x1a/0x30
[ 8.796467] </TASK>
This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().
Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.
To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com
|
|
With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.
So pull the warning out into a helper function, and make sure we
check it when we pick again.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com
|
|
return-migration
As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.
Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:
,-+-* Mutex-1 <-.
Task-A ---' | | ,-- Task-B
`-> Mutex-2 *-+-'
Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.
Then the blocked_on->owner chain will go in circles.
Task-A -> Mutex-2
^ |
| v
Mutex-1 <- Task-B
We need two things:
- find_proxy_task() to stop iterating the circle;
- the woken task to 'unblock' and run, such that it can
back-off and re-try the transaction.
Now, the current code [without this patch] does:
__clear_task_blocked_on();
wake_q_add();
And surely clearing ->blocked_on is sufficient to break the
cycle.
Suppose it is Task-B that is made to back-off, then we have:
Task-A -> Mutex-2 -> Task-B (no further blocked_on)
and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().
Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.
Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."
Thus we need more than just a binary concept of the task being
blocked on a mutex or not.
So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.
This will then be used in a later patch to handle proxy
return-migration.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com
|
|
Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.
This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com
|
|
So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.
So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
|
|
K Prateek pointed out that with Proxy Exec, we may have cases
where we context switch in __schedule(), while the donor remains
the same. This could cause balancing issues, since the
put_prev_set_next() logic short-cuts if (prev == next). With
proxy-exec prev is the previous donor, and next is the next
donor. Should the donor remain the same, but different tasks are
picked to actually run, the shortcut will have avoided enqueuing
the sched class balance callback.
So, if we are context switching, add logic to catch the
same-donor case, and trigger the put_prev/set_next calls to
ensure the balance callbacks get enqueued.
Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com
|
|
Peter noted: Compilers are really bad (as in they utterly refuse)
optimizing (even when marked with __pure) the static branch
things, and will happily emit multiple identical in a row.
So pull out the one obvious sched_proxy_exec() branch in
__schedule() and remove some of the 'implicit' ones in that
path.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com
|
|
proxy_tag_curr()
With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.
This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.
This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.
The dequeue/enqueue pair was wasteful, and additonally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.
After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().
So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.
Then just drop all of the proxy_tag_curr() logic.
Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com
|
|
Add the following tracepoints:
* sched_dl_throttle(dl_se, cpu, type):
Called when a deadline entity is throttled
* sched_dl_replenish(dl_se, cpu, type):
Called when a deadline entity's runtime is replenished
* sched_dl_update(dl_se, cpu, type):
Called when a deadline entity updates without throttle or replenish
* sched_dl_server_start(dl_se, cpu, type):
Called when a deadline server is started
* sched_dl_server_stop(dl_se, cpu, type):
Called when a deadline server is stopped
Those tracepoints can be useful to validate the deadline scheduler with
RV and are not exported to tracefs.
Reviewed-by: Phil Auld <pauld@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20260330111010.153663-11-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
|
|
Calling smp_processor_id() on:
- In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.
- In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id
So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated could add up to a small value.
ttwu_queue_cond is called with interrupt disabled. So preemption is
disabled. Hence cache the value once instead.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Link: https://patch.msgid.link/20260323193630.640311-3-sshegde@linux.ibm.com
|
|
Resolve conflict between this change in the upstream kernel:
4c652a47722f ("rseq: Mark rseq_arm_slice_extension_timer() __always_inline")
... and this pending change in timers/core:
0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full
task list walk, which is obviously expensive on large systems.
Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid
and walk this list instead. This removes the proven to be flaky counting
logic and avoids a full task list walk in the case of vfork()'ed tasks.
Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
|
|
This is a leftover from the early versions of this function where it could
be invoked without mm::mm_cid::lock held.
Remove it and add lockdep asserts instead.
Fixes: 653fda7ae73d ("sched/mmcid: Switch over to the new mechanism")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.116363613@kernel.org
|