lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
4 days	sched/mmcid: Fix OOB clear_bit when CID is MM_CID_UNSET in fixup path	Rik van Riel
	In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and mm_cid.active is set, the CID is checked with cid_in_transit() before setting the transition bit. In per-CPU mode a newly forked or exec'd task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are assigned lazily on schedule-in. With cid_in_transit() the guard passes for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET \| MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this to clear_bit() with MM_CID_UNSET as the bit number, triggering an out-of-bounds write. Symptoms: this is genuine memory corruption, but a bounded out-of-bounds write, not an arbitrary one. MM_CID_UNSET is the fixed sentinel BIT(31), so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid() strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET, mm_cidmask(mm)). The cid bitmap is embedded in the mm_struct slab object (after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus() bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a fixed offset of 2^31 / 8 == 256 MiB past the bitmap base. The address is not attacker-influenced (fixed sentinel -> fixed offset) and the op only clears a single bit; what sits 256 MiB further along the direct map is whatever kernel object happens to live there, so this corrupts one bit of unpredictable kernel memory -- it is not an arbitrary-address or arbitrary-value write. It triggers only in per-CPU CID mode, when a CPU is running an active task of the target mm whose cid is still MM_CID_UNSET -- the fork()/execve() window before that task's next schedule-in assigns it a real CID -- and a per-CPU -> per-task fixup walks over it (the mode fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred max_cids recompute in mm_cid_work_fn()). In practice syzkaller surfaced it as a KASAN use-after-free reported in __schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined via mm_cid_schedout() -> mm_drop_cid(). Guard the transition-bit assignment against MM_CID_UNSET, in addition to the existing cid_in_transit() check, so the bit is only set on a genuine task-owned CID. A CPU-owned (MM_CID_ONCPU) CID of a running active task is handled by the cid_on_cpu(pcp->cid) branch above and never reaches this path, so excluding MM_CID_UNSET (and the already-transitioning case) is sufficient. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Assisted-by: Claude:claude-opus-4-8 syzkaller Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260616203818.1516263-1-riel@surriel.com
9 days	Merge tag 'sched-core-2026-06-14' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "SMP load-balancing updates: - A large series to introduce infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. Implemented by Chen Yu and Tim Chen, based on early prototype work by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and Shrikanth Hegde. - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde) Fair scheduler updates: - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing SMT awareness (Andrea Righi, K Prateek Nayak) - A series to optimize cfs_rq and sched_entity allocation for better data locality (Zecheng Li) - A preparatory series to change fair/cgroup scheduling to a single runqueue, without the final change (Peter Zijlstra) - Auto-manage ext/fair dl_server bandwidth (Andrea Righi) - Fix cpu_util runnable_avg arithmetic (Hongyan Xia) - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel) - Allow account_cfs_rq_runtime() to throttle current hierarchy (K Prateek Nayak) - Update util_est after updating util_avg during dequeue, to fix the util signal update logic, which reduces signal noise (Vincent Guittot) Scheduler topology updates: - Allow multiple domains to claim sched_domain_shared (K Prateek Nayak) - Add parameter to split LLC (Peter Zijlstra) Core scheduler updates: - Use trace_call__<tp>() to save a static branch (Gabriele Monaco) Scheduler statistics updates: - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation guard (Nicolas Pitre) Deadline scheduler updates: - Reject debugfs dl_server writes for offline CPUs (Andrea Righi) - Fix replenishment logic for non-deferred servers (Yuri Andriaccio) RT scheduling updates: - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt) - Update default bandwidth for real-time tasks to 1.0 (Yuri Andriaccio) Proxy scheduling updates: - A series to implement Optimized Donor Migration for Proxy Execution (John Stultz, Peter Zijlstra) - Various proxy scheduling cleanups and fixes (Peter Zijlstra, K Prateek Nayak) Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi, Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde, Peter Zijlstra, Liang Luo and Yiyang Chen" * tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits) sched/fair: Fix newidle vs core-sched sched/deadline: Use task_on_rq_migrating() helper sched/core: Combine separate 'else' and 'if' statements sched/fair: Fix cpu_util runnable_avg arithmetic sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() sched/fair: Call update_curr() before unthrottling the hierarchy sched/fair: Use throttled_csd_list for local unthrottle sched/fair: Convert cfs bandwidth throttling to use guards sched/fair: Allocate cfs_tg_state with percpu allocator sched/fair: Remove task_group->se pointer array sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state sched: restore timer_slack_ns when resetting RT policy on fork MAINTAINERS: Fix spelling mistake in Peter's name sched: Simplify ttwu_runnable() sched/proxy: Remove superfluous clear_task_blocked_in() sched/proxy: Remove PROXY_WAKING sched/proxy: Switch proxy to use p->is_blocked sched/proxy: Only return migrate when needed sched: Be more strict about p->is_blocked ...
2026-06-09	sched/core: Combine separate 'else' and 'if' statements	Liang Luo
	The kernel coding style recommends using 'else if' instead of placing 'if' on a separate line after 'else'. This change makes the code consistent with the rest of the kernel codebase. Signed-off-by: Liang Luo <luoliang@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260608071842.325159-1-luoliang@kylinos.cn
2026-06-02	tick/sched: Move dyntick-idle cputime accounting to cputime code	Frederic Weisbecker
	Although the dynticks-idle cputime accounting is necessarily tied to the tick subsystem, the actual related accounting code has no business residing there and should be part of the scheduler cputime code. Move away the relevant pieces and state machine to where they belong. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-10-frederic@kernel.org
2026-06-02	sched/fair: Call update_curr() before unthrottling the hierarchy	K Prateek Nayak
	Subsequent commits will allow update_curr() to throttle the hierarchy when the runtime accounting exceeds allocated quota. Call update_curr() before the unthrottle event, and in tg_unthrottle_up() to catch up on any remaining runtime and stabilize the "runtime_remaining" and "throttle_count" for that cfs_rq. Doing an update_curr() early ensures the cfs_rq is not throttled right back up again when the unthrottle is in progress. Since all callers of unthrottle_cfs_rq(), except two, already update the rq_clock and call rq_clock_start_loop_update(), move the update_rq_clock() from unthrottle_cfs_rq() to the callers that don't update the rq_clock. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Benjamin Segall <bsegall@google.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20260602052531.11450-1-kprateek.nayak@amd.com
2026-06-02	sched/fair: Allocate cfs_tg_state with percpu allocator	Zecheng Li
	To remove the cfs_rq pointer array in task_group, allocate the combined cfs_rq and sched_entity using the per-cpu allocator. This patch implements the following: - Changes task_group->cfs_rq from 'struct cfs_rq *' to 'struct cfs_rq __percpu '. - Updates memory allocation in alloc_fair_sched_group() and free_fair_sched_group() to use alloc_percpu() and free_percpu() respectively. - Uses the inline accessor tg_cfs_rq(tg, cpu) with per_cpu_ptr() to retrieve the pointer to cfs_rq for the given task group and CPU. - Replaces direct accesses tg->cfs_rq[cpu] with calls to the new tg_cfs_rq(tg, cpu) helper. - Handles the root_task_group: since struct rq is already a per-cpu variable (runqueues), its embedded cfs_rq (rq->cfs) is also per-cpu. Therefore, we assign root_task_group.cfs_rq = &runqueues.cfs. - Cleanup the code in initializing the root task group. This change places each CPU's cfs_rq and sched_entity in its local per-cpu memory area to remove the per-task_group pointer arrays. Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-4-zli94@ncsu.edu
2026-06-02	sched/fair: Remove task_group->se pointer array	Zecheng Li
	Now that struct sched_entity is co-located with struct cfs_rq for non-root task groups, the task_group->se pointer array is redundant. The associated sched_entity can be loaded directly from the cfs_rq. This patch performs the access conversion with the helpers: - is_root_task_group(tg): checks if a task group is the root task group. It compares the task group's address with the global root_task_group variable. - tg_se(tg, cpu): retrieves the cfs_rq and returns the address of the co-located se. This function checks if tg is the root task group to ensure behaving the same of previous tg->se[cpu]. Replaces all accesses that use the tg->se[cpu] pointer array with calls to the new tg_se(tg, cpu) accessor. - cfs_rq_se(cfs_rq): simplifies access paths like cfs_rq->tg->se[...] to use the co-located sched_entity. This function also checks if tg is the root task group to ensure same behavior. Since tg_se is not in very hot code paths, and the branch is a register comparison with an immediate value (`&root_task_group`), the performance impact is expected to be negligible. Signed-off-by: Zecheng Li <zecheng@google.com> Signed-off-by: Zecheng Li <zli94@ncsu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Josh Don <joshdon@google.com> Link: https://patch.msgid.link/20260522141623.600235-3-zli94@ncsu.edu
2026-06-02	sched: restore timer_slack_ns when resetting RT policy on fork	Guanyou.Chen
	Commit ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values for realtime tasks") sets timer_slack_ns to 0 for RT tasks in __setscheduler_params(). However, when an RT task with SCHED_RESET_ON_FORK creates child threads, the children inherit timer_slack_ns=0 from the parent. sched_fork() resets the child's policy to SCHED_NORMAL but does not restore timer_slack_ns, leaving the child permanently running with zero slack. Fix this by restoring timer_slack_ns from default_timer_slack_ns in sched_fork() when resetting from RT/DL to NORMAL policy, matching the existing behavior in __setscheduler_params(). Note: this fix alone requires a correct default_timer_slack_ns to be effective. See the following patch for that fix. Fixes: ed4fb6d7ef68 ("hrtimer: Use and report correct timerslack values for realtime tasks") Reported-by: Qiaoting.Lin <linqiaoting@xiaomi.com> Signed-off-by: Guanyou.Chen <chenguanyou@xiaomi.com> Signed-off-by: Chunhui.Li <chunhui.li@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260522131000.1664983-2-chenguanyou@xiaomi.com
2026-06-02	sched: Simplify ttwu_runnable()	Peter Zijlstra
	Note that both proxy and delayed tasks have ->is_blocked set. Use this one condition to guard both paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260526113322.714832584%40infradead.org
2026-06-02	sched/proxy: Remove superfluous clear_task_blocked_in()	Peter Zijlstra
	Per the discussion here: https://lore.kernel.org/all/20260403112810.GG3738786@noisy.programming.kicks-ass.net/ The reason for this condition is that the signal condition in try_to_block_task() would set_task_blocked_in_waking(). However, it no longer does that, in fact, that path does clear_task_blocked_on(). Further, per the discussions here: https://lore.kernel.org/r/dc61cf77-e541-441d-a708-c40e19aa0db2%40amd.com https://lore.kernel.org/r//9dd1d24d-45d3-4ee2-8e67-8305b34bfb6d%40amd.com there are a few other edge cases that needed this. But they're all variants of PROXY_WAKING leaking out. And since PROXY_WAKING is now gone, this is no longer needed either. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260526113322.120970670%40infradead.org
2026-06-02	sched/proxy: Remove PROXY_WAKING	K Prateek Nayak
	Now that the proxy path uses ->is_blocked, use the '->is_blocked && !->blocked_on' state instead of PROXY_WAKING. Notably, this is where a blocked_on relation is broken but the donor task might still need a return migration. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260526113322.596522894%40infradead.org
2026-06-02	sched/proxy: Switch proxy to use p->is_blocked	Peter Zijlstra
	Rather than gate the proxy paths with p->blocked_on, use p->is_blocked. This opens up the state: '->is_blocked && !->blocked_on' for future use. Notably, only proxy and delayed tasks can be ->on_rq && ->is_blocked, and it is guaranteed that sched_class::pick_task() will never return a delayed task. Therefore any task returned from pick_next_task() that has ->is_blocked set, must be a proxy task. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260526113322.477954312%40infradead.org
2026-06-02	sched/proxy: Only return migrate when needed	Peter Zijlstra
	Current code will 'unconditionally' return migrate on PROXY_WAKING, even if the task is (still) on the original CPU. Check task_cpu(p) against p->waking_cpu, which per proxy_set_task_cpu() preserves the original CPU the task was on. If they do not mis-match, there is no need to go through the more expensive wakeup path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260527082916.GP3126523%40noisy.programming.kicks-ass.net
2026-06-02	sched: Be more strict about p->is_blocked	Peter Zijlstra
	Upon entry to try_to_block_task(), p->is_blocked should be false. After all, the prior wakeup would have made it so per ttwu_do_wakeup(). Ensure this is the case, rather than clearing it in the path that doesn't set it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260526113322.364017314%40infradead.org
2026-06-02	sched/proxy: Optimize try_to_wake_up()	Peter Zijlstra
	The reason for the clause in try_to_wake_up() is, per its comment, that find_proxy_task()'s proxy_deactivate() is not always called with a cleared p->blocked_on. However, that seems silly and easily cured. Make sure to always call proxy_deactivate() with a cleared p->blocked_on such that we might remove this clause from the common wake-up path. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260526113322.244729903%40infradead.org
2026-06-02	sched: Add blocked_donor link to task for smarter mutex handoffs	Peter Zijlstra
	Add link to the task this task is proxying for, and use it so the mutex owner can do an intelligent hand-off of the mutex to the task that the owner is running on behalf. [jstultz: This patch was split out from larger proxy patch] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-8-jstultz@google.com
2026-06-02	sched: Add is_blocked task flag	John Stultz
	Add a new is_blocked flag to the task struct. This flag is set by try_to_block_task() and cleared by ttwu_do_wakeup() and tracks if the task is blocked. Traditionally this would mirror !p->on_rq, however due things like DELAY_DEQUEUE and PROXY_EXEC, this can diverge, so its useful to manage separately. Additionally with this, we might be able to get rid of the p->se.sched_delayed (ab)use in the core code (eventually). Taken whole cloth from Peter's email: https://lore.kernel.org/lkml/20260501132143.GC1026330@noisy.programming.kicks-ass.net/ With a few additional p->is_blocked = 0 in a few cases where we return current if blocked_on gets zeroed or there is no owner. This may hint that these current special cases might be dropped eventually. This change also helps resolve wait-queue stalls seen with proxy-execution. See previous patch attempts for details: https://lore.kernel.org/lkml/20260430215103.2978955-2-jstultz@google.com/ Reported-by: Vineeth Pillai <vineethrp@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-7-jstultz@google.com
2026-06-02	sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case	John Stultz
	This patch adds logic so try_to_wake_up() will notice if we are waking a task where blocked_on == PROXY_WAKING, and if necessary dequeue the task so the wakeup will naturally return-migrate the donor task back to a cpu it can run on. This helps performance as we do the dequeue and wakeup under the locks normally taken in the try_to_wake_up() and avoids having to do proxy_force_return() from __schedule(), which has to re-take similar locks and then force a pick again loop. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-6-jstultz@google.com
2026-06-02	sched: Rework block_task so it can be directly called	John Stultz
	Pull most of the logic out of try_to_block_task() and put it into block_task() directly, so that we can call block_task() and not have to worry about the failing cases in try_to_block_task() Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-5-jstultz@google.com
2026-06-02	sched: Rework prev_balance() to avoid stale prev references	John Stultz
	Historically, the prev value from __schedule() was the rq->curr. This prev value is passed down through numerous functions, and used in the class scheduler implementations. The fact that prev was on_cpu until the end of __schedule(), meant it was stable across the rq lock drops that the class->balance() implementations often do. However, with proxy-exec, the prev passed to functions called by __schedule() is rq->donor, which may not be the same as rq->curr and may not be on_cpu, this makes the prev value potentially unstable across rq lock drops. A recently found issue with proxy-exec, is when we begin doing return migration from try_to_wake_up(), its possible we may be waking up the rq->donor. When we do this, we proxy_resched_idle() to put_prev_set_next() setting the rq->donor to rq->idle, allowing the rq->donor to be return migrated and allowed to run. This however runs into trouble, as on another cpu we might be in the middle of calling __schedule(). Conceptually the rq lock is held for the majority of the time, but in calling prev_balance() its possible the class->balance() handler call may briefly drop the rq lock. This opens a window for try_to_wake_up() to wake and return migrate the rq->donor before the class logic reacquires the rq lock. Unfortunately prev_balance() pass in a prev argument, to which we pass rq->donor. However this prev value can now become stale and incorrect across a rq lock drop. So, to correct this, rework the prev_balance() call so that it does not take a "prev" argument. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260512025635.2840817-2-jstultz@google.com
2026-05-26	sched: Remove sched_class::pick_next_task()	Peter Zijlstra
	The reason for pick_next_task_fair() is the put/set optimization that avoids touching the common ancestors. However, it is possible to implement this in the put_prev_task() and set_next_task() calls as used in put_prev_set_next_task(). Notably, put_prev_set_next_task() is the only site that: - calls put_prev_task() with a .next argument; - calls set_next_task() with .first = true. This means that put_prev_task() can determine the common hierarchy and stop there, and then set_next_task() can terminate where put_prev_task stopped. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260511120628.057634261@infradead.org
2026-05-26	sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode	Peter Zijlstra
	Robots figured out you can read and write this concurrently and got 'upset'. Gemini even noted sched_dynamic_show() can generate 'confusing' output if it observed different values during the printing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.176946327@infradead.org
2026-05-26	sched/fair: Fix RCU usage in NOHZ exit path on CPU offline	Andrea Righi
	Commit c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path") removed the rcu_read_lock()/unlock() pair from set_cpu_sd_state_busy() and set_cpu_sd_state_idle() on the assumption that all callers run in a safe context for rcu_dereference_all(): IRQs disabled or cpus_write_lock() held. That assumption is wrong for the CPU hotplug teardown path. When CPUs are taken offline, set_cpu_sd_state_busy() is invoked via: cpuhp/N kthread cpuhp_thread_fun() cpuhp_invoke_callback() sched_cpu_deactivate() nohz_balance_exit_idle() set_cpu_sd_state_busy() rcu_dereference_all(per_cpu(sd_llc, cpu)) The cpuhp kthread holds cpu_hotplug_lock (percpu-rwsem) but runs with preemption and IRQs enabled. As a result, lockdep correctly reports a suspicious RCU usage on CPU offline, e.g.: # echo 0 > /sys/devices/system/cpu/cpu1/online ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage! ... 2 locks held by cpuhp/1/20: #0: (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae #1: (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae Call Trace: lockdep_rcu_suspicious nohz_balance_exit_idle sched_cpu_deactivate cpuhp_invoke_callback cpuhp_thread_fun smpboot_thread_fn Fix this by adding RCU read lock coverage to the one caller that lacks it: nohz_balance_exit_idle() in the CPU hotplug teardown. The other callers (nohz_balancer_kick() and nohz_balance_enter_idle()) genuinely run with IRQs disabled, so they remain unchanged. Fixes: c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path") Closes: https://lore.kernel.org/all/38fe0a1d-1a48-435a-910a-c278024d9ac9@samsung.com/ Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260522092523.2046095-1-arighi@nvidia.com
2026-05-19	Merge branch 'sched/cache'	Peter Zijlstra
	Merge the cache aware balancer topic branch. # Conflicts: # kernel/sched/topology.c
2026-05-19	sched: Switch rq->next_class on proxy_resched_idle()	John Stultz
	K Prateek noticed we weren't setting the rq->next_class in proxy_resched_idle(), when I was debugging an issue seen with CONFIG_SCHED_PROXY_EXEC and some of Peter's new patches, and suggested this fix. So set rq->next_class when we temporarily switch the donor to idle, so we don't accidentally call wakeup_preempt_fair() with idle as the donor. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260514234732.3170197-1-jstultz@google.com
2026-05-19	sched: Simplify ifdeffery around cpu_smt_mask	Shrikanth Hegde
	Now, that cpu_smt_mask is defined as cpumask_of(cpu) for CONFIG_SCHED_SMT=n, it is possible to get rid of the ifdeffery. Effectively, - This makes sched_smt_present is defined always - cpumask_weight(cpumask_of(cpu)) == 1. So sched_smt_present_inc/dec will never enable the sched_smt_present. Which is expected. - Paths that were compile-time eliminated become runtime guarded using static keys. - Defines set_idle_cores, test_idle_cores, etc which could likely benefit the CONFIG_SCHED_SMT=n systems to use the same optimizations within the LLC at wakeups. - This will expose sched_smt_present symbol for CONFIG_SCHED_SMT=n. Likely not a concern. - There is a bloat of code CONFIG_SCHED_SMT=n. (NR_CPUS=2048) add/remove: 24/18 grow/shrink: 26/28 up/down: 6396/-3188 (3208) Total: Before=30629880, After=30633088, chg +0.01% - No code bloat for CONFIG_SCHED_SMT=y, which is expected. - Add comments around stop_core_cpuslocked on why ifdefs are not removed. - This leaves the remaining uses of CONFIG_SCHED_SMT mainly for topology building bits which has a policy based decision. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260515172456.542799-3-sshegde@linux.ibm.com
2026-05-05	sched: Use trace_call__<tp>() to save a static branch	Gabriele Monaco
	The wrapper functions __trace_set_current_state() and __trace_set_need_resched() allow the tracepoints to be called from code outside sched/core.c, those calls are already guarded by a tracepoint_enabled(<tp>) so there is no need to repeat this check once again inside the call using trace_<tp>(). Use the new trace_call__<tp>() API to directly call the tracepoint without check. Those helper functions must be called after the appropriate check. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260429094227.34087-1-gmonaco@redhat.com
2026-04-28	sched/fair: Clear rel_deadline when initializing forked entities	Zicheng Qu
	A yield-triggered crash can happen when a newly forked sched_entity enters the fair class with se->rel_deadline unexpectedly set. The failing sequence is: 1. A task is forked while se->rel_deadline is still set. 2. __sched_fork() initializes vruntime, vlag and other sched_entity state, but does not clear rel_deadline. 3. On the first enqueue, enqueue_entity() calls place_entity(). 4. Because se->rel_deadline is set, place_entity() treats se->deadline as a relative deadline and converts it to an absolute deadline by adding the current vruntime. 5. However, the forked entity's deadline is not a valid inherited relative deadline for this new scheduling instance, so the conversion produces an abnormally large deadline. 6. If the task later calls sched_yield(), yield_task_fair() advances se->vruntime to se->deadline. 7. The inflated vruntime is then used by the following enqueue path, where the vruntime-derived key can overflow when multiplied by the entity weight. 8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility calculation, and can eventually make all entities appear ineligible. pick_next_entity() may then return NULL unexpectedly, leading to a later NULL dereference. A captured trace shows the effect clearly. Before yield, the entity's vruntime was around: 9834017729983308 After yield_task_fair() executed: se->vruntime = se->deadline the vruntime jumped to: 19668035460670230 and the deadline was later advanced further to: 19668035463470230 This shows that the deadline had already become abnormally large before yield_task_fair() copied it into vruntime. rel_deadline is only meaningful when se->deadline really carries a relative deadline that still needs to be placed against vruntime. A freshly forked sched_entity should not inherit or retain this state. Clear se->rel_deadline in __sched_fork(), together with the other sched_entity runtime state, so that the first enqueue does not interpret the new entity's deadline as a stale relative deadline. Fixes: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Analyzed-by: Hui Tang <tanghui20@huawei.com> Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com
2026-04-15	Merge tag 'trace-rv-v7.1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verification updates from Steven Rostedt: - Refactor da_monitor header to share handlers across monitor types No functional changes, only less code duplication. - Add Hybrid Automata model class Add a new model class that extends deterministic automata by adding constraints on transitions and states. Those constraints can take into account wall-clock time and as such allow RV monitor to make assertions on real time. Add documentation and code generation scripts. - Add stall monitor as hybrid automaton example Add a monitor that triggers a violation when a task is stalling as an example of automaton working with real time variables. - Convert the opid monitor to a hybrid automaton The opid monitor can be heavily simplified if written as a hybrid automaton: instead of tracking preempt and interrupt enable/disable events, it can just run constraints on the preemption/interrupt states when events like wakeup and need_resched verify. - Add support for per-object monitors in DA/HA Allow writing deterministic and hybrid automata monitors for generic objects (e.g. any struct), by exploiting a hash table where objects are saved. This allows to track more than just tasks in RV. For instance it will be used to track deadline entities in deadline monitors. - Add deadline tracepoints and move some deadline utilities Prepare the ground for deadline monitors by defining events and exporting helpers. - Add nomiss deadline monitor Add first example of deadline monitor asserting all entities complete before their deadline. - Improve rvgen error handling Introduce AutomataError exception class and better handle expected exceptions while showing a backtrace for unexpected ones. - Improve python code quality in rvgen Refactor the rvgen generation scripts to align with python best practices: use f-strings instead of %, use len() instead of __len__(), remove semicolons, use context managers for file operations, fix whitespace violations, extract magic strings into constants, remove unused imports and methods. - Fix small bugs in rvgen The generator scripts presented some corner case bugs: logical error in validating what a correct dot file looks like, fix an isinstance() check, enforce a dot file has an initial state, fix type annotations and typos in comments. - rvgen refactoring Refactor automata.py to use iterator-based parsing and handle required arguments directly in argparse. - Allow epoll in rtapp-sleep monitor The epoll_wait call is now rt-friendly so it should be allowed in the sleep monitor as a valid sleep method. * tag 'trace-rv-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (32 commits) rv: Allow epoll in rtapp-sleep monitor rv/rvgen: fix _fill_states() return type annotation rv/rvgen: fix unbound loop variable warning rv/rvgen: enforce presence of initial state rv/rvgen: extract node marker string to class constant rv/rvgen: fix isinstance check in Variable.expand() rv/rvgen: make monitor arguments required in rvgen rv/rvgen: remove unused __get_main_name method rv/rvgen: remove unused sys import from dot2c rv/rvgen: refactor automata.py to use iterator-based parsing rv/rvgen: use class constant for init marker rv/rvgen: fix DOT file validation logic error rv/rvgen: fix PEP 8 whitespace violations rv/rvgen: fix typos in automata and generator docstring and comments rv/rvgen: use context managers for file operations rv/rvgen: remove unnecessary semicolons rv/rvgen: replace __len__() calls with len() rv/rvgen: replace % string formatting with f-strings rv/rvgen: remove bare except clauses in generator rv/rvgen: introduce AutomataError exception class ...
2026-04-15	Merge tag 'sched_ext-for-7.1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - cgroup sub-scheduler groundwork Multiple BPF schedulers can be attached to cgroups and the dispatch path is made hierarchical. This involves substantial restructuring of the core dispatch, bypass, watchdog, and dump paths to be per-scheduler, along with new infrastructure for scheduler ownership enforcement, lifecycle management, and cgroup subtree iteration The enqueue path is not yet updated and will follow in a later cycle - scx_bpf_dsq_reenq() generalized to support any DSQ including remote local DSQs and user DSQs Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched to local DSQs either run immediately or get reenqueued back through ops.enqueue(), giving schedulers tighter control over queueing latency Also useful for opportunistic CPU sharing across sub-schedulers - ops.dequeue() was only invoked when the core knew a task was in BPF data structures, missing scheduling property change events and skipping callbacks for non-local DSQ dispatches from ops.select_cpu() Fixed to guarantee exactly one ops.dequeue() call when a task leaves BPF scheduler custody - Kfunc access validation moved from runtime to BPF verifier time, removing runtime mask enforcement - Idle SMT sibling prioritization in the idle CPU selection path - Documentation, selftest, and tooling updates. Misc bug fixes and cleanups * tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits) tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY() sched_ext: Make string params of __ENUM_set() const tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap sched_ext: Drop spurious warning on kick during scheduler disable sched_ext: Warn on task-based SCX op recursion sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() sched_ext: Remove runtime kfunc mask enforcement sched_ext: Add verifier-time kfunc context filter sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() sched_ext: Decouple kfunc unlocked-context check from kf_mask sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked sched_ext: Drop TRACING access to select_cpu kfuncs selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code selftests/sched_ext: Improve runner error reporting for invalid arguments sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name sched_ext: Documentation: Add ops.dequeue() to task lifecycle tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing ...
2026-04-14	Merge tag 'sched-core-2026-04-13' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduling updates: - Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle) - Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak) - Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak) - Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak) - Avoid overflow in enqueue_entity() (K Prateek Nayak) - Update overutilized detection (Vincent Guittot) - Prevent negative lag increase during delayed dequeue (Vincent Guittot) - Clear buddies for preempt_short (Vincent Guittot) - Implement more complex proportional newidle balance (Peter Zijlstra) - Increase weight bits for avg_vruntime (Peter Zijlstra) - Use full weight to __calc_delta() (Peter Zijlstra) RT and DL scheduling updates: - Fix incorrect schedstats for rt and dl thread (Dengjun Su) - Skip group schedulable check with rt_group_sched=0 (Michal Koutný) - Move group schedulability check to sched_rt_global_validate() (Michal Koutný) - Add reporting of runtime left & abs deadline to sched_getattr() for DEADLINE tasks (Tommaso Cucinotta) Scheduling topology updates by K Prateek Nayak: - Compute sd_weight considering cpuset partitions - Extract "imb_numa_nr" calculation into a separate helper - Allocate per-CPU sched_domain_shared in s_data - Switch to assigning "sd->shared" from s_data - Remove sched_domain_shared allocation with sd_data Energy-aware scheduling updates: - Filter false overloaded_group case for EAS (Vincent Guittot) - PM: EM: Switch to rcu_dereference_all() in wakeup path (Dietmar Eggemann) Infrastructure updates: - Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari) Proxy scheduling updates by John Stultz: - Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() - Minimise repeated sched_proxy_exec() checking - Fix potentially missing balancing with Proxy Exec - Fix and improve task::blocked_on et al handling - Add assert_balance_callbacks_empty() helper - Add logic to zap balancing callbacks if we pick again - Move attach_one_task() and attach_task() helpers to sched.h - Handle blocked-waiter migration (and return migration) - Add K Prateek Nayak to scheduler reviewers for proxy execution Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot" * tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) sched/eevdf: Clear buddies for preempt_short sched/rt: Cleanup global RT bandwidth functions sched/rt: Move group schedulability check to sched_rt_global_validate() sched/rt: Skip group schedulable check with rt_group_sched=0 sched/fair: Avoid overflow in enqueue_entity() sched: Use u64 for bandwidth ratio calculations sched/fair: Prevent negative lag increase during delayed dequeue sched/fair: Use sched_energy_enabled() sched: Handle blocked-waiter migration (and return migration) sched: Move attach_one_task and attach_task helpers to sched.h sched: Add logic to zap balance callbacks if we pick again sched: Add assert_balance_callbacks_empty helper sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration sched: Fix modifying donor->blocked on without proper locking locking: Add task::blocked_lock to serialize blocked_on state sched: Fix potentially missing balancing with Proxy Exec sched: Minimise repeated sched_proxy_exec() checking sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() MAINTAINERS: Add K Prateek Nayak to scheduler reviewers sched/core: Get this cpu once in ttwu_queue_cond() ...
2026-04-09	sched/cache: Track LLC-preferred tasks per runqueue	Tim Chen
	For each runqueue, track the number of tasks with an LLC preference and how many of them are running on their preferred LLC. This mirrors nr_numa_running and nr_preferred_running for NUMA balancing, and will be used by cache-aware load balancing in later patches. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/459a37102f3d74a4e09ea58401d2094ac731d044.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Make LLC id continuous	Tim Chen
	Introduce an index mapping between CPUs and their LLCs. This provides a roughly continuous per LLC index needed for cache-aware load balancing in later patches. The existing per_cpu llc_id usually points to the first CPU of the LLC domain, which is sparse and unsuitable as an array index. Using llc_id directly would waste memory. With the new mapping, CPUs in the same LLC share an approximate continuous id: per_cpu(llc_id, CPU=0...15) = 0 per_cpu(llc_id, CPU=16...31) = 1 per_cpu(llc_id, CPU=32...47) = 2 ... Note that the LLC IDs are allocated via bitmask, so the IDs may be reused during CPU offline->online transitions. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Originally-by: K Prateek Nayak <kprateek.nayak@amd.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/047ef46339e4db497b54a89940a7ebedf27fcf28.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Introduce infrastructure for cache-aware load balancing	Peter Zijlstra (Intel)
	Adds infrastructure to enable cache-aware load balancing, which improves cache locality by grouping tasks that share resources within the same cache domain. This reduces cache misses and improves overall data access efficiency. In this initial implementation, threads belonging to the same process are treated as entities that likely share working sets. The mechanism tracks per-process CPU occupancy across cache domains and attempts to migrate threads toward cache-hot domains where their process already has active threads, thereby enhancing locality. This provides a basic model for cache affinity. While the current code targets the last-level cache (LLC), the approach could be extended to other domain types such as clusters (L2) or node-internal groupings. At present, the mechanism selects the CPU within an LLC that has the highest recent runtime. Subsequent patches in this series will use this information in the load-balancing path to guide task placement toward preferred LLCs. In the future, more advanced policies could be integrated through NUMA balancing-for example, migrating a task to its preferred LLC when spare capacity exists, or swapping tasks across LLCs to improve cache affinity. Grouping of tasks could also be generalized from that of a process to be that of a NUMA group, or be user configurable. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/6269a53221b9439b9ca00d18a9d1946fb64d8cff.1775065312.git.tim.c.chen@linux.intel.com
2026-04-07	sched: Use u64 for bandwidth ratio calculations	Joseph Salisbury
	to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and runtime values, but it returns unsigned long. tg_rt_schedulable() also stores the current group limit and the accumulated child sum in unsigned long. On 32-bit builds, large bandwidth ratios can be truncated and the RT group sum can wrap when enough siblings are present. That can let an overcommitted RT hierarchy pass the schedulability check, and it also narrows the helper result for other callers. Return u64 from to_ratio() and use u64 for the RT group totals so bandwidth ratios are preserved and compared at full width on both 32-bit and 64-bit builds. Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints") Assisted-by: Codex:GPT-5 Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com
2026-04-03	sched: Handle blocked-waiter migration (and return migration)	John Stultz
	Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com
2026-04-03	sched: Add logic to zap balance callbacks if we pick again	John Stultz
	With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] <TASK> [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] </TASK> This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the balance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com
2026-04-03	sched: Add assert_balance_callbacks_empty helper	John Stultz
	With proxy-exec utilizing pick-again logic, we can end up having balance callbacks set by the preivous pick_next_task() call left on the list. So pull the warning out into a helper function, and make sure we check it when we pick again. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com
2026-04-03	sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy ↵	John Stultz
	return-migration As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Peter helpfully provided the following example with pictures: "Suppose we have a ww_mutex cycle: ,-+-* Mutex-1 <-. Task-A ---' \| \| ,-- Task-B `-> Mutex-2 -+-' Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and where Task-B holds Mutex-2 and tries to acquire Mutex-1. Then the blocked_on->owner chain will go in circles. Task-A -> Mutex-2 ^ \| \| v Mutex-1 <- Task-B We need two things: - find_proxy_task() to stop iterating the circle; - the woken task to 'unblock' and run, such that it can back-off and re-try the transaction. Now, the current code [without this patch] does: __clear_task_blocked_on(); wake_q_add(); And surely clearing ->blocked_on is sufficient to break the cycle. Suppose it is Task-B that is made to back-off, then we have: Task-A -> Mutex-2 -> Task-B (no further blocked_on) and it would attempt to run Task-B. Or worse, it could directly pick Task-B and run it, without ever getting into find_proxy_task(). Now, here is a problem because Task-B might not be runnable on the CPU it is currently on; and because !task_is_blocked() we don't get into the proxy paths, so nobody is going to fix this up. Ideally we would have dequeued Task-B alongside of clearing ->blocked_on, but alas, [the lock ordering prevents us from getting the task_rq_lock() and] spoils things." Thus we need more than just a binary concept of the task being blocked on a mutex or not. So allow setting blocked_on to PROXY_WAKING as a special value which specifies the task is no longer blocked, but needs to be evaluated for return migration before* it can be run. This will then be used in a later patch to handle proxy return-migration. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com
2026-04-03	sched: Fix modifying donor->blocked on without proper locking	John Stultz
	Introduce an action enum in find_proxy_task() which allows us to handle work needed to be done outside the mutex.wait_lock and task.blocked_lock guard scopes. This ensures proper locking when we clear the donor's blocked_on pointer in proxy_deactivate(), and the switch statement will be useful as we add more cases to handle later in this series. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com
2026-04-03	locking: Add task::blocked_lock to serialize blocked_on state	John Stultz
	So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
2026-04-03	sched: Fix potentially missing balancing with Proxy Exec	John Stultz
	K Prateek pointed out that with Proxy Exec, we may have cases where we context switch in __schedule(), while the donor remains the same. This could cause balancing issues, since the put_prev_set_next() logic short-cuts if (prev == next). With proxy-exec prev is the previous donor, and next is the next donor. Should the donor remain the same, but different tasks are picked to actually run, the shortcut will have avoided enqueuing the sched class balance callback. So, if we are context switching, add logic to catch the same-donor case, and trigger the put_prev/set_next calls to ensure the balance callbacks get enqueued. Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com
2026-04-03	sched: Minimise repeated sched_proxy_exec() checking	John Stultz
	Peter noted: Compilers are really bad (as in they utterly refuse) optimizing (even when marked with __pure) the static branch things, and will happily emit multiple identical in a row. So pull out the one obvious sched_proxy_exec() branch in __schedule() and remove some of the 'implicit' ones in that path. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com
2026-04-03	sched: Make class_schedulers avoid pushing current, and get rid of ↵	John Stultz
	proxy_tag_curr() With proxy-execution, the scheduler selects the donor, but for blocked donors, we end up running the lock owner. This caused some complexity, because the class schedulers make sure to remove the task they pick from their pushable task lists, which prevents the donor from being migrated, but there wasn't then anything to prevent rq->curr from being migrated if rq->curr != rq->donor. This was sort of hacked around by calling proxy_tag_curr() on the rq->curr task if we were running something other then the donor. proxy_tag_curr() did a dequeue/enqueue pair on the rq->curr task, allowing the class schedulers to remove it from their pushable list. The dequeue/enqueue pair was wasteful, and additonally K Prateek highlighted that we didn't properly undo things when we stopped proxying, leaving the lock owner off the pushable list. After some alternative approaches were considered, Peter suggested just having the RT/DL classes just avoid migrating when task_on_cpu(). So rework pick_next_pushable_dl_task() and the rt pick_next_pushable_task() functions so that they skip over the first pushable task if it is on_cpu. Then just drop all of the proxy_tag_curr() logic. Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability") Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com
2026-03-31	sched: Add deadline tracepoints	Gabriele Monaco
	Add the following tracepoints: * sched_dl_throttle(dl_se, cpu, type): Called when a deadline entity is throttled * sched_dl_replenish(dl_se, cpu, type): Called when a deadline entity's runtime is replenished * sched_dl_update(dl_se, cpu, type): Called when a deadline entity updates without throttle or replenish * sched_dl_server_start(dl_se, cpu, type): Called when a deadline server is started * sched_dl_server_stop(dl_se, cpu, type): Called when a deadline server is stopped Those tracepoints can be useful to validate the deadline scheduler with RV and are not exported to tracefs. Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20260330111010.153663-11-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2026-03-24	sched/core: Get this cpu once in ttwu_queue_cond()	Shrikanth Hegde
	Calling smp_processor_id() on: - In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does not print any warning. - In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting __smp_processor_id So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section it is better to cache the value. It could save a few cycles. Though tiny, repeated could add up to a small value. ttwu_queue_cond is called with interrupt disabled. So preemption is disabled. Hence cache the value once instead. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Link: https://patch.msgid.link/20260323193630.640311-3-sshegde@linux.ibm.com
2026-03-21	Merge tag 'v7.0-rc4' into timers/core, to resolve conflict	Ingo Molnar
	Resolve conflict between this change in the upstream kernel: 4c652a47722f ("rseq: Mark rseq_arm_slice_extension_timer() __always_inline") ... and this pending change in timers/core: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-03-17	Merge tag 'v7.0-rc4' into sched/core, to pick up scheduler fixes	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-03-11	sched/mmcid: Avoid full tasklist walks	Thomas Gleixner
	Chasing vfork()'ed tasks on a CID ownership mode switch requires a full task list walk, which is obviously expensive on large systems. Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid and walk this list instead. This removes the proven to be flaky counting logic and avoids a full task list walk in the case of vfork()'ed tasks. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
2026-03-11	sched/mmcid: Remove pointless preempt guard	Thomas Gleixner
	This is a leftover from the early versions of this function where it could be invoked without mm::mm_cid::lock held. Remove it and add lockdep asserts instead. Fixes: 653fda7ae73d ("sched/mmcid: Switch over to the new mechanism") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.116363613@kernel.org