lwn.git - Linux kernel documentation tree maintained by Jonathan Corbet

Age	Commit message (Collapse)	Author
7 days	Merge tag 'sched_ext-for-7.2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: "Most of this continues the in-development sub-scheduler support, which lets a root BPF scheduler delegate to nested sub-schedulers. The dispatch-path building blocks landed in 7.1. A follow-up patchset in development will complete enqueue-path support for hierarchical scheduling. This cycle adds most of that infrastructure: - Topological CPU IDs (cids): a dense, topology-ordered CPU numbering where the CPUs of a core, LLC, or NUMA node form contiguous ranges, so a topology unit becomes a (start, length) slice. Raw CPU numbers are sparse and don't track topological closeness, which makes them clumsy for sharding work across sub-schedulers and awkward in BPF. - cmask: bitmaps windowed over a slice of cid space, so a sub-scheduler can track, for example, the idle cids of its shard without a full NR_CPUS cpumask. - A struct_ops variant that cid-form sub-schedulers register with, along with the cid-form kfuncs they call. - BPF arena integration, which sub-scheduler support is built on. The bpf-next additions let the kernel read and write the BPF scheduler's arena directly, turning it into a real kernel/BPF shared-memory channel. Shared state like the per-CPU cmask now lives there. - scx_qmap is reworked to exercise the new arena and cid interfaces. Additionally: - Exit-dump improvements: dump the faulting CPU first, expose the exit CPU to BPF and userspace, and normalize the dump header. - Misc kfuncs and cleanups: a task-ID lookup kfunc, __printf checking on the error and dump formatters, header reorganization, and assorted fixes" * tag 'sched_ext-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (59 commits) sched_ext: Add scx_arena_to_kaddr() / scx_kaddr_to_arena() sched_ext: Make scx_bpf_kick_cid() return s32 sched_ext: Add scx_cmask_test() and scx_cmask_for_each_cid() tools/sched_ext: Order single-cid cmask helpers as (cid, mask) sched_ext: Order single-cid cmask helpers as (cid, mask) selftests/sched_ext: Fix dsq_move_to_local check sched_ext: Guard BPF arena helper calls to fix 32-bit build sched_ext: idle: Fix errno loss in scx_idle_init() sched_ext: Convert ops.set_cmask() to arena-resident cmask sched_ext: Sub-allocator over kernel-claimed BPF arena pages sched_ext: Require an arena for cid-form schedulers sched_ext: Add cmask mask ops sched_ext: Track bits[] storage size in struct scx_cmask sched_ext: Rename scx_cmask.nr_bits to nr_cids tools/sched_ext: scx_qmap: Fix qa arena placement sched_ext: Mark !CONFIG_EXT_SUB_SCHED dummy stubs static inline sched_ext: Replace tryget_task_struct() with get_task_struct() sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn() sched_ext: Fix ops_cid layout assert sched_ext: Use offsetofend on both sides of the ops_cid layout assert ...
9 days	Merge tag 'sched-core-2026-06-14' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "SMP load-balancing updates: - A large series to introduce infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. Implemented by Chen Yu and Tim Chen, based on early prototype work by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and Shrikanth Hegde. - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde) Fair scheduler updates: - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing SMT awareness (Andrea Righi, K Prateek Nayak) - A series to optimize cfs_rq and sched_entity allocation for better data locality (Zecheng Li) - A preparatory series to change fair/cgroup scheduling to a single runqueue, without the final change (Peter Zijlstra) - Auto-manage ext/fair dl_server bandwidth (Andrea Righi) - Fix cpu_util runnable_avg arithmetic (Hongyan Xia) - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel) - Allow account_cfs_rq_runtime() to throttle current hierarchy (K Prateek Nayak) - Update util_est after updating util_avg during dequeue, to fix the util signal update logic, which reduces signal noise (Vincent Guittot) Scheduler topology updates: - Allow multiple domains to claim sched_domain_shared (K Prateek Nayak) - Add parameter to split LLC (Peter Zijlstra) Core scheduler updates: - Use trace_call__<tp>() to save a static branch (Gabriele Monaco) Scheduler statistics updates: - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation guard (Nicolas Pitre) Deadline scheduler updates: - Reject debugfs dl_server writes for offline CPUs (Andrea Righi) - Fix replenishment logic for non-deferred servers (Yuri Andriaccio) RT scheduling updates: - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt) - Update default bandwidth for real-time tasks to 1.0 (Yuri Andriaccio) Proxy scheduling updates: - A series to implement Optimized Donor Migration for Proxy Execution (John Stultz, Peter Zijlstra) - Various proxy scheduling cleanups and fixes (Peter Zijlstra, K Prateek Nayak) Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi, Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde, Peter Zijlstra, Liang Luo and Yiyang Chen" * tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits) sched/fair: Fix newidle vs core-sched sched/deadline: Use task_on_rq_migrating() helper sched/core: Combine separate 'else' and 'if' statements sched/fair: Fix cpu_util runnable_avg arithmetic sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() sched/fair: Call update_curr() before unthrottling the hierarchy sched/fair: Use throttled_csd_list for local unthrottle sched/fair: Convert cfs bandwidth throttling to use guards sched/fair: Allocate cfs_tg_state with percpu allocator sched/fair: Remove task_group->se pointer array sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state sched: restore timer_slack_ns when resetting RT policy on fork MAINTAINERS: Fix spelling mistake in Peter's name sched: Simplify ttwu_runnable() sched/proxy: Remove superfluous clear_task_blocked_in() sched/proxy: Remove PROXY_WAKING sched/proxy: Switch proxy to use p->is_blocked sched/proxy: Only return migrate when needed sched: Be more strict about p->is_blocked ...
2026-06-01	selftests/sched_ext: Fix dsq_move_to_local check	Cheng-Yang Chou
	scan_dsq_pool() checked == 0 against scx_bpf_dsq_move_to_local(), which returns true on success. This inverted success and failure, causing peek_dsq_dispatch() to double-dispatch on success and skip the real_dsq fallback on failure. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29	selftests/sched_ext: Validate dl_server attach/detach in total_bw test	Andrea Righi
	Extend the total_bw selftest to validate the fair/ext dl_server auto-attach/detach operations. After the existing consistency checks, the test now doubles the fair_server's runtime on every CPU via debugfs and verifies that: 1. total_bw grew after the customization (proves fair_server was attached and apply_params() honored the dl_bw_attached flag), 2. with the minimal BPF scheduler loaded, total_bw drops back to the baseline value (proves fair_server was detached and ext_server was attached at its own default runtime), 3. after unload total_bw matches the doubled value from step 1 (proves fair_server was re-attached with the runtime customization preserved across the load/unload cycle). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260526164420.638711-3-arighi@nvidia.com
2026-05-11	Merge branch 'for-7.1-fixes' into for-7.2	Tejun Heo
	Pull to receive: 9a415cc53711 ("sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path") Conflicts with for-7.2's scx_task_iter_relock() rework. The fix moves put_task_struct(p) past scx_error(); for-7.2 still has it at the old position. Resolved by dropping the old one. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10	selftests/sched_ext: Fix build error in dequeue selftest	Andrea Righi
	Building the dequeue selftest with newer compilers (e.g., gcc 16) triggers the following error: dequeue.c:28:22: error: variable 'sum' set but not used The 'volatile' qualifier prevents the writes from being optimized away, but does not silence the unused variable 'sum' is indeed only written and never read. Consume 'sum' via an empty asm() with a register input constraint. This forces the compiler to keep the accumulated value (preserving the CPU stress loop) and avoiding the build error. Fixes: 658ad2259b3e ("selftests/sched_ext: Add test to validate ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08	selftests/sched_ext: Fix select_cpu_dfl link leak on early return	Cheng-Yang Chou
	If run() exits early via SCX_EQ/SCX_ASSERT (which calls return directly), bpf_link__destroy() is never reached and the BPF scheduler stays loaded. All subsequent tests then fail to attach because SCX is not in the DISABLED state. Move bpf_link into a context struct so cleanup() always destroys it, regardless of how run() exits. Also skip waitpid() for children where fork() returned -1, avoiding waitpid(-1,...) accidentally reaping an unrelated child and triggering the early return path. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-23	selftests/sched_ext: Include common.bpf.h to avoid build failure	Zhao Mengmeng
	In scx-cid patchsets, sched_ext selftest failed to build with following error: non_scx_kfunc_deny.bpf.c:17:6: error: conflicting types for 'scx_bpf_kick_cpu' 17 \| void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym; \| ^ tools/testing/selftests/sched_ext/build/include/vmlinux.h:136300:13: note: previous declaration is here 136300 \| extern void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux aux) __weak __ksym; \| ^ non_scx_kfunc_deny.bpf.c:26:23: error: too few arguments to function call, expected 3, have 2 26 \| scx_bpf_kick_cpu(0, 0); \| ~~~~~~~~~~~~~~~~ ^ tools/testing/selftests/sched_ext/build/include/vmlinux.h:136300:13: note: 'scx_bpf_kick_cpu' declared here 136300 \| extern void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux aux) __weak __ksym; The root cause is on scx core part, but we can avoid this by including common.bpf.h and remove scx_bpf_kick_cpu() to make it more robust, just like the usage in other xx.bpf.c. Link: https://lore.kernel.org/sched-ext/20260421071945.3110084-1-tj@kernel.org/ Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Tested-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-20	selftests/sched_ext: Add non_scx_kfunc_deny test	Cheng-Yang Chou
	Verify that the BPF verifier rejects a non-SCX struct_ops program (tcp_congestion_ops) that attempts to call an SCX kfunc (scx_bpf_kick_cpu). The test expects the load to fail with -EACCES from scx_kfunc_context_filter. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-09	selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message	fangqiurong
	The error path after scx_bpf_create_dsq(real_dsq_id, ...) was reporting test_dsq_id instead of real_dsq_id in the error message, which would mislead debugging. Signed-off-by: fangqiurong <fangqiurong@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-08	selftests/sched_ext: Improve runner error reporting for invalid arguments	Cheng-Yang Chou
	Report an error for './runner foo' (positional arg instead of -t) and for './runner -t foo' when the filter matches no tests. Previously both cases produced no error output. Pre-scan the test list before the main loop so the error is reported immediately, avoiding spurious SKIP output from '-s' when no tests match. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-30	Merge branch 'for-7.0-fixes' into for-7.1	Tejun Heo
	Conflict in kernel/sched/ext.c init_sched_ext_class() between: 415cb193bb97 ("sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback") which adds cpus_to_sync cpumask allocation, and: 84b1a0ea0b7c ("sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs") 8c1b9453fde6 ("sched_ext: Convert deferred_reenq_locals from llist to regular list") which add deferred_reenq init code at the same location. Both are independent additions. Include both. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-30	selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test	Tejun Heo
	Add a test that creates a 3-CPU kick_wait cycle (A->B->C->A). A BPF scheduler kicks the next CPU in the ring with SCX_KICK_WAIT on every enqueue while userspace workers generate continuous scheduling churn via sched_yield(). Without the preceding fix, this hangs the machine within seconds. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Tested-by: Christian Loehle <christian.loehle@arm.com>
2026-03-26	Revert "selftests/sched_ext: Add tests for SCX_ENQ_IMMED and ↵	Tejun Heo
	scx_bpf_dsq_reenq()" This reverts commit c50dcf533149. The tests are superficial, likely AI-generated slop, and flaky. They don't add actual value and just churn the selftests. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-24	selftests/sched_ext: Skip rt_stall on older kernels and list skipped tests	Cheng-Yang Chou
	rt_stall tests the ext DL server which was introduced in commit cd959a356205 ("sched_ext: Add a DL server for sched_ext tasks"). On older kernels that lack this feature, the test calls ksft_exit_fail() internally which terminates the entire runner process, preventing subsequent tests from running. Add a guard in setup() that checks for the ext_server field in struct rq via __COMPAT_struct_has_field() and returns SCX_TEST_SKIP if not present. Also print the names of skipped tests in the results summary, mirroring the existing behavior for failed tests. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-22	selftests/sched_ext: Add tests for SCX_ENQ_IMMED and scx_bpf_dsq_reenq()	zhidao su
	Add three selftests covering features introduced in v7.1: - dsq_reenq: Verify scx_bpf_dsq_reenq() on user DSQs triggers ops.enqueue() with SCX_ENQ_REENQ and SCX_TASK_REENQ_KFUNC in p->scx.flags. - enq_immed: Verify SCX_OPS_ALWAYS_ENQ_IMMED slow path where tasks dispatched to a busy CPU's local DSQ are re-enqueued through ops.enqueue() with SCX_TASK_REENQ_IMMED. - consume_immed: Verify SCX_ENQ_IMMED via the consume path using scx_bpf_dsq_move_to_local___v2() with explicit SCX_ENQ_IMMED. All three tests skip gracefully on kernels that predate the required features by checking availability via __COMPAT_has_ksym() / __COMPAT_read_enum() before loading. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21	selftests/sched_ext: Return non-zero exit code on test failure	zhidao su
	runner.c always returned 0 regardless of test results. The kselftest framework (tools/testing/selftests/kselftest/runner.sh) invokes the runner binary and treats a non-zero exit code as a test failure; with the old code, failed sched_ext tests were silently hidden from the parent harness even though individual "not ok" TAP lines were emitted. Return 1 when at least one test failed, 0 when all tests passed or were skipped. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17	selftests/sched_ext: Show failed test names in summary	Cheng-Yang Chou
	When tests fail, the runner only printed the failure count, making it hard to tell which tests failed without scrolling through output. Track failed test names in an array and print them after the summary so failures are immediately visible at the end of the run. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-14	sched_ext: Update selftests to drop ops.cpu_acquire/release()	Cheng-Yang Chou
	ops.cpu_acquire/release() are deprecated by commit a3f5d4822253 ("sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere") in favor of handling CPU preemption via the sched_switch tracepoint. In the maximal selftest, replace the cpu_acquire/release stubs with a minimal sched_switch TP program. Attach all non-struct_ops programs (including the new TP) via maximal__attach() after disabling auto-attach for the maximal_ops struct_ops map, which is managed manually in run(). Apply the same fix to reload_loop, which also uses the maximal skeleton. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-14	sched_ext: Update demo schedulers and selftests to use ↵	Cheng-Yang Chou
	scx_bpf_task_set_dsq_vtime() Direct writes to p->scx.dsq_vtime are deprecated in favor of scx_bpf_task_set_dsq_vtime(). Update scx_simple, scx_flatcg, and select_cpu_vtime selftest to use the new kfunc with scale_by_task_weight_inverse(). Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13	sched_ext/selftests: Fix incorrect include guard comments	Cheng-Yang Chou
	Fix two mismatched closing comments in header include guards: - util.h: closing comment says __SCX_TEST_H__ but the guard is __SCX_TEST_UTIL_H__ - exit_test.h: closing comment has a spurious '#' character before the guard name Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13	selftests/sched_ext: Update scx_bpf_dsq_move_to_local() in kselftests	Andrea Righi
	After commit 860683763ebf ("sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()") some of the kselftests are failing to build: exit.bpf.c:44:34: error: too few arguments provided to function-like macro invocation 44 \| scx_bpf_dsq_move_to_local(DSQ_ID); Update the kselftests adding the new argument to scx_bpf_dsq_move_to_local(). Fixes: 860683763ebf ("sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13	selftests/sched_ext: Add missing error check for exit__load()	David Carlier
	exit__load(skel) was called without checking its return value. Every other test in the suite wraps the load call with SCX_FAIL_IF(). Add the missing check to be consistent with the rest of the test suite. Fixes: a5db7817af78 ("sched_ext: Add selftests") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-11	sched_ext: Fix incomplete help text usage strings	Cheng-Yang Chou
	Several demo schedulers and the selftest runner had usage strings that omitted options which are actually supported: - scx_central: add missing [-v] - scx_pair: add missing [-v] - scx_qmap: add missing [-S] and [-H] - scx_userland: add missing [-v] - scx_sdt: remove [-f] which no longer exists - runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the other sched_ext tools list in their usage lines Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06	Merge branch 'for-7.0-fixes' into for-7.1	Tejun Heo
	To prepare for hierarchical scheduling patchset which will cause multiple conflicts otherwise. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-04	sched_ext/selftests: Fix format specifier and buffer length in file_write_long()	Cheng-Yang Chou
	Use %ld (not %lu) for signed long, and pass the actual string length returned by sprintf() to write_text() instead of sizeof(buf). Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02	selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17	Zhao Mengmeng
	When compiling sched_ext selftests using clang 17.0.6, it raised compiler crash and build error: Error at line 68: Unsupport signed division for DAG: 0x55b2f9a60240: i64 = sdiv 0x55b2f9a609b0, Constant:i64<100>, peek_dsq.bpf.c:68:25 @[ peek_dsq.bpf.c:95:4 @[ peek_dsq.bpf.c:169:8 @[ peek _dsq.bpf.c:140:6 ] ] ]Please convert to unsigned div/mod After digging, it's not a compiler error, clang supported Signed division only when using -mcpu=v4, while we use -mcpu=v3 currently, the better way is to use unsigned div, see [1] for details. [1] https://github.com/llvm/llvm-project/issues/70433 Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02	selftests/sched_ext: Add -fms-extensions to bpf build flags	Zhao Mengmeng
	Similar to commit 835a50753579 ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit 639f58a0f480 ("bpftool: Fix build warnings due to MS extensions") Fix "declaration does not declare anything" warning by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23	selftests/sched_ext: Add test to validate ops.dequeue() semantics	Andrea Righi
	Add a new kselftest to validate that the new ops.dequeue() semantics work correctly for all task lifecycle scenarios, including the distinction between terminal DSQs (where BPF scheduler is done with the task), user DSQs (where BPF scheduler manages the task lifecycle) and BPF data structures, regardless of which event performs the dispatch. The test validates the following scenarios: - From ops.select_cpu(): - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass the BPF scheduler entirely; they never enter BPF custody, so ops.dequeue() is not called, - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also bypass the BPF scheduler, like the local DSQ; ops.dequeue() is not called, - scenario 2 (user DSQ): tasks dispatched to user DSQs from ops.select_cpu(): tasks enter BPF scheduler's custody with full enqueue/dequeue lifecycle tracking and state machine validation, expects 1:1 enqueue/dequeue pairing, - From ops.enqueue(): - scenario 3 (local DSQ): same behavior as scenario 0, - scenario 4 (global DSQ): same behavior as scenario 1, - scenario 5 (user DSQ): same behavior as scenario 2, - scenario 6 (BPF internal queue): tasks are stored in a BPF queue from ops.enqueue() and consumed from ops.dispatch(); similarly to scenario 5, tasks enter BPF scheduler's custody with full lifecycle tracking and 1:1 enqueue/dequeue validation. This verifies that: - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(), - tasks dispatched to user DSQs, either from ops.select_cpu() or ops.enqueue(), enter BPF scheduler's custody and have exact 1:1 enqueue/dequeue pairing, - tasks stored to internal BPF data structures from ops.enqueue() enter BPF scheduler's custody and have exact 1:1 enqueue/dequeue pairing, - dispatch dequeues have no flags (normal workflow), - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set, - no duplicate enqueues or invalid state transitions are happening. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23	selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c	Cheng-Yang Chou
	The header <unistd.h> is included twice in rt_stall.c. Remove the redundant inclusion to clean up the code. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23	selftests/sched_ext: Fix unused-result warning for read()	Cheng-Yang Chou
	The read() call in run_test() triggers a warn_unused_result compiler warning, which breaks the build under -Werror. Check the return value of read() and exit the child process on failure to satisfy the compiler and handle pipe read errors. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23	selftests/sched_ext: Abort test loop on signal	Cheng-Yang Chou
	The runner sets exit_req on SIGINT/SIGTERM but ignores it during the main loop. This prevents users from cleanly interrupting a test run. Check exit_req each iteration to safely break out on exit signals. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-13	selftests/sched_ext: Fix rt_stall flaky failure	Ihor Solodrai
	The rt_stall test measures the runtime ratio between an EXT and an RT task pinned to the same CPU, verifying that the deadline server prevents RT tasks from starving SCHED_EXT tasks. It expects the EXT task to get at least 4% of CPU time. The test is flaky because sched_stress_test() calls sleep(RUN_TIME) immediately after fork(), without waiting for the RT child to complete its setup (set_affinity + set_sched). If the RT child experiences scheduling latency before completing setup, that delay eats into the measurement window: the RT child runs for less than RUN_TIME seconds, and the EXT task's measured ratio drops below the 4% threshold. For example, in the failing CI run [1]: EXT=0.140s RT=4.750s total=4.890s (expected ~5.0s) ratio=2.86% < 4% → FAIL The 110ms gap (5.0 - 4.89) corresponds to the RT child's setup time being counted inside the measurement window, during which fewer deadline server ticks fire for the EXT task. Fix by using pipes to synchronize: each child signals the parent after completing its setup, and the parent waits for both signals before starting sleep(RUN_TIME). This ensures the measurement window only counts time when both tasks are fully configured and competing. [1] https://github.com/kernel-patches/bpf/actions/runs/21961895809/job/63442490449 Fixes: be621a76341c ("selftests/sched_ext: Add test for sched_ext dl_server") Assisted-by: claude-opus-4-6-v1 Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-11	Merge tag 'sched_ext-for-6.20' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep() * tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Fix init_enable_count flakiness tools/sched_ext: Fix data header access during free in scx_sdt tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers tools/sched_ext: add arena based scheduler tools/sched_ext: add scx_pair scheduler tools/sched_ext: add scx_userland scheduler sched_ext: Add error logging for dsq creation failures sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
2026-02-03	selftests/sched_ext: Add test for DL server total_bw consistency	Joel Fernandes
	Add a new kselftest to verify that the total_bw value in /sys/kernel/debug/sched/debug remains consistent across all CPUs under different sched_ext BPF program states: 1. Before a BPF scheduler is loaded 2. While a BPF scheduler is loaded and active 3. After a BPF scheduler is unloaded The test runs CPU stress threads to ensure DL server bandwidth values stabilize before checking consistency. This helps catch potential issues with DL server bandwidth accounting during sched_ext transitions. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-8-arighi@nvidia.com
2026-02-03	selftests/sched_ext: Add test for sched_ext dl_server	Andrea Righi
	Add a selftest to validate the correct behavior of the deadline server for the ext_sched_class. Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Tested-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260126100050.3854740-7-arighi@nvidia.com
2026-02-02	selftests/sched_ext: Fix init_enable_count flakiness	Tejun Heo
	The init_enable_count test is flaky. The test forks 1024 children before attaching the scheduler to verify that existing tasks get ops.init_task() called. The children were using sleep(1) before exiting. 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()") changed when tasks are removed from scx_tasks - previously when the task_struct was freed, now immediately in finish_task_switch() when the task dies. Before the commit, pre-forked children would linger on scx_tasks until freed regardless of when they exited, so the scheduler would always see them during iteration. The sleep(1) was unnecessary. After the commit, children are removed as soon as they die. The sleep(1) masks the problem in most cases but the test becomes flaky depending on timing. Fix by synchronizing properly using a pipe. All children block on read() and the parent signals them to exit by closing the write end after attaching the scheduler. The children are auto-reaped so there's no need to wait on them. Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12	selftests/sched_ext: flush stdout before test to avoid log spam	Emil Tsalapatis
	The sched_ext selftests runner runs each test in the same process, with each test possibly forking multiple times. When the main runner has not flushed its stdout, the children inherit the buffered output for previous tests and emit it during exit. This causes log spam. Make sure stdout/stderr is fully flushed before each test. Cc: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-15	sched_ext: Add a selftest for scx_bpf_dsq_peek	Ryan Newton
	This commit adds two tests. The first is the most basic unit test: make sure an empty queue peeks as empty, and when we put one element in the queue, make sure peek returns that element. However, even this simple test is a little complicated by the different behavior of scx_bpf_dsq_insert in different calling contexts: - insert is for direct dispatch in enqueue - insert is delayed when called from select_cpu In this case we split the insert and the peek that verifies the result between enqueue/dispatch. Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an empty queue, which in turn calls `flush_dispatch_buf`, in order to flush the buffered insert. Unfortunately, this is not viable within the enqueue path, as it attempts a voluntary context switch within an RCU read-side critical section. The second test is a stress test that performs many peeks on all DSQs and records the observed tasks. Signed-off-by: Ryan Newton <newton@meta.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-11	selftests/sched_ext: Remove duplicate sched.h header	Jiapeng Chong
	./tools/testing/selftests/sched_ext/hotplug.c: sched.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22941 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-31	Merge tag 'sched_ext-for-6.17' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add support for cgroup "cpu.max" interface - Code organization cleanup so that ext_idle.c doesn't depend on the source-file-inclusion build method of sched/ - Drop UP paths in accordance with sched core changes - Documentation and other misc changes * tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_bpf_reenqueue_local() reference sched_ext: Drop kfuncs marked for removal in 6.15 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments sched_ext: Add support for cgroup bandwidth control interface sched_ext, sched/core: Factor out struct scx_task_group sched_ext: Return NULL in llc_span sched_ext: Always use SMP versions in kernel/sched/ext_idle.h sched_ext: Always use SMP versions in kernel/sched/ext_idle.c sched_ext: Always use SMP versions in kernel/sched/ext.h sched_ext: Always use SMP versions in kernel/sched/ext.c sched_ext: Documentation: Clarify time slice handling in task lifecycle sched_ext: Make scx_locked_rq() inline sched_ext: Make scx_rq_bypassing() inline sched_ext: idle: Make local functions static in ext_idle.c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
2025-07-03	selftests/sched_ext: Fix exit selftest hang on UP	Andrea Righi
	On single-CPU systems, ops.select_cpu() is never called, causing the EXIT_SELECT_CPU test case to wait indefinitely. Avoid the stall by skipping this specific sub-test when only one CPU is available. Reported-by: Phil Auld <pauld@redhat.com> Fixes: a5db7817af780 ("sched_ext: Add selftests") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Phil Auld <pauld@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20	sched_ext: Add support for cgroup bandwidth control interface	Tejun Heo
	From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 13 Jun 2025 15:06:47 -1000 - Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED. - Put bandwidth control interface files for both cgroup v1 and v2 under CONFIG_GROUP_SCHED_BANDWIDTH. - Update tg_bandwidth() to fetch configuration parameters from fair if CONFIG_CFS_BANDWIDTH, SCX otherwise. - Update tg_set_bandwidth() to update the parameters for both fair and SCX. - Add bandwidth control parameters to struct scx_cgroup_init_args. - Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth control parameter updates. - Update scx_qmap and maximal selftest to test the new feature. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-21	selftests/sched_ext: Update test enq_select_cpu_fails	Andrea Righi
	With commit 08699d20467b6 ("sched_ext: idle: Consolidate default idle CPU selection kfuncs") allowing scx_bpf_select_cpu_dfl() to be invoked from multiple contexts, update the test to validate that the kfunc behaves correctly when used from ops.enqueue() and via BPF test_run. Additionally, rename the test to enq_select_cpu, dropping "fails" from the name, as the logic has now been inverted. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20	selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run	Andrea Righi
	Update the allowed_cpus selftest to include a check to validate the behavior of scx_bpf_select_cpu_and() when invoked via a BPF test_run call. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-07	selftests/sched_ext: Add test for scx_bpf_select_cpu_and()	Andrea Righi
	Add a selftest to validate the behavior of the built-in idle CPU selection policy applied to a subset of allowed CPUs, using scx_bpf_select_cpu_and(). Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-24	Merge tag 'sched-core-2025-03-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-19	sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files	Ingo Molnar
	We leave most of the defconfigs alone (there's over 70 of them), but let's remove CONFIG_SCHED_DEBUG from the scheduler self-test Kconfig files. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/Z9szt3MpQmQ56TRd@gmail.com
2025-03-03	sched_ext: Merge branch 'for-6.14-fixes' into for-6.15	Tejun Heo
	Pull for-6.14-fixes to receive: 9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()") which conflicts with: 337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-02-26	selftests/sched_ext: Add NUMA-aware scheduler test	Andrea Righi
	Add a selftest to validate the behavior of the NUMA-aware scheduler functionalities, including idle CPU selection within nodes, per-node DSQs and CPU to node mapping. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>