diff options
| author | Andrea Righi <arighi@nvidia.com> | 2026-02-18 09:32:17 +0100 |
|---|---|---|
| committer | Tejun Heo <tj@kernel.org> | 2026-02-23 10:01:18 -1000 |
| commit | ebf1ccff79c43f860cbd2f9d6cfab9a462d0cb2d (patch) | |
| tree | 3ce5147ba934fba51424e7441c8fe6c485ff115d /tools/sched_ext/include | |
| parent | 482bb06f83ab26cc055835eb7d94d615520e9de9 (diff) | |
| download | lwn-ebf1ccff79c43f860cbd2f9d6cfab9a462d0cb2d.tar.gz lwn-ebf1ccff79c43f860cbd2f9d6cfab9a462d0cb2d.zip | |
sched_ext: Fix ops.dequeue() semantics
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.
Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).
BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures from ops.enqueue().
Custody ends when the task is dispatched to a terminal DSQ (such as the
local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed
due to a property change.
Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
- Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
where tasks go directly to execution.
- Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
BPF scheduler is considered "done" with the task.
As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.
To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.
New ops.dequeue() semantics:
- ops.dequeue() is invoked exactly once when the task leaves the BPF
scheduler's custody, in one of the following cases:
a) regular dispatch: a task dispatched to a user DSQ or stored in
internal BPF data structures is moved to a terminal DSQ
(ops.dequeue() called without any special flags set),
b) core scheduling dispatch: core-sched picks task before dispatch
(ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
c) property change: task properties modified before dispatch,
(ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).
This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of managed tasks,
- update internal state when tasks change properties.
Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Diffstat (limited to 'tools/sched_ext/include')
| -rw-r--r-- | tools/sched_ext/include/scx/enum_defs.autogen.h | 1 | ||||
| -rw-r--r-- | tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 | ||||
| -rw-r--r-- | tools/sched_ext/include/scx/enums.autogen.h | 1 |
3 files changed, 4 insertions, 0 deletions
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c..dcc945304760 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19a..5da50f937684 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584b..fc9a7a4d9dea 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) |
