diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-14 13:33:36 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-14 13:33:36 -0700 |
| commit | 1c3b68f0d55b5932eb38eda602a61aec6d6f5e5e (patch) | |
| tree | ec22e8344526e4f2968507472f3f578429392dd4 /kernel/sched/fair.c | |
| parent | 33c66eb5e9844429911bf5478c96c60f9f8af9d0 (diff) | |
| parent | 78cde54ea5f03398f1cf6656de2472068f6da966 (diff) | |
| download | lwn-1c3b68f0d55b5932eb38eda602a61aec6d6f5e5e.tar.gz lwn-1c3b68f0d55b5932eb38eda602a61aec6d6f5e5e.zip | |
Merge tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduling updates:
- Skip SCHED_IDLE rq for SCHED_IDLE tasks (Christian Loehle)
- Remove superfluous rcu_read_lock() in the wakeup path (K Prateek Nayak)
- Simplify the entry condition for update_idle_cpu_scan() (K Prateek Nayak)
- Simplify SIS_UTIL handling in select_idle_cpu() (K Prateek Nayak)
- Avoid overflow in enqueue_entity() (K Prateek Nayak)
- Update overutilized detection (Vincent Guittot)
- Prevent negative lag increase during delayed dequeue (Vincent Guittot)
- Clear buddies for preempt_short (Vincent Guittot)
- Implement more complex proportional newidle balance (Peter Zijlstra)
- Increase weight bits for avg_vruntime (Peter Zijlstra)
- Use full weight to __calc_delta() (Peter Zijlstra)
RT and DL scheduling updates:
- Fix incorrect schedstats for rt and dl thread (Dengjun Su)
- Skip group schedulable check with rt_group_sched=0 (Michal Koutný)
- Move group schedulability check to sched_rt_global_validate()
(Michal Koutný)
- Add reporting of runtime left & abs deadline to sched_getattr()
for DEADLINE tasks (Tommaso Cucinotta)
Scheduling topology updates by K Prateek Nayak:
- Compute sd_weight considering cpuset partitions
- Extract "imb_numa_nr" calculation into a separate helper
- Allocate per-CPU sched_domain_shared in s_data
- Switch to assigning "sd->shared" from s_data
- Remove sched_domain_shared allocation with sd_data
Energy-aware scheduling updates:
- Filter false overloaded_group case for EAS (Vincent Guittot)
- PM: EM: Switch to rcu_dereference_all() in wakeup path
(Dietmar Eggemann)
Infrastructure updates:
- Replace use of system_unbound_wq with system_dfl_wq (Marco Crivellari)
Proxy scheduling updates by John Stultz:
- Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
- Minimise repeated sched_proxy_exec() checking
- Fix potentially missing balancing with Proxy Exec
- Fix and improve task::blocked_on et al handling
- Add assert_balance_callbacks_empty() helper
- Add logic to zap balancing callbacks if we pick again
- Move attach_one_task() and attach_task() helpers to sched.h
- Handle blocked-waiter migration (and return migration)
- Add K Prateek Nayak to scheduler reviewers for proxy execution
Misc cleanups and fixes by John Stultz, Joseph Salisbury, Peter
Zijlstra, K Prateek Nayak, Michal Koutný, Randy Dunlap, Shrikanth
Hegde, Vincent Guittot, Zhan Xusheng, Xie Yuanbin and Vincent Guittot"
* tag 'sched-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
sched/eevdf: Clear buddies for preempt_short
sched/rt: Cleanup global RT bandwidth functions
sched/rt: Move group schedulability check to sched_rt_global_validate()
sched/rt: Skip group schedulable check with rt_group_sched=0
sched/fair: Avoid overflow in enqueue_entity()
sched: Use u64 for bandwidth ratio calculations
sched/fair: Prevent negative lag increase during delayed dequeue
sched/fair: Use sched_energy_enabled()
sched: Handle blocked-waiter migration (and return migration)
sched: Move attach_one_task and attach_task helpers to sched.h
sched: Add logic to zap balance callbacks if we pick again
sched: Add assert_balance_callbacks_empty helper
sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
sched: Fix modifying donor->blocked on without proper locking
locking: Add task::blocked_lock to serialize blocked_on state
sched: Fix potentially missing balancing with Proxy Exec
sched: Minimise repeated sched_proxy_exec() checking
sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
MAINTAINERS: Add K Prateek Nayak to scheduler reviewers
sched/core: Get this cpu once in ttwu_queue_cond()
...
Diffstat (limited to 'kernel/sched/fair.c')
| -rw-r--r-- | kernel/sched/fair.c | 513 |
1 files changed, 362 insertions, 151 deletions
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2be80780ff51..69361c63353a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -225,6 +225,7 @@ void __init sched_init_granularity(void) update_sysctl(); } +#ifndef CONFIG_64BIT #define WMULT_CONST (~0U) #define WMULT_SHIFT 32 @@ -283,6 +284,12 @@ static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight return mul_u64_u32_shr(delta_exec, fact, shift); } +#else +static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) +{ + return (delta_exec * weight) / lw->weight; +} +#endif /* * delta /= w @@ -665,25 +672,83 @@ static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se) * Since zero_vruntime closely tracks the per-task service, these * deltas: (v_i - v0), will be in the order of the maximal (virtual) lag * induced in the system due to quantisation. - * - * Also, we use scale_load_down() to reduce the size. - * - * As measured, the max (key * weight) value was ~44 bits for a kernel build. */ +static inline unsigned long avg_vruntime_weight(struct cfs_rq *cfs_rq, unsigned long w) +{ +#ifdef CONFIG_64BIT + if (cfs_rq->sum_shift) + w = max(2UL, w >> cfs_rq->sum_shift); +#endif + return w; +} + +static inline void +__sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight); + s64 w_vruntime, key = entity_key(cfs_rq, se); + + w_vruntime = key * weight; + WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62)); + + cfs_rq->sum_w_vruntime += w_vruntime; + cfs_rq->sum_weight += weight; +} + static void -sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +sum_w_vruntime_add_paranoid(struct cfs_rq *cfs_rq, struct sched_entity *se) { - unsigned long weight = scale_load_down(se->load.weight); - s64 key = entity_key(cfs_rq, se); + unsigned long weight; + s64 key, tmp; + +again: + weight = avg_vruntime_weight(cfs_rq, se->load.weight); + key = entity_key(cfs_rq, se); + + if (check_mul_overflow(key, weight, &key)) + goto overflow; + + if (check_add_overflow(cfs_rq->sum_w_vruntime, key, &tmp)) + goto overflow; - cfs_rq->sum_w_vruntime += key * weight; + cfs_rq->sum_w_vruntime = tmp; cfs_rq->sum_weight += weight; + return; + +overflow: + /* + * There's gotta be a limit -- if we're still failing at this point + * there's really nothing much to be done about things. + */ + BUG_ON(cfs_rq->sum_shift >= 10); + cfs_rq->sum_shift++; + + /* + * Note: \Sum (k_i * (w_i >> 1)) != (\Sum (k_i * w_i)) >> 1 + */ + cfs_rq->sum_w_vruntime = 0; + cfs_rq->sum_weight = 0; + + for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost; + node; node = rb_next(node)) + __sum_w_vruntime_add(cfs_rq, __node_2_se(node)); + + goto again; +} + +static void +sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + if (sched_feat(PARANOID_AVG)) + return sum_w_vruntime_add_paranoid(cfs_rq, se); + + __sum_w_vruntime_add(cfs_rq, se); } static void sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) { - unsigned long weight = scale_load_down(se->load.weight); + unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight); s64 key = entity_key(cfs_rq, se); cfs_rq->sum_w_vruntime -= key * weight; @@ -725,7 +790,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) s64 runtime = cfs_rq->sum_w_vruntime; if (curr) { - unsigned long w = scale_load_down(curr->load.weight); + unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight); runtime += entity_key(cfs_rq, curr) * w; weight += w; @@ -735,7 +800,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq) if (runtime < 0) runtime -= (weight - 1); - delta = div_s64(runtime, weight); + delta = div64_long(runtime, weight); } else if (curr) { /* * When there is but one element, it is the average. @@ -764,17 +829,44 @@ static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq); * * -r_max < lag < max(r_max, q) */ -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) +static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avruntime) { u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC; s64 vlag, limit; + vlag = avruntime - se->vruntime; + limit = calc_delta_fair(max_slice, se); + + return clamp(vlag, -limit, limit); +} + +/* + * Delayed dequeue aims to reduce the negative lag of a dequeued task. While + * updating the lag of an entity, check that negative lag didn't increase + * during the delayed dequeue period which would be unfair. + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO + * is set. + * + * Return true if the lag has been adjusted. + */ +static __always_inline +bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) +{ + s64 vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq)); + bool ret; + WARN_ON_ONCE(!se->on_rq); - vlag = avg_vruntime(cfs_rq) - se->vruntime; - limit = calc_delta_fair(max_slice, se); + if (se->sched_delayed) { + /* previous vlag < 0 otherwise se would not be delayed */ + vlag = max(vlag, se->vlag); + if (sched_feat(DELAY_ZERO)) + vlag = min(vlag, 0); + } + ret = (vlag == se->vlag); + se->vlag = vlag; - se->vlag = clamp(vlag, -limit, limit); + return ret; } /* @@ -801,7 +893,7 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime) long load = cfs_rq->sum_weight; if (curr && curr->on_rq) { - unsigned long weight = scale_load_down(curr->load.weight); + unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight); avg += entity_key(cfs_rq, curr) * weight; load += weight; @@ -1024,7 +1116,7 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) /* * Picking the ->next buddy will affect latency but not fairness. */ - if (sched_feat(PICK_BUDDY) && + if (sched_feat(PICK_BUDDY) && protect && cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { /* ->next will never be delayed */ WARN_ON_ONCE(cfs_rq->next->sched_delayed); @@ -3841,23 +3933,125 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) se_weight(se) * -se->avg.load_sum); } -static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags); +static void +rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot) +{ + unsigned long old_weight = se->load.weight; + + /* + * VRUNTIME + * -------- + * + * COROLLARY #1: The virtual runtime of the entity needs to be + * adjusted if re-weight at !0-lag point. + * + * Proof: For contradiction assume this is not true, so we can + * re-weight without changing vruntime at !0-lag point. + * + * Weight VRuntime Avg-VRuntime + * before w v V + * after w' v' V' + * + * Since lag needs to be preserved through re-weight: + * + * lag = (V - v)*w = (V'- v')*w', where v = v' + * ==> V' = (V - v)*w/w' + v (1) + * + * Let W be the total weight of the entities before reweight, + * since V' is the new weighted average of entities: + * + * V' = (WV + w'v - wv) / (W + w' - w) (2) + * + * by using (1) & (2) we obtain: + * + * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v + * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v + * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v + * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3) + * + * Since we are doing at !0-lag point which means V != v, we + * can simplify (3): + * + * ==> W / (W + w' - w) = w / w' + * ==> Ww' = Ww + ww' - ww + * ==> W * (w' - w) = w * (w' - w) + * ==> W = w (re-weight indicates w' != w) + * + * So the cfs_rq contains only one entity, hence vruntime of + * the entity @v should always equal to the cfs_rq's weighted + * average vruntime @V, which means we will always re-weight + * at 0-lag point, thus breach assumption. Proof completed. + * + * + * COROLLARY #2: Re-weight does NOT affect weighted average + * vruntime of all the entities. + * + * Proof: According to corollary #1, Eq. (1) should be: + * + * (V - v)*w = (V' - v')*w' + * ==> v' = V' - (V - v)*w/w' (4) + * + * According to the weighted average formula, we have: + * + * V' = (WV - wv + w'v') / (W - w + w') + * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w') + * = (WV - wv + w'V' - Vw + wv) / (W - w + w') + * = (WV + w'V' - Vw) / (W - w + w') + * + * ==> V'*(W - w + w') = WV + w'V' - Vw + * ==> V' * (W - w) = (W - w) * V (5) + * + * If the entity is the only one in the cfs_rq, then reweight + * always occurs at 0-lag point, so V won't change. Or else + * there are other entities, hence W != w, then Eq. (5) turns + * into V' = V. So V won't change in either case, proof done. + * + * + * So according to corollary #1 & #2, the effect of re-weight + * on vruntime should be: + * + * v' = V' - (V - v) * w / w' (4) + * = V - (V - v) * w / w' + * = V - vl * w / w' + * = V - vl' + */ + se->vlag = div64_long(se->vlag * old_weight, weight); + + /* + * DEADLINE + * -------- + * + * When the weight changes, the virtual time slope changes and + * we should adjust the relative virtual deadline accordingly. + * + * d' = v' + (d - v)*w/w' + * = V' - (V - v)*w/w' + (d - v)*w/w' + * = V - (V - v)*w/w' + (d - v)*w/w' + * = V + (d - V)*w/w' + */ + if (se->rel_deadline) + se->deadline = div64_long(se->deadline * old_weight, weight); + + if (rel_vprot) + se->vprot = div64_long(se->vprot * old_weight, weight); +} static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, unsigned long weight) { bool curr = cfs_rq->curr == se; bool rel_vprot = false; - u64 vprot; + u64 avruntime = 0; if (se->on_rq) { /* commit outstanding execution time */ update_curr(cfs_rq); - update_entity_lag(cfs_rq, se); - se->deadline -= se->vruntime; + avruntime = avg_vruntime(cfs_rq); + se->vlag = entity_lag(cfs_rq, se, avruntime); + se->deadline -= avruntime; se->rel_deadline = 1; if (curr && protect_slice(se)) { - vprot = se->vprot - se->vruntime; + se->vprot -= avruntime; rel_vprot = true; } @@ -3868,30 +4062,23 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, } dequeue_load_avg(cfs_rq, se); - /* - * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i), - * we need to scale se->vlag when w_i changes. - */ - se->vlag = div_s64(se->vlag * se->load.weight, weight); - if (se->rel_deadline) - se->deadline = div_s64(se->deadline * se->load.weight, weight); - - if (rel_vprot) - vprot = div_s64(vprot * se->load.weight, weight); + rescale_entity(se, weight, rel_vprot); update_load_set(&se->load, weight); do { u32 divider = get_pelt_divider(&se->avg); - se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider); } while (0); enqueue_load_avg(cfs_rq, se); if (se->on_rq) { - place_entity(cfs_rq, se, 0); if (rel_vprot) - se->vprot = se->vruntime + vprot; + se->vprot += avruntime; + se->deadline += avruntime; + se->rel_deadline = 0; + se->vruntime = avruntime - se->vlag; + update_load_add(&cfs_rq->load, se->load.weight); if (!curr) __enqueue_entity(cfs_rq, se); @@ -5165,6 +5352,7 @@ static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { u64 vslice, vruntime = avg_vruntime(cfs_rq); + bool update_zero = false; s64 lag = 0; if (!se->custom_slice) @@ -5181,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) */ if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { struct sched_entity *curr = cfs_rq->curr; - unsigned long load; + long load, weight; lag = se->vlag; @@ -5239,17 +5427,44 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) */ load = cfs_rq->sum_weight; if (curr && curr->on_rq) - load += scale_load_down(curr->load.weight); + load += avg_vruntime_weight(cfs_rq, curr->load.weight); - lag *= load + scale_load_down(se->load.weight); + weight = avg_vruntime_weight(cfs_rq, se->load.weight); + lag *= load + weight; if (WARN_ON_ONCE(!load)) load = 1; - lag = div_s64(lag, load); + lag = div64_long(lag, load); + + /* + * A heavy entity (relative to the tree) will pull the + * avg_vruntime close to its vruntime position on enqueue. But + * the zero_vruntime point is only updated at the next + * update_deadline()/place_entity()/update_entity_lag(). + * + * Specifically (see the comment near avg_vruntime_weight()): + * + * sum_w_vruntime = \Sum (v_i - v0) * w_i + * + * Note that if v0 is near a light entity, both terms will be + * small for the light entity, while in that case both terms + * are large for the heavy entity, leading to risk of + * overflow. + * + * OTOH if v0 is near the heavy entity, then the difference is + * larger for the light entity, but the factor is small, while + * for the heavy entity the difference is small but the factor + * is large. Avoiding the multiplication overflow. + */ + if (weight > load) + update_zero = true; } se->vruntime = vruntime - lag; - if (se->rel_deadline) { + if (update_zero) + update_zero_vruntime(cfs_rq, -lag); + + if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { se->deadline += se->vruntime; se->rel_deadline = 0; return; @@ -5399,13 +5614,6 @@ static void clear_delayed(struct sched_entity *se) } } -static inline void finish_delayed_dequeue_entity(struct sched_entity *se) -{ - clear_delayed(se); - if (sched_feat(DELAY_ZERO) && se->vlag > 0) - se->vlag = 0; -} - static bool dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { @@ -5431,6 +5639,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (sched_feat(DELAY_DEQUEUE) && delay && !entity_eligible(cfs_rq, se)) { update_load_avg(cfs_rq, se, 0); + update_entity_lag(cfs_rq, se); set_delayed(se); return false; } @@ -5470,7 +5679,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) update_cfs_group(se); if (flags & DEQUEUE_DELAYED) - finish_delayed_dequeue_entity(se); + clear_delayed(se); if (cfs_rq->nr_queued == 0) { update_idle_cfs_rq_clock_pelt(cfs_rq); @@ -6866,16 +7075,15 @@ static inline void hrtick_update(struct rq *rq) static inline bool cpu_overutilized(int cpu) { - unsigned long rq_util_min, rq_util_max; + unsigned long rq_util_max; if (!sched_energy_enabled()) return false; - rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN); rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX); /* Return true only if the utilization doesn't fit CPU's capacity */ - return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu); + return !util_fits_cpu(cpu_util_cfs(cpu), 0, rq_util_max, cpu); } /* @@ -6913,9 +7121,15 @@ static int sched_idle_rq(struct rq *rq) rq->nr_running); } -static int sched_idle_cpu(int cpu) +static int choose_sched_idle_rq(struct rq *rq, struct task_struct *p) +{ + return sched_idle_rq(rq) && !task_has_idle_policy(p); +} + +static int choose_idle_cpu(int cpu, struct task_struct *p) { - return sched_idle_rq(cpu_rq(cpu)); + return available_idle_cpu(cpu) || + choose_sched_idle_rq(cpu_rq(cpu), p); } static void @@ -6931,18 +7145,14 @@ requeue_delayed_entity(struct sched_entity *se) WARN_ON_ONCE(!se->sched_delayed); WARN_ON_ONCE(!se->on_rq); - if (sched_feat(DELAY_ZERO)) { - update_entity_lag(cfs_rq, se); - if (se->vlag > 0) { - cfs_rq->nr_queued--; - if (se != cfs_rq->curr) - __dequeue_entity(cfs_rq, se); - se->vlag = 0; - place_entity(cfs_rq, se, 0); - if (se != cfs_rq->curr) - __enqueue_entity(cfs_rq, se); - cfs_rq->nr_queued++; - } + if (update_entity_lag(cfs_rq, se)) { + cfs_rq->nr_queued--; + if (se != cfs_rq->curr) + __dequeue_entity(cfs_rq, se); + place_entity(cfs_rq, se, 0); + if (se != cfs_rq->curr) + __enqueue_entity(cfs_rq, se); + cfs_rq->nr_queued++; } update_load_avg(cfs_rq, se, 0); @@ -7475,7 +7685,7 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct * if (!sched_core_cookie_match(rq, p)) continue; - if (sched_idle_cpu(i)) + if (choose_sched_idle_rq(rq, p)) return i; if (available_idle_cpu(i)) { @@ -7566,8 +7776,7 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas static inline int __select_idle_cpu(int cpu, struct task_struct *p) { - if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && - sched_cpu_cookie_match(cpu_rq(cpu), p)) + if (choose_idle_cpu(cpu, p) && sched_cpu_cookie_match(cpu_rq(cpu), p)) return cpu; return -1; @@ -7640,7 +7849,8 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu if (!available_idle_cpu(cpu)) { idle = false; if (*idle_cpu == -1) { - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, cpus)) { + if (choose_sched_idle_rq(cpu_rq(cpu), p) && + cpumask_test_cpu(cpu, cpus)) { *idle_cpu = cpu; break; } @@ -7675,7 +7885,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t */ if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) continue; - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) + if (choose_idle_cpu(cpu, p)) return cpu; } @@ -7714,21 +7924,26 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask); int i, cpu, idle_cpu = -1, nr = INT_MAX; - struct sched_domain_shared *sd_share; - - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); if (sched_feat(SIS_UTIL)) { - sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, target)); - if (sd_share) { - /* because !--nr is the condition to stop scan */ - nr = READ_ONCE(sd_share->nr_idle_scan) + 1; - /* overloaded LLC is unlikely to have idle cpu/core */ - if (nr == 1) - return -1; - } + /* + * Increment because !--nr is the condition to stop scan. + * + * Since "sd" is "sd_llc" for target CPU dereferenced in the + * caller, it is safe to directly dereference "sd->shared". + * Topology bits always ensure it assigned for "sd_llc" abd it + * cannot disappear as long as we have a RCU protected + * reference to one the associated "sd" here. + */ + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1; + /* overloaded LLC is unlikely to have idle cpu/core */ + if (nr == 1) + return -1; } + if (!cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr)) + return -1; + if (static_branch_unlikely(&sched_cluster_active)) { struct sched_group *sg = sd->groups; @@ -7797,7 +8012,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) for_each_cpu_wrap(cpu, cpus, target) { unsigned long cpu_cap = capacity_of(cpu); - if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) + if (!choose_idle_cpu(cpu, p)) continue; fits = util_fits_cpu(task_util, util_min, util_max, cpu); @@ -7868,7 +8083,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) */ lockdep_assert_irqs_disabled(); - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && + if (choose_idle_cpu(target, p) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; @@ -7876,7 +8091,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && - (available_idle_cpu(prev) || sched_idle_cpu(prev)) && + choose_idle_cpu(prev, p) && asym_fits_cpu(task_util, util_min, util_max, prev)) { if (!static_branch_unlikely(&sched_cluster_active) || @@ -7908,7 +8123,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && + choose_idle_cpu(recent_used_cpu, p) && cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) && asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { @@ -8408,10 +8623,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) struct perf_domain *pd; struct energy_env eenv; - rcu_read_lock(); pd = rcu_dereference_all(rd->pd); if (!pd) - goto unlock; + return target; /* * Energy-aware wake-up happens on the lowest sched_domain starting @@ -8421,13 +8635,13 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd))) sd = sd->parent; if (!sd) - goto unlock; + return target; target = prev_cpu; sync_entity_load_avg(&p->se); if (!task_util_est(p) && p_util_min == 0) - goto unlock; + return target; eenv_task_busy_time(&eenv, p, prev_cpu); @@ -8522,7 +8736,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) prev_cpu); /* CPU utilization has changed */ if (prev_delta < base_energy) - goto unlock; + return target; prev_delta -= base_energy; prev_actual_cap = cpu_actual_cap; best_delta = min(best_delta, prev_delta); @@ -8546,7 +8760,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) max_spare_cap_cpu); /* CPU utilization has changed */ if (cur_delta < base_energy) - goto unlock; + return target; cur_delta -= base_energy; /* @@ -8563,7 +8777,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) best_actual_cap = cpu_actual_cap; } } - rcu_read_unlock(); if ((best_fits > prev_fits) || ((best_fits > 0) && (best_delta < prev_delta)) || @@ -8571,11 +8784,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu) target = best_energy_cpu; return target; - -unlock: - rcu_read_unlock(); - - return target; } /* @@ -8620,7 +8828,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); } - rcu_read_lock(); for_each_domain(cpu, tmp) { /* * If both 'cpu' and 'prev_cpu' are part of this domain, @@ -8646,14 +8853,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) break; } - if (unlikely(sd)) { - /* Slow path */ - new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag); - } else if (wake_flags & WF_TTWU) { /* XXX always ? */ - /* Fast path */ - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); - } - rcu_read_unlock(); + /* Slow path */ + if (unlikely(sd)) + return sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag); + + /* Fast path */ + if (wake_flags & WF_TTWU) + return select_idle_sibling(p, prev_cpu, new_cpu); return new_cpu; } @@ -8944,8 +9150,10 @@ pick: return; preempt: - if (preempt_action == PREEMPT_WAKEUP_SHORT) + if (preempt_action == PREEMPT_WAKEUP_SHORT) { cancel_protect_slice(se); + clear_buddies(cfs_rq, se); + } resched_curr_lazy(rq); } @@ -9793,32 +10001,6 @@ next: } /* - * attach_task() -- attach the task detached by detach_task() to its new rq. - */ -static void attach_task(struct rq *rq, struct task_struct *p) -{ - lockdep_assert_rq_held(rq); - - WARN_ON_ONCE(task_rq(p) != rq); - activate_task(rq, p, ENQUEUE_NOCLOCK); - wakeup_preempt(rq, p, 0); -} - -/* - * attach_one_task() -- attaches the task returned from detach_one_task() to - * its new rq. - */ -static void attach_one_task(struct rq *rq, struct task_struct *p) -{ - struct rq_flags rf; - - rq_lock(rq, &rf); - update_rq_clock(rq); - attach_task(rq, p); - rq_unlock(rq, &rf); -} - -/* * attach_tasks() -- attaches all tasks detached by detach_tasks() to their * new rq. */ @@ -10055,6 +10237,7 @@ struct sg_lb_stats { unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ unsigned int group_smt_balance; /* Task on busy SMT be moved */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ + unsigned int group_overutilized; /* At least one CPU is overutilized in the group */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -10287,6 +10470,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs) static inline bool group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs) { + /* + * With EAS and uclamp, 1 CPU in the group must be overutilized to + * consider the group overloaded. + */ + if (sched_energy_enabled() && !sgs->group_overutilized) + return false; + if (sgs->sum_nr_running <= sgs->group_weight) return false; @@ -10470,14 +10660,12 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) * @group: sched_group whose statistics are to be updated. * @sgs: variable to hold the statistics for this group. * @sg_overloaded: sched_group is overloaded - * @sg_overutilized: sched_group is overutilized */ static inline void update_sg_lb_stats(struct lb_env *env, struct sd_lb_stats *sds, struct sched_group *group, struct sg_lb_stats *sgs, - bool *sg_overloaded, - bool *sg_overutilized) + bool *sg_overloaded) { int i, nr_running, local_group, sd_flags = env->sd->flags; bool balancing_at_rd = !env->sd->parent; @@ -10499,7 +10687,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->sum_nr_running += nr_running; if (cpu_overutilized(i)) - *sg_overutilized = 1; + sgs->group_overutilized = 1; /* * No need to call idle_cpu() if nr_running is not 0 @@ -11075,6 +11263,7 @@ static void update_idle_cpu_scan(struct lb_env *env, unsigned long sum_util) { struct sched_domain_shared *sd_share; + struct sched_domain *sd = env->sd; int llc_weight, pct; u64 x, y, tmp; /* @@ -11088,11 +11277,7 @@ static void update_idle_cpu_scan(struct lb_env *env, if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE) return; - llc_weight = per_cpu(sd_llc_size, env->dst_cpu); - if (env->sd->span_weight != llc_weight) - return; - - sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, env->dst_cpu)); + sd_share = sd->shared; if (!sd_share) return; @@ -11126,10 +11311,11 @@ static void update_idle_cpu_scan(struct lb_env *env, */ /* equation [3] */ x = sum_util; + llc_weight = sd->span_weight; do_div(x, llc_weight); /* equation [4] */ - pct = env->sd->imbalance_pct; + pct = sd->imbalance_pct; tmp = x * x * pct * pct; do_div(tmp, 10000 * SCHED_CAPACITY_SCALE); tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE); @@ -11170,13 +11356,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd update_group_capacity(env->sd, env->dst_cpu); } - update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized); + update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded); if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; sds->busiest_stat = *sgs; } + sg_overutilized |= sgs->group_overutilized; + /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; @@ -12297,7 +12485,30 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su sd->newidle_success += success; if (sd->newidle_call >= 1024) { - sd->newidle_ratio = sd->newidle_success; + u64 now = sched_clock(); + s64 delta = now - sd->newidle_stamp; + sd->newidle_stamp = now; + int ratio = 0; + + if (delta < 0) + delta = 0; + + if (sched_feat(NI_RATE)) { + /* + * ratio delta freq + * + * 1024 - 4 s - 128 Hz + * 512 - 2 s - 256 Hz + * 256 - 1 s - 512 Hz + * 128 - .5 s - 1024 Hz + * 64 - .25 s - 2048 Hz + */ + ratio = delta >> 22; + } + + ratio += sd->newidle_success; + + sd->newidle_ratio = min(1024, ratio); sd->newidle_call /= 2; sd->newidle_success /= 2; } @@ -12344,7 +12555,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle) { int continue_balancing = 1; int cpu = rq->cpu; - int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu); + int busy = idle != CPU_IDLE && !sched_idle_rq(rq); unsigned long interval; struct sched_domain *sd; /* Earliest time when we have to do rebalance again */ @@ -12382,7 +12593,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle) * state even if we migrated tasks. Update it. */ idle = idle_cpu(cpu); - busy = !idle && !sched_idle_cpu(cpu); + busy = !idle && !sched_idle_rq(rq); } sd->last_balance = jiffies; interval = get_sd_balance_interval(sd, busy); @@ -12427,14 +12638,14 @@ static inline int on_null_domain(struct rq *rq) */ static inline int find_new_ilb(void) { + int this_cpu = smp_processor_id(); const struct cpumask *hk_mask; int ilb_cpu; hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE); for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) { - - if (ilb_cpu == smp_processor_id()) + if (ilb_cpu == this_cpu) continue; if (idle_cpu(ilb_cpu)) @@ -13004,7 +13215,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) if (sd->flags & SD_BALANCE_NEWIDLE) { unsigned int weight = 1; - if (sched_feat(NI_RANDOM)) { + if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) { /* * Throw a 1k sided dice; and only run * newidle_balance according to the success @@ -14030,7 +14241,7 @@ void show_numa_stats(struct task_struct *p, struct seq_file *m) tpf = p->numa_faults[task_faults_idx(NUMA_MEM, node, 1)]; } if (ng) { - gsf = ng->faults[task_faults_idx(NUMA_MEM, node, 0)], + gsf = ng->faults[task_faults_idx(NUMA_MEM, node, 0)]; gpf = ng->faults[task_faults_idx(NUMA_MEM, node, 1)]; } print_numa_stats(m, node, tsf, tpf, gsf, gpf); |
