summaryrefslogtreecommitdiff
path: root/kernel/smp.c
AgeCommit message (Collapse)Author
2026-04-17Merge tag 'trace-v7.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Fix printf format warning for bprintf sunrpc uses a trace_printk() that triggers a printf warning during the compile. Move the __printf() attribute around for when debugging is not enabled the warning will go away - Remove redundant check for EVENT_FILE_FL_FREED in event_filter_write() The FREED flag is checked in the call to event_file_file() and then checked again right afterward, which is unneeded - Clean up event_file_file() and event_file_data() helpers These helper functions played a different role in the past, but now with eventfs, the READ_ONCE() isn't needed. Simplify the code a bit and also add a warning to event_file_data() if the file or its data is not present - Remove updating file->private_data in tracing open All access to the file private data is handled by the helper functions, which do not use file->private_data. Stop updating it on open - Show ENUM names in function arguments via BTF in function tracing When showing the function arguments when func-args option is set for function tracing, if one of the arguments is found to be an enum, show the name of the enum instead of its number - Add new trace_call__##name() API for tracepoints Tracepoints are enabled via static_branch() blocks, where when not enabled, there's only a nop that is in the code where the execution will just skip over it. When tracing is enabled, the nop is converted to a direct jump to the tracepoint code. Sometimes more calculations are required to be performed to update the parameters of the tracepoint. In this case, trace_##name##_enabled() is called which is a static_branch() that gets enabled only when the tracepoint is enabled. This allows the extra calculations to also be skipped by the nop: if (trace_foo_enabled()) { x = bar(); trace_foo(x); } Where the x=bar() is only performed when foo is enabled. The problem with this approach is that there's now two static_branch() calls. One for checking if the tracepoint is enabled, and then again to know if the tracepoint should be called. The second one is redundant Introduce trace_call__foo() that will call the foo() tracepoint directly without doing a static_branch(): if (trace_foo_enabled()) { x = bar(); trace_call__foo(); } - Update various locations to use the new trace_call__##name() API - Move snapshot code out of trace.c Cleaning up trace.c to not be a "dump all", move the snapshot code out of it and into a new trace_snapshot.c file - Clean up some "%*.s" to "%*s" - Allow boot kernel command line options to be called multiple times Have options like: ftrace_filter=foo ftrace_filter=bar ftrace_filter=zoo Equal to: ftrace_filter=foo,bar,zoo - Fix ipi_raise event CPU field to be a CPU field The ipi_raise target_cpus field is defined as a __bitmask(). There is now a __cpumask() field definition. Update the field to use that - Have hist_field_name() use a snprintf() and not a series of strcat() It's safer to use snprintf() that a series of strcat() - Fix tracepoint regfunc balancing A tracepoint can define a "reg" and "unreg" function that gets called before the tracepoint is enabled, and after it is disabled respectively. But on error, after the "reg" func is called and the tracepoint is not enabled, the "unreg" function is not called to tear down what the "reg" function performed - Fix output that shows what histograms are enabled Event variables are displayed incorrectly in the histogram output Instead of "sched.sched_wakeup.$var", it is showing "$sched.sched_wakeup.var" where the '$' is in the incorrect location - Some other simple cleanups * tag 'trace-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (24 commits) selftests/ftrace: Add test case for fully-qualified variable references tracing: Fix fully-qualified variable reference printing in histograms tracepoint: balance regfunc() on func_add() failure in tracepoint_add_func() tracing: Rebuild full_name on each hist_field_name() call tracing: Report ipi_raise target CPUs as cpumask tracing: Remove duplicate latency_fsnotify() stub tracing: Preserve repeated trace_trigger boot parameters tracing: Append repeated boot-time tracing parameters tracing: Remove spurious default precision from show_event_trigger/filter formats cpufreq: Use trace_call__##name() at guarded tracepoint call sites tracing: Remove tracing_alloc_snapshot() when snapshot isn't defined tracing: Move snapshot code out of trace.c and into trace_snapshot.c mm: damon: Use trace_call__##name() at guarded tracepoint call sites btrfs: Use trace_call__##name() at guarded tracepoint call sites spi: Use trace_call__##name() at guarded tracepoint call sites i2c: Use trace_call__##name() at guarded tracepoint call sites kernel: Use trace_call__##name() at guarded tracepoint call sites tracepoint: Add trace_call__##name() API tracing: trace_mmap.h: fix a kernel-doc warning tracing: Pretty-print enum parameters in function arguments ...
2026-03-26smp: Use system_percpu_wq instead of system_wqMarco Crivellari
When a caller enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() uses WORK_CPU_UNBOUND (used when no target CPU is specified). The same applies to schedule_work() that is using system_wq and queue_work(), which again makes use of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. Continue the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue() flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") and switch smp_call_on_cpu() to use system_percpu_wq because system_wq is going away once the ongoing workqueue restructuring is done. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251110170332.319314-1-marco.crivellari@suse.com
2026-03-26kernel: Use trace_call__##name() at guarded tracepoint call sitesVineeth Pillai (Google)
Replace trace_foo() with the new trace_call__foo() at sites already guarded by trace_foo_enabled(), avoiding a redundant static_branch_unlikely() re-evaluation inside the tracepoint. trace_call__foo() calls the tracepoint callbacks directly without utilizing the static branch again. Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: "Yury Norov [NVIDIA]" <yury.norov@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Kisel <romank@linux.microsoft.com> Cc: Joel Fernandes <joelagnelf@nvidia.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20260323160052.17528-3-vineeth@bitbyteword.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Assisted-by: Claude:claude-sonnet-4-6 Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-25smp: Improve smp_call_function_single() CSD-lock diagnosticsPaul E. McKenney
Both smp_call_function() and smp_call_function_single() use per-CPU call_single_data_t variable to hold the infamous CSD lock. However, while smp_call_function() acquires the destination CPU's CSD lock, smp_call_function_single() instead uses the source CPU's CSD lock. (These are two separate sets of CSD locks, cfd_data and csd_data, respectively.) This otherwise inexplicable pair of choices is explained by their respective queueing properties. If smp_call_function() where to use the sending CPU's CSD lock, that would serialize the destination CPUs' IPI handlers and result in long smp_call_function() latencies, especially on systems with large numbers of CPUs. For its part, if smp_call_function_single() were to use the (single) destination CPU's CSD lock, this would similarly serialize in the case where many CPUs are sending IPIs to a single "victim" CPU. Plus it would result in higher levels of memory contention. Except that if there is no NMI-based stack tracing on a weakly ordered system where remote unsynchronized stack traces are especially unreliable, the improved debugging beats the improved queueing. This improved queueing only matters if a bunch of CPUs are calling smp_call_function_single() concurrently for a single "victim" CPU, which is not the common case. Therefore, make smp_call_function_single() use the destination CPU's csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y where csdlock_debug_enabled is also set. Otherwise, continue to use the source CPU's csd_data. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/25c2eb97-77c8-49a5-80ac-efe78dea272c@paulmck-laptop
2026-03-25smp: Get this_cpu once in smp_call_functionShrikanth Hegde
smp_call_function_single() and smp_call_function_many_cond() disable preemption and cache the CPU number via get_cpu(). Use this cached value throughout the function instead of invoking smp_processor_id() again. [ tglx: Make the copy&pasta'ed change log match the patch ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Link: https://patch.msgid.link/20260323193630.640311-4-sshegde@linux.ibm.com
2026-03-25smp: Add missing kernel-doc commentsRandy Dunlap
Add missing kernel-doc comments and rearrange the order of others to prevent all kernel-doc warnings. - add function Returns: sections or format existing comments as kernel-doc - add missing function parameter comments - use "/**" for smp_call_function_any() and on_each_cpu_cond_mask() - correct the commented function name for on_each_cpu_cond_mask() - use correct format for function short descriptions - add all kernel-doc comments for smp_call_on_cpu() - remove kernel-doc comments for raw_smp_processor_id() since there is no prototype for it here (other than !SMP) - in smp.h, rearrange some lines so that the kernel-doc comments for smp_processor_id() are immediately before the macro (to prevent kernel-doc warnings) - remove "Returns" from smp_call_function() since it doesn't return a value Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260310061726.1153764-1-rdunlap@infradead.org
2025-11-19smp: Introduce a helper function to check for pending IPIsUlf Hansson
When governors used during cpuidle try to find the most optimal idle state for a CPU or a group of CPUs, they are known to quite often fail. One reason for this is, that they are not taking into account whether there has been an IPI scheduled for any of the CPUs that are affected by the selected idle state. To enable pending IPIs to be taken into account for cpuidle decisions, introduce a new helper function, cpus_peek_for_pending_ipi(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2025-09-18smp: Fix up and expand the smp_call_function_many() kerneldocRafael J. Wysocki
The smp_call_function_many() kerneldoc comment got out of sync with the function definition (bool parameter "wait" is incorrectly described as a bitmask in it), so fix it up by copying the "wait" description from the smp_call_function() kerneldoc and add information regarding the handling of the local CPU to it. Fixes: 49b3bd213a9f ("smp: Fix all kernel-doc warnings") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-08-02smp: Fix spelling in on_each_cpu_cond_mask()'s doc-commentRoman Kisel
"boolean" is spelt as "blooean". Fix that. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250722161818.6139-1-romank@linux.microsoft.com
2025-07-29Merge tag 'stop-machine.2025.07.23a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull stop-machine documentation updates from Paul McKenney: - Improve kernel-doc function-header comments - Document preemption and stop_machine() mutual exclusion (Joel Fernandes) * tag 'stop-machine.2025.07.23a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: smp: Document preemption and stop_machine() mutual exclusion stop_machine: Improve kernel-doc function-header comments
2025-07-17smp: Document preemption and stop_machine() mutual exclusionJoel Fernandes
Recently while revising RCU's cpu online checks, there was some discussion around how IPIs synchronize with hotplug. Add comments explaining how preemption disable creates mutual exclusion with CPU hotplug's stop_machine mechanism. The key insight is that stop_machine() atomically updates CPU masks and flushes IPIs with interrupts disabled, and cannot proceed while any CPU (including the IPI sender) has preemption disabled. [ Apply peterz feedback. ] Cc: Andrea Righi <arighi@nvidia.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: rcu@vger.kernel.org Acked-by: Paul E. McKenney <paulmck@kernel.org> Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2025-07-06smp: Wait only if work was enqueuedRik van Riel
Whenever work is enqueued for a remote CPU, smp_call_function_many_cond() may need to wait for that work to be completed. However, if no work is enqueued for a remote CPU, because the condition func() evaluated to false for all CPUs, there is no need to wait. Set run_remote only if work was enqueued on remote CPUs. Document the difference between "work enqueued", and "CPU needs to be woken up" Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://lore.kernel.org/all/20250703203019.11331ac3@fangorn
2025-07-02smp: Defer check for local execution in smp_call_function_many_cond()Yury Norov [NVIDIA]
Defer check for local execution to the actual place where it is needed, which removes the extra local variable. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250623000010.10124-5-yury.norov@gmail.com
2025-06-26smp: Use cpumask_any_but() in smp_call_function_many_cond()Yury Norov [NVIDIA]
smp_call_function_many_cond() opencodes cpumask_any_but(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250623000010.10124-3-yury.norov@gmail.com
2025-06-26smp: Improve locality in smp_call_function_any()Yury Norov [NVIDIA]
smp_call_function_any() tries to make a local call as it's the cheapest option, or switches to a CPU in the same node. If it's not possible, the algorithm gives up and searches for any CPU, in a numerical order. Instead, it can search for the best CPU based on NUMA locality, including the 2nd nearest hop (a set of equidistant nodes), and higher. sched_numa_find_nth_cpu() does exactly that, and also helps to drop most of the housekeeping code. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250623000010.10124-2-yury.norov@gmail.com
2025-01-28Merge tag 'csd-lock.2025.01.28a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull CSD-lock update from Paul McKenney: "Allow runtime modification of the csd_lock_timeout and panic_on_ipistall module parameters" * tag 'csd-lock.2025.01.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: locking/csd-lock: make CSD lock debug tunables writable in /sys
2024-12-11locking/csd-lock: make CSD lock debug tunables writable in /sysRik van Riel
Currently the CSD lock tunables can only be set at boot time in the kernel commandline, but the way these variables are used means there is really no reason not to tune them at runtime through /sys. Make the CSD lock debug tunables tunable through /sys. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-12-05smp/scf: Evaluate local cond_func() before IPI side-effectsMathieu Desnoyers
In smp_call_function_many_cond(), the local cond_func() is evaluated after triggering the remote CPU IPIs. If cond_func() depends on loading shared state updated by other CPU's IPI handlers func(), then triggering execution of remote CPUs IPI before evaluating cond_func() may have unexpected consequences. One example scenario is evaluating a jiffies delay in cond_func(), which is updated by func() in the IPI handlers. This situation can prevent execution of periodic cleanup code on the local CPU. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20241203163558.3455535-1-mathieu.desnoyers@efficios.com
2024-10-11locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()Paul E. McKenney
Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock() to check for excessive CSD-lock wait times. This works, but does not guarantee monotonic timestamps on x86 due to the sched_clock() function's use of the rdtsc instruction, which does not guarantee ordering. This means that, given successive calls to sched_clock(), the second might return an earlier time than the second, that is, time might seem to go backwards. This can (and does!) result in false-positive CSD-lock wait complaints claiming almost 2^64 nanoseconds of delay. Therefore, switch from sched_clock() to ktime_get_mono_fast_ns(), which does guarantee monotonic timestamps via the rdtsc_ordered() function, which as the name implies, does guarantee ordered timestamps, at least in the absence of calls from NMI handlers, which are not involved in this code path. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
2024-08-15smp: print only local CPU info when sched_clock goes backwardRik van Riel
About 40% of all csd_lock warnings observed in our fleet appear to be due to sched_clock() going backward in time (usually only a little bit), resulting in ts0 being larger than ts2. When the local CPU is at fault, we should print out a message reflecting that, rather than trying to get the remote CPU's stack trace. Signed-off-by: Rik van Riel <riel@surriel.com> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15locking/csd-lock: Use backoff for repeated reports of same incidentPaul E. McKenney
Currently, the CSD-lock diagnostics in CONFIG_CSD_LOCK_WAIT_DEBUG=y kernels are emitted at five-second intervals. Although this has proven to be a good time interval for the first diagnostic, if the target CPU keeps interrupts disabled for way longer than five seconds, the ratio of useful new information to pointless repetition increases considerably. Therefore, back off the time period for repeated reports of the same incident, increasing linearly with the number of reports and logarithmicly with the number of online CPUs. [ paulmck: Apply Dan Carpenter feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15locking/csd_lock: Provide an indication of ongoing CSD-lock stallPaul E. McKenney
If a CSD-lock stall goes on long enough, it will cause an RCU CPU stall warning. This additional warning provides much additional console-log traffic and little additional information. Therefore, provide a new csd_lock_is_stuck() function that returns true if there is an ongoing CSD-lock stall. This function will be used by the RCU CPU stall warnings to provide a one-line indication of the stall when this function returns true. [ neeraj.upadhyay: Apply Rik van Riel feedback. ] [ neeraj.upadhyay: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29locking/csd_lock: Print large numbers as negativesPaul E. McKenney
The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are printed in nanoseconds as unsigned long longs, which is a bit obtuse for human readers when timing bugs result in negative CSD-lock hold times. Yes, there are some people to whom it is immediately obvious that 18446744073709551615 is really -1, but for the rest of us... Therefore, print these numbers as signed long longs, making the negative hold times immediately apparent. Reported-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-10smp: Add missing destroy_work_on_stack() call in smp_call_on_cpu()Zqiang
For CONFIG_DEBUG_OBJECTS_WORK=y kernels sscs.work defined by INIT_WORK_ONSTACK() is initialized by debug_object_init_on_stack() for the debug check in __init_work() to work correctly. But this lacks the counterpart to remove the tracked object from debug objects again, which will cause a debug object warning once the stack is freed. Add the missing destroy_work_on_stack() invocation to cure that. [ tglx: Massaged changelog ] Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240704065213.13559-1-qiang.zhang1211@gmail.com
2024-06-17smp: Use str_plural() to fix Coccinelle warningsThorsten Blum
Fixes the following two Coccinelle/coccicheck warnings reported by string_choices.cocci: opportunity for str_plural(num_cpus) opportunity for str_plural(num_nodes) Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240508154225.309703-2-thorsten.blum@toblux.com
2023-10-30Merge tag 'csd-lock.2023.10.23a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull CSD lock update from Paul McKenney: "This adds a kernel boot parameter that causes the kernel to panic if one of the call_smp_function() APIs is stalled for more than the specified duration. This is useful in deployments in which a clean panic is preferable to an indefinite stall" * tag 'csd-lock.2023.10.23a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: smp,csd: Throw an error if a CSD lock is stuck for too long
2023-10-17Merge branch 'linus' into smp/coreThomas Gleixner
Pull in upstream to get the fixes so depending changes can be applied.
2023-10-16smp,csd: Throw an error if a CSD lock is stuck for too longRik van Riel
The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. In addition, there is an smp.panic_on_ipistall kernel boot parameter that by default retains the old behavior, but when set enables the panic after the CSD lock has been stuck for more than the specified number of milliseconds, as in 300,000 for five minutes. [ paulmck: Apply Imran Khan feedback. ] [ paulmck: Apply Leonardo Bras feedback. ] Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Leonardo Bras <leobras@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org>
2023-09-13smp: Change function signatures to use call_single_data_tLeonardo Bras
call_single_data_t is a size-aligned typedef of struct __call_single_data. This alignment is desirable in order to have smp_call_function*() avoid bouncing an extra cacheline in case of an unaligned csd, given this would hurt performance. Since the removal of struct request->csd in commit 660e802c76c8 ("blk-mq: use percpu csd to remote complete instead of per-rq csd") there are no current users of smp_call_function*() with unaligned csd. Change every 'struct __call_single_data' function parameter to 'call_single_data_t', so we have warnings if any new code tries to introduce an smp_call_function*() call with unaligned csd. Signed-off-by: Leonardo Bras <leobras@redhat.com> Reviewed-by: Guo Ren <guoren@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230831063129.335425-1-leobras@redhat.com
2023-07-10smp: Reduce NMI traffic from CSD waiters to CSD destinationImran Khan
On systems with hundreds of CPUs, if most of the CPUs detect a CSD hang, then all of these waiting CPUs send an NMI to the destination CPU in order to dump its backtrace. Given enough NMIs, the destination CPU will spent much of its time producing backtraces, thus further delaying that CPU's response to the original CSD IPI. In the worst case, by the time destination CPU is done producing all of these backtrace NMIs, the CSD wait timeout will have elapsed so that the waiters resend their backtrace NMIs again, further delaying forward progress. Therefore, to avoid these delays, issue the backtrace NMI only from the first waiter. The destination CPU's other waiters can make use of backtrace obtained from the first waiter's NMI. Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-10smp: Reduce logging due to dump_stack of CSD waitersImran Khan
If a waiter is waiting for CSD lock, its call stack will not change between first and subsequent hang detection for the same CSD lock. Therefore, do dump_stack only for first-time detection for a given waiter. This avoids excessive logging on systems with hundreds of CPUs where repetitive dump_stack from hundreds of CPUs would otherwise flood the console. Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-06-16trace,smp: Add tracepoints for scheduling remotelly called functionsLeonardo Bras
Add a tracepoint for when a CSD is queued to a remote CPU's call_single_queue. This allows finding exactly which CPU queued a given CSD when looking at a csd_function_{entry,exit} event, and also enables us to accurately measure IPI delivery time with e.g. a synthetic event: $ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs' >\ /sys/kernel/tracing/events/smp/csd_queue_cpu/trigger $ echo 'csd_latency unsigned int dst_cpu; unsigned long csd; u64 time' >\ /sys/kernel/tracing/synthetic_events $ echo \ 'hist:keys=common_cpu,csd.hex:'\ 'time=common_timestamp.usecs-$ts:'\ 'onmatch(smp.csd_queue_cpu).trace(csd_latency,common_cpu,csd,$time)' >\ /sys/kernel/tracing/events/smp/csd_function_entry/trigger $ trace-cmd record -e 'synthetic:csd_latency' hackbench $ trace-cmd report <...>-467 [001] 21.824263: csd_queue_cpu: cpu=0 callsite=try_to_wake_up+0x2ea func=sched_ttwu_pending csd=0xffff8880076148b8 <...>-467 [001] 21.824280: ipi_send_cpu: cpu=0 callsite=try_to_wake_up+0x2ea callback=generic_smp_call_function_single_interrupt+0x0 <...>-489 [000] 21.824299: csd_function_entry: func=sched_ttwu_pending csd=0xffff8880076148b8 <...>-489 [000] 21.824320: csd_latency: dst_cpu=0, csd=18446612682193848504, time=36 Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Leonardo Bras <leobras@redhat.com> Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230615065944.188876-7-leobras@redhat.com
2023-06-16trace,smp: Add tracepoints around remotelly called functionsLeonardo Bras
The recently added ipi_send_{cpu,cpumask} tracepoints allow finding sources of IPIs targeting CPUs running latency-sensitive applications. For NOHZ_FULL CPUs, all IPIs are interference, and those tracepoints are sufficient to find them and work on getting rid of them. In some setups however, not *all* IPIs are to be suppressed, but long-running IPI callbacks can still be problematic. Add a pair of tracepoints to mark the start and end of processing a CSD IPI callback, similar to what exists for softirq, workqueue or timer callbacks. Signed-off-by: Leonardo Bras <leobras@redhat.com> Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230615065944.188876-5-leobras@redhat.com
2023-05-15cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __initThomas Gleixner
No point in keeping them around. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de
2023-03-24trace,smp: Trace all smp_function_call*() invocationsPeter Zijlstra
(Ab)use the trace_ipi_send_cpu*() family to trace all smp_function_call*() invocations, not only those that result in an actual IPI. The queued entries log their callback function while the actual IPIs are traced on generic_smp_call_function_single_interrupt(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-03-24trace: Add trace_ipi_send_cpu()Peter Zijlstra
Because copying cpumasks around when targeting a single CPU is a bit daft... Tested-and-reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net
2023-03-24sched, smp: Trace smp callback causing an IPIValentin Schneider
Context ======= The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter which so far has only been fed with NULL. While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing struct layout (meaning their callback func can be accessed without caring about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function attached to its struct. This means we need to check the type of a CSD before eventually dereferencing its associated callback. This isn't as trivial as it sounds: the CSD type is stored in __call_single_node.u_flags, which get cleared right before the callback is executed via csd_unlock(). This implies checking the CSD type before it is enqueued on the call_single_queue, as the target CPU's queue can be flushed before we get to sending an IPI. Furthermore, send_call_function_single_ipi() only has a CPU parameter, and would need to have an additional argument to trickle down the invoked function. This is somewhat silly, as the extra argument will always be pushed down to the function even when nothing is being traced, which is unnecessary overhead. Changes ======= send_call_function_single_ipi() is only used by smp.c, and is defined in sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of a CPU's idle task). Split it into two parts: the scheduler bits remain in sched/core.c, and the actual IPI emission is moved into smp.c. This lets us define an __always_inline helper function that can take the related callback as parameter without creating useless register pressure in the non-traced path which only gains a (disabled) static branch. Do the same thing for the multi IPI case. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230307143558.294354-8-vschneid@redhat.com
2023-03-24smp: reword smp call IPI commentValentin Schneider
Accessing the call_single_queue hasn't involved a spinlock since 2014: 6897fc22ea01 ("kernel: use lockless list for smp_call_function_single") The llist operations (namely cmpxchg() and xchg()) provide similar ordering guarantees, update the comment to lessen confusion. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230307143558.294354-7-vschneid@redhat.com
2023-03-24smp: Trace IPIs sent via arch_send_call_function_ipi_mask()Valentin Schneider
This simply wraps around the arch function and prepends it with a tracepoint, similar to send_call_function_single_ipi(). Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230307143558.294354-4-vschneid@redhat.com
2023-03-24sched, smp: Trace IPIs sent via send_call_function_single_ipi()Valentin Schneider
send_call_function_single_ipi() is the thing that sends IPIs at the bottom of smp_call_function*() via either generic_exec_single() or smp_call_function_many_cond(). Give it an IPI-related tracepoint. Note that this ends up tracing any IPI sent via __smp_call_single_queue(), which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free". Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230307143558.294354-3-vschneid@redhat.com
2023-03-24kernel/smp: Make csdlock_debug= resettablePaul E. McKenney
It is currently possible to set the csdlock_debug_enabled static branch, but not to reset it. This is an issue when several different entities supply kernel boot parameters and also for kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT=y. Therefore, make the csdlock_debug=0 kernel boot parameter turn off debugging. Last one wins! Reported-by: Jes Sorensen <Jes.Sorensen@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230321005516.50558-4-paulmck@kernel.org
2023-03-24locking/csd_lock: Remove per-CPU data indirection from CSD lock debuggingPaul E. McKenney
The diagnostics added by this commit were extremely useful in one instance: a5aabace5fb8 ("locking/csd_lock: Add more data to CSD lock debugging") However, they have not seen much action since, and there have been some concerns expressed that the complexity is not worth the benefit. Therefore, manually revert the following commit preparatory commit: de7b09ef658d ("locking/csd_lock: Prepare more CSD lock debugging") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230321005516.50558-3-paulmck@kernel.org
2023-03-24locking/csd_lock: Remove added data from CSD lock debuggingPaul E. McKenney
The diagnostics added by this commit were extremely useful in one instance: a5aabace5fb8 ("locking/csd_lock: Add more data to CSD lock debugging") However, they have not seen much action since, and there have been some concerns expressed that the complexity is not worth the benefit. Therefore, manually revert this commit, but leave a comment telling people where to find these diagnostics. [ paulmck: Apply Juergen Gross feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230321005516.50558-2-paulmck@kernel.org
2023-03-24locking/csd_lock: Add Kconfig option for csd_debug defaultPaul E. McKenney
The csd_debug kernel parameter works well, but is inconvenient in cases where it is more closely associated with boot loaders or automation than with a particular kernel version or release. Thererfore, provide a new CSD_LOCK_WAIT_DEBUG_DEFAULT Kconfig option that defaults csd_debug to 1 when selected and 0 otherwise, with this latter being the default. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230321005516.50558-1-paulmck@kernel.org
2022-10-10Merge tag 'bitmap-6.1-rc1' of https://github.com/norov/linuxLinus Torvalds
Pull bitmap updates from Yury Norov: - Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES (Phil Auld) - cleanup nr_cpu_ids vs nr_cpumask_bits mess (me) This series cleans that mess and adds new config FORCE_NR_CPUS that allows to optimize cpumask subsystem if the number of CPUs is known at compile-time. - optimize find_bit() functions (me) Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros. - add find_nth_bit() (me) Adds find_nth_bit(), which is ~70 times faster than bitcounting with for_each() loop: for_each_set_bit(bit, mask, size) if (n-- == 0) return bit; Also adds bitmap_weight_and() to let people replace this pattern: tmp = bitmap_alloc(nbits); bitmap_and(tmp, map1, map2, nbits); weight = bitmap_weight(tmp, nbits); bitmap_free(tmp); with a single bitmap_weight_and() call. - repair cpumask_check() (me) After switching cpumask to use nr_cpu_ids, cpumask_check() started generating many false-positive warnings. This series fixes it. - Add for_each_cpu_andnot() and for_each_cpu_andnot() (Valentin Schneider) Extends the API with one more function and applies it in sched/core. * tag 'bitmap-6.1-rc1' of https://github.com/norov/linux: (28 commits) sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot() lib/test_cpumask: Add for_each_cpu_and(not) tests cpumask: Introduce for_each_cpu_andnot() lib/find_bit: Introduce find_next_andnot_bit() cpumask: fix checking valid cpu range lib/bitmap: add tests for for_each() loops lib/find: optimize for_each() macros lib/bitmap: introduce for_each_set_bit_wrap() macro lib/find_bit: add find_next{,_and}_bit_wrap cpumask: switch for_each_cpu{,_not} to use for_each_bit() net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and} cpumask: add cpumask_nth_{,and,andnot} lib/bitmap: remove bitmap_ord_to_pos lib/bitmap: add tests for find_nth_bit() lib: add find_nth{,_and,_andnot}_bit() lib/bitmap: add bitmap_weight_and() lib/bitmap: don't call __bitmap_weight() in kernel code tools: sync find_bit() implementation lib/find_bit: optimize find_next_bit() functions lib/find_bit: create find_first_zero_bit_le() ...
2022-09-20lib/cpumask: add FORCE_NR_CPUS config optionYury Norov
The size of cpumasks is hard-limited by compile-time parameter NR_CPUS, but defined at boot-time when kernel parses ACPI/DT tables, and stored in nr_cpu_ids. In many practical cases, number of CPUs for a target is known at compile time, and can be provided with NR_CPUS. In that case, compiler may be instructed to rely on NR_CPUS as on actual number of CPUs, not an upper limit. It allows to optimize many cpumask routines and significantly shrink size of the kernel image. This patch adds FORCE_NR_CPUS option to teach the compiler to rely on NR_CPUS and enable corresponding optimizations. If FORCE_NR_CPUS=y, kernel will not set nr_cpu_ids at boot, but only check that the actual number of possible CPUs is equal to NR_CPUS, and WARN if that doesn't hold. The new option is especially useful in embedded applications because kernel configurations are unique for each SoC, the number of CPUs is constant and known well, and memory limitations are typically harder. For my 4-CPU ARM64 build with NR_CPUS=4, FORCE_NR_CPUS=y saves 46KB: add/remove: 3/4 grow/shrink: 46/729 up/down: 652/-46952 (-46300) Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19smp: add set_nr_cpu_ids()Yury Norov
In preparation to support compile-time nr_cpu_ids, add a setter for the variable. This is a no-op for all arches. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19smp: don't declare nr_cpu_ids if NR_CPUS == 1Yury Norov
SMP and NR_CPUS are independent options, hence nr_cpu_ids may be declared even if NR_CPUS == 1, which is useless. Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-08-31sched/debug: Try trigger_single_cpu_backtrace(cpu) in dump_cpu_task()Zhen Lei
The trigger_all_cpu_backtrace() function attempts to send an NMI to the target CPU, which usually provides much better stack traces than the dump_cpu_task() function's approach of dumping that stack from some other CPU. So much so that most calls to dump_cpu_task() only happen after a call to trigger_all_cpu_backtrace() has failed. And the exception to this rule really should attempt to use trigger_all_cpu_backtrace() first. Therefore, move the trigger_all_cpu_backtrace() invocation into dump_cpu_task(). Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com>
2022-07-19locking/csd_lock: Change csdlock_debug from early_param to __setupChen Zhongjin
The csdlock_debug kernel-boot parameter is parsed by the early_param() function csdlock_debug(). If set, csdlock_debug() invokes static_branch_enable() to enable csd_lock_wait feature, which triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and CONFIG_SPARSEMEM_VMEMMAP=n. With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in static_key_enable() and returns NULL, resulting in a NULL dereference because mem_section is initialized only later in sparse_init(). This is also a problem for powerpc because early_param() functions are invoked earlier than jump_label_init(), also resulting in static_key_enable() failures. These failures cause the warning "static key 'xxx' used before call to jump_label_init()". Thus, early_param is too early for csd_lock_wait to run static_branch_enable(), so changes it to __setup to fix these. Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging") Cc: stable@vger.kernel.org Reported-by: Chen jingwen <chenjingwen6@huawei.com> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>