summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-03-19Revert "tick/common: Make tick_periodic() check for missing ticks"Thomas Gleixner
This reverts commit d441dceb5dce71150f28add80d36d91bbfccba99 due to boot failures. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com>
2020-03-19time/sched_clock: Expire timer in hardirq contextAhmed S. Darwish
To minimize latency, PREEMPT_RT kernels expires hrtimers in preemptible softirq context by default. This can be overriden by marking the timer's expiry with HRTIMER_MODE_HARD. sched_clock_timer is missing this annotation: if its callback is preempted and the duration of the preemption exceeds the wrap around time of the underlying clocksource, sched clock will get out of sync. Mark the sched_clock_timer for expiry in hard interrupt context. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200309181529.26558-1-a.darwish@linutronix.de
2020-03-04tick/common: Make tick_periodic() check for missing ticksWaiman Long
The tick_periodic() function is used at the beginning part of the bootup process for time keeping while the other clock sources are being initialized. The current code assumes that all the timer interrupts are handled in a timely manner with no missing ticks. That is not actually true. Some ticks are missed and there are some discrepancies between the tick time (jiffies) and the timestamp reported in the kernel log. Some systems, however, are more prone to missing ticks than the others. In the extreme case, the discrepancy can actually cause a soft lockup message to be printed by the watchdog kthread. For example, on a Cavium ThunderX2 Sabre arm64 system: [ 25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! On that system, the missing ticks are especially prevalent during the smp_init() phase of the boot process. With an instrumented kernel, it was found that it took about 24s as reported by the timestamp for the tick to accumulate 4s of time. Investigation and bisection done by others seemed to point to the commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") as the culprit. It could also be a firmware issue as new firmware was promised that would fix the issue. To properly address this problem, stop assuming that there will be no missing tick in tick_periodic(). Modify it to follow the example of tick_do_update_jiffies64() by using another reference clock to check for missing ticks. Since the watchdog timer uses running_clock(), it is used here as the reference. With this applied, the soft lockup problem in the affected arm64 system is gone and tick time tracks much more closely to the timestamp time. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200207193929.27308-1-longman@redhat.com
2020-03-04hrtimer: Cast explicitely to u32t in __ktime_divns()Wen Yang
do_div() does a 64-by-32 division at least on 32bit platforms, while the divisor 'div' is explicitly casted to unsigned long, thus 64-bit on 64-bit platforms. The code already ensures that the divisor is less than 2^32. Hence the proper cast type is u32. Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200130130851.29204-1-wenyang@linux.alibaba.com
2020-03-04timekeeping: Prevent 32bit truncation in scale64_check_overflow()Wen Yang
While unlikely the divisor in scale64_check_overflow() could be >= 32bit in scale64_check_overflow(). do_div() truncates the divisor to 32bit at least on 32bit platforms. Use div64_u64() instead to avoid the truncation to 32-bit. [ tglx: Massaged changelog ] Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com
2020-03-04posix-cpu-timers: Stop disabling timers on mt-execEric W. Biederman
The reasons why the extra posix_cpu_timers_exit_group() invocation has been added are not entirely clear from the commit message. Today all that posix_cpu_timers_exit_group() does is stop timers that are tracking the task from firing. Every other operation on those timers is still allowed. The practical implication of this is posix_cpu_timer_del() which could not get the siglock after the thread group leader has exited (because sighand == NULL) would be able to run successfully because the timer was already dequeued. With that locking issue fixed there is no point in disabling all of the timers. So remove this ``tempoary'' hack. Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org
2020-03-04posix-cpu-timers: Store a reference to a pid not a taskEric W. Biederman
posix cpu timers do not handle the death of a process well. This is most clearly seen when a multi-threaded process calls exec from a thread that is not the leader of the thread group. The posix cpu timer code continues to pin the old thread group leader and is unable to find the siglock from there. This results in posix_cpu_timer_del being unable to delete a timer, posix_cpu_timer_set being unable to set a timer. Further to compensate for the problems in posix_cpu_timer_del on a multi-threaded exec all timers that point at the multi-threaded task are stopped. The code for the timers fundamentally needs to check if the target process/thread is alive. This needs an extra level of indirection. This level of indirection is already available in struct pid. So replace cpu.task with cpu.pid to get the needed extra layer of indirection. In addition to handling things more cleanly this reduces the amount of memory a timer can pin when a process exits and then is reaped from a task_struct to the vastly smaller struct pid. Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87wo86tz6d.fsf@x220.int.ebiederm.org
2020-03-01posix-cpu-timers: Pass the task into arm_timer()Eric W. Biederman
The task has been already computed to take siglock before calling arm_timer. So pass the benefit of that labor into arm_timer(). Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/8736auvdt1.fsf@x220.int.ebiederm.org
2020-03-01posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_groupEric W. Biederman
As of e78c3496790e ("time, signal: Protect resource use statistics with seqlock") cpu_clock_sample_group no longers needs siglock protection. Unfortunately no one realized it at the time. Remove the extra locking that is for cpu_clock_sample_group and not for cpu_clock_sample. This significantly simplifies the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/878skmvdts.fsf@x220.int.ebiederm.org
2020-03-01posix-cpu-timers: cpu_clock_sample_group() no longer needs siglockEric W. Biederman
As of e78c3496790e ("time, signal: Protect resource use statistics with seqlock") cpu_clock_sample_group() no longer needs siglock protection so remove the stale comment. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87eeuevduq.fsf@x220.int.ebiederm.org
2020-02-17posix-timers: Pass lockdep expression to RCU listsAmol Grover
head is traversed using hlist_for_each_entry_rcu outside an RCU read-side critical section but under the protection of hash_lock. Hence, add corresponding lockdep expression to silence false-positive lockdep warnings, and harden RCU lists. [ tglx: Removed the macro and put the condition right where it's used ] Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200216074330.GA14025@workstation-portable
2020-02-17timer: Improve the comment describing schedule_timeout()Alexander Popov
When working commit 6dcd5d7a7a29c1e, a mistake was noticed by Linus: schedule_timeout() was called without setting the task state to anything particular. It calls the scheduler, but doesn't delay anything, because the task stays runnable. That happens because sched_submit_work() does nothing for tasks in TASK_RUNNING state. That turned out to be the intended behavior. Adding a WARN() is not useful as the task could be woken up right after setting the state and before reaching schedule_timeout(). Improve the comment about schedule_timeout() and describe that more explicitly. Signed-off-by: Alexander Popov <alex.popov@linux.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200117225900.16340-1-alex.popov@linux.com
2020-02-17lib/vdso: Move VCLOCK_TIMENS to vdso_clock_modesThomas Gleixner
Move the time namespace indicator clock mode to the other ones for consistency sake. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/20200207124403.656097274@linutronix.de
2020-02-17lib/vdso: Avoid highres update if clocksource is not VDSO capableThomas Gleixner
If the current clocksource is not VDSO capable there is no point in updating the high resolution parts of the VDSO data. Replace the architecture specific check with a check for a VDSO capable clocksource and skip the update if there is none. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/20200207124403.563379423@linutronix.de
2020-02-17lib/vdso: Cleanup clock mode storage leftoversThomas Gleixner
Now that all architectures are converted to use the generic storage the helpers and conditionals can be removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/20200207124403.470699892@linutronix.de
2020-02-17clocksource: Add common vdso clock mode storageThomas Gleixner
All architectures which use the generic VDSO code have their own storage for the VDSO clock mode. That's pointless and just requires duplicate code. Provide generic storage for it. The new Kconfig symbol is intermediate and will be removed once all architectures are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lkml.kernel.org/r/20200207124403.028046322@linutronix.de
2020-02-15Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes all over the place: - Fix NUMA over-balancing between lightly loaded nodes. This is fallout of the big load-balancer rewrite. - Fix the NOHZ remote loadavg update logic, which fixes anomalies like reported 150 loadavg on mostly idle CPUs. - Fix XFS performance/scalability - Fix throttled groups unbound task-execution bug - Fix PSI procfs boundary condition - Fix the cpu.uclamp.{min,max} cgroup configuration write checks - Fix DocBook annotations - Fix RCU annotations - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU counts" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix kernel-doc warning in attach_entity_load_avg() sched/core: Annotate curr pointer in rq with __rcu sched/psi: Fix OOB write when writing 0 bytes to PSI files sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression sched/fair: Prevent unlimited runtime on throttled group sched/nohz: Optimize get_nohz_timer_target() sched/uclamp: Reject negative values in cpu_uclamp_write() sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains timers/nohz: Update NOHZ load in remote tick sched/core: Don't skip remote tick for idle CPUs
2020-02-14Merge tag 'pm-5.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Fix three issues related to the handling of wakeup events signaled through the ACPI SCI while suspended to idle (Rafael Wysocki) and unexport an internal cpufreq variable (Yangtao Li)" * tag 'pm-5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system ACPICA: Introduce acpi_any_gpe_status_set() ACPI: PM: s2idle: Avoid possible race related to the EC GPE ACPI: EC: Fix flushing of pending work cpufreq: Make cpufreq_global_kobject static
2020-02-11Merge tag 'trace-v5.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various fixes: - Fix an uninitialized variable - Fix compile bug to bootconfig userspace tool (in tools directory) - Suppress some error messages of bootconfig userspace tool - Remove unneded CONFIG_LIBXBC from bootconfig - Allocate bootconfig xbc_nodes dynamically. To ease complaints about taking up static memory at boot up - Use of parse_args() to parse bootconfig instead of strstr() usage Prevents issues of double quotes containing the interested string - Fix missing ring_buffer_nest_end() on synthetic event error path - Return zero not -EINVAL on soft disabled synthetic event (soft disabling must be the same as hard disabling, which returns zero) - Consolidate synthetic event code (remove duplicate code)" * tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Consolidate trace() functions tracing: Don't return -EINVAL when tracing soft disabled synth events tracing: Add missing nest end to synth_event_trace_start() error case tools/bootconfig: Suppress non-error messages bootconfig: Allocate xbc_nodes array dynamically bootconfig: Use parse_args() to find bootconfig and '--' tracing/kprobe: Fix uninitialized variable bug bootconfig: Remove unneeded CONFIG_LIBXBC tools/bootconfig: Fix wrong __VA_ARGS__ usage
2020-02-11sched/fair: Fix kernel-doc warning in attach_entity_load_avg()Randy Dunlap
Fix kernel-doc warning in kernel/sched/fair.c, caused by a recent function parameter removal: ../kernel/sched/fair.c:3526: warning: Excess function parameter 'flags' description in 'attach_entity_load_avg' Fixes: a4f9a0e51bbf ("sched/fair: Remove redundant call to cpufreq_update_util()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/cbe964e4-6879-fd08-41c9-ef1917414af4@infradead.org
2020-02-11sched/core: Annotate curr pointer in rq with __rcuMadhuparna Bhowmik
This patch fixes the following sparse warnings in sched/core.c and sched/membarrier.c: kernel/sched/core.c:2372:27: error: incompatible types in comparison expression kernel/sched/core.c:4061:17: error: incompatible types in comparison expression kernel/sched/core.c:6067:9: error: incompatible types in comparison expression kernel/sched/membarrier.c:108:21: error: incompatible types in comparison expression kernel/sched/membarrier.c:177:21: error: incompatible types in comparison expression kernel/sched/membarrier.c:243:21: error: incompatible types in comparison expression Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200201125803.20245-1-madhuparnabhowmik10@gmail.com
2020-02-11sched/psi: Fix OOB write when writing 0 bytes to PSI filesSuren Baghdasaryan
Issuing write() with count parameter set to 0 on any file under /proc/pressure/ will cause an OOB write because of the access to buf[buf_size-1] when NUL-termination is performed. Fix this by checking for buf_size to be non-zero. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
2020-02-11ACPI: PM: s2idle: Avoid possible race related to the EC GPERafael J. Wysocki
It is theoretically possible for the ACPI EC GPE to be set after the s2idle_ops->wake() called from s2idle_loop() has returned and before the subsequent pm_wakeup_pending() check is carried out. If that happens, the resulting wakeup event will cause the system to resume even though it may be a spurious one. To avoid that race, first make the ->wake() callback in struct platform_s2idle_ops return a bool value indicating whether or not to let the system resume and rearrange s2idle_loop() to use that value instad of the direct pm_wakeup_pending() call if ->wake() is present. Next, rework acpi_s2idle_wake() to process EC events and check pm_wakeup_pending() before re-arming the SCI for system wakeup to prevent it from triggering prematurely and add comments to that function to explain the rationale for the new code flow. Fixes: 56b991849009 ("PM: sleep: Simplify suspend-to-idle control flow") Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-02-10tracing: Consolidate trace() functionsTom Zanussi
Move the checking, buffer reserve and buffer commit code in synth_event_trace_start/end() into inline functions __synth_event_trace_start/end() so they can also be used by synth_event_trace() and synth_event_trace_array(), and then have all those functions use them. Also, change synth_event_trace_state.enabled to disabled so it only needs to be set if the event is disabled, which is not normally the case. Link: http://lkml.kernel.org/r/b1f3108d0f450e58192955a300e31d0405ab4149.1581374549.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10tracing: Don't return -EINVAL when tracing soft disabled synth eventsTom Zanussi
There's no reason to return -EINVAL when tracing a synthetic event if it's soft disabled - treat it the same as if it were hard disabled and return normally. Have synth_event_trace() and synth_event_trace_array() just return normally, and have synth_event_trace_start set the trace state to disabled and return. Link: http://lkml.kernel.org/r/df5d02a1625aff97c9866506c5bada6a069982ba.1581374549.git.zanussi@kernel.org Fixes: 8dcc53ad956d2 ("tracing: Add synth_event_trace() and related functions") Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10tracing: Add missing nest end to synth_event_trace_start() error caseTom Zanussi
If the ring_buffer reserve in synth_event_trace_start() fails, the matching ring_buffer_nest_end() should be called in the error code, since nothing else will ever call it in this case. Link: http://lkml.kernel.org/r/20abc444b3eeff76425f895815380abe7aa53ff8.1581374549.git.zanussi@kernel.org Fixes: 8dcc53ad956d2 ("tracing: Add synth_event_trace() and related functions") Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10Merge branch 'for-5.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "I made a mistake while removing cgroup task list lazy init optimization making the root cgroup.procs show entries for the init_tasks. The zero entries doesn't cause critical failures but does make systemd print out warning messages during boot. Fix it by omitting init_tasks as they should be" * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: init_tasks shouldn't be linked to the root cgroup
2020-02-10tracing/kprobe: Fix uninitialized variable bugGustavo A. R. Silva
There is a potential execution path in which variable *ret* is returned without being properly initialized, previously. Fix this by initializing variable *ret* to 0. Link: http://lkml.kernel.org/r/20200205223404.GA3379@embeddedor Addresses-Coverity-ID: 1491142 ("Uninitialized scalar variable") Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, ↵Mel Gorman
to fix XFS performance regression The following XFS commit: 8ab39f11d974 ("xfs: prevent CIL push holdoff in log recovery") changed the logic from using bound workqueues to using unbound workqueues. Functionally this makes sense but it was observed at the time that the dbench performance dropped quite a lot and CPU migrations were increased. The current pattern of the task migration is straight-forward. With XFS, an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker. This runs on a nearby CPU and on completion, dbench wakes up on its old CPU as it is still idle and no migration occurs. dbench then queues the real IO on the blk_mq_requeue_work() work item which runs on a bound kworker which is forced to run on the same CPU as dbench. When IO completes, the bound kworker wakes dbench but as the kworker is a bound but, real task, the CPU is not considered idle and dbench gets migrated by select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs for a while but ultimately it starts a round-robin of all CPUs sharing the same LLC. High-frequency migration on each IO completion has poor performance overall. It has negative implications both in commication costs and power management. mpstat confirmed that at low thread counts that all CPUs sharing an LLC has low level of activity. Note that even if the CIL patch was reverted, there still would be migrations but the impact is less noticeable. It turns out that individually the scheduler, XFS, blk-mq and workqueues all made sensible decisions but in combination, the overall effect was sub-optimal. This patch special cases the IO issue/completion pattern and allows a bound kworker waker and a task wakee to stack on the same CPU if there is a strong chance they are directly related. The expectation is that the kworker is likely going back to sleep shortly. This is not guaranteed as the IO could be queued asynchronously but there is a very strong relationship between the task and kworker in this case that would justify stacking on the same CPU instead of migrating. There should be few concerns about kworker starvation given that the special casing is only when the kworker is the waker. DBench on XFS MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem UMA machine with 8 cores sharing LLC 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack Amean 1 22.63 ( 0.00%) 20.54 * 9.23%* Amean 2 25.56 ( 0.00%) 23.40 * 8.44%* Amean 4 28.63 ( 0.00%) 27.85 * 2.70%* Amean 8 37.66 ( 0.00%) 37.68 ( -0.05%) Amean 64 469.47 ( 0.00%) 468.26 ( 0.26%) Stddev 1 1.00 ( 0.00%) 0.72 ( 28.12%) Stddev 2 1.62 ( 0.00%) 1.97 ( -21.54%) Stddev 4 2.53 ( 0.00%) 3.58 ( -41.19%) Stddev 8 5.30 ( 0.00%) 5.20 ( 1.92%) Stddev 64 86.36 ( 0.00%) 94.53 ( -9.46%) NUMA machine, 48 CPUs total, 24 CPUs share cache 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 Amean 1 58.69 ( 0.00%) 30.21 * 48.53%* Amean 2 60.90 ( 0.00%) 35.29 * 42.05%* Amean 4 66.77 ( 0.00%) 46.55 * 30.28%* Amean 8 81.41 ( 0.00%) 68.46 * 15.91%* Amean 16 113.29 ( 0.00%) 107.79 * 4.85%* Amean 32 199.10 ( 0.00%) 198.22 * 0.44%* Amean 64 478.99 ( 0.00%) 477.06 * 0.40%* Amean 128 1345.26 ( 0.00%) 1372.64 * -2.04%* Stddev 1 2.64 ( 0.00%) 4.17 ( -58.08%) Stddev 2 4.35 ( 0.00%) 5.38 ( -23.73%) Stddev 4 6.77 ( 0.00%) 6.56 ( 3.00%) Stddev 8 11.61 ( 0.00%) 10.91 ( 6.04%) Stddev 16 18.63 ( 0.00%) 19.19 ( -3.01%) Stddev 32 38.71 ( 0.00%) 38.30 ( 1.06%) Stddev 64 100.28 ( 0.00%) 91.24 ( 9.02%) Stddev 128 186.87 ( 0.00%) 160.34 ( 14.20%) Dbench has been modified to report the time to complete a single "load file". This is a more meaningful metric for dbench that a throughput metric as the benchmark makes many different system calls that are not throughput-related Patch shows a 9.23% and 48.53% reduction in the time to process a load file with the difference partially explained by the number of CPUs sharing a LLC. In a separate run, task migrations were almost eliminated by the patch for low client counts. In case people have issue with the metric used for the benchmark, this is a comparison of the throughputs as reported by dbench on the NUMA machine. dbench4 Throughput (misleading but traditional) 5.5.0-rc7 5.5.0-rc7 tipsched-20200124 kworkerstack-v1r2 Hmean 1 321.41 ( 0.00%) 617.82 * 92.22%* Hmean 2 622.87 ( 0.00%) 1066.80 * 71.27%* Hmean 4 1134.56 ( 0.00%) 1623.74 * 43.12%* Hmean 8 1869.96 ( 0.00%) 2212.67 * 18.33%* Hmean 16 2673.11 ( 0.00%) 2806.13 * 4.98%* Hmean 32 3032.74 ( 0.00%) 3039.54 ( 0.22%) Hmean 64 2514.25 ( 0.00%) 2498.96 * -0.61%* Hmean 128 1778.49 ( 0.00%) 1746.05 * -1.82%* Note that this is somewhat specific to XFS and ext4 shows no performance difference as it does not rely on kworkers in the same way. No major problem was observed running other workloads on different machines although not all tests have completed yet. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-02-09Merge tag 'kbuild-v5.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix randconfig to generate a sane .config - rename hostprogs-y / always to hostprogs / always-y, which are more natual syntax. - optimize scripts/kallsyms - fix yes2modconfig and mod2yesconfig - make multiple directory targets ('make foo/ bar/') work * tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: make multiple directory targets work kconfig: Invalidate all symbols after changing to y or m. kallsyms: fix type of kallsyms_token_table[] scripts/kallsyms: change table to store (strcut sym_entry *) scripts/kallsyms: rename local variables in read_symbol() kbuild: rename hostprogs-y/always to hostprogs/always-y kbuild: fix the document to use extra-y for vmlinux.lds kconfig: fix broken dependency in randconfig-generated .config
2020-02-09Merge tag 'x86-urgent-2020-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for X86: - Ensure that the PIT is set up when the local APIC is disable or configured in legacy mode. This is caused by an ordering issue introduced in the recent changes which skip PIT initialization when the TSC and APIC frequencies are already known. - Handle malformed SRAT tables during early ACPI parsing which caused an infinite loop anda boot hang. - Fix a long standing race in the affinity setting code which affects PCI devices with non-maskable MSI interrupts. The problem is caused by the non-atomic writes of the MSI address (destination APIC id) and data (vector) fields which the device uses to construct the MSI message. The non-atomic writes are mandated by PCI. If both fields change and the device raises an interrupt after writing address and before writing data, then the MSI block constructs a inconsistent message which causes interrupts to be lost and subsequent malfunction of the device. The fix is to redirect the interrupt to the new vector on the current CPU first and then switch it over to the new target CPU. This allows to observe an eventually raised interrupt in the transitional stage (old CPU, new vector) to be observed in the APIC IRR and retriggered on the new target CPU and the new vector. The potential spurious interrupts caused by this are harmless and can in the worst case expose a buggy driver (all handlers have to be able to deal with spurious interrupts as they can and do happen for various reasons). - Add the missing suspend/resume mechanism for the HYPERV hypercall page which prevents resume hibernation on HYPERV guests. This change got lost before the merge window. - Mask the IOAPIC before disabling the local APIC to prevent potentially stale IOAPIC remote IRR bits which cause stale interrupt lines after resume" * tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic: Mask IOAPIC entries when disabling the local APIC x86/hyperv: Suspend/resume the hypercall page for hibernation x86/apic/msi: Plug non-maskable MSI affinity race x86/boot: Handle malformed SRAT tables during early ACPI parsing x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode
2020-02-09Merge tag 'smp-urgent-2020-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP fixes from Thomas Gleixner: "Two fixes for the SMP related functionality: - Make the UP version of smp_call_function_single() match SMP semantics when called for a not available CPU. Instead of emitting a warning and assuming that the function call target is CPU0, return a proper error code like the SMP version does. - Remove a superfluous check in smp_call_function_many_cond()" * tag 'smp-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp/up: Make smp_call_function_single() match SMP semantics smp: Remove superfluous cond_func check in smp_call_function_many_cond()
2020-02-09Merge tag 'perf-urgent-2020-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A set of fixes and improvements for the perf subsystem: Kernel fixes: - Install cgroup events to the correct CPU context to prevent a potential list double add - Prevent an integer underflow in the perf mlock accounting - Add a missing prototype for arch_perf_update_userpage() Tooling: - Add a missing unlock in the error path of maps__insert() in perf maps. - Fix the build with the latest libbfd - Fix the perf parser so it does not delete parse event terms, which caused a regression for using perf with the ARM CoreSight as the sink configuration was missing due to the deletion. - Fix the double free in the perf CPU map merging test case - Add the missing ustring support for the perf probe command" * tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf maps: Add missing unlock to maps__insert() error case perf probe: Add ustring support for perf probe command perf: Make perf able to build with latest libbfd perf test: Fix test case Merge cpu map perf parse: Copy string to perf_evsel_config_term perf parse: Refactor 'struct perf_evsel_config_term' kernel/events: Add a missing prototype for arch_perf_update_userpage() perf/cgroups: Install cgroup events to correct cpuctx perf/core: Fix mlock accounting in perf_mmap()
2020-02-09Merge tag 'timers-urgent-2020-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two small fixes for the time(r) subsystem: - Handle a subtle race between the clocksource watchdog and a concurrent clocksource watchdog stop/start sequence correctly to prevent a timer double add bug. - Fix the file path for the core time namespace file" * tag 'timers-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Prevent double add_timer_on() for watchdog_timer MAINTAINERS: Correct path to time namespace source file
2020-02-09Merge tag 'irq-urgent-2020-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt fixes from Thomas Gleixner: "A set of fixes for the interrupt subsystem: - Provision only ACPI enabled redistributors on GICv3 - Use the proper command colums when building the INVALL command for the GICv3-ITS - Ensure the allocation of the L2 vPE table for GICv4.1 - Correct the GICv4.1 VPROBASER programming so it uses the proper size - A set of small GICv4.1 tidy up patches - Configuration cleanup for C-SKY interrupt chip - Clarify the function documentation for irq_set_wake() to document that the wakeup functionality is orthogonal to the irq disable/enable mechanism" * tag 'irq-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Rename VPENDBASER/VPROPBASER accessors irqchip/gic-v3-its: Remove superfluous WARN_ON irqchip/gic-v4.1: Drop 'tmp' in inherit_vpe_l1_table_from_rd() irqchip/gic-v4.1: Ensure L2 vPE table is allocated at RD level irqchip/gic-v4.1: Set vpe_l1_base for all redistributors irqchip/gic-v4.1: Fix programming of GICR_VPROPBASER_4_1_SIZE genirq: Clarify that irq wake state is orthogonal to enable/disable irqchip/gic-v3-its: Reference to its_invall_cmd descriptor when building INVALL irqchip: Some Kconfig cleanup for C-SKY irqchip/gic-v3: Only provision redistributors that are enabled in ACPI
2020-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Unbalanced locking in mwifiex_process_country_ie, from Brian Norris. 2) Fix thermal zone registration in iwlwifi, from Andrei Otcheretianski. 3) Fix double free_irq in sgi ioc3 eth, from Thomas Bogendoerfer. 4) Use after free in mptcp, from Florian Westphal. 5) Use after free in wireguard's root_remove_peer_lists, from Eric Dumazet. 6) Properly access packets heads in bonding alb code, from Eric Dumazet. 7) Fix data race in skb_queue_len(), from Qian Cai. 8) Fix regression in r8169 on some chips, from Heiner Kallweit. 9) Fix XDP program ref counting in hv_netvsc, from Haiyang Zhang. 10) Certain kinds of set link netlink operations can cause a NULL deref in the ipv6 addrconf code. Fix from Eric Dumazet. 11) Don't cancel uninitialized work queue in drop monitor, from Ido Schimmel. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) net: thunderx: use proper interface type for RGMII mt76: mt7615: fix max_nss in mt7615_eeprom_parse_hw_cap bpf: Improve bucket_log calculation logic selftests/bpf: Test freeing sockmap/sockhash with a socket in it bpf, sockhash: Synchronize_rcu before free'ing map bpf, sockmap: Don't sleep while holding RCU lock on tear-down bpftool: Don't crash on missing xlated program instructions bpf, sockmap: Check update requirements after locking drop_monitor: Do not cancel uninitialized work item mlxsw: spectrum_dpipe: Add missing error path mlxsw: core: Add validation of hardware device types for MGPIR register mlxsw: spectrum_router: Clear offload indication from IPv6 nexthops on abort selftests: mlxsw: Add test cases for local table route replacement mlxsw: spectrum_router: Prevent incorrect replacement of local table routes net: dsa: microchip: enable module autoprobe ipv6/addrconf: fix potential NULL deref in inet6_set_link_af() dpaa_eth: support all modes with rate adapting PHYs net: stmmac: update pci platform data to use phy_interface net: stmmac: xgmac: fix missing IFF_MULTICAST checki in dwxgmac2_set_filter net: stmmac: fix missing IFF_MULTICAST check in dwmac4_set_filter ...
2020-02-08Merge branch 'merge.nfs-fs_parse.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs file system parameter updates from Al Viro: "Saner fs_parser.c guts and data structures. The system-wide registry of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is the horror switch() in fs_parse() that would have to grow another case every time something got added to that system-wide registry. New syntax types can be added by filesystems easily now, and their namespace is that of functions - not of system-wide enum members. IOW, they can be shared or kept private and if some turn out to be widely useful, we can make them common library helpers, etc., without having to do anything whatsoever to fs_parse() itself. And we already get that kind of requests - the thing that finally pushed me into doing that was "oh, and let's add one for timeouts - things like 15s or 2h". If some filesystem really wants that, let them do it. Without somebody having to play gatekeeper for the variants blessed by direct support in fs_parse(), TYVM. Quite a bit of boilerplate is gone. And IMO the data structures make a lot more sense now. -200LoC, while we are at it" * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits) tmpfs: switch to use of invalfc() cgroup1: switch to use of errorfc() et.al. procfs: switch to use of invalfc() hugetlbfs: switch to use of invalfc() cramfs: switch to use of errofc() et.al. gfs2: switch to use of errorfc() et.al. fuse: switch to use errorfc() et.al. ceph: use errorfc() and friends instead of spelling the prefix out prefix-handling analogues of errorf() and friends turn fs_param_is_... into functions fs_parse: handle optional arguments sanely fs_parse: fold fs_parameter_desc/fs_parameter_spec fs_parser: remove fs_parameter_description name field add prefix to fs_context->log ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log new primitive: __fs_parse() switch rbd and libceph to p_log-based primitives struct p_log, variants of warnf() et.al. taking that one instead teach logfc() to handle prefices, give it saner calling conventions get rid of cg_invalf() ...
2020-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2020-02-07 The following pull-request contains BPF updates for your *net* tree. We've added 15 non-merge commits during the last 10 day(s) which contain a total of 12 files changed, 114 insertions(+), 31 deletions(-). The main changes are: 1) Various BPF sockmap fixes related to RCU handling in the map's tear- down code, from Jakub Sitnicki. 2) Fix macro state explosion in BPF sk_storage map when calculating its bucket_log on allocation, from Martin KaFai Lau. 3) Fix potential BPF sockmap update race by rechecking socket's established state under lock, from Lorenz Bauer. 4) Fix crash in bpftool on missing xlated instructions when kptr_restrict sysctl is set, from Toke Høiland-Jørgensen. 5) Fix i40e's XSK wakeup code to return proper error in busy state and various misc fixes in xdpsock BPF sample code, from Maciej Fijalkowski. 6) Fix the way modifiers are skipped in BTF in the verifier while walking pointers to avoid program rejection, from Alexei Starovoitov. 7) Fix Makefile for runqslower BPF tool to i) rebuild on libbpf changes and ii) to fix undefined reference linker errors for older gcc version due to order of passed gcc parameters, from Yulia Kartseva and Song Liu. 8) Fix a trampoline_count BPF kselftest warning about missing braces around initializer, from Andrii Nakryiko. 9) Fix up redundant "HAVE" prefix from large INSN limit kernel probe in bpftool, from Michal Rostecki. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07genirq: Clarify that irq wake state is orthogonal to enable/disableStephen Boyd
There's some confusion around if an irq that's disabled with disable_irq() can still wake the system from sleep states such as "suspend to RAM". Clarify this in the kernel documentation for irq_set_irq_wake() so that it's clear that an irq can be disabled and still wake the system if it has been marked for wakeup. Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lkml.kernel.org/r/20200206191521.94559-1-swboyd@chromium.org
2020-02-07cgroup1: switch to use of errorfc() et.al.Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07fs_parse: fold fs_parameter_desc/fs_parameter_specAl Viro
The former contains nothing but a pointer to an array of the latter... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07fs_parser: remove fs_parameter_description name fieldEric Sandeen
Unused now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07get rid of cg_invalf()Al Viro
pointless alias for invalf()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07smp/up: Make smp_call_function_single() match SMP semanticsPaul E. McKenney
In CONFIG_SMP=y kernels, smp_call_function_single() returns -ENXIO when invoked for a non-existent CPU. In contrast, in CONFIG_SMP=n kernels, a splat is emitted and smp_call_function_single() otherwise silently ignores its "cpu" argument, instead pretending that the caller intended to have something happen on CPU 0. Given that there is now code that expects smp_call_function_single() to return an error if a bad CPU was specified, this difference in semantics needs to be addressed. Bring the semantics of the CONFIG_SMP=n version of smp_call_function_single() into alignment with its CONFIG_SMP=y counterpart. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200205143409.GA7021@paulmck-ThinkPad-P72
2020-02-06Merge tag 'kgdb-fixes-5.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb fix from Daniel Thompson: "One of the simplifications added for 5.6-rc1 has caused build regressions on some platforms (it was reported for sparc64). This fixes it with a revert" * tag 'kgdb-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: Revert "kdb: Get rid of confusing diag msg from "rd" if current task has no regs"
2020-02-06Revert "kdb: Get rid of confusing diag msg from "rd" if current task has no ↵Daniel Thompson
regs" This reverts commit bbfceba15f8d1260c328a254efc2b3f2deae4904. When DBG_MAX_REG_NUM is zero then a number of symbols are conditionally defined. It is therefore not possible to check it using C expressions. Reported-by: Anatoly Pugachev <matorola@gmail.com> Acked-by: Doug Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-02-06Merge tag 'trace-v5.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Added new "bootconfig". This looks for a file appended to initrd to add boot config options, and has been discussed thoroughly at Linux Plumbers. Very useful for adding kprobes at bootup. Only enabled if "bootconfig" is on the real kernel command line. - Created dynamic event creation. Merges common code between creating synthetic events and kprobe events. - Rename perf "ring_buffer" structure to "perf_buffer" - Rename ftrace "ring_buffer" structure to "trace_buffer" Had to rename existing "trace_buffer" to "array_buffer" - Allow trace_printk() to work withing (some) tracing code. - Sort of tracing configs to be a little better organized - Fixed bug where ftrace_graph hash was not being protected properly - Various other small fixes and clean ups * tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits) bootconfig: Show the number of nodes on boot message tools/bootconfig: Show the number of bootconfig nodes bootconfig: Add more parse error messages bootconfig: Use bootconfig instead of boot config ftrace: Protect ftrace_graph_hash with ftrace_sync ftrace: Add comment to why rcu_dereference_sched() is open coded tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu tracing: Annotate ftrace_graph_hash pointer with __rcu bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline tracing: Use seq_buf for building dynevent_cmd string tracing: Remove useless code in dynevent_arg_pair_add() tracing: Remove check_arg() callbacks from dynevent args tracing: Consolidate some synth_event_trace code tracing: Fix now invalid var_ref_vals assumption in trace action tracing: Change trace_boot to use synth_event interface tracing: Move tracing selftests to bottom of menu tracing: Move mmio tracer config up with the other tracers tracing: Move tracing test module configs together tracing: Move all function tracing configs together tracing: Documentation for in-kernel synthetic event API ...
2020-02-05ftrace: Protect ftrace_graph_hash with ftrace_syncSteven Rostedt (VMware)
As function_graph tracer can run when RCU is not "watching", it can not be protected by synchronize_rcu() it requires running a task on each CPU before it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used. Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72 Cc: stable@vger.kernel.org Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05ftrace: Add comment to why rcu_dereference_sched() is open codedSteven Rostedt (VMware)
Because the function graph tracer can execute in sections where RCU is not "watching", the rcu_dereference_sched() for the has needs to be open coded. This is fine because the RCU "flavor" of the ftrace hash is protected by its own RCU handling (it does its own little synchronization on every CPU and does not rely on RCU sched). Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05tracing: Annotate ftrace_graph_notrace_hash pointer with __rcuAmol Grover
Fix following instances of sparse error kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison Use rcu_dereference_protected to dereference the newly annotated pointer. Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>