summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-06-03Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull scheduler updates from Ingo Molnar: "The main scheduling related changes in this cycle were: - various sched/numa updates, for better performance - tree wide cleanup of open coded nice levels - nohz fix related to rq->nr_running use - cpuidle changes and continued consolidation to improve the kernel/sched/idle.c high level idle scheduling logic. As part of this effort I pulled cpuidle driver changes from Rafael as well. - standardized idle polling amongst architectures - continued work on preparing better power/energy aware scheduling - sched/rt updates - misc fixlets and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) sched/numa: Decay ->wakee_flips instead of zeroing sched/numa: Update migrate_improves/degrades_locality() sched/numa: Allow task switch if load imbalance improves sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice() sched: Initialize rq->age_stamp on processor start sched, nohz: Change rq->nr_running to always use wrappers sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance() sched: Use clamp() and clamp_val() to make sys_nice() more readable sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups() sched/numa: Fix initialization of sched_domain_topology for NUMA sched: Call select_idle_sibling() when not affine_sd sched: Simplify return logic in sched_read_attr() sched: Simplify return logic in sched_copy_attr() sched: Fix exec_start/task_hot on migrated tasks arm64: Remove TIF_POLLING_NRFLAG metag: Remove TIF_POLLING_NRFLAG sched/idle: Make cpuidle_idle_call() void sched/idle: Reflow cpuidle_idle_call() sched/idle: Delay clearing the polling bit ...
2014-06-03Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull perf updates from Ingo Molnar: "The tooling changes maintained by Jiri Olsa until Arnaldo is on vacation: User visible changes: - Add -F option for specifying output fields (Namhyung Kim) - Propagate exit status of a command line workload for record command (Namhyung Kim) - Use tid for finding thread (Namhyung Kim) - Clarify the output of perf sched map plus small sched command fixes (Dongsheng Yang) - Wire up perf_regs and unwind support for ARM64 (Jean Pihet) - Factor hists statistics counts processing which in turn also fixes several bugs in TUI report command (Namhyung Kim) - Add --percentage option to control absolute/relative percentage output (Namhyung Kim) - Add --list-cmds to 'kmem', 'mem', 'lock' and 'sched', for use by completion scripts (Ramkumar Ramachandra) Development/infrastructure changes and fixes: - Android related fixes for pager and map dso resolving (Michael Lentine) - Add libdw DWARF post unwind support for ARM (Jean Pihet) - Consolidate types.h for ARM and ARM64 (Jean Pihet) - Fix possible null pointer dereference in session.c (Masanari Iida) - Cleanup, remove unused variables in map_switch_event() (Dongsheng Yang) - Remove nr_state_machine_bugs in perf latency (Dongsheng Yang) - Remove usage of trace_sched_wakeup(.success) (Peter Zijlstra) - Cleanups for perf.h header (Jiri Olsa) - Consolidate types.h and export.h within tools (Borislav Petkov) - Move u64_swap union to its single user's header, evsel.h (Borislav Petkov) - Fix for s390 to properly parse tracepoints plus test code (Alexander Yarygin) - Handle EINTR error for readn/writen (Namhyung Kim) - Add a test case for hists filtering (Namhyung Kim) - Share map_groups among threads of the same group (Arnaldo Carvalho de Melo, Jiri Olsa) - Making some code (cpu node map and report parse callchain callback) global to be usable by upcomming changes (Don Zickus) - Fix pmu object compilation error (Jiri Olsa) Kernel side changes: - intrusive uprobes fixes from Oleg Nesterov. Since the interface is admin-only, and the bug only affects user-space ("any probed jmp/call can kill the application"), we queued these fixes via the development tree, as a special exception. - more fuzzer motivated race fixes and related refactoring and robustization. - allow PMU drivers to be built as modules. (No actual module yet, because the x86 Intel uncore module wasn't ready in time for this)" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) perf tools: Add automatic remapping of Android libraries perf tools: Add cat as fallback pager perf tests: Add a testcase for histogram output sorting perf tests: Factor out print_hists_*() perf tools: Introduce reset_output_field() perf tools: Get rid of obsolete hist_entry__sort_list perf hists: Reset width of output fields with header length perf tools: Skip elided sort entries perf top: Add --fields option to specify output fields perf report/tui: Fix a bug when --fields/sort is given perf tools: Add ->sort() member to struct sort_entry perf report: Add -F option to specify output fields perf tools: Call perf_hpp__init() before setting up GUI browsers perf tools: Consolidate management of default sort orders perf tools: Allow hpp fields to be sort keys perf ui: Get rid of callback from __hpp__fmt() perf tools: Consolidate output field handling to hpp format routines perf tools: Use hpp formats to sort final output perf tools: Support event grouping in hpp ->sort() perf tools: Use hpp formats to sort hist entries ...
2014-06-03Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__*() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__*() arch,doc: Convert smp_mb__*() arch,xtensa: Convert smp_mb__*() arch,x86: Convert smp_mb__*() arch,tile: Convert smp_mb__*() arch,sparc: Convert smp_mb__*() arch,sh: Convert smp_mb__*() arch,score: Convert smp_mb__*() arch,s390: Convert smp_mb__*() arch,powerpc: Convert smp_mb__*() arch,parisc: Convert smp_mb__*() arch,openrisc: Convert smp_mb__*() arch,mn10300: Convert smp_mb__*() arch,mips: Convert smp_mb__*() arch,metag: Convert smp_mb__*() arch,m68k: Convert smp_mb__*() arch,m32r: Convert smp_mb__*() arch,ia64: Convert smp_mb__*() ...
2014-06-03Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull RCU changes from Ingo Molnar: "The main RCU changes in this cycle were: - RCU torture-test changes. - variable-name renaming cleanup. - update RCU documentation. - miscellaneous fixes. - patch to suppress RCU stall warnings while sysrq requests are being processed" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits) rcu: Provide API to suppress stall warnings while sysrc runs rcu: Variable name changed in tree_plugin.h and used in tree.c torture: Remove unused definition torture: Remove __init from torture_init_begin/end torture: Check for multiple concurrent torture tests locktorture: Remove reference to nonexistent Kconfig parameter rcutorture: Run rcu_torture_writer at normal priority rcutorture: Note diffs from git commits rcutorture: Add missing destroy_timer_on_stack() rcutorture: Explicitly test synchronous grace-period primitives rcutorture: Add tests for get_state_synchronize_rcu() rcutorture: Test RCU-sched primitives in TREE_PREEMPT_RCU kernels torture: Use elapsed time to detect hangs rcutorture: Check for rcu_torture_fqs creation errors torture: Better summary diagnostics for build failures torture: Notice if an all-zero cpumask is passed inside a critical section rcutorture: Make rcu_torture_reader() use cond_resched() sched,rcu: Make cond_resched() report RCU quiescent states percpu: Fix raw_cpu_inc_return() rcutorture: Export RCU grace-period kthread wait state to rcutorture ...
2014-06-03Merge tag 'tty-3.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty into next Pull tty/serial driver updates from Greg KH: "Here is the big tty / serial driver pull request for 3.16-rc1. A variety of different serial driver fixes and updates and additions, nothing huge, and no real major core tty changes at all. All have been in linux-next for a while" * tag 'tty-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (84 commits) Revert "serial: imx: remove the DMA wait queue" serial: kgdb_nmi: Improve console integration with KDB I/O serial: kgdb_nmi: Switch from tasklets to real timers serial: kgdb_nmi: Use container_of() to locate private data serial: cpm_uart: No LF conversion in put_poll_char() serial: sirf: Fix compilation failure console: Remove superfluous readonly check console: Use explicit pointer type for vc_uni_pagedir* fields vgacon: Fix & cleanup refcounting ARM: tty: Move HVC DCC assembly to arch/arm tty/hvc/hvc_console: Fix wakeup of HVC thread on hvc_kick() drivers/tty/n_hdlc.c: replace kmalloc/memset by kzalloc vt: emulate 8- and 24-bit colour codes. printk/of_serial: fix serial console cessation part way through boot. serial: 8250_dma: check the result of TX buffer mapping serial: uart: add hw flow control support configuration tty/serial: at91: add interrupts for modem control lines tty/serial: at91: use mctrl_gpio helpers tty/serial: Add GPIOLIB helpers for controlling modem lines ARM: at91: gpio: implement get_direction ...
2014-06-03Merge tag 'driver-core-3.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core into next Pull driver core / kernfs changes from Greg KH: "Here is the "big" pull request for 3.16-rc1. Not a lot of changes here, some kernfs work, a revert of a very old driver core change that ended up cauing some memory leaks on driver probe error paths, and other minor things. As was pointed out earlier today, one commit here, 26fc9cd200ec ("kernfs: move the last knowledge of sysfs out from kernfs") is also needed in your 3.15-final branch as well. If you could cherry-pick it there, it would be most appreciated by Andy Lutomirski to prevent a regression there. All of these have been in linux-next for a while" * tag 'driver-core-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: crypto/nx/nx-842: dev_set_drvdata can no longer fail kernfs: move the last knowledge of sysfs out from kernfs sysfs: fix attribute_group bin file path on removal sysfs.h: don't return a void-valued expression in sysfs_remove_file init.h: Update initcall_sync variants to fix build errors driver core: Inline dev_set/get_drvdata driver core: dev_get_drvdata: Don't check for NULL dev driver core: dev_set_drvdata returns void driver core: dev_set_drvdata can no longer fail driver core: Move driver_data back to struct device lib/devres.c: fix checkpatch warnings lib/devres.c: use dev in devm_request_and_ioremap kobject: Make support for uevent_helper optional. kernfs: make kernfs_notify() trigger inotify events too kernfs: implement kernfs_root->supers list
2014-06-02Merge tag 'pci-v3.16-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into next Pull PCI changes from Bjorn Helgaas: "Enumeration - Notify driver before and after device reset (Keith Busch) - Use reset notification in NVMe (Keith Busch) NUMA - Warn if we have to guess host bridge node information (Myron Stowe) - Work around AMD Fam15h BIOSes that fail to provide _PXM (Suravee Suthikulpanit) - Clean up and mark early_root_info_init() as deprecated (Suravee Suthikulpanit) Driver binding - Add "driver_override" for force specific binding (Alex Williamson) - Fail "new_id" addition for devices we already know about (Bandan Das) Resource management - Support BAR sizes up to 8GB (Nikhil Rao, Alan Cox) - Don't move IORESOURCE_PCI_FIXED resources (Bjorn Helgaas) - Mark SBx00 HPET BAR as IORESOURCE_PCI_FIXED (Bjorn Helgaas) - Fail safely if we can't handle BARs larger than 4GB (Bjorn Helgaas) - Reject BAR above 4GB if dma_addr_t is too small (Bjorn Helgaas) - Don't convert BAR address to resource if dma_addr_t is too small (Bjorn Helgaas) - Don't set BAR to zero if dma_addr_t is too small (Bjorn Helgaas) - Don't print anything while decoding is disabled (Bjorn Helgaas) - Don't add disabled subtractive decode bus resources (Bjorn Helgaas) - Add resource allocation comments (Bjorn Helgaas) - Restrict 64-bit prefetchable bridge windows to 64-bit resources (Yinghai Lu) - Assign i82875p_edac PCI resources before adding device (Yinghai Lu) PCI device hotplug - Remove unnecessary "dev->bus" test (Bjorn Helgaas) - Use PCI_EXP_SLTCAP_PSN define (Bjorn Helgaas) - Fix rphahp endianess issues (Laurent Dufour) - Acknowledge spurious "cmd completed" event (Rajat Jain) - Allow hotplug service drivers to operate in polling mode (Rajat Jain) - Fix cpqphp possible NULL dereference (Rickard Strandqvist) MSI - Replace pci_enable_msi_block() by pci_enable_msi_exact() (Alexander Gordeev) - Replace pci_enable_msix() by pci_enable_msix_exact() (Alexander Gordeev) - Simplify populate_msi_sysfs() (Jan Beulich) Virtualization - Add Intel Patsburg (X79) root port ACS quirk (Alex Williamson) - Mark RTL8110SC INTx masking as broken (Alex Williamson) Generic host bridge driver - Add generic PCI host controller driver (Will Deacon) Freescale i.MX6 - Use new clock names (Lucas Stach) - Drop old IRQ mapping (Lucas Stach) - Remove optional (and unused) IRQs (Lucas Stach) - Add support for MSI (Lucas Stach) - Fix imx6_add_pcie_port() section mismatch warning (Sachin Kamat) Renesas R-Car - Add gen2 device tree support (Ben Dooks) - Use new OF interrupt mapping when possible (Lucas Stach) - Add PCIe driver (Phil Edworthy) - Add PCIe MSI support (Phil Edworthy) - Add PCIe device tree bindings (Phil Edworthy) Samsung Exynos - Remove unnecessary OOM messages (Jingoo Han) - Fix add_pcie_port() section mismatch warning (Sachin Kamat) Synopsys DesignWare - Make MSI ISR shared IRQ aware (Lucas Stach) Miscellaneous - Check for broken config space aliasing (Alex Williamson) - Update email address (Ben Hutchings) - Fix Broadcom CNB20LE unintended sign extension (Bjorn Helgaas) - Fix incorrect vgaarb conditional in WARN_ON() (Bjorn Helgaas) - Remove unnecessary __ref annotations (Bjorn Helgaas) - Add arch/x86/kernel/quirks.c to MAINTAINERS PCI file patterns (Bjorn Helgaas) - Fix use of uninitialized MPS value (Bjorn Helgaas) - Tidy x86/gart messages (Bjorn Helgaas) - Fix return value from pci_user_{read,write}_config_*() (Gavin Shan) - Turn pcibios_penalize_isa_irq() into a weak function (Hanjun Guo) - Remove unused serial device IDs (Jean Delvare) - Use designated initialization in PCI_VDEVICE (Mark Rustad) - Fix powerpc NULL dereference in pci_root_buses traversal (Mike Qiu) - Configure MPS on ARM (Murali Karicheri) - Remove unnecessary includes of <linux/init.h> (Paul Gortmaker) - Move Open Firmware devspec attribute to PCI common code (Sebastian Ott) - Use pdev->dev.groups for attribute creation on s390 (Sebastian Ott) - Remove pcibios_add_platform_entries() (Sebastian Ott) - Add new ID for Intel GPU "spurious interrupt" quirk (Thomas Jarosch) - Rename pci_is_bridge() to pci_has_subordinate() (Yijing Wang) - Add and use new pci_is_bridge() interface (Yijing Wang) - Make pci_bus_add_device() void (Yijing Wang) DMA API - Clarify physical/bus address distinction in docs (Bjorn Helgaas) - Fix typos in docs (Emilio López) - Update dma_pool_create ()and dma_pool_alloc() descriptions (Gioh Kim) - Change dma_declare_coherent_memory() CPU address to phys_addr_t (Bjorn Helgaas) - Pass GAPSPCI_DMA_BASE CPU & bus address to dma_declare_coherent_memory() (Bjorn Helgaas)" * tag 'pci-v3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (92 commits) MAINTAINERS: Add generic PCI host controller driver PCI: generic: Add generic PCI host controller driver PCI: imx6: Add support for MSI PCI: designware: Make MSI ISR shared IRQ aware PCI: imx6: Remove optional (and unused) IRQs PCI: imx6: Drop old IRQ mapping PCI: imx6: Use new clock names i82875p_edac: Assign PCI resources before adding device ARM/PCI: Call pcie_bus_configure_settings() to set MPS PCI: imx6: Fix imx6_add_pcie_port() section mismatch warning PCI: Make pci_bus_add_device() void PCI: exynos: Fix add_pcie_port() section mismatch warning PCI: Introduce new device binding path using pci_dev.driver_override PCI: rcar: Add gen2 device tree support PCI: cpqphp: Fix possible null pointer dereference PCI: rcar: Add R-Car PCIe device tree bindings PCI: rcar: Add MSI support for PCIe PCI: rcar: Add Renesas R-Car PCIe driver PCI: Fix return value from pci_user_{read,write}_config_*() PCI: exynos: Remove unnecessary OOM messages ...
2014-06-01Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Various fixlets, mostly related to the (root-only) SCHED_DEADLINE policy, but also a hotplug bug fix and a fix for a NR_CPUS related overallocation bug causing a suspend/resume regression" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix hotplug vs. set_cpus_allowed_ptr() sched/cpupri: Replace NR_CPUS arrays sched/deadline: Replace NR_CPUS arrays sched/deadline: Restrict user params max value to 2^63 ns sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE sched: Disallow sched_attr::sched_policy < 0 sched: Make sched_setattr() correctly return -EFBIG
2014-05-31Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core futex/rtmutex fixes from Thomas Gleixner: "Three fixlets for long standing issues in the futex/rtmutex code unearthed by Dave Jones syscall fuzzer: - Add missing early deadlock detection checks in the futex code - Prevent user space from attaching a futex to kernel threads - Make the deadlock detector of rtmutex work again Looks large, but is more comments than code change" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Fix deadlock detector for real futex: Prevent attaching to kernel threads futex: Add another early deadlock detection check
2014-05-28printk/of_serial: fix serial console cessation part way through boot.Stephen Chivers
Commit 5f5c9ae56c38942623f69c3e6dc6ec78e4da2076 "serial_core: Unregister console in uart_remove_one_port()" fixed a crash where a serial port was removed but not deregistered as a console. There is a side effect of that commit for platforms having serial consoles and of_serial configured (CONFIG_SERIAL_OF_PLATFORM). The serial console is disabled midway through the boot process. This cessation of the serial console affects PowerPC computers such as the MVME5100 and SAM440EP. The sequence is: bootconsole [udbg0] enabled .... serial8250/16550 driver initialises and registers its UARTS, one of these is the serial console. console [ttyS0] enabled .... of_serial probes "platform" devices, registering them as it goes. One of these is the serial console. console [ttyS0] disabled. The disabling of the serial console is due to: a. unregister_console in printk not clearing the CONS_ENABLED bit in the console flags, even though it has announced that the console is disabled; and b. of_platform_serial_probe in of_serial not setting the port type before it registers with serial8250_register_8250_port. This patch ensures that the serial console is re-enabled when of_serial registers a serial port that corresponds to the designated console. Signed-off-by: Stephen Chivers <schivers@csc.com> Tested-by: Stephen Chivers <schivers@csc.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [unregister_console] Cc: stable <stable@vger.kernel.org> # 3.15 === The above failure was identified in Linux-3.15-rc2. Tested using MVME5100 and SAM440EP PowerPC computers with kernels built from Linux-3.15-rc5 and tty-next. The continued operation of the serial console is vital for computers such as the MVME5100 as that Single Board Computer does not have any grapical/display hardware. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-28rtmutex: Fix deadlock detector for realThomas Gleixner
The current deadlock detection logic does not work reliably due to the following early exit path: /* * Drop out, when the task has no waiters. Note, * top_waiter can be NULL, when we are in the deboosting * mode! */ if (top_waiter && (!task_has_pi_waiters(task) || top_waiter != task_top_pi_waiter(task))) goto out_unlock_pi; So this not only exits when the task has no waiters, it also exits unconditionally when the current waiter is not the top priority waiter of the task. So in a nested locking scenario, it might abort the lock chain walk and therefor miss a potential deadlock. Simple fix: Continue the chain walk, when deadlock detection is enabled. We also avoid the whole enqueue, if we detect the deadlock right away (A-A). It's an optimization, but also prevents that another waiter who comes in after the detection and before the task has undone the damage observes the situation and detects the deadlock and returns -EDEADLOCK, which is wrong as the other task is not in a deadlock situation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-28Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull two powerpc fixes from Ben Herrenschmidt: "Here's a pair of powerpc fixes for 3.15 which are also going to stable. One's a fix for building with newer binutils (the problem currently only affects the BookE kernels but the affected macro might come back into use on BookS platforms at any time). Unfortunately, the binutils maintainer did a backward incompatible change to a construct that we use so we have to add Makefile check. The other one is a fix for CPUs getting stuck in kexec when running single threaded. Since we routinely use kexec on power (including in our newer bootloaders), I deemed that important enough" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode powerpc: Fix 64 bit builds with binutils 2.24
2014-05-28powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST modeSrivatsa S. Bhat
If we try to perform a kexec when the machine is in ST (Single-Threaded) mode (ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we get the following messages during boot: [ 0.089866] POWER8 performance monitor hardware support registered [ 0.089985] power8-pmu: PMAO restore workaround active. [ 5.095419] Processor 1 is stuck. [ 10.097933] Processor 2 is stuck. [ 15.100480] Processor 3 is stuck. [ 20.102982] Processor 4 is stuck. [ 25.105489] Processor 5 is stuck. [ 30.108005] Processor 6 is stuck. [ 35.110518] Processor 7 is stuck. [ 40.113369] Processor 9 is stuck. [ 45.115879] Processor 10 is stuck. [ 50.118389] Processor 11 is stuck. [ 55.120904] Processor 12 is stuck. [ 60.123425] Processor 13 is stuck. [ 65.125970] Processor 14 is stuck. [ 70.128495] Processor 15 is stuck. [ 75.131316] Processor 17 is stuck. Note that only the sibling threads are stuck, while the primary threads (0, 8, 16 etc) boot just fine. Looking closer at the previous step of kexec, we observe that kexec tries to wakeup (bring online) the sibling threads of all the cores, before performing kexec: [ 9464.131231] Starting new kernel [ 9464.148507] kexec: Waking offline cpu 1. [ 9464.148552] kexec: Waking offline cpu 2. [ 9464.148600] kexec: Waking offline cpu 3. [ 9464.148636] kexec: Waking offline cpu 4. [ 9464.148671] kexec: Waking offline cpu 5. [ 9464.148708] kexec: Waking offline cpu 6. [ 9464.148743] kexec: Waking offline cpu 7. [ 9464.148779] kexec: Waking offline cpu 9. [ 9464.148815] kexec: Waking offline cpu 10. [ 9464.148851] kexec: Waking offline cpu 11. [ 9464.148887] kexec: Waking offline cpu 12. [ 9464.148922] kexec: Waking offline cpu 13. [ 9464.148958] kexec: Waking offline cpu 14. [ 9464.148994] kexec: Waking offline cpu 15. [ 9464.149030] kexec: Waking offline cpu 17. Instrumenting this piece of code revealed that the cpu_up() operation actually fails with -EBUSY. Thus, only the primary threads of all the cores are online during kexec, and hence this is a sure-shot receipe for disaster, as explained in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec), as well as in the comment above wake_offline_cpus(). It turns out that cpu_up() was returning -EBUSY because the variable 'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done by migrate_to_reboot_cpu() inside kernel_kexec(). Now, migrate_to_reboot_cpu() was originally written with the assumption that any further code will not need to perform CPU hotplug, since we are anyway in the reboot path. However, kexec is clearly not such a case, since we depend on onlining CPUs, atleast on powerpc. So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the kexec path, to fix this regression in kexec on powerpc. Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we can catch such issues more easily in the future. Fixes: c97102ba963 (kexec: migrate to reboot cpu) Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-27kernfs: move the last knowledge of sysfs out from kernfsJianyu Zhan
There is still one residue of sysfs remaining: the sb_magic SYSFS_MAGIC. However this should be kernfs user specific, so this patch moves it out. Kerrnfs user should specify their magic number while mouting. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-23Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "The biggest commit is an irqtime accounting loop latency fix, the rest are misc fixes all over the place: deadline scheduling, docs, numa, balancer and a bad to-idle latency fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/numa: Initialize newidle balance stats in sd_numa_init() sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance() sched: Skip double execution of pick_next_task_fair() sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check sched/deadline: Fix memory leak sched/deadline: Fix sched_yield() behavior sched: Sanitize irq accounting madness sched/docbook: Fix 'make htmldocs' warnings caused by missing description
2014-05-23Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "The biggest changes are fixes for races that kept triggering Trinity crashes, plus liblockdep build fixes and smaller misc fixes. The liblockdep bits in perf/urgent are a pull mistake - they should have been in locking/urgent - but by the time I noticed other commits were added and testing was done :-/ Sorry about that" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() perf: Prevent false warning in perf_swevent_add perf: Limit perf_event_attr::sample_period to 63 bits tools/liblockdep: Remove all build files when doing make clean tools/liblockdep: Build liblockdep from tools/Makefile perf/x86/intel: Fix Silvermont's event constraints perf: Fix perf_event_init_context() perf: Fix race in removing an event
2014-05-23resources: Clarify sanity check messageBjorn Helgaas
The resource map sanity check message is a bit confusing. Change it to be more readable: -resource map sanity check conflict: 0xfed10000 0xfed15fff 0xfed10000 0xfed13fff pnp 00:01 +resource sanity check: requesting [mem 0xfed10000-0xfed15fff], which spans more than pnp 00:01 [mem 0xfed10000-0xfed13fff] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2014-05-23Merge 3.15-rc6 into driver-core-nextGreg Kroah-Hartman
We want the kernfs fixes in this branch as well for testing. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-22Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: " 1. Update RCU documentation. These were posted to LKML at https://lkml.org/lkml/2014/4/28/634. 2. Miscellaneous fixes. These were posted to LKML at https://lkml.org/lkml/2014/4/28/645. 3. Torture-test changes. These were posted to LKML at https://lkml.org/lkml/2014/4/28/667. 4. Variable-name renaming cleanup, sent separately due to conflicts. This was posted to LKML at https://lkml.org/lkml/2014/5/13/854. 5. Patch to suppress RCU stall warnings while sysrq requests are being processed. This patch is the RCU portions of the patch that Rik posted to LKML at https://lkml.org/lkml/2014/4/29/457. The reason for pushing this patch ahead instead of waiting until 3.17 is that the NMI-based stack traces are messing up sysrq output, and in some cases also messing up the system as well." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/numa: Decay ->wakee_flips instead of zeroingRik van Riel
Affine wakeups have the potential to interfere with NUMA placement. If a task wakes up too many other tasks, affine wakeups will get disabled. However, regardless of how many other tasks it wakes up, it gets re-enabled once a second, potentially interfering with NUMA placement of other tasks. By decaying wakee_wakes in half instead of zeroing it, we can avoid that problem for some workloads. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: chegu_vinod@hp.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20140516001332.67f91af2@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/numa: Update migrate_improves/degrades_locality()Rik van Riel
Update the migrate_improves/degrades_locality() functions with knowledge of pseudo-interleaving. Do not consider moving tasks around within the set of group's active nodes as improving or degrading locality. Instead, leave the load balancer free to balance the load between a numa_group's active nodes. Also, switch from the group/task_weight functions to the group/task_fault functions. The "weight" functions involve a division, but both calls use the same divisor, so there's no point in doing that from these functions. On a 4 node (x10 core) system, performance of SPECjbb2005 seems unaffected, though the number of migrations with 2 8-warehouse wide instances seems to have almost halved, due to the scheduler running each instance on a single node. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Link: http://lkml.kernel.org/r/20140515130306.61aae7db@cuia.bos.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/numa: Allow task switch if load imbalance improvesRik van Riel
Currently the NUMA balancing code only allows moving tasks between NUMA nodes when the load on both nodes is in balance. This breaks down when the load was imbalanced to begin with. Allow tasks to be moved between NUMA nodes if the imbalance is small, or if the new imbalance is be smaller than the original one. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mgorman@suse.de Cc: chegu_vinod@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/20140514132221.274b3463@annuminas.surriel.com
2014-05-22sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match ↵xiaofeng.yan
the current upstream code Signed-off-by: xiaofeng.yan <xiaofeng.yan@huawei.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1399605687-18094-1-git-send-email-xiaofeng.yan@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Consolidate open coded implementations of nice level frobbing into ↵Dongsheng Yang
nice_to_rlimit() and rlimit_to_nice() Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Initialize rq->age_stamp on processor startCorey Minyard
If the sched_clock time starts at a large value, the kernel will spin in sched_avg_update for a long time while rq->age_stamp catches up with rq->clock. The comment in kernel/sched/clock.c says that there is no strict promise that it starts at zero. So initialize rq->age_stamp when a cpu starts up to avoid this. I was seeing long delays on a simulator that didn't start the clock at zero. This might also be an issue on reboots on processors that don't re-initialize the timer to zero on reset, and when using kexec. Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1399574859-11714-1-git-send-email-minyard@acm.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched, nohz: Change rq->nr_running to always use wrappersKirill Tkhai
Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()Jason Low
Currently, in idle_balance(), we update rq->next_balance when we pull_tasks. However, it is also important to update this in the !pulled_tasks case too. When the CPU is "busy" (the CPU isn't idle), rq->next_balance gets computed using sd->busy_factor (so we increase the balance interval when the CPU is busy). However, when the CPU goes idle, rq->next_balance could still be set to a large value that was computed with the sd->busy_factor. Thus, we need to also update rq->next_balance in idle_balance() in the cases where !pulled_tasks too, so that rq->next_balance gets updated without taking the busy_factor into account when the CPU is about to go idle. This patch makes rq->next_balance get updated independently of whether or not we pulled_task. Also, we add logic to ensure that we always traverse at least 1 of the sched domains to get a proper next_balance value for updating rq->next_balance. Additionally, since load_balance() modifies the sd->balance_interval, we need to re-obtain the sched domain's interval after the call to load_balance() in rebalance_domains() before we update rq->next_balance. This patch adds and uses 2 new helper functions, update_next_balance() and get_sd_balance_interval() to update next_balance and obtain the sched domain's balance_interval. Signed-off-by: Jason Low <jason.low2@hp.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: daniel.lezcano@linaro.org Cc: alex.shi@linaro.org Cc: efault@gmx.de Cc: vincent.guittot@linaro.org Cc: morten.rasmussen@arm.com Cc: aswin@hp.com Link: http://lkml.kernel.org/r/1399596562.2200.7.camel@j-VirtualBox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Use clamp() and clamp_val() to make sys_nice() more readableDongsheng Yang
Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1399541715-19568-1-git-send-email-yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()Dietmar Eggemann
There is no need to zero struct sched_group member cpumask and struct sched_group_power member power since both structures are already allocated as zeroed memory in __sdt_alloc(). This patch has been tested with BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power); in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including CPU hotplug scenarios. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/numa: Fix initialization of sched_domain_topology for NUMAVincent Guittot
Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with kvm64 cpu. A panic occured while building the sched_domain. In sched_init_numa, we create a new topology table in which both default levels and numa levels are copied. The last row of the table must have a null pointer in the mask field. The current implementation doesn't add this last row in the computation of the table size. So we add 1 row in the allocation size that will be used as the last row of the table. The kzalloc will ensure that the mask field is NULL. Reported-by: Jet Chen <jet.chen@intel.com> Tested-by: Jet Chen <jet.chen@intel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: fengguang.wu@intel.com Link: http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Call select_idle_sibling() when not affine_sdRik van Riel
On smaller systems, the top level sched domain will be an affine domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE wakeup. This seems to be working well. On larger systems, with the node distance between far away NUMA nodes being > RECLAIM_DISTANCE, select_idle_sibling is only called if the waker and the wakee are on nodes less than RECLAIM_DISTANCE apart. This patch leaves in place the policy of not pulling the task across nodes on such systems, while fixing the issue that select_idle_sibling is not called at all in certain circumstances. The code will look for an idle CPU in the same CPU package as the CPU where the task ran previously. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: morten.rasmussen@arm.com Cc: george.mccollister@gmail.com Cc: ktkhai@parallels.com Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20140514114037.2d93266f@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Simplify return logic in sched_read_attr()Michael Kerrisk
Gotos are chained pointlessly here, and the 'out' label can be dispensed with. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/536CEC29.9090503@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Simplify return logic in sched_copy_attr()Michael Kerrisk
The logic in this function is a little contorted, clean it up: * Rather than having chained gotos for the -EFBIG case, just return -EFBIG directly. * Now, the label 'out' is no longer needed, and 'ret' must be zero zero by the time we fall through to this point, so just return 0. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/536CEC24.9080201@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Fix exec_start/task_hot on migrated tasksBen Segall
task_hot checks exec_start on any runnable task, but if it has been migrated since the it last ran, then exec_start is a clock_task from another cpu. If the old cpu's clock_task was sufficiently far ahead of this cpu's then the task will not be considered for another migration until it has run. Instead reset exec_start whenever a task is migrated, since it is presumably no longer hot anyway. Signed-off-by: Ben Segall <bsegall@google.com> [ Made it compile. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140515225920.7179.13924.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22Merge branch 'sched/urgent' into sched/core to avoid conflicts with upcoming ↵Ingo Molnar
changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22Merge branch 'pm-cpuidle' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm into sched/core Pull scheduling related CPU idle updates from Rafael J. Wysocki. Conflicts: kernel/sched/idle.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Fix hotplug vs. set_cpus_allowed_ptr()Lai Jiangshan
Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/cpupri: Replace NR_CPUS arraysPeter Zijlstra
Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/deadline: Replace NR_CPUS arraysPeter Zijlstra
Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/deadline: Restrict user params max value to 2^63 nsJuri Lelli
Michael Kerrisk noticed that creating SCHED_DEADLINE reservations with certain parameters (e.g, a runtime of something near 2^64 ns) can cause a system freeze for some amount of time. The problem is that in the interface we have u64 sched_runtime; while internally we need to have a signed runtime (to cope with budget overruns) s64 runtime; At the time we setup a new dl_entity we copy the first value in the second. The cast turns out with negative values when sched_runtime is too big, and this causes the scheduler to go crazy right from the start. Moreover, considering how we deal with deadlines wraparound (s64)(a - b) < 0 we also have to restrict acceptable values for sched_{deadline,period}. This patch fixes the thing checking that user parameters are always below 2^63 ns (still large enough for everyone). It also rewrites other conditions that we check, since in __checkparam_dl we don't have to deal with deadline wraparounds and what we have now erroneously fails when the difference between values is too big. Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli<raistlin@linux.it> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINEPeter Zijlstra
The way we read POSIX one should only call sched_getparam() when sched_getscheduler() returns either SCHED_FIFO or SCHED_RR. Given that we currently return sched_param::sched_priority=0 for all others, extend the same behaviour to SCHED_DEADLINE. Requested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: linux-man <linux-man@vger.kernel.org> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Disallow sched_attr::sched_policy < 0Peter Zijlstra
The scheduler uses policy=-1 to preserve the current policy state to implement sys_sched_setparam(), this got exposed to userspace by accident through sys_sched_setattr(), cure this. Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Make sched_setattr() correctly return -EFBIGMichael Kerrisk
The documented[1] behavior of sched_attr() in the proposed man page text is: sched_attr::size must be set to the size of the structure, as in sizeof(struct sched_attr), if the provided structure is smaller than the kernel structure, any additional fields are assumed '0'. If the provided structure is larger than the kernel structure, the kernel verifies all additional fields are '0' if not the syscall will fail with -E2BIG. As currently implemented, sched_copy_attr() returns -EFBIG for for this case, but the logic in sys_sched_setattr() converts that error to -EFAULT. This patch fixes the behavior. [1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760 Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-21Merge branch 'for-3.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull more cgroup fixes from Tejun Heo: "Three more patches to fix cgroup_freezer breakage due to the recent cgroup internal locking changes - an operation cgroup_freezer was using now requires sleepable context and cgroup_freezer was invoking that while holding a spin lock. cgroup_freezer was using an overly elaborate hierarchical locking scheme. While it's possible to convert the hierarchical spinlocks directly to mutexes, this patch simplifies the overall locking so that it uses a global mutex. This has the added benefit of avoiding iterating potentially huge number of tasks under a spinlock. While the patch is on the larger side in the devel cycle, the changes made are mostly straight-forward and the locking logic is a lot simpler afterwards" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix rcu_read_lock() leak in update_if_frozen() cgroup_freezer: replace freezer->lock with freezer_mutex cgroup: introduce task_css_is_root()
2014-05-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single bug fix for a long standing issue: - Updating the expiry value of a relative timer _after_ letting the idle logic select a target cpu for the timer based on its stale expiry value is outright stupid. Thanks to Viresh for spotting the brainfart" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Set expiry time before switch_hrtimer_base()
2014-05-19rcu: Provide API to suppress stall warnings while sysrc runsRik van Riel
Some sysrq handlers can run for a long time, because they dump a lot of data onto a serial console. Having RCU stall warnings pop up in the middle of them only makes the problem worse. This commit provides rcu_sysrq_start() and rcu_sysrq_end() APIs to temporarily suppress RCU CPU stall warnings while a sysrq request is handled. Signed-off-by: Rik van Riel <riel@redhat.com> [ paulmck: Fix TINY_RCU build error. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-19perf/events/core: Drop unused variable after cleanupBorislav Petkov
... in 3a497f48637 ("perf: Simplify perf_event_exit_task_context()") Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1399720259-28275-1-git-send-email-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-19perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()Peter Zijlstra
Alexander noticed that we use RCU iteration on rb->event_list but do not use list_{add,del}_rcu() to add,remove entries to that list, nor do we observe proper grace periods when re-using the entries. Merge ring_buffer_detach() into ring_buffer_attach() such that attaching to the NULL buffer is detaching. Furthermore, ensure that between any 'detach' and 'attach' of the same event we observe the required grace period, but only when strictly required. In effect this means that only ioctl(.request = PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the normal initial attach and final detach will not be delayed. This patch should, I think, do the right thing under all circumstances, the 'normal' cases all should never see the extra grace period, but the two cases: 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a ring_buffer set, will now observe the required grace period between removing itself from the old and attaching itself to the new buffer. This case is 'simple' in that both buffers are present in perf_event_set_output() one could think an unconditional synchronize_rcu() would be sufficient; however... 2) an event that has a buffer attached, the buffer is destroyed (munmap) and then the event is attached to a new/different buffer using PERF_EVENT_IOC_SET_OUTPUT. This case is more complex because the buffer destruction does: ring_buffer_attach(.rb = NULL) followed by the ioctl() doing: ring_buffer_attach(.rb = foo); and we still need to observe the grace period between these two calls due to us reusing the event->rb_entry list_head. In order to make 2 happen we use Paul's latest cond_synchronize_rcu() call. Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-19perf: Prevent false warning in perf_swevent_addJiri Olsa
The perf cpu offline callback takes down all cpu context events and releases swhash->swevent_hlist. This could race with task context software event being just scheduled on this cpu via perf_swevent_add while cpu hotplug code already cleaned up event's data. The race happens in the gap between the cpu notifier code and the cpu being actually taken down. Note that only cpu ctx events are terminated in the perf cpu hotplug code. It's easily reproduced with: $ perf record -e faults perf bench sched pipe while putting one of the cpus offline: # echo 0 > /sys/devices/system/cpu/cpu1/online Console emits following warning: WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0() Modules linked in: CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256 Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008 0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005 0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046 ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001 Call Trace: [<ffffffff81665a23>] dump_stack+0x4f/0x7c [<ffffffff8104732c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff8104737a>] warn_slowpath_null+0x1a/0x20 [<ffffffff8110fb3d>] perf_swevent_add+0x18d/0x1a0 [<ffffffff811162ae>] event_sched_in.isra.75+0x9e/0x1f0 [<ffffffff8111646a>] group_sched_in+0x6a/0x1f0 [<ffffffff81083dd5>] ? sched_clock_local+0x25/0xa0 [<ffffffff811167e6>] ctx_sched_in+0x1f6/0x450 [<ffffffff8111757b>] perf_event_sched_in+0x6b/0xa0 [<ffffffff81117a4b>] perf_event_context_sched_in+0x7b/0xc0 [<ffffffff81117ece>] __perf_event_task_sched_in+0x43e/0x460 [<ffffffff81096f1e>] ? put_lock_stats.isra.18+0xe/0x30 [<ffffffff8107b3c8>] finish_task_switch+0xb8/0x100 [<ffffffff8166a7de>] __schedule+0x30e/0xad0 [<ffffffff81172dd2>] ? pipe_read+0x3e2/0x560 [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70 [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70 [<ffffffff8166b464>] preempt_schedule_irq+0x44/0x70 [<ffffffff816707f0>] retint_kernel+0x20/0x30 [<ffffffff8109e60a>] ? lockdep_sys_exit+0x1a/0x90 [<ffffffff812a4234>] lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff81679321>] ? sysret_check+0x5/0x56 Fixing this by tracking the cpu hotplug state and displaying the WARN only if current cpu is initialized properly. Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: stable@vger.kernel.org Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>