summaryrefslogtreecommitdiff
path: root/include/linux/sched.h
AgeCommit message (Collapse)Author
2016-08-21pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau
commit 759c01142a5d0f364a462346168a56de28a80f52 upstream. On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). CVE-2016-2847 Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Chas Williams <3chas3@gmail.com>
2016-01-28unix: properly account for FDs passed over unix socketswilly tarreau
[ Upstream commit 712f4aad406bb1ed67f3f98d04c044191f0ff593 ] It is possible for a process to allocate and accumulate far more FDs than the process' limit by sending them over a unix socket then closing them to keep the process' fd count low. This change addresses this problem by keeping track of the number of FDs in flight per user and preventing non-privileged processes from having more FDs in flight than their configured FD limit. Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-03include/linux/sched.h: don't use task->pid/tgid in ↵Oleg Nesterov
same_thread_group/has_group_leader_pid commit e1403b8edf669ff49bbdf602cc97fefa2760cb15 upstream. task_struct->pid/tgid should go away. 1. Change same_thread_group() to use task->signal for comparison. 2. Change has_group_leader_pid(task) to compare task_pid(task) with signal->leader_pid. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: memcg: handle non-error OOM situations more gracefullyJohannes Weiner
commit 4942642080ea82d99ab5b653abb9a12b7ba31f4a upstream. Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") assumed that only a few places that can trigger a memcg OOM situation do not return VM_FAULT_OOM, like optional page cache readahead. But there are many more and it's impractical to annotate them all. First of all, we don't want to invoke the OOM killer when the failed allocation is gracefully handled, so defer the actual kill to the end of the fault handling as well. This simplifies the code quite a bit for added bonus. Second, since a failed allocation might not be the abrupt end of the fault, the memcg OOM handler needs to be re-entrant until the fault finishes for subsequent allocation attempts. If an allocation is attempted after the task already OOMed, allow it to bypass the limit so that it can quickly finish the fault and invoke the OOM killer. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: memcg: do not trap chargers with full callstack on OOMJohannes Weiner
commit 3812c8c8f3953921ef18544110dafc3505c1ac62 upstream. The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: memcg: enable memcg OOM killer only for user faultsJohannes Weiner
commit 519e52473ebe9db5cdef44670d5a97f1fd53d721 upstream. System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30mm: clear __GFP_FS when PF_MEMALLOC_NOIO is setJunxiao Bi
commit 934f3072c17cc8886f4c043b47eeeb1b12f8de33 upstream. commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing I/O inside memory allocation, __GFP_IO is cleared when this flag is set, but __GFP_FS implies __GFP_IO, it should also be cleared. Or it may still run into I/O, like in superblock shrinker. And this will make the kernel run into the deadlock case described in that commit. See Dave Chinner's comment about io in superblock shrinker: Filesystem shrinkers do indeed perform IO from the superblock shrinker and have for years. Even clean inodes can require IO before they can be freed - e.g. on an orphan list, need truncation of post-eof blocks, need to wait for ordered operations to complete before it can be freed, etc. IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts of IO in the superblock shrinker context. XFS, in particular, has been doing transactions and IO from the VFS inode cache shrinker since it was first introduced.... Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has masked all the gfp_mask that will be passed into fs for the processes setting PF_MEMALLOC_NOIO in the direct reclaim path. v1 thread at: https://lkml.org/lkml/2014/9/3/32 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: joyce.xue <xuejiufei@huawei.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05introduce for_each_thread() to replace the buggy while_each_thread()Oleg Nesterov
commit 0c740d0afc3bff0a097ad03a1c8df92757516f5c upstream. while_each_thread() and next_thread() should die, almost every lockless usage is wrong. 1. Unless g == current, the lockless while_each_thread() is not safe. while_each_thread(g, t) can loop forever if g exits, next_thread() can't reach the unhashed thread in this case. Note that this can happen even if g is the group leader, it can exec. 2. Even if while_each_thread() itself was correct, people often use it wrongly. It was never safe to just take rcu_read_lock() and loop unless you verify that pid_alive(g) == T, even the first next_thread() can point to the already freed/reused memory. This patch adds signal_struct->thread_head and task->thread_node to create the normal rcu-safe list with the stable head. The new for_each_thread(g, t) helper is always safe under rcu_read_lock() as long as this task_struct can't go away. Note: of course it is ugly to have both task_struct->thread_node and the old task_struct->thread_group, we will kill it later, after we change the users of while_each_thread() to use for_each_thread(). Perhaps we can kill it even before we convert all users, we can reimplement next_thread(t) using the new thread_head/thread_node. But we can't do this right now because this will lead to subtle behavioural changes. For example, do/while_each_thread() always sees at least one task, while for_each_thread() can do nothing if the whole thread group has died. Or thread_group_empty(), currently its semantics is not clear unless thread_group_leader(p) and we need to audit the callers before we can change it. So this patch adds the new interface which has to coexist with the old one for some time, hopefully the next changes will be more or less straightforward and the old one will go away soon. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: "Ma, Xindong" <xindong.ma@intel.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29exec/ptrace: fix get_dumpable() incorrect testsKees Cook
commit d049f74f2dbe71354d43d393ac3a188947811348 upstream. The get_dumpable() return value is not boolean. Most users of the function actually want to be testing for non-SUID_DUMP_USER(1) rather than SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a protected state. Almost all places did this correctly, excepting the two places fixed in this patch. Wrong logic: if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ } or if (dumpable == 0) { /* be protective */ } or if (!dumpable) { /* be protective */ } Correct logic: if (dumpable != SUID_DUMP_USER) { /* be protective */ } or if (dumpable != 1) { /* be protective */ } Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a user was able to ptrace attach to processes that had dropped privileges to that user. (This may have been partially mitigated if Yama was enabled.) The macros have been moved into the file that declares get/set_dumpable(), which means things like the ia64 code can see them too. CVE-2013-2929 Reported-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29sched, idle: Fix the idle polling state logicPeter Zijlstra
commit ea8117478918a4734586d35ff530721b682425be upstream. Mike reported that commit 7d1a9417 ("x86: Use generic idle loop") regressed several workloads and caused excessive reschedule interrupts. The patch in question failed to notice that the x86 code had an inverted sense of the polling state versus the new generic code (x86: default polling, generic: default !polling). Fix the two prominent x86 mwait based idle drivers and introduce a few new generic polling helpers (fixing the wrong smp_mb__after_clear_bit usage). Also switch the idle routines to using tif_need_resched() which is an immediate TIF_NEED_RESCHED test as opposed to need_resched which will end up being slightly different. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: lenb@kernel.org Cc: tglx@linutronix.de Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29Revert "x86 get_unmapped_area(): use proper mmap base for bottom-up direction"Linus Torvalds
commit 5ea80f76a56605a190a7ea16846c82aa63dbd0aa upstream. This reverts commit df54d6fa54275ce59660453e29d1228c2b45a826. The commit isn't necessarily wrong, but because it recalculates the random mmap_base every time, it seems to confuse user memory allocators that expect contiguous mmap allocations even when the mmap address isn't specified. In particular, the MATLAB Java runtime seems to be unhappy. See https://bugzilla.kernel.org/show_bug.cgi?id=60774 So we'll want to apply the random offset only once, and Radu has a patch for that. Revert this older commit in order to apply the other one. Reported-by: Jeff Shorey <shoreyjeff@gmail.com> Cc: Radu Caragea <sinaelgl@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20x86 get_unmapped_area(): use proper mmap base for bottom-up directionRadu Caragea
commit df54d6fa54275ce59660453e29d1228c2b45a826 upstream. When the stack is set to unlimited, the bottomup direction is used for mmap-ings but the mmap_base is not used and thus effectively renders ASLR for mmapings along with PIE useless. Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Sendroiu <molecula2788@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-11Merge git://git.infradead.org/users/eparis/auditLinus Torvalds
Pull audit changes from Eric Paris: "Al used to send pull requests every couple of years but he told me to just start pushing them to you directly. Our touching outside of core audit code is pretty straight forward. A couple of interface changes which hit net/. A simple argument bug calling audit functions in namei.c and the removal of some assembly branch prediction code on ppc" * git://git.infradead.org/users/eparis/audit: (31 commits) audit: fix message spacing printing auid Revert "audit: move kaudit thread start from auditd registration to kaudit init" audit: vfs: fix audit_inode call in O_CREAT case of do_last audit: Make testing for a valid loginuid explicit. audit: fix event coverage of AUDIT_ANOM_LINK audit: use spin_lock in audit_receive_msg to process tty logging audit: do not needlessly take a lock in tty_audit_exit audit: do not needlessly take a spinlock in copy_signal audit: add an option to control logging of passwords with pam_tty_audit audit: use spin_lock_irqsave/restore in audit tty code helper for some session id stuff audit: use a consistent audit helper to log lsm information audit: push loginuid and sessionid processing down audit: stop pushing loginid, uid, sessionid as arguments audit: remove the old depricated kernel interface audit: make validity checking generic audit: allow checking the type of audit message in the user filter audit: fix build break when AUDIT_DEBUG == 2 audit: remove duplicate export of audit_enabled Audit: do not print error when LSMs disabled ...
2013-05-08Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: "It might look big in volume, but when categorized, not a lot of drivers are touched. The pull request contains: - mtip32xx fixes from Micron. - A slew of drbd updates, this time in a nicer series. - bcache, a flash/ssd caching framework from Kent. - Fixes for cciss" * 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits) bcache: Use bd_link_disk_holder() bcache: Allocator cleanup/fixes cciss: bug fix to prevent cciss from loading in kdump crash kernel cciss: add cciss_allow_hpsa module parameter drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions mtip32xx: Workaround for unaligned writes bcache: Make sure blocksize isn't smaller than device blocksize bcache: Fix merge_bvec_fn usage for when it modifies the bvm bcache: Correctly check against BIO_MAX_PAGES bcache: Hack around stuff that clones up to bi_max_vecs bcache: Set ra_pages based on backing device's ra_pages bcache: Take data offset from the bdev superblock. mtip32xx: mtip32xx: Disable TRIM support mtip32xx: fix a smatch warning bcache: Disable broken btree fuzz tester bcache: Fix a format string overflow bcache: Fix a minor memory leak on device teardown bcache: Documentation updates bcache: Use WARN_ONCE() instead of __WARN() bcache: Add missing #include <linux/prefetch.h> ...
2013-05-07aio: don't include aio.h in sched.hKent Overstreet
Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-05Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull 'full dynticks' support from Ingo Molnar: "This tree from Frederic Weisbecker adds a new, (exciting! :-) core kernel feature to the timer and scheduler subsystems: 'full dynticks', or CONFIG_NO_HZ_FULL=y. This feature extends the nohz variable-size timer tick feature from idle to busy CPUs (running at most one task) as well, potentially reducing the number of timer interrupts significantly. This feature got motivated by real-time folks and the -rt tree, but the general utility and motivation of full-dynticks runs wider than that: - HPC workloads get faster: CPUs running a single task should be able to utilize a maximum amount of CPU power. A periodic timer tick at HZ=1000 can cause a constant overhead of up to 1.0%. This feature removes that overhead - and speeds up the system by 0.5%-1.0% on typical distro configs even on modern systems. - Real-time workload latency reduction: CPUs running critical tasks should experience as little jitter as possible. The last remaining source of kernel-related jitter was the periodic timer tick. - A single task executing on a CPU is a pretty common situation, especially with an increasing number of cores/CPUs, so this feature helps desktop and mobile workloads as well. The cost of the feature is mainly related to increased timer reprogramming overhead when a CPU switches its tick period, and thus slightly longer to-idle and from-idle latency. Configuration-wise a third mode of operation is added to the existing two NOHZ kconfig modes: - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named as a config option. This is the traditional Linux periodic tick design: there's a HZ tick going on all the time, regardless of whether a CPU is idle or not. - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the periodic tick when a CPU enters idle mode. - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the tick when a CPU is idle, also slows the tick down to 1 Hz (one timer interrupt per second) when only a single task is running on a CPU. The .config behavior is compatible: existing !CONFIG_NO_HZ and CONFIG_NO_HZ=y settings get translated to the new values, without the user having to configure anything. CONFIG_NO_HZ_FULL is turned off by default. This feature is based on a lot of infrastructure work that has been steadily going upstream in the last 2-3 cycles: related RCU support and non-periodic cputime support in particular is upstream already. This tree adds the final pieces and activates the feature. The pull request is marked RFC because: - it's marked 64-bit only at the moment - the 32-bit support patch is small but did not get ready in time. - it has a number of fresh commits that came in after the merge window. The overwhelming majority of commits are from before the merge window, but still some aspects of the tree are fresh and so I marked it RFC. - it's a pretty wide-reaching feature with lots of effects - and while the components have been in testing for some time, the full combination is still not very widely used. That it's default-off should reduce its regression abilities and obviously there are no known regressions with CONFIG_NO_HZ_FULL=y enabled either. - the feature is not completely idempotent: there is no 100% equivalent replacement for a periodic scheduler/timer tick. In particular there's ongoing work to map out and reduce its effects on scheduler load-balancing and statistics. This should not impact correctness though, there are no known regressions related to this feature at this point. - it's a pretty ambitious feature that with time will likely be enabled by most Linux distros, and we'd like you to make input on its design/implementation, if you dislike some aspect we missed. Without flaming us to crisp! :-) Future plans: - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off the periodic tick altogether when there's a single busy task on a CPU. We'd first like 1 Hz to be exposed more widely before we go for the 0 Hz target though. - once we reach 0 Hz we can remove the periodic tick assumption from nr_running>=2 as well, by essentially interrupting busy tasks only as frequently as the sched_latency constraints require us to do - once every 4-40 msecs, depending on nr_running. I am personally leaning towards biting the bullet and doing this in v3.10, like the -rt tree this effort has been going on for too long - but the final word is up to you as usual. More technical details can be found in Documentation/timers/NO_HZ.txt" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) sched: Keep at least 1 tick per second for active dynticks tasks rcu: Fix full dynticks' dependency on wide RCU nocb mode nohz: Protect smp_processor_id() in tick_nohz_task_switch() nohz_full: Add documentation. cputime_nsecs: use math64.h for nsec resolution conversion helpers nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config nohz: Reduce overhead under high-freq idling patterns nohz: Remove full dynticks' superfluous dependency on RCU tree nohz: Fix unavailable tick_stop tracepoint in dynticks idle nohz: Add basic tracing nohz: Select wide RCU nocb for full dynticks nohz: Disable the tick when irq resume in full dynticks CPU nohz: Re-evaluate the tick for the new task after a context switch nohz: Prepare to stop the tick on irq exit nohz: Implement full dynticks kick nohz: Re-evaluate the tick from the scheduler IPI sched: New helper to prevent from stopping the tick in full dynticks sched: Kick full dynticks CPU that have more than one task enqueued. perf: New helper to prevent full dynticks CPUs from stopping tick perf: Kick full dynticks CPU if events rotation is needed ...
2013-05-04sched: Keep at least 1 tick per second for active dynticks tasksFrederic Weisbecker
The scheduler doesn't yet fully support environments with a single task running without a periodic tick. In order to ensure we still maintain the duties of scheduler_tick(), keep at least 1 tick per second. This makes sure that we keep the progression of various scheduler accounting and background maintainance even with a very low granularity. Examples include cpu load, sched average, CFS entity vruntime, avenrun and events such as load balancing, amongst other details handled in sched_class::task_tick(). This limitation will be removed in the future once we get these individual items to work in full dynticks CPUs. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-05-02Merge commit '8700c95adb03' into timers/nohzFrederic Weisbecker
The full dynticks tree needs the latest RCU and sched upstream updates in order to fix some dependencies. Merge a common upstream merge point that has these updates. Conflicts: include/linux/perf_event.h kernel/rcutree.h kernel/rcutree_plugin.h Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2013-04-30exec: do not abuse ->cred_guard_mutex in threadgroup_lock()Oleg Nesterov
threadgroup_lock() takes signal->cred_guard_mutex to ensure that thread_group_leader() is stable. This doesn't look nice, the scope of this lock in do_execve() is huge. And as Dave pointed out this can lead to deadlock, we have the following dependencies: do_execve: cred_guard_mutex -> i_mutex cgroup_mount: i_mutex -> cgroup_mutex attach_task_by_pid: cgroup_mutex -> cred_guard_mutex Change de_thread() to take threadgroup_change_begin() around the switch-the-leader code and change threadgroup_lock() to avoid ->cred_guard_mutex. Note that de_thread() can't sleep with ->group_rwsem held, this can obviously deadlock with the exiting leader if the writer is active, so it does threadgroup_change_end() before schedule(). Reported-by: Dave Jones <davej@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30coredump: only SIGKILL should interrupt the coredumping taskOleg Nesterov
There are 2 well known and ancient problems with coredump/signals, and a lot of related bug reports: - do_coredump() clears TIF_SIGPENDING but of course this can't help if, say, SIGCHLD comes after that. In this case the coredump can fail unexpectedly. See for example wait_for_dump_helper()->signal_pending() check but there are other reasons. - At the same time, dumping a huge core on the slow media can take a lot of time/resources and there is no way to kill the coredumping task reliably. In particular this is not oom_kill-friendly. This patch tries to fix the 1st problem, and makes the preparation for the next changes. We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate that this process dumps the core. prepare_signal() checks this flag and nacks any signal except SIGKILL. Note that this check tries to be conservative, in the long term we should probably treat the SIGNAL_GROUP_EXIT case equally but this needs more discussion. See marc.info/?l=linux-kernel&m=120508897917439 Notes: - recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP. The patch assumes that dump_write/etc paths should never call it, but we can change it as well. - There is another source of TIF_SIGPENDING, freezer. This will be addressed separately. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30audit: add an option to control logging of passwords with pam_tty_auditRichard Guy Briggs
Most commands are entered one line at a time and processed as complete lines in non-canonical mode. Commands that interactively require a password, enter canonical mode to do this while shutting off echo. This pair of features (icanon and !echo) can be used to avoid logging passwords by audit while still logging the rest of the command. Adding a member (log_passwd) to the struct audit_tty_status passed in by pam_tty_audit allows control of canonical mode without echo per task. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2013-04-30Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Ingo Molnar: "The main changes in this cycle's merge are: - Implement shadow timekeeper to shorten in kernel reader side blocking, by Thomas Gleixner. - Posix timers enhancements by Pavel Emelyanov: - allocate timer ID per process, so that exact timer ID allocations can be re-created be checkpoint/restore code. - debuggability and tooling (/proc/PID/timers, etc.) improvements. - suspend/resume enhancements by Feng Tang: on certain new Intel Atom processors (Penwell and Cloverview), there is a feature that the TSC won't stop in S3 state, so the TSC value won't be reset to 0 after resume. This can be taken advantage of by the generic via the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to recover/approximate sleep time, the main (and precise) clocksource can be used. - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many CPUs the file goes beyond 4MB of size and thus the current simplistic seqfile approach fails. Convert /proc/timer_list to a proper seq_file with its own iterator. - Cleanups and refactorings of the core timekeeping code by John Stultz. - International Atomic Clock time is managed by the NTP code internally currently but not exposed externally. Separate the TAI code out and add CLOCK_TAI support and TAI support to the hrtimer and posix-timer code, by John Stultz. - Add deep idle support enhacement to the broadcast clockevents core timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ clockevents feature (which will be utilized by future clockevents driver updates), which allows the use of IRQ affinities to avoid spurious wakeups of idle CPUs - the right CPU with an expiring timer will be woken. - Add new ARM bcm281xx clocksource driver, by Christian Daudt - ... various other fixes and cleanups" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) clockevents: Set dummy handler on CPU_DEAD shutdown timekeeping: Update tk->cycle_last in resume posix-timers: Remove unused variable clockevents: Switch into oneshot mode even if broadcast registered late timer_list: Convert timer list to be a proper seq_file timer_list: Split timer_list_show_tickdevices posix-timers: Show sigevent info in proc file posix-timers: Introduce /proc/PID/timers file posix timers: Allocate timer id per process (v2) timekeeping: Make sure to notify hrtimers when TAI offset changes hrtimer: Fix ktime_add_ns() overflow on 32bit architectures hrtimer: Add expiry time overflow check in hrtimer_interrupt timekeeping: Shorten seq_count region timekeeping: Implement a shadow timekeeper timekeeping: Delay update of clock->cycle_last timekeeping: Store cycle_last value in timekeeper struct as well ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state timekeeping: Simplify tai updating from do_adjtimex timekeeping: Hold timekeepering locks in do_adjtimex and hardpps timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex() ...
2013-04-30Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP/hotplug changes from Ingo Molnar: "This is a pretty large, multi-arch series unifying and generalizing the various disjunct pieces of idle routines that architectures have historically copied from each other and have grown in random, wildly inconsistent and sometimes buggy directions: 101 files changed, 455 insertions(+), 1328 deletions(-) this went through a number of review and test iterations before it was committed, it was tested on various architectures, was exposed to linux-next for quite some time - nevertheless it might cause problems on architectures that don't read the mailing lists and don't regularly test linux-next. This cat herding excercise was motivated by the -rt kernel, and was brought to you by Thomas "the Whip" Gleixner." * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) idle: Remove GENERIC_IDLE_LOOP config switch um: Use generic idle loop ia64: Make sure interrupts enabled when we "safe_halt()" sparc: Use generic idle loop idle: Remove unused ARCH_HAS_DEFAULT_IDLE bfin: Fix typo in arch_cpu_idle() xtensa: Use generic idle loop x86: Use generic idle loop unicore: Use generic idle loop tile: Use generic idle loop tile: Enter idle with preemption disabled sh: Use generic idle loop score: Use generic idle loop s390: Use generic idle loop powerpc: Use generic idle loop parisc: Use generic idle loop openrisc: Use generic idle loop mn10300: Use generic idle loop mips: Use generic idle loop microblaze: Use generic idle loop ...
2013-04-30Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this development cycle were: - full dynticks preparatory work by Frederic Weisbecker - factor out the cpu time accounting code better, by Li Zefan - multi-CPU load balancer cleanups and improvements by Joonsoo Kim - various smaller fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) sched: Fix init NOHZ_IDLE flag sched: Prevent to re-select dst-cpu in load_balance() sched: Rename load_balance_tmpmask to load_balance_mask sched: Move up affinity check to mitigate useless redoing overhead sched: Don't consider other cpus in our group in case of NEWLY_IDLE sched: Explicitly cpu_idle_type checking in rebalance_domains() sched: Change position of resched_cpu() in load_balance() sched: Fix wrong rq's runnable_avg update with rt tasks sched: Document task_struct::personality field sched/cpuacct/UML: Fix header file dependency bug on the UML build cgroup: Kill subsys.active flag sched/cpuacct: No need to check subsys active state sched/cpuacct: Initialize cpuacct subsystem earlier sched/cpuacct: Initialize root cpuacct earlier sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically sched/cpuacct: Clean up cpuacct.h sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field() sched/cpuacct: Remove redundant NULL checks in cpuacct_charge() sched/cpuacct: Add cpuacct_acount_field() sched/cpuacct: Add cpuacct_init() ...
2013-04-29Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "A lot of activities on workqueue side this time. The changes achieve the followings. - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are updated to be able to interface with multiple backend worker pools. This involved a lot of churning but the end result seems actually neater as unbound workqueues are now a lot closer to per-cpu ones. - The ability to interface with multiple backend worker pools are used to implement unbound workqueues with custom attributes. Currently the supported attributes are the nice level and CPU affinity. It may be expanded to include cgroup association in future. The attributes can be specified either by calling apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if the workqueue in question is exported through sysfs. The backend worker pools are keyed by the actual attributes and shared by any workqueues which share the same attributes. When attributes of a workqueue are changed, the workqueue binds to the worker pool with the specified attributes while leaving the work items which are already executing in its previous worker pools alone. This allows converting custom worker pool implementations which want worker attribute tuning to use workqueues. The writeback pool is already converted in block tree and there are a couple others are likely to follow including btrfs io workers. - WQ_UNBOUND's ability to bind to multiple worker pools is also used to make it NUMA-aware. Because there's no association between work item issuer and the specific worker assigned to execute it, before this change, using unbound workqueue led to unnecessary cross-node bouncing and it couldn't be helped by autonuma as it requires tasks to have implicit node affinity and workers are assigned randomly. After these changes, an unbound workqueue now binds to multiple NUMA-affine worker pools so that queued work items are executed in the same node. This is turned on by default but can be disabled system-wide or for individual workqueues. Crypto was requesting NUMA affinity as encrypting data across different nodes can contribute noticeable overhead and doing it per-cpu was too limiting for certain cases and IO throughput could be bottlenecked by one CPU being fully occupied while others have idle cycles. While the new features required a lot of changes including restructuring locking, it didn't complicate the execution paths much. The unbound workqueue handling is now closer to per-cpu ones and the new features are implemented by simply associating a workqueue with different sets of backend worker pools without changing queue, execution or flush paths. As such, even though the amount of change is very high, I feel relatively safe in that it isn't likely to cause subtle issues with basic correctness of work item execution and handling. If something is wrong, it's likely to show up as being associated with worker pools with the wrong attributes or OOPS while workqueue attributes are being changed or during CPU hotplug. While this creates more backend worker pools, it doesn't add too many more workers unless, of course, there are many workqueues with unique combinations of attributes. Assuming everything else is the same, NUMA awareness costs an extra worker pool per NUMA node with online CPUs. There are also a couple things which are being routed outside the workqueue tree. - block tree pulled in workqueue for-3.10 so that writeback worker pool can be converted to unbound workqueue with sysfs control exposed. This simplifies the code, makes writeback workers NUMA-aware and allows tuning nice level and CPU affinity via sysfs. - The conversion to workqueue means that there's no 1:1 association between a specific worker, which makes writeback folks unhappy as they want to be able to tell which filesystem caused a problem from backtrace on systems with many filesystems mounted. This is resolved by allowing work items to set debug info string which is printed when the task is dumped. As this change involves unifying implementations of dump_stack() and friends in arch codes, it's being routed through Andrew's -mm tree." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits) workqueue: use kmem_cache_free() instead of kfree() workqueue: avoid false negative WARN_ON() in destroy_workqueue() workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity workqueue: implement NUMA affinity for unbound workqueues workqueue: introduce put_pwq_unlocked() workqueue: introduce numa_pwq_tbl_install() workqueue: use NUMA-aware allocation for pool_workqueues workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() workqueue: map an unbound workqueues to multiple per-node pool_workqueues workqueue: move hot fields of workqueue_struct to the end workqueue: make workqueue->name[] fixed len workqueue: add workqueue->unbound_attrs workqueue: determine NUMA node of workers accourding to the allowed cpumask workqueue: drop 'H' from kworker names of unbound worker pools workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] workqueue: move pwq_pool_locking outside of get/put_unbound_pool() workqueue: fix memory leak in apply_workqueue_attrs() workqueue: fix unbound workqueue attrs hashing / comparison workqueue: fix race condition in unbound workqueue free path workqueue: remove pwq_lock which is no longer used ...
2013-04-29Merge branch 'core-locking-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking changes from Ingo Molnar: "The most noticeable change are mutex speedups from Waiman Long, for higher loads. These scalability changes should be most noticeable on larger server systems. There are also cleanups, fixes and debuggability improvements." * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Consolidate bug messages into a single print_lockdep_off() function lockdep: Print out additional debugging advice when we hit lockdep BUGs mutex: Back out architecture specific check for negative mutex count mutex: Queue mutex spinners with MCS lock to reduce cacheline contention mutex: Make more scalable by doing less atomic operations mutex: Move mutex spinning code from sched/core.c back to mutex.c locking/rtmutex/tester: Set correct permissions on sysfs files lockdep: Remove unnecessary 'hlock_next' variable
2013-04-26sched: Fix init NOHZ_IDLE flagVincent Guittot
On my SMP platform which is made of 5 cores in 2 clusters, I have the nr_busy_cpu field of sched_group_power struct that is not null when the platform is fully idle - which makes the scheduler unhappy. The root cause is: During the boot sequence, some CPUs reach the idle loop and set their NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus field is initialized later with the assumption that all CPUs are in the busy state whereas some CPUs have already set their NOHZ_IDLE flag. More generally, the NOHZ_IDLE flag must be initialized when new sched_domains are created in order to ensure that NOHZ_IDLE and nr_busy_cpus are aligned. This condition can be ensured by adding a synchronize_rcu() between the destruction of old sched_domains and the creation of new ones so the NOHZ_IDLE flag will not be updated with old sched_domain once it has been initialized. But this solution introduces a additionnal latency in the rebuild sequence that is called during cpu hotplug. As suggested by Frederic Weisbecker, another solution is to have the same rcu lifecycle for both NOHZ_IDLE and sched_domain struct. A new nohz_idle field is added to sched_domain so both status and sched_domain will share the same RCU lifecycle and will be always synchronized. In addition, there is no more need to protect nohz_idle against concurrent access as it is only modified by 2 exclusive functions called by local cpu. This solution has been prefered to the creation of a new struct with an extra pointer indirection for sched_domain. The synchronization is done at the cost of : - An additional indirection and a rcu_dereference for accessing nohz_idle. - We use only the nohz_idle field of the top sched_domain. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: linaro-kernel@lists.linaro.org Cc: peterz@infradead.org Cc: fweisbec@gmail.com Cc: pjt@google.com Cc: rostedt@goodmis.org Cc: efault@gmx.de Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org [ Fixed !NO_HZ build bug. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24Merge branch 'linus' into timers/coreThomas Gleixner
Reason: Get upstream fixes before adding conflicting code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-22sched: New helper to prevent from stopping the tick in full dynticksFrederic Weisbecker
Provide a new helper to be called from the full dynticks engine before stopping the tick in order to make sure we don't stop it when there is more than one task running on the CPU. This way we make sure that the tick stays alive to maintain fairness. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19mutex: Move mutex spinning code from sched/core.c back to mutex.cWaiman Long
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature bit was really just an early hack to make with/without mutex-spinning testable. So it is no longer necessary. This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the mutex spinning code from kernel/sched/core.c back to kernel/mutex.c which is where they should belong. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-17posix timers: Allocate timer id per process (v2)Pavel Emelyanov
Currently kernel generates IDs for posix timers in a global manner -- there's a kernel-wide IDR tree from which IDs are created. This makes it impossible to recreate a timer with a desired ID (in particular this is done by the CRIU checkpoint-restore project) -- since these IDs are global it may happen, that at the time we recreate a timer, the ID we want for it is already busy by some other timer. In order to address this, replace the IDR tree with a global hash table for timers and makes timer IDs unique per signal_struct (to which timers are linked anyway). With this, two timers belonging to different processes may have equal IDs and we can recreate either of them with the ID we want. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Matthew Helsley <matt.helsley@gmail.com> Link: http://lkml.kernel.org/r/513D9FF5.9010004@parallels.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-12kthread: Prevent unpark race which puts threads on the wrong cpuThomas Gleixner
The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-12sched: Document task_struct::personality fieldAndrei Epure
Signed-off-by: Andrei Epure <epure.andrei@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1365701429-4721-1-git-send-email-epure.andrei@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-08idle: Implement set/clr functions for need_resched pollThomas Gleixner
Implement set/clear functions for the idle need_resched poll implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130321215233.518839807@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-08arch: Consolidate tsk_is_polling()Thomas Gleixner
Move it to a common place. Preparatory patch for implementing set/clear for the idle need_resched poll implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130321215233.446034505@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-03nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMONFrederic Weisbecker
We are planning to convert the dynticks Kconfig options layout into a choice menu. The user must be able to easily pick any of the following implementations: constant periodic tick, idle dynticks, full dynticks. As this implies a mutual exclusion, the two dynticks implementions need to converge on the selection of a common Kconfig option in order to ease the sharing of a common infrastructure. It would thus seem pretty natural to reuse CONFIG_NO_HZ to that end. It already implements all the idle dynticks code and the full dynticks depends on all that code for now. So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ. On the other hand we want to stay backward compatible: if CONFIG_NO_HZ is set in an older config file, we want to enable CONFIG_NO_HZ_IDLE by default. But we can't afford both at the same time or we run into a circular dependency: 1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select CONFIG_NO_HZ 2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE We might be able to support that from Kconfig/Kbuild but it may not be wise to introduce such a confusing behaviour. So to solve this, create a new CONFIG_NO_HZ_COMMON option which gathers the common code between idle and full dynticks (that common code for now is simply the idle dynticks code) and select it from their referring Kconfig. Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ to it for backward compatibility. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-23bcache: A block layer cacheKent Overstreet
Does writethrough and writeback caching, handles unclean shutdown, and has a bunch of other nifty features motivated by real world usage. See the wiki at http://bcache.evilpiepirate.org for more. Signed-off-by: Kent Overstreet <koverstreet@google.com>
2013-03-21nohz: Wake up full dynticks CPUs when a timer gets enqueuedFrederic Weisbecker
Wake up a CPU when a timer list timer is enqueued there and the target is part of the full dynticks range. Sending an IPI to it makes it reconsidering the next timer to program on top of recent updates. This may later be improved by checking if the tick is really stopped on the target. This would need some careful synchronization though. So deal with such optimization later and start simple. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-19sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITYTejun Heo
PF_THREAD_BOUND was originally used to mark kernel threads which were bound to a specific CPU using kthread_bind() and a task with the flag set allows cpus_allowed modifications only to itself. Workqueue is currently abusing it to prevent userland from meddling with cpus_allowed of workqueue workers. What we need is a flag to prevent userland from messing with cpus_allowed of certain kernel tasks. In kernel, anyone can (incorrectly) squash the flag, and, for worker-type usages, restricting cpus_allowed modification to the task itself doesn't provide meaningful extra proection as other tasks can inject work items to the task anyway. This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY. sched_setaffinity() checks the flag and return -EINVAL if set. set_cpus_allowed_ptr() is no longer affected by the flag. This will allow simplifying workqueue worker CPU affinity management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-08Merge branch 'sched/cputime' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into sched/core Pull cputime changes from Frederic Weisbecker: * Generalize exception handling * Fix race in context tracking state restore on return from exception and irq exit kernel preemption * Fix cputime scaling in full dynticks accounting dynamic off-case * Fix default Kconfig value Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-07cputime: Dynamically scale cputime for full dynticks accountingFrederic Weisbecker
The full dynticks cputime accounting is able to account either using the tick or the context tracking subsystem. This way the housekeeping CPU can keep the low overhead tick based solution. This latter mode has a low jiffies resolution granularity and need to be scaled against CFS precise runtime accounting to improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING, now we also need to expand it to full dynticks accounting dynamic off-case as well. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Mats Liljegren <mats.liljegren@enea.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-06sched: Move group scheduling functions out of include/linux/sched.hLi Zefan
- Make sched_group_{set_,}runtime(), sched_group_{set_,}period() and sched_rt_can_attach() static. - Move sched_{create,destroy,online,offline}_group() to kernel/sched/sched.h. - Remove declaration of sched_group_shares(). Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A7C5.3000708@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Make default_scale_freq_power() staticLi Zefan
As default_scale_{freq,smt}_power() and update_rt_power() are used in kernel/sched/fair.c only, annotate them as static functions. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A7AF.8010900@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Move struct sched_class to kernel/sched/sched.hLi Zefan
It's used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A79F.8090502@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Move wake flags to kernel/sched/sched.hLi Zefan
They are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A78E.7040609@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Move struct sched_group to kernel/sched/sched.hLi Zefan
Move struct sched_group_power and sched_group and related inline functions to kernel/sched/sched.h, as they are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Move SCHED_LOAD_SHIFT macros to kernel/sched/sched.hLi Zefan
They are used internally only. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A771.4070104@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Remove test_sd_parent()Li Zefan
It's unused. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A75F.4070202@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-03-06sched: Remove some dummy functionsLi Zefan
No one will call those functions if CONFIG_SCHED_DEBUG=n. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5135A748.3050206@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-27coredump: remove redundant defines for dumpable statesKees Cook
The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_* defines introduced in 54b501992dd2 ("coredump: warn about unsafe suid_dumpable / core_pattern combo"). Remove the new ones, and use the prior values instead. Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Chen Gang <gang.chen@asianux.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>