summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-07-11bpf, perf: delay release of BPF prog after grace periodDaniel Borkmann
[ Upstream commit ceb56070359b7329b5678b5d95a376fcb24767be ] Commit dead9f29ddcc ("perf: Fix race in BPF program unregister") moved destruction of BPF program from free_event_rcu() callback to __free_event(), which is problematic if used with tail calls: if prog A is attached as trace event directly, but at the same time present in a tail call map used by another trace event program elsewhere, then we need to delay destruction via RCU grace period since it can still be in use by the program doing the tail call (the prog first needs to be dropped from the tail call map, then trace event with prog A attached destroyed, so we get immediate destruction). Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Jann Horn <jann@thejh.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24sched: panic on corrupted stack endJann Horn
commit 29d6455178a09e1dc340380c582b13356227e8df upstream. Until now, hitting this BUG_ON caused a recursive oops (because oops handling involves do_exit(), which calls into the scheduler, which in turn raises an oops), which caused stuff below the stack to be overwritten until a panic happened (e.g. via an oops in interrupt context, caused by the overwritten CPU index in the thread_info). Just panic directly. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24bpf, trace: use READ_ONCE for retrieving file ptrDaniel Borkmann
[ Upstream commit 5b6c1b4d46b0dae4edea636a776d09f2064f4cd7 ] In bpf_perf_event_read() and bpf_perf_event_output(), we must use READ_ONCE() for fetching the struct file pointer, which could get updated concurrently, so we must prevent the compiler from potential refetching. We already do this with tail calls for fetching the related bpf_prog, but not so on stored perf events. Semantics for both are the same with regards to updates. Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24bpf, inode: disallow userns mountsDaniel Borkmann
[ Upstream commit 612bacad78ba6d0a91166fc4487af114bac172a8 ] Follow-up to commit e27f4a942a0e ("bpf: Use mount_nodev not mount_ns to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag. The original idea was to have a per mountns instance instead of a single global fs instance, but that didn't work out and we had to switch to mount_nodev() model. The intent of that middle ground was that we avoid users who don't play nice to create endless instances of bpf fs which are difficult to control and discover from an admin point of view, but at the same time it would have allowed us to be more flexible with regard to namespaces. Therefore, since we now did the switch to mount_nodev() as a fix where individual instances are created, we also need to remove userns mount flag along with it to avoid running into mentioned situation. I don't expect any breakage at this early point in time with removing the flag and we can revisit this later should the requirement for this come up with future users. This and commit e27f4a942a0e have been split to facilitate tracking should any of them run into the unlikely case of causing a regression. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24bpf: Use mount_nodev not mount_ns to mount the bpf filesystemEric W. Biederman
[ Upstream commit e27f4a942a0ee4b84567a3c6cfa84f273e55cbb7 ] While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the bpf filesystem. Looking at the code I saw a broken usage of mount_ns with current->nsproxy->mnt_ns. As the code does not acquire a reference to the mount namespace it can not possibly be correct to store the mount namespace on the superblock as it does. Replace mount_ns with mount_nodev so that each mount of the bpf filesystem returns a distinct instance, and the code is not buggy. In discussion with Hannes Frederic Sowa it was reported that the use of mount_ns was an attempt to have one bpf instance per mount namespace, in an attempt to keep resources that pin resources from hiding. That intent simply does not work, the vfs is not built to allow that kind of behavior. Which means that the bpf filesystem really is buggy both semantically and in it's implemenation as it does not nor can it implement the original intent. This change is userspace visible, but my experience with similar filesystems leads me to believe nothing will break with a model of each mount of the bpf filesystem is distinct from all others. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-07wait/ptrace: assume __WALL if the child is tracedOleg Nesterov
commit bf959931ddb88c4e4366e96dd22e68fa0db9527c upstream. The following program (simplified version of generated by syzkaller) #include <pthread.h> #include <unistd.h> #include <sys/ptrace.h> #include <stdio.h> #include <signal.h> void *thread_func(void *arg) { ptrace(PTRACE_TRACEME, 0,0,0); return 0; } int main(void) { pthread_t thread; if (fork()) return 0; while (getppid() != 1) ; pthread_create(&thread, NULL, thread_func, NULL); pthread_join(thread, NULL); return 0; } creates an unreapable zombie if /sbin/init doesn't use __WALL. This is not a kernel bug, at least in a sense that everything works as expected: debugger should reap a traced sub-thread before it can reap the leader, but without __WALL/__WCLONE do_wait() ignores sub-threads. Unfortunately, it seems that /sbin/init in most (all?) distributions doesn't use it and we have to change the kernel to avoid the problem. Note also that most init's use sys_waitid() which doesn't allow __WALL, so the necessary user-space fix is not that trivial. This patch just adds the "ptrace" check into eligible_child(). To some degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is mostly ignored when the tracee reports to debugger. Or WSTOPPED, the tracer doesn't need to set this flag to wait for the stopped tracee. This obviously means the user-visible change: __WCLONE and __WALL no longer have any meaning for debugger. And I can only hope that this won't break something, but at least strace/gdb won't suffer. We could make a more conservative change. Say, we can take __WCLONE into account, or !thread_group_leader(). But it would be nice to not complicate these historical/confusing checks. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systemsVik Heyndrickx
commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream. Systems show a minimal load average of 0.00, 0.01, 0.05 even when they have no load at all. Uptime and /proc/loadavg on all systems with kernels released during the last five years up until kernel version 4.6-rc5, show a 5- and 15-minute minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on idle systems, but the way the kernel calculates this value prevents it from getting lower than the mentioned values. Likewise but not as obviously noticeable, a fully loaded system with no processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95 (multiplied by number of cores). Once the (old) load becomes 93 or higher, it mathematically can never get lower than 93, even when the active (load) remains 0 forever. This results in the strange 0.00, 0.01, 0.05 uptime values on idle systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05. It is not correct to add a 0.5 rounding (=1024/2048) here, since the result from this function is fed back into the next iteration again, so the result of that +0.5 rounding value then gets multiplied by (2048-2037), and then rounded again, so there is a virtual "ghost" load created, next to the old and active load terms. By changing the way the internally kept value is rounded, that internal value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon increasing load, the internally kept load value is rounded up, when the load is decreasing, the load value is rounded down. The modified code was tested on nohz=off and nohz kernels. It was tested on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was tested on single, dual, and octal cores system. It was tested on virtual hosts and bare hardware. No unwanted effects have been observed, and the problems that the patch intended to fix were indeed gone. Tested-by: Damien Wyart <damien.wyart@free.fr> Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Doug Smythies <dsmythies@telus.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes") Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01ring-buffer: Prevent overflow of size in ring_buffer_resize()Steven Rostedt (Red Hat)
commit 59643d1535eb220668692a5359de22545af579f6 upstream. If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE then the DIV_ROUND_UP() will return zero. Here's the details: # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb tracing_entries_write() processes this and converts kb to bytes. 18014398509481980 << 10 = 18446744073709547520 and this is passed to ring_buffer_resize() as unsigned long size. size = DIV_ROUND_UP(size, BUF_PAGE_SIZE); Where DIV_ROUND_UP(a, b) is (a + b - 1)/b BUF_PAGE_SIZE is 4080 and here 18446744073709547520 + 4080 - 1 = 18446744073709551599 where 18446744073709551599 is still smaller than 2^64 2^64 - 18446744073709551599 = 17 But now 18446744073709551599 / 4080 = 4521260802379792 and size = size * 4080 = 18446744073709551360 This is checked to make sure its still greater than 2 * 4080, which it is. Then we convert to the number of buffer pages needed. nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE) but this time size is 18446744073709551360 and 2^64 - (18446744073709551360 + 4080 - 1) = -3823 Thus it overflows and the resulting number is less than 4080, which makes 3823 / 4080 = 0 an nr_pages is set to this. As we already checked against the minimum that nr_pages may be, this causes the logic to fail as well, and we crash the kernel. There's no reason to have the two DIV_ROUND_UP() (that's just result of historical code changes), clean up the code and fix this bug. Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-01ring-buffer: Use long for nr_pages to avoid overflow failuresSteven Rostedt (Red Hat)
commit 9b94a8fba501f38368aef6ac1b30e7335252a220 upstream. The size variable to change the ring buffer in ftrace is a long. The nr_pages used to update the ring buffer based on the size is int. On 64 bit machines this can cause an overflow problem. For example, the following will cause the ring buffer to crash: # cd /sys/kernel/debug/tracing # echo 10 > buffer_size_kb # echo 8556384240 > buffer_size_kb Then you get the warning of: WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260 Which is: RB_WARN_ON(cpu_buffer, nr_removed); Note each ring buffer page holds 4080 bytes. This is because: 1) 10 causes the ring buffer to have 3 pages. (10kb requires 3 * 4080 pages to hold) 2) (2^31 / 2^10 + 1) * 4080 = 8556384240 The value written into buffer_size_kb is shifted by 10 and then passed to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE which is 4080. 8761737461760 / 4080 = 2147484672 4) nr_pages is subtracted from the current nr_pages (3) and we get: 2147484669. This value is saved in a signed integer nr_pages_to_update 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int turns into the value of -2147482627 6) As the value is a negative number, in update_pages_handler() it is negated and passed to rb_remove_pages() and 2147482627 pages will be removed, which is much larger than 3 and it causes the warning because not all the pages asked to be removed were removed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001 Fixes: 7a8e76a3829f1 ("tracing: unified trace buffer") Reported-by: Hao Qin <QEver.cn@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-13Merge branch 'for-4.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "During v4.6-rc1 cgroup namespace support was merged. There is an issue where it's impossible to tell whether a given cgroup mount point is bind mounted or namespaced. Serge has been working on the issue but it took longer than expected to resolve, so the late pull request. Given that it's a completely new feature and the patches don't touch anything else, the risk seems acceptable. However, if this is too late, an alternative is plugging new cgroup ns creation for v4.6 and retrying for v4.7" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix compile warning kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces kernfs_path_from_node_locked: don't overwrite nlen
2016-05-13Merge branch 'for-4.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding DOWN_PREPARE which can trigger a WARN_ON() in workqueue. The bug has been there for a very long time. It only triggers if CPU down fails at a specific point and I don't think it has adverse effects other than the warning messages. The fix is very low impact" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix rebind bound workers warning
2016-05-13Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "This is a revert to fix an interactivity problem. The proper fixes for the problems that the reverted commit exposed are now in sched/core (consisting of 3 patches), but were too risky for v4.6 and will arrive in the v4.7 merge window" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "sched/fair: Fix fairness issue on migration"
2016-05-12workqueue: fix rebind bound workers warningWanpeng Li
------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0 Modules linked in: CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31 Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013 0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010 0000000000000000 0000000000000000 0000000000000000 ffff881037babba8 ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046 Call Trace: dump_stack+0x89/0xd4 __warn+0xfd/0x120 warn_slowpath_null+0x1d/0x20 rebind_workers+0x1c0/0x1d0 workqueue_cpu_up_callback+0xf5/0x1d0 notifier_call_chain+0x64/0x90 ? trace_hardirqs_on_caller+0xf2/0x220 ? notify_prepare+0x80/0x80 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x35/0x50 notify_down_prepare+0x5e/0x80 ? notify_prepare+0x80/0x80 cpuhp_invoke_callback+0x73/0x330 ? __schedule+0x33e/0x8a0 cpuhp_down_callbacks+0x51/0xc0 cpuhp_thread_fun+0xc1/0xf0 smpboot_thread_fn+0x159/0x2a0 ? smpboot_create_threads+0x80/0x80 kthread+0xef/0x110 ? wait_for_completion+0xf0/0x120 ? schedule_tail+0x35/0xf0 ret_from_fork+0x22/0x50 ? __init_kthread_worker+0x70/0x70 ---[ end trace eb12ae47d2382d8f ]--- notify_down_prepare: attempt to take down CPU 0 failed This bug can be reproduced by below config w/ nohz_full= all cpus: CONFIG_BOOTPARAM_HOTPLUG_CPU0=y CONFIG_DEBUG_HOTPLUG_CPU0=y CONFIG_NO_HZ_FULL=y As Thomas pointed out: | If a down prepare callback fails, then DOWN_FAILED is invoked for all | callbacks which have successfully executed DOWN_PREPARE. | | But, workqueue has actually two notifiers. One which handles | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE. | | Now look at the priorities of those callbacks: | | CPU_PRI_WORKQUEUE_UP = 5 | CPU_PRI_WORKQUEUE_DOWN = -5 | | So the call order on DOWN_PREPARE is: | | CB 1 | CB ... | CB workqueue_up() -> Ignores DOWN_PREPARE | CB ... | CB X ---> Fails | | So we call up to CB X with DOWN_FAILED | | CB 1 | CB ... | CB workqueue_up() -> Handles DOWN_FAILED | CB ... | CB X-1 | | So the problem is that the workqueue stuff handles DOWN_FAILED in the up | callback, while it should do it in the down callback. Which is not a good idea | either because it wants to be called early on rollback... | | Brilliant stuff, isn't it? The hotplug rework will solve this problem because | the callbacks become symetric, but for the existing mess, we need some | workaround in the workqueue code. The boot CPU handles housekeeping duty(unbound timers, workqueues, timekeeping, ...) on behalf of full dynticks CPUs. It must remain online when nohz full is enabled. There is a priority set to every notifier_blocks: workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down So tick_nohz_cpu_down callback failed when down prepare cpu 0, and notifier_blocks behind tick_nohz_cpu_down will not be called any more, which leads to workers are actually not unbound. Then hotplug state machine will fallback to undo and online cpu 0 again. Workers will be rebound unconditionally even if they are not unbound and trigger the warning in this progress. This patch fix it by catching !DISASSOCIATED to avoid rebind bound workers. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
2016-05-12cgroup: fix compile warningFelipe Balbi
commit 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-12perf/core: Disable the event on a truncated AUX recordAlexander Shishkin
When the PMU driver reports a truncated AUX record, it effectively means that there is no more usable room in the event's AUX buffer (even though there may still be some room, so that perf_aux_output_begin() doesn't take action). At this point the consumer still has to be woken up and the event has to be disabled, otherwise the event will just keep spinning between perf_aux_output_begin() and perf_aux_output_end() until its context gets unscheduled. Again, for cpu-wide events this means never, so once in this condition, they will be forever losing data. Fix this by disabling the event and waking up the consumer in case of a truncated AUX record. Reported-by: Markus Metzger <markus.t.metzger@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-11Revert "sched/fair: Fix fairness issue on migration"Ingo Molnar
Mike reported that this recent commit: 3a47d5124a95 ("sched/fair: Fix fairness issue on migration") ... broke interactivity and the signal starvation test. We have a proper fix series in the works but ran out of time for v4.6, so revert the commit. Reported-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A UP kernel cpufreq fix and a rt/dl scheduler corner case fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/rt, sched/dl: Don't push if task's scheduling class was changed sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakage
2016-05-10sched/rt, sched/dl: Don't push if task's scheduling class was changedXunlei Pang
We got this warning: WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0 [...] Call Trace: dump_stack+0x63/0x87 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 set_task_cpu+0x1af/0x1c0 push_dl_task.part.34+0xea/0x180 push_dl_tasks+0x17/0x30 __balance_callback+0x45/0x5c __sched_setscheduler+0x906/0xb90 SyS_sched_setattr+0x150/0x190 do_syscall_64+0x62/0x110 entry_SYSCALL64_slow_path+0x25/0x25 This corresponds to: WARN_ON_ONCE(p->state == TASK_RUNNING && p->sched_class == &fair_sched_class && (p->on_rq && !task_on_rq_migrating(p))) It happens because in find_lock_later_rq(), the task whose scheduling class was changed to fair class is still pushed away as if it were a deadline task ... So, check in find_lock_later_rq() after double_lock_balance(), if the scheduling class of the deadline task was changed, break and retry. Apply the same logic to RT tasks. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-09perf/core: Change the default paranoia level to 2Andy Lutomirski
Allowing unprivileged kernel profiling lets any user dump follow kernel control flow and dump kernel registers. This most likely allows trivial kASLR bypassing, and it may allow other mischief as well. (Off the top of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads could be quite interesting.) Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-09cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespacesSerge E. Hallyn
Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-07sched/fair: Fix !CONFIG_SMP kernel cpufreq governor breakageRafael J. Wysocki
The following commit: 34e2c555f3e1 ("cpufreq: Add mechanism for registering utilization update callbacks") overlooked the fact that update_load_avg(), where CFS invokes cpufreq utilization update callbacks, becomes an empty stub on UP kernels. In consequence, if !CONFIG_SMP, cpufreq governors are never invoked from CFS and they do not have a chance to evaluate CPU performace levels and update them often enough. Needless to say, things don't work as expected then. Fix the problem by making the !CONFIG_SMP stub of update_load_avg() invoke cpufreq update callbacks too. Reported-by: Steve Muckle <steve.muckle@linaro.org> Tested-by: Steve Muckle <steve.muckle@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Steve Muckle <steve.muckle@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux PM list <linux-pm@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Fixes: 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization update callbacks) Link: http://lkml.kernel.org/r/6282396.VVEdgVYxO3@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "This contains a single fix that fixes a nohz tick stopping bug when mixed-poliocy SCHED_FIFO and SCHED_RR tasks are present on a runqueue" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
2016-05-03Merge tag 'trace-fixes-v4.6-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Chunyu Hu noticed that if one writes into the trigger files within the ftrace subsystem of events that it can cause an oops. This file is only writable by root, but still is a bug that needs to be fixed" * tag 'trace-fixes-v4.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Don't display trigger file for events that can't be enabled
2016-05-03tracing: Don't display trigger file for events that can't be enabledChunyu Hu
Currently register functions for events will be called through the 'reg' field of event class directly without any check when seting up triggers. Triggers for events that don't support register through debug fs (events under events/ftrace are for trace-cmd to read event format, and most of them don't have a register function except events/ftrace/functionx) can't be enabled at all, and an oops will be hit when setting up trigger for those events, so just not creating them is an easy way to avoid the oops. Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org # 3.14+ Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) MODULE_FIRMWARE firmware string not correct for iwlwifi 8000 chips, from Sara Sharon. 2) Fix SKB size checks in batman-adv stack on receive, from Sven Eckelmann. 3) Leak fix on mac80211 interface add error paths, from Johannes Berg. 4) Cannot invoke napi_disable() with BH disabled in myri10ge driver, fix from Stanislaw Gruszka. 5) Fix sign extension problem when computing feature masks in net_gso_ok(), from Marcelo Ricardo Leitner. 6) lan78xx driver doesn't count packets and packet lengths in its statistics properly, fix from Woojung Huh. 7) Fix the buffer allocation sizes in pegasus USB driver, from Petko Manolov. 8) Fix refcount overflows in bpf, from Alexei Starovoitov. 9) Unified dst cache handling introduced a preempt warning in ip_tunnel, fix by resetting rather then setting the cached route. From Paolo Abeni. 10) Listener hash collision test fix in soreuseport, from Craig Gallak * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) gre: do not pull header in ICMP error processing net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case tipc: only process unicast on intended node cxgb3: fix out of bounds read net/smscx5xx: use the device tree for mac address soreuseport: Fix TCP listener hash collision net: l2tp: fix reversed udp6 checksum flags ip_tunnel: fix preempt warning in ip tunnel creation/updating samples/bpf: fix trace_output example bpf: fix check_map_func_compatibility logic bpf: fix refcnt overflow drivers: net: cpsw: use of_phy_connect() in fixed-link case dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive drivers: net: cpsw: don't ignore phy-mode if phy-handle is used drivers: net: cpsw: fix segfault in case of bad phy-handle drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac config MAINTAINERS: net: Change maintainer for GRETH 10/100/1G Ethernet MAC device driver gre: reject GUE and FOU in collect metadata mode pegasus: fixes reported packet length pegasus: fixes URB buffer allocation size; ...
2016-04-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "20 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: Documentation/sysctl/vm.txt: update numa_zonelist_order description lib/stackdepot.c: allow the stack trace hash to be zero rapidio: fix potential NULL pointer dereference mm/memory-failure: fix race with compound page split/merge ocfs2/dlm: return zero if deref_done message is successfully handled Ananth has moved kcov: don't profile branches in kcov kcov: don't trace the code coverage code mm: wake kcompactd before kswapd's short sleep .mailmap: add Frank Rowand mm/hwpoison: fix wrong num_poisoned_pages accounting mm: call swap_slot_free_notify() with page lock held mm: vmscan: reclaim highmem zone if buffer_heads is over limit numa: fix /proc/<pid>/numa_maps for THP mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check mailmap: fix Krzysztof Kozlowski's misspelled name thp: keep huge zero page pinned until tlb flush mm: exclude HugeTLB pages from THP page_mapped() logic kexec: export OFFSET(page.compound_head) to find out compound tail page kexec: update VMCOREINFO for compound_order/dtor
2016-04-28Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "x86 PMU driver fixes plus a core code race fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix incorrect lbr_sel_mask value perf/x86/intel/pt: Don't die on VMXON perf/core: Fix perf_event_open() vs. execve() race perf/x86/amd: Set the size of event map array to PERF_COUNT_HW_MAX perf/core: Make sysctl_perf_cpu_time_max_percent conform to documentation perf/x86/intel/rapl: Add missing Haswell model perf/x86/intel: Add model number for Skylake Server to perf
2016-04-28Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Two lockdep fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix lock_chain::base size locking/lockdep: Fix ->irq_context calculation
2016-04-28kcov: don't profile branches in kcovAndrey Ryabinin
Profiling 'if' statements in __sanitizer_cov_trace_pc() leads to unbound recursion and crash: __sanitizer_cov_trace_pc() -> ftrace_likely_update -> __sanitizer_cov_trace_pc() ... Define DISABLE_BRANCH_PROFILING to disable this tracer. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28kcov: don't trace the code coverage codeJames Morse
Kcov causes the compiler to add a call to __sanitizer_cov_trace_pc() in every basic block. Ftrace patches in a call to _mcount() to each function it has annotated. Letting these mechanisms annotate each other is a bad thing. Break the loop by adding 'notrace' to __sanitizer_cov_trace_pc() so that ftrace won't try to patch this code. This patch lets arm64 with KCOV and STACK_TRACER boot. Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28kexec: export OFFSET(page.compound_head) to find out compound tail pageAtsushi Kumagai
PageAnon() always look at head page to check PAGE_MAPPING_ANON and tail page's page->mapping has just a poisoned data since commit 1c290f642101 ("mm: sanitize page->mapping for tail pages"). If makedumpfile checks page->mapping of a compound tail page to distinguish anonymous page as usual, it must fail in newer kernel. So it's necessary to export OFFSET(page.compound_head) to avoid checking compound tail pages. The problem is that unnecessary hugepages won't be removed from a dump file in kernels 4.5.x and later. This means that extra disk space would be consumed. It's a problem, but not critical. Signed-off-by: Atsushi Kumagai <ats-kumagai@wm.jp.nec.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28kexec: update VMCOREINFO for compound_order/dtorAtsushi Kumagai
makedumpfile refers page.lru.next to get the order of compound pages for page filtering. However, now the order is stored in page.compound_order, hence VMCOREINFO should be updated to export the offset of page.compound_order. The fact is, page.compound_order was introduced already in kernel 4.0, but the offset of it was the same as page.lru.next until kernel 4.3, so this was not actual problem. The above can be said also for page.lru.prev and page.compound_dtor, it's necessary to detect hugetlbfs pages. Further, the content was changed from direct address to the ID which means dtor. The problem is that unnecessary hugepages won't be removed from a dump file in kernels 4.4.x and later. This means that extra disk space would be consumed. It's a problem, but not critical. Signed-off-by: Atsushi Kumagai <ats-kumagai@wm.jp.nec.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28bpf: fix check_map_func_compatibility logicAlexei Starovoitov
The commit 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") introduced clever way to check bpf_helper<->map_type compatibility. Later on commit a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") adjusted the logic and inadvertently broke it. Get rid of the clever bool compare and go back to two-way check from map and from helper perspective. Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28bpf: fix refcnt overflowAlexei Starovoitov
On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK, the malicious application may overflow 32-bit bpf program refcnt. It's also possible to overflow map refcnt on 1Tb system. Impose 32k hard limit which means that the same bpf program or map cannot be shared by more than 32k processes. Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28perf/core: Fix perf_event_open() vs. execve() racePeter Zijlstra
Jann reported that the ptrace_may_access() check in find_lively_task_by_vpid() is racy against exec(). Specifically: perf_event_open() execve() ptrace_may_access() commit_creds() ... if (get_dumpable() != SUID_DUMP_USER) perf_event_exit_task(); perf_install_in_context() would result in installing a counter across the creds boundary. Fix this by wrapping lots of perf_event_open() in cred_guard_mutex. This should be fine as perf_event_exit_task() is already called with cred_guard_mutex held, so all perf locks already nest inside it. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()Peter Zijlstra
Chris Metcalf reported a that sched_can_stop_tick() sometimes fails to re-enable the tick. His observed problem is that rq->cfs.nr_running can be 1 even though there are multiple runnable CFS tasks. This happens in the cgroup case, in which case cfs.nr_running is the number of runnable entities for that level. If there is a single runnable cgroup (which can have an arbitrary number of runnable child entries itself) rq->cfs.nr_running will be 1. However, looking at that function I think there's more problems with it. It seems to assume that if there's FIFO tasks, those will run. This is incorrect. The FIFO task can have a lower prio than an RR task, in which case the RR task will run. So the whole fifo_nr_running test seems misplaced, it should go after the rr_nr_running tests. That is, only if !rr_nr_running, can we use fifo_nr_running like this. Reported-by: Chris Metcalf <cmetcalf@mellanox.com> Tested-by: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <kernellwp@gmail.com> Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Link: http://lkml.kernel.org/r/20160421160315.GK24771@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-27Merge branch 'for-4.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "So, it turns out we had a silly bug in the most fundamental part of workqueue for a very long time. AFAICS, this dates back to pre-git era and has quite likely been there from the time workqueue was first introduced. A work item uses its PENDING bit to synchronize multiple queuers. Anyone who wins the PENDING bit owns the pending state of the work item. Whether a queuer wins or loses the race, one thing should be guaranteed - there will soon be at least one execution of the work item - where "after" means that the execution instance would be able to see all the changes that the queuer has made prior to the queueing attempt. Unfortunately, we were missing a smp_mb() after clearing PENDING for execution, so nothing guaranteed visibility of the changes that a queueing loser has made, which manifested as a reproducible blk-mq stall. Lots of kudos to Roman for debugging the problem. The patch for -stable is the minimal one. For v3.7, Peter is working on a patch to make the code path slightly more efficient and less fragile" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix ghost PENDING flag while doing MQ IO
2016-04-27Merge branch 'for-4.6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two patches to fix a deadlock which can be easily triggered if memcg charge moving is used. This bug was introduced while converting threadgroup locking to a global percpu_rwsem and is caused by cgroup controller task migration path depending on the ability to create new kthreads. cpuset had a similar issue which was fixed by performing heavy-lifting operations asynchronous to task migration. The two patches fix the same issue in memcg in a similar way. The first patch makes the mechanism generic and the second relocates memcg charge moving outside the migration path. Given that we don't want to perform heavy operations while writelocking threadgroup lock anyway, moving them out of the way is a desirable solution. One thing to note is that the problem was difficult to debug because lockdep couldn't figure out the deadlock condition. Looking into how to improve that" * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: memcg: relocate charge moving from ->attach to ->post_attach cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
2016-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle v4/v6 mixed sockets properly in soreuseport, from Craig Gallak. 2) Bug fixes for the new macsec facility (missing kmalloc NULL checks, missing locking around netdev list traversal, etc.) from Sabrina Dubroca. 3) Fix handling of host routes on ifdown in ipv6, from David Ahern. 4) Fix double-fdput in bpf verifier. From Jann Horn. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits) bpf: fix double-fdput in replace_map_fd_with_map_ptr() net: ipv6: Delete host routes on an ifdown Revert "ipv6: Revert optional address flusing on ifdown." net/mlx4_en: fix spurious timestamping callbacks net: dummy: remove note about being Y by default cxgbi: fix uninitialized flowi6 ipv6: Revert optional address flusing on ifdown. ipv4/fib: don't warn when primary address is missing if in_dev is dead net/mlx5: Add pci shutdown callback net/mlx5_core: Remove static from local variable net/mlx5e: Use vport MTU rather than physical port MTU net/mlx5e: Fix minimum MTU net/mlx5e: Device's mtu field is u16 and not int net/mlx5_core: Add ConnectX-5 to list of supported devices net/mlx5e: Fix MLX5E_100BASE_T define net/mlx5_core: Fix soft lockup in steering error flow qlcnic: Update version to 5.3.64 net: stmmac: socfpga: Remove re-registration of reset controller macsec: fix netlink attribute validation macsec: add missing macsec prefix in uapi ...
2016-04-26bpf: fix double-fdput in replace_map_fd_with_map_ptr()Jann Horn
When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode references a non-map file descriptor as a map file descriptor, the error handling code called fdput() twice instead of once (in __bpf_map_get() and in replace_map_fd_with_map_ptr()). If the file descriptor table of the current task is shared, this causes f_count to be decremented too much, allowing the struct file to be freed while it is still in use (use-after-free). This can be exploited to gain root privileges by an unprivileged user. This bug was introduced in commit 0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only exploitable since commit 1be7f75d1668 ("bpf: enable non-root eBPF programs") because previously, CAP_SYS_ADMIN was required to reach the vulnerable code. (posted publicly according to request by maintainer) Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26workqueue: fix ghost PENDING flag while doing MQ IORoman Pen
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 [ 601.350965] Call Trace: [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80 [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230 [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0 [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110 [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70 [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90 [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50 [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0 [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100 [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0 [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0 [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: queue_mode = MQ submit_queues = 1 Verification that nullb0 has something inflight: root@pserver8:~# cat /sys/block/nullb0/inflight 0 1 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; ... /sys/block/nullb0/mq/0/cpu2/rq_list CTX pending: ffff8838038e2400 ... During debug it became clear that stalled request is always inserted in the rq_list from the following path: save_stack_trace_tsk + 34 blk_mq_insert_requests + 231 blk_mq_flush_plug_list + 281 blk_flush_plug_list + 199 wait_on_page_bit + 192 __filemap_fdatawait_range + 228 filemap_fdatawait_range + 20 filemap_write_and_wait_range + 63 blkdev_fsync + 27 vfs_fsync_range + 73 blkdev_write_iter + 202 __vfs_write + 170 vfs_write + 169 kernel_write + 56 So blk_flush_plug_list() was called with from_schedule == true. If from_schedule is true, that means that finally blk_mq_insert_requests() offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, i.e. it calls kblockd_schedule_delayed_work_on(). That means, that we race with another CPU, which is about to execute __blk_mq_run_hw_queue() work. Further debugging shows the following traces from different CPUs: CPU#0 CPU#1 ---------------------------------- ------------------------------- reqeust A inserted STORE hctx->ctx_map[0] bit marked kblockd_schedule...() returns 1 <schedule to kblockd workqueue> request B inserted STORE hctx->ctx_map[1] bit marked kblockd_schedule...() returns 0 *** WORK PENDING bit is cleared *** flush_busy_ctxs() is executed, but bit 1, set by CPU#1, is not observed As a result request B pended forever. This behaviour can be explained by speculative LOAD of hctx->ctx_map on CPU#0, which is reordered with clear of PENDING bit and executed _before_ actual STORE of bit 1 on CPU#1. The proper fix is an explicit full barrier <mfence>, which guarantees that clear of PENDING bit is to be executed before all possible speculative LOADS or STORES inside actual work function. Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Michael Wang <yun.wang@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2016-04-25cgroup, cpuset: replace cpuset_post_attach_flush() with ↵Tejun Heo
cgroup_subsys->post_attach callback Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset kicks off asynchronous NUMA node migration if necessary during task migration and flushes it from cpuset_post_attach_flush() which is called at the end of __cgroup_procs_write(). This is to avoid performing migration with cgroup_threadgroup_rwsem write-locked which can lead to deadlock through dependency on kworker creation. memcg has a similar issue with charge moving, so let's convert it to an official callback rather than the current one-off cpuset specific function. This patch adds cgroup_subsys->post_attach callback and makes cpuset register cpuset_post_attach_flush() as its ->post_attach. The conversion is mostly one-to-one except that the new callback is called under cgroup_mutex. This is to guarantee that no other migration operations are started before ->post_attach callbacks are finished. cgroup_mutex is one of the outermost mutex in the system and has never been and shouldn't be a problem. We can add specialized synchronization around __cgroup_procs_write() but I don't think there's any noticeable benefit. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
2016-04-23Merge branches 'perf-urgent-for-linus', 'smp-urgent-for-linus' and ↵Linus Torvalds
'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf, cpu hotplug and timer fixes from Ingo Molnar: "perf: - A single tooling fix for a user-triggerable segfault. CPU hotplug: - Fix a CPU hotplug corner case regression, introduced by the recent hotplug rework timers: - Fix a boot hang in the ARM based Tango SoC clocksource driver" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf intel-pt: Fix segfault tracing transactions * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Fix rollback during error-out in __cpu_disable() * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/tango-xtal: Fix boot hang due to incorrect test
2016-04-23Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Misc fixes: pvqspinlocks: - an instrumentation fix futexes: - preempt-count vs pagefault_disable decouple corner case fix - futex requeue plist race window fix - futex UNLOCK_PI transaction fix for a corner case" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: asm-generic/futex: Re-enable preemption in futex_atomic_cmpxchg_inatomic() futex: Acknowledge a new waiter in counter before plist futex: Handle unlock_pi race gracefully locking/pvqspinlock: Fix division by zero in qstat_read()
2016-04-23Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "A core irq affinity masks related fix and a MIPS irqchip driver fix" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/mips-gic: Don't overrun pcpu_masks array genirq: Dont allow affinity mask to be updated on IPIs
2016-04-23lockdep: Fix lock_chain::base sizePeter Zijlstra
lock_chain::base is used to store an index into the chain_hlocks[] array, however that array contains more elements than can be indexed using the u16. Change the lock_chain structure to use a bitfield to encode the data it needs and add BUILD_BUG_ON() assertions to check the fields are wide enough. Also, for DEBUG_LOCKDEP, assert that we don't run out of elements of that array; as that would wreck the collision detectoring. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160330093659.GS3408@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23locking/lockdep: Fix ->irq_context calculationBoqun Feng
task_irq_context() returns the encoded irq_context of the task, the return value is encoded in the same as ->irq_context of held_lock. Always return 0 if !(CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING) Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sasha.levin@oracle.com Link: http://lkml.kernel.org/r/1455602265-16490-2-git-send-email-boqun.feng@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-23perf/core: Make sysctl_perf_cpu_time_max_percent conform to documentationPeter Zijlstra
Markus reported that 0 should also disable the throttling we per Documentation/sysctl/kernel.txt. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 91a612eea9a3 ("perf/core: Fix dynamic interrupt throttle") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-22cpu/hotplug: Fix rollback during error-out in __cpu_disable()Sebastian Andrzej Siewior
The recent introduction of the hotplug thread which invokes the callbacks on the plugged cpu, cased the following regression: If takedown_cpu() fails, then we run into several issues: 1) The rollback of the target cpu states is not invoked. That leaves the smp threads and the hotplug thread in disabled state. 2) notify_online() is executed due to a missing skip_onerr flag. That causes that both CPU_DOWN_FAILED and CPU_ONLINE notifications are invoked which confuses quite some notifiers. 3) The CPU_DOWN_FAILED notification is not invoked on the target CPU. That's not an issue per se, but it is inconsistent and in consequence blocks the patches which rely on these states being invoked on the target CPU and not on the controlling cpu. It also does not preserve the strict call order on rollback which is problematic for the ongoing state machine conversion as well. To fix this we add a rollback flag to the remote callback machinery and invoke the rollback including the CPU_DOWN_FAILED notification on the remote cpu. Further mark the notify online state with 'skip_onerr' so we don't get a double invokation. This workaround will go away once we moved the unplug invocation to the target cpu itself. [ tglx: Massaged changelog and moved the CPU_DOWN_FAILED notifiaction to the target cpu ] Fixes: 4cb28ced23c4 ("cpu/hotplug: Create hotplug threads") Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-s390@vger.kernel.org Cc: rt@linutronix.de Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Link: http://lkml.kernel.org/r/20160408124015.GA21960@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix memory leak in iwlwifi, from Matti Gottlieb. 2) Add missing registration of netfilter arp_tables into initial namespace, from Florian Westphal. 3) Fix potential NULL deref in DecNET routing code. 4) Restrict NETLINK_URELEASE to truly bound sockets only, from Dmitry Ivanov. 5) Fix dst ref counting in VRF, from David Ahern. 6) Fix TSO segmenting limits in i40e driver, from Alexander Duyck. 7) Fix heap leak in PACKET_DIAG_MCLIST, from Mathias Krause. 8) Ravalidate IPV6 datagram socket cached routes properly, particularly with UDP, from Martin KaFai Lau. 9) Fix endian bug in RDS dp_ack_seq handling, from Qing Huang. 10) Fix stats typing in bcmgenet driver, from Eric Dumazet. 11) Openvswitch needs to orphan SKBs before ipv6 fragmentation handing, from Joe Stringer. 12) SPI device reference leak in spi_ks8895 PHY driver, from Mark Brown. 13) atl2 doesn't actually support scatter-gather, so don't advertise the feature. From Ben Hucthings. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits) openvswitch: use flow protocol when recalculating ipv6 checksums Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets atl2: Disable unimplemented scatter/gather feature net/mlx4_en: Split SW RX dropped counter per RX ring net/mlx4_core: Don't allow to VF change global pause settings net/mlx4_core: Avoid repeated calls to pci enable/disable net/mlx4_core: Implement pci_resume callback net: phy: spi_ks8895: Don't leak references to SPI devices net: ethernet: davinci_emac: Fix platform_data overwrite net: ethernet: davinci_emac: Fix Unbalanced pm_runtime_enable qede: Fix single MTU sized packet from firmware GRO flow qede: Fix setting Skb network header qede: Fix various memory allocation error flows for fastpath tcp: Merge tx_flags and tskey in tcp_shifted_skb tcp: Merge tx_flags and tskey in tcp_collapse_retrans drivers: net: cpsw: fix wrong regs access in cpsw_ndo_open tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks openvswitch: Orphan skbs before IPv6 defrag Revert "Prevent NUll pointer dereference with two PHYs on cpsw" VSOCK: Only check error on skb_recv_datagram when skb is NULL ...