Age | Commit message (Collapse) | Author |
|
commit d194e5d666225b04c7754471df0948f645b6ab3a upstream.
The final version of commit 637241a900cb ("kmsg: honor dmesg_restrict
sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
processed incorrectly:
- open of /dev/kmsg checks syslog access permissions by using
check_syslog_permissions() where security_syslog() is not called if
dmesg_restrict is set.
- syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
can be executed twice (inside check_syslog_permissions() and then
directly in do_syslog())
With this patch security_syslog() is called once only in all
syslog-related operations regardless of dmesg_restrict value.
Fixes: 637241a900cb ("kmsg: honor dmesg_restrict sysctl on /dev/kmsg")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 6224beb12e190ff11f3c7d4bf50cb2922878f600 upstream.
Fengguang Wu's tests triggered a bug in the branch tracer's start up
test when CONFIG_DEBUG_PREEMPT set. This was because that config
adds some debug logic in the per cpu field, which calls back into
the branch tracer.
The branch tracer has its own recursive checks, but uses a per cpu
variable to implement it. If retrieving the per cpu variable calls
back into the branch tracer, you can see how things will break.
Instead of using a per cpu variable, use the trace_recursion field
of the current task struct. Simply set a bit when entering the
branch tracing and clear it when leaving. If the bit is set on
entry, just don't do the tracing.
There's also the case with lockdep, as the local_irq_save() called
before the recursion can also trigger code that can call back into
the function. Changing that to a raw_local_irq_save() will protect
that as well.
This prevents the recursion and the inevitable crash that follows.
Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 6b88f44e161b9ee2a803e5b2b1fbcf4e20e8b980 upstream.
While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has
ps->infix.cnt--;
return ps->infix.string[ps->infix.tail++];
The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.
Luckily, only root can write to the filter.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit b4875bbe7e68f139bd3383828ae8e994a0df6d28 upstream.
When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.
Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 7e01b5acd88b3f3108d8c4ce44e3205d67437202 upstream.
Introduce KEXEC_CONTROL_MEMORY_GFP to allow the architecture code
to override the gfp flags of the allocation for the kexec control
page. The loop in kimage_alloc_normal_control_pages allocates pages
with GFP_KERNEL until a page is found that happens to have an
address smaller than the KEXEC_CONTROL_MEMORY_LIMIT. On systems
with a large memory size but a small KEXEC_CONTROL_MEMORY_LIMIT
the loop will keep allocating memory until the oom killer steps in.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 6e91f8cb138625be96070b778d9ba71ce520ea7e upstream.
If, at the time __rcu_process_callbacks() is invoked, there are callbacks
in Tiny RCU's callback list, but none of them are ready to be invoked,
the current list-management code will knit the non-ready callbacks out
of the list. This can result in hangs and possibly worse. This commit
therefore inserts a check for there being no callbacks that can be
invoked immediately.
This bug is unlikely to occur -- you have to get a new callback between
the time rcu_sched_qs() or rcu_bh_qs() was called, but before we get to
__rcu_process_callbacks(). It was detected by the addition of RCU-bh
testing to rcutorture, which in turn was instigated by Iftekhar Ahmed's
mutation testing. Although this bug was made much more likely by
915e8a4fe45e (rcu: Remove fastpath from __rcu_process_callbacks()), this
did not cause the bug, but rather made it much more probable. That
said, it takes more than 40 hours of rcutorture testing, on average,
for this bug to appear, so this fix cannot be considered an emergency.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 2cf30dc180cea808077f003c5116388183e54f9e upstream.
When the following filter is used it causes a warning to trigger:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: No error
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
Modules linked in: bnep lockd grace bluetooth ...
CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
Call Trace:
[<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
[<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
[<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
[<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81159065>] replace_preds+0x3c5/0x990
[<ffffffff811596b2>] create_filter+0x82/0xb0
[<ffffffff81159944>] apply_event_filter+0xd4/0x180
[<ffffffff81152bbf>] event_filter_write+0x8f/0x120
[<ffffffff811db2a8>] __vfs_write+0x28/0xe0
[<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
[<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
[<ffffffff811dc408>] vfs_write+0xb8/0x1b0
[<ffffffff811dc72f>] SyS_write+0x4f/0xb0
[<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
---[ end trace e11028bd95818dcd ]---
Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.
This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: Meaningless filter expression
And give no kernel warning.
Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
[js: 3.12 has no OP_NOT yet, I dropped that if]
|
|
commit 108029323910c5dd1ef8fa2d10da1ce5fbce6e12 upstream.
The producer should be used producer_fifo as its sched_priority,
so correct it.
Link: http://lkml.kernel.org/r/1433923957-67842-1-git-send-email-long.wanglong@huawei.com
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 35f71bc0a09a45924bed268d8ccd0d3407bc476f upstream.
copy_process will report any failure in alloc_pid as ENOMEM currently
which is misleading because the pid allocation might fail not only when
the memory is short but also when the pid space is consumed already.
The current man page even mentions this case:
: EAGAIN
:
: A system-imposed limit on the number of threads was encountered.
: There are a number of limits that may trigger this error: the
: RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
: limits the number of processes and threads for a real user ID, was
: reached; the kernel's system-wide limit on the number of processes
: and threads, /proc/sys/kernel/threads-max, was reached (see
: proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
: was reached (see proc(5)).
so the current behavior is also incorrect wrt. documentation. POSIX man
page also suggest returing EAGAIN when the process count limit is reached.
This patch simply propagates error code from alloc_pid and makes sure we
return -EAGAIN due to reservation failure. This will make behavior of
fork closer to both our documentation and POSIX.
alloc_pid might alsoo fail when the reaper in the pid namespace is dead
(the namespace basically disallows all new processes) and there is no
good error code which would match documented ones. We have traditionally
returned ENOMEM for this case which is misleading as well but as per
Eric W. Biederman this behavior is documented in man pid_namespaces(7)
: If the "init" process of a PID namespace terminates, the kernel
: terminates all of the processes in the namespace via a SIGKILL signal.
: This behavior reflects the fact that the "init" process is essential for
: the correct operation of a PID namespace. In this case, a subsequent
: fork(2) into this PID namespace will fail with the error ENOMEM; it is
: not possible to create a new processes in a PID namespace whose "init"
: process has terminated.
and introducing a new error code would be too risky so let's stick to
ENOMEM for this case.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 28423ad283d5348793b0c45cc9b1af058e776fd6 upstream.
While debugging an issue with excessive softirq usage, I encountered the
following note in commit 3e339b5dae24a706 ("softirq: Use hotplug thread
infrastructure"):
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]
...but despite this note, the patch still calls RCU with IRQs disabled.
This seemingly innocuous change caused a significant regression in softirq
CPU usage on the sending side of a large TCP transfer (~1 GB/s): when
introducing 0.01% packet loss, the softirq usage would jump to around 25%,
spiking as high as 50%. Before the change, the usage would never exceed 5%.
Moving the call to rcu_note_context_switch() after the cond_sched() call,
as it was originally before the hotplug patch, completely eliminated this
problem.
Signed-off-by: Calvin Owens <calvinowens@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit b72c186999e689cb0b055ab1c7b3cd8fffbeb5ed upstream.
ptrace_resume() is called when the tracee is still __TASK_TRACED. We set
tracee->exit_code and then wake_up_state() changes tracee->state. If the
tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T)
wrongly looks like another report from tracee.
This confuses debugger, and since wait_task_stopped() clears ->exit_code
the tracee can miss a signal.
Test-case:
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <pthread.h>
#include <assert.h>
int pid;
void *waiter(void *arg)
{
int stat;
for (;;) {
assert(pid == wait(&stat));
assert(WIFSTOPPED(stat));
if (WSTOPSIG(stat) == SIGHUP)
continue;
assert(WSTOPSIG(stat) == SIGCONT);
printf("ERR! extra/wrong report:%x\n", stat);
}
}
int main(void)
{
pthread_t thread;
pid = fork();
if (!pid) {
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
for (;;)
kill(getpid(), SIGHUP);
}
assert(pthread_create(&thread, NULL, waiter, NULL) == 0);
for (;;)
ptrace(PTRACE_CONT, pid, 0, SIGCONT);
return 0;
}
Note for stable: the bug is very old, but without 9899d11f6544 "ptrace:
ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix
should use lock_task_sighand(child).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Pavel Labath <labath@google.com>
Tested-by: Pavel Labath <labath@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 80a9b64e2c156b6523e7a01f2ba6e5d86e722814 upstream.
It has come to my attention that this_cpu_read/write are horrible on
architectures other than x86. Worse yet, they actually disable
preemption or interrupts! This caused some unexpected tracing results
on ARM.
101.356868: preempt_count_add <-ring_buffer_lock_reserve
101.356870: preempt_count_sub <-ring_buffer_lock_reserve
The ring_buffer_lock_reserve has recursion protection that requires
accessing a per cpu variable. But since preempt_disable() is traced, it
too got traced while accessing the variable that is suppose to prevent
recursion like this.
The generic version of this_cpu_read() and write() are:
#define this_cpu_generic_read(pcp) \
({ typeof(pcp) ret__; \
preempt_disable(); \
ret__ = *this_cpu_ptr(&(pcp)); \
preempt_enable(); \
ret__; \
})
#define this_cpu_generic_to_op(pcp, val, op) \
do { \
unsigned long flags; \
raw_local_irq_save(flags); \
*__this_cpu_ptr(&(pcp)) op val; \
raw_local_irq_restore(flags); \
} while (0)
Which is unacceptable for locations that know they are within preempt
disabled or interrupt disabled locations.
Paul McKenney stated that __this_cpu_() versions produce much better code on
other architectures than this_cpu_() does, if we know that the call is done in
a preempt disabled location.
I also changed the recursive_unlock() to use two local variables instead
of accessing the per_cpu variable twice.
Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home
Acked-by: Christoph Lameter <cl@linux.com>
Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Tested-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 746db9443ea57fd9c059f62c4bfbf41cf224fe13 upstream.
When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.
I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.
Signed-off-by: Brian Silverman <brian@peloton-tech.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: austin@peloton-tech.com
Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit d525211f9d1be8b523ec7633f080f2116f5ea536 upstream.
Vince reported a watchdog lockup like:
[<ffffffff8115e114>] perf_tp_event+0xc4/0x210
[<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
[<ffffffff810b7f10>] lock_release+0x130/0x260
[<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
[<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
[<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
[<ffffffff811f71ce>] send_sigio+0xae/0x100
[<ffffffff811f72b7>] kill_fasync+0x97/0xf0
[<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
[<ffffffff8115d103>] perf_pending_event+0x33/0x60
[<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
[<ffffffff8114e448>] irq_work_run+0x18/0x40
[<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
[<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
Which is caused by an irq_work generating new irq_work and therefore
not allowing forward progress.
This happens because processing the perf irq_work triggers another
perf event (tracepoint stuff) which in turn generates an irq_work ad
infinitum.
Avoid this by raising the recursion counter in the irq_work -- which
effectively disables all software events (including tracepoints) from
actually triggering again.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit ff7e0055bb5ddbbb320cdd8dfd3e18672bddd2ad upstream.
The commit
4982223e51e8 module: set nx before marking module MODULE_STATE_COMING.
introduced a regression: if a module fails to parse its arguments or
if mod_sysfs_setup fails, then the module's memory will be freed
while still read-only. Anything that reuses that memory will crash
as soon as it tries to write to it.
Cc: stable@vger.kernel.org # v3.16
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 4982223e51e8ea9d09bb33c8323b5ec1877b2b51 upstream.
We currently set RO & NX on modules very late: after we move them from
MODULE_STATE_UNFORMED to MODULE_STATE_COMING, and after we call
parse_args() (which can exec code in the module).
Much better is to do it in complete_formation() and then call
the notifier.
This means that the notifiers will be called on a module which
is already RO & NX, so that may cause problems (ftrace already
changed so they're unaffected).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 80df28476505ed4e6701c3448c63c9229a50c655 upstream.
As sysctl_hung_task_timeout_sec is unsigned long, when this value is
larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
watchdog will return immediately without sleep and with print :
schedule_timeout: wrong timeout value ffffffffffffff83
and then the funtion watchdog will call schedule_timeout_interruptible
again and again. The screen will be filled with
"schedule_timeout: wrong timeout value ffffffffffffff83"
This patch does some check and correction in sysctl, to let the function
schedule_timeout_interruptible allways get the valid parameter.
Signed-off-by: Liu Hua <sdu.liu@huawei.com>
Tested-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 29183a70b0b828500816bd794b3fe192fce89f73 upstream.
Additional validation of adjtimex freq values to avoid
potential multiplication overflows were added in commit
5e5aeb4367b (time: adjtimex: Validate the ADJ_FREQUENCY values)
Unfortunately the patch used LONG_MAX/MIN instead of
LLONG_MAX/MIN, which was fine on 64-bit systems, but being
much smaller on 32-bit systems caused false positives
resulting in most direct frequency adjustments to fail w/
EINVAL.
ntpd only does direct frequency adjustments at startup, so
the issue was not as easily observed there, but other time
sync applications like ptpd and chrony were more effected by
the bug.
See bugs:
https://bugzilla.kernel.org/show_bug.cgi?id=92481
https://bugzilla.redhat.com/show_bug.cgi?id=1188074
This patch changes the checks to use LLONG_MAX for
clarity, and additionally the checks are disabled
on 32-bit systems since LLONG_MAX/PPM_SCALE is always
larger then the 32-bit long freq value, so multiplication
overflows aren't possible there.
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Reported-by: George Joseph <george.joseph@fairview5.com>
Tested-by: George Joseph <george.joseph@fairview5.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Link: http://lkml.kernel.org/r/1423553436-29747-1-git-send-email-john.stultz@linaro.org
[ Prettified the changelog and the comments a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Christian Riesch <christian.riesch@omicron.at>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 283cb41f426b723a0255702b761b0fc5d1b53a81 upstream.
The cpuset.sched_relax_domain_level can control how far we do
immediate load balancing on a system. However, it was found on recent
kernels that echo'ing a value into cpuset.sched_relax_domain_level
did not reduce any immediate load balancing.
The reason this occurred was because the update_domain_attr_tree() traversal
did not update for the "top_cpuset". This resulted in nothing being changed
when modifying the sched_relax_domain_level parameter.
This patch is able to address that problem by having update_domain_attr_tree()
allow updates for the root in the cpuset traversal.
Fixes: fc560a26acce ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
PREEMPT_NONE
commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream.
cancel[_delayed]_work_sync() are implemented using
__cancel_work_timer() which grabs the PENDING bit using
try_to_grab_pending() and then flushes the work item with PENDING set
to prevent the on-going execution of the work item from requeueing
itself.
try_to_grab_pending() can always grab PENDING bit without blocking
except when someone else is doing the above flushing during
cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
this case, __cancel_work_timer() currently invokes flush_work(). The
assumption is that the completion of the work item is what the other
canceling task would be waiting for too and thus waiting for the same
condition and retrying should allow forward progress without excessive
busy looping
Unfortunately, this doesn't work if preemption is disabled or the
latter task has real time priority. Let's say task A just got woken
up from flush_work() by the completion of the target work item. If,
before task A starts executing, task B gets scheduled and invokes
__cancel_work_timer() on the same work item, its try_to_grab_pending()
will return -ENOENT as the work item is still being canceled by task A
and flush_work() will also immediately return false as the work item
is no longer executing. This puts task B in a busy loop possibly
preventing task A from executing and clearing the canceling state on
the work item leading to a hang.
task A task B worker
executing work
__cancel_work_timer()
try_to_grab_pending()
set work CANCELING
flush_work()
block for work completion
completion, wakes up A
__cancel_work_timer()
while (forever) {
try_to_grab_pending()
-ENOENT as work is being canceled
flush_work()
false as work is no longer executing
}
This patch removes the possible hang by updating __cancel_work_timer()
to explicitly wait for clearing of CANCELING rather than invoking
flush_work() after try_to_grab_pending() fails with -ENOENT.
Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
v3: bit_waitqueue() can't be used for work items defined in vmalloc
area. Switched to custom wake function which matches the target
work item and exclusive wait and wakeup.
v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
the target bit waitqueue has wait_bit_queue's on it. Use
DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu
Vizoso.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rabin Vincent <rabin.vincent@axis.com>
Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
Tested-by: Jesper Nilsson <jesper.nilsson@axis.com>
Tested-by: Rabin Vincent <rabin.vincent@axis.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 30a22c215a0007603ffc08021f2e8b64018517dd upstream.
commit 6ae9200f2cab7 ("enlarge console.name") increased the storage
for the console name to 16 bytes, but not the corresponding
struct console_cmdline::name storage. Console names longer than
8 bytes cause read beyond end-of-string and failure to match
console; I'm not sure if there are other unexpected consequences.
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 146755923262037fc4c54abc28c04b1103f3cc51 upstream.
The output of KDB 'summary' command should report MemTotal, MemFree
and Buffers output in kB. Current codes report in unit of pages.
A define of K(x) as
is defined in the code, but not used.
This patch would apply the define to convert the values to kB.
Please include me on Cc on replies. I do not subscribe to linux-kernel.
Signed-off-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 7215853e985a4bef1a6c14e00e89dfec84f1e457 upstream.
Commit 6edb2a8a385f0cdef51dae37ff23e74d76d8a6ce introduced
an array map_pages that contains the addresses returned by
kmap_atomic. However, when unmapping those pages, map_pages[0]
is unmapped before map_pages[1], breaking the nesting requirement
as specified in the documentation for kmap_atomic/kunmap_atomic.
This was caught by the highmem debug code present in kunmap_atomic.
Fix the loop to do the unmapping properly.
Link: http://lkml.kernel.org/r/1418871056-6614-1-git-send-email-markivx@codeaurora.org
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Reported-by: Lime Yang <limey@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 4bee96860a65c3a62d332edac331b3cf936ba3ad upstream.
The following race exists in the smpboot percpu threads management:
CPU0 CPU1
cpu_up(2)
get_online_cpus();
smpboot_create_threads(2);
smpboot_register_percpu_thread();
for_each_online_cpu();
__smpboot_create_thread();
__cpu_up(2);
This results in a missing per cpu thread for the newly onlined cpu2 and
in a NULL pointer dereference on a consecutive offline of that cpu.
Proctect smpboot_register_percpu_thread() with get_online_cpus() to
prevent that.
[ tglx: Massaged changelog and removed the change in
smpboot_unregister_percpu_thread() because that's an
optimization and therefor not stable material. ]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1406777421-12830-1-git-send-email-laijs@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 29187a9eeaf362d8422e62e17a22a6e115277a49 upstream.
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.
This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value. This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.
Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function. need_to_create_worker() tests the following
conditions.
pending work items && !nr_running && !nr_idle
The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.
If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items. If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.
This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.
We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers. Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.
Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.
tj: Updated to v3.14 where manage_workers() is responsible not only
for creating more workers but also destroying surplus ones.
maybe_create_worker() needs to keep its early exit condition to
avoid creating a new worker when manage_workers() is called to
destroy surplus ones. Other than that, the adaptabion is
straight-forward. Both maybe_{create|destroy}_worker() functions
are converted to return void and manage_workers() returns %false
iff it lost manager arbitration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 5e5aeb4367b450a28f447f6d5ab57d8f2ab16a5f upstream.
Verify that the frequency value from userspace is valid and makes sense.
Unverified values can cause overflows later on.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Fix up bug for negative values and drop redunent cap check]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 6ada1fc0e1c4775de0e043e1bd3ae9d065491aa5 upstream.
An unvalidated user input is multiplied by a constant, which can result in
an undefined behaviour for large values. While this is validated later,
we should avoid triggering undefined behaviour.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: include trivial milisecond->microsecond correction noticed
by Andy]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit c291ee622165cb2c8d4e7af63fffd499354a23be upstream.
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit a5fd9733a30d18d7ac23f17080e7e07bb3205b69 upstream.
commit 4dbd27711cd9 "tick: export nohz tick idle symbols for module
use" was merged via the thermal tree without an explicit ack from the
relevant maintainers.
The exports are abused by the intel powerclamp driver which implements
a fake idle state from a sched FIFO task. This causes all kinds of
wreckage in the NOHZ core code which rightfully assumes that
tick_nohz_idle_enter/exit() are only called from the idle task itself.
Recent changes in the NOHZ core lead to a failure of the powerclamp
driver and now people try to hack completely broken and backwards
workarounds into the NOHZ core code. This is completely unacceptable
and just papers over the real problem. There are way more subtle
issues lurking around the corner.
The real solution is to fix the powerclamp driver by rewriting it with
a sane concept, but that's beyond the scope of this.
So the only solution for now is to remove the calls into the core NOHZ
code from the powerclamp trainwreck along with the exports.
Fixes: d6d71ee4a14a "PM: Introduce Intel PowerClamp Driver"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
Cc: LKP <lkp@01.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 9fc81d87420d0d3fd62d5e5529972c0ad9eab9cc upstream.
We allow PMU driver to change the cpu on which the event
should be installed to. This happened in patch:
e2d37cd213dc ("perf: Allow the PMU driver to choose the CPU on which to install events")
This patch also forces all the group members to follow
the currently opened events cpu if the group happened
to be moved.
This and the change of event->cpu in perf_install_in_context()
function introduced in:
0cda4c023132 ("perf: Introduce perf_pmu_migrate_context()")
forces group members to change their event->cpu,
if the currently-opened-event's PMU changed the cpu
and there is a group move.
Above behaviour causes problem for breakpoint events,
which uses event->cpu to touch cpu specific data for
breakpoints accounting. By changing event->cpu, some
breakpoints slots were wrongly accounted for given
cpu.
Vinces's perf fuzzer hit this issue and caused following
WARN on my setup:
WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
Can't find any breakpoint slot
[...]
This patch changes the group moving code to keep the event's
original cpu.
Reported-by: Vince Weaver <vince@deater.net>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Yan, Zheng <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d upstream.
alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
fails because disable_pid_allocation() was called by the exiting
child_reaper.
We could simply move get_pid_ns() down to successful return, but this fix
tries to be as trivial as possible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 041d7b98ffe59c59fdd639931dea7d74f9aa9a59 upstream.
A regression was caused by commit 780a7654cee8:
audit: Make testing for a valid loginuid explicit.
(which in turn attempted to fix a regression caused by e1760bd)
When audit_krule_to_data() fills in the rules to get a listing, there was a
missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID.
This broke userspace by not returning the same information that was sent and
expected.
The rule:
auditctl -a exit,never -F auid=-1
gives:
auditctl -l
LIST_RULES: exit,never f24=0 syscall=all
when it should give:
LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all
Tag it so that it is reported the same way it was set. Create a new
private flags audit_krule field (pflags) to store it that won't interact with
the public one from the API.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 upstream.
Now that setgroups can be disabled and not reenabled, setting gid_map
without privielge can now be enabled when setgroups is disabled.
This restores most of the functionality that was lost when unprivileged
setting of gid_map was removed. Applications that use this functionality
will need to check to see if they use setgroups or init_groups, and if they
don't they can be fixed by simply disabling setgroups before writing to
gid_map.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.
- Expose the knob to user space through a proc file /proc/<pid>/setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.
A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.
- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.
- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.
This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.
A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit f0d62aec931e4ae3333c797d346dc4f188f454ba upstream.
Generalize id_map_mutex so it can be used for more state of a user namespace.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit f95d7918bd1e724675de4940039f2865e5eec5fe upstream.
If you did not create the user namespace and are allowed
to write to uid_map or gid_map you should already have the necessary
privilege in the parent user namespace to establish any mapping
you want so this will not affect userspace in practice.
Limiting unprivileged uid mapping establishment to the creator of the
user namespace makes it easier to verify all credentials obtained with
the uid mapping can be obtained without the uid mapping without
privilege.
Limiting unprivileged gid mapping establishment (which is temporarily
absent) to the creator of the user namespace also ensures that the
combination of uid and gid can already be obtained without privilege.
This is part of the fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 80dd00a23784b384ccea049bfb3f259d3f973b9d upstream.
setresuid allows the euid to be set to any of uid, euid, suid, and
fsuid. Therefor it is safe to allow an unprivileged user to map
their euid and use CAP_SETUID privileged with exactly that uid,
as no new credentials can be obtained.
I can not find a combination of existing system calls that allows setting
uid, euid, suid, and fsuid from the fsuid making the previous use
of fsuid for allowing unprivileged mappings a bug.
This is part of a fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit be7c6dba2332cef0677fbabb606e279ae76652c3 upstream.
As any gid mapping will allow and must allow for backwards
compatibility dropping groups don't allow any gid mappings to be
established without CAP_SETGID in the parent user namespace.
For a small class of applications this change breaks userspace
and removes useful functionality. This small class of applications
includes tools/testing/selftests/mount/unprivilged-remount-test.c
Most of the removed functionality will be added back with the addition
of a one way knob to disable setgroups. Once setgroups is disabled
setting the gid_map becomes as safe as setting the uid_map.
For more common applications that set the uid_map and the gid_map
with privilege this change will have no affect.
This is part of a fix for CVE-2014-8989.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream.
setgroups is unique in not needing a valid mapping before it can be called,
in the case of setgroups(0, NULL) which drops all supplemental groups.
The design of the user namespace assumes that CAP_SETGID can not actually
be used until a gid mapping is established. Therefore add a helper function
to see if the user namespace gid mapping has been established and call
that function in the setgroups permission check.
This is part of the fix for CVE-2014-8989, being able to drop groups
without privilege using user namespaces.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 0542f17bf2c1f2430d368f44c8fcf2f82ec9e53e upstream.
The rule is simple. Don't allow anything that wouldn't be allowed
without unprivileged mappings.
It was previously overlooked that establishing gid mappings would
allow dropping groups and potentially gaining permission to files and
directories that had lesser permissions for a specific group than for
all other users.
This is the rule needed to fix CVE-2014-8989 and prevent any other
security issues with new_idmap_permitted.
The reason for this rule is that the unix permission model is old and
there are programs out there somewhere that take advantage of every
little corner of it. So allowing a uid or gid mapping to be
established without privielge that would allow anything that would not
be allowed without that mapping will result in expectations from some
code somewhere being violated. Violated expectations about the
behavior of the OS is a long way to say a security issue.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream.
Today there are 3 instances of setgroups and due to an oversight their
permission checking has diverged. Add a common function so that
they may all share the same permission checking code.
This corrects the current oversight in the current permission checks
and adds a helper to avoid this in the future.
A user namespace security fix will update this new helper, shortly.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 42f74461a5b60cf6b42887e6d2ff5b7be4abf1ca upstream.
SFR reported this 2013-05-15:
> After merging the final tree, today's linux-next build (i386 defconfig)
> produced this warning:
>
> kernel/auditfilter.c: In function 'audit_data_to_entry':
> kernel/auditfilter.c:426:3: warning: this decimal constant is unsigned only
> in ISO C90 [enabled by default]
>
> Introduced by commit 780a7654cee8 ("audit: Make testing for a valid
> loginuid explicit") from Linus' tree.
Replace this decimal constant in the code with a macro to make it more readable
(add to the unsigned cast to quiet the warning).
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 82975bc6a6df743b9a01810fb32cb65d0ec5d60b upstream.
x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
not on non-paranoid returns. I suspect that this is a mistake and that
the code only works because int3 is paranoid.
Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME
from the uprobes code.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 2aa792e6faf1a00f5accf1f69e87e11a390ba2cd upstream.
The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:
* Is the thread we are trying to wake up the current thread?
* Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
* Is there no thread created for this flavour, hence nothing to wake up?
If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.
Idle:
Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0
rcutorture:
Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0
Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.
There is a comment about an implied barrier supplied by the wake_up()
logic. This barrier is necessary for the awakened thread to see the
updated ->gp_flags. This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.
Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 48a7639ce80cf279834d0d44865e49ecd714f37d upstream.
The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread. This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.
Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off. This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.
However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs. In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress. Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling. In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls. Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.
In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.
This commit therefore makes this change. The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed. A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ Pranith: backport to 3.13-stable: just rcu_gp_kthread_wake(),
prereq for 2aa792e "rcu: Use rcu_gp_kthread_wake() to wake up grace
period kthreads" ]
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 799b601451b21ebe7af0e6e8f6e2ccd4683c5064 upstream.
Audit rules disappear when an inode they watch is evicted from the cache.
This is likely not what we want.
The guilty commit is "fsnotify: allow marks to not pin inodes in core",
which didn't take into account that audit_tree adds watches with a zero
mask.
Adding any mask should fix this.
Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core")
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 6891c4509c792209c44ced55a60f13954cb50ef4 upstream.
If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.
Initialize sigev_value with 0 to plug the stack info leak.
Found in the PaX patch, written by the PaX Team.
Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: PaX Team <pageexec@freemail.hu>
Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 94fb823fcb4892614f57e59601bb9d4920f24711 upstream.
If a device's dev_pm_ops::freeze callback fails during the QUIESCE
phase, we don't rollback things correctly calling the thaw and complete
callbacks. This could leave some devices in a suspended state in case of
an error during resuming from hibernation.
Signed-off-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.
Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.
Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.
Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael
Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|