summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2010-07-31v2.6.33.6-rt27v2.6.33.6-rt27Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-29UIO: Add a driver for Hilscher netX-based fieldbus cardsHans J. Koch
This patch adds a Userspace IO driver for netX-based fieldbus cards by Hilscher (see http://www.hilscher.com). ATM, cifX and comX cards are supported. The userspace part for this driver is provided by Hilscher and should come with the card. The driver is in use for several months now and has been tested by people at Hilscher and Linutronix. Signed-off-by: Hans J. Koch <hjk@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-20dca: Fix fallout from raw_spinlock conversionThomas Gleixner
Compiling stuff helps even if there is no hardware to test on. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20powerpc: cpu-hotplug: Prevent softirq wakeup on wrong CPUThomas Gleixner
When the plugged CPU sets the online flag it enables interrupts and goes idle. When an interrupt happens and tried to wakeup a softirq thread before the other cpu sets the active flag, then the softirq thread is put on one of the active cpus. Prevent this by waiting for the cpu_active bit. Temporary workaround. Needs more thought. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20drivers/dca: Convert dca_lock to a raw spinlockMike Galbraith
drivers/dca: convert dca_lock to a raw spinlock [ 25.607517] igb 0000:01:00.0: DCA enabled [ 25.607524] BUG: sleeping function called from invalid context at kernel/rtmutex.c:684 [ 25.607528] pcnt: 1 0 in_atomic(): 1, irqs_disabled(): 0, pid: 1805, name: modprobe [ 25.607532] Pid: 1805, comm: modprobe Tainted: G M W N 2.6.33.5-rt23-rt_debug #2 [ 25.607536] Call Trace: [ 25.607557] [<ffffffff820078a1>] try_stack_unwind+0x151/0x1a0 [ 25.607566] [<ffffffff820062c2>] dump_trace+0x92/0x370 [ 25.607573] [<ffffffff8200731c>] show_trace_log_lvl+0x5c/0x80 [ 25.607578] [<ffffffff82007355>] show_trace+0x15/0x20 [ 25.607587] [<ffffffff823f4588>] dump_stack+0x77/0x8f [ 25.607595] [<ffffffff82043f2a>] __might_sleep+0x11a/0x130 [ 25.607602] [<ffffffff823f7b93>] rt_spin_lock+0x83/0x90 [ 25.607611] [<ffffffffa0209138>] dca_common_get_tag+0x28/0x80 [dca] [ 25.607622] [<ffffffffa02091c8>] dca3_get_tag+0x18/0x20 [dca] [ 25.607634] [<ffffffffa0244e71>] igb_update_dca+0xb1/0x1d0 [igb] [ 25.607649] [<ffffffffa0244ff5>] igb_setup_dca+0x65/0x80 [igb] [ 25.607663] [<ffffffffa02535a6>] igb_probe+0x946/0xe4d [igb] [ 25.607678] [<ffffffff82247517>] local_pci_probe+0x17/0x20 [ 25.607686] [<ffffffff82248661>] pci_device_probe+0x121/0x130 [ 25.607699] [<ffffffff822e4832>] driver_probe_device+0xd2/0x2e0 [ 25.607707] [<ffffffff822e4adb>] __driver_attach+0x9b/0xa0 [ 25.607714] [<ffffffff822e3d1b>] bus_for_each_dev+0x6b/0xa0 [ 25.607720] [<ffffffff822e4591>] driver_attach+0x21/0x30 [ 25.607727] [<ffffffff822e3425>] bus_add_driver+0x1e5/0x350 [ 25.607734] [<ffffffff822e4e41>] driver_register+0x81/0x160 [ 25.607742] [<ffffffff8224890f>] __pci_register_driver+0x6f/0xf0 [ 25.607752] [<ffffffffa011505b>] igb_init_module+0x5b/0x5d [igb] [ 25.607769] [<ffffffff820001dd>] do_one_initcall+0x3d/0x1a0 [ 25.607778] [<ffffffff820961f6>] sys_init_module+0xe6/0x270 [ 25.607786] [<ffffffff82003232>] system_call_fastpath+0x16/0x1b [ 25.607794] [<00007f84d6783f4a>] 0x7f84d6783f4a [ tglx: Fixed the domain allocation which was calling kzalloc from the irq disabled section ] Signed-off-by: Mike Galbraith <efault@gmx.de> LKML-Reference: <1278491341.7926.4.camel@marge.simson.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20cpu-hotplug: Prevent softirq wakeup on wrong CPUThomas Gleixner
When the plugged CPU sets the online flag it enables interrupts and goes idle. When an interrupt happens and tried to wakeup a softirq thread before the other cpu sets the active flag, then the softirq thread is put on one of the active cpus. Prevent this by waiting for the cpu_active bit. Temporary workaround. Needs more thought. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20cpu-hotplug: Don't wake up the desched thread from idle_task_exit()Thomas Gleixner
When idle tasks exits then we do not want to wake the cpu bound desched thread. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20x86: mce: Convert cmci_discover_lock to raw_spinlockThomas Gleixner
cmci_discover_lock() is taken in atomic contexts (cpu bring up). Convert it to a raw_spinlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20net: iptables: Fix xt_info lockingThomas Gleixner
xt_info locking is an open coded rw_lock which works fine in mainline, but on RT it's racy. Replace it with a real rwlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20suspend: Prevent might sleep splatsThomas Gleixner
timekeeping suspend/resume calls read_persistant_clock() which takes rtc_lock. That results in might sleep warnings because at that point we run with interrupts disabled. We cannot convert rtc_lock to a raw spinlock as that would trigger other might sleep warnings. As a temporary workaround we disable the might sleep warnings by setting system_state to SYSTEM_SUSPEND before calling sysdev_suspend() and restoring it to SYSTEM_RUNNING afer sysdev_resume(). Needs to be revisited. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20futex: Protect against pi_blocked_on corruption during requeue PIDarren Hart
The requeue_pi mechanism introduced proxy locking of the rtmutex. This creates a scenario where a task can wakeup, not knowing it has been enqueued on an rtmutex. Blocking on an hb->lock() can overwrite a valid value in current->pi_blocked_on, leading to an inconsistent state. Prevent overwriting pi_blocked_on by serializing on the waiter's pi_lock and using the new PI_WAKEUP_INPROGRESS state flag to indicate a waiter that has been woken by a timeout or signal. This prevents the rtmutex code from adding the waiter to the rtmutex wait list, returning EAGAIN to futex_requeue(), which will in turn ignore the waiter during a requeue. Care is taken to allow current to block on locks even if PI_WAKEUP_INPROGRESS is set. During normal wakeup, this results in one less hb->lock protected section. In the pre-requeue-timeout-or-signal wakeup, this removes the "greedy locking" behavior, no attempt will be made to acquire the lock. [ tglx: take pi_lock with lock_irq(), removed paranoid warning, plugged pi_state and pi_blocked_on leak, adjusted some comments ] Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <4C3C1DCF.9090509@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-200017d735092844118bef006696a750a0e4ef6ebdPeter Zijlstra
commit 0017d735092844118bef006696a750a0e4ef6ebd Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed Mar 24 18:34:10 2010 +0100 sched: Fix TASK_WAKING vs fork deadlock Oleg noticed a few races with the TASK_WAKING usage on fork. - since TASK_WAKING is basically a spinlock, it should be IRQ safe - since we set TASK_WAKING (*) without holding rq->lock it could be there still is a rq->lock holder, thereby not actually providing full serialization. (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING. Cure the second issue by not setting TASK_WAKING in sched_fork(), but only temporarily in wake_up_new_task() while calling select_task_rq(). Cure the first by holding rq->lock around the select_task_rq() call, this will disable IRQs, this however requires that we push down the rq->lock release into select_task_rq_fair()'s cgroup stuff. Because select_task_rq_fair() still needs to drop the rq->lock we cannot fully get rid of TASK_WAKING. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-20sched: Make select_fallback_rq() cpuset friendlyOleg Nesterov
Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems with select_fallback_rq(). It can be called from any context and can't use any cpuset locks including task_lock(). It is called when the task doesn't have online cpus in ->cpus_allowed but ttwu/etc must be able to find a suitable cpu. I am not proud of this patch. Everything which needs such a fat comment can't be good even if correct. But I'd prefer to not change the locking rules in the code I hardly understand, and in any case I believe this simple change make the code much more correct compared to deadlocks we currently have. [ upstream commit: 9084bb8246ea935b98320554229e2f371f7f52fa ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091027.GA9155@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20sched: _cpu_down(): Don't play with current->cpus_allowedOleg Nesterov
_cpu_down() changes the current task's affinity and then recovers it at the end. The problems are well known: we can't restore old_allowed if it was bound to the now-dead-cpu, and we can race with the userspace which can change cpu-affinity during unplug. _cpu_down() should not play with current->cpus_allowed at all. Instead, take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable() removes the dying cpu from cpu_online_mask. [ upstream commit: 6a1bdc1b577ebcb65f6603c57f8347309bc4ab13 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091023.GA9148@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20sched: sched_exec(): Remove the select_fallback_rq() logicOleg Nesterov
sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless. This can race with other CPUs updating our ->cpus_allowed, and this looks meaningless to me. The task is current and running, it must have online cpus in ->cpus_allowed, the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu, this likely means we raced with set_cpus_allowed() which was called for reason, why should sched_exec() retry and call ->select_task_rq() again? Change the code to call sched_class->select_task_rq() directly and do nothing if the returned cpu is wrong after re-checking under rq->lock. From now task_struct->cpus_allowed is always stable under TASK_WAKING, select_fallback_rq() is always called under rq-lock or the caller or the caller owns TASK_WAKING (select_task_rq). [ upstream commit: 30da688ef6b76e01969b00608202fff1eed2accc ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091019.GA9141@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20sched: move_task_off_dead_cpu(): Remove retry logicOleg Nesterov
The previous patch preserved the retry logic, but it looks unneeded. __migrate_task() can only fail if we raced with migration after we dropped the lock, but in this case the caller of set_cpus_allowed/etc must initiate migration itself if ->on_rq == T. We already fixed p->cpus_allowed, the changes in active/online masks must be visible to racer, it should migrate the task to online cpu correctly. [ upstream commit: c1804d547dc098363443667609c272d1e4d15ee8 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091014.GA9138@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq()Oleg Nesterov
move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed lockless. We can race with set_cpus_allowed() running in parallel. Change it to take rq->lock around select_fallback_rq(). Note that it is not trivial to move this spin_lock() into select_fallback_rq(), we must recheck the task was not migrated after we take the lock and other callers do not need this lock. To avoid the races with other callers of select_fallback_rq() which rely on TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise. The owner of TASK_WAKING must update ->cpus_allowed and choose the correct CPU anyway, and the subsequent __migrate_task() is just meaningless because p->se.on_rq must be false. Alternatively, we could change select_task_rq() to take rq->lock right after it calls sched_class->select_task_rq(), but this looks a bit ugly. Also, change it to not assume irqs are disabled and absorb __migrate_task_irq(). [ upstream: 1445c08d06c5594895b4fae952ef8a457e89c390 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091010.GA9131@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-20sched: Kill the broken and deadlockable ↵Oleg Nesterov
cpuset_lock/cpuset_cpus_allowed_locked code This patch just states the fact the cpusets/cpuhotplug interaction is broken and removes the deadlockable code which only pretends to work. - cpuset_lock() doesn't really work. It is needed for cpuset_cpus_allowed_locked() but we can't take this lock in try_to_wake_up()->select_fallback_rq() path. - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take cpuset_lock() and hangs forever because CPU is already dead and thus T can't be scheduled. - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock() which is not irq-safe, but try_to_wake_up() can be called from irq. Kill them, and change select_fallback_rq() to use cpu_possible_mask, like we currently do without CONFIG_CPUSETS. Also, with or without this patch, with or without CONFIG_CPUSETS, the callers of select_fallback_rq() can race with each other or with set_cpus_allowed() pathes. The subsequent patches try to to fix these problems. [ upstream: 897f0b3c3ff40b443c84e271bef19bd6ae885195 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091003.GA9123@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-07-20sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlockOleg Nesterov
Trivial typo fix. rq->migration_thread can be NULL after task_rq_unlock(), this is why we have "mt" which should be used instead. [ upstream: 47a70985e5c093ae03d8ccf633c70a93761d86f2 ] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100330165829.GA18284@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-13v2.6.33.6-rt26v2.6.33.6-rt26Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.33.yThomas Gleixner
Conflicts: Makefile Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-13v2.6.33.5-rt25v2.6.33.5-rt25Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-13vfs: Revert the scalability patchesThomas Gleixner
We still have sporadic and hard to debug problems. Revert it for now and revisit with Nick's new version. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-13v2.6.33.5-rt24v2.6.33.5-rt24Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-05Linux 2.6.33.6v2.6.33.6Greg Kroah-Hartman
2010-07-05sctp: fix append error cause to ERROR chunk correctlyWei Yongjun
commit 2e3219b5c8a2e44e0b83ae6e04f52f20a82ac0f2 upstream. commit 5fa782c2f5ef6c2e4f04d3e228412c9b4a4c8809 sctp: Fix skb_over_panic resulting from multiple invalid \ parameter errors (CVE-2010-1173) (v4) cause 'error cause' never be add the the ERROR chunk due to some typo when check valid length in sctp_init_cause_fixed(). Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05qla2xxx: Disable MSI on qla24xx chips other than QLA2432.Ben Hutchings
commit 6377a7ae1ab82859edccdbc8eaea63782efb134d upstream. On specific platforms, MSI is unreliable on some of the QLA24xx chips, resulting in fatal I/O errors under load, as reported in <http://bugs.debian.org/572322> and by some RHEL customers. Signed-off-by: Giridhar Malavali <giridhar.malavali@qlogic.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05KEYS: find_keyring_by_name() can gain access to a freed keyringToshiyuki Okajima
commit cea7daa3589d6b550546a8c8963599f7c1a3ae5c upstream. find_keyring_by_name() can gain access to a keyring that has had its reference count reduced to zero, and is thus ready to be freed. This then allows the dead keyring to be brought back into use whilst it is being destroyed. The following timeline illustrates the process: |(cleaner) (user) | | free_user(user) sys_keyctl() | | | | key_put(user->session_keyring) keyctl_get_keyring_ID() | || //=> keyring->usage = 0 | | |schedule_work(&key_cleanup_task) lookup_user_key() | || | | kmem_cache_free(,user) | | . |[KEY_SPEC_USER_KEYRING] | . install_user_keyrings() | . || | key_cleanup() [<= worker_thread()] || | | || | [spin_lock(&key_serial_lock)] |[mutex_lock(&key_user_keyr..mutex)] | | || | atomic_read() == 0 || | |{ rb_ease(&key->serial_node,) } || | | || | [spin_unlock(&key_serial_lock)] |find_keyring_by_name() | | ||| | keyring_destroy(keyring) ||[read_lock(&keyring_name_lock)] | || ||| | |[write_lock(&keyring_name_lock)] ||atomic_inc(&keyring->usage) | |. ||| *** GET freeing keyring *** | |. ||[read_unlock(&keyring_name_lock)] | || || | |list_del() |[mutex_unlock(&key_user_k..mutex)] | || | | |[write_unlock(&keyring_name_lock)] ** INVALID keyring is returned ** | | . | kmem_cache_free(,keyring) . | . | atomic_dec(&keyring->usage) v *** DESTROYED *** TIME If CONFIG_SLUB_DEBUG=y then we may see the following message generated: ============================================================================= BUG key_jar: Poison overwritten ----------------------------------------------------------------------------- INFO: 0xffff880197a7e200-0xffff880197a7e200. First byte 0x6a instead of 0x6b INFO: Allocated in key_alloc+0x10b/0x35f age=25 cpu=1 pid=5086 INFO: Freed in key_cleanup+0xd0/0xd5 age=12 cpu=1 pid=10 INFO: Slab 0xffffea000592cb90 objects=16 used=2 fp=0xffff880197a7e200 flags=0x200000000000c3 INFO: Object 0xffff880197a7e200 @offset=512 fp=0xffff880197a7e300 Bytes b4 0xffff880197a7e1f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xffff880197a7e200: 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b jkkkkkkkkkkkkkkk Alternatively, we may see a system panic happen, such as: BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 IP: [<ffffffff810e61a3>] kmem_cache_alloc+0x5b/0xe9 PGD 6b2b4067 PUD 6a80d067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/kernel/kexec_crash_loaded CPU 1 ... Pid: 31245, comm: su Not tainted 2.6.34-rc5-nofixed-nodebug #2 D2089/PRIMERGY RIP: 0010:[<ffffffff810e61a3>] [<ffffffff810e61a3>] kmem_cache_alloc+0x5b/0xe9 RSP: 0018:ffff88006af3bd98 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88007d19900b RDX: 0000000100000000 RSI: 00000000000080d0 RDI: ffffffff81828430 RBP: ffffffff81828430 R08: ffff88000a293750 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000100000 R12: 00000000000080d0 R13: 00000000000080d0 R14: 0000000000000296 R15: ffffffff810f20ce FS: 00007f97116bc700(0000) GS:ffff88000a280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000001 CR3: 000000006a91c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process su (pid: 31245, threadinfo ffff88006af3a000, task ffff8800374414c0) Stack: 0000000512e0958e 0000000000008000 ffff880037f8d180 0000000000000001 0000000000000000 0000000000008001 ffff88007d199000 ffffffff810f20ce 0000000000008000 ffff88006af3be48 0000000000000024 ffffffff810face3 Call Trace: [<ffffffff810f20ce>] ? get_empty_filp+0x70/0x12f [<ffffffff810face3>] ? do_filp_open+0x145/0x590 [<ffffffff810ce208>] ? tlb_finish_mmu+0x2a/0x33 [<ffffffff810ce43c>] ? unmap_region+0xd3/0xe2 [<ffffffff810e4393>] ? virt_to_head_page+0x9/0x2d [<ffffffff81103916>] ? alloc_fd+0x69/0x10e [<ffffffff810ef4ed>] ? do_sys_open+0x56/0xfc [<ffffffff81008a02>] ? system_call_fastpath+0x16/0x1b Code: 0f 1f 44 00 00 49 89 c6 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 60 e8 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 14 4c 89 f9 83 ca ff 44 89 e6 48 89 ef RIP [<ffffffff810e61a3>] kmem_cache_alloc+0x5b/0xe9 This problem is that find_keyring_by_name does not confirm that the keyring is valid before accepting it. Skipping keyrings that have been reduced to a zero count seems the way to go. To this end, use atomic_inc_not_zero() to increment the usage count and skip the candidate keyring if that returns false. The following script _may_ cause the bug to happen, but there's no guarantee as the window of opportunity is small: #!/bin/sh LOOP=100000 USER=dummy_user /bin/su -c "exit;" $USER || { /usr/sbin/adduser -m $USER; add=1; } for ((i=0; i<LOOP; i++)) do /bin/su -c "echo '$i' > /dev/null" $USER done (( add == 1 )) && /usr/sbin/userdel -r $USER exit Note that the nominated user must not be in use. An alternative way of testing this may be: for ((i=0; i<100000; i++)) do keyctl session foo /bin/true || break done >&/dev/null as that uses a keyring named "foo" rather than relying on the user and user-session named keyrings. Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05KEYS: Return more accurate error codesDan Carpenter
commit 4d09ec0f705cf88a12add029c058b53f288cfaa2 upstream. We were using the wrong variable here so the error codes weren't being returned properly. The original code returns -ENOKEY. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05parisc: clear floating point exception flag on SIGFPE signalHelge Deller
commit 550f0d922286556c7ea43974bb7921effb5a5278 upstream. Clear the floating point exception flag before returning to user space. This is needed, else the libc trampoline handler may hit the same SIGFPE again while building up a trampoline to a signal handler. Fixes debian bug #559406. Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05KVM: SVM: Don't allow nested guest to VMMCALL into hostJoerg Roedel
This patch disables the possibility for a l2-guest to do a VMMCALL directly into the host. This would happen if the l1-hypervisor doesn't intercept VMMCALL and the l2-guest executes this instruction. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 0d945bd9351199744c1e89d57a70615b6ee9f394)
2010-07-05KVM: x86: Inject #GP with the right rip on efer writesRoedel, Joerg
This patch fixes a bug in the KVM efer-msr write path. If a guest writes to a reserved efer bit the set_efer function injects the #GP directly. The architecture dependent wrmsr function does not see this, assumes success and advances the rip. This results in a #GP in the guest with the wrong rip. This patch fixes this by reporting efer write errors back to the architectural wrmsr function. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit b69e8caef5b190af48c525f6d715e7b7728a77f6)
2010-07-05KVM: x86: Add missing locking to arch specific vcpu ioctlsAvi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 8fbf065d625617bbbf6b72d5f78f84ad13c8b547)
2010-07-05KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctlsAvi Kivity
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 98001d8d017cea1ee0f9f35c6227bbd63ef5005b)
2010-07-05KVM: Fix wallclock version writing raceAvi Kivity
Wallclock writing uses an unprotected global variable to hold the version; this can cause one guest to interfere with another if both write their wallclock at the same time. Acked-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 9ed3c444ab8987c7b219173a2f7807e3f71e234e)
2010-07-05KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_rootsAvi Kivity
On svm, kvm_read_pdptr() may require reading guest memory, which can sleep. Push the spinlock into mmu_alloc_roots(), and only take it after we've read the pdptr. Tested-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 8facbbff071ff2b19268d3732e31badc60471e21)
2010-07-05KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)Shane Wang
Per document, for feature control MSR: Bit 1 enables VMXON in SMX operation. If the bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. This patch is to enable this kind of check with SMX for VMXON in KVM. Signed-off-by: Shane Wang <shane.wang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit cafd66595d92591e4bd25c3904e004fc6f897e2d)
2010-07-05KVM: MMU: Segregate shadow pages with different cr0.wpAvi Kivity
When cr0.wp=0, we may shadow a gpte having u/s=1 and r/w=0 with an spte having u/s=0 and r/w=1. This allows excessive access if the guest sets cr0.wp=1 and accesses through this spte. Fix by making cr0.wp part of the base role; we'll have different sptes for the two cases and the problem disappears. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 3dbe141595faa48a067add3e47bba3205b79d33c)
2010-07-05KVM: x86: Check LMA bit before set_eferSheng Yang
kvm_x86_ops->set_efer() would execute vcpu->arch.efer = efer, so the checking of LMA bit didn't work. Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit a3d204e28579427609c3d15d2310127ebaa47d94)
2010-07-05KVM: Don't allow lmsw to clear cr0.peAvi Kivity
The current lmsw implementation allows the guest to clear cr0.pe, contrary to the manual, which breaks EMM386.EXE. Fix by ORing the old cr0.pe with lmsw's operand. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit f78e917688edbf1f14c318d2e50dc8e7dad20445)
2010-07-05x86, paravirt: Add a global synchronization point for pvclockGlauber Costa
In recent stress tests, it was found that pvclock-based systems could seriously warp in smp systems. Using ingo's time-warp-test.c, I could trigger a scenario as bad as 1.5mi warps a minute in some systems. (to be fair, it wasn't that bad in most of them). Investigating further, I found out that such warps were caused by the very offset-based calculation pvclock is based on. This happens even on some machines that report constant_tsc in its tsc flags, specially on multi-socket ones. Two reads of the same kernel timestamp at approx the same time, will likely have tsc timestamped in different occasions too. This means the delta we calculate is unpredictable at best, and can probably be smaller in a cpu that is legitimately reading clock in a forward ocasion. Some adjustments on the host could make this window less likely to happen, but still, it pretty much poses as an intrinsic problem of the mechanism. A while ago, I though about using a shared variable anyway, to hold clock last state, but gave up due to the high contention locking was likely to introduce, possibly rendering the thing useless on big machines. I argue, however, that locking is not necessary. We do a read-and-return sequence in pvclock, and between read and return, the global value can have changed. However, it can only have changed by means of an addition of a positive value. So if we detected that our clock timestamp is less than the current global, we know that we need to return a higher one, even though it is not exactly the one we compared to. OTOH, if we detect we're greater than the current time source, we atomically replace the value with our new readings. This do causes contention on big boxes (but big here means *BIG*), but it seems like a good trade off, since it provide us with a time source guaranteed to be stable wrt time warps. After this patch is applied, I don't see a single warp in time during 5 days of execution, in any of the machines I saw them before. Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Zachary Amsden <zamsden@redhat.com> CC: Jeremy Fitzhardinge <jeremy@goop.org> CC: Avi Kivity <avi@redhat.com> CC: Marcelo Tosatti <mtosatti@redhat.com> CC: Zachary Amsden <zamsden@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 489fb490dbf8dab0249ad82b56688ae3842a79e8)
2010-07-05KVM: SVM: Report emulated SVM features to userspaceJoerg Roedel
This patch implements the reporting of the emulated SVM features to userspace instead of the real hardware capabilities. Every real hardware capability needs emulation in nested svm so the old behavior was broken. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit c2c63a493924e09a1984d1374a0e60dfd54fc0b0)
2010-07-05KVM: x86: Add callback to let modules decide over some supported cpuid bitsJoerg Roedel
This patch adds the get_supported_cpuid callback to kvm_x86_ops. It will be used in do_cpuid_ent to delegate the decission about some supported cpuid bits to the architecture modules. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit d4330ef2fb2236a1e3a176f0f68360f4c0a8661b)
2010-07-05KVM: PPC: Do not create debugfs if fail to create vcpuWei Yongjun
If fail to create the vcpu, we should not create the debugfs for it. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Alexander Graf <agraf@suse.de> Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 06056bfb944a0302a8f22eb45f09123de7fb417b)
2010-07-05KVM: s390: Fix possible memory leak of in kvm_arch_vcpu_create()Wei Yongjun
This patch fixed possible memory leak in kvm_arch_vcpu_create() under s390, which would happen when kvm_arch_vcpu_create() fails. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Carsten Otte <cotte@de.ibm.com> Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 7b06bf2ffa15e119c7439ed0b024d44f66d7b605)
2010-07-05KVM: SVM: Fix wrong interrupt injection in enable_irq_windowsJoerg Roedel
The nested_svm_intr() function does not execute the vmexit anymore. Therefore we may still be in the nested state after that function ran. This patch changes the nested_svm_intr() function to return wether the irq window could be enabled. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 8fe546547cf6857a9d984bfe2f2194910f3fc5d0)
2010-07-05KVM: SVM: Don't sync nested cr8 to lapic and backJoerg Roedel
This patch makes syncing of the guest tpr to the lapic conditional on !nested. Otherwise a nested guest using the TPR could freeze the guest. Another important change this patch introduces is that the cr8 intercept bits are no longer ORed at vmrun emulation if the guest sets VINTR_MASKING in its VMCB. The reason is that nested cr8 accesses need alway be handled by the nested hypervisor because they change the shadow version of the tpr. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 88ab24adc7142506c8583ac36a34fa388300b750)
2010-07-05KVM: SVM: Fix nested msr intercept handlingJoerg Roedel
The nested_svm_exit_handled_msr() function maps only one page of the guests msr permission bitmap. This patch changes the code to use kvm_read_guest to fix the bug. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit 4c7da8cb43c09e71a405b5aeaa58a1dbac3c39e9)
2010-07-05KVM: SVM: Sync all control registers on nested vmexitJoerg Roedel
Currently the vmexit emulation does not sync control registers were the access is typically intercepted by the nested hypervisor. But we can not count on that intercepts to sync these registers too and make the code architecturally more correct. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit cdbbdc1210223879450555fee04c29ebf116576b)
2010-07-05KVM: SVM: Fix schedule-while-atomic on nested exception handlingJoerg Roedel
Move the actual vmexit routine out of code that runs with irqs and preemption disabled. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> (cherry picked from commit b8e88bc8ffba5fe53fb8d8a0a4be3bbcffeebe56)