summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2016-07-14KVM: x86: use physical LAPIC array for logical x2APICRadim Krčmář
Logical x2APIC IDs map injectively to physical x2APIC IDs, so we can reuse the physical array for them. This allows us to save space by sizing the logical maps according to the needs of xAPIC. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14KVM: x86: add kvm_apic_map_get_dest_lapicRadim Krčmář
kvm_irq_delivery_to_apic_fast and kvm_intr_is_single_vcpu_fast both compute the interrupt destination. Factor the code. 'struct kvm_lapic **dst = NULL' had to be added to silence GCC. GCC might complain about potential NULL access in the future, because it missed conditions that avoided uninitialized uses of dst. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14KVM: x86: bump KVM_SOFT_MAX_VCPUS to 240Radim Krčmář
240 has been well tested by Red Hat. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14kvm: vmx: advertise support for ept execute onlyBandan Das
MMU now knows about execute only mappings, so advertise the feature to L1 hypervisors Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14kvm: mmu: track read permission explicitly for shadow EPT page tablesBandan Das
To support execute only mappings on behalf of L1 hypervisors, reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit set. For the nested EPT case, we assumed that the U bit was always set since there was no equivalent in EPT page tables. Strictly speaking, this was not necessary because handle_ept_violation never set PFERR_USER_MASK in the error code (uf=0 in the parlance of update_permission_bitmask). We now have to set both U and UF correctly, respectively in FNAME(gpte_access) and in handle_ept_violation. Also in handle_ept_violation bit 3 of the exit qualification is not enough to detect a present PTE; all three bits 3-5 have to be checked. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14kvm: mmu: don't set the present bit unconditionallyBandan Das
To support execute only mappings on behalf of L1 hypervisors, we need to teach set_spte() to honor all three of L1's XWR bits. As a start, add a new variable "shadow_present_mask" that will be set for non-EPT shadow paging and clear for EPT. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14kvm: mmu: remove is_present_gpte()Bandan Das
We have two versions of the above function. To prevent confusion and bugs in the future, remove the non-FNAME version entirely and replace all calls with the actual check. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14kvm: mmu: extend the is_present check to 32 bitsBandan Das
This is safe because this function is called on host controlled page table and non-present/non-MMIO sptes never use bits 1..31. For the EPT case, this ensures that cases where only the execute bit is set is marked valid. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11KVM: VMX: introduce vm_{entry,exit}_control_reset_shadowPaolo Bonzini
There is no reason to read the entry/exit control fields of the VMCS and immediately write back the same value. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11KVM: nVMX: keep preemption timer enabled during L2 executionPaolo Bonzini
Because the vmcs12 preemption timer is emulated through a separate hrtimer, we can keep on using the preemption timer in the vmcs02 to emulare L1's TSC deadline timer. However, the corresponding bit in the pin-based execution control field must be kept consistent between vmcs01 and vmcs02. On vmentry we copy it into the vmcs02; on vmexit the preemption timer must be disabled in the vmcs01 if a preemption timer vmexit happened while in guest mode. The preemption timer value in the vmcs02 is set by vmx_vcpu_run, so it need not be considered in prepare_vmcs02. Cc: Yunhong Jiang <yunhong.jiang@intel.com> Cc: Haozhong Zhang <haozhong.zhang@intel.com> Tested-by: Wanpeng Li <kernellwp@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11KVM: nVMX: avoid incorrect preemption timer vmexit in nested guestWanpeng Li
The preemption timer for nested VMX is emulated by hrtimer which is started on L2 entry, stopped on L2 exit and evaluated via the check_nested_events hook. However, nested_vmx_exit_handled is always returning true for preemption timer vmexit. Then, the L1 preemption timer vmexit is captured and be treated as a L2 preemption timer vmexit, causing NULL pointer dereferences or worse in the L1 guest's vmexit handler: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) PGD 0 Oops: 0010 [#1] SMP Call Trace: ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm] handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0x169/0x15a0 [kvm_intel] ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm] kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm] ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm] ? vcpu_load+0x1c/0x60 [kvm] ? kvm_arch_vcpu_load+0x57/0x260 [kvm] kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm] do_vfs_ioctl+0x96/0x6a0 ? __fget_light+0x2a/0x90 SyS_ioctl+0x79/0x90 do_syscall_64+0x68/0x180 entry_SYSCALL64_slow_path+0x25/0x25 Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff8800b5263c48> CR2: 0000000000000000 ---[ end trace 9c70c48b1a2bc66e ]--- This can be reproduced readily by preemption timer enabled on L0 and disabled on L1. Return false since preemption timer vmexits must never be reflected to L2. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11KVM: VMX: reflect broken preemption timer in vmcs_configPaolo Bonzini
Simplify cpu_has_vmx_preemption_timer. This is consistent with the rest of setup_vmcs_config and preparatory for the next patch. Tested-by: Wanpeng Li <kernellwp@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-05KVM: x86: Use ARRAY_SIZE instead of dividing sizeof array with sizeof an elementWei Yongjun
Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01KVM: vmx: fix missed cancellation of TSC deadline timerWanpeng Li
INFO: rcu_sched detected stalls on CPUs/tasks: 1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663 (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719) Task dump for CPU 1: qemu-system-x86 R running task 0 3529 3525 0x00080808 ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40 ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8 0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9 Call Trace: ? kvm_write_guest_cached+0xb9/0x160 [kvm] ? __delay+0xf/0x20 ? wait_lapic_expire+0x14a/0x200 [kvm] ? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm] ? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm] ? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm] ? __fget+0x5/0x210 ? do_vfs_ioctl+0x96/0x6a0 ? __fget_light+0x2a/0x90 ? SyS_ioctl+0x79/0x90 ? do_syscall_64+0x7c/0x1e0 ? entry_SYSCALL64_slow_path+0x25/0x25 This can be reproduced readily by running a full dynticks guest(since hrtimer in guest is heavily used) w/ lapic_timer_advance disabled. If fail to program hardware preemption timer, we will fallback to hrtimer based method, however, a previous programmed preemption timer miss to cancel in this scenario which results in one hardware preemption timer and one hrtimer emulated tsc deadline timer run simultaneously. So sometimes the target guest deadline tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer can underflow and cause delta_tsc to be set a huge value, then host soft lockup as above. This patch fix it by cancelling the previous programmed preemption timer if there is once we failed to program the new preemption timer and fallback to hrtimer based method. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01KVM: x86: introduce cancel_hv_tscdeadlineWanpeng Li
Introduce cancel_hv_tscdeadline() to encapsulate preemption timer cancel stuff. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01KVM: vmx: fix underflow in TSC deadline calculationPaolo Bonzini
If the TSC deadline timer is programmed really close to the deadline or even in the past, the computation in vmx_set_hv_timer can underflow and cause delta_tsc to be set to a huge value. This generally results in vmx_set_hv_timer returning -ERANGE, but we can fix it by limiting delta_tsc to be positive or zero. Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01KVM: x86: use guest_exit_irqoffPaolo Bonzini
This gains a few clock cycles per vmexit. On Intel there is no need anymore to enable the interrupts in vmx_handle_external_intr, since we are using the "acknowledge interrupt on exit" feature. AMD needs to do that, and must be careful to avoid the interrupt shadow. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01KVM: x86: always use "acknowledge interrupt on exit"Paolo Bonzini
This is necessary to simplify handle_external_intr in the next patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01KVM: remove kvm_guest_enter/exit wrappersPaolo Bonzini
Use the functions from context_tracking.h directly. Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23kvm: x86: use getboottime64Arnd Bergmann
KVM reads the current boottime value as a struct timespec in order to calculate the guest wallclock time, resulting in an overflow in 2038 on 32-bit systems. The data then gets passed as an unsigned 32-bit number to the guest, and that in turn overflows in 2106. We cannot do much about the second overflow, which affects both 32-bit and 64-bit hosts, but we can ensure that they both behave the same way and don't overflow until 2106, by using getboottime64() to read a timespec64 value. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23KVM: VMX: enable guest access to LMCE related MSRsAshok Raj
On Intel platforms, this patch adds LMCE to KVM MCE supported capabilities and handles guest access to LMCE related MSRs. Signed-off-by: Ashok Raj <ashok.raj@intel.com> [Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported Only enable LMCE on Intel platform Check MSR_IA32_FEATURE_CONTROL when handling guest access to MSR_IA32_MCG_EXT_CTL] Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROLHaozhong Zhang
KVM currently does not check the value written to guest MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features may be set. This patch makes KVM to validate individual bits written to guest MSR_IA32_FEATURE_CONTROL according to enabled features. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23KVM: VMX: move msr_ia32_feature_control to vcpu_vmxHaozhong Zhang
msr_ia32_feature_control will be used for LMCE and not depend only on nested anymore, so move it from struct nested_vmx to struct vcpu_vmx. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16kvm: vmx: hook preemption timer supportYunhong Jiang
Hook the VMX preemption timer to the "hv timer" functionality added by the previous patch. This includes: checking if the feature is supported, if the feature is broken on the CPU, the hooks to setup/clean the VMX preemption timer, arming the timer on vmentry and handling the vmexit. A module parameter states if the VMX preemption timer should be utilized. Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> [Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value. Put all VMX bits here. Enable it by default #yolo. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16kvm: vmx: rename vmx_pre/post_block to pi_pre/post_blockYunhong Jiang
Prepare to switch from preemption timer to hrtimer in the vmx_pre/post_block. Current functions are only for posted interrupt, rename them accordingly. Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16KVM: x86: support using the vmx preemption timer for tsc deadline timerYunhong Jiang
The VMX preemption timer can be used to virtualize the TSC deadline timer. The VMX preemption timer is armed when the vCPU is running, and a VMExit will happen if the virtual TSC deadline timer expires. When the vCPU thread is blocked because of HLT, KVM will switch to use an hrtimer, and then go back to the VMX preemption timer when the vCPU thread is unblocked. This solution avoids the complex OS's hrtimer system, and the host timer interrupt handling cost, replacing them with a little math (for guest->host TSC and host TSC->preemption timer conversion) and a cheaper VMexit. This benefits latency for isolated pCPUs. [A word about performance... Yunhong reported a 30% reduction in average latency from cyclictest. I made a similar test with tscdeadline_latency from kvm-unit-tests, and measured - ~20 clock cycles loss (out of ~3200, so less than 1% but still statistically significant) in the worst case where the test halts just after programming the TSC deadline timer - ~800 clock cycles gain (25% reduction in latency) in the best case where the test busy waits. I removed the VMX bits from Yunhong's patch, to concentrate them in the next patch - Paolo] Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16kvm: lapic: separate start_sw_tscdeadline from start_apic_timerYunhong Jiang
The function to start the tsc deadline timer virtualization will be used also by the pre_block hook when we use the preemption timer; change it to a separate function. No logic changes. Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16KVM: remove kvm_vcpu_compatiblePaolo Bonzini
The new created_vcpus field makes it possible to avoid the race between irqchip and VCPU creation in a much nicer way; just check under kvm->lock whether a VCPU has already been created. We can then remove KVM_APIC_ARCHITECTURE too, because at this point the symbol is only governing the default definition of kvm_vcpu_compatible. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16x86/kvm/svm: Simplify cpu_has_svm()Borislav Petkov
Use already cached CPUID information instead of querying CPUID again. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: kvm@vger.kernel.org Cc: x86@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-14KVM: x86: Fix typosAndrea Gelmini
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-03kvm/x86: remove unnecessary header file inclusionKai Huang
arch/x86/kvm/iommu.c includes <linux/intel-iommu.h> and <linux/dmar.h>, which both are unnecessary, in fact incorrect to be here as they are intel specific. Building kvm on x86 passed after removing above inclusion. Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03KVM: x86: protect KVM_CREATE_PIT/KVM_CREATE_PIT2 with kvm->lockPaolo Bonzini
The syzkaller folks reported a NULL pointer dereference that seems to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2. The former takes kvm->lock (except when registering the devices, which needs kvm->slots_lock); the latter takes kvm->slots_lock only. Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP. Testcase: #include <pthread.h> #include <linux/kvm.h> #include <fcntl.h> #include <sys/ioctl.h> #include <stdint.h> #include <string.h> #include <stdlib.h> #include <sys/syscall.h> #include <unistd.h> long r[23]; void* thr1(void* arg) { struct kvm_pit_config pitcfg = { .flags = 4 }; switch ((long)arg) { case 0: r[2] = open("/dev/kvm", O_RDONLY|O_ASYNC); break; case 1: r[3] = ioctl(r[2], KVM_CREATE_VM, 0); break; case 2: r[4] = ioctl(r[3], KVM_CREATE_IRQCHIP, 0); break; case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break; } return 0; } int main(int argc, char **argv) { long i; pthread_t th[4]; memset(r, -1, sizeof(r)); for (i = 0; i < 4; i++) { pthread_create(&th[i], 0, thr, (void*)i); if (argc > 1 && rand()%2) usleep(rand()%1000); } usleep(20000); return 0; } Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03KVM: x86: rename process_smi to enter_smm, process_smi_request to process_smiPaolo Bonzini
Make the function names more similar between KVM_REQ_NMI and KVM_REQ_SMI. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03KVM: x86: avoid simultaneous queueing of both IRQ and SMIPaolo Bonzini
If the processor exits to KVM while delivering an interrupt, the hypervisor then requeues the interrupt for the next vmentry. Trying to enter SMM in this same window causes to enter non-root mode in emulated SMM (i.e. with IF=0) and with a request to inject an IRQ (i.e. with a valid VM-entry interrupt info field). This is invalid guest state (SDM 26.3.1.4 "Check on Guest RIP and RFLAGS") and the processor fails vmentry. The fix is to defer the injection from KVM_REQ_SMI to KVM_REQ_EVENT, like we already do for e.g. NMIs. This patch doesn't change the name of the process_smi function so that it can be applied to stable releases. The next patch will modify the names so that process_nmi and process_smi handle respectively KVM_REQ_NMI and KVM_REQ_SMI. This is especially common with Windows, probably due to the self-IPI trick that it uses to deliver deferred procedure calls (DPCs). Reported-by: Laszlo Ersek <lersek@redhat.com> Reported-by: Michał Zegan <webczat_200@poczta.onet.pl> Fixes: 64d6067057d9658acb8675afcfba549abdb7fc16 Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGSPaolo Bonzini
MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to any of bits 63:32. However, this is not detected at KVM_SET_DEBUGREGS time, and the next KVM_RUN oopses: general protection fault: 0000 [#1] SMP CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1 Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012 [...] Call Trace: [<ffffffffa072c93d>] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm] [<ffffffffa071405d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm] [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480 [<ffffffff812418a9>] SyS_ioctl+0x79/0x90 [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 <0f> 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 RIP [<ffffffff810639eb>] native_set_debugreg+0x2b/0x40 RSP <ffff88005836bd50> Testcase (beautified/reduced from syzkaller output): #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <stdint.h> #include <linux/kvm.h> #include <fcntl.h> #include <sys/ioctl.h> long r[8]; int main() { struct kvm_debugregs dr = { 0 }; r[2] = open("/dev/kvm", O_RDONLY); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7); memcpy(&dr, "\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72" "\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8" "\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9" "\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb", 48); r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr); r[6] = ioctl(r[4], KVM_RUN, 0); } Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02KVM: fail KVM_SET_VCPU_EVENTS with invalid exception numberPaolo Bonzini
This cannot be returned by KVM_GET_VCPU_EVENTS, so it is okay to return EINVAL. It causes a WARN from exception_type: WARNING: CPU: 3 PID: 16732 at arch/x86/kvm/x86.c:345 exception_type+0x49/0x50 [kvm]() CPU: 3 PID: 16732 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1 Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012 0000000000000286 000000006308a48b ffff8800bec7fcf8 ffffffff813b542e 0000000000000000 ffffffffa0966496 ffff8800bec7fd30 ffffffff810a40f2 ffff8800552a8000 0000000000000000 00000000002c267c 0000000000000001 Call Trace: [<ffffffff813b542e>] dump_stack+0x63/0x85 [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0 [<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0924809>] exception_type+0x49/0x50 [kvm] [<ffffffffa0934622>] kvm_arch_vcpu_ioctl_run+0x10a2/0x14e0 [kvm] [<ffffffffa091c04d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm] [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480 [<ffffffff812414a9>] SyS_ioctl+0x79/0x90 [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71 ---[ end trace b1a0391266848f50 ]--- Testcase (beautified/reduced from syzkaller output): #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <stdint.h> #include <fcntl.h> #include <sys/ioctl.h> #include <linux/kvm.h> long r[31]; int main() { memset(r, -1, sizeof(r)); r[2] = open("/dev/kvm", O_RDONLY); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[7] = ioctl(r[3], KVM_CREATE_VCPU, 0); struct kvm_vcpu_events ve = { .exception.injected = 1, .exception.nr = 0xd4 }; r[27] = ioctl(r[7], KVM_SET_VCPU_EVENTS, &ve); r[30] = ioctl(r[7], KVM_RUN, 0); return 0; } Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02KVM: x86: avoid vmalloc(0) in the KVM_SET_CPUIDPaolo Bonzini
This causes an ugly dmesg splat. Beautified syzkaller testcase: #include <unistd.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <fcntl.h> #include <linux/kvm.h> long r[8]; int main() { struct kvm_cpuid2 c = { 0 }; r[2] = open("/dev/kvm", O_RDWR); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[4] = ioctl(r[3], KVM_CREATE_VCPU, 0x8); r[7] = ioctl(r[4], KVM_SET_CPUID, &c); return 0; } Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02kvm: x86: avoid warning on repeated KVM_SET_TSS_ADDRPaolo Bonzini
Found by syzkaller: WARNING: CPU: 3 PID: 15175 at arch/x86/kvm/x86.c:7705 __x86_set_memory_region+0x1dc/0x1f0 [kvm]() CPU: 3 PID: 15175 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1 Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012 0000000000000286 00000000950899a7 ffff88011ab3fbf0 ffffffff813b542e 0000000000000000 ffffffffa0966496 ffff88011ab3fc28 ffffffff810a40f2 00000000000001fd 0000000000003000 ffff88014fc50000 0000000000000000 Call Trace: [<ffffffff813b542e>] dump_stack+0x63/0x85 [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0 [<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa09251cc>] __x86_set_memory_region+0x1dc/0x1f0 [kvm] [<ffffffffa092521b>] x86_set_memory_region+0x3b/0x60 [kvm] [<ffffffffa09bb61c>] vmx_set_tss_addr+0x3c/0x150 [kvm_intel] [<ffffffffa092f4d4>] kvm_arch_vm_ioctl+0x654/0xbc0 [kvm] [<ffffffffa091d31a>] kvm_vm_ioctl+0x9a/0x6f0 [kvm] [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480 [<ffffffff812414a9>] SyS_ioctl+0x79/0x90 [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71 Testcase: #include <unistd.h> #include <sys/ioctl.h> #include <fcntl.h> #include <string.h> #include <linux/kvm.h> long r[8]; int main() { memset(r, -1, sizeof(r)); r[2] = open("/dev/kvm", O_RDONLY|O_TRUNC); r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul); r[5] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul); r[7] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul); return 0; } Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02KVM: Handle MSR_IA32_PERF_CTLDmitry Bilunov
Intel CPUs having Turbo Boost feature implement an MSR to provide a control interface via rdmsr/wrmsr instructions. One could detect the presence of this feature by issuing one of these instructions and handling the #GP exception which is generated in case the referenced MSR is not implemented by the CPU. KVM's vCPU model behaves exactly as a real CPU in this case by injecting a fault when MSR_IA32_PERF_CTL is called (which KVM does not support). However, some operating systems use this register during an early boot stage in which their kernel is not capable of handling #GP correctly, causing #DP and finally a triple fault effectively resetting the vCPU. This patch implements a dummy handler for MSR_IA32_PERF_CTL to avoid the crashes. Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02KVM: x86: avoid write-tearing of TDPNadav Amit
In theory, nothing prevents the compiler from write-tearing PTEs, or split PTE writes. These partially-modified PTEs can be fetched by other cores and cause mayhem. I have not really encountered such case in real-life, but it does seem possible. For example, the compiler may try to do something creative for kvm_set_pte_rmapp() and perform multiple writes to the PTE. Signed-off-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-05-27Merge branch 'for-linus-4.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: "This contains a nice FPU fixup from Eli Cooper for UML" * 'for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: add extended processor state save/restore support um: extend fpstate to _xstate to support YMM registers um: fix FPU state preservation around signal handlers
2016-05-27mm: remove more IS_ERR_VALUE abusesLinus Torvalds
The do_brk() and vm_brk() return value was "unsigned long" and returned the starting address on success, and an error value on failure. The reasons are entirely historical, and go back to it basically behaving like the mmap() interface does. However, nobody actually wanted that interface, and it causes totally pointless IS_ERR_VALUE() confusion. What every single caller actually wants is just the simpler integer return of zero for success and negative error number on failure. So just convert to that much clearer and more common calling convention, and get rid of all the IS_ERR_VALUE() uses wrt vm_brk(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27Merge tag 'platform-drivers-x86-v4.7-1' of ↵Linus Torvalds
git://git.infradead.org/users/dvhart/linux-platform-drivers-x86 Pull x86 platform driver updates from Darren Hart: "Mostly minor updates and cleanups. One new power management controller driver for Intel Core SoCs. platform/x86: - Add PMC Driver for Intel Core SoC dell-rbtn: - Ignore ACPI notifications if device is suspended thinkpad_acpi: - save kbdlight state on suspend and restore it on resume intel_menlow: - reduce code duplication asus-wmi: - provide access to ALS control ideapad-laptop: - add a new WMI string for ESC key surfacepro3_button: - Add a warning when switching to tablet mode sony-laptop: - Avoid oops on module unload for older laptops intel_telemetry: - Constify telemetry_core_ops structures fujitsu-laptop: - Use IS_ENABLED() instead of checking for built-in or module asus-laptop: - correct error handling in sysfs_acpi_set - remove redundant initializers - correct error handling in asus_read_brightness() fujitsu-laptop: - Support radio LED" * tag 'platform-drivers-x86-v4.7-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: platform/x86: Add PMC Driver for Intel Core SoC dell-rbtn: Ignore ACPI notifications if device is suspended thinkpad_acpi: save kbdlight state on suspend and restore it on resume intel_menlow: reduce code duplication asus-wmi: provide access to ALS control ideapad-laptop: add a new WMI string for ESC key surfacepro3_button: Add a warning when switching to tablet mode sony-laptop: Avoid oops on module unload for older laptops intel_telemetry: Constify telemetry_core_ops structures fujitsu-laptop: Use IS_ENABLED() instead of checking for built-in or module asus-laptop: correct error handling in sysfs_acpi_set asus-laptop: remove redundant initializers asus-laptop: correct error handling in asus_read_brightness() fujitsu-laptop: Support radio LED
2016-05-27Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull second batch of KVM updates from Radim Krčmář: "General: - move kvm_stat tool from QEMU repo into tools/kvm/kvm_stat (kvm_stat had nothing to do with QEMU in the first place -- the tool only interprets debugfs) - expose per-vm statistics in debugfs and support them in kvm_stat (KVM always collected per-vm statistics, but they were summarised into global statistics) x86: - fix dynamic APICv (VMX was improperly configured and a guest could access host's APIC MSRs, CVE-2016-4440) - minor fixes ARM changes from Christoffer Dall: - new vgic reimplementation of our horribly broken legacy vgic implementation. The two implementations will live side-by-side (with the new being the configured default) for one kernel release and then we'll remove the legacy one. - fix for a non-critical issue with virtual abort injection to guests" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits) tools: kvm_stat: Add comments tools: kvm_stat: Introduce pid monitoring KVM: Create debugfs dir and stat files for each VM MAINTAINERS: Add kvm tools tools: kvm_stat: Powerpc related fixes tools: Add kvm_stat man page tools: Add kvm_stat vm monitor script kvm:vmx: more complete state update on APICv on/off KVM: SVM: Add more SVM_EXIT_REASONS KVM: Unify traced vector format svm: bitwise vs logical op typo KVM: arm/arm64: vgic-new: Synchronize changes to active state KVM: arm/arm64: vgic-new: enable build KVM: arm/arm64: vgic-new: implement mapped IRQ handling KVM: arm/arm64: vgic-new: Wire up irqfd injection KVM: arm/arm64: vgic-new: Add vgic_v2/v3_enable KVM: arm/arm64: vgic-new: vgic_init: implement map_resources KVM: arm/arm64: vgic-new: vgic_init: implement vgic_init KVM: arm/arm64: vgic-new: vgic_init: implement vgic_create KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init ...
2016-05-27platform/x86: Add PMC Driver for Intel Core SoCRajneesh Bhardwaj
This patch adds the Power Management Controller driver as a PCI driver for Intel Core SoC architecture. This driver can utilize debugging capabilities and supported features as exposed by the Power Management Controller. Please refer to the below specification for more details on PMC features. http://www.intel.in/content/www/in/en/chipsets/100-series-chipset-datasheet-vol-2.html The current version of this driver exposes SLP_S0_RESIDENCY counter. This counter can be used for detecting fragile SLP_S0 signal related failures and take corrective actions when PCH SLP_S0 signal is not asserted after kernel freeze as part of suspend to idle flow (echo freeze > /sys/power/state). Intel Platform Controller Hub (PCH) asserts SLP_S0 signal when it detects favorable conditions to enter its low power mode. As a pre-requisite the SoC should be in deepest possible Package C-State and devices should be in low power mode. For example, on Skylake SoC the deepest Package C-State is Package C10 or PC10. Suspend to idle flow generally leads to PC10 state but PC10 state may not be sufficient for realizing the platform wide power potential which SLP_S0 signal assertion can provide. SLP_S0 signal is often connected to the Embedded Controller (EC) and the Power Management IC (PMIC) for other platform power management related optimizations. In general, SLP_S0 assertion == PC10 + PCH low power mode + ModPhy Lanes power gated + PLL Idle. As part of this driver, a mechanism to read the SLP_S0_RESIDENCY is exposed as an API and also debugfs features are added to indicate SLP_S0 signal assertion residency in microseconds. echo freeze > /sys/power/state wake the system cat /sys/kernel/debug/pmc_core/slp_s0_residency_usec Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com> Signed-off-by: Vishwanath Somayaji <vishwanath.somayaji@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Darren Hart <dvhart@linux.intel.com>
2016-05-26Merge branch 'kbuild' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: - new option CONFIG_TRIM_UNUSED_KSYMS which does a two-pass build and unexports symbols which are not used in the current config [Nicolas Pitre] - several kbuild rule cleanups [Masahiro Yamada] - warning option adjustments for gcov etc [Arnd Bergmann] - a few more small fixes * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (31 commits) kbuild: move -Wunused-const-variable to W=1 warning level kbuild: fix if_change and friends to consider argument order kbuild: fix adjust_autoksyms.sh for modules that need only one symbol kbuild: fix ksym_dep_filter when multiple EXPORT_SYMBOL() on the same line gcov: disable -Wmaybe-uninitialized warning gcov: disable tree-loop-im to reduce stack usage gcov: disable for COMPILE_TEST Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES Kbuild: change CC_OPTIMIZE_FOR_SIZE definition kbuild: forbid kernel directory to contain spaces and colons kbuild: adjust ksym_dep_filter for some cmd_* renames kbuild: Fix dependencies for final vmlinux link kbuild: better abstract vmlinux sequential prerequisites kbuild: fix call to adjust_autoksyms.sh when output directory specified kbuild: Get rid of KBUILD_STR kbuild: rename cmd_as_s_S to cmd_cpp_s_S kbuild: rename cmd_cc_i_c to cmd_cpp_i_c kbuild: drop redundant "PHONY += FORCE" kbuild: delete unnecessary "@:" kbuild: mark help target as PHONY ...
2016-05-25Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: EFI, entry code, pkeys and MPX fixes, TASK_SIZE cleanups and a tsc frequency table fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Switch from TASK_SIZE to TASK_SIZE_MAX in the page fault code x86/fsgsbase/64: Use TASK_SIZE_MAX for FSBASE/GSBASE upper limits x86/mm/mpx: Work around MPX erratum SKD046 x86/entry/64: Fix stack return address retrieval in thunk x86/efi: Fix 7-parameter efi_call()s x86/cpufeature, x86/mm/pkeys: Fix broken compile-time disabling of pkeys x86/tsc: Add missing Cherrytrail frequency to the table
2016-05-25Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "Mostly tooling and PMU driver fixes, but also a number of late updates such as the reworking of the call-chain size limiting logic to make call-graph recording more robust, plus tooling side changes for the new 'backwards ring-buffer' extension to the perf ring-buffer" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) perf record: Read from backward ring buffer perf record: Rename variable to make code clear perf record: Prevent reading invalid data in record__mmap_read perf evlist: Add API to pause/resume perf trace: Use the ptr->name beautifier as default for "filename" args perf trace: Use the fd->name beautifier as default for "fd" args perf report: Add srcline_from/to branch sort keys perf evsel: Record fd into perf_mmap perf evsel: Add overwrite attribute and check write_backward perf tools: Set buildid dir under symfs when --symfs is provided perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced perf annotate: Sort list of recognised instructions perf annotate: Fix identification of ARM blt and bls instructions perf tools: Fix usage of max_stack sysctl perf callchain: Stop validating callchains by the max_stack sysctl perf trace: Fix exit_group() formatting perf top: Use machine->kptr_restrict_warned perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1 perf machine: Do not bail out if not managing to read ref reloc symbol perf/x86/intel/p4: Trival indentation fix, remove space ...
2016-05-25kvm:vmx: more complete state update on APICv on/offRoman Kagan
The function to update APICv on/off state (in particular, to deactivate it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust APICv-related fields among secondary processor-based VM-execution controls. As a result, Windows 2012 guests get stuck when SynIC-based auto-EOI interrupt intersected with e.g. an IPI in the guest. In addition, the MSR intercept bitmap isn't updated every time "virtualize x2APIC mode" is toggled. This path can only be triggered by a malicious guest, because Windows didn't use x2APIC but rather their own synthetic APIC access MSRs; however a guest running in a SynIC-enabled VM could switch to x2APIC and thus obtain direct access to host APIC MSRs (CVE-2016-4440). The patch fixes those omissions. Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> Reported-by: Steve Rutherford <srutherford@google.com> Reported-by: Yang Zhang <yang.zhang.wz@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-24Merge tag 'for-linus-4.7-rc0-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen bug fixes from David Vrabel. * tag 'for-linus-4.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: use same main loop for counting and remapping pages xen/events: Don't move disabled irqs xen/x86: actually allocate legacy interrupts on PV guests Xen: don't warn about 2-byte wchar_t in efi xen/gntdev: reduce copy batch size to 16 xen/x86: don't lose event interrupts