diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-01-17 13:03:37 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-01-17 13:03:37 -0800 |
commit | 09d1c6a80f2cf94c6e70be919203473d4ab8e26c (patch) | |
tree | 144604e6cf9f513c45c4d035548ac1760e7dac11 /Documentation | |
parent | 1b1934dbbdcf9aa2d507932ff488cec47999cf3f (diff) | |
parent | 1c6d984f523f67ecfad1083bb04c55d91977bb15 (diff) | |
download | lwn-09d1c6a80f2cf94c6e70be919203473d4ab8e26c.tar.gz lwn-09d1c6a80f2cf94c6e70be919203473d4ab8e26c.zip |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 6 | ||||
-rw-r--r-- | Documentation/virt/kvm/api.rst | 219 | ||||
-rw-r--r-- | Documentation/virt/kvm/locking.rst | 7 |
3 files changed, 213 insertions, 19 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6ee0f9a5da70..a36cf8cc582c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3996,9 +3996,9 @@ vulnerability. System may allow data leaks with this option. - no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized - steal time accounting. steal time is computed, but - won't influence scheduler behaviour + no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV] Disable + paravirtualized steal time accounting. steal time is + computed, but won't influence scheduler behaviour nosync [HW,M68K] Disables sync negotiation for all devices. diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 7025b3751027..3ec0b7a455a0 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -147,10 +147,29 @@ described as 'basic' will be available. The new VM has no virtual cpus and no memory. You probably want to use 0 as machine type. +X86: +^^^^ + +Supported X86 VM types can be queried via KVM_CAP_VM_TYPES. + +S390: +^^^^^ + In order to create user controlled virtual machines on S390, check KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as privileged user (CAP_SYS_ADMIN). +MIPS: +^^^^^ + +To use hardware assisted virtualization on MIPS (VZ ASE) rather than +the default trap & emulate implementation (which changes the virtual +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the +flag KVM_VM_MIPS_VZ. + +ARM64: +^^^^^^ + On arm64, the physical address size for a VM (IPA Size limit) is limited to 40bits by default. The limit can be configured if the host supports the extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use @@ -608,18 +627,6 @@ interrupt number dequeues the interrupt. This is an asynchronous vcpu ioctl and can be invoked from any thread. -4.17 KVM_DEBUG_GUEST --------------------- - -:Capability: basic -:Architectures: none -:Type: vcpu ioctl -:Parameters: none) -:Returns: -1 on error - -Support for this has been removed. Use KVM_SET_GUEST_DEBUG instead. - - 4.18 KVM_GET_MSRS ----------------- @@ -6192,6 +6199,130 @@ to know what fields can be changed for the system register described by ``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a superset of the features supported by the system. +4.140 KVM_SET_USER_MEMORY_REGION2 +--------------------------------- + +:Capability: KVM_CAP_USER_MEMORY2 +:Architectures: all +:Type: vm ioctl +:Parameters: struct kvm_userspace_memory_region2 (in) +:Returns: 0 on success, -1 on error + +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that +allows mapping guest_memfd memory into a guest. All fields shared with +KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_GUEST_MEMFD +in flags to have KVM bind the memory region to a given guest_memfd range of +[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and +the target range must not be bound to any other memory region. All standard +bounds checks apply (use common sense). + +:: + + struct kvm_userspace_memory_region2 { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; /* bytes */ + __u64 userspace_addr; /* start of the userspace allocated memory */ + __u64 guest_memfd_offset; + __u32 guest_memfd; + __u32 pad1; + __u64 pad2[14]; + }; + +A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and +userspace_addr (shared memory). However, "valid" for userspace_addr simply +means that the address itself must be a legal userspace address. The backing +mapping for userspace_addr is not required to be valid/populated at the time of +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated +on-demand. + +When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes +userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE +state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute +is '0' for all gfns. Userspace can control whether memory is shared/private by +toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed. + +4.141 KVM_SET_MEMORY_ATTRIBUTES +------------------------------- + +:Capability: KVM_CAP_MEMORY_ATTRIBUTES +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_memory_attributes (in) +:Returns: 0 on success, <0 on error + +KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range +of guest physical memory. + +:: + + struct kvm_memory_attributes { + __u64 address; + __u64 size; + __u64 attributes; + __u64 flags; + }; + + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) + +The address and size must be page aligned. The supported attributes can be +retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If +executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes +supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES +returns all attributes supported by KVM. The only attribute defined at this +time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being +guest private memory. + +Note, there is no "get" API. Userspace is responsible for explicitly tracking +the state of a gfn/page as needed. + +The "flags" field is reserved for future extensions and must be '0'. + +4.142 KVM_CREATE_GUEST_MEMFD +---------------------------- + +:Capability: KVM_CAP_GUEST_MEMFD +:Architectures: none +:Type: vm ioctl +:Parameters: struct kvm_create_guest_memfd(in) +:Returns: 0 on success, <0 on error + +KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor +that refers to it. guest_memfd files are roughly analogous to files created +via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage, +and are automatically released when the last reference is dropped. Unlike +"regular" memfd_create() files, guest_memfd files are bound to their owning +virtual machine (see below), cannot be mapped, read, or written by userspace, +and cannot be resized (guest_memfd files do however support PUNCH_HOLE). + +:: + + struct kvm_create_guest_memfd { + __u64 size; + __u64 flags; + __u64 reserved[6]; + }; + +Conceptually, the inode backing a guest_memfd file represents physical memory, +i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The +file itself, which is bound to a "struct kvm", is that instance's view of the +underlying memory, e.g. effectively provides the translation of guest addresses +to host memory. This allows for use cases where multiple KVM structures are +used to manage a single virtual machine, e.g. when performing intrahost +migration of a virtual machine. + +KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2, +and more specifically via the guest_memfd and guest_memfd_offset fields in +"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset +into the guest_memfd instance. For a given guest_memfd file, there can be at +most one mapping per page, i.e. binding multiple memory regions to a single +guest_memfd range is not allowed (any number of memory regions can be bound to +a single guest_memfd file, but the bound ranges must not overlap). + +See KVM_SET_USER_MEMORY_REGION2 for additional details. + 5. The kvm_run structure ======================== @@ -6826,6 +6957,30 @@ spec refer, https://github.com/riscv/riscv-sbi-doc. :: + /* KVM_EXIT_MEMORY_FAULT */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) + __u64 flags; + __u64 gpa; + __u64 size; + } memory_fault; + +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field +describes properties of the faulting access that are likely pertinent: + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred + on a private memory access. When clear, indicates the fault occurred on a + shared access. + +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume +kvm_run.exit_reason is stale/undefined for all other error numbers. + +:: + /* KVM_EXIT_NOTIFY */ struct { #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) @@ -7858,6 +8013,27 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. +7.34 KVM_CAP_MEMORY_FAULT_INFO +------------------------------ + +:Architectures: x86 +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the same +error/annotated fault. + +See KVM_EXIT_MEMORY_FAULT for more information. + 8. Other capabilities. ====================== @@ -8374,6 +8550,7 @@ PVHVM guests. Valid flags are:: #define KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL (1 << 4) #define KVM_XEN_HVM_CONFIG_EVTCHN_SEND (1 << 5) #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG (1 << 6) + #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE (1 << 7) The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG ioctl is available, for the guest to set its hypercall page. @@ -8417,6 +8594,11 @@ behave more correctly, not using the XEN_RUNSTATE_UPDATE flag until/unless specifically enabled (by the guest making the hypercall, causing the VMM to enable the KVM_XEN_ATTR_TYPE_RUNSTATE_UPDATE_FLAG attribute). +The KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag indicates that KVM supports +clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be +done when the KVM_CAP_XEN_HVM ioctl sets the +KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag. + 8.31 KVM_CAP_PPC_MULTITCE ------------------------- @@ -8596,6 +8778,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a 64-bit bitmap (each bit describing a block size). The default value is 0, to disable the eager page splitting. +8.41 KVM_CAP_VM_TYPES +--------------------- + +:Capability: KVM_CAP_MEMORY_ATTRIBUTES +:Architectures: x86 +:Type: system ioctl + +This capability returns a bitmap of support VM types. The 1-setting of bit @n +means the VM type with value @n is supported. Possible values of @n are:: + + #define KVM_X86_DEFAULT_VM 0 + #define KVM_X86_SW_PROTECTED_VM 1 + 9. Known KVM API problems ========================= diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index 3a034db5e55f..02880d5552d5 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -43,10 +43,9 @@ On x86: - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock -- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and - kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and - cannot be taken without already holding kvm->arch.mmu_lock (typically with - ``read_lock`` for the TDP MMU, thus the need for additional spinlocks). +- kvm->arch.mmu_lock is an rwlock; critical sections for + kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must + also take kvm->arch.mmu_lock Everything else is a leaf: no other lock is taken inside the critical sections. |