Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini: "Loongarch: - Add more CPUCFG mask bits - Improve feature detection - Add lazy load support for FPU and binary translation (LBT) register state - Fix return value for memory reads from and writes to in-kernel devices - Add support for detecting preemption from within a guest - Add KVM steal time test case to tools/selftests ARM: - Add support for FEAT_IDST, allowing ID registers that are not implemented to be reported as a normal trap rather than as an UNDEF exception - Add sanitisation of the VTCR_EL2 register, fixing a number of UXN/PXN/XN bugs in the process - Full handling of RESx bits, instead of only RES0, and resulting in SCTLR_EL2 being added to the list of sanitised registers - More pKVM fixes for features that are not supposed to be exposed to guests - Make sure that MTE being disabled on the pKVM host doesn't give it the ability to attack the hypervisor - Allow pKVM's host stage-2 mappings to use the Force Write Back version of the memory attributes by using the "pass-through' encoding - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the guest - Preliminary work for guest GICv5 support - A bunch of debugfs fixes, removing pointless custom iterators stored in guest data structures - A small set of FPSIMD cleanups - Selftest fixes addressing the incorrect alignment of page allocation - Other assorted low-impact fixes and spelling fixes RISC-V: - Fixes for issues discoverd by KVM API fuzzing in kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and kvm_riscv_vcpu_aia_imsic_update() - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM - Transparent huge page support for hypervisor page tables - Adjust the number of available guest irq files based on MMIO register sizes found in the device tree or the ACPI tables - Add RISC-V specific paging modes to KVM selftests - Detect paging mode at runtime for selftests s390: - Performance improvement for vSIE (aka nested virtualization) - Completely new memory management. s390 was a special snowflake that enlisted help from the architecture's page table management to build hypervisor page tables, in particular enabling sharing the last level of page tables. This however was a lot of code (~3K lines) in order to support KVM, and also blocked several features. The biggest advantages is that the page size of userspace is completely independent of the page size used by the guest: userspace can mix normal pages, THPs and hugetlbfs as it sees fit, and in fact transparent hugepages were not possible before. It's also now possible to have nested guests and guests with huge pages running on the same host - Maintainership change for s390 vfio-pci - Small quality of life improvement for protected guests x86: - Add support for giving the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allowing direct access to data MSRs and PMCs (restricted by the vPMU model). KVM still intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. This is more accurate, since it has no risk of contention and thus dropped events, and also has significantly less overhead. For more information, see the commit message for merge commit bf2c3138ae36 ("Merge tag 'kvm-x86-pmu-6.20' ...") - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those were advertised as supported to userspace, - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM would attempt to read CR3 configuring an async #PF entry - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Only a few exports that are intended for external usage, and those are allowed explicitly - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead of WARNing. Userspace can sometimes put the vCPU into what should be an impossible state, and spurious exit to userspace on -EBUSY does not really do anything to solve the issue - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller stuffing architecturally impossible states into KVM - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs) - Clean up KVM's handling of marking mapped vCPU pages dirty - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n - Bury all of ioapic.h, i8254.h and related ioctls (including KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y - Add a regression test for recent APICv update fixes - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information - Drop a dead function declaration - Minor cleanups x86 (Intel): - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults, and instead always reflect them into the guest. Since KVM doesn't shadow EPCM entries, EPCM violations cannot be due to KVM interference and can't be resolved by KVM - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch x86 (AMD): - Drop a user-triggerable WARN on nested_svm_load_cr3() failure - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) *must* flush the RAP - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0 - Treat exit_code as an unsigned 64-bit value through all of KVM - Add support for fetching SNP certificates from userspace - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2 - Misc fixes and cleanups x86 selftests: - Add a regression test for TPR<=>CR8 synchronization and IRQ masking - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support, and extend x86's infrastructure to support EPT and NPT (for L2 guests) - Extend several nested VMX tests to also cover nested SVM - Add a selftest for nested VMLOAD/VMSAVE - Rework the nested dirty log test, originally added as a regression test for PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage and to hopefully make the test easier to understand and maintain guest_memfd: - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage handling. SEV/SNP was the only user of the tracking and it can do it via the RMP - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to avoid non-trivial complexity for something that no known VMM seems to be doing and to avoid an API special case for in-place conversion, which simply can't support unaligned sources - When populating guest_memfd memory, GUP the source page in common code and pass the refcounted page to the vendor callback, instead of letting vendor code do the heavy lifting. Doing so avoids a looming deadlock bug with in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap invalidate lock Generic: - Fix a bug where KVM would ignore the vCPU's selected address space when creating a vCPU-specific mapping of guest memory. Actually this bug could not be hit even on x86, the only architecture with multiple address spaces, but it's a bug nevertheless" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits) KVM: s390: Increase permitted SE header size to 1 MiB MAINTAINERS: Replace backup for s390 vfio-pci KVM: s390: vsie: Fix race in acquire_gmap_shadow() KVM: s390: vsie: Fix race in walk_guest_tables() KVM: s390: Use guest address to mark guest page dirty irqchip/riscv-imsic: Adjust the number of available guest irq files RISC-V: KVM: Transparent huge page support RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test RISC-V: KVM: Allow Zalasr extensions for Guest/VM KVM: riscv: selftests: Add riscv vm satp modes KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr() RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr() RISC-V: KVM: Remove unnecessary 'ret' assignment KVM: s390: Add explicit padding to struct kvm_s390_keyop KVM: LoongArch: selftests: Add steal time test case LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side LoongArch: KVM: Add paravirt preempt feature in hypervisor side ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2026-02-13 11:31:15 -0800
committer: Linus Torvalds <torvalds@linux-foundation.org> 2026-02-13 11:31:15 -0800
commit: cb5573868ea85ddbc74dd9a917acd1e434d21390 (patch)
tree: c6180274e0a4a3dbd41050a8d1dd9d786558ebb8 /arch/s390/kernel
parent: c87c79345ea8ff917354e9223d44034e4aef2690 (diff)
parent: b1195183ed42f1522fae3fe44ebee3af437aa000 (diff)
download: lwn-cb5573868ea85ddbc74dd9a917acd1e434d21390.tar.gz
lwn-cb5573868ea85ddbc74dd9a917acd1e434d21390.zip
1 files changed, 23 insertions, 119 deletions
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index ed46950be86f..a284f98d9716 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -134,14 +134,15 @@ static int uv_destroy(unsigned long paddr)
  */
 int uv_destroy_folio(struct folio *folio)
 {
+	unsigned long i;
 	int rc;
 
-	/* Large folios cannot be secure */
-	if (unlikely(folio_test_large(folio)))
-		return 0;
-
 	folio_get(folio);
-	rc = uv_destroy(folio_to_phys(folio));
+	for (i = 0; i < (1 << folio_order(folio)); i++) {
+		rc = uv_destroy(folio_to_phys(folio) + i * PAGE_SIZE);
+		if (rc)
+			break;
+	}
 	if (!rc)
 		clear_bit(PG_arch_1, &folio->flags.f);
 	folio_put(folio);
@@ -183,14 +184,15 @@ EXPORT_SYMBOL_GPL(uv_convert_from_secure);
  */
 int uv_convert_from_secure_folio(struct folio *folio)
 {
+	unsigned long i;
 	int rc;
 
-	/* Large folios cannot be secure */
-	if (unlikely(folio_test_large(folio)))
-		return 0;
-
 	folio_get(folio);
-	rc = uv_convert_from_secure(folio_to_phys(folio));
+	for (i = 0; i < (1 << folio_order(folio)); i++) {
+		rc = uv_convert_from_secure(folio_to_phys(folio) + i * PAGE_SIZE);
+		if (rc)
+			break;
+	}
 	if (!rc)
 		clear_bit(PG_arch_1, &folio->flags.f);
 	folio_put(folio);
@@ -207,39 +209,6 @@ int uv_convert_from_secure_pte(pte_t pte)
 	return uv_convert_from_secure_folio(pfn_folio(pte_pfn(pte)));
 }
 
-/**
- * should_export_before_import - Determine whether an export is needed
- * before an import-like operation
- * @uvcb: the Ultravisor control block of the UVC to be performed
- * @mm: the mm of the process
- *
- * Returns whether an export is needed before every import-like operation.
- * This is needed for shared pages, which don't trigger a secure storage
- * exception when accessed from a different guest.
- *
- * Although considered as one, the Unpin Page UVC is not an actual import,
- * so it is not affected.
- *
- * No export is needed also when there is only one protected VM, because the
- * page cannot belong to the wrong VM in that case (there is no "other VM"
- * it can belong to).
- *
- * Return: true if an export is needed before every import, otherwise false.
- */
-static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
-{
-	/*
-	 * The misc feature indicates, among other things, that importing a
-	 * shared page from a different protected VM will automatically also
-	 * transfer its ownership.
-	 */
-	if (uv_has_feature(BIT_UV_FEAT_MISC))
-		return false;
-	if (uvcb->cmd == UVC_CMD_UNPIN_PAGE_SHARED)
-		return false;
-	return atomic_read(&mm->context.protected_count) > 1;
-}
-
 /*
  * Calculate the expected ref_count for a folio that would otherwise have no
  * further pins. This was cribbed from similar functions in other places in
@@ -279,7 +248,7 @@ static int expected_folio_refs(struct folio *folio)
  *          (it's the same logic as split_folio()), and the folio must be
  *          locked.
  */
-static int __make_folio_secure(struct folio *folio, struct uv_cb_header *uvcb)
+int __make_folio_secure(struct folio *folio, struct uv_cb_header *uvcb)
 {
 	int expected, cc = 0;
 
@@ -309,20 +278,7 @@ static int __make_folio_secure(struct folio *folio, struct uv_cb_header *uvcb)
 		return -EAGAIN;
 	return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
 }
-
-static int make_folio_secure(struct mm_struct *mm, struct folio *folio, struct uv_cb_header *uvcb)
-{
-	int rc;
-
-	if (!folio_trylock(folio))
-		return -EAGAIN;
-	if (should_export_before_import(uvcb, mm))
-		uv_convert_from_secure(folio_to_phys(folio));
-	rc = __make_folio_secure(folio, uvcb);
-	folio_unlock(folio);
-
-	return rc;
-}
+EXPORT_SYMBOL(__make_folio_secure);
 
 /**
  * s390_wiggle_split_folio() - try to drain extra references to a folio and
@@ -337,7 +293,7 @@ static int make_folio_secure(struct mm_struct *mm, struct folio *folio, struct u
  *		   but another attempt can be made;
  *	   -EINVAL in case of other folio splitting errors. See split_folio().
  */
-static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio)
+int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio)
 {
 	int rc, tried_splits;
 
@@ -409,56 +365,7 @@ static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio)
 	}
 	return -EAGAIN;
 }
-
-int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header *uvcb)
-{
-	struct vm_area_struct *vma;
-	struct folio_walk fw;
-	struct folio *folio;
-	int rc;
-
-	mmap_read_lock(mm);
-	vma = vma_lookup(mm, hva);
-	if (!vma) {
-		mmap_read_unlock(mm);
-		return -EFAULT;
-	}
-	folio = folio_walk_start(&fw, vma, hva, 0);
-	if (!folio) {
-		mmap_read_unlock(mm);
-		return -ENXIO;
-	}
-
-	folio_get(folio);
-	/*
-	 * Secure pages cannot be huge and userspace should not combine both.
-	 * In case userspace does it anyway this will result in an -EFAULT for
-	 * the unpack. The guest is thus never reaching secure mode.
-	 * If userspace plays dirty tricks and decides to map huge pages at a
-	 * later point in time, it will receive a segmentation fault or
-	 * KVM_RUN will return -EFAULT.
-	 */
-	if (folio_test_hugetlb(folio))
-		rc = -EFAULT;
-	else if (folio_test_large(folio))
-		rc = -E2BIG;
-	else if (!pte_write(fw.pte) || (pte_val(fw.pte) & _PAGE_INVALID))
-		rc = -ENXIO;
-	else
-		rc = make_folio_secure(mm, folio, uvcb);
-	folio_walk_end(&fw, vma);
-	mmap_read_unlock(mm);
-
-	if (rc == -E2BIG || rc == -EBUSY) {
-		rc = s390_wiggle_split_folio(mm, folio);
-		if (!rc)
-			rc = -EAGAIN;
-	}
-	folio_put(folio);
-
-	return rc;
-}
-EXPORT_SYMBOL_GPL(make_hva_secure);
+EXPORT_SYMBOL_GPL(s390_wiggle_split_folio);
 
 /*
  * To be called with the folio locked or with an extra reference! This will
@@ -470,21 +377,18 @@ int arch_make_folio_accessible(struct folio *folio)
 {
 	int rc = 0;
 
-	/* Large folios cannot be secure */
-	if (unlikely(folio_test_large(folio)))
-		return 0;
-
 	/*
-	 * PG_arch_1 is used in 2 places:
-	 * 1. for storage keys of hugetlb folios and KVM
-	 * 2. As an indication that this small folio might be secure. This can
-	 *    overindicate, e.g. we set the bit before calling
-	 *    convert_to_secure.
-	 * As secure pages are never large folios, both variants can co-exists.
+	 * PG_arch_1 is used as an indication that this small folio might be
+	 * secure. This can overindicate, e.g. we set the bit before calling
+	 * convert_to_secure.
 	 */
 	if (!test_bit(PG_arch_1, &folio->flags.f))
 		return 0;
 
+	/* Large folios cannot be secure. */
+	if (WARN_ON_ONCE(folio_test_large(folio)))
+		return -EFAULT;
+
 	rc = uv_pin_shared(folio_to_phys(folio));
 	if (!rc) {
 		clear_bit(PG_arch_1, &folio->flags.f);
author	Linus Torvalds <torvalds@linux-foundation.org>	2026-02-13 11:31:15 -0800
committer	Linus Torvalds <torvalds@linux-foundation.org>	2026-02-13 11:31:15 -0800
commit	cb5573868ea85ddbc74dd9a917acd1e434d21390 (patch)
tree	c6180274e0a4a3dbd41050a8d1dd9d786558ebb8 /arch/s390/kernel
parent	c87c79345ea8ff917354e9223d44034e4aef2690 (diff)
parent	b1195183ed42f1522fae3fe44ebee3af437aa000 (diff)
download	lwn-cb5573868ea85ddbc74dd9a917acd1e434d21390.tar.gz lwn-cb5573868ea85ddbc74dd9a917acd1e434d21390.zip