diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-17 07:18:03 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-04-17 07:18:03 -0700 |
| commit | 01f492e1817e858d1712f2489d0afbaa552f417b (patch) | |
| tree | 9ba6df223570acd45ccb2ba647407f75f4393eab /arch/arm64/kvm | |
| parent | e55d98e7756135f32150b9b8f75d580d0d4b2dd3 (diff) | |
| parent | 6b802031877a995456c528095c41d1948546bf45 (diff) | |
| download | lwn-01f492e1817e858d1712f2489d0afbaa552f417b.tar.gz lwn-01f492e1817e858d1712f2489d0afbaa552f417b.zip | |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Arm:
- Add support for tracing in the standalone EL2 hypervisor code,
which should help both debugging and performance analysis. This
uses the new infrastructure for 'remote' trace buffers that can be
exposed by non-kernel entities such as firmware, and which came
through the tracing tree
- Add support for GICv5 Per Processor Interrupts (PPIs), as the
starting point for supporting the new GIC architecture in KVM
- Finally add support for pKVM protected guests, where pages are
unmapped from the host as they are faulted into the guest and can
be shared back from the guest using pKVM hypercalls. Protected
guests are created using a new machine type identifier. As the
elusive guestmem has not yet delivered on its promises, anonymous
memory is also supported
This is only a first step towards full isolation from the host; for
example, the CPU register state and DMA accesses are not yet
isolated. Because this does not really yet bring fully what it
promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
created. Caveat emptor
- Rework the dreaded user_mem_abort() function to make it more
maintainable, reducing the amount of state being exposed to the
various helpers and rendering a substantial amount of state
immutable
- Expand the Stage-2 page table dumper to support NV shadow page
tables on a per-VM basis
- Tidy up the pKVM PSCI proxy code to be slightly less hard to
follow
- Fix both SPE and TRBE in non-VHE configurations so that they do not
generate spurious, out of context table walks that ultimately lead
to very bad HW lockups
- A small set of patches fixing the Stage-2 MMU freeing in error
cases
- Tighten-up accepted SMC immediate value to be only #0 for host
SMCCC calls
- The usual cleanups and other selftest churn
LoongArch:
- Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()
- Add DMSINTC irqchip in kernel support
RISC-V:
- Fix steal time shared memory alignment checks
- Fix vector context allocation leak
- Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
- Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
- Fix integer overflow in kvm_pmu_validate_counter_mask()
- Fix shift-out-of-bounds in make_xfence_request()
- Fix lost write protection on huge pages during dirty logging
- Split huge pages during fault handling for dirty logging
- Skip CSR restore if VCPU is reloaded on the same core
- Implement kvm_arch_has_default_irqchip() for KVM selftests
- Factored-out ISA checks into separate sources
- Added hideleg to struct kvm_vcpu_config
- Factored-out VCPU config into separate sources
- Support configuration of per-VM HGATP mode from KVM user space
s390:
- Support for ESA (31-bit) guests inside nested hypervisors
- Remove restriction on memslot alignment, which is not needed
anymore with the new gmap code
- Fix LPSW/E to update the bear (which of course is the breaking
event address register)
x86:
- Shut up various UBSAN warnings on reading module parameter before
they were initialized
- Don't zero-allocate page tables that are used for splitting
hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
the page table and thus write all bytes
- As an optimization, bail early when trying to unsync 4KiB mappings
if the target gfn can just be mapped with a 2MiB hugepage
x86 generic:
- Copy single-chunk MMIO write values into struct kvm_vcpu (more
precisely struct kvm_mmio_fragment) to fix use-after-free stack
bugs where KVM would dereference stack pointer after an exit to
userspace
- Clean up and comment the emulated MMIO code to try to make it
easier to maintain (not necessarily "easy", but "easier")
- Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
VMX and SVM enabling) as it is needed for trusted I/O
- Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
- Immediately fail the build if a required #define is missing in one
of KVM's headers that is included multiple times
- Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
exception, mostly to prevent syzkaller from abusing the uAPI to
trigger WARNs, but also because it can help prevent userspace from
unintentionally crashing the VM
- Exempt SMM from CPUID faulting on Intel, as per the spec
- Misc hardening and cleanup changes
x86 (AMD):
- Fix and optimize IRQ window inhibit handling for AVIC; make it
per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
vCPUs have to-be-injected IRQs
- Clean up and optimize the OSVW handling, avoiding a bug in which
KVM would overwrite state when enabling virtualization on multiple
CPUs in parallel. This should not be a problem because OSVW should
usually be the same for all CPUs
- Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
about a "too large" size based purely on user input
- Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION
- Disallow synchronizing a VMSA of an already-launched/encrypted
vCPU, as doing so for an SNP guest will crash the host due to an
RMP violation page fault
- Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
queries are required to hold kvm->lock, and enforce it by lockdep.
Fix various bugs where sev_guest() was not ensured to be stable for
the whole duration of a function or ioctl
- Convert a pile of kvm->lock SEV code to guard()
- Play nicer with userspace that does not enable
KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
payload would end up in EXITINFO2 rather than CR2, for example).
Only set CR2 and DR6 when consumption of the payload is imminent,
but on the other hand force delivery of the payload in all paths
where userspace retrieves CR2 or DR6
- Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
instead of vmcb02->save.cr2. The value is out of sync after a
save/restore or after a #PF is injected into L2
- Fix a class of nSVM bugs where some fields written by the CPU are
not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
are not up-to-date when saved by KVM_GET_NESTED_STATE
- Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
initialized after save+restore
- Add a variety of missing nSVM consistency checks
- Fix several bugs where KVM failed to correctly update VMCB fields
on nested #VMEXIT
- Fix several bugs where KVM failed to correctly synthesize #UD or
#GP for SVM-related instructions
- Add support for save+restore of virtualized LBRs (on SVM)
- Refactor various helpers and macros to improve clarity and
(hopefully) make the code easier to maintain
- Aggressively sanitize fields when copying from vmcb12, to guard
against unintentionally allowing L1 to utilize yet-to-be-defined
features
- Fix several bugs where KVM botched rAX legality checks when
emulating SVM instructions. There are remaining issues in that KVM
doesn't handle size prefix overrides for 64-bit guests
- Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
down on AMD's architectural but sketchy behavior of generating #GP
for "unsupported" addresses)
- Cache all used vmcb12 fields to further harden against TOCTOU bugs
x86 (Intel):
- Drop obsolete branch hint prefixes from the VMX instruction macros
- Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
register input when appropriate
- Code cleanups
guest_memfd:
- Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
support reclaim, the memory is unevictable, and there is no storage
to write back to
LoongArch selftests:
- Add KVM PMU test cases
s390 selftests:
- Enable more memory selftests
x86 selftests:
- Add support for Hygon CPUs in KVM selftests
- Fix a bug in the MSR test where it would get false failures on
AMD/Hygon CPUs with exactly one of RDPID or RDTSCP
- Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
for a bug where the kernel would attempt to collapse guest_memfd
folios against KVM's will"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
KVM: x86: use inlines instead of macros for is_sev_*guest
x86/virt: Treat SVM as unsupported when running as an SEV+ guest
KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
KVM: SEV: use mutex guard in snp_handle_guest_req()
KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
KVM: SEV: use mutex guard in snp_launch_update()
KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
KVM: SEV: WARN on unhandled VM type when initializing VM
KVM: LoongArch: selftests: Add PMU overflow interrupt test
KVM: LoongArch: selftests: Add basic PMU event counting test
KVM: LoongArch: selftests: Add cpucfg read/write helpers
LoongArch: KVM: Add DMSINTC inject msi to vCPU
LoongArch: KVM: Add DMSINTC device support
LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
...
Diffstat (limited to 'arch/arm64/kvm')
53 files changed, 4513 insertions, 777 deletions
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 7d1f22fd490b..449154f9a485 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -42,9 +42,27 @@ menuconfig KVM If unsure, say N. +if KVM + +config PTDUMP_STAGE2_DEBUGFS + bool "Present the stage-2 pagetables to debugfs" + depends on DEBUG_KERNEL + depends on DEBUG_FS + depends on ARCH_HAS_PTDUMP + select PTDUMP + default n + help + Say Y here if you want to show the stage-2 kernel pagetables + layout in a debugfs file. This information is only useful for kernel developers + who are working in architecture specific areas of the kernel. + It is probably not a good idea to enable this feature in a production + kernel. + + If in doubt, say N. + config NVHE_EL2_DEBUG bool "Debug mode for non-VHE EL2 object" - depends on KVM + default n help Say Y here to enable the debug mode for the non-VHE KVM EL2 object. Failure reports will BUG() in the hypervisor. This is intended for @@ -52,11 +70,30 @@ config NVHE_EL2_DEBUG If unsure, say N. -config PROTECTED_NVHE_STACKTRACE - bool "Protected KVM hypervisor stacktraces" - depends on NVHE_EL2_DEBUG +if NVHE_EL2_DEBUG + +config NVHE_EL2_TRACING + bool + depends on TRACING && FTRACE + select TRACE_REMOTE + default y + +config PKVM_DISABLE_STAGE2_ON_PANIC + bool "Disable the host stage-2 on panic" default n help + Relax the host stage-2 on hypervisor panic to allow the kernel to + unwind and symbolize the hypervisor stacktrace. This however tampers + the system security. This is intended for local EL2 hypervisor + development. + + If unsure, say N. + +config PKVM_STACKTRACE + bool "Protected KVM hypervisor stacktraces" + depends on PKVM_DISABLE_STAGE2_ON_PANIC + default y + help Say Y here to enable pKVM hypervisor stacktraces on hyp_panic() If using protected nVHE mode, but cannot afford the associated @@ -65,21 +102,6 @@ config PROTECTED_NVHE_STACKTRACE If unsure, or not using protected nVHE (pKVM), say N. -config PTDUMP_STAGE2_DEBUGFS - bool "Present the stage-2 pagetables to debugfs" - depends on KVM - depends on DEBUG_KERNEL - depends on DEBUG_FS - depends on ARCH_HAS_PTDUMP - select PTDUMP - default n - help - Say Y here if you want to show the stage-2 kernel pagetables - layout in a debugfs file. This information is only useful for kernel developers - who are working in architecture specific areas of the kernel. - It is probably not a good idea to enable this feature in a production - kernel. - - If in doubt, say N. - +endif # NVHE_EL2_DEBUG +endif # KVM endif # VIRTUALIZATION diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 3ebc0570345c..59612d2f277c 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -30,6 +30,8 @@ kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o +kvm-$(CONFIG_NVHE_EL2_TRACING) += hyp_trace.o + always-y := hyp_constants.h hyp-constants.s define rule_gen_hyp_constants diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c index 600f250753b4..cbea4d9ee955 100644 --- a/arch/arm64/kvm/arch_timer.c +++ b/arch/arm64/kvm/arch_timer.c @@ -56,6 +56,12 @@ static struct irq_ops arch_timer_irq_ops = { .get_input_level = kvm_arch_timer_get_input_level, }; +static struct irq_ops arch_timer_irq_ops_vgic_v5 = { + .get_input_level = kvm_arch_timer_get_input_level, + .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, + .set_direct_injection = vgic_v5_set_ppi_dvi, +}; + static int nr_timers(struct kvm_vcpu *vcpu) { if (!vcpu_has_nv(vcpu)) @@ -447,6 +453,17 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level, if (userspace_irqchip(vcpu->kvm)) return; + /* Skip injecting on GICv5 for directly injected (DVI'd) timers */ + if (vgic_is_v5(vcpu->kvm)) { + struct timer_map map; + + get_timer_map(vcpu, &map); + + if (map.direct_ptimer == timer_ctx || + map.direct_vtimer == timer_ctx) + return; + } + kvm_vgic_inject_irq(vcpu->kvm, vcpu, timer_irq(timer_ctx), timer_ctx->irq.level, @@ -674,6 +691,7 @@ static void kvm_timer_vcpu_load_gic(struct arch_timer_context *ctx) phys_active = kvm_vgic_map_is_active(vcpu, timer_irq(ctx)); phys_active |= ctx->irq.level; + phys_active |= vgic_is_v5(vcpu->kvm); set_timer_irq_phys_active(ctx, phys_active); } @@ -740,13 +758,11 @@ static void kvm_timer_vcpu_load_nested_switch(struct kvm_vcpu *vcpu, ret = kvm_vgic_map_phys_irq(vcpu, map->direct_vtimer->host_timer_irq, - timer_irq(map->direct_vtimer), - &arch_timer_irq_ops); + timer_irq(map->direct_vtimer)); WARN_ON_ONCE(ret); ret = kvm_vgic_map_phys_irq(vcpu, map->direct_ptimer->host_timer_irq, - timer_irq(map->direct_ptimer), - &arch_timer_irq_ops); + timer_irq(map->direct_ptimer)); WARN_ON_ONCE(ret); } } @@ -864,7 +880,8 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu) get_timer_map(vcpu, &map); if (static_branch_likely(&has_gic_active_state)) { - if (vcpu_has_nv(vcpu)) + /* We don't do NV on GICv5, yet */ + if (vcpu_has_nv(vcpu) && !vgic_is_v5(vcpu->kvm)) kvm_timer_vcpu_load_nested_switch(vcpu, &map); kvm_timer_vcpu_load_gic(map.direct_vtimer); @@ -934,6 +951,12 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu) if (kvm_vcpu_is_blocking(vcpu)) kvm_timer_blocking(vcpu); + + if (vgic_is_v5(vcpu->kvm)) { + set_timer_irq_phys_active(map.direct_vtimer, false); + if (map.direct_ptimer) + set_timer_irq_phys_active(map.direct_ptimer, false); + } } void kvm_timer_sync_nested(struct kvm_vcpu *vcpu) @@ -1097,10 +1120,19 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu) HRTIMER_MODE_ABS_HARD); } +/* + * This is always called during kvm_arch_init_vm, but will also be + * called from kvm_vgic_create if we have a vGICv5. + */ void kvm_timer_init_vm(struct kvm *kvm) { + /* + * Set up the default PPIs - note that we adjust them based on + * the model of the GIC as GICv5 uses a different way to + * describing interrupts. + */ for (int i = 0; i < NR_KVM_TIMERS; i++) - kvm->arch.timer_data.ppi[i] = default_ppi[i]; + kvm->arch.timer_data.ppi[i] = get_vgic_ppi(kvm, default_ppi[i]); } void kvm_timer_cpu_up(void) @@ -1269,7 +1301,15 @@ static int timer_irq_set_irqchip_state(struct irq_data *d, static void timer_irq_eoi(struct irq_data *d) { - if (!irqd_is_forwarded_to_vcpu(d)) + /* + * On a GICv5 host, we still need to call EOI on the parent for + * PPIs. The host driver already handles irqs which are forwarded to + * vcpus, and skips the GIC CDDI while still doing the GIC CDEOI. This + * is required to emulate the EOIMode=1 on GICv5 hardware. Failure to + * call EOI unsurprisingly results in *BAD* lock-ups. + */ + if (!irqd_is_forwarded_to_vcpu(d) || + kvm_vgic_global_state.type == VGIC_V5) irq_chip_eoi_parent(d); } @@ -1333,7 +1373,8 @@ static int kvm_irq_init(struct arch_timer_kvm_info *info) host_vtimer_irq = info->virtual_irq; kvm_irq_fixup_flags(host_vtimer_irq, &host_vtimer_irq_flags); - if (kvm_vgic_global_state.no_hw_deactivation) { + if (kvm_vgic_global_state.no_hw_deactivation || + kvm_vgic_global_state.type == VGIC_V5) { struct fwnode_handle *fwnode; struct irq_data *data; @@ -1351,7 +1392,8 @@ static int kvm_irq_init(struct arch_timer_kvm_info *info) return -ENOMEM; } - arch_timer_irq_ops.flags |= VGIC_IRQ_SW_RESAMPLE; + if (kvm_vgic_global_state.no_hw_deactivation) + arch_timer_irq_ops.flags |= VGIC_IRQ_SW_RESAMPLE; WARN_ON(irq_domain_push_irq(domain, host_vtimer_irq, (void *)TIMER_VTIMER)); } @@ -1501,11 +1543,18 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu) if (kvm_vgic_set_owner(vcpu, irq, ctx)) break; + /* With GICv5, the default PPI is what you get -- nothing else */ + if (vgic_is_v5(vcpu->kvm) && irq != get_vgic_ppi(vcpu->kvm, default_ppi[i])) + break; + /* - * We know by construction that we only have PPIs, so - * all values are less than 32. + * We know by construction that we only have PPIs, so all values + * are less than 32 for non-GICv5 VGICs. On GICv5, they are + * architecturally defined to be under 32 too. However, we mask + * off most of the bits as we might be presented with a GICv5 + * style PPI where the type is encoded in the top-bits. */ - ppis |= BIT(irq); + ppis |= BIT(irq & 0x1f); } valid = hweight32(ppis) == nr_timers(vcpu); @@ -1543,6 +1592,7 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu) { struct arch_timer_cpu *timer = vcpu_timer(vcpu); struct timer_map map; + struct irq_ops *ops; int ret; if (timer->enabled) @@ -1563,20 +1613,22 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu) get_timer_map(vcpu, &map); + ops = vgic_is_v5(vcpu->kvm) ? &arch_timer_irq_ops_vgic_v5 : + &arch_timer_irq_ops; + + for (int i = 0; i < nr_timers(vcpu); i++) + kvm_vgic_set_irq_ops(vcpu, timer_irq(vcpu_get_timer(vcpu, i)), ops); + ret = kvm_vgic_map_phys_irq(vcpu, map.direct_vtimer->host_timer_irq, - timer_irq(map.direct_vtimer), - &arch_timer_irq_ops); + timer_irq(map.direct_vtimer)); if (ret) return ret; - if (map.direct_ptimer) { + if (map.direct_ptimer) ret = kvm_vgic_map_phys_irq(vcpu, map.direct_ptimer->host_timer_irq, - timer_irq(map.direct_ptimer), - &arch_timer_irq_ops); - } - + timer_irq(map.direct_ptimer)); if (ret) return ret; @@ -1603,15 +1655,14 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) if (get_user(irq, uaddr)) return -EFAULT; - if (!(irq_is_ppi(irq))) + if (!(irq_is_ppi(vcpu->kvm, irq))) return -EINVAL; - mutex_lock(&vcpu->kvm->arch.config_lock); + guard(mutex)(&vcpu->kvm->arch.config_lock); if (test_bit(KVM_ARCH_FLAG_TIMER_PPIS_IMMUTABLE, &vcpu->kvm->arch.flags)) { - ret = -EBUSY; - goto out; + return -EBUSY; } switch (attr->attr) { @@ -1628,8 +1679,7 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) idx = TIMER_HPTIMER; break; default: - ret = -ENXIO; - goto out; + return -ENXIO; } /* @@ -1639,8 +1689,6 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) */ vcpu->kvm->arch.timer_data.ppi[idx] = irq; -out: - mutex_unlock(&vcpu->kvm->arch.config_lock); return ret; } diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 410ffd41fd73..176cbe8baad3 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -24,6 +24,7 @@ #define CREATE_TRACE_POINTS #include "trace_arm.h" +#include "hyp_trace.h" #include <linux/uaccess.h> #include <asm/ptrace.h> @@ -35,6 +36,7 @@ #include <asm/kvm_arm.h> #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> +#include <asm/kvm_hyp.h> #include <asm/kvm_mmu.h> #include <asm/kvm_nested.h> #include <asm/kvm_pkvm.h> @@ -45,6 +47,9 @@ #include <kvm/arm_hypercalls.h> #include <kvm/arm_pmu.h> #include <kvm/arm_psci.h> +#include <kvm/arm_vgic.h> + +#include <linux/irqchip/arm-gic-v5.h> #include "sys_regs.h" @@ -203,6 +208,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) { int ret; + if (type & ~KVM_VM_TYPE_ARM_MASK) + return -EINVAL; + mutex_init(&kvm->arch.config_lock); #ifdef CONFIG_LOCKDEP @@ -234,9 +242,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) * If any failures occur after this is successful, make sure to * call __pkvm_unreserve_vm to unreserve the VM in hyp. */ - ret = pkvm_init_host_vm(kvm); + ret = pkvm_init_host_vm(kvm, type); if (ret) - goto err_free_cpumask; + goto err_uninit_mmu; + } else if (type & KVM_VM_TYPE_ARM_PROTECTED) { + ret = -EINVAL; + goto err_uninit_mmu; } kvm_vgic_early_init(kvm); @@ -252,6 +263,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) return 0; +err_uninit_mmu: + kvm_uninit_stage2_mmu(kvm); err_free_cpumask: free_cpumask_var(kvm->arch.supported_cpus); err_unshare_kvm: @@ -301,6 +314,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) if (is_protected_kvm_enabled()) pkvm_destroy_hyp_vm(kvm); + kvm_uninit_stage2_mmu(kvm); kvm_destroy_mpidr_data(kvm); kfree(kvm->arch.sysreg_masks); @@ -613,6 +627,9 @@ static bool kvm_vcpu_should_clear_twi(struct kvm_vcpu *vcpu) if (unlikely(kvm_wfi_trap_policy != KVM_WFX_NOTRAP_SINGLE_TASK)) return kvm_wfi_trap_policy == KVM_WFX_NOTRAP; + if (vgic_is_v5(vcpu->kvm)) + return single_task_running(); + return single_task_running() && vcpu->kvm->arch.vgic.vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3 && (atomic_read(&vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count) || @@ -705,6 +722,8 @@ nommu: if (!cpumask_test_cpu(cpu, vcpu->kvm->arch.supported_cpus)) vcpu_set_on_unsupported_cpu(vcpu); + + vcpu->arch.pid = pid_nr(vcpu->pid); } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) @@ -934,6 +953,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu) return ret; } + ret = vgic_v5_finalize_ppi_state(kvm); + if (ret) + return ret; + if (is_protected_kvm_enabled()) { ret = pkvm_create_hyp_vm(kvm); if (ret) @@ -1439,10 +1462,11 @@ static int vcpu_interrupt_line(struct kvm_vcpu *vcpu, int number, bool level) int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, bool line_status) { - u32 irq = irq_level->irq; unsigned int irq_type, vcpu_id, irq_num; struct kvm_vcpu *vcpu = NULL; bool level = irq_level->level; + u32 irq = irq_level->irq; + unsigned long *mask; irq_type = (irq >> KVM_ARM_IRQ_TYPE_SHIFT) & KVM_ARM_IRQ_TYPE_MASK; vcpu_id = (irq >> KVM_ARM_IRQ_VCPU_SHIFT) & KVM_ARM_IRQ_VCPU_MASK; @@ -1472,16 +1496,37 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_level, if (!vcpu) return -EINVAL; - if (irq_num < VGIC_NR_SGIS || irq_num >= VGIC_NR_PRIVATE_IRQS) + if (vgic_is_v5(kvm)) { + if (irq_num >= VGIC_V5_NR_PRIVATE_IRQS) + return -EINVAL; + + /* + * Only allow PPIs that are explicitly exposed to + * usespace to be driven via KVM_IRQ_LINE + */ + mask = kvm->arch.vgic.gicv5_vm.userspace_ppis; + if (!test_bit(irq_num, mask)) + return -EINVAL; + + /* Build a GICv5-style IntID here */ + irq_num = vgic_v5_make_ppi(irq_num); + } else if (irq_num < VGIC_NR_SGIS || + irq_num >= VGIC_NR_PRIVATE_IRQS) { return -EINVAL; + } return kvm_vgic_inject_irq(kvm, vcpu, irq_num, level, NULL); case KVM_ARM_IRQ_TYPE_SPI: if (!irqchip_in_kernel(kvm)) return -ENXIO; - if (irq_num < VGIC_NR_PRIVATE_IRQS) - return -EINVAL; + if (vgic_is_v5(kvm)) { + /* Build a GICv5-style IntID here */ + irq_num = vgic_v5_make_spi(irq_num); + } else { + if (irq_num < VGIC_NR_PRIVATE_IRQS) + return -EINVAL; + } return kvm_vgic_inject_irq(kvm, NULL, irq_num, level, NULL); } @@ -2414,6 +2459,10 @@ static int __init init_subsystems(void) kvm_register_perf_callbacks(); + err = kvm_hyp_trace_init(); + if (err) + kvm_err("Failed to initialize Hyp tracing\n"); + out: if (err) hyp_cpu_pm_exit(); @@ -2465,7 +2514,7 @@ static int __init do_pkvm_init(u32 hyp_va_bits) preempt_disable(); cpu_hyp_init_context(); ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size, - num_possible_cpus(), kern_hyp_va(per_cpu_base), + kern_hyp_va(per_cpu_base), hyp_va_bits); cpu_hyp_init_features(); @@ -2507,6 +2556,7 @@ static void kvm_hyp_init_symbols(void) { kvm_nvhe_sym(id_aa64pfr0_el1_sys_val) = get_hyp_id_aa64pfr0_el1(); kvm_nvhe_sym(id_aa64pfr1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR1_EL1); + kvm_nvhe_sym(id_aa64pfr2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64PFR2_EL1); kvm_nvhe_sym(id_aa64isar0_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR0_EL1); kvm_nvhe_sym(id_aa64isar1_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR1_EL1); kvm_nvhe_sym(id_aa64isar2_el1_sys_val) = read_sanitised_ftr_reg(SYS_ID_AA64ISAR2_EL1); @@ -2529,6 +2579,9 @@ static void kvm_hyp_init_symbols(void) kvm_nvhe_sym(hfgitr2_masks) = hfgitr2_masks; kvm_nvhe_sym(hdfgrtr2_masks)= hdfgrtr2_masks; kvm_nvhe_sym(hdfgwtr2_masks)= hdfgwtr2_masks; + kvm_nvhe_sym(ich_hfgrtr_masks) = ich_hfgrtr_masks; + kvm_nvhe_sym(ich_hfgwtr_masks) = ich_hfgwtr_masks; + kvm_nvhe_sym(ich_hfgitr_masks) = ich_hfgitr_masks; /* * Flush entire BSS since part of its data containing init symbols is read @@ -2674,6 +2727,8 @@ static int __init init_hyp_mode(void) kvm_nvhe_sym(kvm_arm_hyp_percpu_base)[cpu] = (unsigned long)page_addr; } + kvm_nvhe_sym(hyp_nr_cpus) = num_possible_cpus(); + /* * Map the Hyp-code called directly from the host */ diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c index d9f553cbf9df..f35b8dddd7c1 100644 --- a/arch/arm64/kvm/config.c +++ b/arch/arm64/kvm/config.c @@ -225,6 +225,7 @@ struct reg_feat_map_desc { #define FEAT_MTPMU ID_AA64DFR0_EL1, MTPMU, IMP #define FEAT_HCX ID_AA64MMFR1_EL1, HCX, IMP #define FEAT_S2PIE ID_AA64MMFR3_EL1, S2PIE, IMP +#define FEAT_GCIE ID_AA64PFR2_EL1, GCIE, IMP static bool not_feat_aa64el3(struct kvm *kvm) { @@ -1277,6 +1278,58 @@ static const struct reg_bits_to_feat_map vtcr_el2_feat_map[] = { static const DECLARE_FEAT_MAP(vtcr_el2_desc, VTCR_EL2, vtcr_el2_feat_map, FEAT_AA64EL2); +static const struct reg_bits_to_feat_map ich_hfgrtr_feat_map[] = { + NEEDS_FEAT(ICH_HFGRTR_EL2_ICC_APR_EL1 | + ICH_HFGRTR_EL2_ICC_IDRn_EL1 | + ICH_HFGRTR_EL2_ICC_CR0_EL1 | + ICH_HFGRTR_EL2_ICC_HPPIR_EL1 | + ICH_HFGRTR_EL2_ICC_PCR_EL1 | + ICH_HFGRTR_EL2_ICC_ICSR_EL1 | + ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_HMRn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_ENABLERn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_PENDRn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_PRIORITYRn_EL1 | + ICH_HFGRTR_EL2_ICC_PPI_ACTIVERn_EL1, + FEAT_GCIE), +}; + +static const DECLARE_FEAT_MAP_FGT(ich_hfgrtr_desc, ich_hfgrtr_masks, + ich_hfgrtr_feat_map, FEAT_GCIE); + +static const struct reg_bits_to_feat_map ich_hfgwtr_feat_map[] = { + NEEDS_FEAT(ICH_HFGWTR_EL2_ICC_APR_EL1 | + ICH_HFGWTR_EL2_ICC_CR0_EL1 | + ICH_HFGWTR_EL2_ICC_PCR_EL1 | + ICH_HFGWTR_EL2_ICC_ICSR_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_ENABLERn_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_PENDRn_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_PRIORITYRn_EL1 | + ICH_HFGWTR_EL2_ICC_PPI_ACTIVERn_EL1, + FEAT_GCIE), +}; + +static const DECLARE_FEAT_MAP_FGT(ich_hfgwtr_desc, ich_hfgwtr_masks, + ich_hfgwtr_feat_map, FEAT_GCIE); + +static const struct reg_bits_to_feat_map ich_hfgitr_feat_map[] = { + NEEDS_FEAT(ICH_HFGITR_EL2_GICCDEN | + ICH_HFGITR_EL2_GICCDDIS | + ICH_HFGITR_EL2_GICCDPRI | + ICH_HFGITR_EL2_GICCDAFF | + ICH_HFGITR_EL2_GICCDPEND | + ICH_HFGITR_EL2_GICCDRCFG | + ICH_HFGITR_EL2_GICCDHM | + ICH_HFGITR_EL2_GICCDEOI | + ICH_HFGITR_EL2_GICCDDI | + ICH_HFGITR_EL2_GICRCDIA | + ICH_HFGITR_EL2_GICRCDNMIA, + FEAT_GCIE), +}; + +static const DECLARE_FEAT_MAP_FGT(ich_hfgitr_desc, ich_hfgitr_masks, + ich_hfgitr_feat_map, FEAT_GCIE); + static void __init check_feat_map(const struct reg_bits_to_feat_map *map, int map_size, u64 resx, const char *str) { @@ -1328,6 +1381,9 @@ void __init check_feature_map(void) check_reg_desc(&sctlr_el2_desc); check_reg_desc(&mdcr_el2_desc); check_reg_desc(&vtcr_el2_desc); + check_reg_desc(&ich_hfgrtr_desc); + check_reg_desc(&ich_hfgwtr_desc); + check_reg_desc(&ich_hfgitr_desc); } static bool idreg_feat_match(struct kvm *kvm, const struct reg_bits_to_feat_map *map) @@ -1460,6 +1516,13 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt) val |= compute_fgu_bits(kvm, &hdfgrtr2_desc); val |= compute_fgu_bits(kvm, &hdfgwtr2_desc); break; + case ICH_HFGRTR_GROUP: + val |= compute_fgu_bits(kvm, &ich_hfgrtr_desc); + val |= compute_fgu_bits(kvm, &ich_hfgwtr_desc); + break; + case ICH_HFGITR_GROUP: + val |= compute_fgu_bits(kvm, &ich_hfgitr_desc); + break; default: BUG(); } @@ -1531,6 +1594,15 @@ struct resx get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg) case VTCR_EL2: resx = compute_reg_resx_bits(kvm, &vtcr_el2_desc, 0, 0); break; + case ICH_HFGRTR_EL2: + resx = compute_reg_resx_bits(kvm, &ich_hfgrtr_desc, 0, 0); + break; + case ICH_HFGWTR_EL2: + resx = compute_reg_resx_bits(kvm, &ich_hfgwtr_desc, 0, 0); + break; + case ICH_HFGITR_EL2: + resx = compute_reg_resx_bits(kvm, &ich_hfgitr_desc, 0, 0); + break; default: WARN_ON_ONCE(1); resx = (typeof(resx)){}; @@ -1565,6 +1637,12 @@ static __always_inline struct fgt_masks *__fgt_reg_to_masks(enum vcpu_sysreg reg return &hdfgrtr2_masks; case HDFGWTR2_EL2: return &hdfgwtr2_masks; + case ICH_HFGRTR_EL2: + return &ich_hfgrtr_masks; + case ICH_HFGWTR_EL2: + return &ich_hfgwtr_masks; + case ICH_HFGITR_EL2: + return &ich_hfgitr_masks; default: BUILD_BUG_ON(1); } @@ -1585,8 +1663,8 @@ static __always_inline void __compute_fgt(struct kvm_vcpu *vcpu, enum vcpu_sysre clear |= ~nested & m->nmask; } - val |= set; - val &= ~clear; + val |= set | m->res1; + val &= ~(clear | m->res0); *vcpu_fgt(vcpu, reg) = val; } @@ -1606,6 +1684,32 @@ static void __compute_hdfgwtr(struct kvm_vcpu *vcpu) *vcpu_fgt(vcpu, HDFGWTR_EL2) |= HDFGWTR_EL2_MDSCR_EL1; } +static void __compute_ich_hfgrtr(struct kvm_vcpu *vcpu) +{ + __compute_fgt(vcpu, ICH_HFGRTR_EL2); + + /* + * ICC_IAFFIDR_EL1 *always* needs to be trapped when running a guest. + * + * We also trap accesses to ICC_IDR0_EL1 to allow us to completely hide + * FEAT_GCIE_LEGACY from the guest, and to (potentially) present fewer + * ID bits than the host supports. + */ + *vcpu_fgt(vcpu, ICH_HFGRTR_EL2) &= ~(ICH_HFGRTR_EL2_ICC_IAFFIDR_EL1 | + ICH_HFGRTR_EL2_ICC_IDRn_EL1); +} + +static void __compute_ich_hfgwtr(struct kvm_vcpu *vcpu) +{ + __compute_fgt(vcpu, ICH_HFGWTR_EL2); + + /* + * We present a different subset of PPIs the guest from what + * exist in real hardware. We only trap writes, not reads. + */ + *vcpu_fgt(vcpu, ICH_HFGWTR_EL2) &= ~(ICH_HFGWTR_EL2_ICC_PPI_ENABLERn_EL1); +} + void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) { if (!cpus_have_final_cap(ARM64_HAS_FGT)) @@ -1618,12 +1722,17 @@ void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu) __compute_hdfgwtr(vcpu); __compute_fgt(vcpu, HAFGRTR_EL2); - if (!cpus_have_final_cap(ARM64_HAS_FGT2)) - return; + if (cpus_have_final_cap(ARM64_HAS_FGT2)) { + __compute_fgt(vcpu, HFGRTR2_EL2); + __compute_fgt(vcpu, HFGWTR2_EL2); + __compute_fgt(vcpu, HFGITR2_EL2); + __compute_fgt(vcpu, HDFGRTR2_EL2); + __compute_fgt(vcpu, HDFGWTR2_EL2); + } - __compute_fgt(vcpu, HFGRTR2_EL2); - __compute_fgt(vcpu, HFGWTR2_EL2); - __compute_fgt(vcpu, HFGITR2_EL2); - __compute_fgt(vcpu, HDFGRTR2_EL2); - __compute_fgt(vcpu, HDFGWTR2_EL2); + if (cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) { + __compute_ich_hfgrtr(vcpu); + __compute_ich_hfgwtr(vcpu); + __compute_fgt(vcpu, ICH_HFGITR_EL2); + } } diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c index 22d497554c94..dba7ced74ca5 100644 --- a/arch/arm64/kvm/emulate-nested.c +++ b/arch/arm64/kvm/emulate-nested.c @@ -2053,6 +2053,60 @@ static const struct encoding_to_trap_config encoding_to_fgt[] __initconst = { SR_FGT(SYS_AMEVCNTR0_EL0(2), HAFGRTR, AMEVCNTR02_EL0, 1), SR_FGT(SYS_AMEVCNTR0_EL0(1), HAFGRTR, AMEVCNTR01_EL0, 1), SR_FGT(SYS_AMEVCNTR0_EL0(0), HAFGRTR, AMEVCNTR00_EL0, 1), + + /* + * ICH_HFGRTR_EL2 & ICH_HFGWTR_EL2 + */ + SR_FGT(SYS_ICC_APR_EL1, ICH_HFGRTR, ICC_APR_EL1, 0), + SR_FGT(SYS_ICC_IDR0_EL1, ICH_HFGRTR, ICC_IDRn_EL1, 0), + SR_FGT(SYS_ICC_CR0_EL1, ICH_HFGRTR, ICC_CR0_EL1, 0), + SR_FGT(SYS_ICC_HPPIR_EL1, ICH_HFGRTR, ICC_HPPIR_EL1, 0), + SR_FGT(SYS_ICC_PCR_EL1, ICH_HFGRTR, ICC_PCR_EL1, 0), + SR_FGT(SYS_ICC_ICSR_EL1, ICH_HFGRTR, ICC_ICSR_EL1, 0), + SR_FGT(SYS_ICC_IAFFIDR_EL1, ICH_HFGRTR, ICC_IAFFIDR_EL1, 0), + SR_FGT(SYS_ICC_PPI_HMR0_EL1, ICH_HFGRTR, ICC_PPI_HMRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_HMR1_EL1, ICH_HFGRTR, ICC_PPI_HMRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_ENABLER0_EL1, ICH_HFGRTR, ICC_PPI_ENABLERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_ENABLER1_EL1, ICH_HFGRTR, ICC_PPI_ENABLERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CPENDR0_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CPENDR1_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SPENDR0_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SPENDR1_EL1, ICH_HFGRTR, ICC_PPI_PENDRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR0_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR1_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR2_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR3_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR4_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR5_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR6_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR7_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR8_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR9_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR10_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR11_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR12_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR13_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR14_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_PRIORITYR15_EL1, ICH_HFGRTR, ICC_PPI_PRIORITYRn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CACTIVER0_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_CACTIVER1_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SACTIVER0_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + SR_FGT(SYS_ICC_PPI_SACTIVER1_EL1, ICH_HFGRTR, ICC_PPI_ACTIVERn_EL1, 0), + + /* + * ICH_HFGITR_EL2 + */ + SR_FGT(GICV5_OP_GIC_CDEN, ICH_HFGITR, GICCDEN, 0), + SR_FGT(GICV5_OP_GIC_CDDIS, ICH_HFGITR, GICCDDIS, 0), + SR_FGT(GICV5_OP_GIC_CDPRI, ICH_HFGITR, GICCDPRI, 0), + SR_FGT(GICV5_OP_GIC_CDAFF, ICH_HFGITR, GICCDAFF, 0), + SR_FGT(GICV5_OP_GIC_CDPEND, ICH_HFGITR, GICCDPEND, 0), + SR_FGT(GICV5_OP_GIC_CDRCFG, ICH_HFGITR, GICCDRCFG, 0), + SR_FGT(GICV5_OP_GIC_CDHM, ICH_HFGITR, GICCDHM, 0), + SR_FGT(GICV5_OP_GIC_CDEOI, ICH_HFGITR, GICCDEOI, 0), + SR_FGT(GICV5_OP_GIC_CDDI, ICH_HFGITR, GICCDDI, 0), + SR_FGT(GICV5_OP_GICR_CDIA, ICH_HFGITR, GICRCDIA, 0), + SR_FGT(GICV5_OP_GICR_CDNMIA, ICH_HFGITR, GICRCDNMIA, 0), }; /* @@ -2127,6 +2181,9 @@ FGT_MASKS(hfgwtr2_masks, HFGWTR2_EL2); FGT_MASKS(hfgitr2_masks, HFGITR2_EL2); FGT_MASKS(hdfgrtr2_masks, HDFGRTR2_EL2); FGT_MASKS(hdfgwtr2_masks, HDFGWTR2_EL2); +FGT_MASKS(ich_hfgrtr_masks, ICH_HFGRTR_EL2); +FGT_MASKS(ich_hfgwtr_masks, ICH_HFGWTR_EL2); +FGT_MASKS(ich_hfgitr_masks, ICH_HFGITR_EL2); static __init bool aggregate_fgt(union trap_config tc) { @@ -2162,6 +2219,14 @@ static __init bool aggregate_fgt(union trap_config tc) rmasks = &hfgitr2_masks; wmasks = NULL; break; + case ICH_HFGRTR_GROUP: + rmasks = &ich_hfgrtr_masks; + wmasks = &ich_hfgwtr_masks; + break; + case ICH_HFGITR_GROUP: + rmasks = &ich_hfgitr_masks; + wmasks = NULL; + break; } rresx = rmasks->res0 | rmasks->res1; @@ -2232,6 +2297,9 @@ static __init int check_all_fgt_masks(int ret) &hfgitr2_masks, &hdfgrtr2_masks, &hdfgwtr2_masks, + &ich_hfgrtr_masks, + &ich_hfgwtr_masks, + &ich_hfgitr_masks, }; int err = 0; diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index cc7d5d1709cb..54aedf93c78b 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -539,7 +539,7 @@ void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr, /* All hyp bugs, including warnings, are treated as fatal. */ if (!is_protected_kvm_enabled() || - IS_ENABLED(CONFIG_NVHE_EL2_DEBUG)) { + IS_ENABLED(CONFIG_PKVM_DISABLE_STAGE2_ON_PANIC)) { struct bug_entry *bug = find_bug(elr_in_kimg); if (bug) diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 0b50ddd530f3..98b2976837b1 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -233,6 +233,18 @@ static inline void __activate_traps_hfgxtr(struct kvm_vcpu *vcpu) __activate_fgt(hctxt, vcpu, HDFGWTR2_EL2); } +static inline void __activate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu) +{ + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); + + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) + return; + + __activate_fgt(hctxt, vcpu, ICH_HFGRTR_EL2); + __activate_fgt(hctxt, vcpu, ICH_HFGWTR_EL2); + __activate_fgt(hctxt, vcpu, ICH_HFGITR_EL2); +} + #define __deactivate_fgt(htcxt, vcpu, reg) \ do { \ write_sysreg_s(ctxt_sys_reg(hctxt, reg), \ @@ -265,6 +277,19 @@ static inline void __deactivate_traps_hfgxtr(struct kvm_vcpu *vcpu) __deactivate_fgt(hctxt, vcpu, HDFGWTR2_EL2); } +static inline void __deactivate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu) +{ + struct kvm_cpu_context *hctxt = host_data_ptr(host_ctxt); + + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) + return; + + __deactivate_fgt(hctxt, vcpu, ICH_HFGRTR_EL2); + __deactivate_fgt(hctxt, vcpu, ICH_HFGWTR_EL2); + __deactivate_fgt(hctxt, vcpu, ICH_HFGITR_EL2); + +} + static inline void __activate_traps_mpam(struct kvm_vcpu *vcpu) { u64 clr = MPAM2_EL2_EnMPAMSM; @@ -332,6 +357,7 @@ static inline void __activate_traps_common(struct kvm_vcpu *vcpu) } __activate_traps_hfgxtr(vcpu); + __activate_traps_ich_hfgxtr(vcpu); __activate_traps_mpam(vcpu); } @@ -349,6 +375,7 @@ static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu) write_sysreg_s(ctxt_sys_reg(hctxt, HCRX_EL2), SYS_HCRX_EL2); __deactivate_traps_hfgxtr(vcpu); + __deactivate_traps_ich_hfgxtr(vcpu); __deactivate_traps_mpam(); } diff --git a/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h b/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h new file mode 100644 index 000000000000..1258bc84477f --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h @@ -0,0 +1,23 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ +#define __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ + +#include <asm/kvm_hypevents.h> + +#include <linux/arm-smccc.h> + +#define hyp_smccc_1_1_smc(...) \ + do { \ + trace_hyp_exit(NULL, HYP_REASON_SMC); \ + arm_smccc_1_1_smc(__VA_ARGS__); \ + trace_hyp_enter(NULL, HYP_REASON_SMC); \ + } while (0) + +#define hyp_smccc_1_2_smc(...) \ + do { \ + trace_hyp_exit(NULL, HYP_REASON_SMC); \ + arm_smccc_1_2_smc(__VA_ARGS__); \ + trace_hyp_enter(NULL, HYP_REASON_SMC); \ + } while (0) + +#endif /* __ARM64_KVM_HYP_NVHE_ARM_SMCCC_H__ */ diff --git a/arch/arm64/kvm/hyp/include/nvhe/clock.h b/arch/arm64/kvm/hyp/include/nvhe/clock.h new file mode 100644 index 000000000000..9f429f5c0664 --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/clock.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ARM64_KVM_HYP_NVHE_CLOCK_H +#define __ARM64_KVM_HYP_NVHE_CLOCK_H +#include <linux/types.h> + +#include <asm/kvm_hyp.h> + +#ifdef CONFIG_NVHE_EL2_TRACING +void trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); +u64 trace_clock(void); +#else +static inline void +trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } +static inline u64 trace_clock(void) { return 0; } +#endif +#endif diff --git a/arch/arm64/kvm/hyp/include/nvhe/define_events.h b/arch/arm64/kvm/hyp/include/nvhe/define_events.h new file mode 100644 index 000000000000..776d4c6cb702 --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/define_events.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#undef HYP_EVENT +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + struct hyp_event_id hyp_event_id_##__name \ + __section(".hyp.event_ids."#__name) = { \ + .enabled = ATOMIC_INIT(0), \ + } + +#define HYP_EVENT_MULTI_READ +#include <asm/kvm_hypevents.h> +#undef HYP_EVENT_MULTI_READ + +#undef HYP_EVENT diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h index 5f9d56754e39..3cbfae0e3dda 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h @@ -27,18 +27,22 @@ extern struct host_mmu host_mmu; enum pkvm_component_id { PKVM_ID_HOST, PKVM_ID_HYP, - PKVM_ID_FFA, + PKVM_ID_GUEST, }; -extern unsigned long hyp_nr_cpus; - int __pkvm_prot_finalize(void); int __pkvm_host_share_hyp(u64 pfn); +int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); +int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn); int __pkvm_host_unshare_hyp(u64 pfn); int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages); int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages); int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages); int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages); +int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu); +int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu); +int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys); +int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm); int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, enum kvm_pgtable_prot prot); int __pkvm_host_unshare_guest(u64 gfn, u64 nr_pages, struct pkvm_hyp_vm *hyp_vm); @@ -70,6 +74,8 @@ static __always_inline void __load_host_stage2(void) #ifdef CONFIG_NVHE_EL2_DEBUG void pkvm_ownership_selftest(void *base); +struct pkvm_hyp_vcpu *init_selftest_vm(void *virt); +void teardown_selftest_vm(void); #else static inline void pkvm_ownership_selftest(void *base) { } #endif diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h index dee1a406b0c2..b50712d47f6d 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h @@ -30,8 +30,14 @@ enum pkvm_page_state { * struct hyp_page. */ PKVM_NOPAGE = BIT(0) | BIT(1), + + /* + * 'Meta-states' which aren't encoded directly in the PTE's SW bits (or + * the hyp_vmemmap entry for the host) + */ + PKVM_POISON = BIT(2), }; -#define PKVM_PAGE_STATE_MASK (BIT(0) | BIT(1)) +#define PKVM_PAGE_STATE_VMEMMAP_MASK (BIT(0) | BIT(1)) #define PKVM_PAGE_STATE_PROT_MASK (KVM_PGTABLE_PROT_SW0 | KVM_PGTABLE_PROT_SW1) static inline enum kvm_pgtable_prot pkvm_mkstate(enum kvm_pgtable_prot prot, @@ -108,12 +114,12 @@ static inline void set_host_state(struct hyp_page *p, enum pkvm_page_state state static inline enum pkvm_page_state get_hyp_state(struct hyp_page *p) { - return p->__hyp_state_comp ^ PKVM_PAGE_STATE_MASK; + return p->__hyp_state_comp ^ PKVM_PAGE_STATE_VMEMMAP_MASK; } static inline void set_hyp_state(struct hyp_page *p, enum pkvm_page_state state) { - p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_MASK; + p->__hyp_state_comp = state ^ PKVM_PAGE_STATE_VMEMMAP_MASK; } /* diff --git a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h index 184ad7a39950..c904647d2f76 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/pkvm.h +++ b/arch/arm64/kvm/hyp/include/nvhe/pkvm.h @@ -73,8 +73,12 @@ int __pkvm_init_vm(struct kvm *host_kvm, unsigned long vm_hva, unsigned long pgd_hva); int __pkvm_init_vcpu(pkvm_handle_t handle, struct kvm_vcpu *host_vcpu, unsigned long vcpu_hva); -int __pkvm_teardown_vm(pkvm_handle_t handle); +int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn); +int __pkvm_start_teardown_vm(pkvm_handle_t handle); +int __pkvm_finalize_teardown_vm(pkvm_handle_t handle); + +struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle); struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, unsigned int vcpu_idx); void pkvm_put_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu); @@ -84,6 +88,7 @@ struct pkvm_hyp_vm *get_pkvm_hyp_vm(pkvm_handle_t handle); struct pkvm_hyp_vm *get_np_pkvm_hyp_vm(pkvm_handle_t handle); void put_pkvm_hyp_vm(struct pkvm_hyp_vm *hyp_vm); +bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code); bool kvm_handle_pvm_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code); bool kvm_handle_pvm_restricted(struct kvm_vcpu *vcpu, u64 *exit_code); void kvm_init_pvm_id_regs(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/kvm/hyp/include/nvhe/trace.h b/arch/arm64/kvm/hyp/include/nvhe/trace.h new file mode 100644 index 000000000000..8813ff250f8e --- /dev/null +++ b/arch/arm64/kvm/hyp/include/nvhe/trace.h @@ -0,0 +1,70 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef __ARM64_KVM_HYP_NVHE_TRACE_H +#define __ARM64_KVM_HYP_NVHE_TRACE_H + +#include <linux/trace_remote_event.h> + +#include <asm/kvm_hyptrace.h> + +static inline pid_t __tracing_get_vcpu_pid(struct kvm_cpu_context *host_ctxt) +{ + struct kvm_vcpu *vcpu; + + if (!host_ctxt) + host_ctxt = host_data_ptr(host_ctxt); + + vcpu = host_ctxt->__hyp_running_vcpu; + + return vcpu ? vcpu->arch.pid : 0; +} + +#define HE_PROTO(__args...) __args +#define HE_ASSIGN(__args...) __args +#define HE_STRUCT RE_STRUCT +#define he_field re_field + +#ifdef CONFIG_NVHE_EL2_TRACING + +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + REMOTE_EVENT_FORMAT(__name, __struct); \ + extern struct hyp_event_id hyp_event_id_##__name; \ + static __always_inline void trace_##__name(__proto) \ + { \ + struct remote_event_format_##__name *__entry; \ + size_t length = sizeof(*__entry); \ + \ + if (!atomic_read(&hyp_event_id_##__name.enabled)) \ + return; \ + __entry = tracing_reserve_entry(length); \ + if (!__entry) \ + return; \ + __entry->hdr.id = hyp_event_id_##__name.id; \ + __assign \ + tracing_commit_entry(); \ + } + +void *tracing_reserve_entry(unsigned long length); +void tracing_commit_entry(void); + +int __tracing_load(unsigned long desc_va, size_t desc_size); +void __tracing_unload(void); +int __tracing_enable(bool enable); +int __tracing_swap_reader(unsigned int cpu); +void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc); +int __tracing_reset(unsigned int cpu); +int __tracing_enable_event(unsigned short id, bool enable); +#else +static inline void *tracing_reserve_entry(unsigned long length) { return NULL; } +static inline void tracing_commit_entry(void) { } +#define HYP_EVENT(__name, __proto, __struct, __assign, __printk) \ + static inline void trace_##__name(__proto) {} + +static inline int __tracing_load(unsigned long desc_va, size_t desc_size) { return -ENODEV; } +static inline void __tracing_unload(void) { } +static inline int __tracing_enable(bool enable) { return -ENODEV; } +static inline int __tracing_swap_reader(unsigned int cpu) { return -ENODEV; } +static inline void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) { } +static inline int __tracing_reset(unsigned int cpu) { return -ENODEV; } +static inline int __tracing_enable_event(unsigned short id, bool enable) { return -ENODEV; } +#endif +#endif diff --git a/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h b/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h index ba5382c12787..32d7b7746e8e 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h +++ b/arch/arm64/kvm/hyp/include/nvhe/trap_handler.h @@ -16,4 +16,6 @@ __always_unused int ___check_reg_ ## reg; \ type name = (type)cpu_reg(ctxt, (reg)) +void inject_host_exception(u64 esr); + #endif /* __ARM64_KVM_NVHE_TRAP_HANDLER_H__ */ diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile index a244ec25f8c5..62cdfbff7562 100644 --- a/arch/arm64/kvm/hyp/nvhe/Makefile +++ b/arch/arm64/kvm/hyp/nvhe/Makefile @@ -17,7 +17,7 @@ ccflags-y += -fno-stack-protector \ hostprogs := gen-hyprel HOST_EXTRACFLAGS += -I$(objtree)/include -lib-objs := clear_page.o copy_page.o memcpy.o memset.o +lib-objs := clear_page.o copy_page.o memcpy.o memset.o tishift.o lib-objs := $(addprefix ../../../lib/, $(lib-objs)) CFLAGS_switch.nvhe.o += -Wno-override-init @@ -26,11 +26,15 @@ hyp-obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o hyp-main.o hyp-smp.o psci-relay.o early_alloc.o page_alloc.o \ cache.o setup.o mm.o mem_protect.o sys_regs.o pkvm.o stacktrace.o ffa.o hyp-obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ - ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o + ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o ../vgic-v5-sr.o hyp-obj-y += ../../../kernel/smccc-call.o hyp-obj-$(CONFIG_LIST_HARDENED) += list_debug.o +hyp-obj-$(CONFIG_NVHE_EL2_TRACING) += clock.o trace.o events.o hyp-obj-y += $(lib-objs) +# Path to simple_ring_buffer.c +CFLAGS_trace.nvhe.o += -I$(srctree)/kernel/trace/ + ## ## Build rules for compiling nVHE hyp code ## Output of this folder is `kvm_nvhe.o`, a partially linked object diff --git a/arch/arm64/kvm/hyp/nvhe/clock.c b/arch/arm64/kvm/hyp/nvhe/clock.c new file mode 100644 index 000000000000..32fc4313fe43 --- /dev/null +++ b/arch/arm64/kvm/hyp/nvhe/clock.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort <vdonnefort@google.com> + */ + +#include <nvhe/clock.h> + +#include <asm/arch_timer.h> +#include <asm/div64.h> + +static struct clock_data { + struct { + u32 mult; + u32 shift; + u64 epoch_ns; + u64 epoch_cyc; + u64 cyc_overflow64; + } data[2]; + u64 cur; +} trace_clock_data; + +static u64 __clock_mult_uint128(u64 cyc, u32 mult, u32 shift) +{ + __uint128_t ns = (__uint128_t)cyc * mult; + + ns >>= shift; + + return (u64)ns; +} + +/* Does not guarantee no reader on the modified bank. */ +void trace_clock_update(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) +{ + struct clock_data *clock = &trace_clock_data; + u64 bank = clock->cur ^ 1; + + clock->data[bank].mult = mult; + clock->data[bank].shift = shift; + clock->data[bank].epoch_ns = epoch_ns; + clock->data[bank].epoch_cyc = epoch_cyc; + clock->data[bank].cyc_overflow64 = ULONG_MAX / mult; + + smp_store_release(&clock->cur, bank); +} + +/* Use untrusted host data */ +u64 trace_clock(void) +{ + struct clock_data *clock = &trace_clock_data; + u64 bank = smp_load_acquire(&clock->cur); + u64 cyc, ns; + + cyc = __arch_counter_get_cntvct() - clock->data[bank].epoch_cyc; + + if (likely(cyc < clock->data[bank].cyc_overflow64)) { + ns = cyc * clock->data[bank].mult; + ns >>= clock->data[bank].shift; + } else { + ns = __clock_mult_uint128(cyc, clock->data[bank].mult, + clock->data[bank].shift); + } + + return (u64)ns + clock->data[bank].epoch_ns; +} diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c index 2a1c0f49792b..f8904391c125 100644 --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c @@ -14,20 +14,20 @@ #include <asm/kvm_hyp.h> #include <asm/kvm_mmu.h> -static void __debug_save_spe(u64 *pmscr_el1) +static void __debug_save_spe(void) { - u64 reg; + u64 *pmscr_el1, *pmblimitr_el1; - /* Clear pmscr in case of early return */ - *pmscr_el1 = 0; + pmscr_el1 = host_data_ptr(host_debug_state.pmscr_el1); + pmblimitr_el1 = host_data_ptr(host_debug_state.pmblimitr_el1); /* * At this point, we know that this CPU implements * SPE and is available to the host. * Check if the host is actually using it ? */ - reg = read_sysreg_s(SYS_PMBLIMITR_EL1); - if (!(reg & BIT(PMBLIMITR_EL1_E_SHIFT))) + *pmblimitr_el1 = read_sysreg_s(SYS_PMBLIMITR_EL1); + if (!(*pmblimitr_el1 & BIT(PMBLIMITR_EL1_E_SHIFT))) return; /* Yes; save the control register and disable data generation */ @@ -37,18 +37,29 @@ static void __debug_save_spe(u64 *pmscr_el1) /* Now drain all buffered data to memory */ psb_csync(); + dsb(nsh); + + /* And disable the profiling buffer */ + write_sysreg_s(0, SYS_PMBLIMITR_EL1); + isb(); } -static void __debug_restore_spe(u64 pmscr_el1) +static void __debug_restore_spe(void) { - if (!pmscr_el1) + u64 pmblimitr_el1 = *host_data_ptr(host_debug_state.pmblimitr_el1); + + if (!(pmblimitr_el1 & BIT(PMBLIMITR_EL1_E_SHIFT))) return; /* The host page table is installed, but not yet synchronised */ isb(); + /* Re-enable the profiling buffer. */ + write_sysreg_s(pmblimitr_el1, SYS_PMBLIMITR_EL1); + isb(); + /* Re-enable data generation */ - write_sysreg_el1(pmscr_el1, SYS_PMSCR); + write_sysreg_el1(*host_data_ptr(host_debug_state.pmscr_el1), SYS_PMSCR); } static void __trace_do_switch(u64 *saved_trfcr, u64 new_trfcr) @@ -57,12 +68,54 @@ static void __trace_do_switch(u64 *saved_trfcr, u64 new_trfcr) write_sysreg_el1(new_trfcr, SYS_TRFCR); } -static bool __trace_needs_drain(void) +static void __trace_drain_and_disable(void) { - if (is_protected_kvm_enabled() && host_data_test_flag(HAS_TRBE)) - return read_sysreg_s(SYS_TRBLIMITR_EL1) & TRBLIMITR_EL1_E; + u64 *trblimitr_el1 = host_data_ptr(host_debug_state.trblimitr_el1); + bool needs_drain = is_protected_kvm_enabled() ? + host_data_test_flag(HAS_TRBE) : + host_data_test_flag(TRBE_ENABLED); + + if (!needs_drain) { + *trblimitr_el1 = 0; + return; + } + + *trblimitr_el1 = read_sysreg_s(SYS_TRBLIMITR_EL1); + if (*trblimitr_el1 & TRBLIMITR_EL1_E) { + /* + * The host has enabled the Trace Buffer Unit so we have + * to beat the CPU with a stick until it stops accessing + * memory. + */ - return host_data_test_flag(TRBE_ENABLED); + /* First, ensure that our prior write to TRFCR has stuck. */ + isb(); + + /* Now synchronise with the trace and drain the buffer. */ + tsb_csync(); + dsb(nsh); + + /* + * With no more trace being generated, we can disable the + * Trace Buffer Unit. + */ + write_sysreg_s(0, SYS_TRBLIMITR_EL1); + if (cpus_have_final_cap(ARM64_WORKAROUND_2064142)) { + /* + * Some CPUs are so good, we have to drain 'em + * twice. + */ + tsb_csync(); + dsb(nsh); + } + + /* + * Ensure that the Trace Buffer Unit is disabled before + * we start mucking with the stage-2 and trap + * configuration. + */ + isb(); + } } static bool __trace_needs_switch(void) @@ -79,21 +132,34 @@ static void __trace_switch_to_guest(void) __trace_do_switch(host_data_ptr(host_debug_state.trfcr_el1), *host_data_ptr(trfcr_while_in_guest)); - - if (__trace_needs_drain()) { - isb(); - tsb_csync(); - } + __trace_drain_and_disable(); } static void __trace_switch_to_host(void) { + u64 trblimitr_el1 = *host_data_ptr(host_debug_state.trblimitr_el1); + + if (trblimitr_el1 & TRBLIMITR_EL1_E) { + /* Re-enable the Trace Buffer Unit for the host. */ + write_sysreg_s(trblimitr_el1, SYS_TRBLIMITR_EL1); + isb(); + if (cpus_have_final_cap(ARM64_WORKAROUND_2038923)) { + /* + * Make sure the unit is re-enabled before we + * poke TRFCR. + */ + isb(); + } + } + __trace_do_switch(host_data_ptr(trfcr_while_in_guest), *host_data_ptr(host_debug_state.trfcr_el1)); } -static void __debug_save_brbe(u64 *brbcr_el1) +static void __debug_save_brbe(void) { + u64 *brbcr_el1 = host_data_ptr(host_debug_state.brbcr_el1); + *brbcr_el1 = 0; /* Check if the BRBE is enabled */ @@ -109,8 +175,10 @@ static void __debug_save_brbe(u64 *brbcr_el1) write_sysreg_el1(0, SYS_BRBCR); } -static void __debug_restore_brbe(u64 brbcr_el1) +static void __debug_restore_brbe(void) { + u64 brbcr_el1 = *host_data_ptr(host_debug_state.brbcr_el1); + if (!brbcr_el1) return; @@ -122,11 +190,11 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu) { /* Disable and flush SPE data generation */ if (host_data_test_flag(HAS_SPE)) - __debug_save_spe(host_data_ptr(host_debug_state.pmscr_el1)); + __debug_save_spe(); /* Disable BRBE branch records */ if (host_data_test_flag(HAS_BRBE)) - __debug_save_brbe(host_data_ptr(host_debug_state.brbcr_el1)); + __debug_save_brbe(); if (__trace_needs_switch()) __trace_switch_to_guest(); @@ -140,9 +208,9 @@ void __debug_switch_to_guest(struct kvm_vcpu *vcpu) void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu) { if (host_data_test_flag(HAS_SPE)) - __debug_restore_spe(*host_data_ptr(host_debug_state.pmscr_el1)); + __debug_restore_spe(); if (host_data_test_flag(HAS_BRBE)) - __debug_restore_brbe(*host_data_ptr(host_debug_state.brbcr_el1)); + __debug_restore_brbe(); if (__trace_needs_switch()) __trace_switch_to_host(); } diff --git a/arch/arm64/kvm/hyp/nvhe/events.c b/arch/arm64/kvm/hyp/nvhe/events.c new file mode 100644 index 000000000000..add9383aadb5 --- /dev/null +++ b/arch/arm64/kvm/hyp/nvhe/events.c @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort <vdonnefort@google.com> + */ + +#include <nvhe/mm.h> +#include <nvhe/trace.h> + +#include <nvhe/define_events.h> + +int __tracing_enable_event(unsigned short id, bool enable) +{ + struct hyp_event_id *event_id = &__hyp_event_ids_start[id]; + atomic_t *enabled; + + if (event_id >= __hyp_event_ids_end) + return -EINVAL; + + enabled = hyp_fixmap_map(__hyp_pa(&event_id->enabled)); + atomic_set(enabled, enable); + hyp_fixmap_unmap(); + + return 0; +} diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c index 94161ea1cd60..1af722771178 100644 --- a/arch/arm64/kvm/hyp/nvhe/ffa.c +++ b/arch/arm64/kvm/hyp/nvhe/ffa.c @@ -26,10 +26,10 @@ * the duration and are therefore serialised. */ -#include <linux/arm-smccc.h> #include <linux/arm_ffa.h> #include <asm/kvm_pkvm.h> +#include <nvhe/arm-smccc.h> #include <nvhe/ffa.h> #include <nvhe/mem_protect.h> #include <nvhe/memory.h> @@ -147,7 +147,7 @@ static int ffa_map_hyp_buffers(u64 ffa_page_count) { struct arm_smccc_1_2_regs res; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_FN64_RXTX_MAP, .a1 = hyp_virt_to_phys(hyp_buffers.tx), .a2 = hyp_virt_to_phys(hyp_buffers.rx), @@ -161,7 +161,7 @@ static int ffa_unmap_hyp_buffers(void) { struct arm_smccc_1_2_regs res; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_RXTX_UNMAP, .a1 = HOST_FFA_ID, }, &res); @@ -172,7 +172,7 @@ static int ffa_unmap_hyp_buffers(void) static void ffa_mem_frag_tx(struct arm_smccc_1_2_regs *res, u32 handle_lo, u32 handle_hi, u32 fraglen, u32 endpoint_id) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_MEM_FRAG_TX, .a1 = handle_lo, .a2 = handle_hi, @@ -184,7 +184,7 @@ static void ffa_mem_frag_tx(struct arm_smccc_1_2_regs *res, u32 handle_lo, static void ffa_mem_frag_rx(struct arm_smccc_1_2_regs *res, u32 handle_lo, u32 handle_hi, u32 fragoff) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_MEM_FRAG_RX, .a1 = handle_lo, .a2 = handle_hi, @@ -196,7 +196,7 @@ static void ffa_mem_frag_rx(struct arm_smccc_1_2_regs *res, u32 handle_lo, static void ffa_mem_xfer(struct arm_smccc_1_2_regs *res, u64 func_id, u32 len, u32 fraglen) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = func_id, .a1 = len, .a2 = fraglen, @@ -206,7 +206,7 @@ static void ffa_mem_xfer(struct arm_smccc_1_2_regs *res, u64 func_id, u32 len, static void ffa_mem_reclaim(struct arm_smccc_1_2_regs *res, u32 handle_lo, u32 handle_hi, u32 flags) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_MEM_RECLAIM, .a1 = handle_lo, .a2 = handle_hi, @@ -216,7 +216,7 @@ static void ffa_mem_reclaim(struct arm_smccc_1_2_regs *res, u32 handle_lo, static void ffa_retrieve_req(struct arm_smccc_1_2_regs *res, u32 len) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_FN64_MEM_RETRIEVE_REQ, .a1 = len, .a2 = len, @@ -225,7 +225,7 @@ static void ffa_retrieve_req(struct arm_smccc_1_2_regs *res, u32 len) static void ffa_rx_release(struct arm_smccc_1_2_regs *res) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_RX_RELEASE, }, res); } @@ -728,7 +728,7 @@ static int hyp_ffa_post_init(void) size_t min_rxtx_sz; struct arm_smccc_1_2_regs res; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ .a0 = FFA_ID_GET, }, &res); if (res.a0 != FFA_SUCCESS) @@ -737,7 +737,7 @@ static int hyp_ffa_post_init(void) if (res.a2 != HOST_FFA_ID) return -EINVAL; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs){ .a0 = FFA_FEATURES, .a1 = FFA_FN64_RXTX_MAP, }, &res); @@ -788,7 +788,7 @@ static void do_ffa_version(struct arm_smccc_1_2_regs *res, * first if TEE supports it. */ if (FFA_MINOR_VERSION(ffa_req_version) < FFA_MINOR_VERSION(hyp_ffa_version)) { - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_VERSION, .a1 = ffa_req_version, }, res); @@ -824,7 +824,7 @@ static void do_ffa_part_get(struct arm_smccc_1_2_regs *res, goto out_unlock; } - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_PARTITION_INFO_GET, .a1 = uuid0, .a2 = uuid1, @@ -939,7 +939,7 @@ int hyp_ffa_init(void *pages) if (kvm_host_psci_config.smccc_version < ARM_SMCCC_VERSION_1_2) return 0; - arm_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { + hyp_smccc_1_2_smc(&(struct arm_smccc_1_2_regs) { .a0 = FFA_VERSION, .a1 = FFA_VERSION_1_2, }, &res); diff --git a/arch/arm64/kvm/hyp/nvhe/host.S b/arch/arm64/kvm/hyp/nvhe/host.S index eef15b374abb..f337770ec459 100644 --- a/arch/arm64/kvm/hyp/nvhe/host.S +++ b/arch/arm64/kvm/hyp/nvhe/host.S @@ -120,12 +120,11 @@ SYM_FUNC_START(__hyp_do_panic) mov x29, x0 -#ifdef CONFIG_NVHE_EL2_DEBUG +#ifdef PKVM_DISABLE_STAGE2_ON_PANIC /* Ensure host stage-2 is disabled */ mrs x0, hcr_el2 bic x0, x0, #HCR_VM msr_hcr_el2 x0 - isb tlbi vmalls12e1 dsb nsh #endif @@ -291,13 +290,3 @@ SYM_CODE_START(__kvm_hyp_host_forward_smc) ret SYM_CODE_END(__kvm_hyp_host_forward_smc) - -/* - * kvm_host_psci_cpu_entry is called through br instruction, which requires - * bti j instruction as compilers (gcc and llvm) doesn't insert bti j for external - * functions, but bti c instead. - */ -SYM_CODE_START(kvm_host_psci_cpu_entry) - bti j - b __kvm_host_psci_cpu_entry -SYM_CODE_END(kvm_host_psci_cpu_entry) diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S index 445eb0743af2..89cb553be1e5 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S @@ -173,9 +173,8 @@ SYM_CODE_END(___kvm_hyp_init) * x0: struct kvm_nvhe_init_params PA */ SYM_CODE_START(kvm_hyp_cpu_entry) - mov x1, #1 // is_cpu_on = true + ldr x29, =__kvm_host_psci_cpu_on_entry b __kvm_hyp_init_cpu -SYM_CODE_END(kvm_hyp_cpu_entry) /* * PSCI CPU_SUSPEND / SYSTEM_SUSPEND entry point @@ -183,32 +182,17 @@ SYM_CODE_END(kvm_hyp_cpu_entry) * x0: struct kvm_nvhe_init_params PA */ SYM_CODE_START(kvm_hyp_cpu_resume) - mov x1, #0 // is_cpu_on = false - b __kvm_hyp_init_cpu -SYM_CODE_END(kvm_hyp_cpu_resume) + ldr x29, =__kvm_host_psci_cpu_resume_entry -/* - * Common code for CPU entry points. Initializes EL2 state and - * installs the hypervisor before handing over to a C handler. - * - * x0: struct kvm_nvhe_init_params PA - * x1: bool is_cpu_on - */ -SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) +SYM_INNER_LABEL(__kvm_hyp_init_cpu, SYM_L_LOCAL) mov x28, x0 // Stash arguments - mov x29, x1 /* Check that the core was booted in EL2. */ mrs x0, CurrentEL cmp x0, #CurrentEL_EL2 - b.eq 2f - - /* The core booted in EL1. KVM cannot be initialized on it. */ -1: wfe - wfi - b 1b + b.ne 1f -2: msr SPsel, #1 // We want to use SP_EL{1,2} + msr SPsel, #1 // We want to use SP_EL2 init_el2_hcr 0 @@ -218,11 +202,16 @@ SYM_CODE_START_LOCAL(__kvm_hyp_init_cpu) mov x0, x28 bl ___kvm_hyp_init // Clobbers x0..x2 - /* Leave idmap. */ - mov x0, x29 - ldr x1, =kvm_host_psci_cpu_entry - br x1 -SYM_CODE_END(__kvm_hyp_init_cpu) + /* Leave idmap -- using BLR is OK, LR is restored from host context */ + blr x29 + + // The core booted in EL1, or the C code unexpectedly returned. + // Either way, KVM cannot be initialized on it. +1: wfe + wfi + b 1b +SYM_CODE_END(kvm_hyp_cpu_resume) +SYM_CODE_END(kvm_hyp_cpu_entry) SYM_CODE_START(__kvm_handle_stub_hvc) /* diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c index e7790097db93..73f2e0221e70 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c @@ -12,12 +12,14 @@ #include <asm/kvm_emulate.h> #include <asm/kvm_host.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_hypevents.h> #include <asm/kvm_mmu.h> #include <nvhe/ffa.h> #include <nvhe/mem_protect.h> #include <nvhe/mm.h> #include <nvhe/pkvm.h> +#include <nvhe/trace.h> #include <nvhe/trap_handler.h> DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params); @@ -136,6 +138,8 @@ static void flush_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) hyp_vcpu->vcpu.arch.vsesr_el2 = host_vcpu->arch.vsesr_el2; hyp_vcpu->vcpu.arch.vgic_cpu.vgic_v3 = host_vcpu->arch.vgic_cpu.vgic_v3; + + hyp_vcpu->vcpu.arch.pid = host_vcpu->arch.pid; } static void sync_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu) @@ -169,9 +173,6 @@ static void handle___pkvm_vcpu_load(struct kvm_cpu_context *host_ctxt) DECLARE_REG(u64, hcr_el2, host_ctxt, 3); struct pkvm_hyp_vcpu *hyp_vcpu; - if (!is_protected_kvm_enabled()) - return; - hyp_vcpu = pkvm_load_hyp_vcpu(handle, vcpu_idx); if (!hyp_vcpu) return; @@ -188,12 +189,8 @@ static void handle___pkvm_vcpu_load(struct kvm_cpu_context *host_ctxt) static void handle___pkvm_vcpu_put(struct kvm_cpu_context *host_ctxt) { - struct pkvm_hyp_vcpu *hyp_vcpu; + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); - if (!is_protected_kvm_enabled()) - return; - - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (hyp_vcpu) pkvm_put_hyp_vcpu(hyp_vcpu); } @@ -248,6 +245,26 @@ static int pkvm_refill_memcache(struct pkvm_hyp_vcpu *hyp_vcpu) &host_vcpu->arch.pkvm_memcache); } +static void handle___pkvm_host_donate_guest(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(u64, pfn, host_ctxt, 1); + DECLARE_REG(u64, gfn, host_ctxt, 2); + struct pkvm_hyp_vcpu *hyp_vcpu; + int ret = -EINVAL; + + hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); + if (!hyp_vcpu || !pkvm_hyp_vcpu_is_protected(hyp_vcpu)) + goto out; + + ret = pkvm_refill_memcache(hyp_vcpu); + if (ret) + goto out; + + ret = __pkvm_host_donate_guest(pfn, gfn, hyp_vcpu); +out: + cpu_reg(host_ctxt, 1) = ret; +} + static void handle___pkvm_host_share_guest(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(u64, pfn, host_ctxt, 1); @@ -257,9 +274,6 @@ static void handle___pkvm_host_share_guest(struct kvm_cpu_context *host_ctxt) struct pkvm_hyp_vcpu *hyp_vcpu; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) goto out; @@ -281,9 +295,6 @@ static void handle___pkvm_host_unshare_guest(struct kvm_cpu_context *host_ctxt) struct pkvm_hyp_vm *hyp_vm; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) goto out; @@ -301,9 +312,6 @@ static void handle___pkvm_host_relax_perms_guest(struct kvm_cpu_context *host_ct struct pkvm_hyp_vcpu *hyp_vcpu; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) goto out; @@ -321,9 +329,6 @@ static void handle___pkvm_host_wrprotect_guest(struct kvm_cpu_context *host_ctxt struct pkvm_hyp_vm *hyp_vm; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) goto out; @@ -343,9 +348,6 @@ static void handle___pkvm_host_test_clear_young_guest(struct kvm_cpu_context *ho struct pkvm_hyp_vm *hyp_vm; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) goto out; @@ -362,9 +364,6 @@ static void handle___pkvm_host_mkyoung_guest(struct kvm_cpu_context *host_ctxt) struct pkvm_hyp_vcpu *hyp_vcpu; int ret = -EINVAL; - if (!is_protected_kvm_enabled()) - goto out; - hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); if (!hyp_vcpu || pkvm_hyp_vcpu_is_protected(hyp_vcpu)) goto out; @@ -424,12 +423,8 @@ static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) static void handle___pkvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); - struct pkvm_hyp_vm *hyp_vm; + struct pkvm_hyp_vm *hyp_vm = get_np_pkvm_hyp_vm(handle); - if (!is_protected_kvm_enabled()) - return; - - hyp_vm = get_np_pkvm_hyp_vm(handle); if (!hyp_vm) return; @@ -486,17 +481,15 @@ static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); DECLARE_REG(unsigned long, size, host_ctxt, 2); - DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3); - DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4); - DECLARE_REG(u32, hyp_va_bits, host_ctxt, 5); + DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 3); + DECLARE_REG(u32, hyp_va_bits, host_ctxt, 4); /* * __pkvm_init() will return only if an error occurred, otherwise it * will tail-call in __pkvm_init_finalise() which will have to deal * with the host context directly. */ - cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base, - hyp_va_bits); + cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, per_cpu_base, hyp_va_bits); } static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt) @@ -582,11 +575,115 @@ static void handle___pkvm_init_vcpu(struct kvm_cpu_context *host_ctxt) cpu_reg(host_ctxt, 1) = __pkvm_init_vcpu(handle, host_vcpu, vcpu_hva); } -static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt) +static void handle___pkvm_vcpu_in_poison_fault(struct kvm_cpu_context *host_ctxt) +{ + int ret; + struct pkvm_hyp_vcpu *hyp_vcpu = pkvm_get_loaded_hyp_vcpu(); + + ret = hyp_vcpu ? __pkvm_vcpu_in_poison_fault(hyp_vcpu) : -EINVAL; + cpu_reg(host_ctxt, 1) = ret; +} + +static void handle___pkvm_force_reclaim_guest_page(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(phys_addr_t, phys, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __pkvm_host_force_reclaim_page_guest(phys); +} + +static void handle___pkvm_reclaim_dying_guest_page(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); + DECLARE_REG(u64, gfn, host_ctxt, 2); + + cpu_reg(host_ctxt, 1) = __pkvm_reclaim_dying_guest_page(handle, gfn); +} + +static void handle___pkvm_start_teardown_vm(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); - cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle); + cpu_reg(host_ctxt, 1) = __pkvm_start_teardown_vm(handle); +} + +static void handle___pkvm_finalize_teardown_vm(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(pkvm_handle_t, handle, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __pkvm_finalize_teardown_vm(handle); +} + +static void handle___tracing_load(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned long, desc_hva, host_ctxt, 1); + DECLARE_REG(size_t, desc_size, host_ctxt, 2); + + cpu_reg(host_ctxt, 1) = __tracing_load(desc_hva, desc_size); +} + +static void handle___tracing_unload(struct kvm_cpu_context *host_ctxt) +{ + __tracing_unload(); +} + +static void handle___tracing_enable(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(bool, enable, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __tracing_enable(enable); +} + +static void handle___tracing_swap_reader(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned int, cpu, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __tracing_swap_reader(cpu); +} + +static void handle___tracing_update_clock(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(u32, mult, host_ctxt, 1); + DECLARE_REG(u32, shift, host_ctxt, 2); + DECLARE_REG(u64, epoch_ns, host_ctxt, 3); + DECLARE_REG(u64, epoch_cyc, host_ctxt, 4); + + __tracing_update_clock(mult, shift, epoch_ns, epoch_cyc); +} + +static void handle___tracing_reset(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned int, cpu, host_ctxt, 1); + + cpu_reg(host_ctxt, 1) = __tracing_reset(cpu); +} + +static void handle___tracing_enable_event(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(unsigned short, id, host_ctxt, 1); + DECLARE_REG(bool, enable, host_ctxt, 2); + + cpu_reg(host_ctxt, 1) = __tracing_enable_event(id, enable); +} + +static void handle___tracing_write_event(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(u64, id, host_ctxt, 1); + + trace_selftest(id); +} + +static void handle___vgic_v5_save_apr(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1); + + __vgic_v5_save_apr(kern_hyp_va(cpu_if)); +} + +static void handle___vgic_v5_restore_vmcr_apr(struct kvm_cpu_context *host_ctxt) +{ + DECLARE_REG(struct vgic_v5_cpu_if *, cpu_if, host_ctxt, 1); + + __vgic_v5_restore_vmcr_apr(kern_hyp_va(cpu_if)); } typedef void (*hcall_t)(struct kvm_cpu_context *); @@ -603,14 +700,6 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__vgic_v3_get_gic_config), HANDLE_FUNC(__pkvm_prot_finalize), - HANDLE_FUNC(__pkvm_host_share_hyp), - HANDLE_FUNC(__pkvm_host_unshare_hyp), - HANDLE_FUNC(__pkvm_host_share_guest), - HANDLE_FUNC(__pkvm_host_unshare_guest), - HANDLE_FUNC(__pkvm_host_relax_perms_guest), - HANDLE_FUNC(__pkvm_host_wrprotect_guest), - HANDLE_FUNC(__pkvm_host_test_clear_young_guest), - HANDLE_FUNC(__pkvm_host_mkyoung_guest), HANDLE_FUNC(__kvm_adjust_pc), HANDLE_FUNC(__kvm_vcpu_run), HANDLE_FUNC(__kvm_flush_vm_context), @@ -622,20 +711,44 @@ static const hcall_t host_hcall[] = { HANDLE_FUNC(__kvm_timer_set_cntvoff), HANDLE_FUNC(__vgic_v3_save_aprs), HANDLE_FUNC(__vgic_v3_restore_vmcr_aprs), + HANDLE_FUNC(__vgic_v5_save_apr), + HANDLE_FUNC(__vgic_v5_restore_vmcr_apr), + + HANDLE_FUNC(__pkvm_host_share_hyp), + HANDLE_FUNC(__pkvm_host_unshare_hyp), + HANDLE_FUNC(__pkvm_host_donate_guest), + HANDLE_FUNC(__pkvm_host_share_guest), + HANDLE_FUNC(__pkvm_host_unshare_guest), + HANDLE_FUNC(__pkvm_host_relax_perms_guest), + HANDLE_FUNC(__pkvm_host_wrprotect_guest), + HANDLE_FUNC(__pkvm_host_test_clear_young_guest), + HANDLE_FUNC(__pkvm_host_mkyoung_guest), HANDLE_FUNC(__pkvm_reserve_vm), HANDLE_FUNC(__pkvm_unreserve_vm), HANDLE_FUNC(__pkvm_init_vm), HANDLE_FUNC(__pkvm_init_vcpu), - HANDLE_FUNC(__pkvm_teardown_vm), + HANDLE_FUNC(__pkvm_vcpu_in_poison_fault), + HANDLE_FUNC(__pkvm_force_reclaim_guest_page), + HANDLE_FUNC(__pkvm_reclaim_dying_guest_page), + HANDLE_FUNC(__pkvm_start_teardown_vm), + HANDLE_FUNC(__pkvm_finalize_teardown_vm), HANDLE_FUNC(__pkvm_vcpu_load), HANDLE_FUNC(__pkvm_vcpu_put), HANDLE_FUNC(__pkvm_tlb_flush_vmid), + HANDLE_FUNC(__tracing_load), + HANDLE_FUNC(__tracing_unload), + HANDLE_FUNC(__tracing_enable), + HANDLE_FUNC(__tracing_swap_reader), + HANDLE_FUNC(__tracing_update_clock), + HANDLE_FUNC(__tracing_reset), + HANDLE_FUNC(__tracing_enable_event), + HANDLE_FUNC(__tracing_write_event), }; static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(unsigned long, id, host_ctxt, 0); - unsigned long hcall_min = 0; + unsigned long hcall_min = 0, hcall_max = -1; hcall_t hfn; /* @@ -647,14 +760,19 @@ static void handle_host_hcall(struct kvm_cpu_context *host_ctxt) * basis. This is all fine, however, since __pkvm_prot_finalize * returns -EPERM after the first call for a given CPU. */ - if (static_branch_unlikely(&kvm_protected_mode_initialized)) - hcall_min = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize; + if (static_branch_unlikely(&kvm_protected_mode_initialized)) { + hcall_min = __KVM_HOST_SMCCC_FUNC_MIN_PKVM; + } else { + hcall_max = __KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM; + } id &= ~ARM_SMCCC_CALL_HINTS; id -= KVM_HOST_SMCCC_ID(0); - if (unlikely(id < hcall_min || id >= ARRAY_SIZE(host_hcall))) + if (unlikely(id < hcall_min || id > hcall_max || + id >= ARRAY_SIZE(host_hcall))) { goto inval; + } hfn = host_hcall[id]; if (unlikely(!hfn)) @@ -670,14 +788,22 @@ inval: static void default_host_smc_handler(struct kvm_cpu_context *host_ctxt) { + trace_hyp_exit(host_ctxt, HYP_REASON_SMC); __kvm_hyp_host_forward_smc(host_ctxt); + trace_hyp_enter(host_ctxt, HYP_REASON_SMC); } static void handle_host_smc(struct kvm_cpu_context *host_ctxt) { DECLARE_REG(u64, func_id, host_ctxt, 0); + u64 esr = read_sysreg_el2(SYS_ESR); bool handled; + if (esr & ESR_ELx_xVC_IMM_MASK) { + cpu_reg(host_ctxt, 0) = SMCCC_RET_NOT_SUPPORTED; + goto exit_skip_instr; + } + func_id &= ~ARM_SMCCC_CALL_HINTS; handled = kvm_host_psci_handler(host_ctxt, func_id); @@ -686,47 +812,57 @@ static void handle_host_smc(struct kvm_cpu_context *host_ctxt) if (!handled) default_host_smc_handler(host_ctxt); +exit_skip_instr: /* SMC was trapped, move ELR past the current PC. */ kvm_skip_host_instr(); } -/* - * Inject an Undefined Instruction exception into the host. - * - * This is open-coded to allow control over PSTATE construction without - * complicating the generic exception entry helpers. - */ -static void inject_undef64(void) +void inject_host_exception(u64 esr) { - u64 spsr_mask, vbar, sctlr, old_spsr, new_spsr, esr, offset; + u64 sctlr, spsr_el1, spsr_el2, exc_offset = except_type_sync; + const u64 spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | + PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; - spsr_mask = PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT | PSR_DIT_BIT | PSR_PAN_BIT; - - vbar = read_sysreg_el1(SYS_VBAR); - sctlr = read_sysreg_el1(SYS_SCTLR); - old_spsr = read_sysreg_el2(SYS_SPSR); + spsr_el1 = spsr_el2 = read_sysreg_el2(SYS_SPSR); + switch (spsr_el1 & (PSR_MODE_MASK | PSR_MODE32_BIT)) { + case PSR_MODE_EL0t: + exc_offset += LOWER_EL_AArch64_VECTOR; + break; + case PSR_MODE_EL0t | PSR_MODE32_BIT: + exc_offset += LOWER_EL_AArch32_VECTOR; + break; + default: + exc_offset += CURRENT_EL_SP_ELx_VECTOR; + } - new_spsr = old_spsr & spsr_mask; - new_spsr |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT; - new_spsr |= PSR_MODE_EL1h; + spsr_el2 &= spsr_mask; + spsr_el2 |= PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT | + PSR_MODE_EL1h; + sctlr = read_sysreg_el1(SYS_SCTLR); if (!(sctlr & SCTLR_EL1_SPAN)) - new_spsr |= PSR_PAN_BIT; + spsr_el2 |= PSR_PAN_BIT; if (sctlr & SCTLR_ELx_DSSBS) - new_spsr |= PSR_SSBS_BIT; + spsr_el2 |= PSR_SSBS_BIT; if (system_supports_mte()) - new_spsr |= PSR_TCO_BIT; + spsr_el2 |= PSR_TCO_BIT; - esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | ESR_ELx_IL; - offset = CURRENT_EL_SP_ELx_VECTOR + except_type_sync; + if (esr_fsc_is_translation_fault(esr)) + write_sysreg_el1(read_sysreg_el2(SYS_FAR), SYS_FAR); write_sysreg_el1(esr, SYS_ESR); write_sysreg_el1(read_sysreg_el2(SYS_ELR), SYS_ELR); - write_sysreg_el1(old_spsr, SYS_SPSR); - write_sysreg_el2(vbar + offset, SYS_ELR); - write_sysreg_el2(new_spsr, SYS_SPSR); + write_sysreg_el1(spsr_el1, SYS_SPSR); + write_sysreg_el2(read_sysreg_el1(SYS_VBAR) + exc_offset, SYS_ELR); + write_sysreg_el2(spsr_el2, SYS_SPSR); +} + +static void inject_host_undef64(void) +{ + inject_host_exception((ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT) | + ESR_ELx_IL); } static bool handle_host_mte(u64 esr) @@ -749,7 +885,7 @@ static bool handle_host_mte(u64 esr) return false; } - inject_undef64(); + inject_host_undef64(); return true; } @@ -757,15 +893,19 @@ void handle_trap(struct kvm_cpu_context *host_ctxt) { u64 esr = read_sysreg_el2(SYS_ESR); + switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_HVC64: + trace_hyp_enter(host_ctxt, HYP_REASON_HVC); handle_host_hcall(host_ctxt); break; case ESR_ELx_EC_SMC64: + trace_hyp_enter(host_ctxt, HYP_REASON_SMC); handle_host_smc(host_ctxt); break; case ESR_ELx_EC_IABT_LOW: case ESR_ELx_EC_DABT_LOW: + trace_hyp_enter(host_ctxt, HYP_REASON_HOST_ABORT); handle_host_mem_abort(host_ctxt); break; case ESR_ELx_EC_SYS64: @@ -775,4 +915,6 @@ void handle_trap(struct kvm_cpu_context *host_ctxt) default: BUG(); } + + trace_hyp_exit(host_ctxt, HYP_REASON_ERET_HOST); } diff --git a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S index d724f6d69302..7a02837203d1 100644 --- a/arch/arm64/kvm/hyp/nvhe/hyp.lds.S +++ b/arch/arm64/kvm/hyp/nvhe/hyp.lds.S @@ -16,6 +16,12 @@ SECTIONS { HYP_SECTION(.text) HYP_SECTION(.data..ro_after_init) HYP_SECTION(.rodata) +#ifdef CONFIG_NVHE_EL2_TRACING + . = ALIGN(PAGE_SIZE); + BEGIN_HYP_SECTION(.event_ids) + *(SORT(.hyp.event_ids.*)) + END_HYP_SECTION +#endif /* * .hyp..data..percpu needs to be page aligned to maintain the same diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index d815265bd374..28a471d1927c 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -18,6 +18,7 @@ #include <nvhe/memory.h> #include <nvhe/mem_protect.h> #include <nvhe/mm.h> +#include <nvhe/trap_handler.h> #define KVM_HOST_S2_FLAGS (KVM_PGTABLE_S2_AS_S1 | KVM_PGTABLE_S2_IDMAP) @@ -461,8 +462,15 @@ static bool range_is_memory(u64 start, u64 end) static inline int __host_stage2_idmap(u64 start, u64 end, enum kvm_pgtable_prot prot) { + /* + * We don't make permission changes to the host idmap after + * initialisation, so we can squash -EAGAIN to save callers + * having to treat it like success in the case that they try to + * map something that is already mapped. + */ return kvm_pgtable_stage2_map(&host_mmu.pgt, start, end - start, start, - prot, &host_s2_pool, 0); + prot, &host_s2_pool, + KVM_PGTABLE_WALK_IGNORE_EAGAIN); } /* @@ -504,7 +512,7 @@ static int host_stage2_adjust_range(u64 addr, struct kvm_mem_range *range) return ret; if (kvm_pte_valid(pte)) - return -EAGAIN; + return -EEXIST; if (pte) { WARN_ON(addr_is_memory(addr) && @@ -541,24 +549,99 @@ static void __host_update_page_state(phys_addr_t addr, u64 size, enum pkvm_page_ set_host_state(page, state); } -int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) +#define KVM_HOST_DONATION_PTE_OWNER_MASK GENMASK(3, 1) +#define KVM_HOST_DONATION_PTE_EXTRA_MASK GENMASK(59, 4) +static int host_stage2_set_owner_metadata_locked(phys_addr_t addr, u64 size, + u8 owner_id, u64 meta) { + kvm_pte_t annotation; int ret; + if (owner_id == PKVM_ID_HOST) + return -EINVAL; + if (!range_is_memory(addr, addr + size)) return -EPERM; - ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_mmu.pgt, - addr, size, &host_s2_pool, owner_id); - if (ret) - return ret; + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id)) + return -EINVAL; - /* Don't forget to update the vmemmap tracking for the host */ - if (owner_id == PKVM_ID_HOST) - __host_update_page_state(addr, size, PKVM_PAGE_OWNED); - else + if (!FIELD_FIT(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta)) + return -EINVAL; + + annotation = FIELD_PREP(KVM_HOST_DONATION_PTE_OWNER_MASK, owner_id) | + FIELD_PREP(KVM_HOST_DONATION_PTE_EXTRA_MASK, meta); + ret = host_stage2_try(kvm_pgtable_stage2_annotate, &host_mmu.pgt, + addr, size, &host_s2_pool, + KVM_HOST_INVALID_PTE_TYPE_DONATION, annotation); + if (!ret) __host_update_page_state(addr, size, PKVM_NOPAGE); + return ret; +} + +int host_stage2_set_owner_locked(phys_addr_t addr, u64 size, u8 owner_id) +{ + int ret = -EINVAL; + + switch (owner_id) { + case PKVM_ID_HOST: + if (!range_is_memory(addr, addr + size)) + return -EPERM; + + ret = host_stage2_idmap_locked(addr, size, PKVM_HOST_MEM_PROT); + if (!ret) + __host_update_page_state(addr, size, PKVM_PAGE_OWNED); + break; + case PKVM_ID_HYP: + ret = host_stage2_set_owner_metadata_locked(addr, size, + owner_id, 0); + break; + } + + return ret; +} + +#define KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK GENMASK(15, 0) +/* We need 40 bits for the GFN to cover a 52-bit IPA with 4k pages and LPA2 */ +#define KVM_HOST_PTE_OWNER_GUEST_GFN_MASK GENMASK(55, 16) +static u64 host_stage2_encode_gfn_meta(struct pkvm_hyp_vm *vm, u64 gfn) +{ + pkvm_handle_t handle = vm->kvm.arch.pkvm.handle; + + BUILD_BUG_ON((pkvm_handle_t)-1 > KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK); + WARN_ON(!FIELD_FIT(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn)); + + return FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, handle) | + FIELD_PREP(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, gfn); +} + +static int host_stage2_decode_gfn_meta(kvm_pte_t pte, struct pkvm_hyp_vm **vm, + u64 *gfn) +{ + pkvm_handle_t handle; + u64 meta; + + if (WARN_ON(kvm_pte_valid(pte))) + return -EINVAL; + + if (FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) != + KVM_HOST_INVALID_PTE_TYPE_DONATION) { + return -EINVAL; + } + + if (FIELD_GET(KVM_HOST_DONATION_PTE_OWNER_MASK, pte) != PKVM_ID_GUEST) + return -EPERM; + + meta = FIELD_GET(KVM_HOST_DONATION_PTE_EXTRA_MASK, pte); + handle = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_HANDLE_MASK, meta); + *vm = get_vm_by_handle(handle); + if (!*vm) { + /* We probably raced with teardown; try again */ + return -EAGAIN; + } + + *gfn = FIELD_GET(KVM_HOST_PTE_OWNER_GUEST_GFN_MASK, meta); return 0; } @@ -605,11 +688,43 @@ unlock: return ret; } +static void host_inject_mem_abort(struct kvm_cpu_context *host_ctxt) +{ + u64 ec, esr, spsr; + + esr = read_sysreg_el2(SYS_ESR); + spsr = read_sysreg_el2(SYS_SPSR); + + /* Repaint the ESR to report a same-level fault if taken from EL1 */ + if ((spsr & PSR_MODE_MASK) != PSR_MODE_EL0t) { + ec = ESR_ELx_EC(esr); + if (ec == ESR_ELx_EC_DABT_LOW) + ec = ESR_ELx_EC_DABT_CUR; + else if (ec == ESR_ELx_EC_IABT_LOW) + ec = ESR_ELx_EC_IABT_CUR; + else + WARN_ON(1); + esr &= ~ESR_ELx_EC_MASK; + esr |= ec << ESR_ELx_EC_SHIFT; + } + + /* + * Since S1PTW should only ever be set for stage-2 faults, we're pretty + * much guaranteed that it won't be set in ESR_EL1 by the hardware. So, + * let's use that bit to allow the host abort handler to differentiate + * this abort from normal userspace faults. + * + * Note: although S1PTW is RES0 at EL1, it is guaranteed by the + * architecture to be backed by flops, so it should be safe to use. + */ + esr |= ESR_ELx_S1PTW; + inject_host_exception(esr); +} + void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) { struct kvm_vcpu_fault_info fault; u64 esr, addr; - int ret = 0; esr = read_sysreg_el2(SYS_ESR); if (!__get_fault_info(esr, &fault)) { @@ -628,8 +743,16 @@ void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt) BUG_ON(!(fault.hpfar_el2 & HPFAR_EL2_NS)); addr = FIELD_GET(HPFAR_EL2_FIPA, fault.hpfar_el2) << 12; - ret = host_stage2_idmap(addr); - BUG_ON(ret && ret != -EAGAIN); + switch (host_stage2_idmap(addr)) { + case -EPERM: + host_inject_mem_abort(host_ctxt); + fallthrough; + case -EEXIST: + case 0: + break; + default: + BUG(); + } } struct check_walk_data { @@ -707,8 +830,20 @@ static int __hyp_check_page_state_range(phys_addr_t phys, u64 size, enum pkvm_pa return 0; } +static bool guest_pte_is_poisoned(kvm_pte_t pte) +{ + if (kvm_pte_valid(pte)) + return false; + + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == + KVM_GUEST_INVALID_PTE_TYPE_POISONED; +} + static enum pkvm_page_state guest_get_page_state(kvm_pte_t pte, u64 addr) { + if (guest_pte_is_poisoned(pte)) + return PKVM_POISON; + if (!kvm_pte_valid(pte)) return PKVM_NOPAGE; @@ -727,6 +862,77 @@ static int __guest_check_page_state_range(struct pkvm_hyp_vm *vm, u64 addr, return check_page_state_range(&vm->pgt, addr, size, &d); } +static int get_valid_guest_pte(struct pkvm_hyp_vm *vm, u64 ipa, kvm_pte_t *ptep, u64 *physp) +{ + kvm_pte_t pte; + u64 phys; + s8 level; + int ret; + + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); + if (ret) + return ret; + if (guest_pte_is_poisoned(pte)) + return -EHWPOISON; + if (!kvm_pte_valid(pte)) + return -ENOENT; + if (level != KVM_PGTABLE_LAST_LEVEL) + return -E2BIG; + + phys = kvm_pte_to_phys(pte); + ret = check_range_allowed_memory(phys, phys + PAGE_SIZE); + if (WARN_ON(ret)) + return ret; + + *ptep = pte; + *physp = phys; + + return 0; +} + +int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu); + kvm_pte_t pte; + s8 level; + u64 ipa; + int ret; + + switch (kvm_vcpu_trap_get_class(&hyp_vcpu->vcpu)) { + case ESR_ELx_EC_DABT_LOW: + case ESR_ELx_EC_IABT_LOW: + if (kvm_vcpu_trap_is_translation_fault(&hyp_vcpu->vcpu)) + break; + fallthrough; + default: + return -EINVAL; + } + + /* + * The host has the faulting IPA when it calls us from the guest + * fault handler but we retrieve it ourselves from the FAR so as + * to avoid exposing an "oracle" that could reveal data access + * patterns of the guest after initial donation of its pages. + */ + ipa = kvm_vcpu_get_fault_ipa(&hyp_vcpu->vcpu); + ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(&hyp_vcpu->vcpu)); + + guest_lock_component(vm); + ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level); + if (ret) + goto unlock; + + if (level != KVM_PGTABLE_LAST_LEVEL) { + ret = -EINVAL; + goto unlock; + } + + ret = guest_pte_is_poisoned(pte); +unlock: + guest_unlock_component(vm); + return ret; +} + int __pkvm_host_share_hyp(u64 pfn) { u64 phys = hyp_pfn_to_phys(pfn); @@ -753,6 +959,72 @@ unlock: return ret; } +int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); + u64 phys, ipa = hyp_pfn_to_phys(gfn); + kvm_pte_t pte; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); + if (ret) + goto unlock; + + ret = -EPERM; + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_OWNED) + goto unlock; + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)) + goto unlock; + + ret = 0; + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED), + &vcpu->vcpu.arch.pkvm_memcache, 0)); + WARN_ON(__host_set_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + +int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); + u64 meta, phys, ipa = hyp_pfn_to_phys(gfn); + kvm_pte_t pte; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); + if (ret) + goto unlock; + + ret = -EPERM; + if (pkvm_getstate(kvm_pgtable_stage2_pte_prot(pte)) != PKVM_PAGE_SHARED_OWNED) + goto unlock; + if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)) + goto unlock; + + ret = 0; + meta = host_stage2_encode_gfn_meta(vm, gfn); + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, + PKVM_ID_GUEST, meta)); + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), + &vcpu->vcpu.arch.pkvm_memcache, 0)); +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + int __pkvm_host_unshare_hyp(u64 pfn) { u64 phys = hyp_pfn_to_phys(pfn); @@ -960,6 +1232,176 @@ static int __guest_check_transition_size(u64 phys, u64 ipa, u64 nr_pages, u64 *s return 0; } +static void hyp_poison_page(phys_addr_t phys) +{ + void *addr = hyp_fixmap_map(phys); + + memset(addr, 0, PAGE_SIZE); + /* + * Prefer kvm_flush_dcache_to_poc() over __clean_dcache_guest_page() + * here as the latter may elide the CMO under the assumption that FWB + * will be enabled on CPUs that support it. This is incorrect for the + * host stage-2 and would otherwise lead to a malicious host potentially + * being able to read the contents of newly reclaimed guest pages. + */ + kvm_flush_dcache_to_poc(addr, PAGE_SIZE); + hyp_fixmap_unmap(); +} + +static int host_stage2_get_guest_info(phys_addr_t phys, struct pkvm_hyp_vm **vm, + u64 *gfn) +{ + enum pkvm_page_state state; + kvm_pte_t pte; + s8 level; + int ret; + + if (!addr_is_memory(phys)) + return -EFAULT; + + state = get_host_state(hyp_phys_to_page(phys)); + switch (state) { + case PKVM_PAGE_OWNED: + case PKVM_PAGE_SHARED_OWNED: + case PKVM_PAGE_SHARED_BORROWED: + /* The access should no longer fault; try again. */ + return -EAGAIN; + case PKVM_NOPAGE: + break; + default: + return -EPERM; + } + + ret = kvm_pgtable_get_leaf(&host_mmu.pgt, phys, &pte, &level); + if (ret) + return ret; + + if (WARN_ON(level != KVM_PGTABLE_LAST_LEVEL)) + return -EINVAL; + + return host_stage2_decode_gfn_meta(pte, vm, gfn); +} + +int __pkvm_host_force_reclaim_page_guest(phys_addr_t phys) +{ + struct pkvm_hyp_vm *vm; + u64 gfn, ipa, pa; + kvm_pte_t pte; + int ret; + + phys &= PAGE_MASK; + + hyp_spin_lock(&vm_table_lock); + host_lock_component(); + + ret = host_stage2_get_guest_info(phys, &vm, &gfn); + if (ret) + goto unlock_host; + + ipa = hyp_pfn_to_phys(gfn); + guest_lock_component(vm); + ret = get_valid_guest_pte(vm, ipa, &pte, &pa); + if (ret) + goto unlock_guest; + + WARN_ON(pa != phys); + if (guest_get_page_state(pte, ipa) != PKVM_PAGE_OWNED) { + ret = -EPERM; + goto unlock_guest; + } + + /* We really shouldn't be allocating, so don't pass a memcache */ + ret = kvm_pgtable_stage2_annotate(&vm->pgt, ipa, PAGE_SIZE, NULL, + KVM_GUEST_INVALID_PTE_TYPE_POISONED, + 0); + if (ret) + goto unlock_guest; + + hyp_poison_page(phys); + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); +unlock_guest: + guest_unlock_component(vm); +unlock_host: + host_unlock_component(); + hyp_spin_unlock(&vm_table_lock); + + return ret; +} + +int __pkvm_host_reclaim_page_guest(u64 gfn, struct pkvm_hyp_vm *vm) +{ + u64 ipa = hyp_pfn_to_phys(gfn); + kvm_pte_t pte; + u64 phys; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = get_valid_guest_pte(vm, ipa, &pte, &phys); + if (ret) + goto unlock; + + switch (guest_get_page_state(pte, ipa)) { + case PKVM_PAGE_OWNED: + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE)); + hyp_poison_page(phys); + break; + case PKVM_PAGE_SHARED_OWNED: + WARN_ON(__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED)); + break; + default: + ret = -EPERM; + goto unlock; + } + + WARN_ON(kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE)); + WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST)); + +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + /* + * -EHWPOISON implies that the page was forcefully reclaimed already + * so return success for the GUP pin to be dropped. + */ + return ret && ret != -EHWPOISON ? ret : 0; +} + +int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu) +{ + struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); + u64 phys = hyp_pfn_to_phys(pfn); + u64 ipa = hyp_pfn_to_phys(gfn); + u64 meta; + int ret; + + host_lock_component(); + guest_lock_component(vm); + + ret = __host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_OWNED); + if (ret) + goto unlock; + + ret = __guest_check_page_state_range(vm, ipa, PAGE_SIZE, PKVM_NOPAGE); + if (ret) + goto unlock; + + meta = host_stage2_encode_gfn_meta(vm, gfn); + WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE, + PKVM_ID_GUEST, meta)); + WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys, + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED), + &vcpu->vcpu.arch.pkvm_memcache, 0)); + +unlock: + guest_unlock_component(vm); + host_unlock_component(); + + return ret; +} + int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu *vcpu, enum kvm_pgtable_prot prot) { @@ -1206,53 +1648,18 @@ struct pkvm_expected_state { static struct pkvm_expected_state selftest_state; static struct hyp_page *selftest_page; - -static struct pkvm_hyp_vm selftest_vm = { - .kvm = { - .arch = { - .mmu = { - .arch = &selftest_vm.kvm.arch, - .pgt = &selftest_vm.pgt, - }, - }, - }, -}; - -static struct pkvm_hyp_vcpu selftest_vcpu = { - .vcpu = { - .arch = { - .hw_mmu = &selftest_vm.kvm.arch.mmu, - }, - .kvm = &selftest_vm.kvm, - }, -}; - -static void init_selftest_vm(void *virt) -{ - struct hyp_page *p = hyp_virt_to_page(virt); - int i; - - selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; - WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); - - for (i = 0; i < pkvm_selftest_pages(); i++) { - if (p[i].refcount) - continue; - p[i].refcount = 1; - hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); - } -} +static struct pkvm_hyp_vcpu *selftest_vcpu; static u64 selftest_ipa(void) { - return BIT(selftest_vm.pgt.ia_bits - 1); + return BIT(selftest_vcpu->vcpu.arch.hw_mmu->pgt->ia_bits - 1); } static void assert_page_state(void) { void *virt = hyp_page_to_virt(selftest_page); u64 size = PAGE_SIZE << selftest_page->order; - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; + struct pkvm_hyp_vcpu *vcpu = selftest_vcpu; u64 phys = hyp_virt_to_phys(virt); u64 ipa[2] = { selftest_ipa(), selftest_ipa() + PAGE_SIZE }; struct pkvm_hyp_vm *vm; @@ -1267,10 +1674,10 @@ static void assert_page_state(void) WARN_ON(__hyp_check_page_state_range(phys, size, selftest_state.hyp)); hyp_unlock_component(); - guest_lock_component(&selftest_vm); + guest_lock_component(vm); WARN_ON(__guest_check_page_state_range(vm, ipa[0], size, selftest_state.guest[0])); WARN_ON(__guest_check_page_state_range(vm, ipa[1], size, selftest_state.guest[1])); - guest_unlock_component(&selftest_vm); + guest_unlock_component(vm); } #define assert_transition_res(res, fn, ...) \ @@ -1283,14 +1690,15 @@ void pkvm_ownership_selftest(void *base) { enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX; void *virt = hyp_alloc_pages(&host_s2_pool, 0); - struct pkvm_hyp_vcpu *vcpu = &selftest_vcpu; - struct pkvm_hyp_vm *vm = &selftest_vm; + struct pkvm_hyp_vcpu *vcpu; u64 phys, size, pfn, gfn; + struct pkvm_hyp_vm *vm; WARN_ON(!virt); selftest_page = hyp_virt_to_page(virt); selftest_page->refcount = 0; - init_selftest_vm(base); + selftest_vcpu = vcpu = init_selftest_vm(base); + vm = pkvm_hyp_vcpu_to_hyp_vm(vcpu); size = PAGE_SIZE << selftest_page->order; phys = hyp_virt_to_phys(virt); @@ -1309,6 +1717,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); selftest_state.host = PKVM_PAGE_OWNED; selftest_state.hyp = PKVM_NOPAGE; @@ -1328,6 +1737,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); assert_transition_res(0, hyp_pin_shared_mem, virt, virt + size); @@ -1340,6 +1750,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); hyp_unpin_shared_mem(virt, virt + size); assert_page_state(); @@ -1359,6 +1770,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); assert_transition_res(-ENOENT, __pkvm_host_unshare_guest, gfn, 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); selftest_state.host = PKVM_PAGE_OWNED; @@ -1375,6 +1787,7 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); assert_transition_res(-EPERM, hyp_pin_shared_mem, virt, virt + size); selftest_state.guest[1] = PKVM_PAGE_SHARED_BORROWED; @@ -1389,9 +1802,69 @@ void pkvm_ownership_selftest(void *base) assert_transition_res(0, __pkvm_host_unshare_guest, gfn + 1, 1, vm); selftest_state.host = PKVM_NOPAGE; + selftest_state.guest[0] = PKVM_PAGE_OWNED; + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + + selftest_state.host = PKVM_PAGE_SHARED_BORROWED; + selftest_state.guest[0] = PKVM_PAGE_SHARED_OWNED; + assert_transition_res(0, __pkvm_guest_share_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_guest_share_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + + selftest_state.host = PKVM_NOPAGE; + selftest_state.guest[0] = PKVM_PAGE_OWNED; + assert_transition_res(0, __pkvm_guest_unshare_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_guest_unshare_host, vcpu, gfn); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn + 1, 1, vcpu, prot); + assert_transition_res(-EPERM, __pkvm_host_share_ffa, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_donate_hyp, pfn, 1); + assert_transition_res(-EPERM, __pkvm_host_share_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_host_unshare_hyp, pfn); + assert_transition_res(-EPERM, __pkvm_hyp_donate_host, pfn, 1); + + selftest_state.host = PKVM_PAGE_OWNED; + selftest_state.guest[0] = PKVM_POISON; + assert_transition_res(0, __pkvm_host_force_reclaim_page_guest, phys); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + assert_transition_res(-EHWPOISON, __pkvm_guest_share_host, vcpu, gfn); + assert_transition_res(-EHWPOISON, __pkvm_guest_unshare_host, vcpu, gfn); + + selftest_state.host = PKVM_NOPAGE; + selftest_state.guest[1] = PKVM_PAGE_OWNED; + assert_transition_res(0, __pkvm_host_donate_guest, pfn, gfn + 1, vcpu); + + selftest_state.host = PKVM_PAGE_OWNED; + selftest_state.guest[1] = PKVM_NOPAGE; + assert_transition_res(0, __pkvm_host_reclaim_page_guest, gfn + 1, vm); + assert_transition_res(-EPERM, __pkvm_host_donate_guest, pfn, gfn, vcpu); + assert_transition_res(-EPERM, __pkvm_host_share_guest, pfn, gfn, 1, vcpu, prot); + + selftest_state.host = PKVM_NOPAGE; selftest_state.hyp = PKVM_PAGE_OWNED; assert_transition_res(0, __pkvm_host_donate_hyp, pfn, 1); + teardown_selftest_vm(); selftest_page->refcount = 1; hyp_put_page(&host_s2_pool, virt); } diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c index 4d8fcc7a3a41..3b0bee496bff 100644 --- a/arch/arm64/kvm/hyp/nvhe/mm.c +++ b/arch/arm64/kvm/hyp/nvhe/mm.c @@ -244,7 +244,7 @@ static void *fixmap_map_slot(struct hyp_fixmap_slot *slot, phys_addr_t phys) void *hyp_fixmap_map(phys_addr_t phys) { - return fixmap_map_slot(this_cpu_ptr(&fixmap_slots), phys); + return fixmap_map_slot(this_cpu_ptr(&fixmap_slots), phys) + offset_in_page(phys); } static void fixmap_clear_slot(struct hyp_fixmap_slot *slot) @@ -366,7 +366,7 @@ void *hyp_fixblock_map(phys_addr_t phys, size_t *size) #ifdef HAS_FIXBLOCK *size = PMD_SIZE; hyp_spin_lock(&hyp_fixblock_lock); - return fixmap_map_slot(&hyp_fixblock_slot, phys); + return fixmap_map_slot(&hyp_fixblock_slot, phys) + offset_in_page(phys); #else *size = PAGE_SIZE; return hyp_fixmap_map(phys); diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c index 2f029bfe4755..7ed96d64d611 100644 --- a/arch/arm64/kvm/hyp/nvhe/pkvm.c +++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c @@ -4,6 +4,8 @@ * Author: Fuad Tabba <tabba@google.com> */ +#include <kvm/arm_hypercalls.h> + #include <linux/kvm_host.h> #include <linux/mm.h> @@ -222,6 +224,7 @@ static struct pkvm_hyp_vm **vm_table; void pkvm_hyp_vm_table_init(void *tbl) { + BUILD_BUG_ON((u64)HANDLE_OFFSET + KVM_MAX_PVMS > (pkvm_handle_t)-1); WARN_ON(vm_table); vm_table = tbl; } @@ -229,10 +232,12 @@ void pkvm_hyp_vm_table_init(void *tbl) /* * Return the hyp vm structure corresponding to the handle. */ -static struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) +struct pkvm_hyp_vm *get_vm_by_handle(pkvm_handle_t handle) { unsigned int idx = vm_handle_to_idx(handle); + hyp_assert_lock_held(&vm_table_lock); + if (unlikely(idx >= KVM_MAX_PVMS)) return NULL; @@ -255,7 +260,10 @@ struct pkvm_hyp_vcpu *pkvm_load_hyp_vcpu(pkvm_handle_t handle, hyp_spin_lock(&vm_table_lock); hyp_vm = get_vm_by_handle(handle); - if (!hyp_vm || hyp_vm->kvm.created_vcpus <= vcpu_idx) + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) + goto unlock; + + if (hyp_vm->kvm.created_vcpus <= vcpu_idx) goto unlock; hyp_vcpu = hyp_vm->vcpus[vcpu_idx]; @@ -719,6 +727,55 @@ void __pkvm_unreserve_vm(pkvm_handle_t handle) hyp_spin_unlock(&vm_table_lock); } +#ifdef CONFIG_NVHE_EL2_DEBUG +static struct pkvm_hyp_vm selftest_vm = { + .kvm = { + .arch = { + .mmu = { + .arch = &selftest_vm.kvm.arch, + .pgt = &selftest_vm.pgt, + }, + }, + }, +}; + +static struct pkvm_hyp_vcpu selftest_vcpu = { + .vcpu = { + .arch = { + .hw_mmu = &selftest_vm.kvm.arch.mmu, + }, + .kvm = &selftest_vm.kvm, + }, +}; + +struct pkvm_hyp_vcpu *init_selftest_vm(void *virt) +{ + struct hyp_page *p = hyp_virt_to_page(virt); + int i; + + selftest_vm.kvm.arch.mmu.vtcr = host_mmu.arch.mmu.vtcr; + WARN_ON(kvm_guest_prepare_stage2(&selftest_vm, virt)); + + for (i = 0; i < pkvm_selftest_pages(); i++) { + if (p[i].refcount) + continue; + p[i].refcount = 1; + hyp_put_page(&selftest_vm.pool, hyp_page_to_virt(&p[i])); + } + + selftest_vm.kvm.arch.pkvm.handle = __pkvm_reserve_vm(); + insert_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle, &selftest_vm); + return &selftest_vcpu; +} + +void teardown_selftest_vm(void) +{ + hyp_spin_lock(&vm_table_lock); + remove_vm_table_entry(selftest_vm.kvm.arch.pkvm.handle); + hyp_spin_unlock(&vm_table_lock); +} +#endif /* CONFIG_NVHE_EL2_DEBUG */ + /* * Initialize the hypervisor copy of the VM state using host-donated memory. * @@ -859,7 +916,54 @@ teardown_donated_memory(struct kvm_hyp_memcache *mc, void *addr, size_t size) unmap_donated_memory_noclear(addr, size); } -int __pkvm_teardown_vm(pkvm_handle_t handle) +int __pkvm_reclaim_dying_guest_page(pkvm_handle_t handle, u64 gfn) +{ + struct pkvm_hyp_vm *hyp_vm = get_pkvm_hyp_vm(handle); + int ret = -EINVAL; + + if (!hyp_vm) + return ret; + + if (hyp_vm->kvm.arch.pkvm.is_dying) + ret = __pkvm_host_reclaim_page_guest(gfn, hyp_vm); + + put_pkvm_hyp_vm(hyp_vm); + return ret; +} + +static struct pkvm_hyp_vm *get_pkvm_unref_hyp_vm_locked(pkvm_handle_t handle) +{ + struct pkvm_hyp_vm *hyp_vm; + + hyp_assert_lock_held(&vm_table_lock); + + hyp_vm = get_vm_by_handle(handle); + if (!hyp_vm || hyp_page_count(hyp_vm)) + return NULL; + + return hyp_vm; +} + +int __pkvm_start_teardown_vm(pkvm_handle_t handle) +{ + struct pkvm_hyp_vm *hyp_vm; + int ret = 0; + + hyp_spin_lock(&vm_table_lock); + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); + if (!hyp_vm || hyp_vm->kvm.arch.pkvm.is_dying) { + ret = -EINVAL; + goto unlock; + } + + hyp_vm->kvm.arch.pkvm.is_dying = true; +unlock: + hyp_spin_unlock(&vm_table_lock); + + return ret; +} + +int __pkvm_finalize_teardown_vm(pkvm_handle_t handle) { struct kvm_hyp_memcache *mc, *stage2_mc; struct pkvm_hyp_vm *hyp_vm; @@ -869,14 +973,9 @@ int __pkvm_teardown_vm(pkvm_handle_t handle) int err; hyp_spin_lock(&vm_table_lock); - hyp_vm = get_vm_by_handle(handle); - if (!hyp_vm) { - err = -ENOENT; - goto err_unlock; - } - - if (WARN_ON(hyp_page_count(hyp_vm))) { - err = -EBUSY; + hyp_vm = get_pkvm_unref_hyp_vm_locked(handle); + if (!hyp_vm || !hyp_vm->kvm.arch.pkvm.is_dying) { + err = -EINVAL; goto err_unlock; } @@ -922,3 +1021,121 @@ err_unlock: hyp_spin_unlock(&vm_table_lock); return err; } + +static u64 __pkvm_memshare_page_req(struct kvm_vcpu *vcpu, u64 ipa) +{ + u64 elr; + + /* Fake up a data abort (level 3 translation fault on write) */ + vcpu->arch.fault.esr_el2 = (ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT) | + ESR_ELx_WNR | ESR_ELx_FSC_FAULT | + FIELD_PREP(ESR_ELx_FSC_LEVEL, 3); + + /* Shuffle the IPA around into the HPFAR */ + vcpu->arch.fault.hpfar_el2 = (HPFAR_EL2_NS | (ipa >> 8)) & HPFAR_MASK; + + /* This is a virtual address. 0's good. Let's go with 0. */ + vcpu->arch.fault.far_el2 = 0; + + /* Rewind the ELR so we return to the HVC once the IPA is mapped */ + elr = read_sysreg(elr_el2); + elr -= 4; + write_sysreg(elr, elr_el2); + + return ARM_EXCEPTION_TRAP; +} + +static bool pkvm_memshare_call(u64 *ret, struct kvm_vcpu *vcpu, u64 *exit_code) +{ + struct pkvm_hyp_vcpu *hyp_vcpu; + u64 ipa = smccc_get_arg1(vcpu); + + if (!PAGE_ALIGNED(ipa)) + goto out_guest; + + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); + switch (__pkvm_guest_share_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) { + case 0: + ret[0] = SMCCC_RET_SUCCESS; + goto out_guest; + case -ENOENT: + /* + * Convert the exception into a data abort so that the page + * being shared is mapped into the guest next time. + */ + *exit_code = __pkvm_memshare_page_req(vcpu, ipa); + goto out_host; + } + +out_guest: + return true; +out_host: + return false; +} + +static void pkvm_memunshare_call(u64 *ret, struct kvm_vcpu *vcpu) +{ + struct pkvm_hyp_vcpu *hyp_vcpu; + u64 ipa = smccc_get_arg1(vcpu); + + if (!PAGE_ALIGNED(ipa)) + return; + + hyp_vcpu = container_of(vcpu, struct pkvm_hyp_vcpu, vcpu); + if (!__pkvm_guest_unshare_host(hyp_vcpu, hyp_phys_to_pfn(ipa))) + ret[0] = SMCCC_RET_SUCCESS; +} + +/* + * Handler for protected VM HVC calls. + * + * Returns true if the hypervisor has handled the exit (and control + * should return to the guest) or false if it hasn't (and the handling + * should be performed by the host). + */ +bool kvm_handle_pvm_hvc64(struct kvm_vcpu *vcpu, u64 *exit_code) +{ + u64 val[4] = { SMCCC_RET_INVALID_PARAMETER }; + bool handled = true; + + switch (smccc_get_function(vcpu)) { + case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: + val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_HYP_MEMINFO); + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_SHARE); + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MEM_UNSHARE); + break; + case ARM_SMCCC_VENDOR_HYP_KVM_HYP_MEMINFO_FUNC_ID: + if (smccc_get_arg1(vcpu) || + smccc_get_arg2(vcpu) || + smccc_get_arg3(vcpu)) { + break; + } + + val[0] = PAGE_SIZE; + break; + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_SHARE_FUNC_ID: + if (smccc_get_arg2(vcpu) || + smccc_get_arg3(vcpu)) { + break; + } + + handled = pkvm_memshare_call(val, vcpu, exit_code); + break; + case ARM_SMCCC_VENDOR_HYP_KVM_MEM_UNSHARE_FUNC_ID: + if (smccc_get_arg2(vcpu) || + smccc_get_arg3(vcpu)) { + break; + } + + pkvm_memunshare_call(val, vcpu); + break; + default: + /* Punt everything else back to the host, for now. */ + handled = false; + } + + if (handled) + smccc_set_retval(vcpu, val[0], val[1], val[2], val[3]); + return handled; +} diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c b/arch/arm64/kvm/hyp/nvhe/psci-relay.c index c3e196fb8b18..e20db999e328 100644 --- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c +++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c @@ -6,11 +6,12 @@ #include <asm/kvm_asm.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_hypevents.h> #include <asm/kvm_mmu.h> -#include <linux/arm-smccc.h> #include <linux/kvm_host.h> #include <uapi/linux/psci.h> +#include <nvhe/arm-smccc.h> #include <nvhe/memory.h> #include <nvhe/trap_handler.h> @@ -65,7 +66,7 @@ static unsigned long psci_call(unsigned long fn, unsigned long arg0, { struct arm_smccc_res res; - arm_smccc_1_1_smc(fn, arg0, arg1, arg2, &res); + hyp_smccc_1_1_smc(fn, arg0, arg1, arg2, &res); return res.a0; } @@ -200,30 +201,42 @@ static int psci_system_suspend(u64 func_id, struct kvm_cpu_context *host_ctxt) __hyp_pa(init_params), 0); } -asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on) +static void __noreturn __kvm_host_psci_cpu_entry(unsigned long pc, unsigned long r0) { - struct psci_boot_args *boot_args; - struct kvm_cpu_context *host_ctxt; - - host_ctxt = host_data_ptr(host_ctxt); + struct kvm_cpu_context *host_ctxt = host_data_ptr(host_ctxt); - if (is_cpu_on) - boot_args = this_cpu_ptr(&cpu_on_args); - else - boot_args = this_cpu_ptr(&suspend_args); + trace_hyp_enter(host_ctxt, HYP_REASON_PSCI); - cpu_reg(host_ctxt, 0) = boot_args->r0; - write_sysreg_el2(boot_args->pc, SYS_ELR); - - if (is_cpu_on) - release_boot_args(boot_args); + cpu_reg(host_ctxt, 0) = r0; + write_sysreg_el2(pc, SYS_ELR); write_sysreg_el1(INIT_SCTLR_EL1_MMU_OFF, SYS_SCTLR); write_sysreg(INIT_PSTATE_EL1, SPSR_EL2); + trace_hyp_exit(host_ctxt, HYP_REASON_PSCI); __host_enter(host_ctxt); } +asmlinkage void __noreturn __kvm_host_psci_cpu_on_entry(void) +{ + struct psci_boot_args *boot_args = this_cpu_ptr(&cpu_on_args); + unsigned long pc, r0; + + pc = READ_ONCE(boot_args->pc); + r0 = READ_ONCE(boot_args->r0); + + release_boot_args(boot_args); + + __kvm_host_psci_cpu_entry(pc, r0); +} + +asmlinkage void __noreturn __kvm_host_psci_cpu_resume_entry(void) +{ + struct psci_boot_args *boot_args = this_cpu_ptr(&suspend_args); + + __kvm_host_psci_cpu_entry(boot_args->pc, boot_args->r0); +} + static unsigned long psci_0_1_handler(u64 func_id, struct kvm_cpu_context *host_ctxt) { if (is_psci_0_1(cpu_off, func_id) || is_psci_0_1(migrate, func_id)) diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c index 90bd014e952f..d8e5b563fd3d 100644 --- a/arch/arm64/kvm/hyp/nvhe/setup.c +++ b/arch/arm64/kvm/hyp/nvhe/setup.c @@ -341,8 +341,7 @@ out: __host_enter(host_ctxt); } -int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, - unsigned long *per_cpu_base, u32 hyp_va_bits) +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long *per_cpu_base, u32 hyp_va_bits) { struct kvm_nvhe_init_params *params; void *virt = hyp_phys_to_virt(phys); @@ -355,7 +354,6 @@ int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus, return -EINVAL; hyp_spin_lock_init(&pkvm_pgd_lock); - hyp_nr_cpus = nr_cpus; ret = divide_memory_pool(virt, size); if (ret) diff --git a/arch/arm64/kvm/hyp/nvhe/stacktrace.c b/arch/arm64/kvm/hyp/nvhe/stacktrace.c index 5b6eeab1a774..7c832d60d22b 100644 --- a/arch/arm64/kvm/hyp/nvhe/stacktrace.c +++ b/arch/arm64/kvm/hyp/nvhe/stacktrace.c @@ -34,7 +34,7 @@ static void hyp_prepare_backtrace(unsigned long fp, unsigned long pc) stacktrace_info->pc = pc; } -#ifdef CONFIG_PROTECTED_NVHE_STACKTRACE +#ifdef CONFIG_PKVM_STACKTRACE #include <asm/stacktrace/nvhe.h> DEFINE_PER_CPU(unsigned long [NVHE_STACKTRACE_SIZE/sizeof(long)], pkvm_stacktrace); @@ -134,11 +134,11 @@ static void pkvm_save_backtrace(unsigned long fp, unsigned long pc) unwind(&state, pkvm_save_backtrace_entry, &idx); } -#else /* !CONFIG_PROTECTED_NVHE_STACKTRACE */ +#else /* !CONFIG_PKVM_STACKTRACE */ static void pkvm_save_backtrace(unsigned long fp, unsigned long pc) { } -#endif /* CONFIG_PROTECTED_NVHE_STACKTRACE */ +#endif /* CONFIG_PKVM_STACKTRACE */ /* * kvm_nvhe_prepare_backtrace - prepare to dump the nVHE backtrace diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index 779089e42681..8d1df3d33595 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -7,7 +7,6 @@ #include <hyp/switch.h> #include <hyp/sysreg-sr.h> -#include <linux/arm-smccc.h> #include <linux/kvm_host.h> #include <linux/types.h> #include <linux/jump_label.h> @@ -21,6 +20,7 @@ #include <asm/kvm_asm.h> #include <asm/kvm_emulate.h> #include <asm/kvm_hyp.h> +#include <asm/kvm_hypevents.h> #include <asm/kvm_mmu.h> #include <asm/fpsimd.h> #include <asm/debug-monitors.h> @@ -44,6 +44,9 @@ struct fgt_masks hfgwtr2_masks; struct fgt_masks hfgitr2_masks; struct fgt_masks hdfgrtr2_masks; struct fgt_masks hdfgwtr2_masks; +struct fgt_masks ich_hfgrtr_masks; +struct fgt_masks ich_hfgwtr_masks; +struct fgt_masks ich_hfgitr_masks; extern void kvm_nvhe_prepare_backtrace(unsigned long fp, unsigned long pc); @@ -110,6 +113,12 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu) /* Save VGICv3 state on non-VHE systems */ static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu) { + if (vgic_is_v5(kern_hyp_va(vcpu->kvm))) { + __vgic_v5_save_state(&vcpu->arch.vgic_cpu.vgic_v5); + __vgic_v5_save_ppi_state(&vcpu->arch.vgic_cpu.vgic_v5); + return; + } + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); @@ -119,6 +128,12 @@ static void __hyp_vgic_save_state(struct kvm_vcpu *vcpu) /* Restore VGICv3 state on non-VHE systems */ static void __hyp_vgic_restore_state(struct kvm_vcpu *vcpu) { + if (vgic_is_v5(kern_hyp_va(vcpu->kvm))) { + __vgic_v5_restore_state(&vcpu->arch.vgic_cpu.vgic_v5); + __vgic_v5_restore_ppi_state(&vcpu->arch.vgic_cpu.vgic_v5); + return; + } + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) { __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); @@ -190,6 +205,7 @@ static const exit_handler_fn hyp_exit_handlers[] = { static const exit_handler_fn pvm_exit_handlers[] = { [0 ... ESR_ELx_EC_MAX] = NULL, + [ESR_ELx_EC_HVC64] = kvm_handle_pvm_hvc64, [ESR_ELx_EC_SYS64] = kvm_handle_pvm_sys64, [ESR_ELx_EC_SVE] = kvm_handle_pvm_restricted, [ESR_ELx_EC_FP_ASIMD] = kvm_hyp_handle_fpsimd, @@ -278,7 +294,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) * We're about to restore some new MMU state. Make sure * ongoing page-table walks that have started before we * trapped to EL2 have completed. This also synchronises the - * above disabling of BRBE, SPE and TRBE. + * above disabling of BRBE. * * See DDI0487I.a D8.1.5 "Out-of-context translation regimes", * rule R_LFHQG and subsequent information statements. @@ -308,10 +324,13 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) __debug_switch_to_guest(vcpu); do { + trace_hyp_exit(host_ctxt, HYP_REASON_ERET_GUEST); + /* Jump in the fire! */ exit_code = __guest_enter(vcpu); /* And we're baaack! */ + trace_hyp_enter(host_ctxt, HYP_REASON_GUEST_EXIT); } while (fixup_guest_exit(vcpu, &exit_code)); __sysreg_save_state_nvhe(guest_ctxt); diff --git a/arch/arm64/kvm/hyp/nvhe/sys_regs.c b/arch/arm64/kvm/hyp/nvhe/sys_regs.c index 06d28621722e..8c3fbb413a06 100644 --- a/arch/arm64/kvm/hyp/nvhe/sys_regs.c +++ b/arch/arm64/kvm/hyp/nvhe/sys_regs.c @@ -20,6 +20,7 @@ */ u64 id_aa64pfr0_el1_sys_val; u64 id_aa64pfr1_el1_sys_val; +u64 id_aa64pfr2_el1_sys_val; u64 id_aa64isar0_el1_sys_val; u64 id_aa64isar1_el1_sys_val; u64 id_aa64isar2_el1_sys_val; @@ -108,6 +109,11 @@ static const struct pvm_ftr_bits pvmid_aa64pfr1[] = { FEAT_END }; +static const struct pvm_ftr_bits pvmid_aa64pfr2[] = { + MAX_FEAT(ID_AA64PFR2_EL1, GCIE, NI), + FEAT_END +}; + static const struct pvm_ftr_bits pvmid_aa64mmfr0[] = { MAX_FEAT_ENUM(ID_AA64MMFR0_EL1, PARANGE, 40), MAX_FEAT_ENUM(ID_AA64MMFR0_EL1, ASIDBITS, 16), @@ -221,6 +227,8 @@ static u64 pvm_calc_id_reg(const struct kvm_vcpu *vcpu, u32 id) return get_restricted_features(vcpu, id_aa64pfr0_el1_sys_val, pvmid_aa64pfr0); case SYS_ID_AA64PFR1_EL1: return get_restricted_features(vcpu, id_aa64pfr1_el1_sys_val, pvmid_aa64pfr1); + case SYS_ID_AA64PFR2_EL1: + return get_restricted_features(vcpu, id_aa64pfr2_el1_sys_val, pvmid_aa64pfr2); case SYS_ID_AA64ISAR0_EL1: return id_aa64isar0_el1_sys_val; case SYS_ID_AA64ISAR1_EL1: @@ -392,6 +400,14 @@ static const struct sys_reg_desc pvm_sys_reg_descs[] = { /* Cache maintenance by set/way operations are restricted. */ /* Debug and Trace Registers are restricted. */ + RAZ_WI(SYS_DBGBVRn_EL1(0)), + RAZ_WI(SYS_DBGBCRn_EL1(0)), + RAZ_WI(SYS_DBGWVRn_EL1(0)), + RAZ_WI(SYS_DBGWCRn_EL1(0)), + RAZ_WI(SYS_MDSCR_EL1), + RAZ_WI(SYS_OSLAR_EL1), + RAZ_WI(SYS_OSLSR_EL1), + RAZ_WI(SYS_OSDLR_EL1), /* Group 1 ID registers */ HOST_HANDLED(SYS_REVIDR_EL1), @@ -431,7 +447,7 @@ static const struct sys_reg_desc pvm_sys_reg_descs[] = { /* CRm=4 */ AARCH64(SYS_ID_AA64PFR0_EL1), AARCH64(SYS_ID_AA64PFR1_EL1), - ID_UNALLOCATED(4,2), + AARCH64(SYS_ID_AA64PFR2_EL1), ID_UNALLOCATED(4,3), AARCH64(SYS_ID_AA64ZFR0_EL1), ID_UNALLOCATED(4,5), diff --git a/arch/arm64/kvm/hyp/nvhe/trace.c b/arch/arm64/kvm/hyp/nvhe/trace.c new file mode 100644 index 000000000000..a6ca27b18e15 --- /dev/null +++ b/arch/arm64/kvm/hyp/nvhe/trace.c @@ -0,0 +1,306 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort <vdonnefort@google.com> + */ + +#include <nvhe/clock.h> +#include <nvhe/mem_protect.h> +#include <nvhe/mm.h> +#include <nvhe/trace.h> + +#include <asm/percpu.h> +#include <asm/kvm_mmu.h> +#include <asm/local.h> + +#include "simple_ring_buffer.c" + +static DEFINE_PER_CPU(struct simple_rb_per_cpu, __simple_rbs); + +static struct hyp_trace_buffer { + struct simple_rb_per_cpu __percpu *simple_rbs; + void *bpages_backing_start; + size_t bpages_backing_size; + hyp_spinlock_t lock; +} trace_buffer = { + .simple_rbs = &__simple_rbs, + .lock = __HYP_SPIN_LOCK_UNLOCKED, +}; + +static bool hyp_trace_buffer_loaded(struct hyp_trace_buffer *trace_buffer) +{ + return trace_buffer->bpages_backing_size > 0; +} + +void *tracing_reserve_entry(unsigned long length) +{ + return simple_ring_buffer_reserve(this_cpu_ptr(trace_buffer.simple_rbs), length, + trace_clock()); +} + +void tracing_commit_entry(void) +{ + simple_ring_buffer_commit(this_cpu_ptr(trace_buffer.simple_rbs)); +} + +static int __admit_host_mem(void *start, u64 size) +{ + if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(size) || !size) + return -EINVAL; + + if (!is_protected_kvm_enabled()) + return 0; + + return __pkvm_host_donate_hyp(hyp_virt_to_pfn(start), size >> PAGE_SHIFT); +} + +static void __release_host_mem(void *start, u64 size) +{ + if (!is_protected_kvm_enabled()) + return; + + WARN_ON(__pkvm_hyp_donate_host(hyp_virt_to_pfn(start), size >> PAGE_SHIFT)); +} + +static int hyp_trace_buffer_load_bpage_backing(struct hyp_trace_buffer *trace_buffer, + struct hyp_trace_desc *desc) +{ + void *start = (void *)kern_hyp_va(desc->bpages_backing_start); + size_t size = desc->bpages_backing_size; + int ret; + + ret = __admit_host_mem(start, size); + if (ret) + return ret; + + memset(start, 0, size); + + trace_buffer->bpages_backing_start = start; + trace_buffer->bpages_backing_size = size; + + return 0; +} + +static void hyp_trace_buffer_unload_bpage_backing(struct hyp_trace_buffer *trace_buffer) +{ + void *start = trace_buffer->bpages_backing_start; + size_t size = trace_buffer->bpages_backing_size; + + if (!size) + return; + + memset(start, 0, size); + + __release_host_mem(start, size); + + trace_buffer->bpages_backing_start = 0; + trace_buffer->bpages_backing_size = 0; +} + +static void *__pin_shared_page(unsigned long kern_va) +{ + void *va = kern_hyp_va((void *)kern_va); + + if (!is_protected_kvm_enabled()) + return va; + + return hyp_pin_shared_mem(va, va + PAGE_SIZE) ? NULL : va; +} + +static void __unpin_shared_page(void *va) +{ + if (!is_protected_kvm_enabled()) + return; + + hyp_unpin_shared_mem(va, va + PAGE_SIZE); +} + +static void hyp_trace_buffer_unload(struct hyp_trace_buffer *trace_buffer) +{ + int cpu; + + hyp_assert_lock_held(&trace_buffer->lock); + + if (!hyp_trace_buffer_loaded(trace_buffer)) + return; + + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) + simple_ring_buffer_unload_mm(per_cpu_ptr(trace_buffer->simple_rbs, cpu), + __unpin_shared_page); + + hyp_trace_buffer_unload_bpage_backing(trace_buffer); +} + +static int hyp_trace_buffer_load(struct hyp_trace_buffer *trace_buffer, + struct hyp_trace_desc *desc) +{ + struct simple_buffer_page *bpages; + struct ring_buffer_desc *rb_desc; + int ret, cpu; + + hyp_assert_lock_held(&trace_buffer->lock); + + if (hyp_trace_buffer_loaded(trace_buffer)) + return -EINVAL; + + ret = hyp_trace_buffer_load_bpage_backing(trace_buffer, desc); + if (ret) + return ret; + + bpages = trace_buffer->bpages_backing_start; + for_each_ring_buffer_desc(rb_desc, cpu, &desc->trace_buffer_desc) { + ret = simple_ring_buffer_init_mm(per_cpu_ptr(trace_buffer->simple_rbs, cpu), + bpages, rb_desc, __pin_shared_page, + __unpin_shared_page); + if (ret) + break; + + bpages += rb_desc->nr_page_va; + } + + if (ret) + hyp_trace_buffer_unload(trace_buffer); + + return ret; +} + +static bool hyp_trace_desc_validate(struct hyp_trace_desc *desc, size_t desc_size) +{ + struct ring_buffer_desc *rb_desc; + unsigned int cpu; + size_t nr_bpages; + void *desc_end; + + /* + * Both desc_size and bpages_backing_size are untrusted host-provided + * values. We rely on __pkvm_host_donate_hyp() to enforce their validity. + */ + desc_end = (void *)desc + desc_size; + nr_bpages = desc->bpages_backing_size / sizeof(struct simple_buffer_page); + + for_each_ring_buffer_desc(rb_desc, cpu, &desc->trace_buffer_desc) { + /* Can we read nr_page_va? */ + if ((void *)rb_desc + struct_size(rb_desc, page_va, 0) > desc_end) + return false; + + /* Overflow desc? */ + if ((void *)rb_desc + struct_size(rb_desc, page_va, rb_desc->nr_page_va) > desc_end) + return false; + + /* Overflow bpages backing memory? */ + if (nr_bpages < rb_desc->nr_page_va) + return false; + + if (cpu >= hyp_nr_cpus) + return false; + + if (cpu != rb_desc->cpu) + return false; + + nr_bpages -= rb_desc->nr_page_va; + } + + return true; +} + +int __tracing_load(unsigned long desc_hva, size_t desc_size) +{ + struct hyp_trace_desc *desc = (struct hyp_trace_desc *)kern_hyp_va(desc_hva); + int ret; + + ret = __admit_host_mem(desc, desc_size); + if (ret) + return ret; + + if (!hyp_trace_desc_validate(desc, desc_size)) + goto err_release_desc; + + hyp_spin_lock(&trace_buffer.lock); + + ret = hyp_trace_buffer_load(&trace_buffer, desc); + + hyp_spin_unlock(&trace_buffer.lock); + +err_release_desc: + __release_host_mem(desc, desc_size); + return ret; +} + +void __tracing_unload(void) +{ + hyp_spin_lock(&trace_buffer.lock); + hyp_trace_buffer_unload(&trace_buffer); + hyp_spin_unlock(&trace_buffer.lock); +} + +int __tracing_enable(bool enable) +{ + int cpu, ret = enable ? -EINVAL : 0; + + hyp_spin_lock(&trace_buffer.lock); + + if (!hyp_trace_buffer_loaded(&trace_buffer)) + goto unlock; + + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) + simple_ring_buffer_enable_tracing(per_cpu_ptr(trace_buffer.simple_rbs, cpu), + enable); + + ret = 0; + +unlock: + hyp_spin_unlock(&trace_buffer.lock); + + return ret; +} + +int __tracing_swap_reader(unsigned int cpu) +{ + int ret = -ENODEV; + + if (cpu >= hyp_nr_cpus) + return -EINVAL; + + hyp_spin_lock(&trace_buffer.lock); + + if (hyp_trace_buffer_loaded(&trace_buffer)) + ret = simple_ring_buffer_swap_reader_page( + per_cpu_ptr(trace_buffer.simple_rbs, cpu)); + + hyp_spin_unlock(&trace_buffer.lock); + + return ret; +} + +void __tracing_update_clock(u32 mult, u32 shift, u64 epoch_ns, u64 epoch_cyc) +{ + int cpu; + + /* After this loop, all CPUs are observing the new bank... */ + for (cpu = 0; cpu < hyp_nr_cpus; cpu++) { + struct simple_rb_per_cpu *simple_rb = per_cpu_ptr(trace_buffer.simple_rbs, cpu); + + while (READ_ONCE(simple_rb->status) == SIMPLE_RB_WRITING) + ; + } + + /* ...we can now override the old one and swap. */ + trace_clock_update(mult, shift, epoch_ns, epoch_cyc); +} + +int __tracing_reset(unsigned int cpu) +{ + int ret = -ENODEV; + + if (cpu >= hyp_nr_cpus) + return -EINVAL; + + hyp_spin_lock(&trace_buffer.lock); + + if (hyp_trace_buffer_loaded(&trace_buffer)) + ret = simple_ring_buffer_reset(per_cpu_ptr(trace_buffer.simple_rbs, cpu)); + + hyp_spin_unlock(&trace_buffer.lock); + + return ret; +} diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 30226f2d5564..0c1defa5fb0f 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -114,11 +114,6 @@ static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, s8 level) return pte; } -static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id) -{ - return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id); -} - static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, const struct kvm_pgtable_visit_ctx *ctx, enum kvm_pgtable_walk_flags visit) @@ -581,7 +576,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt) struct stage2_map_data { const u64 phys; kvm_pte_t attr; - u8 owner_id; + kvm_pte_t pte_annot; kvm_pte_t *anchor; kvm_pte_t *childp; @@ -798,7 +793,11 @@ static bool stage2_pte_is_counted(kvm_pte_t pte) static bool stage2_pte_is_locked(kvm_pte_t pte) { - return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED); + if (kvm_pte_valid(pte)) + return false; + + return FIELD_GET(KVM_INVALID_PTE_TYPE_MASK, pte) == + KVM_INVALID_PTE_TYPE_LOCKED; } static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new) @@ -829,6 +828,7 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx, struct kvm_s2_mmu *mmu) { struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops; + kvm_pte_t locked_pte; if (stage2_pte_is_locked(ctx->old)) { /* @@ -839,7 +839,9 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx, return false; } - if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED)) + locked_pte = FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, + KVM_INVALID_PTE_TYPE_LOCKED); + if (!stage2_try_set_pte(ctx, locked_pte)) return false; if (!kvm_pgtable_walk_skip_bbm_tlbi(ctx)) { @@ -964,7 +966,7 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx, if (!data->annotation) new = kvm_init_valid_leaf_pte(phys, data->attr, ctx->level); else - new = kvm_init_invalid_leaf_owner(data->owner_id); + new = data->pte_annot; /* * Skip updating the PTE if we are trying to recreate the exact @@ -1118,16 +1120,18 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, return ret; } -int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, - void *mc, u8 owner_id) +int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size, + void *mc, enum kvm_invalid_pte_type type, + kvm_pte_t pte_annot) { int ret; struct stage2_map_data map_data = { .mmu = pgt->mmu, .memcache = mc, - .owner_id = owner_id, .force_pte = true, .annotation = true, + .pte_annot = pte_annot | + FIELD_PREP(KVM_INVALID_PTE_TYPE_MASK, type), }; struct kvm_pgtable_walker walker = { .cb = stage2_map_walker, @@ -1136,7 +1140,10 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size, .arg = &map_data, }; - if (owner_id > KVM_MAX_OWNER_ID) + if (pte_annot & ~KVM_INVALID_PTE_ANNOT_MASK) + return -EINVAL; + + if (!type || type == KVM_INVALID_PTE_TYPE_LOCKED) return -EINVAL; ret = kvm_pgtable_walk(pgt, addr, size, &walker); diff --git a/arch/arm64/kvm/hyp/vgic-v5-sr.c b/arch/arm64/kvm/hyp/vgic-v5-sr.c new file mode 100644 index 000000000000..47e6bcd43702 --- /dev/null +++ b/arch/arm64/kvm/hyp/vgic-v5-sr.c @@ -0,0 +1,166 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025, 2026 - Arm Ltd + */ + +#include <linux/irqchip/arm-gic-v5.h> + +#include <asm/kvm_hyp.h> + +void __vgic_v5_save_apr(struct vgic_v5_cpu_if *cpu_if) +{ + cpu_if->vgic_apr = read_sysreg_s(SYS_ICH_APR_EL2); +} + +static void __vgic_v5_compat_mode_disable(void) +{ + sysreg_clear_set_s(SYS_ICH_VCTLR_EL2, ICH_VCTLR_EL2_V3, 0); + isb(); +} + +void __vgic_v5_restore_vmcr_apr(struct vgic_v5_cpu_if *cpu_if) +{ + __vgic_v5_compat_mode_disable(); + + write_sysreg_s(cpu_if->vgic_vmcr, SYS_ICH_VMCR_EL2); + write_sysreg_s(cpu_if->vgic_apr, SYS_ICH_APR_EL2); +} + +void __vgic_v5_save_ppi_state(struct vgic_v5_cpu_if *cpu_if) +{ + /* + * The following code assumes that the bitmap storage that we have for + * PPIs is either 64 (architected PPIs, only) or 128 bits (architected & + * impdef PPIs). + */ + BUILD_BUG_ON(VGIC_V5_NR_PRIVATE_IRQS % 64); + + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, + read_sysreg_s(SYS_ICH_PPI_ACTIVER0_EL2), 0, 64); + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr, + read_sysreg_s(SYS_ICH_PPI_PENDR0_EL2), 0, 64); + + cpu_if->vgic_ppi_priorityr[0] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR0_EL2); + cpu_if->vgic_ppi_priorityr[1] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR1_EL2); + cpu_if->vgic_ppi_priorityr[2] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR2_EL2); + cpu_if->vgic_ppi_priorityr[3] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR3_EL2); + cpu_if->vgic_ppi_priorityr[4] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR4_EL2); + cpu_if->vgic_ppi_priorityr[5] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR5_EL2); + cpu_if->vgic_ppi_priorityr[6] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR6_EL2); + cpu_if->vgic_ppi_priorityr[7] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR7_EL2); + + if (VGIC_V5_NR_PRIVATE_IRQS == 128) { + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->activer_exit, + read_sysreg_s(SYS_ICH_PPI_ACTIVER1_EL2), 64, 64); + bitmap_write(host_data_ptr(vgic_v5_ppi_state)->pendr, + read_sysreg_s(SYS_ICH_PPI_PENDR1_EL2), 64, 64); + + cpu_if->vgic_ppi_priorityr[8] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR8_EL2); + cpu_if->vgic_ppi_priorityr[9] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR9_EL2); + cpu_if->vgic_ppi_priorityr[10] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR10_EL2); + cpu_if->vgic_ppi_priorityr[11] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR11_EL2); + cpu_if->vgic_ppi_priorityr[12] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR12_EL2); + cpu_if->vgic_ppi_priorityr[13] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR13_EL2); + cpu_if->vgic_ppi_priorityr[14] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR14_EL2); + cpu_if->vgic_ppi_priorityr[15] = read_sysreg_s(SYS_ICH_PPI_PRIORITYR15_EL2); + } + + /* Now that we are done, disable DVI */ + write_sysreg_s(0, SYS_ICH_PPI_DVIR0_EL2); + write_sysreg_s(0, SYS_ICH_PPI_DVIR1_EL2); +} + +void __vgic_v5_restore_ppi_state(struct vgic_v5_cpu_if *cpu_if) +{ + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); + + /* We assume 64 or 128 PPIs - see above comment */ + BUILD_BUG_ON(VGIC_V5_NR_PRIVATE_IRQS % 64); + + /* Enable DVI so that the guest's interrupt config takes over */ + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_dvir, 0, 64), + SYS_ICH_PPI_DVIR0_EL2); + + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_activer, 0, 64), + SYS_ICH_PPI_ACTIVER0_EL2); + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_enabler, 0, 64), + SYS_ICH_PPI_ENABLER0_EL2); + + /* Update the pending state of the NON-DVI'd PPIs, only */ + bitmap_andnot(pendr, host_data_ptr(vgic_v5_ppi_state)->pendr, + cpu_if->vgic_ppi_dvir, VGIC_V5_NR_PRIVATE_IRQS); + write_sysreg_s(bitmap_read(pendr, 0, 64), SYS_ICH_PPI_PENDR0_EL2); + + write_sysreg_s(cpu_if->vgic_ppi_priorityr[0], + SYS_ICH_PPI_PRIORITYR0_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[1], + SYS_ICH_PPI_PRIORITYR1_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[2], + SYS_ICH_PPI_PRIORITYR2_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[3], + SYS_ICH_PPI_PRIORITYR3_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[4], + SYS_ICH_PPI_PRIORITYR4_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[5], + SYS_ICH_PPI_PRIORITYR5_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[6], + SYS_ICH_PPI_PRIORITYR6_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[7], + SYS_ICH_PPI_PRIORITYR7_EL2); + + if (VGIC_V5_NR_PRIVATE_IRQS == 128) { + /* Enable DVI so that the guest's interrupt config takes over */ + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_dvir, 64, 64), + SYS_ICH_PPI_DVIR1_EL2); + + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_activer, 64, 64), + SYS_ICH_PPI_ACTIVER1_EL2); + write_sysreg_s(bitmap_read(cpu_if->vgic_ppi_enabler, 64, 64), + SYS_ICH_PPI_ENABLER1_EL2); + write_sysreg_s(bitmap_read(pendr, 64, 64), + SYS_ICH_PPI_PENDR1_EL2); + + write_sysreg_s(cpu_if->vgic_ppi_priorityr[8], + SYS_ICH_PPI_PRIORITYR8_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[9], + SYS_ICH_PPI_PRIORITYR9_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[10], + SYS_ICH_PPI_PRIORITYR10_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[11], + SYS_ICH_PPI_PRIORITYR11_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[12], + SYS_ICH_PPI_PRIORITYR12_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[13], + SYS_ICH_PPI_PRIORITYR13_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[14], + SYS_ICH_PPI_PRIORITYR14_EL2); + write_sysreg_s(cpu_if->vgic_ppi_priorityr[15], + SYS_ICH_PPI_PRIORITYR15_EL2); + } else { + write_sysreg_s(0, SYS_ICH_PPI_DVIR1_EL2); + + write_sysreg_s(0, SYS_ICH_PPI_ACTIVER1_EL2); + write_sysreg_s(0, SYS_ICH_PPI_ENABLER1_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PENDR1_EL2); + + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR8_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR9_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR10_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR11_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR12_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR13_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR14_EL2); + write_sysreg_s(0, SYS_ICH_PPI_PRIORITYR15_EL2); + } +} + +void __vgic_v5_save_state(struct vgic_v5_cpu_if *cpu_if) +{ + cpu_if->vgic_vmcr = read_sysreg_s(SYS_ICH_VMCR_EL2); + cpu_if->vgic_icsr = read_sysreg_s(SYS_ICC_ICSR_EL1); +} + +void __vgic_v5_restore_state(struct vgic_v5_cpu_if *cpu_if) +{ + write_sysreg_s(cpu_if->vgic_icsr, SYS_ICC_ICSR_EL1); +} diff --git a/arch/arm64/kvm/hyp/vhe/Makefile b/arch/arm64/kvm/hyp/vhe/Makefile index afc4aed9231a..9695328bbd96 100644 --- a/arch/arm64/kvm/hyp/vhe/Makefile +++ b/arch/arm64/kvm/hyp/vhe/Makefile @@ -10,4 +10,4 @@ CFLAGS_switch.o += -Wno-override-init obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \ - ../fpsimd.o ../hyp-entry.o ../exception.o + ../fpsimd.o ../hyp-entry.o ../exception.o ../vgic-v5-sr.o diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c new file mode 100644 index 000000000000..8b7f2bf2fba8 --- /dev/null +++ b/arch/arm64/kvm/hyp_trace.c @@ -0,0 +1,442 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025 Google LLC + * Author: Vincent Donnefort <vdonnefort@google.com> + */ + +#include <linux/cpumask.h> +#include <linux/trace_remote.h> +#include <linux/tracefs.h> +#include <linux/simple_ring_buffer.h> + +#include <asm/arch_timer.h> +#include <asm/kvm_host.h> +#include <asm/kvm_hyptrace.h> +#include <asm/kvm_mmu.h> + +#include "hyp_trace.h" + +/* Same 10min used by clocksource when width is more than 32-bits */ +#define CLOCK_MAX_CONVERSION_S 600 +/* + * Time to give for the clock init. Long enough to get a good mult/shift + * estimation. Short enough to not delay the tracing start too much. + */ +#define CLOCK_INIT_MS 100 +/* + * Time between clock checks. Must be small enough to catch clock deviation when + * it is still tiny. + */ +#define CLOCK_UPDATE_MS 500 + +static struct hyp_trace_clock { + u64 cycles; + u64 cyc_overflow64; + u64 boot; + u32 mult; + u32 shift; + struct delayed_work work; + struct completion ready; + struct mutex lock; + bool running; +} hyp_clock; + +static void __hyp_clock_work(struct work_struct *work) +{ + struct delayed_work *dwork = to_delayed_work(work); + struct hyp_trace_clock *hyp_clock; + struct system_time_snapshot snap; + u64 rate, delta_cycles; + u64 boot, delta_boot; + + hyp_clock = container_of(dwork, struct hyp_trace_clock, work); + + ktime_get_snapshot(&snap); + boot = ktime_to_ns(snap.boot); + + delta_boot = boot - hyp_clock->boot; + delta_cycles = snap.cycles - hyp_clock->cycles; + + /* Compare hyp clock with the kernel boot clock */ + if (hyp_clock->mult) { + u64 err, cur = delta_cycles; + + if (WARN_ON_ONCE(cur >= hyp_clock->cyc_overflow64)) { + __uint128_t tmp = (__uint128_t)cur * hyp_clock->mult; + + cur = tmp >> hyp_clock->shift; + } else { + cur *= hyp_clock->mult; + cur >>= hyp_clock->shift; + } + cur += hyp_clock->boot; + + err = abs_diff(cur, boot); + /* No deviation, only update epoch if necessary */ + if (!err) { + if (delta_cycles >= (hyp_clock->cyc_overflow64 >> 1)) + goto fast_forward; + + goto resched; + } + + /* Warn if the error is above tracing precision (1us) */ + if (err > NSEC_PER_USEC) + pr_warn_ratelimited("hyp trace clock off by %lluus\n", + err / NSEC_PER_USEC); + } + + rate = div64_u64(delta_cycles * NSEC_PER_SEC, delta_boot); + + clocks_calc_mult_shift(&hyp_clock->mult, &hyp_clock->shift, + rate, NSEC_PER_SEC, CLOCK_MAX_CONVERSION_S); + + /* Add a comfortable 50% margin */ + hyp_clock->cyc_overflow64 = (U64_MAX / hyp_clock->mult) >> 1; + +fast_forward: + hyp_clock->cycles = snap.cycles; + hyp_clock->boot = boot; + kvm_call_hyp_nvhe(__tracing_update_clock, hyp_clock->mult, + hyp_clock->shift, hyp_clock->boot, hyp_clock->cycles); + complete(&hyp_clock->ready); + +resched: + schedule_delayed_work(&hyp_clock->work, + msecs_to_jiffies(CLOCK_UPDATE_MS)); +} + +static void hyp_trace_clock_enable(struct hyp_trace_clock *hyp_clock, bool enable) +{ + struct system_time_snapshot snap; + + if (hyp_clock->running == enable) + return; + + if (!enable) { + cancel_delayed_work_sync(&hyp_clock->work); + hyp_clock->running = false; + } + + ktime_get_snapshot(&snap); + + hyp_clock->boot = ktime_to_ns(snap.boot); + hyp_clock->cycles = snap.cycles; + hyp_clock->mult = 0; + + init_completion(&hyp_clock->ready); + INIT_DELAYED_WORK(&hyp_clock->work, __hyp_clock_work); + schedule_delayed_work(&hyp_clock->work, msecs_to_jiffies(CLOCK_INIT_MS)); + wait_for_completion(&hyp_clock->ready); + hyp_clock->running = true; +} + +/* Access to this struct within the trace_remote_callbacks are protected by the trace_remote lock */ +static struct hyp_trace_buffer { + struct hyp_trace_desc *desc; + size_t desc_size; +} trace_buffer; + +static int __map_hyp(void *start, size_t size) +{ + if (is_protected_kvm_enabled()) + return 0; + + return create_hyp_mappings(start, start + size, PAGE_HYP); +} + +static int __share_page(unsigned long va) +{ + return kvm_share_hyp((void *)va, (void *)va + 1); +} + +static void __unshare_page(unsigned long va) +{ + kvm_unshare_hyp((void *)va, (void *)va + 1); +} + +static int hyp_trace_buffer_alloc_bpages_backing(struct hyp_trace_buffer *trace_buffer, size_t size) +{ + int nr_bpages = (PAGE_ALIGN(size) / PAGE_SIZE) + 1; + size_t backing_size; + void *start; + + backing_size = PAGE_ALIGN(sizeof(struct simple_buffer_page) * nr_bpages * + num_possible_cpus()); + + start = alloc_pages_exact(backing_size, GFP_KERNEL_ACCOUNT); + if (!start) + return -ENOMEM; + + trace_buffer->desc->bpages_backing_start = (unsigned long)start; + trace_buffer->desc->bpages_backing_size = backing_size; + + return __map_hyp(start, backing_size); +} + +static void hyp_trace_buffer_free_bpages_backing(struct hyp_trace_buffer *trace_buffer) +{ + free_pages_exact((void *)trace_buffer->desc->bpages_backing_start, + trace_buffer->desc->bpages_backing_size); +} + +static void hyp_trace_buffer_unshare_hyp(struct hyp_trace_buffer *trace_buffer, int last_cpu) +{ + struct ring_buffer_desc *rb_desc; + int cpu, p; + + for_each_ring_buffer_desc(rb_desc, cpu, &trace_buffer->desc->trace_buffer_desc) { + if (cpu > last_cpu) + break; + + __share_page(rb_desc->meta_va); + for (p = 0; p < rb_desc->nr_page_va; p++) + __unshare_page(rb_desc->page_va[p]); + } +} + +static int hyp_trace_buffer_share_hyp(struct hyp_trace_buffer *trace_buffer) +{ + struct ring_buffer_desc *rb_desc; + int cpu, p, ret = 0; + + for_each_ring_buffer_desc(rb_desc, cpu, &trace_buffer->desc->trace_buffer_desc) { + ret = __share_page(rb_desc->meta_va); + if (ret) + break; + + for (p = 0; p < rb_desc->nr_page_va; p++) { + ret = __share_page(rb_desc->page_va[p]); + if (ret) + break; + } + + if (ret) { + for (p--; p >= 0; p--) + __unshare_page(rb_desc->page_va[p]); + break; + } + } + + if (ret) + hyp_trace_buffer_unshare_hyp(trace_buffer, cpu--); + + return ret; +} + +static struct trace_buffer_desc *hyp_trace_load(unsigned long size, void *priv) +{ + struct hyp_trace_buffer *trace_buffer = priv; + struct hyp_trace_desc *desc; + size_t desc_size; + int ret; + + if (WARN_ON(trace_buffer->desc)) + return ERR_PTR(-EINVAL); + + desc_size = trace_buffer_desc_size(size, num_possible_cpus()); + if (desc_size == SIZE_MAX) + return ERR_PTR(-E2BIG); + + desc_size = PAGE_ALIGN(desc_size); + desc = (struct hyp_trace_desc *)alloc_pages_exact(desc_size, GFP_KERNEL); + if (!desc) + return ERR_PTR(-ENOMEM); + + ret = __map_hyp(desc, desc_size); + if (ret) + goto err_free_desc; + + trace_buffer->desc = desc; + + ret = hyp_trace_buffer_alloc_bpages_backing(trace_buffer, size); + if (ret) + goto err_free_desc; + + ret = trace_remote_alloc_buffer(&desc->trace_buffer_desc, desc_size, size, + cpu_possible_mask); + if (ret) + goto err_free_backing; + + ret = hyp_trace_buffer_share_hyp(trace_buffer); + if (ret) + goto err_free_buffer; + + ret = kvm_call_hyp_nvhe(__tracing_load, (unsigned long)desc, desc_size); + if (ret) + goto err_unload_pages; + + return &desc->trace_buffer_desc; + +err_unload_pages: + hyp_trace_buffer_unshare_hyp(trace_buffer, INT_MAX); + +err_free_buffer: + trace_remote_free_buffer(&desc->trace_buffer_desc); + +err_free_backing: + hyp_trace_buffer_free_bpages_backing(trace_buffer); + +err_free_desc: + free_pages_exact(desc, desc_size); + trace_buffer->desc = NULL; + + return ERR_PTR(ret); +} + +static void hyp_trace_unload(struct trace_buffer_desc *desc, void *priv) +{ + struct hyp_trace_buffer *trace_buffer = priv; + + if (WARN_ON(desc != &trace_buffer->desc->trace_buffer_desc)) + return; + + kvm_call_hyp_nvhe(__tracing_unload); + hyp_trace_buffer_unshare_hyp(trace_buffer, INT_MAX); + trace_remote_free_buffer(desc); + hyp_trace_buffer_free_bpages_backing(trace_buffer); + free_pages_exact(trace_buffer->desc, trace_buffer->desc_size); + trace_buffer->desc = NULL; +} + +static int hyp_trace_enable_tracing(bool enable, void *priv) +{ + hyp_trace_clock_enable(&hyp_clock, enable); + + return kvm_call_hyp_nvhe(__tracing_enable, enable); +} + +static int hyp_trace_swap_reader_page(unsigned int cpu, void *priv) +{ + return kvm_call_hyp_nvhe(__tracing_swap_reader, cpu); +} + +static int hyp_trace_reset(unsigned int cpu, void *priv) +{ + return kvm_call_hyp_nvhe(__tracing_reset, cpu); +} + +static int hyp_trace_enable_event(unsigned short id, bool enable, void *priv) +{ + struct hyp_event_id *event_id = lm_alias(&__hyp_event_ids_start[id]); + struct page *page; + atomic_t *enabled; + void *map; + + if (is_protected_kvm_enabled()) + return kvm_call_hyp_nvhe(__tracing_enable_event, id, enable); + + enabled = &event_id->enabled; + page = virt_to_page(enabled); + map = vmap(&page, 1, VM_MAP, PAGE_KERNEL); + if (!map) + return -ENOMEM; + + enabled = map + offset_in_page(enabled); + atomic_set(enabled, enable); + + vunmap(map); + + return 0; +} + +static int hyp_trace_clock_show(struct seq_file *m, void *v) +{ + seq_puts(m, "[boot]\n"); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(hyp_trace_clock); + +static ssize_t hyp_trace_write_event_write(struct file *f, const char __user *ubuf, + size_t cnt, loff_t *pos) +{ + unsigned long val; + int ret; + + ret = kstrtoul_from_user(ubuf, cnt, 10, &val); + if (ret) + return ret; + + kvm_call_hyp_nvhe(__tracing_write_event, val); + + return cnt; +} + +static const struct file_operations hyp_trace_write_event_fops = { + .write = hyp_trace_write_event_write, +}; + +static int hyp_trace_init_tracefs(struct dentry *d, void *priv) +{ + if (!tracefs_create_file("write_event", 0200, d, NULL, &hyp_trace_write_event_fops)) + return -ENOMEM; + + return tracefs_create_file("trace_clock", 0440, d, NULL, &hyp_trace_clock_fops) ? + 0 : -ENOMEM; +} + +static struct trace_remote_callbacks trace_remote_callbacks = { + .init = hyp_trace_init_tracefs, + .load_trace_buffer = hyp_trace_load, + .unload_trace_buffer = hyp_trace_unload, + .enable_tracing = hyp_trace_enable_tracing, + .swap_reader_page = hyp_trace_swap_reader_page, + .reset = hyp_trace_reset, + .enable_event = hyp_trace_enable_event, +}; + +static const char *__hyp_enter_exit_reason_str(u8 reason); + +#include <asm/kvm_define_hypevents.h> + +static const char *__hyp_enter_exit_reason_str(u8 reason) +{ + static const char strs[][12] = { + "smc", + "hvc", + "psci", + "host_abort", + "guest_exit", + "eret_host", + "eret_guest", + "unknown", + }; + + return strs[min(reason, HYP_REASON_UNKNOWN)]; +} + +static void __init hyp_trace_init_events(void) +{ + struct hyp_event_id *hyp_event_id = __hyp_event_ids_start; + struct remote_event *event = __hyp_events_start; + int id = 0; + + /* Events on both sides hypervisor are sorted */ + for (; event < __hyp_events_end; event++, hyp_event_id++, id++) + event->id = hyp_event_id->id = id; +} + +int __init kvm_hyp_trace_init(void) +{ + int cpu; + + if (is_kernel_in_hyp_mode()) + return 0; + + for_each_possible_cpu(cpu) { + const struct arch_timer_erratum_workaround *wa = + per_cpu(timer_unstable_counter_workaround, cpu); + + if (IS_ENABLED(CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND) && + wa && wa->read_cntvct_el0) { + pr_warn("hyp trace can't handle CNTVCT workaround '%s'\n", wa->desc); + return -EOPNOTSUPP; + } + } + + hyp_trace_init_events(); + + return trace_remote_register("hypervisor", &trace_remote_callbacks, &trace_buffer, + __hyp_events_start, __hyp_events_end - __hyp_events_start); +} diff --git a/arch/arm64/kvm/hyp_trace.h b/arch/arm64/kvm/hyp_trace.h new file mode 100644 index 000000000000..c991b1ec65f1 --- /dev/null +++ b/arch/arm64/kvm/hyp_trace.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ARM64_KVM_HYP_TRACE_H__ +#define __ARM64_KVM_HYP_TRACE_H__ + +#ifdef CONFIG_NVHE_EL2_TRACING +int kvm_hyp_trace_init(void); +#else +static inline int kvm_hyp_trace_init(void) { return 0; } +#endif +#endif diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 17d64a1e11e5..d089c107d9b7 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -340,6 +340,9 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size, bool may_block) { + if (kvm_vm_is_protected(kvm_s2_mmu_to_kvm(mmu))) + return; + __unmap_stage2_range(mmu, start, size, may_block); } @@ -878,9 +881,6 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type) u64 mmfr0, mmfr1; u32 phys_shift; - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) - return -EINVAL; - phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); if (is_protected_kvm_enabled()) { phys_shift = kvm_ipa_limit; @@ -1013,6 +1013,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t out_destroy_pgtable: kvm_stage2_destroy(pgt); + mmu->pgt = NULL; out_free_pgtable: kfree(pgt); return err; @@ -1400,10 +1401,10 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot, */ static long transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, - unsigned long hva, kvm_pfn_t *pfnp, - phys_addr_t *ipap) + unsigned long hva, kvm_pfn_t *pfnp, gfn_t *gfnp) { kvm_pfn_t pfn = *pfnp; + gfn_t gfn = *gfnp; /* * Make sure the adjustment is done only for THP pages. Also make @@ -1419,7 +1420,8 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, if (sz < PMD_SIZE) return PAGE_SIZE; - *ipap &= PMD_MASK; + gfn &= ~(PTRS_PER_PMD - 1); + *gfnp = gfn; pfn &= ~(PTRS_PER_PMD - 1); *pfnp = pfn; @@ -1512,25 +1514,22 @@ static bool kvm_vma_is_cacheable(struct vm_area_struct *vma) } } -static int prepare_mmu_memcache(struct kvm_vcpu *vcpu, bool topup_memcache, - void **memcache) +static void *get_mmu_memcache(struct kvm_vcpu *vcpu) { - int min_pages; - if (!is_protected_kvm_enabled()) - *memcache = &vcpu->arch.mmu_page_cache; + return &vcpu->arch.mmu_page_cache; else - *memcache = &vcpu->arch.pkvm_memcache; - - if (!topup_memcache) - return 0; + return &vcpu->arch.pkvm_memcache; +} - min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu); +static int topup_mmu_memcache(struct kvm_vcpu *vcpu, void *memcache) +{ + int min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu); if (!is_protected_kvm_enabled()) - return kvm_mmu_topup_memory_cache(*memcache, min_pages); + return kvm_mmu_topup_memory_cache(memcache, min_pages); - return topup_hyp_memcache(*memcache, min_pages); + return topup_hyp_memcache(memcache, min_pages); } /* @@ -1543,54 +1542,63 @@ static int prepare_mmu_memcache(struct kvm_vcpu *vcpu, bool topup_memcache, * TLB invalidation from the guest and used to limit the invalidation scope if a * TTL hint or a range isn't provided. */ -static void adjust_nested_fault_perms(struct kvm_s2_trans *nested, - enum kvm_pgtable_prot *prot, - bool *writable) +static enum kvm_pgtable_prot adjust_nested_fault_perms(struct kvm_s2_trans *nested, + enum kvm_pgtable_prot prot) { - *writable &= kvm_s2_trans_writable(nested); + if (!kvm_s2_trans_writable(nested)) + prot &= ~KVM_PGTABLE_PROT_W; if (!kvm_s2_trans_readable(nested)) - *prot &= ~KVM_PGTABLE_PROT_R; + prot &= ~KVM_PGTABLE_PROT_R; - *prot |= kvm_encode_nested_level(nested); + return prot | kvm_encode_nested_level(nested); } -static void adjust_nested_exec_perms(struct kvm *kvm, - struct kvm_s2_trans *nested, - enum kvm_pgtable_prot *prot) +static enum kvm_pgtable_prot adjust_nested_exec_perms(struct kvm *kvm, + struct kvm_s2_trans *nested, + enum kvm_pgtable_prot prot) { if (!kvm_s2_trans_exec_el0(kvm, nested)) - *prot &= ~KVM_PGTABLE_PROT_UX; + prot &= ~KVM_PGTABLE_PROT_UX; if (!kvm_s2_trans_exec_el1(kvm, nested)) - *prot &= ~KVM_PGTABLE_PROT_PX; + prot &= ~KVM_PGTABLE_PROT_PX; + + return prot; } -static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, bool is_perm) +struct kvm_s2_fault_desc { + struct kvm_vcpu *vcpu; + phys_addr_t fault_ipa; + struct kvm_s2_trans *nested; + struct kvm_memory_slot *memslot; + unsigned long hva; +}; + +static int gmem_abort(const struct kvm_s2_fault_desc *s2fd) { - bool write_fault, exec_fault, writable; + bool write_fault, exec_fault; enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; - struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; + struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt; unsigned long mmu_seq; struct page *page; - struct kvm *kvm = vcpu->kvm; + struct kvm *kvm = s2fd->vcpu->kvm; void *memcache; kvm_pfn_t pfn; gfn_t gfn; int ret; - ret = prepare_mmu_memcache(vcpu, true, &memcache); + memcache = get_mmu_memcache(s2fd->vcpu); + ret = topup_mmu_memcache(s2fd->vcpu, memcache); if (ret) return ret; - if (nested) - gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT; + if (s2fd->nested) + gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT; else - gfn = fault_ipa >> PAGE_SHIFT; + gfn = s2fd->fault_ipa >> PAGE_SHIFT; - write_fault = kvm_is_write_fault(vcpu); - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); + write_fault = kvm_is_write_fault(s2fd->vcpu); + exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu); VM_WARN_ON_ONCE(write_fault && exec_fault); @@ -1598,26 +1606,24 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, /* Pairs with the smp_wmb() in kvm_mmu_invalidate_end(). */ smp_rmb(); - ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL); + ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL); if (ret) { - kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE, + kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE, write_fault, exec_fault, false); return ret; } - writable = !(memslot->flags & KVM_MEM_READONLY); - - if (nested) - adjust_nested_fault_perms(nested, &prot, &writable); - - if (writable) + if (!(s2fd->memslot->flags & KVM_MEM_READONLY)) prot |= KVM_PGTABLE_PROT_W; + if (s2fd->nested) + prot = adjust_nested_fault_perms(s2fd->nested, prot); + if (exec_fault || cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) prot |= KVM_PGTABLE_PROT_X; - if (nested) - adjust_nested_exec_perms(kvm, nested, &prot); + if (s2fd->nested) + prot = adjust_nested_exec_perms(kvm, s2fd->nested, prot); kvm_fault_lock(kvm); if (mmu_invalidate_retry(kvm, mmu_seq)) { @@ -1625,85 +1631,122 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, goto out_unlock; } - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE, + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE, __pfn_to_phys(pfn), prot, memcache, flags); out_unlock: - kvm_release_faultin_page(kvm, page, !!ret, writable); + kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W); kvm_fault_unlock(kvm); - if (writable && !ret) - mark_page_dirty_in_slot(kvm, memslot, gfn); + if ((prot & KVM_PGTABLE_PROT_W) && !ret) + mark_page_dirty_in_slot(kvm, s2fd->memslot, gfn); return ret != -EAGAIN ? ret : 0; } -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_s2_trans *nested, - struct kvm_memory_slot *memslot, unsigned long hva, - bool fault_is_perm) +struct kvm_s2_fault_vma_info { + unsigned long mmu_seq; + long vma_pagesize; + vm_flags_t vm_flags; + unsigned long max_map_size; + struct page *page; + kvm_pfn_t pfn; + gfn_t gfn; + bool device; + bool mte_allowed; + bool is_vma_cacheable; + bool map_writable; + bool map_non_cacheable; +}; + +static int pkvm_mem_abort(const struct kvm_s2_fault_desc *s2fd) { - int ret = 0; - bool topup_memcache; - bool write_fault, writable; - bool exec_fault, mte_allowed, is_vma_cacheable; - bool s2_force_noncacheable = false, vfio_allow_any_uc = false; - unsigned long mmu_seq; - phys_addr_t ipa = fault_ipa; + unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE; + struct kvm_vcpu *vcpu = s2fd->vcpu; + struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt; + struct mm_struct *mm = current->mm; struct kvm *kvm = vcpu->kvm; - struct vm_area_struct *vma; - short vma_shift; - void *memcache; - gfn_t gfn; - kvm_pfn_t pfn; - bool logging_active = memslot_is_logging(memslot); - bool force_pte = logging_active; - long vma_pagesize, fault_granule; - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R; - struct kvm_pgtable *pgt; + void *hyp_memcache; struct page *page; - vm_flags_t vm_flags; - enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; + int ret; - if (fault_is_perm) - fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu); - write_fault = kvm_is_write_fault(vcpu); - exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu); - VM_WARN_ON_ONCE(write_fault && exec_fault); + hyp_memcache = get_mmu_memcache(vcpu); + ret = topup_mmu_memcache(vcpu, hyp_memcache); + if (ret) + return -ENOMEM; - /* - * Permission faults just need to update the existing leaf entry, - * and so normally don't require allocations from the memcache. The - * only exception to this is when dirty logging is enabled at runtime - * and a write fault needs to collapse a block entry into a table. - */ - topup_memcache = !fault_is_perm || (logging_active && write_fault); - ret = prepare_mmu_memcache(vcpu, topup_memcache, &memcache); + ret = account_locked_vm(mm, 1, true); if (ret) return ret; - /* - * Let's check if we will get back a huge page backed by hugetlbfs, or - * get block mapping for device MMIO region. - */ - mmap_read_lock(current->mm); - vma = vma_lookup(current->mm, hva); - if (unlikely(!vma)) { - kvm_err("Failed to find VMA for hva 0x%lx\n", hva); - mmap_read_unlock(current->mm); - return -EFAULT; + mmap_read_lock(mm); + ret = pin_user_pages(s2fd->hva, 1, flags, &page); + mmap_read_unlock(mm); + + if (ret == -EHWPOISON) { + kvm_send_hwpoison_signal(s2fd->hva, PAGE_SHIFT); + ret = 0; + goto dec_account; + } else if (ret != 1) { + ret = -EFAULT; + goto dec_account; + } else if (!folio_test_swapbacked(page_folio(page))) { + /* + * We really can't deal with page-cache pages returned by GUP + * because (a) we may trigger writeback of a page for which we + * no longer have access and (b) page_mkclean() won't find the + * stage-2 mapping in the rmap so we can get out-of-whack with + * the filesystem when marking the page dirty during unpinning + * (see cc5095747edf ("ext4: don't BUG if someone dirty pages + * without asking ext4 first")). + * + * Ideally we'd just restrict ourselves to anonymous pages, but + * we also want to allow memfd (i.e. shmem) pages, so check for + * pages backed by swap in the knowledge that the GUP pin will + * prevent try_to_unmap() from succeeding. + */ + ret = -EIO; + goto unpin; } - if (force_pte) + write_lock(&kvm->mmu_lock); + ret = pkvm_pgtable_stage2_map(pgt, s2fd->fault_ipa, PAGE_SIZE, + page_to_phys(page), KVM_PGTABLE_PROT_RWX, + hyp_memcache, 0); + write_unlock(&kvm->mmu_lock); + if (ret) { + if (ret == -EAGAIN) + ret = 0; + goto unpin; + } + + return 0; +unpin: + unpin_user_pages(&page, 1); +dec_account: + account_locked_vm(mm, 1, false); + return ret; +} + +static short kvm_s2_resolve_vma_size(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault_vma_info *s2vi, + struct vm_area_struct *vma) +{ + short vma_shift; + + if (memslot_is_logging(s2fd->memslot)) { + s2vi->max_map_size = PAGE_SIZE; vma_shift = PAGE_SHIFT; - else - vma_shift = get_vma_page_shift(vma, hva); + } else { + s2vi->max_map_size = PUD_SIZE; + vma_shift = get_vma_page_shift(vma, s2fd->hva); + } switch (vma_shift) { #ifndef __PAGETABLE_PMD_FOLDED case PUD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE)) + if (fault_supports_stage2_huge_mapping(s2fd->memslot, s2fd->hva, PUD_SIZE)) break; fallthrough; #endif @@ -1711,12 +1754,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vma_shift = PMD_SHIFT; fallthrough; case PMD_SHIFT: - if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) + if (fault_supports_stage2_huge_mapping(s2fd->memslot, s2fd->hva, PMD_SIZE)) break; fallthrough; case CONT_PTE_SHIFT: vma_shift = PAGE_SHIFT; - force_pte = true; + s2vi->max_map_size = PAGE_SIZE; fallthrough; case PAGE_SHIFT: break; @@ -1724,21 +1767,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, WARN_ONCE(1, "Unknown vma_shift %d", vma_shift); } - vma_pagesize = 1UL << vma_shift; - - if (nested) { + if (s2fd->nested) { unsigned long max_map_size; - max_map_size = force_pte ? PAGE_SIZE : PUD_SIZE; - - ipa = kvm_s2_trans_output(nested); + max_map_size = min(s2vi->max_map_size, PUD_SIZE); /* * If we're about to create a shadow stage 2 entry, then we * can only create a block mapping if the guest stage 2 page * table uses at least as big a mapping. */ - max_map_size = min(kvm_s2_trans_size(nested), max_map_size); + max_map_size = min(kvm_s2_trans_size(s2fd->nested), max_map_size); /* * Be careful that if the mapping size falls between @@ -1749,30 +1788,46 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, else if (max_map_size >= PAGE_SIZE && max_map_size < PMD_SIZE) max_map_size = PAGE_SIZE; - force_pte = (max_map_size == PAGE_SIZE); - vma_pagesize = min_t(long, vma_pagesize, max_map_size); - vma_shift = __ffs(vma_pagesize); + s2vi->max_map_size = max_map_size; + vma_shift = min_t(short, vma_shift, __ffs(max_map_size)); } + return vma_shift; +} + +static bool kvm_s2_fault_is_perm(const struct kvm_s2_fault_desc *s2fd) +{ + return kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); +} + +static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault_vma_info *s2vi) +{ + struct vm_area_struct *vma; + struct kvm *kvm = s2fd->vcpu->kvm; + + mmap_read_lock(current->mm); + vma = vma_lookup(current->mm, s2fd->hva); + if (unlikely(!vma)) { + kvm_err("Failed to find VMA for hva 0x%lx\n", s2fd->hva); + mmap_read_unlock(current->mm); + return -EFAULT; + } + + s2vi->vma_pagesize = BIT(kvm_s2_resolve_vma_size(s2fd, s2vi, vma)); + /* * Both the canonical IPA and fault IPA must be aligned to the * mapping size to ensure we find the right PFN and lay down the * mapping in the right place. */ - fault_ipa = ALIGN_DOWN(fault_ipa, vma_pagesize); - ipa = ALIGN_DOWN(ipa, vma_pagesize); - - gfn = ipa >> PAGE_SHIFT; - mte_allowed = kvm_vma_mte_allowed(vma); - - vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED; + s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT; - vm_flags = vma->vm_flags; + s2vi->mte_allowed = kvm_vma_mte_allowed(vma); - is_vma_cacheable = kvm_vma_is_cacheable(vma); + s2vi->vm_flags = vma->vm_flags; - /* Don't use the VMA after the unlock -- it may have vanished */ - vma = NULL; + s2vi->is_vma_cacheable = kvm_vma_is_cacheable(vma); /* * Read mmu_invalidate_seq so that KVM can detect if the results of @@ -1782,24 +1837,50 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs * with the smp_wmb() in kvm_mmu_invalidate_end(). */ - mmu_seq = kvm->mmu_invalidate_seq; + s2vi->mmu_seq = kvm->mmu_invalidate_seq; mmap_read_unlock(current->mm); - pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0, - &writable, &page); - if (pfn == KVM_PFN_ERR_HWPOISON) { - kvm_send_hwpoison_signal(hva, vma_shift); - return 0; - } - if (is_error_noslot_pfn(pfn)) + return 0; +} + +static gfn_t get_canonical_gfn(const struct kvm_s2_fault_desc *s2fd, + const struct kvm_s2_fault_vma_info *s2vi) +{ + phys_addr_t ipa; + + if (!s2fd->nested) + return s2vi->gfn; + + ipa = kvm_s2_trans_output(s2fd->nested); + return ALIGN_DOWN(ipa, s2vi->vma_pagesize) >> PAGE_SHIFT; +} + +static int kvm_s2_fault_pin_pfn(const struct kvm_s2_fault_desc *s2fd, + struct kvm_s2_fault_vma_info *s2vi) +{ + int ret; + + ret = kvm_s2_fault_get_vma_info(s2fd, s2vi); + if (ret) + return ret; + + s2vi->pfn = __kvm_faultin_pfn(s2fd->memslot, get_canonical_gfn(s2fd, s2vi), + kvm_is_write_fault(s2fd->vcpu) ? FOLL_WRITE : 0, + &s2vi->map_writable, &s2vi->page); + if (unlikely(is_error_noslot_pfn(s2vi->pfn))) { + if (s2vi->pfn == KVM_PFN_ERR_HWPOISON) { + kvm_send_hwpoison_signal(s2fd->hva, __ffs(s2vi->vma_pagesize)); + return 0; + } return -EFAULT; + } /* * Check if this is non-struct page memory PFN, and cannot support * CMOs. It could potentially be unsafe to access as cacheable. */ - if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(pfn)) { - if (is_vma_cacheable) { + if (s2vi->vm_flags & (VM_PFNMAP | VM_MIXEDMAP) && !pfn_is_map_memory(s2vi->pfn)) { + if (s2vi->is_vma_cacheable) { /* * Whilst the VMA owner expects cacheable mapping to this * PFN, hardware also has to support the FWB and CACHE DIC @@ -1812,8 +1893,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * S2FWB and CACHE DIC are mandatory to avoid the need for * cache maintenance. */ - if (!kvm_supports_cacheable_pfnmap()) - ret = -EFAULT; + if (!kvm_supports_cacheable_pfnmap()) { + kvm_release_faultin_page(s2fd->vcpu->kvm, s2vi->page, true, false); + return -EFAULT; + } } else { /* * If the page was identified as device early by looking at @@ -1825,21 +1908,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * In both cases, we don't let transparent_hugepage_adjust() * change things at the last minute. */ - s2_force_noncacheable = true; + s2vi->map_non_cacheable = true; } - } else if (logging_active && !write_fault) { - /* - * Only actually map the page as writable if this was a write - * fault. - */ - writable = false; + + s2vi->device = true; } - if (exec_fault && s2_force_noncacheable) - ret = -ENOEXEC; + return 1; +} - if (ret) - goto out_put_page; +static int kvm_s2_fault_compute_prot(const struct kvm_s2_fault_desc *s2fd, + const struct kvm_s2_fault_vma_info *s2vi, + enum kvm_pgtable_prot *prot) +{ + struct kvm *kvm = s2fd->vcpu->kvm; + + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu) && s2vi->map_non_cacheable) + return -ENOEXEC; /* * Guest performs atomic/exclusive operations on memory with unsupported @@ -1847,99 +1932,167 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, * and trigger the exception here. Since the memslot is valid, inject * the fault back to the guest. */ - if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(vcpu))) { - kvm_inject_dabt_excl_atomic(vcpu, kvm_vcpu_get_hfar(vcpu)); - ret = 1; - goto out_put_page; + if (esr_fsc_is_excl_atomic_fault(kvm_vcpu_get_esr(s2fd->vcpu))) { + kvm_inject_dabt_excl_atomic(s2fd->vcpu, kvm_vcpu_get_hfar(s2fd->vcpu)); + return 1; } - if (nested) - adjust_nested_fault_perms(nested, &prot, &writable); + *prot = KVM_PGTABLE_PROT_R; + + if (s2vi->map_writable && (s2vi->device || + !memslot_is_logging(s2fd->memslot) || + kvm_is_write_fault(s2fd->vcpu))) + *prot |= KVM_PGTABLE_PROT_W; + + if (s2fd->nested) + *prot = adjust_nested_fault_perms(s2fd->nested, *prot); + + if (kvm_vcpu_trap_is_exec_fault(s2fd->vcpu)) + *prot |= KVM_PGTABLE_PROT_X; + + if (s2vi->map_non_cacheable) + *prot |= (s2vi->vm_flags & VM_ALLOW_ANY_UNCACHED) ? + KVM_PGTABLE_PROT_NORMAL_NC : KVM_PGTABLE_PROT_DEVICE; + else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) + *prot |= KVM_PGTABLE_PROT_X; + + if (s2fd->nested) + *prot = adjust_nested_exec_perms(kvm, s2fd->nested, *prot); + + if (!kvm_s2_fault_is_perm(s2fd) && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) { + /* Check the VMM hasn't introduced a new disallowed VMA */ + if (!s2vi->mte_allowed) + return -EFAULT; + } + + return 0; +} + +static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd, + const struct kvm_s2_fault_vma_info *s2vi, + enum kvm_pgtable_prot prot, + void *memcache) +{ + enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED; + bool writable = prot & KVM_PGTABLE_PROT_W; + struct kvm *kvm = s2fd->vcpu->kvm; + struct kvm_pgtable *pgt; + long perm_fault_granule; + long mapping_size; + kvm_pfn_t pfn; + gfn_t gfn; + int ret; kvm_fault_lock(kvm); - pgt = vcpu->arch.hw_mmu->pgt; - if (mmu_invalidate_retry(kvm, mmu_seq)) { - ret = -EAGAIN; + pgt = s2fd->vcpu->arch.hw_mmu->pgt; + ret = -EAGAIN; + if (mmu_invalidate_retry(kvm, s2vi->mmu_seq)) goto out_unlock; - } + + perm_fault_granule = (kvm_s2_fault_is_perm(s2fd) ? + kvm_vcpu_trap_get_perm_fault_granule(s2fd->vcpu) : 0); + mapping_size = s2vi->vma_pagesize; + pfn = s2vi->pfn; + gfn = s2vi->gfn; /* * If we are not forced to use page mapping, check if we are * backed by a THP and thus use block mapping if possible. */ - if (vma_pagesize == PAGE_SIZE && !(force_pte || s2_force_noncacheable)) { - if (fault_is_perm && fault_granule > PAGE_SIZE) - vma_pagesize = fault_granule; - else - vma_pagesize = transparent_hugepage_adjust(kvm, memslot, - hva, &pfn, - &fault_ipa); - - if (vma_pagesize < 0) { - ret = vma_pagesize; - goto out_unlock; - } - } - - if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) { - /* Check the VMM hasn't introduced a new disallowed VMA */ - if (mte_allowed) { - sanitise_mte_tags(kvm, pfn, vma_pagesize); + if (mapping_size == PAGE_SIZE && + !(s2vi->max_map_size == PAGE_SIZE || s2vi->map_non_cacheable)) { + if (perm_fault_granule > PAGE_SIZE) { + mapping_size = perm_fault_granule; } else { - ret = -EFAULT; - goto out_unlock; + mapping_size = transparent_hugepage_adjust(kvm, s2fd->memslot, + s2fd->hva, &pfn, + &gfn); + if (mapping_size < 0) { + ret = mapping_size; + goto out_unlock; + } } } - if (writable) - prot |= KVM_PGTABLE_PROT_W; - - if (exec_fault) - prot |= KVM_PGTABLE_PROT_X; - - if (s2_force_noncacheable) { - if (vfio_allow_any_uc) - prot |= KVM_PGTABLE_PROT_NORMAL_NC; - else - prot |= KVM_PGTABLE_PROT_DEVICE; - } else if (cpus_have_final_cap(ARM64_HAS_CACHE_DIC)) { - prot |= KVM_PGTABLE_PROT_X; - } - - if (nested) - adjust_nested_exec_perms(kvm, nested, &prot); + if (!perm_fault_granule && !s2vi->map_non_cacheable && kvm_has_mte(kvm)) + sanitise_mte_tags(kvm, pfn, mapping_size); /* * Under the premise of getting a FSC_PERM fault, we just need to relax - * permissions only if vma_pagesize equals fault_granule. Otherwise, + * permissions only if mapping_size equals perm_fault_granule. Otherwise, * kvm_pgtable_stage2_map() should be called to change block size. */ - if (fault_is_perm && vma_pagesize == fault_granule) { + if (mapping_size == perm_fault_granule) { /* * Drop the SW bits in favour of those stored in the * PTE, which will be preserved. */ prot &= ~KVM_NV_GUEST_MAP_SZ; - ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags); + ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn), + prot, flags); } else { - ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize, - __pfn_to_phys(pfn), prot, - memcache, flags); + ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size, + __pfn_to_phys(pfn), prot, + memcache, flags); } out_unlock: - kvm_release_faultin_page(kvm, page, !!ret, writable); + kvm_release_faultin_page(kvm, s2vi->page, !!ret, writable); kvm_fault_unlock(kvm); - /* Mark the page dirty only if the fault is handled successfully */ - if (writable && !ret) - mark_page_dirty_in_slot(kvm, memslot, gfn); + /* + * Mark the page dirty only if the fault is handled successfully, + * making sure we adjust the canonical IPA if the mapping size has + * been updated (via a THP upgrade, for example). + */ + if (writable && !ret) { + phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi)); + ipa &= ~(mapping_size - 1); + mark_page_dirty_in_slot(kvm, s2fd->memslot, gpa_to_gfn(ipa)); + } - return ret != -EAGAIN ? ret : 0; + if (ret != -EAGAIN) + return ret; + return 0; +} -out_put_page: - kvm_release_page_unused(page); - return ret; +static int user_mem_abort(const struct kvm_s2_fault_desc *s2fd) +{ + bool perm_fault = kvm_vcpu_trap_is_permission_fault(s2fd->vcpu); + struct kvm_s2_fault_vma_info s2vi = {}; + enum kvm_pgtable_prot prot; + void *memcache; + int ret; + + /* + * Permission faults just need to update the existing leaf entry, + * and so normally don't require allocations from the memcache. The + * only exception to this is when dirty logging is enabled at runtime + * and a write fault needs to collapse a block entry into a table. + */ + memcache = get_mmu_memcache(s2fd->vcpu); + if (!perm_fault || (memslot_is_logging(s2fd->memslot) && + kvm_is_write_fault(s2fd->vcpu))) { + ret = topup_mmu_memcache(s2fd->vcpu, memcache); + if (ret) + return ret; + } + + /* + * Let's check if we will get back a huge page backed by hugetlbfs, or + * get block mapping for device MMIO region. + */ + ret = kvm_s2_fault_pin_pfn(s2fd, &s2vi); + if (ret != 1) + return ret; + + ret = kvm_s2_fault_compute_prot(s2fd, &s2vi, &prot); + if (ret) { + kvm_release_page_unused(s2vi.page); + return ret; + } + + return kvm_s2_fault_map(s2fd, &s2vi, prot, memcache); } /* Resolve the access fault by making the page young again. */ @@ -2202,15 +2355,27 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) goto out_unlock; } - VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && - !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu)); + const struct kvm_s2_fault_desc s2fd = { + .vcpu = vcpu, + .fault_ipa = fault_ipa, + .nested = nested, + .memslot = memslot, + .hva = hva, + }; + + if (kvm_vm_is_protected(vcpu->kvm)) { + ret = pkvm_mem_abort(&s2fd); + } else { + VM_WARN_ON_ONCE(kvm_vcpu_trap_is_permission_fault(vcpu) && + !write_fault && + !kvm_vcpu_trap_is_exec_fault(vcpu)); + + if (kvm_slot_has_gmem(memslot)) + ret = gmem_abort(&s2fd); + else + ret = user_mem_abort(&s2fd); + } - if (kvm_slot_has_gmem(memslot)) - ret = gmem_abort(vcpu, fault_ipa, nested, memslot, - esr_fsc_is_permission_fault(esr)); - else - ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva, - esr_fsc_is_permission_fault(esr)); if (ret == 0) ret = 1; out: @@ -2223,7 +2388,7 @@ out_unlock: bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { - if (!kvm->arch.mmu.pgt) + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) return false; __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT, @@ -2238,7 +2403,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; - if (!kvm->arch.mmu.pgt) + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) return false; return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, @@ -2254,7 +2419,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; - if (!kvm->arch.mmu.pgt) + if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm)) return false; return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt, @@ -2411,6 +2576,19 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, hva_t hva, reg_end; int ret = 0; + if (kvm_vm_is_protected(kvm)) { + /* Cannot modify memslots once a pVM has run. */ + if (pkvm_hyp_vm_is_created(kvm) && + (change == KVM_MR_DELETE || change == KVM_MR_MOVE)) { + return -EPERM; + } + + if (new && + new->flags & (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)) { + return -EPERM; + } + } + if (change != KVM_MR_CREATE && change != KVM_MR_MOVE && change != KVM_MR_FLAGS_ONLY) return 0; diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c index 2c43097248b2..883b6c1008fb 100644 --- a/arch/arm64/kvm/nested.c +++ b/arch/arm64/kvm/nested.c @@ -735,8 +735,10 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size; /* Make sure we don't forget to do the laundry */ - if (kvm_s2_mmu_valid(s2_mmu)) + if (kvm_s2_mmu_valid(s2_mmu)) { + kvm_nested_s2_ptdump_remove_debugfs(s2_mmu); s2_mmu->pending_unmap = true; + } /* * The virtual VMID (modulo CnP) will be used as a key when matching @@ -750,6 +752,8 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu) s2_mmu->tlb_vtcr = vcpu_read_sys_reg(vcpu, VTCR_EL2); s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2) & HCR_VM; + kvm_nested_s2_ptdump_create_debugfs(s2_mmu); + out: atomic_inc(&s2_mmu->refcnt); @@ -1558,6 +1562,11 @@ u64 limit_nv_id_reg(struct kvm *kvm, u32 reg, u64 val) ID_AA64PFR1_EL1_MTE); break; + case SYS_ID_AA64PFR2_EL1: + /* GICv5 is not yet supported for NV */ + val &= ~ID_AA64PFR2_EL1_GCIE; + break; + case SYS_ID_AA64MMFR0_EL1: /* Hide ExS, Secure Memory */ val &= ~(ID_AA64MMFR0_EL1_EXS | diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c index d7a0f69a9982..053e4f733e4b 100644 --- a/arch/arm64/kvm/pkvm.c +++ b/arch/arm64/kvm/pkvm.c @@ -88,7 +88,7 @@ void __init kvm_hyp_reserve(void) static void __pkvm_destroy_hyp_vm(struct kvm *kvm) { if (pkvm_hyp_vm_is_created(kvm)) { - WARN_ON(kvm_call_hyp_nvhe(__pkvm_teardown_vm, + WARN_ON(kvm_call_hyp_nvhe(__pkvm_finalize_teardown_vm, kvm->arch.pkvm.handle)); } else if (kvm->arch.pkvm.handle) { /* @@ -192,10 +192,16 @@ int pkvm_create_hyp_vm(struct kvm *kvm) { int ret = 0; + /* + * Synchronise with kvm_arch_prepare_memory_region(), as we + * prevent memslot modifications on a pVM that has been run. + */ + mutex_lock(&kvm->slots_lock); mutex_lock(&kvm->arch.config_lock); if (!pkvm_hyp_vm_is_created(kvm)) ret = __pkvm_create_hyp_vm(kvm); mutex_unlock(&kvm->arch.config_lock); + mutex_unlock(&kvm->slots_lock); return ret; } @@ -219,9 +225,10 @@ void pkvm_destroy_hyp_vm(struct kvm *kvm) mutex_unlock(&kvm->arch.config_lock); } -int pkvm_init_host_vm(struct kvm *kvm) +int pkvm_init_host_vm(struct kvm *kvm, unsigned long type) { int ret; + bool protected = type & KVM_VM_TYPE_ARM_PROTECTED; if (pkvm_hyp_vm_is_created(kvm)) return -EINVAL; @@ -236,6 +243,11 @@ int pkvm_init_host_vm(struct kvm *kvm) return ret; kvm->arch.pkvm.handle = ret; + kvm->arch.pkvm.is_protected = protected; + if (protected) { + pr_warn_once("kvm: protected VMs are experimental and for development only, tainting kernel\n"); + add_taint(TAINT_USER, LOCKDEP_STILL_OK); + } return 0; } @@ -322,15 +334,38 @@ int pkvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu, return 0; } -static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 end) +static int __pkvm_pgtable_stage2_reclaim(struct kvm_pgtable *pgt, u64 start, u64 end) { struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); pkvm_handle_t handle = kvm->arch.pkvm.handle; struct pkvm_mapping *mapping; int ret; - if (!handle) - return 0; + for_each_mapping_in_range_safe(pgt, start, end, mapping) { + struct page *page; + + ret = kvm_call_hyp_nvhe(__pkvm_reclaim_dying_guest_page, + handle, mapping->gfn); + if (WARN_ON(ret)) + continue; + + page = pfn_to_page(mapping->pfn); + WARN_ON_ONCE(mapping->nr_pages != 1); + unpin_user_pages_dirty_lock(&page, 1, true); + account_locked_vm(current->mm, 1, false); + pkvm_mapping_remove(mapping, &pgt->pkvm_mappings); + kfree(mapping); + } + + return 0; +} + +static int __pkvm_pgtable_stage2_unshare(struct kvm_pgtable *pgt, u64 start, u64 end) +{ + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); + pkvm_handle_t handle = kvm->arch.pkvm.handle; + struct pkvm_mapping *mapping; + int ret; for_each_mapping_in_range_safe(pgt, start, end, mapping) { ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_guest, handle, mapping->gfn, @@ -347,7 +382,21 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, u64 addr, u64 size) { - __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); + pkvm_handle_t handle = kvm->arch.pkvm.handle; + + if (!handle) + return; + + if (pkvm_hyp_vm_is_created(kvm) && !kvm->arch.pkvm.is_dying) { + WARN_ON(kvm_call_hyp_nvhe(__pkvm_start_teardown_vm, handle)); + kvm->arch.pkvm.is_dying = true; + } + + if (kvm_vm_is_protected(kvm)) + __pkvm_pgtable_stage2_reclaim(pgt, addr, addr + size); + else + __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt) @@ -365,31 +414,58 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, struct kvm_hyp_memcache *cache = mc; u64 gfn = addr >> PAGE_SHIFT; u64 pfn = phys >> PAGE_SHIFT; + u64 end = addr + size; int ret; - if (size != PAGE_SIZE && size != PMD_SIZE) - return -EINVAL; - lockdep_assert_held_write(&kvm->mmu_lock); + mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, end - 1); - /* - * Calling stage2_map() on top of existing mappings is either happening because of a race - * with another vCPU, or because we're changing between page and block mappings. As per - * user_mem_abort(), same-size permission faults are handled in the relax_perms() path. - */ - mapping = pkvm_mapping_iter_first(&pgt->pkvm_mappings, addr, addr + size - 1); - if (mapping) { - if (size == (mapping->nr_pages * PAGE_SIZE)) - return -EAGAIN; - - /* Remove _any_ pkvm_mapping overlapping with the range, bigger or smaller. */ - ret = __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); - if (ret) - return ret; - mapping = NULL; + if (kvm_vm_is_protected(kvm)) { + /* Protected VMs are mapped using RWX page-granular mappings */ + if (WARN_ON_ONCE(size != PAGE_SIZE)) + return -EINVAL; + + if (WARN_ON_ONCE(prot != KVM_PGTABLE_PROT_RWX)) + return -EINVAL; + + /* + * We either raced with another vCPU or the guest PTE + * has been poisoned by an erroneous host access. + */ + if (mapping) { + ret = kvm_call_hyp_nvhe(__pkvm_vcpu_in_poison_fault); + return ret ? -EFAULT : -EAGAIN; + } + + ret = kvm_call_hyp_nvhe(__pkvm_host_donate_guest, pfn, gfn); + } else { + if (WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE)) + return -EINVAL; + + /* + * We either raced with another vCPU or we're changing between + * page and block mappings. As per user_mem_abort(), same-size + * permission faults are handled in the relax_perms() path. + */ + if (mapping) { + if (size == (mapping->nr_pages * PAGE_SIZE)) + return -EAGAIN; + + /* + * Remove _any_ pkvm_mapping overlapping with the range, + * bigger or smaller. + */ + ret = __pkvm_pgtable_stage2_unshare(pgt, addr, end); + if (ret) + return ret; + + mapping = NULL; + } + + ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, + size / PAGE_SIZE, prot); } - ret = kvm_call_hyp_nvhe(__pkvm_host_share_guest, pfn, gfn, size / PAGE_SIZE, prot); if (WARN_ON(ret)) return ret; @@ -404,9 +480,14 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size, int pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size) { - lockdep_assert_held_write(&kvm_s2_mmu_to_kvm(pgt->mmu)->mmu_lock); + struct kvm *kvm = kvm_s2_mmu_to_kvm(pgt->mmu); - return __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size); + if (WARN_ON(kvm_vm_is_protected(kvm))) + return -EPERM; + + lockdep_assert_held_write(&kvm->mmu_lock); + + return __pkvm_pgtable_stage2_unshare(pgt, addr, addr + size); } int pkvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size) @@ -416,6 +497,9 @@ int pkvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size) struct pkvm_mapping *mapping; int ret = 0; + if (WARN_ON(kvm_vm_is_protected(kvm))) + return -EPERM; + lockdep_assert_held(&kvm->mmu_lock); for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) { ret = kvm_call_hyp_nvhe(__pkvm_host_wrprotect_guest, handle, mapping->gfn, @@ -447,6 +531,9 @@ bool pkvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, u64 struct pkvm_mapping *mapping; bool young = false; + if (WARN_ON(kvm_vm_is_protected(kvm))) + return false; + lockdep_assert_held(&kvm->mmu_lock); for_each_mapping_in_range_safe(pgt, addr, addr + size, mapping) young |= kvm_call_hyp_nvhe(__pkvm_host_test_clear_young_guest, handle, mapping->gfn, @@ -458,12 +545,18 @@ bool pkvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, u64 int pkvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr, enum kvm_pgtable_prot prot, enum kvm_pgtable_walk_flags flags) { + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) + return -EPERM; + return kvm_call_hyp_nvhe(__pkvm_host_relax_perms_guest, addr >> PAGE_SHIFT, prot); } void pkvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr, enum kvm_pgtable_walk_flags flags) { + if (WARN_ON(kvm_vm_is_protected(kvm_s2_mmu_to_kvm(pgt->mmu)))) + return; + WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_mkyoung_guest, addr >> PAGE_SHIFT)); } @@ -485,3 +578,15 @@ int pkvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, WARN_ON_ONCE(1); return -EINVAL; } + +/* + * Forcefully reclaim a page from the guest, zeroing its contents and + * poisoning the stage-2 pte so that pages can no longer be mapped at + * the same IPA. The page remains pinned until the guest is destroyed. + */ +bool pkvm_force_reclaim_guest_page(phys_addr_t phys) +{ + int ret = kvm_call_hyp_nvhe(__pkvm_force_reclaim_guest_page, phys); + + return !ret || ret == -EAGAIN; +} diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c index 93cc9bbb5cec..e1860acae641 100644 --- a/arch/arm64/kvm/pmu-emul.c +++ b/arch/arm64/kvm/pmu-emul.c @@ -939,7 +939,8 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu) * number against the dimensions of the vgic and make sure * it's valid. */ - if (!irq_is_ppi(irq) && !vgic_valid_spi(vcpu->kvm, irq)) + if (!irq_is_ppi(vcpu->kvm, irq) && + !vgic_valid_spi(vcpu->kvm, irq)) return -EINVAL; } else if (kvm_arm_pmu_irq_initialized(vcpu)) { return -EINVAL; @@ -961,8 +962,13 @@ static int kvm_arm_pmu_v3_init(struct kvm_vcpu *vcpu) if (!vgic_initialized(vcpu->kvm)) return -ENODEV; - if (!kvm_arm_pmu_irq_initialized(vcpu)) - return -ENXIO; + if (!kvm_arm_pmu_irq_initialized(vcpu)) { + if (!vgic_is_v5(vcpu->kvm)) + return -ENXIO; + + /* Use the architected irq number for GICv5. */ + vcpu->arch.pmu.irq_num = KVM_ARMV8_PMU_GICV5_IRQ; + } ret = kvm_vgic_set_owner(vcpu, vcpu->arch.pmu.irq_num, &vcpu->arch.pmu); @@ -987,11 +993,15 @@ static bool pmu_irq_is_valid(struct kvm *kvm, int irq) unsigned long i; struct kvm_vcpu *vcpu; + /* On GICv5, the PMUIRQ is architecturally mandated to be PPI 23 */ + if (vgic_is_v5(kvm) && irq != KVM_ARMV8_PMU_GICV5_IRQ) + return false; + kvm_for_each_vcpu(i, vcpu, kvm) { if (!kvm_arm_pmu_irq_initialized(vcpu)) continue; - if (irq_is_ppi(irq)) { + if (irq_is_ppi(vcpu->kvm, irq)) { if (vcpu->arch.pmu.irq_num != irq) return false; } else { @@ -1142,7 +1152,7 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) return -EFAULT; /* The PMU overflow interrupt can be a PPI or a valid SPI. */ - if (!(irq_is_ppi(irq) || irq_is_spi(irq))) + if (!(irq_is_ppi(vcpu->kvm, irq) || irq_is_spi(vcpu->kvm, irq))) return -EINVAL; if (!pmu_irq_is_valid(kvm, irq)) diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c index 6a8836207a79..c9140e22abcf 100644 --- a/arch/arm64/kvm/ptdump.c +++ b/arch/arm64/kvm/ptdump.c @@ -10,19 +10,20 @@ #include <linux/kvm_host.h> #include <linux/seq_file.h> +#include <asm/cpufeature.h> #include <asm/kvm_mmu.h> #include <asm/kvm_pgtable.h> #include <asm/ptdump.h> #define MARKERS_LEN 2 #define KVM_PGTABLE_MAX_LEVELS (KVM_PGTABLE_LAST_LEVEL + 1) +#define S2FNAMESZ sizeof("0x0123456789abcdef-0x0123456789abcdef-s2-disabled") struct kvm_ptdump_guest_state { - struct kvm *kvm; + struct kvm_s2_mmu *mmu; struct ptdump_pg_state parser_state; struct addr_marker ipa_marker[MARKERS_LEN]; struct ptdump_pg_level level[KVM_PGTABLE_MAX_LEVELS]; - struct ptdump_range range[MARKERS_LEN]; }; static const struct ptdump_prot_bits stage2_pte_bits[] = { @@ -112,10 +113,9 @@ static int kvm_ptdump_build_levels(struct ptdump_pg_level *level, u32 start_lvl) return 0; } -static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm *kvm) +static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm_s2_mmu *mmu) { struct kvm_ptdump_guest_state *st; - struct kvm_s2_mmu *mmu = &kvm->arch.mmu; struct kvm_pgtable *pgtable = mmu->pgt; int ret; @@ -131,17 +131,8 @@ static struct kvm_ptdump_guest_state *kvm_ptdump_parser_create(struct kvm *kvm) st->ipa_marker[0].name = "Guest IPA"; st->ipa_marker[1].start_address = BIT(pgtable->ia_bits); - st->range[0].end = BIT(pgtable->ia_bits); - - st->kvm = kvm; - st->parser_state = (struct ptdump_pg_state) { - .marker = &st->ipa_marker[0], - .level = -1, - .pg_level = &st->level[0], - .ptdump.range = &st->range[0], - .start_address = 0, - }; + st->mmu = mmu; return st; } @@ -149,16 +140,20 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *unused) { int ret; struct kvm_ptdump_guest_state *st = m->private; - struct kvm *kvm = st->kvm; - struct kvm_s2_mmu *mmu = &kvm->arch.mmu; - struct ptdump_pg_state *parser_state = &st->parser_state; + struct kvm_s2_mmu *mmu = st->mmu; + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) { .cb = kvm_ptdump_visitor, - .arg = parser_state, + .arg = &st->parser_state, .flags = KVM_PGTABLE_WALK_LEAF, }; - parser_state->seq = m; + st->parser_state = (struct ptdump_pg_state) { + .marker = &st->ipa_marker[0], + .level = -1, + .pg_level = &st->level[0], + .seq = m, + }; write_lock(&kvm->mmu_lock); ret = kvm_pgtable_walk(mmu->pgt, 0, BIT(mmu->pgt->ia_bits), &walker); @@ -169,14 +164,15 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *unused) static int kvm_ptdump_guest_open(struct inode *m, struct file *file) { - struct kvm *kvm = m->i_private; + struct kvm_s2_mmu *mmu = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct kvm_ptdump_guest_state *st; int ret; if (!kvm_get_kvm_safe(kvm)) return -ENOENT; - st = kvm_ptdump_parser_create(kvm); + st = kvm_ptdump_parser_create(mmu); if (IS_ERR(st)) { ret = PTR_ERR(st); goto err_with_kvm_ref; @@ -194,7 +190,7 @@ err_with_kvm_ref: static int kvm_ptdump_guest_close(struct inode *m, struct file *file) { - struct kvm *kvm = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(m->i_private); void *st = ((struct seq_file *)file->private_data)->private; kfree(st); @@ -229,14 +225,15 @@ static int kvm_pgtable_levels_show(struct seq_file *m, void *unused) static int kvm_pgtable_debugfs_open(struct inode *m, struct file *file, int (*show)(struct seq_file *, void *)) { - struct kvm *kvm = m->i_private; + struct kvm_s2_mmu *mmu = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu); struct kvm_pgtable *pgtable; int ret; if (!kvm_get_kvm_safe(kvm)) return -ENOENT; - pgtable = kvm->arch.mmu.pgt; + pgtable = mmu->pgt; ret = single_open(file, show, pgtable); if (ret < 0) @@ -256,7 +253,7 @@ static int kvm_pgtable_levels_open(struct inode *m, struct file *file) static int kvm_pgtable_debugfs_close(struct inode *m, struct file *file) { - struct kvm *kvm = m->i_private; + struct kvm *kvm = kvm_s2_mmu_to_kvm(m->i_private); kvm_put_kvm(kvm); return single_release(m, file); @@ -276,12 +273,36 @@ static const struct file_operations kvm_pgtable_levels_fops = { .release = kvm_pgtable_debugfs_close, }; +void kvm_nested_s2_ptdump_create_debugfs(struct kvm_s2_mmu *mmu) +{ + struct dentry *dent; + char file_name[S2FNAMESZ]; + + snprintf(file_name, sizeof(file_name), "0x%016llx-0x%016llx-s2-%sabled", + mmu->tlb_vttbr, + mmu->tlb_vtcr, + mmu->nested_stage2_enabled ? "en" : "dis"); + + dent = debugfs_create_file(file_name, 0400, + mmu->arch->debugfs_nv_dentry, mmu, + &kvm_ptdump_guest_fops); + + mmu->shadow_pt_debugfs_dentry = dent; +} + +void kvm_nested_s2_ptdump_remove_debugfs(struct kvm_s2_mmu *mmu) +{ + debugfs_remove(mmu->shadow_pt_debugfs_dentry); +} + void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) { debugfs_create_file("stage2_page_tables", 0400, kvm->debugfs_dentry, - kvm, &kvm_ptdump_guest_fops); - debugfs_create_file("ipa_range", 0400, kvm->debugfs_dentry, kvm, - &kvm_pgtable_range_fops); + &kvm->arch.mmu, &kvm_ptdump_guest_fops); + debugfs_create_file("ipa_range", 0400, kvm->debugfs_dentry, + &kvm->arch.mmu, &kvm_pgtable_range_fops); debugfs_create_file("stage2_levels", 0400, kvm->debugfs_dentry, - kvm, &kvm_pgtable_levels_fops); + &kvm->arch.mmu, &kvm_pgtable_levels_fops); + if (cpus_have_final_cap(ARM64_HAS_NESTED_VIRT)) + kvm->arch.debugfs_nv_dentry = debugfs_create_dir("nested", kvm->debugfs_dentry); } diff --git a/arch/arm64/kvm/stacktrace.c b/arch/arm64/kvm/stacktrace.c index af5eec681127..9724c320126b 100644 --- a/arch/arm64/kvm/stacktrace.c +++ b/arch/arm64/kvm/stacktrace.c @@ -197,7 +197,7 @@ static void hyp_dump_backtrace(unsigned long hyp_offset) kvm_nvhe_dump_backtrace_end(); } -#ifdef CONFIG_PROTECTED_NVHE_STACKTRACE +#ifdef CONFIG_PKVM_STACKTRACE DECLARE_KVM_NVHE_PER_CPU(unsigned long [NVHE_STACKTRACE_SIZE/sizeof(long)], pkvm_stacktrace); @@ -225,12 +225,12 @@ static void pkvm_dump_backtrace(unsigned long hyp_offset) kvm_nvhe_dump_backtrace_entry((void *)hyp_offset, stacktrace[i]); kvm_nvhe_dump_backtrace_end(); } -#else /* !CONFIG_PROTECTED_NVHE_STACKTRACE */ +#else /* !CONFIG_PKVM_STACKTRACE */ static void pkvm_dump_backtrace(unsigned long hyp_offset) { - kvm_err("Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE\n"); + kvm_err("Cannot dump pKVM nVHE stacktrace: !CONFIG_PKVM_STACKTRACE\n"); } -#endif /* CONFIG_PROTECTED_NVHE_STACKTRACE */ +#endif /* CONFIG_PKVM_STACKTRACE */ /* * kvm_nvhe_dump_backtrace - Dump KVM nVHE hypervisor backtrace. diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index c1e0dea903a1..148fc3400ea8 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -681,6 +681,91 @@ static bool access_gic_dir(struct kvm_vcpu *vcpu, return true; } +static bool access_gicv5_idr0(struct kvm_vcpu *vcpu, struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p->is_write) + return undef_access(vcpu, p, r); + + /* + * Expose KVM's priority- and ID-bits to the guest, but not GCIE_LEGACY. + * + * Note: for GICv5 the mimic the way that the num_pri_bits and + * num_id_bits fields are used with GICv3: + * - num_pri_bits stores the actual number of priority bits, whereas the + * register field stores num_pri_bits - 1. + * - num_id_bits stores the raw field value, which is 0b0000 for 16 bits + * and 0b0001 for 24 bits. + */ + p->regval = FIELD_PREP(ICC_IDR0_EL1_PRI_BITS, vcpu->arch.vgic_cpu.num_pri_bits - 1) | + FIELD_PREP(ICC_IDR0_EL1_ID_BITS, vcpu->arch.vgic_cpu.num_id_bits); + + return true; +} + +static bool access_gicv5_iaffid(struct kvm_vcpu *vcpu, struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + if (p->is_write) + return undef_access(vcpu, p, r); + + /* + * For GICv5 VMs, the IAFFID value is the same as the VPE ID. The VPE ID + * is the same as the VCPU's ID. + */ + p->regval = FIELD_PREP(ICC_IAFFIDR_EL1_IAFFID, vcpu->vcpu_id); + + return true; +} + +static bool access_gicv5_ppi_enabler(struct kvm_vcpu *vcpu, + struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + unsigned long *mask = vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask; + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + int i; + + /* We never expect to get here with a read! */ + if (WARN_ON_ONCE(!p->is_write)) + return undef_access(vcpu, p, r); + + /* + * If we're only handling architected PPIs and the guest writes to the + * enable for the non-architected PPIs, we just return as there's + * nothing to do at all. We don't even allocate the storage for them in + * this case. + */ + if (VGIC_V5_NR_PRIVATE_IRQS == 64 && p->Op2 % 2) + return true; + + /* + * Merge the raw guest write into out bitmap at an offset of either 0 or + * 64, then and it with our PPI mask. + */ + bitmap_write(cpu_if->vgic_ppi_enabler, p->regval, 64 * (p->Op2 % 2), 64); + bitmap_and(cpu_if->vgic_ppi_enabler, cpu_if->vgic_ppi_enabler, mask, + VGIC_V5_NR_PRIVATE_IRQS); + + /* + * Sync the change in enable states to the vgic_irqs. We consider all + * PPIs as we don't expose many to the guest. + */ + for_each_set_bit(i, mask, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + irq->enabled = test_bit(i, cpu_if->vgic_ppi_enabler); + + vgic_put_irq(vcpu->kvm, irq); + } + + return true; +} + static bool trap_raz_wi(struct kvm_vcpu *vcpu, struct sys_reg_params *p, const struct sys_reg_desc *r) @@ -1758,6 +1843,7 @@ static u8 pmuver_to_perfmon(u8 pmuver) static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val); static u64 sanitise_id_aa64pfr1_el1(const struct kvm_vcpu *vcpu, u64 val); +static u64 sanitise_id_aa64pfr2_el1(const struct kvm_vcpu *vcpu, u64 val); static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val); /* Read a sanitised cpufeature ID register by sys_reg_desc */ @@ -1783,10 +1869,7 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu, val = sanitise_id_aa64pfr1_el1(vcpu, val); break; case SYS_ID_AA64PFR2_EL1: - val &= ID_AA64PFR2_EL1_FPMR | - (kvm_has_mte(vcpu->kvm) ? - ID_AA64PFR2_EL1_MTEFAR | ID_AA64PFR2_EL1_MTESTOREONLY : - 0); + val = sanitise_id_aa64pfr2_el1(vcpu, val); break; case SYS_ID_AA64ISAR1_EL1: if (!vcpu_has_ptrauth(vcpu)) @@ -1985,7 +2068,7 @@ static u64 sanitise_id_aa64pfr0_el1(const struct kvm_vcpu *vcpu, u64 val) val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, CSV3, IMP); } - if (vgic_is_v3(vcpu->kvm)) { + if (vgic_host_has_gicv3()) { val &= ~ID_AA64PFR0_EL1_GIC_MASK; val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); } @@ -2027,6 +2110,23 @@ static u64 sanitise_id_aa64pfr1_el1(const struct kvm_vcpu *vcpu, u64 val) return val; } +static u64 sanitise_id_aa64pfr2_el1(const struct kvm_vcpu *vcpu, u64 val) +{ + val &= ID_AA64PFR2_EL1_FPMR | + ID_AA64PFR2_EL1_MTEFAR | + ID_AA64PFR2_EL1_MTESTOREONLY; + + if (!kvm_has_mte(vcpu->kvm)) { + val &= ~ID_AA64PFR2_EL1_MTEFAR; + val &= ~ID_AA64PFR2_EL1_MTESTOREONLY; + } + + if (vgic_host_has_gicv5()) + val |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); + + return val; +} + static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val) { val = ID_REG_LIMIT_FIELD_ENUM(val, ID_AA64DFR0_EL1, DebugVer, V8P8); @@ -2177,14 +2277,6 @@ static int set_id_aa64pfr0_el1(struct kvm_vcpu *vcpu, (vcpu_has_nv(vcpu) && !FIELD_GET(ID_AA64PFR0_EL1_EL2, user_val))) return -EINVAL; - /* - * If we are running on a GICv5 host and support FEAT_GCIE_LEGACY, then - * we support GICv3. Fail attempts to do anything but set that to IMP. - */ - if (vgic_is_v3_compat(vcpu->kvm) && - FIELD_GET(ID_AA64PFR0_EL1_GIC_MASK, user_val) != ID_AA64PFR0_EL1_GIC_IMP) - return -EINVAL; - return set_id_reg(vcpu, rd, user_val); } @@ -2224,6 +2316,12 @@ static int set_id_aa64pfr1_el1(struct kvm_vcpu *vcpu, return set_id_reg(vcpu, rd, user_val); } +static int set_id_aa64pfr2_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_desc *rd, u64 user_val) +{ + return set_id_reg(vcpu, rd, user_val); +} + /* * Allow userspace to de-feature a stage-2 translation granule but prevent it * from claiming the impossible. @@ -3205,10 +3303,11 @@ static const struct sys_reg_desc sys_reg_descs[] = { ID_AA64PFR1_EL1_RES0 | ID_AA64PFR1_EL1_MPAM_frac | ID_AA64PFR1_EL1_MTE)), - ID_WRITABLE(ID_AA64PFR2_EL1, - ID_AA64PFR2_EL1_FPMR | - ID_AA64PFR2_EL1_MTEFAR | - ID_AA64PFR2_EL1_MTESTOREONLY), + ID_FILTERED(ID_AA64PFR2_EL1, id_aa64pfr2_el1, + (ID_AA64PFR2_EL1_FPMR | + ID_AA64PFR2_EL1_MTEFAR | + ID_AA64PFR2_EL1_MTESTOREONLY | + ID_AA64PFR2_EL1_GCIE)), ID_UNALLOCATED(4,3), ID_WRITABLE(ID_AA64ZFR0_EL1, ~ID_AA64ZFR0_EL1_RES0), ID_HIDDEN(ID_AA64SMFR0_EL1), @@ -3394,6 +3493,10 @@ static const struct sys_reg_desc sys_reg_descs[] = { { SYS_DESC(SYS_ICC_AP1R1_EL1), undef_access }, { SYS_DESC(SYS_ICC_AP1R2_EL1), undef_access }, { SYS_DESC(SYS_ICC_AP1R3_EL1), undef_access }, + { SYS_DESC(SYS_ICC_IDR0_EL1), access_gicv5_idr0 }, + { SYS_DESC(SYS_ICC_IAFFIDR_EL1), access_gicv5_iaffid }, + { SYS_DESC(SYS_ICC_PPI_ENABLER0_EL1), access_gicv5_ppi_enabler }, + { SYS_DESC(SYS_ICC_PPI_ENABLER1_EL1), access_gicv5_ppi_enabler }, { SYS_DESC(SYS_ICC_DIR_EL1), access_gic_dir }, { SYS_DESC(SYS_ICC_RPR_EL1), undef_access }, { SYS_DESC(SYS_ICC_SGI1R_EL1), access_gic_sgi }, @@ -5650,6 +5753,8 @@ void kvm_calculate_traps(struct kvm_vcpu *vcpu) compute_fgu(kvm, HFGRTR2_GROUP); compute_fgu(kvm, HFGITR2_GROUP); compute_fgu(kvm, HDFGRTR2_GROUP); + compute_fgu(kvm, ICH_HFGRTR_GROUP); + compute_fgu(kvm, ICH_HFGITR_GROUP); set_bit(KVM_ARCH_FLAG_FGU_INITIALIZED, &kvm->arch.flags); out: @@ -5670,23 +5775,58 @@ int kvm_finalize_sys_regs(struct kvm_vcpu *vcpu) guard(mutex)(&kvm->arch.config_lock); + if (vcpu_has_nv(vcpu)) { + int ret = kvm_init_nv_sysregs(vcpu); + if (ret) + return ret; + } + + if (kvm_vm_has_ran_once(kvm)) + return 0; + /* * This hacks into the ID registers, so only perform it when the * first vcpu runs, or the kvm_set_vm_id_reg() helper will scream. */ - if (!irqchip_in_kernel(kvm) && !kvm_vm_has_ran_once(kvm)) { + if (!irqchip_in_kernel(kvm)) { u64 val; val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, val); + val = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1) & ~ID_AA64PFR2_EL1_GCIE; + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1, val); val = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, val); - } - - if (vcpu_has_nv(vcpu)) { - int ret = kvm_init_nv_sysregs(vcpu); - if (ret) - return ret; + } else { + /* + * Certain userspace software - QEMU - samples the system + * register state without creating an irqchip, then blindly + * restores the state prior to running the final guest. This + * means that it restores the virtualization & emulation + * capabilities of the host system, rather than something that + * reflects the final guest state. Moreover, it checks that the + * state was "correctly" restored (i.e., verbatim), bailing if + * it isn't, so masking off invalid state isn't an option. + * + * On GICv5 hardware that supports FEAT_GCIE_LEGACY we can run + * both GICv3- and GICv5-based guests. Therefore, we initially + * present both ID_AA64PFR0.GIC and ID_AA64PFR2.GCIE as IMP to + * reflect that userspace can create EITHER a vGICv3 or a + * vGICv5. This is an architecturally invalid combination, of + * course. Once an in-kernel GIC is created, the sysreg state is + * updated to reflect the actual, valid configuration. + * + * Setting both the GIC and GCIE features to IMP unsurprisingly + * results in guests falling over, and hence we need to fix up + * this mess in KVM. Before running for the first time we yet + * again ensure that the GIC and GCIE fields accurately reflect + * the actual hardware the guest should see. + * + * This hack allows legacy QEMU-based GICv3 guests to run + * unmodified on compatible GICv5 hosts, and avoids the inverse + * problem for GICv5-based guests in the future. + */ + kvm_vgic_finalize_idregs(kvm); } return 0; diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index e9b8b5fc480c..933983bb2005 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -66,12 +66,11 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type); * or through the generic KVM_CREATE_DEVICE API ioctl. * irqchip_in_kernel() tells you if this function succeeded or not. * @kvm: kvm struct pointer - * @type: KVM_DEV_TYPE_ARM_VGIC_V[23] + * @type: KVM_DEV_TYPE_ARM_VGIC_V[235] */ int kvm_vgic_create(struct kvm *kvm, u32 type) { struct kvm_vcpu *vcpu; - u64 aa64pfr0, pfr1; unsigned long i; int ret; @@ -132,8 +131,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) if (type == KVM_DEV_TYPE_ARM_VGIC_V2) kvm->max_vcpus = VGIC_V2_MAX_CPUS; - else + else if (type == KVM_DEV_TYPE_ARM_VGIC_V3) kvm->max_vcpus = VGIC_V3_MAX_CPUS; + else if (type == KVM_DEV_TYPE_ARM_VGIC_V5) + kvm->max_vcpus = min(VGIC_V5_MAX_CPUS, + kvm_vgic_global_state.max_gic_vcpus); if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) { ret = -E2BIG; @@ -145,19 +147,20 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) kvm->arch.vgic.implementation_rev = KVM_VGIC_IMP_REV_LATEST; kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF; - aa64pfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; - pfr1 = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; - - if (type == KVM_DEV_TYPE_ARM_VGIC_V2) { + switch (type) { + case KVM_DEV_TYPE_ARM_VGIC_V2: kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF; - } else { + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: INIT_LIST_HEAD(&kvm->arch.vgic.rd_regions); - aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); - pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); + break; } - kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, aa64pfr0); - kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, pfr1); + /* + * We've now created the GIC. Update the system register state + * to accurately reflect what we've created. + */ + kvm_vgic_finalize_idregs(kvm); kvm_for_each_vcpu(i, vcpu, kvm) { ret = vgic_allocate_private_irqs_locked(vcpu, type); @@ -179,6 +182,15 @@ int kvm_vgic_create(struct kvm *kvm, u32 type) if (type == KVM_DEV_TYPE_ARM_VGIC_V3) kvm->arch.vgic.nassgicap = system_supports_direct_sgis(); + /* + * We now know that we have a GICv5. The Arch Timer PPI interrupts may + * have been initialised at this stage, but will have done so assuming + * that we have an older GIC, meaning that the IntIDs won't be + * correct. We init them again, and this time they will be correct. + */ + if (type == KVM_DEV_TYPE_ARM_VGIC_V5) + kvm_timer_init_vm(kvm); + out_unlock: mutex_unlock(&kvm->arch.config_lock); kvm_unlock_all_vcpus(kvm); @@ -259,9 +271,65 @@ int kvm_vgic_vcpu_nv_init(struct kvm_vcpu *vcpu) return ret; } +static void vgic_allocate_private_irq(struct kvm_vcpu *vcpu, int i, u32 type) +{ + struct vgic_irq *irq = &vcpu->arch.vgic_cpu.private_irqs[i]; + + INIT_LIST_HEAD(&irq->ap_list); + raw_spin_lock_init(&irq->irq_lock); + irq->vcpu = NULL; + irq->target_vcpu = vcpu; + refcount_set(&irq->refcount, 0); + + irq->intid = i; + if (vgic_irq_is_sgi(i)) { + /* SGIs */ + irq->enabled = 1; + irq->config = VGIC_CONFIG_EDGE; + } else { + /* PPIs */ + irq->config = VGIC_CONFIG_LEVEL; + } + + switch (type) { + case KVM_DEV_TYPE_ARM_VGIC_V3: + irq->group = 1; + irq->mpidr = kvm_vcpu_get_mpidr_aff(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + irq->group = 0; + irq->targets = BIT(vcpu->vcpu_id); + break; + } +} + +static void vgic_v5_allocate_private_irq(struct kvm_vcpu *vcpu, int i, u32 type) +{ + struct vgic_irq *irq = &vcpu->arch.vgic_cpu.private_irqs[i]; + u32 intid = vgic_v5_make_ppi(i); + + INIT_LIST_HEAD(&irq->ap_list); + raw_spin_lock_init(&irq->irq_lock); + irq->vcpu = NULL; + irq->target_vcpu = vcpu; + refcount_set(&irq->refcount, 0); + + irq->intid = intid; + + /* The only Edge architected PPI is the SW_PPI */ + if (i == GICV5_ARCH_PPI_SW_PPI) + irq->config = VGIC_CONFIG_EDGE; + else + irq->config = VGIC_CONFIG_LEVEL; + + /* Register the GICv5-specific PPI ops */ + vgic_v5_set_ppi_ops(vcpu, intid); +} + static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) { struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; + u32 num_private_irqs; int i; lockdep_assert_held(&vcpu->kvm->arch.config_lock); @@ -269,8 +337,13 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) if (vgic_cpu->private_irqs) return 0; + if (vgic_is_v5(vcpu->kvm)) + num_private_irqs = VGIC_V5_NR_PRIVATE_IRQS; + else + num_private_irqs = VGIC_NR_PRIVATE_IRQS; + vgic_cpu->private_irqs = kzalloc_objs(struct vgic_irq, - VGIC_NR_PRIVATE_IRQS, + num_private_irqs, GFP_KERNEL_ACCOUNT); if (!vgic_cpu->private_irqs) @@ -280,34 +353,11 @@ static int vgic_allocate_private_irqs_locked(struct kvm_vcpu *vcpu, u32 type) * Enable and configure all SGIs to be edge-triggered and * configure all PPIs as level-triggered. */ - for (i = 0; i < VGIC_NR_PRIVATE_IRQS; i++) { - struct vgic_irq *irq = &vgic_cpu->private_irqs[i]; - - INIT_LIST_HEAD(&irq->ap_list); - raw_spin_lock_init(&irq->irq_lock); - irq->intid = i; - irq->vcpu = NULL; - irq->target_vcpu = vcpu; - refcount_set(&irq->refcount, 0); - if (vgic_irq_is_sgi(i)) { - /* SGIs */ - irq->enabled = 1; - irq->config = VGIC_CONFIG_EDGE; - } else { - /* PPIs */ - irq->config = VGIC_CONFIG_LEVEL; - } - - switch (type) { - case KVM_DEV_TYPE_ARM_VGIC_V3: - irq->group = 1; - irq->mpidr = kvm_vcpu_get_mpidr_aff(vcpu); - break; - case KVM_DEV_TYPE_ARM_VGIC_V2: - irq->group = 0; - irq->targets = BIT(vcpu->vcpu_id); - break; - } + for (i = 0; i < num_private_irqs; i++) { + if (vgic_is_v5(vcpu->kvm)) + vgic_v5_allocate_private_irq(vcpu, i, type); + else + vgic_allocate_private_irq(vcpu, i, type); } return 0; @@ -366,7 +416,11 @@ int kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu) static void kvm_vgic_vcpu_reset(struct kvm_vcpu *vcpu) { - if (kvm_vgic_global_state.type == VGIC_V2) + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + + if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V5) + vgic_v5_reset(vcpu); + else if (kvm_vgic_global_state.type == VGIC_V2) vgic_v2_reset(vcpu); else vgic_v3_reset(vcpu); @@ -397,22 +451,28 @@ int vgic_init(struct kvm *kvm) if (kvm->created_vcpus != atomic_read(&kvm->online_vcpus)) return -EBUSY; - /* freeze the number of spis */ - if (!dist->nr_spis) - dist->nr_spis = VGIC_NR_IRQS_LEGACY - VGIC_NR_PRIVATE_IRQS; + if (!vgic_is_v5(kvm)) { + /* freeze the number of spis */ + if (!dist->nr_spis) + dist->nr_spis = VGIC_NR_IRQS_LEGACY - VGIC_NR_PRIVATE_IRQS; - ret = kvm_vgic_dist_init(kvm, dist->nr_spis); - if (ret) - goto out; + ret = kvm_vgic_dist_init(kvm, dist->nr_spis); + if (ret) + return ret; - /* - * Ensure vPEs are allocated if direct IRQ injection (e.g. vSGIs, - * vLPIs) is supported. - */ - if (vgic_supports_direct_irqs(kvm)) { - ret = vgic_v4_init(kvm); + /* + * Ensure vPEs are allocated if direct IRQ injection (e.g. vSGIs, + * vLPIs) is supported. + */ + if (vgic_supports_direct_irqs(kvm)) { + ret = vgic_v4_init(kvm); + if (ret) + return ret; + } + } else { + ret = vgic_v5_init(kvm); if (ret) - goto out; + return ret; } kvm_for_each_vcpu(idx, vcpu, kvm) @@ -420,12 +480,12 @@ int vgic_init(struct kvm *kvm) ret = kvm_vgic_setup_default_irq_routing(kvm); if (ret) - goto out; + return ret; vgic_debug_init(kvm); dist->initialized = true; -out: - return ret; + + return 0; } static void kvm_vgic_dist_destroy(struct kvm *kvm) @@ -569,6 +629,7 @@ int vgic_lazy_init(struct kvm *kvm) int kvm_vgic_map_resources(struct kvm *kvm) { struct vgic_dist *dist = &kvm->arch.vgic; + bool needs_dist = true; enum vgic_type type; gpa_t dist_base; int ret = 0; @@ -587,21 +648,29 @@ int kvm_vgic_map_resources(struct kvm *kvm) if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V2) { ret = vgic_v2_map_resources(kvm); type = VGIC_V2; - } else { + } else if (dist->vgic_model == KVM_DEV_TYPE_ARM_VGIC_V3) { ret = vgic_v3_map_resources(kvm); type = VGIC_V3; + } else { + ret = vgic_v5_map_resources(kvm); + type = VGIC_V5; + needs_dist = false; } if (ret) goto out; - dist_base = dist->vgic_dist_base; - mutex_unlock(&kvm->arch.config_lock); + if (needs_dist) { + dist_base = dist->vgic_dist_base; + mutex_unlock(&kvm->arch.config_lock); - ret = vgic_register_dist_iodev(kvm, dist_base, type); - if (ret) { - kvm_err("Unable to register VGIC dist MMIO regions\n"); - goto out_slots; + ret = vgic_register_dist_iodev(kvm, dist_base, type); + if (ret) { + kvm_err("Unable to register VGIC dist MMIO regions\n"); + goto out_slots; + } + } else { + mutex_unlock(&kvm->arch.config_lock); } smp_store_release(&dist->ready, true); @@ -617,6 +686,35 @@ out_slots: return ret; } +void kvm_vgic_finalize_idregs(struct kvm *kvm) +{ + u32 type = kvm->arch.vgic.vgic_model; + u64 aa64pfr0, aa64pfr2, pfr1; + + aa64pfr0 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1) & ~ID_AA64PFR0_EL1_GIC; + aa64pfr2 = kvm_read_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1) & ~ID_AA64PFR2_EL1_GCIE; + pfr1 = kvm_read_vm_id_reg(kvm, SYS_ID_PFR1_EL1) & ~ID_PFR1_EL1_GIC; + + switch (type) { + case KVM_DEV_TYPE_ARM_VGIC_V2: + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: + aa64pfr0 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR0_EL1, GIC, IMP); + if (kvm_supports_32bit_el0()) + pfr1 |= SYS_FIELD_PREP_ENUM(ID_PFR1_EL1, GIC, GICv3); + break; + case KVM_DEV_TYPE_ARM_VGIC_V5: + aa64pfr2 |= SYS_FIELD_PREP_ENUM(ID_AA64PFR2_EL1, GCIE, IMP); + break; + default: + WARN_ONCE(1, "Unknown VGIC type!!!\n"); + } + + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR0_EL1, aa64pfr0); + kvm_set_vm_id_reg(kvm, SYS_ID_AA64PFR2_EL1, aa64pfr2); + kvm_set_vm_id_reg(kvm, SYS_ID_PFR1_EL1, pfr1); +} + /* GENERIC PROBE */ void kvm_vgic_cpu_up(void) diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c b/arch/arm64/kvm/vgic/vgic-kvm-device.c index 3d1a776b716d..a96c77dccf35 100644 --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c @@ -336,6 +336,10 @@ int kvm_register_vgic_device(unsigned long type) break; ret = kvm_vgic_register_its_device(); break; + case KVM_DEV_TYPE_ARM_VGIC_V5: + ret = kvm_register_device_ops(&kvm_arm_vgic_v5_ops, + KVM_DEV_TYPE_ARM_VGIC_V5); + break; } return ret; @@ -639,7 +643,7 @@ static int vgic_v3_set_attr(struct kvm_device *dev, if (vgic_initialized(dev->kvm)) return -EBUSY; - if (!irq_is_ppi(val)) + if (!irq_is_ppi(dev->kvm, val)) return -EINVAL; dev->kvm->arch.vgic.mi_intid = val; @@ -715,3 +719,104 @@ struct kvm_device_ops kvm_arm_vgic_v3_ops = { .get_attr = vgic_v3_get_attr, .has_attr = vgic_v3_has_attr, }; + +static int vgic_v5_get_userspace_ppis(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + struct vgic_v5_vm *gicv5_vm = &dev->kvm->arch.vgic.gicv5_vm; + u64 __user *uaddr = (u64 __user *)(long)attr->addr; + int ret; + + guard(mutex)(&dev->kvm->arch.config_lock); + + /* + * We either support 64 or 128 PPIs. In the former case, we need to + * return 0s for the second 64 bits as we have no storage backing those. + */ + ret = put_user(bitmap_read(gicv5_vm->userspace_ppis, 0, 64), uaddr); + if (ret) + return ret; + uaddr++; + + if (VGIC_V5_NR_PRIVATE_IRQS == 128) + ret = put_user(bitmap_read(gicv5_vm->userspace_ppis, 64, 128), uaddr); + else + ret = put_user(0, uaddr); + + return ret; +} + +static int vgic_v5_set_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_ARM_VGIC_GRP_ADDR: + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + return -ENXIO; + case KVM_DEV_ARM_VGIC_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_ARM_VGIC_CTRL_INIT: + return vgic_set_common_attr(dev, attr); + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: + default: + return -ENXIO; + } + default: + return -ENXIO; + } + +} + +static int vgic_v5_get_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_ARM_VGIC_GRP_ADDR: + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + return -ENXIO; + case KVM_DEV_ARM_VGIC_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_ARM_VGIC_CTRL_INIT: + return vgic_get_common_attr(dev, attr); + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: + return vgic_v5_get_userspace_ppis(dev, attr); + default: + return -ENXIO; + } + default: + return -ENXIO; + } +} + +static int vgic_v5_has_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_ARM_VGIC_GRP_ADDR: + case KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS: + case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + return -ENXIO; + case KVM_DEV_ARM_VGIC_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_ARM_VGIC_CTRL_INIT: + return 0; + case KVM_DEV_ARM_VGIC_USERSPACE_PPIS: + return 0; + default: + return -ENXIO; + } + default: + return -ENXIO; + } +} + +struct kvm_device_ops kvm_arm_vgic_v5_ops = { + .name = "kvm-arm-vgic-v5", + .create = vgic_create, + .destroy = vgic_destroy, + .set_attr = vgic_v5_set_attr, + .get_attr = vgic_v5_get_attr, + .has_attr = vgic_v5_has_attr, +}; diff --git a/arch/arm64/kvm/vgic/vgic-mmio.c b/arch/arm64/kvm/vgic/vgic-mmio.c index a573b1f0c6cb..74d76dec9730 100644 --- a/arch/arm64/kvm/vgic/vgic-mmio.c +++ b/arch/arm64/kvm/vgic/vgic-mmio.c @@ -842,18 +842,46 @@ vgic_find_mmio_region(const struct vgic_register_region *regions, void vgic_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) { - if (kvm_vgic_global_state.type == VGIC_V2) - vgic_v2_set_vmcr(vcpu, vmcr); - else + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_set_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_set_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_set_vmcr(vcpu, vmcr); + else + vgic_v2_set_vmcr(vcpu, vmcr); + break; + default: + BUG(); + } } void vgic_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr) { - if (kvm_vgic_global_state.type == VGIC_V2) - vgic_v2_get_vmcr(vcpu, vmcr); - else + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_get_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_get_vmcr(vcpu, vmcr); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_get_vmcr(vcpu, vmcr); + else + vgic_v2_get_vmcr(vcpu, vmcr); + break; + default: + BUG(); + } } /* diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c index 6a355eca1934..9e841e7afd4a 100644 --- a/arch/arm64/kvm/vgic/vgic-v3.c +++ b/arch/arm64/kvm/vgic/vgic-v3.c @@ -499,7 +499,7 @@ void vcpu_set_ich_hcr(struct kvm_vcpu *vcpu) { struct vgic_v3_cpu_if *vgic_v3 = &vcpu->arch.vgic_cpu.vgic_v3; - if (!vgic_is_v3(vcpu->kvm)) + if (!vgic_host_has_gicv3()) return; /* Hide GICv3 sysreg if necessary */ diff --git a/arch/arm64/kvm/vgic/vgic-v5.c b/arch/arm64/kvm/vgic/vgic-v5.c index 331651087e2c..fdd39ea7f83e 100644 --- a/arch/arm64/kvm/vgic/vgic-v5.c +++ b/arch/arm64/kvm/vgic/vgic-v5.c @@ -1,28 +1,52 @@ // SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2025, 2026 Arm Ltd. + */ #include <kvm/arm_vgic.h> + +#include <linux/bitops.h> #include <linux/irqchip/arm-vgic-info.h> #include "vgic.h" +static struct vgic_v5_ppi_caps ppi_caps; + +/* + * Not all PPIs are guaranteed to be implemented for GICv5. Deterermine which + * ones are, and generate a mask. + */ +static void vgic_v5_get_implemented_ppis(void) +{ + if (!cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF)) + return; + + /* + * If we have KVM, we have EL2, which means that we have support for the + * EL1 and EL2 Physical & Virtual timers. + */ + __assign_bit(GICV5_ARCH_PPI_CNTHP, ppi_caps.impl_ppi_mask, 1); + __assign_bit(GICV5_ARCH_PPI_CNTV, ppi_caps.impl_ppi_mask, 1); + __assign_bit(GICV5_ARCH_PPI_CNTHV, ppi_caps.impl_ppi_mask, 1); + __assign_bit(GICV5_ARCH_PPI_CNTP, ppi_caps.impl_ppi_mask, 1); + + /* The SW_PPI should be available */ + __assign_bit(GICV5_ARCH_PPI_SW_PPI, ppi_caps.impl_ppi_mask, 1); + + /* The PMUIRQ is available if we have the PMU */ + __assign_bit(GICV5_ARCH_PPI_PMUIRQ, ppi_caps.impl_ppi_mask, system_supports_pmuv3()); +} + /* * Probe for a vGICv5 compatible interrupt controller, returning 0 on success. - * Currently only supports GICv3-based VMs on a GICv5 host, and hence only - * registers a VGIC_V3 device. */ int vgic_v5_probe(const struct gic_kvm_info *info) { + bool v5_registered = false; u64 ich_vtr_el2; int ret; - if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) - return -ENODEV; - kvm_vgic_global_state.type = VGIC_V5; - kvm_vgic_global_state.has_gcie_v3_compat = true; - - /* We only support v3 compat mode - use vGICv3 limits */ - kvm_vgic_global_state.max_gic_vcpus = VGIC_V3_MAX_CPUS; kvm_vgic_global_state.vcpu_base = 0; kvm_vgic_global_state.vctrl_base = NULL; @@ -30,6 +54,38 @@ int vgic_v5_probe(const struct gic_kvm_info *info) kvm_vgic_global_state.has_gicv4 = false; kvm_vgic_global_state.has_gicv4_1 = false; + /* + * GICv5 is currently not supported in Protected mode. Skip the + * registration of GICv5 completely to make sure no guests can create a + * GICv5-based guest. + */ + if (is_protected_kvm_enabled()) { + kvm_info("GICv5-based guests are not supported with pKVM\n"); + goto skip_v5; + } + + kvm_vgic_global_state.max_gic_vcpus = VGIC_V5_MAX_CPUS; + + vgic_v5_get_implemented_ppis(); + + ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V5); + if (ret) { + kvm_err("Cannot register GICv5 KVM device.\n"); + goto skip_v5; + } + + v5_registered = true; + kvm_info("GCIE system register CPU interface\n"); + +skip_v5: + /* If we don't support the GICv3 compat mode we're done. */ + if (!cpus_have_final_cap(ARM64_HAS_GICV5_LEGACY)) { + if (!v5_registered) + return -ENODEV; + return 0; + } + + kvm_vgic_global_state.has_gcie_v3_compat = true; ich_vtr_el2 = kvm_call_hyp_ret(__vgic_v3_get_gic_config); kvm_vgic_global_state.ich_vtr_el2 = (u32)ich_vtr_el2; @@ -45,6 +101,10 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return ret; } + /* We potentially limit the max VCPUs further than we need to here */ + kvm_vgic_global_state.max_gic_vcpus = min(VGIC_V3_MAX_CPUS, + VGIC_V5_MAX_CPUS); + static_branch_enable(&kvm_vgic_global_state.gicv3_cpuif); kvm_info("GCIE legacy system register CPU interface\n"); @@ -52,3 +112,424 @@ int vgic_v5_probe(const struct gic_kvm_info *info) return 0; } + +void vgic_v5_reset(struct kvm_vcpu *vcpu) +{ + /* + * We always present 16-bits of ID space to the guest, irrespective of + * the host allowing more. + */ + vcpu->arch.vgic_cpu.num_id_bits = ICC_IDR0_EL1_ID_BITS_16BITS; + + /* + * The GICv5 architeture only supports 5-bits of priority in the + * CPUIF (but potentially fewer in the IRS). + */ + vcpu->arch.vgic_cpu.num_pri_bits = 5; +} + +int vgic_v5_init(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu; + unsigned long idx; + + if (vgic_initialized(kvm)) + return 0; + + kvm_for_each_vcpu(idx, vcpu, kvm) { + if (vcpu_has_nv(vcpu)) { + kvm_err("Nested GICv5 VMs are currently unsupported\n"); + return -EINVAL; + } + } + + /* We only allow userspace to drive the SW_PPI, if it is implemented. */ + bitmap_zero(kvm->arch.vgic.gicv5_vm.userspace_ppis, + VGIC_V5_NR_PRIVATE_IRQS); + __assign_bit(GICV5_ARCH_PPI_SW_PPI, + kvm->arch.vgic.gicv5_vm.userspace_ppis, + VGIC_V5_NR_PRIVATE_IRQS); + bitmap_and(kvm->arch.vgic.gicv5_vm.userspace_ppis, + kvm->arch.vgic.gicv5_vm.userspace_ppis, + ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); + + return 0; +} + +int vgic_v5_map_resources(struct kvm *kvm) +{ + if (!vgic_initialized(kvm)) + return -EBUSY; + + return 0; +} + +int vgic_v5_finalize_ppi_state(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu0; + int i; + + if (!vgic_is_v5(kvm)) + return 0; + + guard(mutex)(&kvm->arch.config_lock); + + /* + * If SW_PPI has been advertised, then we know we already + * initialised the whole thing, and we can return early. Yes, + * this is pretty hackish as far as state tracking goes... + */ + if (test_bit(GICV5_ARCH_PPI_SW_PPI, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask)) + return 0; + + /* The PPI state for all VCPUs should be the same. Pick the first. */ + vcpu0 = kvm_get_vcpu(kvm, 0); + + bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS); + bitmap_zero(kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, VGIC_V5_NR_PRIVATE_IRQS); + + for_each_set_bit(i, ppi_caps.impl_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { + const u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu0, intid); + + /* Expose PPIs with an owner or the SW_PPI, only */ + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { + if (irq->owner || i == GICV5_ARCH_PPI_SW_PPI) { + __assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, 1); + __assign_bit(i, kvm->arch.vgic.gicv5_vm.vgic_ppi_hmr, + irq->config == VGIC_CONFIG_LEVEL); + } + } + + vgic_put_irq(vcpu0->kvm, irq); + } + + return 0; +} + +static u32 vgic_v5_get_effective_priority_mask(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u32 highest_ap, priority_mask, apr; + + /* + * If the guest's CPU has not opted to receive interrupts, then the + * effective running priority is the highest priority. Just return 0 + * (the highest priority). + */ + if (!FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, cpu_if->vgic_vmcr)) + return 0; + + /* + * Counting the number of trailing zeros gives the current active + * priority. Explicitly use the 32-bit version here as we have 32 + * priorities. 32 then means that there are no active priorities. + */ + apr = cpu_if->vgic_apr; + highest_ap = apr ? __builtin_ctz(apr) : 32; + + /* + * An interrupt is of sufficient priority if it is equal to or + * greater than the priority mask. Add 1 to the priority mask + * (i.e., lower priority) to match the APR logic before taking + * the min. This gives us the lowest priority that is masked. + */ + priority_mask = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, cpu_if->vgic_vmcr); + + return min(highest_ap, priority_mask + 1); +} + +/* + * For GICv5, the PPIs are mostly directly managed by the hardware. We (the + * hypervisor) handle the pending, active, enable state save/restore, but don't + * need the PPIs to be queued on a per-VCPU AP list. Therefore, sanity check the + * state, unlock, and return. + */ +bool vgic_v5_ppi_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, + unsigned long flags) + __releases(&irq->irq_lock) +{ + struct kvm_vcpu *vcpu; + + lockdep_assert_held(&irq->irq_lock); + + if (WARN_ON_ONCE(!__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, irq->intid))) + goto out_unlock_fail; + + vcpu = irq->target_vcpu; + if (WARN_ON_ONCE(!vcpu)) + goto out_unlock_fail; + + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + + /* Directly kick the target VCPU to make sure it sees the IRQ */ + kvm_make_request(KVM_REQ_IRQ_PENDING, vcpu); + kvm_vcpu_kick(vcpu); + + return true; + +out_unlock_fail: + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + + return false; +} + +/* + * Sets/clears the corresponding bit in the ICH_PPI_DVIR register. + */ +void vgic_v5_set_ppi_dvi(struct kvm_vcpu *vcpu, struct vgic_irq *irq, bool dvi) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u32 ppi; + + lockdep_assert_held(&irq->irq_lock); + + ppi = vgic_v5_get_hwirq_id(irq->intid); + __assign_bit(ppi, cpu_if->vgic_ppi_dvir, dvi); +} + +static struct irq_ops vgic_v5_ppi_irq_ops = { + .queue_irq_unlock = vgic_v5_ppi_queue_irq_unlock, + .set_direct_injection = vgic_v5_set_ppi_dvi, +}; + +void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid) +{ + kvm_vgic_set_irq_ops(vcpu, vintid, &vgic_v5_ppi_irq_ops); +} + +/* + * Sync back the PPI priorities to the vgic_irq shadow state for any interrupts + * exposed to the guest (skipping all others). + */ +static void vgic_v5_sync_ppi_priorities(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u64 priorityr; + int i; + + /* + * We have up to 16 PPI Priority regs, but only have a few interrupts + * that the guest is allowed to use. Limit our sync of PPI priorities to + * those actually exposed to the guest by first iterating over the mask + * of exposed PPIs. + */ + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + int pri_idx, pri_reg, pri_bit; + u8 priority; + + /* + * Determine which priority register and the field within it to + * extract. + */ + pri_reg = i / 8; + pri_idx = i % 8; + pri_bit = pri_idx * 8; + + priorityr = cpu_if->vgic_ppi_priorityr[pri_reg]; + priority = field_get(GENMASK(pri_bit + 4, pri_bit), priorityr); + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + irq->priority = priority; + + vgic_put_irq(vcpu->kvm, irq); + } +} + +bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu) +{ + unsigned int priority_mask; + int i; + + priority_mask = vgic_v5_get_effective_priority_mask(vcpu); + + /* + * If the combined priority mask is 0, nothing can be signalled! In the + * case where the guest has disabled interrupt delivery for the vcpu + * (via ICV_CR0_EL1.EN->ICH_VMCR_EL2.EN), we calculate the priority mask + * as 0 too (the highest possible priority). + */ + if (!priority_mask) + return false; + + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + bool has_pending = false; + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + if (irq->enabled && irq->priority < priority_mask) + has_pending = irq->hw ? vgic_get_phys_line_level(irq) : irq_is_pending(irq); + + vgic_put_irq(vcpu->kvm, irq); + + if (has_pending) + return true; + } + + return false; +} + +/* + * Detect any PPIs state changes, and propagate the state with KVM's + * shadow structures. + */ +void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + unsigned long *activer, *pendr; + int i; + + activer = host_data_ptr(vgic_v5_ppi_state)->activer_exit; + pendr = host_data_ptr(vgic_v5_ppi_state)->pendr; + + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, + VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { + irq->active = test_bit(i, activer); + + /* This is an OR to avoid losing incoming edges! */ + if (irq->config == VGIC_CONFIG_EDGE) + irq->pending_latch |= test_bit(i, pendr); + } + + vgic_put_irq(vcpu->kvm, irq); + } + + /* + * Re-inject the exit state as entry state next time! + * + * Note that the write of the Enable state is trapped, and hence there + * is nothing to explcitly sync back here as we already have the latest + * copy by definition. + */ + bitmap_copy(cpu_if->vgic_ppi_activer, activer, VGIC_V5_NR_PRIVATE_IRQS); +} + +void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu) +{ + DECLARE_BITMAP(pendr, VGIC_V5_NR_PRIVATE_IRQS); + int i; + + /* + * Time to enter the guest - we first need to build the guest's + * ICC_PPI_PENDRx_EL1, however. + */ + bitmap_zero(pendr, VGIC_V5_NR_PRIVATE_IRQS); + for_each_set_bit(i, vcpu->kvm->arch.vgic.gicv5_vm.vgic_ppi_mask, + VGIC_V5_NR_PRIVATE_IRQS) { + u32 intid = vgic_v5_make_ppi(i); + struct vgic_irq *irq; + + irq = vgic_get_vcpu_irq(vcpu, intid); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) { + __assign_bit(i, pendr, irq_is_pending(irq)); + if (irq->config == VGIC_CONFIG_EDGE) + irq->pending_latch = false; + } + + vgic_put_irq(vcpu->kvm, irq); + } + + /* + * Copy the shadow state to the pending reg that will be written to the + * ICH_PPI_PENDRx_EL2 regs. While the guest is running we track any + * incoming changes to the pending state in the vgic_irq structures. The + * incoming changes are merged with the outgoing changes on the return + * path. + */ + bitmap_copy(host_data_ptr(vgic_v5_ppi_state)->pendr, pendr, + VGIC_V5_NR_PRIVATE_IRQS); +} + +void vgic_v5_load(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + /* + * On the WFI path, vgic_load is called a second time. The first is when + * scheduling in the vcpu thread again, and the second is when leaving + * WFI. Skip the second instance as it serves no purpose and just + * restores the same state again. + */ + if (cpu_if->gicv5_vpe.resident) + return; + + kvm_call_hyp(__vgic_v5_restore_vmcr_apr, cpu_if); + + cpu_if->gicv5_vpe.resident = true; +} + +void vgic_v5_put(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + /* + * Do nothing if we're not resident. This can happen in the WFI path + * where we do a vgic_put in the WFI path and again later when + * descheduling the thread. We risk losing VMCR state if we sync it + * twice, so instead return early in this case. + */ + if (!cpu_if->gicv5_vpe.resident) + return; + + kvm_call_hyp(__vgic_v5_save_apr, cpu_if); + + cpu_if->gicv5_vpe.resident = false; + + /* The shadow priority is only updated on entering WFI */ + if (vcpu_get_flag(vcpu, IN_WFI)) + vgic_v5_sync_ppi_priorities(vcpu); +} + +void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u64 vmcr = cpu_if->vgic_vmcr; + + vmcrp->en = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_EN, vmcr); + vmcrp->pmr = FIELD_GET(FEAT_GCIE_ICH_VMCR_EL2_VPMR, vmcr); +} + +void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcrp) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + u64 vmcr; + + vmcr = FIELD_PREP(FEAT_GCIE_ICH_VMCR_EL2_VPMR, vmcrp->pmr) | + FIELD_PREP(FEAT_GCIE_ICH_VMCR_EL2_EN, vmcrp->en); + + cpu_if->vgic_vmcr = vmcr; +} + +void vgic_v5_restore_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + __vgic_v5_restore_state(cpu_if); + __vgic_v5_restore_ppi_state(cpu_if); + dsb(sy); +} + +void vgic_v5_save_state(struct kvm_vcpu *vcpu) +{ + struct vgic_v5_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v5; + + __vgic_v5_save_state(cpu_if); + __vgic_v5_save_ppi_state(cpu_if); + dsb(sy); +} diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c index e22b79cfff96..1e9fe8764584 100644 --- a/arch/arm64/kvm/vgic/vgic.c +++ b/arch/arm64/kvm/vgic/vgic.c @@ -86,6 +86,10 @@ static struct vgic_irq *vgic_get_lpi(struct kvm *kvm, u32 intid) */ struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid) { + /* Non-private IRQs are not yet implemented for GICv5 */ + if (vgic_is_v5(kvm)) + return NULL; + /* SPIs */ if (intid >= VGIC_NR_PRIVATE_IRQS && intid < (kvm->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) { @@ -94,7 +98,7 @@ struct vgic_irq *vgic_get_irq(struct kvm *kvm, u32 intid) } /* LPIs */ - if (intid >= VGIC_MIN_LPI) + if (irq_is_lpi(kvm, intid)) return vgic_get_lpi(kvm, intid); return NULL; @@ -105,6 +109,18 @@ struct vgic_irq *vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid) if (WARN_ON(!vcpu)) return NULL; + if (vgic_is_v5(vcpu->kvm)) { + u32 int_num, hwirq_id; + + if (!__irq_is_ppi(KVM_DEV_TYPE_ARM_VGIC_V5, intid)) + return NULL; + + hwirq_id = FIELD_GET(GICV5_HWIRQ_ID, intid); + int_num = array_index_nospec(hwirq_id, VGIC_V5_NR_PRIVATE_IRQS); + + return &vcpu->arch.vgic_cpu.private_irqs[int_num]; + } + /* SGIs and PPIs */ if (intid < VGIC_NR_PRIVATE_IRQS) { intid = array_index_nospec(intid, VGIC_NR_PRIVATE_IRQS); @@ -123,7 +139,7 @@ static void vgic_release_lpi_locked(struct vgic_dist *dist, struct vgic_irq *irq static __must_check bool __vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) { - if (irq->intid < VGIC_MIN_LPI) + if (!irq_is_lpi(kvm, irq->intid)) return false; return refcount_dec_and_test(&irq->refcount); @@ -148,7 +164,7 @@ void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq) * Acquire/release it early on lockdep kernels to make locking issues * in rare release paths a bit more obvious. */ - if (IS_ENABLED(CONFIG_LOCKDEP) && irq->intid >= VGIC_MIN_LPI) { + if (IS_ENABLED(CONFIG_LOCKDEP) && irq_is_lpi(kvm, irq->intid)) { guard(spinlock_irqsave)(&dist->lpi_xa.xa_lock); } @@ -186,7 +202,7 @@ void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu) raw_spin_lock_irqsave(&vgic_cpu->ap_list_lock, flags); list_for_each_entry_safe(irq, tmp, &vgic_cpu->ap_list_head, ap_list) { - if (irq->intid >= VGIC_MIN_LPI) { + if (irq_is_lpi(vcpu->kvm, irq->intid)) { raw_spin_lock(&irq->irq_lock); list_del(&irq->ap_list); irq->vcpu = NULL; @@ -404,6 +420,9 @@ bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq, lockdep_assert_held(&irq->irq_lock); + if (irq->ops && irq->ops->queue_irq_unlock) + return irq->ops->queue_irq_unlock(kvm, irq, flags); + retry: vcpu = vgic_target_oracle(irq); if (irq->vcpu || !vcpu) { @@ -521,12 +540,12 @@ int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, if (ret) return ret; - if (!vcpu && intid < VGIC_NR_PRIVATE_IRQS) + if (!vcpu && irq_is_private(kvm, intid)) return -EINVAL; trace_vgic_update_irq_pending(vcpu ? vcpu->vcpu_idx : 0, intid, level); - if (intid < VGIC_NR_PRIVATE_IRQS) + if (irq_is_private(kvm, intid)) irq = vgic_get_vcpu_irq(vcpu, intid); else irq = vgic_get_irq(kvm, intid); @@ -553,10 +572,27 @@ int kvm_vgic_inject_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, return 0; } +void kvm_vgic_set_irq_ops(struct kvm_vcpu *vcpu, u32 vintid, + struct irq_ops *ops) +{ + struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, vintid); + + BUG_ON(!irq); + + scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) + irq->ops = ops; + + vgic_put_irq(vcpu->kvm, irq); +} + +void kvm_vgic_clear_irq_ops(struct kvm_vcpu *vcpu, u32 vintid) +{ + kvm_vgic_set_irq_ops(vcpu, vintid, NULL); +} + /* @irq->irq_lock must be held */ static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, - unsigned int host_irq, - struct irq_ops *ops) + unsigned int host_irq) { struct irq_desc *desc; struct irq_data *data; @@ -576,20 +612,25 @@ static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, irq->hw = true; irq->host_irq = host_irq; irq->hwintid = data->hwirq; - irq->ops = ops; + + if (irq->ops && irq->ops->set_direct_injection) + irq->ops->set_direct_injection(vcpu, irq, true); + return 0; } /* @irq->irq_lock must be held */ static inline void kvm_vgic_unmap_irq(struct vgic_irq *irq) { + if (irq->ops && irq->ops->set_direct_injection) + irq->ops->set_direct_injection(irq->target_vcpu, irq, false); + irq->hw = false; irq->hwintid = 0; - irq->ops = NULL; } int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, - u32 vintid, struct irq_ops *ops) + u32 vintid) { struct vgic_irq *irq = vgic_get_vcpu_irq(vcpu, vintid); unsigned long flags; @@ -598,7 +639,7 @@ int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, BUG_ON(!irq); raw_spin_lock_irqsave(&irq->irq_lock, flags); - ret = kvm_vgic_map_irq(vcpu, irq, host_irq, ops); + ret = kvm_vgic_map_irq(vcpu, irq, host_irq); raw_spin_unlock_irqrestore(&irq->irq_lock, flags); vgic_put_irq(vcpu->kvm, irq); @@ -685,7 +726,7 @@ int kvm_vgic_set_owner(struct kvm_vcpu *vcpu, unsigned int intid, void *owner) return -EAGAIN; /* SGIs and LPIs cannot be wired up to any device */ - if (!irq_is_ppi(intid) && !vgic_valid_spi(vcpu->kvm, intid)) + if (!irq_is_ppi(vcpu->kvm, intid) && !vgic_valid_spi(vcpu->kvm, intid)) return -EINVAL; irq = vgic_get_vcpu_irq(vcpu, intid); @@ -812,8 +853,13 @@ retry: vgic_release_deleted_lpis(vcpu->kvm); } -static inline void vgic_fold_lr_state(struct kvm_vcpu *vcpu) +static void vgic_fold_state(struct kvm_vcpu *vcpu) { + if (vgic_is_v5(vcpu->kvm)) { + vgic_v5_fold_ppi_state(vcpu); + return; + } + if (!*host_data_ptr(last_lr_irq)) return; @@ -1002,7 +1048,10 @@ static inline bool can_access_vgic_from_kernel(void) static inline void vgic_save_state(struct kvm_vcpu *vcpu) { - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + /* No switch statement here. See comment in vgic_restore_state() */ + if (vgic_is_v5(vcpu->kvm)) + vgic_v5_save_state(vcpu); + else if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) vgic_v2_save_state(vcpu); else __vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3); @@ -1011,20 +1060,24 @@ static inline void vgic_save_state(struct kvm_vcpu *vcpu) /* Sync back the hardware VGIC state into our emulation after a guest's run. */ void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu) { - /* If nesting, emulate the HW effect from L0 to L1 */ - if (vgic_state_is_nested(vcpu)) { - vgic_v3_sync_nested(vcpu); - return; - } + if (vgic_is_v3(vcpu->kvm)) { + /* If nesting, emulate the HW effect from L0 to L1 */ + if (vgic_state_is_nested(vcpu)) { + vgic_v3_sync_nested(vcpu); + return; + } - if (vcpu_has_nv(vcpu)) - vgic_v3_nested_update_mi(vcpu); + if (vcpu_has_nv(vcpu)) + vgic_v3_nested_update_mi(vcpu); + } if (can_access_vgic_from_kernel()) vgic_save_state(vcpu); - vgic_fold_lr_state(vcpu); - vgic_prune_ap_list(vcpu); + vgic_fold_state(vcpu); + + if (!vgic_is_v5(vcpu->kvm)) + vgic_prune_ap_list(vcpu); } /* Sync interrupts that were deactivated through a DIR trap */ @@ -1040,12 +1093,34 @@ void kvm_vgic_process_async_update(struct kvm_vcpu *vcpu) static inline void vgic_restore_state(struct kvm_vcpu *vcpu) { - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + /* + * As nice as it would be to restructure this code into a switch + * statement as can be found elsewhere, the logic quickly gets ugly. + * + * __vgic_v3_restore_state() is doing a lot of heavy lifting here. It is + * required for GICv3-on-GICv3, GICv2-on-GICv3, GICv3-on-GICv5, and the + * no-in-kernel-irqchip case on GICv3 hardware. Hence, adding a switch + * here results in much more complex code. + */ + if (vgic_is_v5(vcpu->kvm)) + vgic_v5_restore_state(vcpu); + else if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) vgic_v2_restore_state(vcpu); else __vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3); } +static void vgic_flush_state(struct kvm_vcpu *vcpu) +{ + if (vgic_is_v5(vcpu->kvm)) { + vgic_v5_flush_ppi_state(vcpu); + return; + } + + scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) + vgic_flush_lr_state(vcpu); +} + /* Flush our emulation state into the GIC hardware before entering the guest. */ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) { @@ -1082,42 +1157,69 @@ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu) DEBUG_SPINLOCK_BUG_ON(!irqs_disabled()); - scoped_guard(raw_spinlock, &vcpu->arch.vgic_cpu.ap_list_lock) - vgic_flush_lr_state(vcpu); + vgic_flush_state(vcpu); if (can_access_vgic_from_kernel()) vgic_restore_state(vcpu); - if (vgic_supports_direct_irqs(vcpu->kvm)) + if (vgic_supports_direct_irqs(vcpu->kvm) && kvm_vgic_global_state.has_gicv4) vgic_v4_commit(vcpu); } void kvm_vgic_load(struct kvm_vcpu *vcpu) { + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) __vgic_v3_activate_traps(&vcpu->arch.vgic_cpu.vgic_v3); return; } - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) - vgic_v2_load(vcpu); - else + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_load(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_load(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_load(vcpu); + else + vgic_v2_load(vcpu); + break; + default: + BUG(); + } } void kvm_vgic_put(struct kvm_vcpu *vcpu) { + const struct vgic_dist *dist = &vcpu->kvm->arch.vgic; + if (unlikely(!irqchip_in_kernel(vcpu->kvm) || !vgic_initialized(vcpu->kvm))) { if (has_vhe() && static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) __vgic_v3_deactivate_traps(&vcpu->arch.vgic_cpu.vgic_v3); return; } - if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) - vgic_v2_put(vcpu); - else + switch (dist->vgic_model) { + case KVM_DEV_TYPE_ARM_VGIC_V5: + vgic_v5_put(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V3: vgic_v3_put(vcpu); + break; + case KVM_DEV_TYPE_ARM_VGIC_V2: + if (static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif)) + vgic_v3_put(vcpu); + else + vgic_v2_put(vcpu); + break; + default: + BUG(); + } } int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) @@ -1128,6 +1230,9 @@ int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu) unsigned long flags; struct vgic_vmcr vmcr; + if (vgic_is_v5(vcpu->kvm)) + return vgic_v5_has_pending_ppi(vcpu); + if (!vcpu->kvm->arch.vgic.enabled) return false; diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index c9b3bb07e483..9d941241c8a2 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -187,6 +187,7 @@ static inline u64 vgic_ich_hcr_trap_bits(void) * registers regardless of the hardware backed GIC used. */ struct vgic_vmcr { + u32 en; /* GICv5-specific */ u32 grpen0; u32 grpen1; @@ -363,6 +364,19 @@ void vgic_debug_init(struct kvm *kvm); void vgic_debug_destroy(struct kvm *kvm); int vgic_v5_probe(const struct gic_kvm_info *info); +void vgic_v5_reset(struct kvm_vcpu *vcpu); +int vgic_v5_init(struct kvm *kvm); +int vgic_v5_map_resources(struct kvm *kvm); +void vgic_v5_set_ppi_ops(struct kvm_vcpu *vcpu, u32 vintid); +bool vgic_v5_has_pending_ppi(struct kvm_vcpu *vcpu); +void vgic_v5_flush_ppi_state(struct kvm_vcpu *vcpu); +void vgic_v5_fold_ppi_state(struct kvm_vcpu *vcpu); +void vgic_v5_load(struct kvm_vcpu *vcpu); +void vgic_v5_put(struct kvm_vcpu *vcpu); +void vgic_v5_set_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); +void vgic_v5_get_vmcr(struct kvm_vcpu *vcpu, struct vgic_vmcr *vmcr); +void vgic_v5_restore_state(struct kvm_vcpu *vcpu); +void vgic_v5_save_state(struct kvm_vcpu *vcpu); static inline int vgic_v3_max_apr_idx(struct kvm_vcpu *vcpu) { @@ -425,15 +439,6 @@ void vgic_its_invalidate_all_caches(struct kvm *kvm); int vgic_its_inv_lpi(struct kvm *kvm, struct vgic_irq *irq); int vgic_its_invall(struct kvm_vcpu *vcpu); -bool system_supports_direct_sgis(void); -bool vgic_supports_direct_msis(struct kvm *kvm); -bool vgic_supports_direct_sgis(struct kvm *kvm); - -static inline bool vgic_supports_direct_irqs(struct kvm *kvm) -{ - return vgic_supports_direct_msis(kvm) || vgic_supports_direct_sgis(kvm); -} - int vgic_v4_init(struct kvm *kvm); void vgic_v4_teardown(struct kvm *kvm); void vgic_v4_configure_vsgis(struct kvm *kvm); @@ -447,6 +452,11 @@ static inline bool kvm_has_gicv3(struct kvm *kvm) return kvm_has_feat(kvm, ID_AA64PFR0_EL1, GIC, IMP); } +static inline bool kvm_has_gicv5(struct kvm *kvm) +{ + return kvm_has_feat(kvm, ID_AA64PFR2_EL1, GCIE, IMP); +} + void vgic_v3_flush_nested(struct kvm_vcpu *vcpu); void vgic_v3_sync_nested(struct kvm_vcpu *vcpu); void vgic_v3_load_nested(struct kvm_vcpu *vcpu); @@ -454,15 +464,32 @@ void vgic_v3_put_nested(struct kvm_vcpu *vcpu); void vgic_v3_handle_nested_maint_irq(struct kvm_vcpu *vcpu); void vgic_v3_nested_update_mi(struct kvm_vcpu *vcpu); -static inline bool vgic_is_v3_compat(struct kvm *kvm) +static inline bool vgic_host_has_gicv3(void) { - return cpus_have_final_cap(ARM64_HAS_GICV5_CPUIF) && + /* + * Either the host is a native GICv3, or it is GICv5 with + * FEAT_GCIE_LEGACY. + */ + return kvm_vgic_global_state.type == VGIC_V3 || kvm_vgic_global_state.has_gcie_v3_compat; } -static inline bool vgic_is_v3(struct kvm *kvm) +static inline bool vgic_host_has_gicv5(void) +{ + return kvm_vgic_global_state.type == VGIC_V5; +} + +bool system_supports_direct_sgis(void); +bool vgic_supports_direct_msis(struct kvm *kvm); +bool vgic_supports_direct_sgis(struct kvm *kvm); + +static inline bool vgic_supports_direct_irqs(struct kvm *kvm) { - return kvm_vgic_global_state.type == VGIC_V3 || vgic_is_v3_compat(kvm); + /* GICv5 always supports direct IRQs */ + if (vgic_is_v5(kvm)) + return true; + + return vgic_supports_direct_msis(kvm) || vgic_supports_direct_sgis(kvm); } int vgic_its_debug_init(struct kvm_device *dev); |
