- 29 Apr, 2022 28 commits
-
-
Paolo Bonzini authored
Remove another duplicate field of struct kvm_mmu. This time it's the root level for page table walking; the separate field is always initialized as cpu_role.base.level, so its users can look up the CPU mode directly instead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Do not lead init_kvm_*mmu into the temptation of poking into struct kvm_mmu_role_regs, by passing to it directly the CPU mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Shadow MMUs compute their role from cpu_role.base, simply by adjusting the root level. It's one line of code, so do not place it in a separate function. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Before the separation of the CPU and the MMU role, CR0.PG was not available in the base MMU role, because two-dimensional paging always used direct=1 in the MMU role. However, now that the raw role is snapshotted in mmu->cpu_role, the value of CR0.PG always matches both !cpu_role.base.direct and cpu_role.base.level > 0. There is no need to store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg accessor by hand that takes care of the conversion. Use cpu_role.base.level since the future of the direct field is unclear. Likewise, CR4.PAE is now always present in the CPU role as !cpu_role.base.has_4_byte_gpte. The inversion makes certain tests on the MMU role easier, and is easily hidden by the is_cr4_pae accessor when operating on the CPU role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
It is quite confusing that the "full" union is called kvm_mmu_role but is used for the "cpu_role" field of struct kvm_mmu. Rename it to kvm_cpu_role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
mmu_role represents the role of the root of the page tables. It does not need any extended bits, as those govern only KVM's page table walking; the is_* functions used for page table walking always use the CPU role. ext.valid is not present anymore in the MMU role, but an all-zero MMU role is impossible because the level field is never zero in the MMU role. So just zap the whole mmu_role in order to force invalidation after CPUID is updated. While making this change, which requires touching almost every occurrence of "mmu_role", rename it to "root_role". Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Now that the MMU role is separate from the CPU role, it can be a truthful description of the format of the shadow pages. This includes whether the shadow pages use the NX bit; so force the efer_nx field of the MMU role when TDP is disabled, and remove the hardcoding it in the callers of reset_shadow_zero_bits_mask. In fact, the initialization of reserved SPTE bits can now be made common to shadow paging and shadow NPT; move it to shadow_mmu_init_context. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Pass the already-computed CPU role, instead of redoing it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Inline kvm_calc_mmu_role_common into its sole caller, and simplify it by removing the computation of unnecessary bits. Extended bits are unnecessary because page walking uses the CPU role, and EFER.NX/CR0.WP can be set to one unconditionally---matching the format of shadow pages rather than the format of guest pages. The MMU role for two dimensional paging does still depend on the CPU role, even if only barely so, due to SMM and guest mode; for consistency, pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying the vcpu with is_smm or is_guest_mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
kvm_calc_shadow_root_page_role_common is the same as kvm_calc_cpu_role except for the level, which is overwritten afterwards in kvm_calc_shadow_mmu_root_page_role and kvm_calc_shadow_npt_root_page_role. role.base.direct is already set correctly for the CPU role, and CR0.PG=1 is required for VMRUN so it will also be correct for nested NPT. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The ept_ad field is used during page walk to determine if the guest PTEs have accessed and dirty bits. In the MMU role, the ad_disabled bit represents whether the *shadow* PTEs have the bits, so it would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just !mmu->mmu_role.base.ad_disabled. However, the similar field in the CPU mode, ad_disabled, is initialized correctly: to the opposite value of ept_ad for shadow EPT, and zero for non-EPT guest paging modes (which always have A/D bits). It is therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode, like other page-format fields; it just has to be inverted to account for the different polarity. In fact, now that the CPU mode is distinct from the MMU roles, it would even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because the macro has a small effect in terms of dead code elimination: text data bss dec hex 103544 16665 112 120321 1d601 # as of this patch 103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The root_level can be found in the cpu_role (in fact the field is superfluous and could be removed, but one thing at a time). Since there is only one usage left of role_regs_to_root_level, inline it into kvm_calc_cpu_role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Snapshot the state of the processor registers that govern page walk into a new field of struct kvm_mmu. This is a more natural representation than having it *mostly* in mmu_role but not exclusively; the delta right now is represented in other fields, such as root_level. The nested MMU now has only the CPU role; and in fact the new function kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role, except that it has role.base.direct equal to !CR0.PG. For a walk-only MMU, "direct" has no meaning, but we set it to !CR0.PG so that role.ext.cr0_pg can go away in a future patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The argument is always false now that kvm_mmu_calc_root_page_role has been removed. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
If accessed bits are not supported there simple isn't any distinction between accessed and non-accessed gPTEs, so the comment does not make much sense. Rephrase it in terms of what happens if accessed bits *are* supported. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The init_kvm_*mmu functions, with the exception of shadow NPT, do not need to know the full values of CR0/CR4/EFER; they only need to know the bits that make up the "role". This cleanup however will take quite a few incremental steps. As a start, pull the common computation of the struct kvm_mmu_role_regs into their caller: all of them extract the struct from the vcpu as the very first step. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
struct kvm_mmu_role_regs is computed just once and then accessed. Use const to make this clearer, even though the const fields of struct kvm_mmu_role_regs already prevent (or make it harder...) to modify the contents of the struct. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The role.base.smm flag is always zero when setting up shadow EPT, do not bother copying it over from vcpu->arch.root_mmu. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Clear enable_mmio_caching if hardware can't support MMIO caching and use the dedicated flag to detect if MMIO caching is enabled instead of assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will use a zero value even when caching is enabled, and is_mmio_spte() isn't so hot that it needs to avoid an extra memory access, i.e. there's no reason to be super clever. And the clever approach may not even be more performant, e.g. gcc-11 lands the extra check on a non-zero value inline, but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra uops for non-MMIO SPTEs. Cc: Isaku Yamahata <isaku.yamahata@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420002747.3287931-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When determining whether or not a SPTE needs to have SME/SEV's memory encryption flag set, do the moderately expensive host MMIO pfn check if and only if the memory encryption mask is non-zero. Note, KVM could further optimize the host MMIO checks by making a single call to kvm_is_mmio_pfn(), but the tdp_enabled path (for EPT's memtype handling) will likely be split out to a separate flow[*]. At that point, a better approach would be to shove the call to kvm_is_mmio_pfn() into VMX code so that AMD+NPT without SME doesn't get hit with an unnecessary lookup. [*] https://lkml.kernel.org/r/20220321224358.1305530-3-bgardon@google.comSigned-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220415004909.2216670-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Babu Moger authored
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID instructions. The TSC_AUX value is set using the WRMSR instruction to the TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID instructions. If the read/write of the TSC_AUX MSR is intercepted, then RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization is present. However, the RDPID instruction can't be intercepted. This means that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR read/write must not be intercepted for SEV-ES (or SEV-SNP) guests. Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Babu Moger authored
The TSC_AUX Virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) MSR in RDTSCP and RDPID instructions. The TSC_AUX MSR is typically initialized to APIC ID or another unique identifier so that software can quickly associate returned TSC value with the logical processor. Add the feature bit and also include it in the kvm for detection. Signed-off-by: Babu Moger <babu.moger@amd.com> Acked-by: Borislav Petkov <bp@suse.de> Message-Id: <165040157111.1399644.6123821125319995316.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees. The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between 5.18 and commit c24a950e ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata for SEV-ES", 2022-04-13).
-
Mingwei Zhang authored
KVM uses lookup_address_in_mm() to detect the hugepage size that the host uses to map a pfn. The function suffers from several issues: - no usage of READ_ONCE(*). This allows multiple dereference of the same page table entry. The TOCTOU problem because of that may cause KVM to incorrectly treat a newly generated leaf entry as a nonleaf one, and dereference the content by using its pfn value. - the information returned does not match what KVM needs; for non-present entries it returns the level at which the walk was terminated, as long as the entry is not 'none'. KVM needs level information of only 'present' entries, otherwise it may regard a non-present PXE entry as a present large page mapping. - the function is not safe for mappings that can be torn down, because it does not disable IRQs and because it returns a PTE pointer which is never safe to dereference after the function returns. So implement the logic for walking host page tables directly in KVM, and stop using lookup_address_in_mm(). Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220429031757.2042406-1-mizhang@google.com> [Inline in host_pfn_mapping_level, ensure no semantic change for its callers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags member that at the time was unused. Unfortunately this extensibility mechanism has several issues: - x86 is not writing the member, so it would not be possible to use it on x86 except for new events - the member is not aligned to 64 bits, so the definition of the uAPI struct is incorrect for 32- on 64-bit userspace. This is a problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately usage of flags was only introduced in 5.18. Since padding has to be introduced, place a new field in there that tells if the flags field is valid. To allow further extensibility, in fact, change flags to an array of 16 values, and store how many of the values are valid. The availability of the new ndata field is tied to a system capability; all architectures are changed to fill in the field. To avoid breaking compilation of userspace that was using the flags field, provide a userspace-only union to overlap flags with data[0]. The new field is placed at the same offset for both 32- and 64-bit userspace. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Message-Id: <20220422103013.34832-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 13 Apr, 2022 12 commits
-
-
Sean Christopherson authored
Exit to userspace when emulating an atomic guest access if the CMPXCHG on the userspace address faults. Emulating the access as a write and thus likely treating it as emulated MMIO is wrong, as KVM has already confirmed there is a valid, writable memslot. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use the recently introduce __try_cmpxchg_user() to emulate atomic guest accesses via the associated userspace address instead of mapping the backing pfn into kernel address space. Using kvm_vcpu_map() is unsafe as it does not coordinate with KVM's mmu_notifier to ensure the hva=>pfn translation isn't changed/unmapped in the memremap() path, i.e. when there's no struct page and thus no elevated refcount. Fixes: 42e35f80 ("KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Use the recently introduced __try_cmpxchg_user() to update guest PTE A/D bits instead of mapping the PTE into kernel address space. The VM_PFNMAP path is broken as it assumes that vm_pgoff is the base pfn of the mapped VMA range, which is conceptually wrong as vm_pgoff is the offset relative to the file and has nothing to do with the pfn. The horrific hack worked for the original use case (backing guest memory with /dev/mem), but leads to accessing "random" pfns for pretty much any other VM_PFNMAP case. Fixes: bd53cb35 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs") Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org> Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Zijlstra authored
Add support for CMPXCHG loops on userspace addresses. Provide both an "unsafe" version for tight loops that do their own uaccess begin/end, as well as a "safe" version for use cases where the CMPXCHG is not buried in a loop, e.g. KVM will resume the guest instead of looping when emulation of a guest atomic accesses fails the CMPXCHG. Provide 8-byte versions for 32-bit kernels so that KVM can do CMPXCHG on guest PAE PTEs, which are accessed via userspace addresses. Guard the asm_volatile_goto() variation with CC_HAS_ASM_GOTO_TIED_OUTPUT, the "+m" constraint fails on some compilers that otherwise support CC_HAS_ASM_GOTO_OUTPUT. Cc: stable@vger.kernel.org Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a config option to guard (future) usage of asm_volatile_goto() that includes "tied outputs", i.e. "+" constraints that specify both an input and output parameter. clang-13 has a bug[1] that causes compilation of such inline asm to fail, and KVM wants to use a "+m" constraint to implement a uaccess form of CMPXCHG[2]. E.g. the test code fails with <stdin>:1:29: error: invalid operand in inline asm: '.long (${1:l}) - .' int foo(int *x) { asm goto (".long (%l[bar]) - .\n": "+m"(*x) ::: bar); return *x; bar: return 0; } ^ <stdin>:1:29: error: unknown token in expression <inline asm>:1:9: note: instantiated into assembly here .long () - . ^ 2 errors generated. on clang-13, but passes on gcc (with appropriate asm goto support). The bug is fixed in clang-14, but won't be backported to clang-13 as the changes are too invasive/risky. gcc also had a similar bug[3], fixed in gcc-11, where gcc failed to account for its behavior of assigning two numbers to tied outputs (one for input, one for output) when evaluating symbolic references. [1] https://github.com/ClangBuiltLinux/linux/issues/1512 [2] https://lore.kernel.org/all/YfMruK8%2F1izZ2VHS@google.com [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98096Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Peter Gonda authored
If an SEV-ES guest requests termination, exit to userspace with KVM_EXIT_SYSTEM_EVENT and a dedicated SEV_TERM type instead of -EINVAL so that userspace can take appropriate action. See AMD's GHCB spec section '4.1.13 Termination Request' for more details. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Gonda <pgonda@google.com> Reported-by: kernel test robot <lkp@intel.com> Message-Id: <20220407210233.782250-1-pgonda@google.com> [Add documentatino. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Clear the IDT vectoring field in vmcs12 on next VM-Exit due to a double or triple fault. Per the SDM, a VM-Exit isn't considered to occur during event delivery if the exit is due to an intercepted double fault or a triple fault. Opportunistically move the default clearing (no event "pending") into the helper so that it's more obvious that KVM does indeed handle this case. Note, the double fault case is worded rather wierdly in the SDM: The original event results in a double-fault exception that causes the VM exit directly. Temporarily ignoring injected events, double faults can _only_ occur if an exception occurs while attempting to deliver a different exception, i.e. there's _always_ an original event. And for injected double fault, while there's no original event, injected events are never subject to interception. Presumably the SDM is calling out that a the vectoring info will be valid if a different exit occurs after a double fault, e.g. if a #PF occurs and is intercepted while vectoring #DF, then the vectoring info will show the double fault. In other words, the clause can simply be read as: The VM exit is caused by a double-fault exception. Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1") Cc: Chenyi Qiang <chenyi.qiang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220407002315.78092-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't modify vmcs12 exit fields except EXIT_REASON and EXIT_QUALIFICATION when performing a nested VM-Exit due to failed VM-Entry. Per the SDM, only the two aformentioned fields are filled and "All other VM-exit information fields are unmodified". Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220407002315.78092-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove WARNs that sanity check that KVM never lets a triple fault for L2 escape and incorrectly end up in L1. In normal operation, the sanity check is perfectly valid, but it incorrectly assumes that it's impossible for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through KVM_RUN (which guarantees kvm_check_nested_state() will see and handle the triple fault). The WARN can currently be triggered if userspace injects a machine check while L2 is active and CR4.MCE=0. And a future fix to allow save/restore of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't lost on migration, will make it trivially easy for userspace to trigger the WARN. Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is tempting, but wrong, especially if/when the request is saved/restored, e.g. if userspace restores events (including a triple fault) and then restores nested state (which may forcibly leave guest mode). Ignoring the fact that KVM doesn't currently provide the necessary APIs, it's userspace's responsibility to manage pending events during save/restore. ------------[ cut here ]------------ WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Modules linked in: kvm_intel kvm irqbypass CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Call Trace: <TASK> vmx_leave_nested+0x30/0x40 [kvm_intel] vmx_set_nested_state+0xca/0x3e0 [kvm_intel] kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm] kvm_vcpu_ioctl+0x4b9/0x660 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]--- Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1") Cc: stable@vger.kernel.org Cc: Chenyi Qiang <chenyi.qiang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220407002315.78092-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Like Xu authored
Use static calls to improve kvm_pmu_ops performance, following the same pattern and naming scheme used by kvm-x86-ops.h. Here are the worst fenced_rdtsc() cycles numbers for the kvm_pmu_ops functions that is most often called (up to 7 digits of calls) when running a single perf test case in a guest on an ICX 2.70GHz host (mitigations=on): | legacy | static call ------------------------------------------------------------ .pmc_idx_to_pmc | 1304840 | 994872 (+23%) .pmc_is_enabled | 978670 | 1011750 (-3%) .msr_idx_to_pmc | 47828 | 41690 (+12%) .is_valid_msr | 28786 | 30108 (-4%) Signed-off-by: Like Xu <likexu@tencent.com> [sean: Handle static call updates in pmu.c, tweak changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Like Xu authored
The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata. That'll save those precious few bytes, and more importantly make the original ops unreachable, i.e. make it harder to sneak in post-init modification bugs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Like Xu authored
Replace the kvm_pmu_ops pointer in common x86 with an instance of the struct to save one pointer dereference when invoking functions. Copy the struct by value to set the ops during kvm_init(). Signed-off-by: Like Xu <likexu@tencent.com> [sean: Move pmc_is_enabled(), make kvm_pmu_ops static] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-