- 08 Dec, 2021 40 commits
-
-
Paolo Bonzini authored
Merge bugfix for enlightened MSR Bitmap, before adding support to KVM for exposing the feature to nested guests.
-
Hou Wenlong authored
em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses required an exit to userspace. However, x86_emulate_insn() doesn't return X86EMUL_*, so x86_emulate_instruction() doesn't directly act on X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to exit to userspace now. Nevertheless, if the userspace_msr_exit_test testcase in selftests is changed to test RDMSR/WRMSR with a forced emulation prefix, the test passes. What happens is that first userspace exit information is filled but the userspace exit does not happen. Because x86_emulate_instruction() returns 1, the guest retries the instruction---but this time RIP has already been adjusted past the forced emulation prefix, so the guest executes RDMSR/WRMSR and the userspace exit finally happens. Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io callback, x86_emulate_instruction() can just return 0 if the callback is not NULL. Then RDMSR/WRMSR instruction emulation will exit to userspace directly, without the RDMSR/WRMSR vmexit. Fixes: 1ae09954 ("KVM: x86: Allow deflecting unknown MSR accesses to user space") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679.git.houwenlong93@linux.alibaba.com>
-
Vitaly Kuznetsov authored
When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from clean fields to make Hyper-V aware of the change. For KVM's L1s, this is done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr(). MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't check if the resulting bitmap is different and never cleans HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and may result in Hyper-V missing the update. The issue could've been solved by calling evmcs_touch_msr_bitmap() for eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so would not give any performance benefits (compared to not using Enlightened MSR Bitmap at all). 3-level nesting is also not a very common setup nowadays. Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for now. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Hou Wenlong authored
If msr access triggers an exit to userspace, the complete_userspace_io callback would skip instruction by vendor callback for kvm_skip_emulated_instruction(). However, when msr access comes from the emulator, e.g. if kvm.force_emulation_prefix is enabled and the guest uses rdmsr/wrmsr with kvm prefix, VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and kvm_emulate_instruction() should be used to skip instruction instead. As Sean noted, unlike the previous case, there's no #UD if unrestricted guest is disabled and the guest accesses an MSR in Big RM. So the correct way to fix this is to attach a different callback when the msr access comes from the emulator. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679.git.houwenlong93@linux.alibaba.com>
-
Hou Wenlong authored
The next patch would use kvm_emulate_instruction() with EMULTYPE_SKIP in complete_userspace_io callback to fix a problem in msr access emulation. However, EMULTYPE_SKIP only updates RIP, more things like updating interruptibility state and injecting single-step #DBs would be done in the callback. Since the emulator also does those things after x86_emulate_insn(), add a new emulation type to pair with EMULTYPE_SKIP to do those things for completion of user exits within the emulator. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
-
Sean Christopherson authored
Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the decode phase does not truncate _eip. Wrapping the 32-bit boundary is legal if and only if CS is a flat code segment, but that check is implicitly handled in the form of limit checks in the decode phase. Opportunstically prepare for a future fix by storing the result of any truncation in "eip" instead of "_eip". Fixes: 1957aa63 ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679.git.houwenlong93@linux.alibaba.com>
-
Li RongQing authored
merge pv_eoi_get_pending and pv_eoi_clr_pending into a single function pv_eoi_test_and_clear_pending, which returns and clear the value of the pending bit. This makes it possible to clear the pending bit only if the guest had set it, and otherwise skip the call to pv_eoi_put_user(). This can save up to 300 nsec on AMD EPYC processors. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1636026974-50555-2-git-send-email-lirongqing@baidu.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Li RongQing authored
If guest gives MSR_KVM_PV_EOI_EN a wrong value, this printk() will be trigged, and kernel log is spammed with the useless message Fixes: 0d88800d ("kvm: x86: ioapic and apic debug macros cleanup") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: stable@kernel.org Message-Id: <1636026974-50555-1-git-send-email-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs, and nested NPT treats them like all other non-leaf page table levels instead of caching them. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
This bit is very close to mean "role.quadrant is not in use", except that it is false also when the MMU is mapping guest physical addresses directly. In that case, role.quadrant is indeed not in use, but there are no guest PTEs at all. Changing the name and direction of the bit removes the special case, since a guest with paging disabled, or not considering guest paging structures as is the case for two-dimensional paging, does not have to deal with 4-byte guest PTEs. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Using ept_caps_to_lpage_level is simpler. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The level of supported large page on nEPT affects the rsvds_bits_mask. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Bit 7 on pte depends on the level of supported large page. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra FNAME(gva_to_gpa_nested) is needed. Add the parameter can simplify the code. And it makes it explicit that the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2 gpa in translate_nested_gpa() via the new parameter. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
role.quadrant is only valid when gpte size is 4 bytes and only be calculated when gpte size is 4 bytes. Although "vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL" also means gpte size is 4 bytes, but using "!role.gpte_is_8_bytes" is clearer Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
role.gpte_is_8_bytes is unused when role.direct; there is no point in changing a bit in the role, the value that was set when the MMU is initialized is just fine. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-14-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The body of __kvm_mmu_free_some_pages() has been removed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-13-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The allocation of special roots is moved to mmu_alloc_special_roots(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
It is never used. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
And remove clear_page() on it. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
It returns the only proper NPT level, so the "max" in the name is not appropriate. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The variable name is changed in the code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
MSR_IA32_TSC_ADJUST can be left to the default way which also uese kvm_set_msr_common(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The host CR3 in the vcpu thread can only be changed when scheduling. Moving the code in vmx_prepare_switch_to_guest() makes the code simpler. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Aoid earlier modification. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The value of host MSR_IA32_SYSENTER_ESP is known to be constant for each CPU: (cpu_entry_stack(cpu) + 1) when 32 bit syscall is enabled or NULL is 32 bit syscall is not enabled. So rdmsrl() can be avoided for the first case and both rdmsrl() and vmcs_writel() can be avoided for the second case. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
It is unchanged in most cases. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211111144527.88852-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
It has no user. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-15-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Document the meaning of the three combinations of regs_avail and regs_dirty. Update regs_dirty just after writeback instead of doing it later after vmexit. After vmexit, instead, we clear the regs_avail bits corresponding to lazily-loaded registers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
When vcpu->arch.cr3 is changed, it is marked dirty, so vmcs.GUEST_CR3 can be updated only when kvm_register_is_dirty(vcpu, VCPU_EXREG_CR3). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-12-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
When vcpu->arch.cr3 is changed, it should be marked dirty unless it is being updated to the value of the architecture guest CR3 (i.e. VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled). This patch has no functionality changed because kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional change to vcpu->arch.regs_dirty, but no code uses regs_dirty for VCPU_EXREG_CR3. (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses regs_avail for VCPU_EXREG_CR3 dirty information.) Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
VCPU_EXREG_CR3 is never cleared from vcpu->arch.regs_avail or vcpu->arch.regs_dirty in SVM; therefore, marking CR3 as available is merely a NOP, and testing it will likewise always succeed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
The comment had been added in the commit 689f3bf2 ("KVM: x86: unify callbacks to load paging root") and its related code was removed later, and it has nothing to do with the next line of code. So the comment should be removed too. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Not functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
In set_cr4_guest_host_mask(), all cr4 pdptr bits are already set to be intercepted in an unclear way. Add X86_CR4_PDPTR_BITS to make it clear and self-documented. No functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
In set_cr4_guest_host_mask(), X86_CR4_PGE is set to be intercepted when !enable_ept just because X86_CR4_PGE is the only bit that is responsible for flushing TLB but listed in KVM_POSSIBLE_CR4_GUEST_BITS. It is clearer and self-documented to use X86_CR4_TLBFLUSH_BITS instead. No functionality changed. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
Use the same logic to handle the availability of VCPU_EXREG_PDPTR as VMX, also removing a branch in svm_vcpu_run(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
mmu->pdptrs[] and vmcs.GUEST_PDPTR[0-3] are synced, so mmu->pdptrs is available and GUEST_PDPTR[0-3] is not dirty. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
For VMX with EPT, dirty PDPTRs need to be loaded before the next vmentry via vmx_load_mmu_pgd() But not all paths that call load_pdptrs() will cause vmx_load_mmu_pgd() to be invoked. Normally, kvm_mmu_reset_context() is used to cause KVM_REQ_LOAD_MMU_PGD, but sometimes it is skipped: * commit d81135a5("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR0.CD and CR0.NW. * commit 21823fbd("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") skips KVM_REQ_LOAD_MMU_PGD after load_pdptrs() when rewriting the CR3 with the same value. * commit a91a7c70("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after load_pdptrs() when changing CR4.PGE. Fixes: d81135a5 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed") Fixes: 21823fbd ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") Fixes: a91a7c70 ("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211108124407.12187-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-