- 04 Oct, 2021 11 commits
-
-
Anup Patel authored
This patch implements all required functions for programming the stage2 page table for each Guest/VM. At high-level, the flow of stage2 related functions is similar from KVM ARM/ARM64 implementation but the stage2 page table format is quite different for KVM RISC-V. [jiangyifei: stage2 dirty log support] Signed-off-by: Yifei Jiang <jiangyifei@huawei.com> Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
We implement a simple VMID allocator for Guests/VMs which: 1. Detects number of VMID bits at boot-time 2. Uses atomic number to track VMID version and increments VMID version whenever we run-out of VMIDs 3. Flushes Guest TLBs on all host CPUs whenever we run-out of VMIDs 4. Force updates HW Stage2 VMID for each Guest VCPU whenever VMID changes using VCPU request KVM_REQ_UPDATE_HGATP Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
We get illegal instruction trap whenever Guest/VM executes WFI instruction. This patch handles WFI trap by blocking the trapped VCPU using kvm_vcpu_block() API. The blocked VCPU will be automatically resumed whenever a VCPU interrupt is injected from user-space or from in-kernel IRQCHIP emulation. Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
We will get stage2 page faults whenever Guest/VM access SW emulated MMIO device or unmapped Guest RAM. This patch implements MMIO read/write emulation by extracting MMIO details from the trapped load/store instruction and forwarding the MMIO read/write to user-space. The actual MMIO emulation will happen in user-space and KVM kernel module will only take care of register updates before resuming the trapped VCPU. The handling for stage2 page faults for unmapped Guest RAM will be implemeted by a separate patch later. [jiangyifei: ioeventfd and in-kernel mmio device support] Signed-off-by: Yifei Jiang <jiangyifei@huawei.com> Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
This patch implements the VCPU world-switch for KVM RISC-V. The KVM RISC-V world-switch (i.e. __kvm_riscv_switch_to()) mostly switches general purpose registers, SSTATUS, STVEC, SSCRATCH and HSTATUS CSRs. Other CSRs are switched via vcpu_load() and vcpu_put() interface in kvm_arch_vcpu_load() and kvm_arch_vcpu_put() functions respectively. Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
For KVM RISC-V, we use KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls to access VCPU config and registers from user-space. We have three types of VCPU registers: 1. CONFIG - these are VCPU config and capabilities 2. CORE - these are VCPU general purpose registers 3. CSR - these are VCPU control and status registers The CONFIG register available to user-space is ISA. The ISA register is a read and write register where user-space can only write the desired VCPU ISA capabilities before running the VCPU. The CORE registers available to user-space are PC, RA, SP, GP, TP, A0-A7, T0-T6, S0-S11 and MODE. Most of these are RISC-V general registers except PC and MODE. The PC register represents program counter whereas the MODE register represent VCPU privilege mode (i.e. S/U-mode). The CSRs available to user-space are SSTATUS, SIE, STVEC, SSCRATCH, SEPC, SCAUSE, STVAL, SIP, and SATP. All of these are read/write registers. In future, more VCPU register types will be added (such as FP) for the KVM_GET_ONE_REG/KVM_SET_ONE_REG ioctls. Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
This patch implements VCPU interrupts and requests which are both asynchronous events. The VCPU interrupts can be set/unset using KVM_INTERRUPT ioctl from user-space. In future, the in-kernel IRQCHIP emulation will use kvm_riscv_vcpu_set_interrupt() and kvm_riscv_vcpu_unset_interrupt() functions to set/unset VCPU interrupts. Important VCPU requests implemented by this patch are: KVM_REQ_SLEEP - set whenever VCPU itself goes to sleep state KVM_REQ_VCPU_RESET - set whenever VCPU reset is requested The WFI trap-n-emulate (added later) will use KVM_REQ_SLEEP request and kvm_riscv_vcpu_has_interrupt() function. The KVM_REQ_VCPU_RESET request will be used by SBI emulation (added later) to power-up a VCPU in power-off state. The user-space can use the GET_MPSTATE/SET_MPSTATE ioctls to get/set power state of a VCPU. Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
This patch implements VCPU create, init and destroy functions required by generic KVM module. We don't have much dynamic resources in struct kvm_vcpu_arch so these functions are quite simple for KVM RISC-V. Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
This patch adds initial skeletal KVM RISC-V support which has: 1. A simple implementation of arch specific VM functions except kvm_vm_ioctl_get_dirty_log() which will implemeted in-future as part of stage2 page loging. 2. Stubs of required arch specific VCPU functions except kvm_arch_vcpu_ioctl_run() which is semi-complete and extended by subsequent patches. 3. Stubs for required arch specific stage2 MMU functions. Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
-
Anup Patel authored
This patch adds asm/kvm_csr.h for RISC-V hypervisor extension related defines. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Message-Id: <20210927114016.1089328-2-anup.patel@wdc.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Merge tag 'kvm-s390-master-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: allow to compile without warning with W=1
-
- 30 Sep, 2021 4 commits
-
-
Sean Christopherson authored
Rework the CPU selection in the migration worker to ensure the specified number of migrations are performed when the test iteslf is affined to a subset of CPUs. The existing logic skips iterations if the target CPU is not in the original set of possible CPUs, which causes the test to fail if too many iterations are skipped. ==== Test Assertion Failure ==== rseq_test.c:228: i > (NR_TASK_MIGRATIONS / 2) pid=10127 tid=10127 errno=4 - Interrupted system call 1 0x00000000004018e5: main at rseq_test.c:227 2 0x00007fcc8fc66bf6: ?? ??:0 3 0x0000000000401959: _start at ??:? Only performed 4 KVM_RUNs, task stalled too much? Calculate the min/max possible CPUs as a cheap "best effort" to avoid high runtimes when the test is affined to a small percentage of CPUs. Alternatively, a list or xarray of the possible CPUs could be used, but even in a horrendously inefficient setup, such optimizations are not needed because the runtime is completely dominated by the cost of migrating the task, and the absolute runtime is well under a minute in even truly absurd setups, e.g. running on a subset of vCPUs in a VM that is heavily overcommited (16 vCPUs per pCPU). Fixes: 61e52f16 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Reported-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210929234112.1862848-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Check whether a CPUID entry's index is significant before checking for a matching index to hack-a-fix an undefined behavior bug due to consuming uninitialized data. RESET/INIT emulation uses kvm_cpuid() to retrieve CPUID.0x1, which does _not_ have a significant index, and fails to initialize the dummy variable that doubles as EBX/ECX/EDX output _and_ ECX, a.k.a. index, input. Practically speaking, it's _extremely_ unlikely any compiler will yield code that causes problems, as the compiler would need to inline the kvm_cpuid() call to detect the uninitialized data, and intentionally hose the kernel, e.g. insert ud2, instead of simply ignoring the result of the index comparison. Although the sketchy "dummy" pattern was introduced in SVM by commit 66f7b72e ("KVM: x86: Make register state after reset conform to specification"), it wasn't actually broken until commit 7ff6c035 ("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the order of operations such that "index" was checked before the significant flag. Avoid consuming uninitialized data by reverting to checking the flag before the index purely so that the fix can be easily backported; the offending RESET/INIT code has been refactored, moved, and consolidated from vendor code to common x86 since the bug was introduced. A future patch will directly address the bad RESET/INIT behavior. The undefined behavior was detected by syzbot + KernelMemorySanitizer. BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68 BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183 cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline] kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline] kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183 kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885 kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923 vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534 vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788 kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020 Local variable ----dummy@kvm_vcpu_reset created at: kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812 kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923 Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com Reported-by: Alexander Potapenko <glider@google.com> Fixes: 2a24be79 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping") Fixes: 7ff6c035 ("KVM: x86: Remove stateful CPUID handling") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210929222426.1855730-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Zelin Deng authored
hv_clock is preallocated to have only HVC_BOOT_ARRAY_SIZE (64) elements; if the PTP_SYS_OFFSET_PRECISE ioctl is executed on vCPUs whose index is 64 of higher, retrieving the struct pvclock_vcpu_time_info pointer with "src = &hv_clock[cpu].pvti" will result in an out-of-bounds access and a wild pointer. Change it to "this_cpu_pvti()" which is guaranteed to be valid. Fixes: 95a3d445 ("Switch kvmclock data to a PER_CPU variable") Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: <stable@vger.kernel.org> Message-Id: <1632892429-101194-3-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Zelin Deng authored
There're other modules might use hv_clock_per_cpu variable like ptp_kvm, so move it into kvmclock.h and export the symbol to make it visiable to other modules. Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Cc: <stable@vger.kernel.org> Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 28 Sep, 2021 2 commits
-
-
Janosch Frank authored
The latest compile changes pointed us to a few instances where we use the kernel documentation style but don't explain all variables or don't adhere to it 100%. It's easy to fix so let's do that. Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
-
Oliver Upton authored
There is no need to clobber a register that is only being read from. Oops. Drop the XMM register from the clobbers list. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210927223621.50178-1-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 27 Sep, 2021 1 commit
-
-
Zhenzhong Duan authored
When updating the host's mask for its MSR_IA32_TSX_CTRL user return entry, clear the mask in the found uret MSR instead of vmx->guest_uret_msrs[i]. Modifying guest_uret_msrs directly is completely broken as 'i' does not point at the MSR_IA32_TSX_CTRL entry. In fact, it's guaranteed to be an out-of-bounds accesses as is always set to kvm_nr_uret_msrs in a prior loop. By sheer dumb luck, the fallout is limited to "only" failing to preserve the host's TSX_CTRL_CPUID_CLEAR. The out-of-bounds access is benign as it's guaranteed to clear a bit in a guest MSR value, which are always zero at vCPU creation on both x86-64 and i386. Cc: stable@vger.kernel.org Fixes: 8ea8b8d6 ("KVM: VMX: Use common x86's uret MSR list as the one true list") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210926015545.281083-1-zhenzhong.duan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 24 Sep, 2021 3 commits
-
-
Paolo Bonzini authored
Merge tag 'kvmarm-fixes-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/arm64 fixes for 5.15, take #1 - Add missing FORCE target when building the EL2 object - Fix a PMU probe regression on some platforms
-
Oliver Upton authored
Compiling the KVM selftests with clang emits the following warning: >> include/x86_64/processor.h:297:25: error: variable 'xmm0' is uninitialized when used here [-Werror,-Wuninitialized] >> return (unsigned long)xmm0; where xmm0 is accessed via an uninitialized register variable. Indeed, this is a misuse of register variables, which really should only be used for specifying register constraints on variables passed to inline assembly. Rather than attempting to read xmm registers via register variables, just explicitly perform the movq from the desired xmm register. Fixes: 783e9e51 ("kvm: selftests: add API testing infrastructure") Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210924005147.1122357-1-oupton@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Oliver Upton authored
While x86 does not require any additional setup to use the ucall infrastructure, arm64 needs to set up the MMIO address used to signal a ucall to userspace. rseq_test does not initialize the MMIO address, resulting in the test spinning indefinitely. Fix the issue by calling ucall_init() during setup. Fixes: 61e52f16 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs") Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210923220033.4172362-1-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 23 Sep, 2021 7 commits
-
-
Lai Jiangshan authored
There is no user of tlbs_dirty. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
If gpte is changed from non-present to present, the guest doesn't need to flush tlb per SDM. So the host must synchronze sp before link it. Otherwise the guest might use a wrong mapping. For example: the guest first changes a level-1 pagetable, and then links its parent to a new place where the original gpte is non-present. Finally the guest can access the remapped area without flushing the tlb. The guest's behavior should be allowed per SDM, but the host kvm mmu makes it wrong. Fixes: 4731d4c7 ("KVM: MMU: out of sync shadow core") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Lai Jiangshan authored
When kvm->tlbs_dirty > 0, some rmaps might have been deleted without flushing tlb remotely after kvm_sync_page(). If @gfn was writable before and it's rmaps was deleted in kvm_sync_page(), and if the tlb entry is still in a remote running VCPU, the @gfn is not safely protected. To fix the problem, kvm_sync_page() does the remote flush when needed to avoid the problem. Fixes: a4ee1ca4 ("KVM: MMU: delay flush all tlbs on sync_page path") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
These field correspond to features that we don't expose yet to L2 While currently there are no CVE worthy features in this field, if AMD adds more features to this field, that could allow guest escapes similar to CVE-2021-3653 and CVE-2021-3656. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-6-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
GP SVM errata workaround made the #GP handler always emulate the SVM instructions. However these instructions #GP in case the operand is not 4K aligned, but the workaround code didn't check this and we ended up emulating these instructions anyway. This is only an emulation accuracy check bug as there is no harm for KVM to read/write unaligned vmcb images. Fixes: 82a11e9c ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
Test that if: * L1 disables virtual interrupt masking, and INTR intercept. * L1 setups a virtual interrupt to be injected to L2 and enters L2 with interrupts disabled, thus the virtual interrupt is pending. * Now an external interrupt arrives in L1 and since L1 doesn't intercept it, it should be delivered to L2 when it enables interrupts. to do this L0 (abuses) V_IRQ to setup an interrupt window, and returns to L2. * L2 enables interrupts. This should trigger the interrupt window, injection of the external interrupt and delivery of the virtual interrupt that can now be done. * Test that now L2 gets those interrupts. This is the test that demonstrates the issue that was fixed in the previous patch. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
In svm_clear_vintr we try to restore the virtual interrupt injection that might be pending, but we fail to restore the interrupt vector. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 22 Sep, 2021 12 commits
-
-
Fares Mehanna authored
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a consistent behavior between Intel and AMD when using KVM_GET_MSRS, KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST. We have to add legacy and new MSRs to handle guests running without X86_FEATURE_PERFCTR_CORE. Signed-off-by: Fares Mehanna <faresx@amazon.de> Message-Id: <20210915133951.22389-1-faresx@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
If L1 had invalid state on VM entry (can happen on SMM transactions when we enter from real mode, straight to nested guest), then after we load 'host' state from VMCS12, the state has to become valid again, but since we load the segment registers with __vmx_set_segment we weren't always updating emulation_required. Update emulation_required explicitly at end of load_vmcs12_host_state. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
It is possible that when non root mode is entered via special entry (!from_vmentry), that is from SMM or from loading the nested state, the L2 state could be invalid in regard to non unrestricted guest mode, but later it can become valid. (for example when RSM emulation restores segment registers from SMRAM) Thus delay the check to VM entry, where we will check this and fail. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
Since no actual VM entry happened, the VM exit information is stale. To avoid this, synthesize an invalid VM guest state VM exit. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
Use return statements instead of nested if, and fix error path to free all the maps that were allocated. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
Currently the KVM_REQ_GET_NESTED_STATE_PAGES on SVM only reloads PDPTRs, and MSR bitmap, with former not really needed for SMM as SMM exit code reloads them again from SMRAM'S CR3, and later happens to work since MSR bitmap isn't modified while in SMM. Still it is better to be consistient with VMX. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
When exiting SMM, pdpts are loaded again from the guest memory. This fixes a theoretical bug, when exit from SMM triggers entry to the nested guest which re-uses some of the migration code which uses this flag as a workaround for a legacy userspace. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
Otherwise guest entry code might see incorrect L1 state (e.g paging state). Fixes: 37be407b ("KVM: nSVM: Fix L1 state corruption upon return from SMM") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
Windows Server 2022 with Hyper-V role enabled failed to boot on KVM when enlightened VMCS is advertised. Debugging revealed there are two exposed secondary controls it is not happy with: SECONDARY_EXEC_ENABLE_VMFUNC and SECONDARY_EXEC_SHADOW_VMCS. These controls are known to be unsupported, as there are no corresponding fields in eVMCSv1 (see the comment above EVMCS1_UNSUPPORTED_2NDEXEC definition). Previously, commit 31de3d25 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()") introduced the required filtering mechanism for VMX MSRs but for some reason put only known to be problematic (and not full EVMCS1_UNSUPPORTED_* lists) controls there. Note, Windows Server 2022 seems to have gained some sanity check for VMX MSRs: it doesn't even try to launch a guest when there's something it doesn't like, nested_evmcs_check_controls() mechanism can't catch the problem. Let's be bold this time and instead of playing whack-a-mole just filter out all unsupported controls from VMX MSRs. Fixes: 31de3d25 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210907163530.110066-1-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Check for a NULL cpumask_var_t when kicking multiple vCPUs via cpumask_available(), which performs a !NULL check if and only if cpumasks are configured to be allocated off-stack. This is a meaningless optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more importantly helps document that the NULL check is necessary even though all callers pass in a local variable. No functional change intended. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu is read exactly once, and by ensuring the vCPU is booted from guest mode if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if kvm_request_needs_ipi() returns true. Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or kvm_request_needs_ipi()) means the target vCPU could get migrated (change vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and reading vcpu->mode. If that happens, the kick/IPI will be sent to the old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs. Although failing to kick the vCPU is not exactly ideal, practically speaking it cannot cause a functional issue unless there is also a bug in the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s behavior. The purpose of sending an IPI is purely to get a vCPU into the host (or out of reading SPTEs) so that the vCPU can recognize a change in state, e.g. a KVM_REQ_* request. If vCPU's handling of the state change is required for correctness, KVM must ensure either the vCPU sees the change before entering the guest, or that the sender sees the vCPU as running in guest mode. All architectures handle this by (a) sending the request before calling kvm_vcpu_kick() and (b) checking for requests _after_ setting vcpu->mode. x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to ensure it kicks and waits for vCPUs that started reading SPTEs _before_ MMU changes were finalized, but any vCPU that starts reading after MMU changes were finalized will see the new state and can continue on uninterrupted. For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g. x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied upon for functional correctness, e.g. in the dirty log case, userspace cannot assume it has a 100% complete log if vCPUs are still running. All that said, eliminate the benign race since the cost of doing so is an "extra" atomic cmpxchg() in the case where the target vCPU is loaded by the current pCPU or is not loaded at all. I.e. the kick will be skipped due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as opposed to the kick being skipped because of the cpu checks. Keep the "cpu != me" checks even though they appear useless/impossible at first glance. x86 processes guest IPI writes in a fast path that runs in IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or READING_SHADOW_PAGE_TABLES is perfectly reasonable. Note, a race with the cpu_online() check in kvm_vcpu_kick() likely persists, e.g. the vCPU could exit guest mode and get offlined between the cpu_online() check and the sending of smp_send_reschedule(). But, the online check appears to exist only to avoid a WARN in x86's native_smp_send_reschedule() that fires if the target CPU is not online. The reschedule WARN exists because CPU offlining takes the CPU out of the scheduling pool, i.e. the WARN is intended to detect the case where the kernel attempts to schedule a task on an offline CPU. The actual sending of the IPI is a non-issue as at worst it will simpy be dropped on the floor. In other words, KVM's usurping of the reschedule IPI could theoretically trigger a WARN if the stars align, but there will be no loss of functionality. [*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a Cc: Venkatesh Srinivas <venkateshs@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: 97222cc8 ("KVM: Emulate local APIC in kernel") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
KASAN reports the following issue: BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm] Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798 CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- --- Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x130 ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm] __kasan_report.cold+0x7f/0x114 ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm] kasan_report+0x38/0x50 kasan_check_range+0xf5/0x1d0 kvm_make_vcpus_request_mask+0x174/0x440 [kvm] kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm] ? kvm_arch_exit+0x110/0x110 [kvm] ? sched_clock+0x5/0x10 ioapic_write_indirect+0x59f/0x9e0 [kvm] ? static_obj+0xc0/0xc0 ? __lock_acquired+0x1d2/0x8c0 ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm] The problem appears to be that 'vcpu_bitmap' is allocated as a single long on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear the lower 16 bits of it with bitmap_zero() for no particular reason (my guess would be that 'bitmap' and 'vcpu_bitmap' variables in kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed 16-bit long, the later should accommodate all possible vCPUs). Fixes: 7ee30bc1 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs") Fixes: 9a2ae9f6 ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap") Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210827092516.1027264-7-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-