- 01 Jun, 2020 9 commits
-
-
Paolo Bonzini authored
Restore the INT_CTL value from the guest's VMCB once we've stopped using it, so that virtual interrupts can be injected as requested by L1. V_TPR is up-to-date however, and it can change if the guest writes to CR8, so keep it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
In preparation for nested SVM save/restore, store all data that matters from the VMCB control area into svm->nested. It will then become part of the nested SVM state that is saved by KVM_SET_NESTED_STATE and restored by KVM_GET_NESTED_STATE, just like the cached vmcs12 for nVMX. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Allow placing the VMCB structs on the stack or in other structs without wasting too much space. Add BUILD_BUG_ON as a quick safeguard against typos. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This will come in handy when we put a struct vmcb_control_area in svm->nested. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Use l1_tsc_offset to compute svm->vcpu.arch.tsc_offset and svm->vmcb->control.tsc_offset, instead of relying on hsave. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Everything that is needed during nested state restore is now part of nested_prepare_vmcb_control. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Split out filling svm->vmcb.save and svm->vmcb.control before VMRUN. Only the latter will be useful when restoring nested SVM state. This patch introduces no semantic change, so the MMU setup is still done in nested_prepare_vmcb_save. The next patch will clean up things. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
When restoring SVM nested state, the control state cache in svm->nested will have to be filled, but the save state will not have to be moved into svm->vmcb. Therefore, pull the code that handles the control area into a separate function. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Unmapping the nested VMCB in enter_svm_guest_mode is a bit of a wart, since the map argument is not used elsewhere in the function. There are just two callers, and those are also the place where kvm_vcpu_map is called, so it is cleaner to unmap there. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 28 May, 2020 6 commits
-
-
Paolo Bonzini authored
vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. Fixes: 04f11ef4 ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
svm_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. This was was added in commit 689f3bf2 ("KVM: x86: unify callbacks to load paging root", 2020-03-16) just to keep the two vendor-specific modules closer, but we'll fix VMX too. Fixes: 689f3bf2 ("KVM: x86: unify callbacks to load paging root") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The usual drill at this point, except there is no code to remove because this case was not handled at all. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
All events now inject vmexits before vmentry rather than after vmexit. Therefore, exit_required is not set anymore and we can remove it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
This allows exceptions injected by the emulator to be properly delivered as vmexits. The code also becomes simpler, because we can just let all L0-intercepted exceptions go through the usual path. In particular, our emulation of the VMX #DB exit qualification is very much simplified, because the vmexit injection path can use kvm_deliver_exception_payload to update DR6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 27 May, 2020 17 commits
-
-
Paolo Bonzini authored
Instead of calling kvm_event_needs_reinjection, track its future return value in a variable. This will be useful in the next patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Vitaly Kuznetsov authored
L2 guest hang is observed after 'exit_required' was dropped and nSVM switched to check_nested_events() completely. The hang is a busy loop when e.g. KVM is emulating an instruction (e.g. L2 is accessing MMIO space and we drop to userspace). After nested_svm_vmexit() and when L1 is doing VMRUN nested guest's RIP is not advanced so KVM goes into emulating the same instruction which caused nested_svm_vmexit() and the loop continues. nested_svm_vmexit() is not new, however, with check_nested_events() we're now calling it later than before. In case by that time KVM has modified register state we may pick stale values from VMCB when trying to save nested guest state to nested VMCB. nVMX code handles this case correctly: sync_vmcs02_to_vmcs12() called from nested_vmx_vmexit() does e.g 'vmcs12->guest_rip = kvm_rip_read(vcpu)' and this ensures KVM-made modifications are preserved. Do the same for nSVM. Generally, nested_vmx_vmexit()/nested_svm_vmexit() need to pick up all nested guest state modifications done by KVM after vmexit. It would be great to find a way to express this in a way which would not require to manually track these changes, e.g. nested_{vmcb,vmcs}_get_field(). Co-debugged-with: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200527090102.220647-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Initialize vcpu->arch.tdp_level during vCPU creation to avoid consuming garbage if userspace calls KVM_RUN without first calling KVM_SET_CPUID. Fixes: e93fd3b3 ("KVM: x86/mmu: Capture TDP level when updating CPUID") Reported-by: syzbot+904752567107eefb728c@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200527085400.23759-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Restoring the ASID from the hsave area on VMEXIT is wrong, because its value depends on the handling of TLB flushes. Just skipping the field in copy_vmcb_control_area will do. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Async page faults have to be trapped in the host (L1 in this case), since the APF reason was passed from L0 to L1 and stored in the L1 APF data page. This was completely reversed: the page faults were passed to the guest, a L2 hypervisor. Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
彭浩(Richard) authored
pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the same implementation. Signed-off-by: Peng Hao <richard.peng@oppo.com> Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Haiwei Li authored
There is a bad indentation in next&queue branch. The patch looks like fixes nothing though it fixes the indentation. Before fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: After fixing: if (!handle_fastpath_set_x2apic_icr_irqoff(vcpu, data)) { kvm_skip_emulated_instruction(vcpu); ret = EXIT_FASTPATH_EXIT_HANDLED; } break; case MSR_IA32_TSCDEADLINE: Signed-off-by: Haiwei Li <lihaiwei@tencent.com> Message-Id: <2f78457e-f3a7-3bc9-e237-3132ee87f71e@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Miaohe Lin authored
The second "/* fall through */" in rmode_exception() makes code harder to read. Replace it with "return" to indicate they are different cases, only the #DB and #BP check vcpu->guest_debug, while others don't care. And this also improves the readability. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Message-Id: <1582080348-20827-1-git-send-email-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Take a u32 for the index in has_emulated_msr() to match hardware, which treats MSR indices as unsigned 32-bit values. Functionally, taking a signed int doesn't cause problems with the current code base, but could theoretically cause problems with 32-bit KVM, e.g. if the index were checked via a less-than statement, which would evaluate incorrectly for MSR indices with bit 31 set. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200218234012.7110-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove unnecessary brackets from a case statement that unintentionally encapsulates unrelated case statements in the same switch statement. While technically legal and functionally correct syntax, the brackets are visually confusing and potentially dangerous, e.g. the last of the encapsulated case statements has an undocumented fall-through that isn't flagged by compilers due the encapsulation. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200218234012.7110-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
The migration functionality was left incomplete in commit 5ef8acbd ("KVM: nVMX: Emulate MTF when performing instruction emulation", 2020-02-23), fix it. Fixes: 5ef8acbd ("KVM: nVMX: Emulate MTF when performing instruction emulation") Cc: stable@vger.kernel.org Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Merge AMD fixes before doing more development work.
-
Paolo Bonzini authored
Merge tag 'kvm-s390-next-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Cleanups for 5.8 - vsie (nesting) cleanups - remove unneeded semicolon
-
Paolo Bonzini authored
We can simply look at bits 52-53 to identify MMIO entries in KVM's page tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
This msr is only available when the host supports WAITPKG feature. This breaks a nested guest, if the L1 hypervisor is set to ignore unknown msrs, because the only other safety check that the kernel does is that it attempts to read the msr and rejects it if it gets an exception. Cc: stable@vger.kernel.org Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200523161455.3940-3-mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Maxim Levitsky authored
Even though we might not allow the guest to use WAITPKG's new instructions, we should tell KVM that the feature is supported by the host CPU. Note that vmx_waitpkg_supported checks that WAITPKG _can_ be set in secondary execution controls as specified by VMX capability MSR, rather that we actually enable it for a guest. Cc: stable@vger.kernel.org Fixes: e69e72fa ("KVM: x86: Add support for user wait instructions") Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200523161455.3940-2-mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Set the mmio_value to '0' instead of simply clearing the present bit to squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains about the mmio_value overlapping the lower GFN mask on systems with 52 bits of PA space. Opportunistically clean up the code and comments. Cc: stable@vger.kernel.org Fixes: d43e2675 ("KVM: x86: only do L1TF workaround on affected processors") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200527084909.23492-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 20 May, 2020 2 commits
-
-
Paolo Bonzini authored
Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
-
Paolo Bonzini authored
rcuwait_active only returns whether w->task is not NULL. This is exactly one of the usecases that are mentioned in the documentation for rcu_access_pointer() where it is correct to bypass lockdep checks. This avoids a splat from kvm_vcpu_on_spin(). Reported-by: Wanpeng Li <kernellwp@gmail.com> Tested-by: Wanpeng Li <kernellwp@gmail.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-
- 19 May, 2020 6 commits
-
-
Thomas Gleixner authored
The async page fault injection into kernel space creates more problems than it solves. The host has absolutely no knowledge about the state of the guest if the fault happens in CPL0. The only restriction for the host is interrupt disabled state. If interrupts are enabled in the guest then the exception can hit arbitrary code. The HALT based wait in non-preemotible code is a hacky replacement for a proper hypercall. For the ongoing work to restrict instrumentation and make the RCU idle interaction well defined the required extra work for supporting async pagefault in CPL0 is just not justified and creates complexity for a dubious benefit. The CPL3 injection is well defined and does not cause any issues as it is more or less the same as a regular page fault from CPL3. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.369802541@linutronix.de
-
Thomas Gleixner authored
While working on the entry consolidation I stumbled over the KVM async page fault handler and kvm_async_pf_task_wait() in particular. It took me a while to realize that the randomly sprinkled around rcu_irq_enter()/exit() invocations are just cargo cult programming. Several patches "fixed" RCU splats by curing the symptoms without noticing that the code is flawed from a design perspective. The main problem is that this async injection is not based on a proper handshake mechanism and only respects the minimal requirement, i.e. the guest is not in a state where it has interrupts disabled. Aside of that the actual code is a convoluted one fits it all swiss army knife. It is invoked from different places with different RCU constraints: 1) Host side: vcpu_enter_guest() kvm_x86_ops->handle_exit() kvm_handle_page_fault() kvm_async_pf_task_wait() The invocation happens from fully preemptible context. 2) Guest side: The async page fault interrupted: a) user space b) preemptible kernel code which is not in a RCU read side critical section c) non-preemtible kernel code or a RCU read side critical section or kernel code with CONFIG_PREEMPTION=n which allows not to differentiate between #2b and #2c. RCU is watching for: #1 The vCPU exited and current is definitely not the idle task #2a The #PF entry code on the guest went through enter_from_user_mode() which reactivates RCU #2b There is no preemptible, interrupts enabled code in the kernel which can run with RCU looking away. (The idle task is always non preemptible). I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU voodoo at all. In #2c RCU is eventually not watching, but as that state cannot schedule anyway there is no point to worry about it so it has to invoke rcu_irq_enter() before running that code. This can be optimized, but this will be done as an extra step in course of the entry code consolidation work. So the proper solution for this is to: - Split kvm_async_pf_task_wait() into schedule and halt based waiting interfaces which share the enqueueing code. - Add comments (condensed form of this changelog) to spare others the time waste and pain of reverse engineering all of this with the help of uncomprehensible changelogs and code history. - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(), user mode and schedulable kernel side async page faults (#1, #2a, #2b) - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel case (#2c). For this case also remove the rcu_irq_exit()/enter() pair around the halt as it is just a pointless exercise: - vCPUs can VMEXIT at any random point and can be scheduled out for an arbitrary amount of time by the host and this is not any different except that it voluntary triggers the exit via halt. - The interrupted context could have RCU watching already. So the rcu_irq_exit() before the halt is not gaining anything aside of confusing the reader. Claiming that this might prevent RCU stalls is just an illusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
-
Andy Lutomirski authored
KVM overloads #PF to indicate two types of not-actually-page-fault events. Right now, the KVM guest code intercepts them by modifying the IDT and hooking the #PF vector. This makes the already fragile fault code even harder to understand, and it also pollutes call traces with async_page_fault and do_async_page_fault for normal page faults. Clean it up by moving the logic into do_page_fault() using a static branch. This gets rid of the platform trap_init override mechanism completely. [ tglx: Fixed up 32bit, removed error code from the async functions and massaged coding style ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de
-
Thomas Gleixner authored
Force inlining of the helpers and mark the instrumentable parts accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134341.672545766@linutronix.de
-
Peter Zijlstra authored
Force inlining and prevent instrumentation of all sorts by marking the functions which are invoked from low level entry code with 'noinstr'. Split the irqflags tracking into two parts. One which does the heavy lifting while RCU is watching and the final one which can be invoked after RCU is turned off. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
-
Thomas Gleixner authored
trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer core itself is safe, but the resulting tracepoints can be utilized by e.g. BPF which is unsafe. Provide variants which do not contain the lockdep invocation so the lockdep and tracer invocations can be split at the call site and placed properly. This is required because lockdep needs to be aware of the state before switching away from RCU idle and after switching to RCU idle because these transitions can take locks. As these code pathes are going to be non-instrumentable the tracer can be invoked after RCU is turned on and before the switch to RCU idle. So for these new variants there is no need to invoke the rcuidle aware tracer functions. Name them so they match the lockdep counterparts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
-