1. 04 Feb, 2021 40 commits
    • Michael Roth's avatar
      KVM: SVM: use vmsave/vmload for saving/restoring additional host state · e79b91bb
      Michael Roth authored
      Using a guest workload which simply issues 'hlt' in a tight loop to
      generate VMEXITs, it was observed (on a recent EPYC processor) that a
      significant amount of the VMEXIT overhead measured on the host was the
      result of MSR reads/writes in svm_vcpu_load/svm_vcpu_put according to
      perf:
      
        67.49%--kvm_arch_vcpu_ioctl_run
                |
                |--23.13%--vcpu_put
                |          kvm_arch_vcpu_put
                |          |
                |          |--21.31%--native_write_msr
                |          |
                |           --1.27%--svm_set_cr4
                |
                |--16.11%--vcpu_load
                |          |
                |           --15.58%--kvm_arch_vcpu_load
                |                     |
                |                     |--13.97%--svm_set_cr4
                |                     |          |
                |                     |          |--12.64%--native_read_msr
      
      Most of these MSRs relate to 'syscall'/'sysenter' and segment bases, and
      can be saved/restored using 'vmsave'/'vmload' instructions rather than
      explicit MSR reads/writes. In doing so there is a significant reduction
      in the svm_vcpu_load/svm_vcpu_put overhead measured for the above
      workload:
      
        50.92%--kvm_arch_vcpu_ioctl_run
                |
                |--19.28%--disable_nmi_singlestep
                |
                |--13.68%--vcpu_load
                |          kvm_arch_vcpu_load
                |          |
                |          |--9.19%--svm_set_cr4
                |          |          |
                |          |           --6.44%--native_read_msr
                |          |
                |           --3.55%--native_write_msr
                |
                |--6.05%--kvm_inject_nmi
                |--2.80%--kvm_sev_es_mmio_read
                |--2.19%--vcpu_put
                |          |
                |           --1.25%--kvm_arch_vcpu_put
                |                     native_write_msr
      
      Quantifying this further, if we look at the raw cycle counts for a
      normal iteration of the above workload (according to 'rdtscp'),
      kvm_arch_vcpu_ioctl_run() takes ~4600 cycles from start to finish with
      the current behavior. Using 'vmsave'/'vmload', this is reduced to
      ~2800 cycles, a savings of 39%.
      
      While this approach doesn't seem to manifest in any noticeable
      improvement for more realistic workloads like UnixBench, netperf, and
      kernel builds, likely due to their exit paths generally involving IO
      with comparatively high latencies, it does improve overall overhead
      of KVM_RUN significantly, which may still be noticeable for certain
      situations. It also simplifies some aspects of the code.
      
      With this change, explicit save/restore is no longer needed for the
      following host MSRs, since they are documented[1] as being part of the
      VMCB State Save Area:
      
        MSR_STAR, MSR_LSTAR, MSR_CSTAR,
        MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE,
        MSR_IA32_SYSENTER_CS,
        MSR_IA32_SYSENTER_ESP,
        MSR_IA32_SYSENTER_EIP,
        MSR_FS_BASE, MSR_GS_BASE
      
      and only the following MSR needs individual handling in
      svm_vcpu_put/svm_vcpu_load:
      
        MSR_TSC_AUX
      
      We could drop the host_save_user_msrs array/loop and instead handle
      MSR read/write of MSR_TSC_AUX directly, but we leave that for now as
      a potential follow-up.
      
      Since 'vmsave'/'vmload' also handles the LDTR and FS/GS segment
      registers (and associated hidden state)[2], some of the code
      previously used to handle this is no longer needed, so we drop it
      as well.
      
      The first public release of the SVM spec[3] also documents the same
      handling for the host state in question, so we make these changes
      unconditionally.
      
      Also worth noting is that we 'vmsave' to the same page that is
      subsequently used by 'vmrun' to record some host additional state. This
      is okay, since, in accordance with the spec[2], the additional state
      written to the page by 'vmrun' does not overwrite any fields written by
      'vmsave'. This has also been confirmed through testing (for the above
      CPU, at least).
      
      [1] AMD64 Architecture Programmer's Manual, Rev 3.33, Volume 2, Appendix B, Table B-2
      [2] AMD64 Architecture Programmer's Manual, Rev 3.31, Volume 3, Chapter 4, VMSAVE/VMLOAD
      [3] Secure Virtual Machine Architecture Reference Manual, Rev 3.01
      Suggested-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20210202190126.2185715-2-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e79b91bb
    • Sean Christopherson's avatar
      KVM: SVM: Use asm goto to handle unexpected #UD on SVM instructions · 35a78319
      Sean Christopherson authored
      Add svm_asm*() macros, a la the existing vmx_asm*() macros, to handle
      faults on SVM instructions instead of using the generic __ex(), a.k.a.
      __kvm_handle_fault_on_reboot().  Using asm goto generates slightly
      better code as it eliminates the in-line JMP+CALL sequences that are
      needed by __kvm_handle_fault_on_reboot() to avoid triggering BUG()
      from fixup (which generates bad stack traces).
      
      Using SVM specific macros also drops the last user of __ex() and the
      the last asm linkage to kvm_spurious_fault(), and adds a helper for
      VMSAVE, which may gain an addition call site in the future (as part
      of optimizing the SVM context switching).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      35a78319
    • Sean Christopherson's avatar
      KVM: VMX: Use the kernel's version of VMXOFF · 6a289139
      Sean Christopherson authored
      Drop kvm_cpu_vmxoff() in favor of the kernel's cpu_vmxoff().  Modify the
      latter to return -EIO on fault so that KVM can invoke
      kvm_spurious_fault() when appropriate.  In addition to the obvious code
      reuse, dropping kvm_cpu_vmxoff() also eliminates VMX's last usage of the
      __ex()/__kvm_handle_fault_on_reboot() macros, thus helping pave the way
      toward dropping them entirely.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a289139
    • Sean Christopherson's avatar
      KVM: VMX: Move Intel PT shenanigans out of VMXON/VMXOFF flows · 5ef940bd
      Sean Christopherson authored
      Move the Intel PT tracking outside of the VMXON/VMXOFF helpers so that
      a future patch can drop KVM's kvm_cpu_vmxoff() in favor of the kernel's
      cpu_vmxoff() without an associated PT functional change, and without
      losing symmetry between the VMXON and VMXOFF flows.
      
      Barring undocumented behavior, this should have no meaningful effects
      as Intel PT behavior does not interact with CR4.VMXE.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ef940bd
    • Uros Bizjak's avatar
      KVM/nVMX: Use __vmx_vcpu_run in nested_vmx_check_vmentry_hw · 150f17bf
      Uros Bizjak authored
      Replace inline assembly in nested_vmx_check_vmentry_hw
      with a call to __vmx_vcpu_run.  The function is not
      performance critical, so (double) GPR save/restore
      in __vmx_vcpu_run can be tolerated, as far as performance
      effects are concerned.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Reviewed-and-tested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      [sean: dropped versioning info from changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      150f17bf
    • David P. Reed's avatar
      x86/virt: Mark flags and memory as clobbered by VMXOFF · 53666664
      David P. Reed authored
      Explicitly tell the compiler that VMXOFF modifies flags (like all VMX
      instructions), and mark memory as clobbered since VMXOFF must not be
      reordered and also may have memory side effects (though the kernel
      really shouldn't be accessing the root VMCS anyways).
      
      Practically speaking, adding the clobbers is most likely a nop; the
      primary motivation is to properly document VMXOFF's behavior.
      
      For the flags clobber, both Clang and GCC automatically mark flags as
      clobbered; this is noted in commit 4b1e5478 ("KVM/x86: Use assembly
      instruction mnemonics instead of .byte streams"), which intentionally
      removed the previous clobber.  But, neither Clang nor GCC documents
      this behavior, and there's no downside to including the clobber.
      
      For the memory clobber, the RFLAGS.IF and CR4.VMXE manipulations that
      immediately follow VMXOFF have compiler barriers of their own, i.e.
      VMXOFF can't get reordered after clearing CR4.VMXE, which is really
      what's of interest.
      
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid P. Reed <dpreed@deepplum.com>
      [sean: rewrote changelog, dropped comment adjustments]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      53666664
    • Sean Christopherson's avatar
      x86/reboot: Force all cpus to exit VMX root if VMX is supported · ed727361
      Sean Christopherson authored
      Force all CPUs to do VMXOFF (via NMI shootdown) during an emergency
      reboot if VMX is _supported_, as VMX being off on the current CPU does
      not prevent other CPUs from being in VMX root (post-VMXON).  This fixes
      a bug where a crash/panic reboot could leave other CPUs in VMX root and
      prevent them from being woken via INIT-SIPI-SIPI in the new kernel.
      
      Fixes: d176720d ("x86: disable VMX on all CPUs on reboot")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid P. Reed <dpreed@deepplum.com>
      [sean: reworked changelog and further tweaked comment]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed727361
    • Sean Christopherson's avatar
      x86/virt: Eat faults on VMXOFF in reboot flows · aec511ad
      Sean Christopherson authored
      Silently ignore all faults on VMXOFF in the reboot flows as such faults
      are all but guaranteed to be due to the CPU not being in VMX root.
      Because (a) VMXOFF may be executed in NMI context, e.g. after VMXOFF but
      before CR4.VMXE is cleared, (b) there's no way to query the CPU's VMX
      state without faulting, and (c) the whole point is to get out of VMX
      root, eating faults is the simplest way to achieve the desired behaior.
      
      Technically, VMXOFF can fault (or fail) for other reasons, but all other
      fault and failure scenarios are mode related, i.e. the kernel would have
      to magically end up in RM, V86, compat mode, at CPL>0, or running with
      the SMI Transfer Monitor active.  The kernel is beyond hosed if any of
      those scenarios are encountered; trying to do something fancy in the
      error path to handle them cleanly is pointless.
      
      Fixes: 1e993114 ("x86: asm/virtext.h: add cpu_vmxoff() inline function")
      Reported-by: default avatarDavid P. Reed <dpreed@deepplum.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20201231002702.22237077-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aec511ad
    • Jason Baron's avatar
      KVM: x86: use static calls to reduce kvm_x86_ops overhead · b3646477
      Jason Baron authored
      Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are
      covered here except for 'pmu_ops and 'nested ops'.
      
      Here are some numbers running cpuid in a loop of 1 million calls averaged
      over 5 runs, measured in the vm (lower is better).
      
      Intel Xeon 3000MHz:
      
                 |default    |mitigations=off
      -------------------------------------
      vanilla    |.671s      |.486s
      static call|.573s(-15%)|.458s(-6%)
      
      AMD EPYC 2500MHz:
      
                 |default    |mitigations=off
      -------------------------------------
      vanilla    |.710s      |.609s
      static call|.664s(-6%) |.609s(0%)
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b3646477
    • Jason Baron's avatar
      KVM: x86: introduce definitions to support static calls for kvm_x86_ops · 9af5471b
      Jason Baron authored
      Use static calls to improve kvm_x86_ops performance. Introduce the
      definitions that will be used by a subsequent patch to actualize the
      savings. Add a new kvm-x86-ops.h header that can be used for the
      definition of static calls. This header is also intended to be
      used to simplify the defition of svm_kvm_ops and vmx_x86_ops.
      
      Note that all functions in kvm_x86_ops are covered here except for
      'pmu_ops' and 'nested ops'. I think they can be covered by static
      calls in a simlilar manner, but were omitted from this series to
      reduce scope and because I don't think they have as large of a
      performance impact.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Message-Id: <e5cc82ead7ab37b2dceb0837a514f3f8bea4f8d1.1610680941.git.jbaron@akamai.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9af5471b
    • Jason Baron's avatar
      KVM: X86: prepend vmx/svm prefix to additional kvm_x86_ops functions · b6a7cc35
      Jason Baron authored
      A subsequent patch introduces macros in preparation for simplifying the
      definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform
      expands the coverage of the macros. Add vmx/svm prefix to the following
      functions: update_exception_bitmap(), enable_nmi_window(),
      enable_irq_window(), update_cr8_intercept and enable_smi_window().
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6a7cc35
    • Cun Li's avatar
      KVM: Stop using deprecated jump label APIs · 6e4e3b4d
      Cun Li authored
      The use of 'struct static_key' and 'static_key_false' is
      deprecated. Use the new API.
      Signed-off-by: default avatarCun Li <cun.jia.li@gmail.com>
      Message-Id: <20210111152435.50275-1-cun.jia.li@gmail.com>
      [Make it compile.  While at it, rename kvm_no_apic_vcpu to
       kvm_has_noapic_vcpu; the former reads too much like "true if
       no vCPU has an APIC". - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6e4e3b4d
    • Wei Huang's avatar
      KVM: SVM: Fix #GP handling for doubly-nested virtualization · 14c2bf81
      Wei Huang authored
      Under the case of nested on nested (L0, L1, L2 are all hypervisors),
      we do not support emulation of the vVMLOAD/VMSAVE feature, the
      L0 hypervisor can inject the proper #VMEXIT to inform L1 of what is
      happening and L1 can avoid invoking the #GP workaround.  For this
      reason we turns on guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM
      running inside VM to receive the notification and change behavior.
      
      Similarly we check if vcpu is under guest mode before emulating the
      vmware-backdoor instructions. For the case of nested on nested, we
      let the guest handle it.
      Co-developed-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarWei Huang <wei.huang2@amd.com>
      Tested-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210126081831.570253-5-wei.huang2@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      14c2bf81
    • Wei Huang's avatar
      KVM: SVM: Add support for SVM instruction address check change · 3b9c723e
      Wei Huang authored
      New AMD CPUs have a change that checks #VMEXIT intercept on special SVM
      instructions before checking their EAX against reserved memory region.
      This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT
      is triggered before #GP. KVM doesn't need to intercept and emulate #GP
      faults as #GP is supposed to be triggered.
      Co-developed-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarWei Huang <wei.huang2@amd.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210126081831.570253-4-wei.huang2@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3b9c723e
    • Bandan Das's avatar
      KVM: SVM: Add emulation support for #GP triggered by SVM instructions · 82a11e9c
      Bandan Das authored
      While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
      CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
      before checking VMCB's instruction intercept. If EAX falls into such
      memory areas, #GP is triggered before VMEXIT. This causes problem under
      nested virtualization. To solve this problem, KVM needs to trap #GP and
      check the instructions triggering #GP. For VM execution instructions,
      KVM emulates these instructions.
      Co-developed-by: default avatarWei Huang <wei.huang2@amd.com>
      Signed-off-by: default avatarWei Huang <wei.huang2@amd.com>
      Signed-off-by: default avatarBandan Das <bsd@redhat.com>
      Message-Id: <20210126081831.570253-3-wei.huang2@amd.com>
      [Conditionally enable #GP intercept. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      82a11e9c
    • Wei Huang's avatar
      KVM: x86: Factor out x86 instruction emulation with decoding · 4aa2691d
      Wei Huang authored
      Move the instruction decode part out of x86_emulate_instruction() for it
      to be used in other places. Also kvm_clear_exception_queue() is moved
      inside the if-statement as it doesn't apply when KVM are coming back from
      userspace.
      Co-developed-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarWei Huang <wei.huang2@amd.com>
      Message-Id: <20210126081831.570253-2-wei.huang2@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4aa2691d
    • Chenyi Qiang's avatar
      KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW · 9a3ecd5e
      Chenyi Qiang authored
      DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
      to 0 when the condition (e.g. RTM) happens. The value can be used to
      initialize dr6 and also be the XOR mask between the #DB exit
      qualification (or payload) and DR6.
      
      Concerning that DR6_INIT is used as initial value only once, rename it
      to DR6_ACTIVE_LOW and apply it in other places, which would make the
      incoming changes for bus lock debug exception more simple.
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
      [Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9a3ecd5e
    • Like Xu's avatar
      selftests: kvm/x86: add test for pmu msr MSR_IA32_PERF_CAPABILITIES · f88d4f2f
      Like Xu authored
      This test will check the effect of various CPUID settings on the
      MSR_IA32_PERF_CAPABILITIES MSR, check that whatever user space writes
      with KVM_SET_MSR is _not_ modified from the guest and can be retrieved
      with KVM_GET_MSR, and check that invalid LBR formats are rejected.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Message-Id: <20210201051039.255478-12-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f88d4f2f
    • Like Xu's avatar
      KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES · be635e34
      Like Xu authored
      Userspace could enable guest LBR feature when the exactly supported
      LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES
      and the LBR is also compatible with vPMU version and host cpu model.
      
      The LBR could be enabled on the guest if host perf supports LBR
      (checked via x86_perf_get_lbr()) and the vcpu model is compatible
      with the host one.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Message-Id: <20210201051039.255478-11-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      be635e34
    • Like Xu's avatar
      KVM: vmx/pmu: Release guest LBR event via lazy release mechanism · 9aa4f622
      Like Xu authored
      The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to
      indicate whether a guest LBR event is still needed by the vcpu. If the
      vcpu no longer accesses LBR related registers within a scheduling time
      slice, and the enable bit of LBR has been unset, vPMU will treat the
      guest LBR event as a bland event of a vPMC counter and release it
      as usual. Also, the pass-through state of LBR records msrs is cancelled.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Message-Id: <20210201051039.255478-10-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9aa4f622
    • Like Xu's avatar
      KVM: vmx/pmu: Emulate legacy freezing LBRs on virtual PMI · e6209a3b
      Like Xu authored
      The current vPMU only supports Architecture Version 2. According to
      Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if
      IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual
      PMI and the KVM would emulate to clear the LBR bit (bit 0) in
      IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR
      to resume recording branches.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Message-Id: <20210201051039.255478-9-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6209a3b
    • Like Xu's avatar
      KVM: vmx/pmu: Reduce the overhead of LBR pass-through or cancellation · 9254beaa
      Like Xu authored
      When the LBR records msrs has already been pass-through, there is no
      need to call vmx_update_intercept_for_lbr_msrs() again and again, and
      vice versa.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Message-Id: <20210201051039.255478-8-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9254beaa
    • Like Xu's avatar
      KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE · 1b5ac322
      Like Xu authored
      In addition to DEBUGCTLMSR_LBR, any KVM trap caused by LBR msrs access
      will result in a creation of guest LBR event per-vcpu.
      
      If the guest LBR event is scheduled on with the corresponding vcpu context,
      KVM will pass-through all LBR records msrs to the guest. The LBR callstack
      mechanism implemented in the host could help save/restore the guest LBR
      records during the event context switches, which reduces a lot of overhead
      if we save/restore tens of LBR msrs (e.g. 32 LBR records entries) in the
      much more frequent VMX transitions.
      
      To avoid reclaiming LBR resources from any higher priority event on host,
      KVM would always check the exist of guest LBR event and its state before
      vm-entry as late as possible. A negative result would cancel the
      pass-through state, and it also prevents real registers accesses and
      potential data leakage. If host reclaims the LBR between two checks, the
      interception state and LBR records can be safely preserved due to native
      save/restore support from guest LBR event.
      
      The KVM emits a pr_warn() when the LBR hardware is unavailable to the
      guest LBR event. The administer is supposed to reminder users that the
      guest result may be inaccurate if someone is using LBR to record
      hypervisor on the host side.
      Suggested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Co-developed-by: default avatarWei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Message-Id: <20210201051039.255478-7-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1b5ac322
    • Like Xu's avatar
      KVM: vmx/pmu: Create a guest LBR event when vcpu sets DEBUGCTLMSR_LBR · 8e12911b
      Like Xu authored
      When vcpu sets DEBUGCTLMSR_LBR in the MSR_IA32_DEBUGCTLMSR, the KVM handler
      would create a guest LBR event which enables the callstack mode and none of
      hardware counter is assigned. The host perf would schedule and enable this
      event as usual but in an exclusive way.
      
      The guest LBR event will be released when the vPMU is reset but soon,
      the lazy release mechanism would be applied to this event like a vPMC.
      Suggested-by: default avatarAndi Kleen <ak@linux.intel.com>
      Co-developed-by: default avatarWei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Message-Id: <20210201051039.255478-6-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8e12911b
    • Like Xu's avatar
      KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabled · c6462363
      Like Xu authored
      Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES
      MSR which tells about the record format stored in the LBR records.
      
      The LBR will be enabled on the guest if host perf supports LBR
      (checked via x86_perf_get_lbr()) and the vcpu model is compatible
      with the host one.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c6462363
    • Paolo Bonzini's avatar
      KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabled · 9c9520ce
      Paolo Bonzini authored
      Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES
      MSR which tells about the record format stored in the LBR records.
      
      The LBR will be enabled on the guest if host perf supports LBR
      (checked via x86_perf_get_lbr()) and the vcpu model is compatible
      with the host one.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9c9520ce
    • Paolo Bonzini's avatar
      KVM: x86/pmu: preserve IA32_PERF_CAPABILITIES across CPUID refresh · a7557539
      Paolo Bonzini authored
      Once MSR_IA32_PERF_CAPABILITIES is changed via vmx_set_msr(), the
      value should not be changed by cpuid(). To ensure that the new value
      is kept, the default initialization path is moved to intel_pmu_init().
      The effective value of the MSR will be 0 if PDCM is clear, however.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a7557539
    • Like Xu's avatar
      KVM: x86/vmx: Make vmx_set_intercept_for_msr() non-static · 252e365e
      Like Xu authored
      To make code responsibilities clear, we may resue and invoke the
      vmx_set_intercept_for_msr() in other vmx-specific files (e.g. pmu_intel.c),
      so expose it to passthrough LBR msrs later.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Message-Id: <20210201051039.255478-2-like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      252e365e
    • Like Xu's avatar
      KVM: VMX: read/write MSR_IA32_DEBUGCTLMSR from GUEST_IA32_DEBUGCTL · d855066f
      Like Xu authored
      SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the
      svm_get/set_msr, so the x86 common part can be safely moved to VMX.
      This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL.
      
      Add vmx_supported_debugctl() to refactor the throwing logic of #GP.
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com>
      [Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception
       to guest". - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d855066f
    • Sean Christopherson's avatar
      KVM: VMX: Use x2apic_mode to avoid RDMSR when querying PI state · 563c54c4
      Sean Christopherson authored
      Use x2apic_mode instead of x2apic_enabled() when adjusting the
      destination ID during Posted Interrupt updates.  This avoids the costly
      RDMSR that is hidden behind x2apic_enabled().
      Reported-by: default avatarluferry <luferry@163.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210115220354.434807-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      563c54c4
    • Sean Christopherson's avatar
      x86/apic: Export x2apic_mode for use by KVM in "warm" path · db7d8e47
      Sean Christopherson authored
      Export x2apic_mode so that KVM can query whether x2APIC is active
      without having to incur the RDMSR in x2apic_enabled().  When Posted
      Interrupts are in use for a guest with an assigned device, KVM ends up
      checking for x2APIC at least once every time a vCPU halts.  KVM could
      obviously snapshot x2apic_enabled() to avoid the RDMSR, but that's
      rather silly given that x2apic_mode holds the exact info needed by KVM.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210115220354.434807-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      db7d8e47
    • Chenyi Qiang's avatar
      KVM: X86: Add the Document for KVM_CAP_X86_BUS_LOCK_EXIT · c32b1b89
      Chenyi Qiang authored
      Introduce a new capability named KVM_CAP_X86_BUS_LOCK_EXIT, which is
      used to handle bus locks detected in guest. It allows the userspace to
      do custom throttling policies to mitigate the 'noisy neighbour' problem.
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20201106090315.18606-5-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c32b1b89
    • Chenyi Qiang's avatar
      KVM: VMX: Enable bus lock VM exit · fe6b6bc8
      Chenyi Qiang authored
      Virtual Machine can exploit bus locks to degrade the performance of
      system. Bus lock can be caused by split locked access to writeback(WB)
      memory or by using locks on uncacheable(UC) memory. The bus lock is
      typically >1000 cycles slower than an atomic operation within a cache
      line. It also disrupts performance on other cores (which must wait for
      the bus lock to be released before their memory operations can
      complete).
      
      To address the threat, bus lock VM exit is introduced to notify the VMM
      when a bus lock was acquired, allowing it to enforce throttling or other
      policy based mitigations.
      
      A VMM can enable VM exit due to bus locks by setting a new "Bus Lock
      Detection" VM-execution control(bit 30 of Secondary Processor-based VM
      execution controls). If delivery of this VM exit was preempted by a
      higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC
      access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of
      exit reason in vmcs field is set to 1.
      
      In current implementation, the KVM exposes this capability through
      KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap
      (i.e. off and exit) and enable it explicitly (disabled by default). If
      bus locks in guest are detected by KVM, exit to user space even when
      current exit reason is handled by KVM internally. Set a new field
      KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there
      is a bus lock detected in guest.
      
      Document for Bus Lock VM exit is now available at the latest "Intel
      Architecture Instruction Set Extensions Programming Reference".
      
      Document Link:
      https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.htmlCo-developed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe6b6bc8
    • Chenyi Qiang's avatar
      KVM: X86: Reset the vcpu->run->flags at the beginning of vcpu_run · 15aad3be
      Chenyi Qiang authored
      Reset the vcpu->run->flags at the beginning of kvm_arch_vcpu_ioctl_run.
      It can avoid every thunk of code that needs to set the flag clear it,
      which increases the odds of missing a case and ending up with a flag in
      an undefined state.
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20201106090315.18606-3-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15aad3be
    • Sean Christopherson's avatar
      KVM: VMX: Convert vcpu_vmx.exit_reason to a union · 8e533240
      Sean Christopherson authored
      Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32).  The
      full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in
      bits 15:0, and single-bit modifiers in bits 31:16.
      
      Historically, KVM has only had to worry about handling the "failed
      VM-Entry" modifier, which could only be set in very specific flows and
      required dedicated handling.  I.e. manually stripping the FAILED_VMENTRY
      bit was a somewhat viable approach.  But even with only a single bit to
      worry about, KVM has had several bugs related to comparing a basic exit
      reason against the full exit reason store in vcpu_vmx.
      
      Upcoming Intel features, e.g. SGX, will add new modifier bits that can
      be set on more or less any VM-Exit, as opposed to the significantly more
      restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off
      flows isn't scalable.  Tracking exit reason in a union forces code to
      explicitly choose between consuming the full exit reason and the basic
      exit, and is a convenient way to document and access the modifiers.
      
      No functional change intended.
      
      Cc: Xiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20201106090315.18606-2-chenyi.qiang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8e533240
    • Brijesh Singh's avatar
      KVM/SVM: add support for SEV attestation command · 2c07ded0
      Brijesh Singh authored
      The SEV FW version >= 0.23 added a new command that can be used to query
      the attestation report containing the SHA-256 digest of the guest memory
      encrypted through the KVM_SEV_LAUNCH_UPDATE_{DATA, VMSA} commands and
      sign the report with the Platform Endorsement Key (PEK).
      
      See the SEV FW API spec section 6.8 for more details.
      
      Note there already exist a command (KVM_SEV_LAUNCH_MEASURE) that can be
      used to get the SHA-256 digest. The main difference between the
      KVM_SEV_LAUNCH_MEASURE and KVM_SEV_ATTESTATION_REPORT is that the latter
      can be called while the guest is running and the measurement value is
      signed with PEK.
      
      Cc: James Bottomley <jejb@linux.ibm.com>
      Cc: Tom Lendacky <Thomas.Lendacky@amd.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: John Allen <john.allen@amd.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: linux-crypto@vger.kernel.org
      Reviewed-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarJames Bottomley <jejb@linux.ibm.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Message-Id: <20210104151749.30248-1-brijesh.singh@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2c07ded0
    • Ben Gardon's avatar
      KVM: selftests: Disable dirty logging with vCPUs running · c1d1650f
      Ben Gardon authored
      Disabling dirty logging is much more intestesting from a testing
      perspective if the vCPUs are still running. This also excercises the
      code-path in which collapsible SPTEs must be faulted back in at a higher
      level after disabling dirty logging.
      
      To: linux-kselftest@vger.kernel.org
      CC: Peter Xu <peterx@redhat.com>
      CC: Andrew Jones <drjones@redhat.com>
      CC: Thomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210202185734.1680553-29-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1d1650f
    • Ben Gardon's avatar
      KVM: selftests: Add backing src parameter to dirty_log_perf_test · 9e965bb7
      Ben Gardon authored
      Add a parameter to control the backing memory type for
      dirty_log_perf_test so that the test can be run with hugepages.
      
      To: linux-kselftest@vger.kernel.org
      CC: Peter Xu <peterx@redhat.com>
      CC: Andrew Jones <drjones@redhat.com>
      CC: Thomas Huth <thuth@redhat.com>
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210202185734.1680553-28-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9e965bb7
    • Ben Gardon's avatar
      KVM: selftests: Add memslot modification stress test · f73a3446
      Ben Gardon authored
      Add a memslot modification stress test in which a memslot is repeatedly
      created and removed while vCPUs access memory in another memslot. Most
      userspaces do not create or remove memslots on running VMs which makes
      it hard to test races in adding and removing memslots without a
      dedicated test. Adding and removing a memslot also has the effect of
      tearing down the entire paging structure, which leads to more page
      faults and pressure on the page fault handling path than a one-and-done
      memory population test.
      Reviewed-by: default avatarJacob Xu <jacobhxu@google.com>
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210112214253.463999-7-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f73a3446
    • Ben Gardon's avatar
      KVM: selftests: Add option to overlap vCPU memory access · 82f91337
      Ben Gardon authored
      Add an option to overlap the ranges of memory each vCPU accesses instead
      of partitioning them. This option will increase the probability of
      multiple vCPUs faulting on the same page at the same time, and causing
      interesting races, if there are bugs in the page fault handler or
      elsewhere in the kernel.
      Reviewed-by: default avatarJacob Xu <jacobhxu@google.com>
      Reviewed-by: default avatarMakarand Sonare <makarandsonare@google.com>
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Message-Id: <20210112214253.463999-6-bgardon@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      82f91337