1. 18 Jun, 2019 20 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Use descriptive names for VMCS sync functions and flags · 3731905e
      Sean Christopherson authored
      Nested virtualization involves copying data between many different types
      of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS.  Rename a variety
      of functions and flags to document both the source and destination of
      each sync.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3731905e
    • Sean Christopherson's avatar
      KVM: nVMX: Lift sync_vmcs12() out of prepare_vmcs12() · f4f8316d
      Sean Christopherson authored
      ... to make it more obvious that sync_vmcs12() is invoked on all nested
      VM-Exits, e.g. hiding sync_vmcs12() in prepare_vmcs12() makes it appear
      that guest state is NOT propagated to vmcs12 for a normal VM-Exit.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f4f8316d
    • Sean Christopherson's avatar
      KVM: nVMX: Track vmcs12 offsets for shadowed VMCS fields · 1c6f0b47
      Sean Christopherson authored
      The vmcs12 fields offsets are constant and known at compile time.  Store
      the associated offset for each shadowed field to avoid the costly lookup
      in vmcs_field_to_offset() when copying between vmcs12 and the shadow
      VMCS.  Avoiding the costly lookup reduces the latency of copying by
      ~100 cycles in each direction.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1c6f0b47
    • Sean Christopherson's avatar
      KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES · b6437805
      Sean Christopherson authored
      VMMs frequently read the guest's CS and SS AR bytes to detect 64-bit
      mode and CPL respectively, but effectively never write said fields once
      the VM is initialized.  Intercepting VMWRITEs for the two fields saves
      ~55 cycles in copy_shadow_to_vmcs12().
      
      Because some Intel CPUs, e.g. Haswell, drop the reserved bits of the
      guest access rights fields on VMWRITE, exposing the fields to L1 for
      VMREAD but not VMWRITE leads to inconsistent behavior between L1 and L2.
      On hardware that drops the bits, L1 will see the stripped down value due
      to reading the value from hardware, while L2 will see the full original
      value as stored by KVM.  To avoid such an inconsistency, emulate the
      behavior on all CPUS, but only for intercepted VMWRITEs so as to avoid
      introducing pointless latency into copy_shadow_to_vmcs12(), e.g. if the
      emulation were added to vmcs12_write_any().
      
      Since the AR_BYTES emulation is done only for intercepted VMWRITE, if a
      future patch (re)exposed AR_BYTES for both VMWRITE and VMREAD, then KVM
      would end up with incosistent behavior on pre-Haswell hardware, e.g. KVM
      would drop the reserved bits on intercepted VMWRITE, but direct VMWRITE
      to the shadow VMCS would not drop the bits.  Add a WARN in the shadow
      field initialization to detect any attempt to expose an AR_BYTES field
      without updating vmcs12_write_any().
      
      Note, emulation of the AR_BYTES reserved bit behavior is based on a
      patch[1] from Jim Mattson that applied the emulation to all writes to
      vmcs12 so that live migration across different generations of hardware
      would not introduce divergent behavior.  But given that live migration
      of nested state has already been enabled, that ship has sailed (not to
      mention that no sane VMM will be affected by this behavior).
      
      [1] https://patchwork.kernel.org/patch/10483321/
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6437805
    • Sean Christopherson's avatar
      KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields · fadcead0
      Sean Christopherson authored
      Allowing L1 to VMWRITE read-only fields is only beneficial in a double
      nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal
      non-nested operation.  Intercepting RO fields means KVM doesn't need to
      sync them from the shadow VMCS to vmcs12 when running L2.  The obvious
      downside is that L1 will VM-Exit more often when running L3, but it's
      likely safe to assume most folks would happily sacrifice a bit of L3
      performance, which may not even be noticeable in the grande scheme, to
      improve L2 performance across the board.
      
      Not intercepting fields tagged read-only also allows for additional
      optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
      since those fields are rarely written by a VMMs, but read frequently.
      
      When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields
      that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to
      the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD
      from L1 will consume a stale value.
      
      Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field"
      is not exposed to L1, but only so that it can reject the VMWRITE, i.e.
      propagating the VMWRITE to the shadow VMCS is a new requirement, not a
      bug fix.
      
      Eliminating the copying of RO fields reduces the latency of nested
      VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles
      if/when the AR_BYTES fields are exposed RO).
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fadcead0
    • Sean Christopherson's avatar
      KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c
      Sean Christopherson authored
      Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
      interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
      to "make sure we handle NMI on the current cpu, and that we don't
      service maskable interrupts before non-maskable ones".  The other
      exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
      have similar requirements, and are located there to avoid extra VMREADs
      since VMX bins hardware exceptions and NMIs into a single exit reason.
      
      Clean up the code and eliminate the vaguely named complete_atomic_exit()
      by moving the interrupts-disabled exception and NMI handling into the
      existing handle_external_intrs() callback, and rename the callback to
      a more appropriate name.  Rename VMexit handlers throughout so that the
      atomic and non-atomic counterparts have similar names.
      
      In addition to improving code readability, this also ensures the NMI
      handler is run with the host's debug registers loaded in the unlikely
      event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
      field is also improved when handling NMIs (and #MCs) as the handler
      will run after updating said field.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Naming cleanups. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      95b5a48c
    • Sean Christopherson's avatar
      KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0
      Sean Christopherson authored
      VMX can conditionally call kvm_{before,after}_interrupt() since KVM
      always uses "ack interrupt on exit" and therefore explicitly handles
      interrupts as opposed to blindly enabling irqs.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      165072b0
    • Sean Christopherson's avatar
      KVM: VMX: Store the host kernel's IDT base in a global variable · 2342080c
      Sean Christopherson authored
      Although the kernel may use multiple IDTs, KVM should only ever see the
      "real" IDT, e.g. the early init IDT is long gone by the time KVM runs
      and the debug stack IDT is only used for small windows of time in very
      specific flows.
      
      Before commit a547c6db ("KVM: VMX: Enable acknowledge interupt on
      vmexit"), the kernel's IDT base was consumed by KVM only when setting
      constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE.  Because constant
      host state is done once per vCPU, there was ostensibly no need to cache
      the kernel's IDT base.
      
      When support for "ack interrupt on exit" was introduced, KVM added a
      second consumer of the IDT base as handling already-acked interrupts
      requires directly calling the interrupt handler, i.e. KVM uses the IDT
      base to find the address of the handler.  Because interrupts are a fast
      path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE.
      Presumably, the IDT base was cached on a per-vCPU basis simply because
      the existing code grabbed the IDT base on a per-vCPU (VMCS) basis.
      
      Note, all post-boot IDTs use the same handlers for external interrupts,
      i.e. the "ack interrupt on exit" use of the IDT base would be unaffected
      even if the cached IDT somehow did not match the current IDT.  And as
      for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the
      above analysis is wrong then KVM has had a bug since the beginning of
      time since KVM has effectively been caching the IDT at vCPU creation
      since commit a8b732ca01c ("[PATCH] kvm: userspace interface").
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2342080c
    • Sean Christopherson's avatar
      KVM: VMX: Read cached VM-Exit reason to detect external interrupt · 49def500
      Sean Christopherson authored
      Generic x86 code invokes the kvm_x86_ops external interrupt handler on
      all VM-Exits regardless of the actual exit type.  Use the already-cached
      EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus
      avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other
      types of VM-Exit.
      
      In addition to avoiding the extra VMREAD, checking the EXIT_REASON
      instead of VM_EXIT_INTR_INFO makes it more obvious that
      vmx_handle_external_intr() is called for all VM-Exits, e.g. someone
      unfamiliar with the flow might wonder under what condition(s)
      VM_EXIT_INTR_INFO does not contain a valid interrupt, which is
      simply not possible since KVM always runs with "ack interrupt on exit".
      
      WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on
      an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a
      hardware bug.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49def500
    • Paolo Bonzini's avatar
      kvm: nVMX: small cleanup in handle_exception · 2ea72039
      Paolo Bonzini authored
      The reason for skipping handling of NMI and #MC in handle_exception is
      the same, namely they are handled earlier by vmx_complete_atomic_exit.
      Calling the machine check handler (which just returns 1) is misleading,
      don't do it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ea72039
    • Sean Christopherson's avatar
      KVM: VMX: Fix handling of #MC that occurs during VM-Entry · beb8d93b
      Sean Christopherson authored
      A previous fix to prevent KVM from consuming stale VMCS state after a
      failed VM-Entry inadvertantly blocked KVM's handling of machine checks
      that occur during VM-Entry.
      
      Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
      depending on when the #MC is recognoized.  As it pertains to this bug
      fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
      is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
      indicate the VM-Entry failed.
      
      If a machine-check event occurs during a VM entry, one of the following occurs:
       - The machine-check event is handled as if it occurred before the VM entry:
              ...
       - The machine-check event is handled after VM entry completes:
              ...
       - A VM-entry failure occurs as described in Section 26.7. The basic
         exit reason is 41, for "VM-entry failure due to machine-check event".
      
      Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
      vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
      Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
      in a sane fashion and also simplifies vmx_complete_atomic_exit() since
      VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.
      
      Fixes: b060ca3b ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      beb8d93b
    • Paolo Bonzini's avatar
      KVM: x86: move MSR_IA32_POWER_CTL handling to common code · 73f624f4
      Paolo Bonzini authored
      Make it available to AMD hosts as well, just in case someone is trying
      to use an Intel processor's CPUID setup.
      Suggested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73f624f4
    • Wei Yang's avatar
      kvm: x86: offset is ensure to be in range · 4cb8b116
      Wei Yang authored
      In function apic_mmio_write(), the offset has been checked in:
      
         * apic_mmio_in_range()
         * offset & 0xf
      
      These two ensures offset is in range [0x010, 0xff0].
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cb8b116
    • Wei Yang's avatar
      kvm: x86: use same convention to name kvm_lapic_{set,clear}_vector() · ee171d2f
      Wei Yang authored
      apic_clear_vector() is the counterpart of kvm_lapic_set_vector(),
      while they have different naming convention.
      
      Rename it and move together to arch/x86/kvm/lapic.h. Also fix one typo
      in comment by hand.
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee171d2f
    • Wei Yang's avatar
      kvm: x86: check kvm_apic_sw_enabled() is enough · 7d2296bf
      Wei Yang authored
      On delivering irq to apic, we iterate on vcpu and do the check like
      this:
      
          kvm_apic_present(vcpu)
          kvm_lapic_enabled(vpu)
              kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic)
      
      Since we have already checked kvm_apic_present(), it is reasonable to
      replace kvm_lapic_enabled() with kvm_apic_sw_enabled().
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7d2296bf
    • Marcelo Tosatti's avatar
      kvm: x86: add host poll control msrs · 2d5ba19b
      Marcelo Tosatti authored
      Add an MSRs which allows the guest to disable
      host polling (specifically the cpuidle-haltpoll,
      when performing polling in the guest, disables
      host side polling).
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d5ba19b
    • Eugene Korenevsky's avatar
      kvm: vmx: segment limit check: use access length · fdb28619
      Eugene Korenevsky authored
      There is an imperfection in get_vmx_mem_address(): access length is ignored
      when checking the limit. To fix this, pass access length as a function argument.
      The access length is usually obvious since it is used by callers after
      get_vmx_mem_address() call, but for vmread/vmwrite it depends on the
      state of 64-bit mode.
      Signed-off-by: default avatarEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdb28619
    • Eugene Korenevsky's avatar
      kvm: vmx: fix limit checking in get_vmx_mem_address() · c1a9acbc
      Eugene Korenevsky authored
      Intel SDM vol. 3, 5.3:
      The processor causes a
      general-protection exception (or, if the segment is SS, a stack-fault
      exception) any time an attempt is made to access the following addresses
      in a segment:
      - A byte at an offset greater than the effective limit
      - A word at an offset greater than the (effective-limit – 1)
      - A doubleword at an offset greater than the (effective-limit – 3)
      - A quadword at an offset greater than the (effective-limit – 7)
      
      Therefore, the generic limit checking error condition must be
      
      exn = (off > limit + 1 - access_len) = (off + access_len - 1 > limit)
      
      but not
      
      exn = (off + access_len > limit)
      
      as for now.
      
      Also avoid integer overflow of `off` at 32-bit KVM by casting it to u64.
      
      Note: access length is currently sizeof(u64) which is incorrect. This
      will be fixed in the subsequent patch.
      Signed-off-by: default avatarEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1a9acbc
    • Like Xu's avatar
      KVM: x86: Add Intel CPUID.1F cpuid emulation support · a87f2d3a
      Like Xu authored
      Add support to expose Intel V2 Extended Topology Enumeration Leaf for
      some new systems with multiple software-visible die within each package.
      
      Because unimplemented and unexposed leaves should be explicitly reported
      as zero, there is no need to limit cpuid.0.eax to the maximum value of
      feature configuration but limit it to the highest leaf implemented in
      the current code. A single clamping seems sufficient and cheaper.
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a87f2d3a
    • Liran Alon's avatar
      KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15 · 1fc5d194
      Liran Alon authored
      Make all code consistent with kvm_deliver_exception_payload() by using
      appropriate symbolic constant instead of hard-coded number.
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1fc5d194
  2. 13 Jun, 2019 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: clean up conditions for asynchronous page fault handling · 1dfdb45e
      Paolo Bonzini authored
      Even when asynchronous page fault is disabled, KVM does not want to pause
      the host if a guest triggers a page fault; instead it will put it into
      an artificial HLT state that allows running other host processes while
      allowing interrupt delivery into the guest.
      
      However, the way this feature is triggered is a bit confusing.
      First, it is not used for page faults while a nested guest is
      running: but this is not an issue since the artificial halt
      is completely invisible to the guest, either L1 or L2.  Second,
      it is used even if kvm_halt_in_guest() returns true; in this case,
      the guest probably should not pay the additional latency cost of the
      artificial halt, and thus we should handle the page fault in a
      completely synchronous way.
      
      By introducing a new function kvm_can_deliver_async_pf, this patch
      commonizes the code that chooses whether to deliver an async page fault
      (kvm_arch_async_page_not_present) and the code that chooses whether a
      page fault should be handled synchronously (kvm_can_do_async_pf).
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1dfdb45e
  3. 05 Jun, 2019 7 commits
  4. 04 Jun, 2019 12 commits
    • Andrew Jones's avatar
      kvm: selftests: ucall improvements · 2c7c5d3d
      Andrew Jones authored
      Make sure we complete the I/O after determining we have a ucall,
      which is I/O. Also allow the *uc parameter to optionally be NULL.
      It's quite possible that a test case will only care about the
      return value, like for example when looping on a check for
      UCALL_DONE.
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2c7c5d3d
    • Wanpeng Li's avatar
      KVM: X86: Emulate MSR_IA32_MISC_ENABLE MWAIT bit · 511a8556
      Wanpeng Li authored
      MSR IA32_MISC_ENABLE bit 18, according to SDM:
      
      | When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0).
      | This indicates that MONITOR/MWAIT are not supported.
      |
      | Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0.
      |
      | When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1).
      
      The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit,
      CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest().
      kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its
      guest visibility.
      
      This patch implements toggling of the CPUID bit based on guest writes
      to the MSR.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Fixes for backwards compatibility - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      511a8556
    • Wanpeng Li's avatar
      KVM: X86: Provide a capability to disable cstate msr read intercepts · b5170063
      Wanpeng Li authored
      Allow guest reads CORE cstate when exposing host CPU power management capabilities
      to the guest. PKG cstate is restricted to avoid a guest to get the whole package
      information in multi-tenant scenario.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5170063
    • Wanpeng Li's avatar
      KVM: Documentation: Add disable pause exits to KVM_CAP_X86_DISABLE_EXITS · 8ffdaa7f
      Wanpeng Li authored
      Commit b31c114b (KVM: X86: Provide a capability to disable PAUSE intercepts)
      forgot to add the KVM_X86_DISABLE_EXITS_PAUSE into api doc. This patch adds
      it.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8ffdaa7f
    • Xiaoyao Li's avatar
      kvm: x86: refine kvm_get_arch_capabilities() · 4d22c17c
      Xiaoyao Li authored
      1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of
      MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe().
      
      2. Since kvm_get_arch_capabilities() is only used in this file, making
      it static.
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d22c17c
    • Sean Christopherson's avatar
      KVM: Directly return result from kvm_arch_check_processor_compat() · f257d6dc
      Sean Christopherson authored
      Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
      boilerplate ugliness of checking virtualization support on all CPUs is
      hidden from the arch specific code.  x86's implementation in particular
      is quite heinous, as it unnecessarily propagates the out-param pattern
      into kvm_x86_ops.
      
      While the x86 specific issue could be resolved solely by changing
      kvm_x86_ops, make the change for all architectures as returning a value
      directly is prettier and technically more robust, e.g. s390 doesn't set
      the out param, which could lead to subtle breakage in the (highly
      unlikely) scenario where the out-param was not pre-initialized by the
      caller.
      
      Opportunistically annotate svm_check_processor_compat() with __init.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f257d6dc
    • Suthikulpanit, Suravee's avatar
      kvm: svm/avic: Do not send AVIC doorbell to self · 0532dd52
      Suthikulpanit, Suravee authored
      AVIC doorbell is used to notify a running vCPU that interrupts
      has been injected into the vCPU AVIC backing page. Current logic
      checks only if a VCPU is running before sending a doorbell.
      However, the doorbell is not necessary if the destination
      CPU is itself.
      
      Add logic to check currently running CPU before sending doorbell.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Reviewed-by: default avatarAlexander Graf <graf@amazon.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0532dd52
    • Wanpeng Li's avatar
      KVM: LAPIC: Optimize timer latency further · b6c4bc65
      Wanpeng Li authored
      Advance lapic timer tries to hidden the hypervisor overhead between the
      host emulated timer fires and the guest awares the timer is fired. However,
      it just hidden the time between apic_timer_fn/handle_preemption_timer ->
      wait_lapic_expire, instead of the real position of vmentry which is
      mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
      advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
      the end of wait_lapic_expire and before world switch on my haswell desktop.
      
      This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
      it takes the real overhead time between apic_timer_fn/handle_preemption_timer
      and before world switch into consideration when adaptively tuning timer
      advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
      on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6c4bc65
    • Wanpeng Li's avatar
      KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit · ec0671d5
      Wanpeng Li authored
      wait_lapic_expire() call was moved above guest_enter_irqoff() because of
      its tracepoint, which violated the RCU extended quiescent state invoked
      by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
      below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
      VM-Enter, but trace it after VM-Exit. This can help us to move
      wait_lapic_expire() just before vmentry in the later patch.
      
      [1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state")
      [2] https://patchwork.kernel.org/patch/7821111/
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Suggested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Track whether wait_lapic_expire was called, and do not invoke the tracepoint
       if not. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec0671d5
    • Wanpeng Li's avatar
      KVM: LAPIC: Extract adaptive tune timer advancement logic · 84ea3aca
      Wanpeng Li authored
      Extract adaptive tune timer advancement logic to a single function.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      [Rename new function. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      84ea3aca
    • Vitaly Kuznetsov's avatar
      KVM/nSVM: properly map nested VMCB · 8f38302c
      Vitaly Kuznetsov authored
      Commit 8c5fbf1a ("KVM/nSVM: Use the new mapping API for mapping guest
      memory") broke nested SVM completely: kvm_vcpu_map()'s second parameter is
      GFN so vmcb_gpa needs to be converted with gpa_to_gfn(), not the other way
      around.
      
      Fixes: 8c5fbf1a ("KVM/nSVM: Use the new mapping API for mapping guest memory")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f38302c
    • Kai Huang's avatar
      kvm: x86: Fix reserved bits related calculation errors caused by MKTME · f3ecb59d
      Kai Huang authored
      Intel MKTME repurposes several high bits of physical address as 'keyID'
      for memory encryption thus effectively reduces platform's maximum
      physical address bits. Exactly how many bits are reduced is configured
      by BIOS. To honor such HW behavior, the repurposed bits are reduced from
      cpuinfo_x86->x86_phys_bits when MKTME is detected in CPU detection.
      Similarly, AMD SME/SEV also reduces physical address bits for memory
      encryption, and cpuinfo->x86_phys_bits is reduced too when SME/SEV is
      detected, so for both MKTME and SME/SEV, boot_cpu_data.x86_phys_bits
      doesn't hold physical address bits reported by CPUID anymore.
      
      Currently KVM treats bits from boot_cpu_data.x86_phys_bits to 51 as
      reserved bits, but it's not true anymore for MKTME, since MKTME treats
      those reduced bits as 'keyID', but not reserved bits. Therefore
      boot_cpu_data.x86_phys_bits cannot be used to calculate reserved bits
      anymore, although we can still use it for AMD SME/SEV since SME/SEV
      treats the reduced bits differently -- they are treated as reserved
      bits, the same as other reserved bits in page table entity [1].
      
      Fix by introducing a new 'shadow_phys_bits' variable in KVM x86 MMU code
      to store the effective physical bits w/o reserved bits -- for MKTME,
      it equals to physical address reported by CPUID, and for SME/SEV, it is
      boot_cpu_data.x86_phys_bits.
      
      Note that for the physical address bits reported to guest should remain
      unchanged -- KVM should report physical address reported by CPUID to
      guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME,
      there's no harm if guest sets up 'keyID' bits in guest page table (since
      MKTME only works at physical address level), and KVM doesn't even expose
      MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it
      should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore
      KVM should still reports physcial address reported by CPUID to guest.
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarKai Huang <kai.huang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f3ecb59d