1. 18 Jun, 2019 21 commits
    • Sean Christopherson's avatar
      KVM: nVMX: Add helpers to identify shadowed VMCS fields · e2174295
      Sean Christopherson authored
      So that future optimizations related to shadowed fields don't need to
      define their own switch statement.
      
      Add a BUILD_BUG_ON() to ensure at least one of the types (RW vs RO) is
      defined when including vmcs_shadow_fields.h (guess who keeps mistyping
      SHADOW_FIELD_RO as SHADOW_FIELD_R0).
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e2174295
    • Sean Christopherson's avatar
      KVM: nVMX: Use descriptive names for VMCS sync functions and flags · 3731905e
      Sean Christopherson authored
      Nested virtualization involves copying data between many different types
      of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS.  Rename a variety
      of functions and flags to document both the source and destination of
      each sync.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3731905e
    • Sean Christopherson's avatar
      KVM: nVMX: Lift sync_vmcs12() out of prepare_vmcs12() · f4f8316d
      Sean Christopherson authored
      ... to make it more obvious that sync_vmcs12() is invoked on all nested
      VM-Exits, e.g. hiding sync_vmcs12() in prepare_vmcs12() makes it appear
      that guest state is NOT propagated to vmcs12 for a normal VM-Exit.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f4f8316d
    • Sean Christopherson's avatar
      KVM: nVMX: Track vmcs12 offsets for shadowed VMCS fields · 1c6f0b47
      Sean Christopherson authored
      The vmcs12 fields offsets are constant and known at compile time.  Store
      the associated offset for each shadowed field to avoid the costly lookup
      in vmcs_field_to_offset() when copying between vmcs12 and the shadow
      VMCS.  Avoiding the costly lookup reduces the latency of copying by
      ~100 cycles in each direction.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1c6f0b47
    • Sean Christopherson's avatar
      KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES · b6437805
      Sean Christopherson authored
      VMMs frequently read the guest's CS and SS AR bytes to detect 64-bit
      mode and CPL respectively, but effectively never write said fields once
      the VM is initialized.  Intercepting VMWRITEs for the two fields saves
      ~55 cycles in copy_shadow_to_vmcs12().
      
      Because some Intel CPUs, e.g. Haswell, drop the reserved bits of the
      guest access rights fields on VMWRITE, exposing the fields to L1 for
      VMREAD but not VMWRITE leads to inconsistent behavior between L1 and L2.
      On hardware that drops the bits, L1 will see the stripped down value due
      to reading the value from hardware, while L2 will see the full original
      value as stored by KVM.  To avoid such an inconsistency, emulate the
      behavior on all CPUS, but only for intercepted VMWRITEs so as to avoid
      introducing pointless latency into copy_shadow_to_vmcs12(), e.g. if the
      emulation were added to vmcs12_write_any().
      
      Since the AR_BYTES emulation is done only for intercepted VMWRITE, if a
      future patch (re)exposed AR_BYTES for both VMWRITE and VMREAD, then KVM
      would end up with incosistent behavior on pre-Haswell hardware, e.g. KVM
      would drop the reserved bits on intercepted VMWRITE, but direct VMWRITE
      to the shadow VMCS would not drop the bits.  Add a WARN in the shadow
      field initialization to detect any attempt to expose an AR_BYTES field
      without updating vmcs12_write_any().
      
      Note, emulation of the AR_BYTES reserved bit behavior is based on a
      patch[1] from Jim Mattson that applied the emulation to all writes to
      vmcs12 so that live migration across different generations of hardware
      would not introduce divergent behavior.  But given that live migration
      of nested state has already been enabled, that ship has sailed (not to
      mention that no sane VMM will be affected by this behavior).
      
      [1] https://patchwork.kernel.org/patch/10483321/
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6437805
    • Sean Christopherson's avatar
      KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields · fadcead0
      Sean Christopherson authored
      Allowing L1 to VMWRITE read-only fields is only beneficial in a double
      nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal
      non-nested operation.  Intercepting RO fields means KVM doesn't need to
      sync them from the shadow VMCS to vmcs12 when running L2.  The obvious
      downside is that L1 will VM-Exit more often when running L3, but it's
      likely safe to assume most folks would happily sacrifice a bit of L3
      performance, which may not even be noticeable in the grande scheme, to
      improve L2 performance across the board.
      
      Not intercepting fields tagged read-only also allows for additional
      optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
      since those fields are rarely written by a VMMs, but read frequently.
      
      When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields
      that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to
      the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD
      from L1 will consume a stale value.
      
      Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field"
      is not exposed to L1, but only so that it can reject the VMWRITE, i.e.
      propagating the VMWRITE to the shadow VMCS is a new requirement, not a
      bug fix.
      
      Eliminating the copying of RO fields reduces the latency of nested
      VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles
      if/when the AR_BYTES fields are exposed RO).
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fadcead0
    • Sean Christopherson's avatar
      KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c
      Sean Christopherson authored
      Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
      interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
      to "make sure we handle NMI on the current cpu, and that we don't
      service maskable interrupts before non-maskable ones".  The other
      exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
      have similar requirements, and are located there to avoid extra VMREADs
      since VMX bins hardware exceptions and NMIs into a single exit reason.
      
      Clean up the code and eliminate the vaguely named complete_atomic_exit()
      by moving the interrupts-disabled exception and NMI handling into the
      existing handle_external_intrs() callback, and rename the callback to
      a more appropriate name.  Rename VMexit handlers throughout so that the
      atomic and non-atomic counterparts have similar names.
      
      In addition to improving code readability, this also ensures the NMI
      handler is run with the host's debug registers loaded in the unlikely
      event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
      field is also improved when handling NMIs (and #MCs) as the handler
      will run after updating said field.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Naming cleanups. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      95b5a48c
    • Sean Christopherson's avatar
      KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0
      Sean Christopherson authored
      VMX can conditionally call kvm_{before,after}_interrupt() since KVM
      always uses "ack interrupt on exit" and therefore explicitly handles
      interrupts as opposed to blindly enabling irqs.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      165072b0
    • Sean Christopherson's avatar
      KVM: VMX: Store the host kernel's IDT base in a global variable · 2342080c
      Sean Christopherson authored
      Although the kernel may use multiple IDTs, KVM should only ever see the
      "real" IDT, e.g. the early init IDT is long gone by the time KVM runs
      and the debug stack IDT is only used for small windows of time in very
      specific flows.
      
      Before commit a547c6db ("KVM: VMX: Enable acknowledge interupt on
      vmexit"), the kernel's IDT base was consumed by KVM only when setting
      constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE.  Because constant
      host state is done once per vCPU, there was ostensibly no need to cache
      the kernel's IDT base.
      
      When support for "ack interrupt on exit" was introduced, KVM added a
      second consumer of the IDT base as handling already-acked interrupts
      requires directly calling the interrupt handler, i.e. KVM uses the IDT
      base to find the address of the handler.  Because interrupts are a fast
      path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE.
      Presumably, the IDT base was cached on a per-vCPU basis simply because
      the existing code grabbed the IDT base on a per-vCPU (VMCS) basis.
      
      Note, all post-boot IDTs use the same handlers for external interrupts,
      i.e. the "ack interrupt on exit" use of the IDT base would be unaffected
      even if the cached IDT somehow did not match the current IDT.  And as
      for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the
      above analysis is wrong then KVM has had a bug since the beginning of
      time since KVM has effectively been caching the IDT at vCPU creation
      since commit a8b732ca01c ("[PATCH] kvm: userspace interface").
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2342080c
    • Sean Christopherson's avatar
      KVM: VMX: Read cached VM-Exit reason to detect external interrupt · 49def500
      Sean Christopherson authored
      Generic x86 code invokes the kvm_x86_ops external interrupt handler on
      all VM-Exits regardless of the actual exit type.  Use the already-cached
      EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus
      avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other
      types of VM-Exit.
      
      In addition to avoiding the extra VMREAD, checking the EXIT_REASON
      instead of VM_EXIT_INTR_INFO makes it more obvious that
      vmx_handle_external_intr() is called for all VM-Exits, e.g. someone
      unfamiliar with the flow might wonder under what condition(s)
      VM_EXIT_INTR_INFO does not contain a valid interrupt, which is
      simply not possible since KVM always runs with "ack interrupt on exit".
      
      WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on
      an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a
      hardware bug.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49def500
    • Paolo Bonzini's avatar
      kvm: nVMX: small cleanup in handle_exception · 2ea72039
      Paolo Bonzini authored
      The reason for skipping handling of NMI and #MC in handle_exception is
      the same, namely they are handled earlier by vmx_complete_atomic_exit.
      Calling the machine check handler (which just returns 1) is misleading,
      don't do it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ea72039
    • Sean Christopherson's avatar
      KVM: VMX: Fix handling of #MC that occurs during VM-Entry · beb8d93b
      Sean Christopherson authored
      A previous fix to prevent KVM from consuming stale VMCS state after a
      failed VM-Entry inadvertantly blocked KVM's handling of machine checks
      that occur during VM-Entry.
      
      Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
      depending on when the #MC is recognoized.  As it pertains to this bug
      fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
      is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
      indicate the VM-Entry failed.
      
      If a machine-check event occurs during a VM entry, one of the following occurs:
       - The machine-check event is handled as if it occurred before the VM entry:
              ...
       - The machine-check event is handled after VM entry completes:
              ...
       - A VM-entry failure occurs as described in Section 26.7. The basic
         exit reason is 41, for "VM-entry failure due to machine-check event".
      
      Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
      vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
      Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
      in a sane fashion and also simplifies vmx_complete_atomic_exit() since
      VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.
      
      Fixes: b060ca3b ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      beb8d93b
    • Paolo Bonzini's avatar
      KVM: x86: move MSR_IA32_POWER_CTL handling to common code · 73f624f4
      Paolo Bonzini authored
      Make it available to AMD hosts as well, just in case someone is trying
      to use an Intel processor's CPUID setup.
      Suggested-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73f624f4
    • Wei Yang's avatar
      kvm: x86: offset is ensure to be in range · 4cb8b116
      Wei Yang authored
      In function apic_mmio_write(), the offset has been checked in:
      
         * apic_mmio_in_range()
         * offset & 0xf
      
      These two ensures offset is in range [0x010, 0xff0].
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cb8b116
    • Wei Yang's avatar
      kvm: x86: use same convention to name kvm_lapic_{set,clear}_vector() · ee171d2f
      Wei Yang authored
      apic_clear_vector() is the counterpart of kvm_lapic_set_vector(),
      while they have different naming convention.
      
      Rename it and move together to arch/x86/kvm/lapic.h. Also fix one typo
      in comment by hand.
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee171d2f
    • Wei Yang's avatar
      kvm: x86: check kvm_apic_sw_enabled() is enough · 7d2296bf
      Wei Yang authored
      On delivering irq to apic, we iterate on vcpu and do the check like
      this:
      
          kvm_apic_present(vcpu)
          kvm_lapic_enabled(vpu)
              kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic)
      
      Since we have already checked kvm_apic_present(), it is reasonable to
      replace kvm_lapic_enabled() with kvm_apic_sw_enabled().
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7d2296bf
    • Marcelo Tosatti's avatar
      kvm: x86: add host poll control msrs · 2d5ba19b
      Marcelo Tosatti authored
      Add an MSRs which allows the guest to disable
      host polling (specifically the cpuidle-haltpoll,
      when performing polling in the guest, disables
      host side polling).
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d5ba19b
    • Eugene Korenevsky's avatar
      kvm: vmx: segment limit check: use access length · fdb28619
      Eugene Korenevsky authored
      There is an imperfection in get_vmx_mem_address(): access length is ignored
      when checking the limit. To fix this, pass access length as a function argument.
      The access length is usually obvious since it is used by callers after
      get_vmx_mem_address() call, but for vmread/vmwrite it depends on the
      state of 64-bit mode.
      Signed-off-by: default avatarEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdb28619
    • Eugene Korenevsky's avatar
      kvm: vmx: fix limit checking in get_vmx_mem_address() · c1a9acbc
      Eugene Korenevsky authored
      Intel SDM vol. 3, 5.3:
      The processor causes a
      general-protection exception (or, if the segment is SS, a stack-fault
      exception) any time an attempt is made to access the following addresses
      in a segment:
      - A byte at an offset greater than the effective limit
      - A word at an offset greater than the (effective-limit – 1)
      - A doubleword at an offset greater than the (effective-limit – 3)
      - A quadword at an offset greater than the (effective-limit – 7)
      
      Therefore, the generic limit checking error condition must be
      
      exn = (off > limit + 1 - access_len) = (off + access_len - 1 > limit)
      
      but not
      
      exn = (off + access_len > limit)
      
      as for now.
      
      Also avoid integer overflow of `off` at 32-bit KVM by casting it to u64.
      
      Note: access length is currently sizeof(u64) which is incorrect. This
      will be fixed in the subsequent patch.
      Signed-off-by: default avatarEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1a9acbc
    • Like Xu's avatar
      KVM: x86: Add Intel CPUID.1F cpuid emulation support · a87f2d3a
      Like Xu authored
      Add support to expose Intel V2 Extended Topology Enumeration Leaf for
      some new systems with multiple software-visible die within each package.
      
      Because unimplemented and unexposed leaves should be explicitly reported
      as zero, there is no need to limit cpuid.0.eax to the maximum value of
      feature configuration but limit it to the highest leaf implemented in
      the current code. A single clamping seems sufficient and cheaper.
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: default avatarLike Xu <like.xu@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a87f2d3a
    • Liran Alon's avatar
      KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15 · 1fc5d194
      Liran Alon authored
      Make all code consistent with kvm_deliver_exception_payload() by using
      appropriate symbolic constant instead of hard-coded number.
      Reviewed-by: default avatarNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1fc5d194
  2. 13 Jun, 2019 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: clean up conditions for asynchronous page fault handling · 1dfdb45e
      Paolo Bonzini authored
      Even when asynchronous page fault is disabled, KVM does not want to pause
      the host if a guest triggers a page fault; instead it will put it into
      an artificial HLT state that allows running other host processes while
      allowing interrupt delivery into the guest.
      
      However, the way this feature is triggered is a bit confusing.
      First, it is not used for page faults while a nested guest is
      running: but this is not an issue since the artificial halt
      is completely invisible to the guest, either L1 or L2.  Second,
      it is used even if kvm_halt_in_guest() returns true; in this case,
      the guest probably should not pay the additional latency cost of the
      artificial halt, and thus we should handle the page fault in a
      completely synchronous way.
      
      By introducing a new function kvm_can_deliver_async_pf, this patch
      commonizes the code that chooses whether to deliver an async page fault
      (kvm_arch_async_page_not_present) and the code that chooses whether a
      page fault should be handled synchronously (kvm_can_do_async_pf).
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1dfdb45e
  3. 05 Jun, 2019 7 commits
  4. 04 Jun, 2019 11 commits