1. 20 Feb, 2019 40 commits
    • Marcelo Tosatti's avatar
      x86: kvmguest: use TSC clocksource if invariant TSC is exposed · 7539b174
      Marcelo Tosatti authored
      The invariant TSC bit has the following meaning:
      
      "The time stamp counter in newer processors may support an enhancement,
      referred to as invariant TSC. Processor's support for invariant TSC
      is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run
      at a constant rate in all ACPI P-, C-. and T-states. This is the
      architectural behavior moving forward. On processors with invariant TSC
      support, the OS may use the TSC for wall clock timer services (instead
      of ACPI or HPET timers). TSC reads are much more efficient and do not
      incur the overhead associated with a ring transition or access to a
      platform resource."
      
      IOW, TSC does not change frequency. In such case, and with
      TSC scaling hardware available to handle migration, it is possible
      to use the TSC clocksource directly, whose system calls are
      faster.
      
      Reduce the rating of kvmclock clocksource to allow TSC clocksource
      to be the default if invariant TSC is exposed.
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      
      v2: Use feature bits and tsc_unstable() check (Sean Christopherson)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7539b174
    • Nir Weiner's avatar
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner authored
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • Nir Weiner's avatar
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner authored
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • Nir Weiner's avatar
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner authored
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarNir Weiner <nir.weiner@oracle.com>
      Suggested-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
    • Sean Christopherson's avatar
      KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes() · 8ab3c471
      Sean Christopherson authored
      ...via a new helper, __kvm_mmu_zap_all().  An alternative to passing a
      'bool mmio_only' would be to pass a callback function to filter the
      shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but
      zapping all shadow pages is a last resort, i.e. making the helper less
      extensible is a feature of sorts.  And the explicit MMIO parameter makes
      it easy to preserve the WARN_ON_ONCE() if a restart is triggered when
      zapping MMIO sptes.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8ab3c471
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children · 24efe61f
      Sean Christopherson authored
      Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a
      quadratic runtime[1], i.e. restarting the spte walk while zapping only
      MMIO sptes could result in re-walking large portions of the list over
      and over due to the non-MMIO sptes encountered before the restart not
      being removed.
      
      At the time, the concern was legitimate as the walk was restarted when
      any spte was zapped.  But that is no longer the case as the walk is now
      restarted iff one or more children have been zapped, which is necessary
      because zapping children makes the active_mmu_pages list unstable.
      
      Furthermore, it should be impossible for an MMIO spte to have children,
      i.e. zapping an MMIO spte should never result in zapping children.  In
      other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and
      so should always execute in linear time.  WARN if this assertion fails.
      
      Although it should never be needed, leave the restart logic in place.
      In normal operation, the cost is at worst an extra CMP+Jcc, and if for
      some reason the list does become unstable, not restarting would likely
      crash KVM, or worse, the kernel.
      
      [1] https://patchwork.kernel.org/patch/10756589/#22452085Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      24efe61f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Differentiate between nr zapped and list unstable · 83cdb568
      Sean Christopherson authored
      The return value of kvm_mmu_prepare_zap_page() has evolved to become
      overloaded to convey two separate pieces of information.  1) was at
      least one page zapped and 2) has the list of MMU pages become unstable.
      
      In it's original incarnation (as kvm_mmu_zap_page()), there was no
      return value at all.  Commit 07385413 ("KVM: MMU: awareness of new
      kvm_mmu_zap_page behaviour") added a return value in preparation for
      commit 4731d4c7 ("KVM: MMU: out of sync shadow core").  Although
      the return value was of type 'int', it was actually used as a boolean
      to indicate whether or not active_mmu_pages may have become unstable due
      to zapping children.  Walking a list with list_for_each_entry_safe()
      only protects against deleting/moving the current entry, i.e. zapping a
      child page would break iteration due to modifying any number of entries.
      
      Later, commit 60c8aec6 ("KVM: MMU: use page array in unsync walk")
      modified mmu_zap_unsync_children() to return an approximation of the
      number of children zapped.  This was not intentional, it was simply a
      side effect of how the code was written.
      
      The unintented side affect was then morphed into an actual feature by
      commit 77662e00 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
      path"), which modified kvm_mmu_change_mmu_pages() to use the number of
      zapped pages when determining the number of MMU pages in use by the VM.
      
      Finally, commit 54a4f023 ("KVM: MMU: make kvm_mmu_zap_page() return
      the number of pages it actually freed") added the initial page to the
      return value to make its behavior more consistent with what most users
      would expect.  Incorporating the initial parent page in the return value
      of kvm_mmu_zap_page() breaks the original usage of restarting a list
      walk on a non-zero return value to handle a potentially unstable list,
      i.e. walks will unnecessarily restart when any page is zapped.
      
      Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
      return a boolean to indicate that the list may be unstable and move the
      number of zapped children to a dedicated parameter.  Since the majority
      of callers to kvm_mmu_prepare_zap_page() don't care about either return
      value, preserve the current definition of kvm_mmu_prepare_zap_page() by
      making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page().  This
      avoids having to update every call site and also provides cleaner code
      for functions that only care about the number of pages zapped.
      
      Fixes: 54a4f023 ("KVM: MMU: make kvm_mmu_zap_page() return
                            the number of pages it actually freed")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      83cdb568
    • Sean Christopherson's avatar
      Revert "KVM: MMU: fast invalidate all pages" · ea145aac
      Sean Christopherson authored
      Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
      from the original series[1], now that all users of the fast invalidate
      mechanism are gone.
      
      This reverts commit 5304b8d3.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ea145aac
    • Sean Christopherson's avatar
      KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes · 5d6317ca
      Sean Christopherson authored
      Call cond_resched_lock() when zapping all sptes to reschedule if needed
      or to release and reacquire mmu_lock in case of contention.  There is no
      need to flush or zap when temporarily dropping mmu_lock as zapping all
      sptes is done only when the owning userspace VMM has exited or when the
      VM is being destroyed, i.e. there is no interplay with memslots or MMIO
      generations to worry about.
      
      Be paranoid and restart the walk if mmu_lock is dropped to avoid any
      potential issues with consuming a stale iterator.  The overhead in doing
      so is negligible as at worst there will be a few root shadow pages at
      the head of the list, i.e. the iterator is essentially the head of the
      list already.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d6317ca
    • Sean Christopherson's avatar
      KVM: x86/mmu: skip over invalid root pages when zapping all sptes · 8a674adc
      Sean Christopherson authored
      ...to guarantee forward progress.  When zapped, root pages are marked
      invalid and moved to the head of the active pages list until they are
      explicitly freed.  Theoretically, having unzappable root pages at the
      head of the list could prevent kvm_mmu_zap_all() from making forward
      progress were a future patch to add a loop restart after processing a
      page, e.g. to drop mmu_lock on contention.
      
      Although kvm_mmu_prepare_zap_page() can theoretically take action on
      invalid pages, e.g. to zap unsync children, functionally it's not
      necessary (root pages will be re-zapped when freed) and practically
      speaking the odds of e.g. @unsync or @unsync_children becoming %true
      while zapping all pages is basically nil.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8a674adc
    • Sean Christopherson's avatar
      Revert "KVM: x86: use the fast way to invalidate all pages" · 7390de1e
      Sean Christopherson authored
      Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all().
      Flushing all shadow entries is only done during VM teardown, i.e.
      kvm_arch_flush_shadow_all() is only called when the associated MM struct
      is being released or when the VM instance is being freed.
      
      Although the performance of teardown itself isn't critical, KVM should
      still voluntarily schedule to play nice with the rest of the kernel;
      but that can be done without the fast invalidate mechanism in a future
      patch.
      
      This reverts commit 6ca18b69.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7390de1e
    • Sean Christopherson's avatar
      Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints" · b59c4830
      Sean Christopherson authored
      ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
      is one part of a revert all patches from the series that introduced the
      mechanism[1].
      
      This reverts commit 2248b023.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b59c4830
    • Sean Christopherson's avatar
      Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages" · 42560fb1
      Sean Christopherson authored
      ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
      is one part of a revert all patches from the series that introduced the
      mechanism[1].
      
      This reverts commit 35006126.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      42560fb1
    • Sean Christopherson's avatar
      Revert "KVM: MMU: zap pages in batch" · 43d2b14b
      Sean Christopherson authored
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit e7d11c7a.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      43d2b14b
    • Sean Christopherson's avatar
      Revert "KVM: MMU: collapse TLB flushes when zap all pages" · 210f4942
      Sean Christopherson authored
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit f34d251d.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      210f4942
    • Sean Christopherson's avatar
      Revert "KVM: MMU: reclaim the zapped-obsolete page first" · 52d5dedc
      Sean Christopherson authored
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit 365c8868.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52d5dedc
    • Sean Christopherson's avatar
      KVM: x86/mmu: Remove is_obsolete() call · 5ff05683
      Sean Christopherson authored
      Unwinding usage of is_obsolete() is a step towards removing x86's fast
      invalidate mechanism, i.e. this is one part of a revert all patches from
      the series that introduced the mechanism[1].
      
      This is a partial revert of commit 05988d72 ("KVM: MMU: reduce
      KVM_REQ_MMU_RELOAD when root page is zapped").
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ff05683
    • Sean Christopherson's avatar
      KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes · 571c5af0
      Sean Christopherson authored
      Call cond_resched_lock() when zapping MMIO to reschedule if needed or to
      release and reacquire mmu_lock in case of contention.  There is no need
      to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes
      is done when holding the memslots lock and with the "update in-progress"
      bit set in the memslots generation, which disables MMIO spte caching.
      The walk does need to be restarted if mmu_lock is dropped as the active
      pages list may be modified.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      571c5af0
    • Sean Christopherson's avatar
      Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes" · 4771450c
      Sean Christopherson authored
      Revert back to a dedicated (and slower) mechanism for handling the
      scenario where all MMIO shadow PTEs need to be zapped due to overflowing
      the MMIO generation number.  The MMIO generation scenario is almost
      literally a one-in-a-million occurrence, i.e. is not a performance
      sensitive scenario.
      
      Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
      of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
      the fast invalidate mechanism altogether.
      
      This reverts commit a8eca9dc.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4771450c
    • Sean Christopherson's avatar
      Revert "KVM: MMU: document fast invalidate all pages" · a592a3b8
      Sean Christopherson authored
      Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
      from the original series[1].
      
      Though not explicitly stated, for all intents and purposes the fast
      invalidate mechanism was added to speed up the scenario where removing
      a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
      flush all shadow entries[1].  Now that the memslot case flushes only
      shadow entries belonging to the memslot, i.e. doesn't use the fast
      invalidate mechanism, the only remaining usage of the mechanism are
      when the VM is being destroyed and when the MMIO generation rolls
      over.
      
      When a VM is being destroyed, either there are no active vcpus, i.e.
      there's no lock contention, or the VM has ungracefully terminated, in
      which case we want to reclaim its pages as quickly as possible, i.e.
      not release the MMU lock if there are still CPUs executing in the VM.
      
      The MMIO generation scenario is almost literally a one-in-a-million
      occurrence, i.e. is not a performance sensitive scenario.
      
      Given that lock-breaking is not desirable (VM teardown) or irrelevant
      (MMIO generation overflow), remove the fast invalidate mechanism to
      simplify the code (a small amount) and to discourage future code from
      zapping all pages as using such a big hammer should be a last restort.
      
      This reverts commit f6f8adee.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a592a3b8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Zap only the relevant pages when removing a memslot · 4e103134
      Sean Christopherson authored
      Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
      handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
      that actually belong to the memslot being removed.  This improves
      performance, especially why the deleted memslot has only a few shadow
      entries, or even no entries.  E.g. a microbenchmark to access regular
      memory while concurrently reading PCI ROM to trigger memslot deletion
      showed a 5% improvement in throughput.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4e103134
    • Sean Christopherson's avatar
      KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap() · a2113634
      Sean Christopherson authored
      ...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does
      not require a vcpu so that the code can be (re)used by
      kvm_mmu_invalidate_zap_pages_in_memslot().
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a2113634
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move slot_level_*() helper functions up a few lines · 85875a13
      Sean Christopherson authored
      ...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the
      helpers in future patches.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85875a13
    • Sean Christopherson's avatar
      KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5
      Sean Christopherson authored
      ...now that KVM won't explode by moving it out of bit 0.  Using bit 63
      eliminates the need to jump over bit 0, e.g. when calculating a new
      memslots generation or when propagating the memslots generation to an
      MMIO spte.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      164bf7e5
    • Sean Christopherson's avatar
      KVM: Remove the hack to trigger memslot generation wraparound · 0e32958e
      Sean Christopherson authored
      x86 captures a subset of the memslot generation (19 bits) in its MMIO
      sptes so that it can expedite emulated MMIO handling by checking only
      the releveant spte, i.e. doesn't need to do a full page fault walk.
      
      Because the MMIO sptes capture only 19 bits (due to limited space in
      the sptes), there is a non-zero probability that the MMIO generation
      could wrap, e.g. after 500k memslot updates.  Since normal usage is
      extremely unlikely to result in 500k memslot updates, a hack was added
      by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
      wrap-around value") to offset the MMIO generation in order to trigger
      a wraparound, e.g. after 150 memslot updates.
      
      When separate memslot generation sequences were assigned to each
      address space, commit 00f034a1 ("KVM: do not bias the generation
      number in kvm_current_mmio_generation") moved the offset logic into the
      initialization of the memslot generation itself so that the per-address
      space bit(s) were not dropped/corrupted by the MMIO shenanigans.
      
      Remove the offset hack for three reasons:
      
        - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
          wrapping the generation doesn't actually test the interesting case
          of having stale MMIO sptes with the new generation number, e.g. old
          sptes with a generation number of 0.
      
        - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
          performance rather important since the probability of invalidating
          MMIO sptes jumps from "effectively never" to "fairly likely".  This
          limits what can be done in future patches, e.g. to simplify the
          invalidation code, as doing so without proper caution could lead to
          a noticeable performance regression.
      
        - Forcing the memslots generation, which is a 64-bit number, to wrap
          prevents KVM from assuming the memslots generation will never wrap.
          This in turn prevents KVM from using an arbitrary bit for the
          "update in-progress" flag, e.g. using bit 63 would immediately
          collide with using a large value as the starting generation number.
          The "update in-progress" flag is effectively forced into bit 0 so
          that it's (subtly) taken into account when incrementing the
          generation.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0e32958e
    • Sean Christopherson's avatar
      KVM: x86: Refactor the MMIO SPTE generation handling · cae7ed3c
      Sean Christopherson authored
      The code to propagate the memslots generation number into MMIO sptes is
      a bit convoluted.  The "what" is relatively straightfoward, e.g. the
      comment explaining which bits go where is quite readable, but the "how"
      requires a lot of staring to understand what is happening.  For example,
      'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the
      spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits.
      
      Refactor the code to:
      
        - use #defines whose values align with the bits defined in the comment
        - use consistent code for both the high and low mask
        - explicitly highlight the handling of bit 0 (update in-progress flag)
        - explicitly call out that the defines are for MMIO sptes (to avoid
          confusion with the per-vCPU MMIO cache, which uses the full memslots
          generation)
      
      In addition to making the code a little less magical, this paves the way
      for moving the update in-progress flag to bit 63 without having to
      simultaneously rewrite all of the MMIO spte code.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cae7ed3c
    • Sean Christopherson's avatar
      KVM: x86: Use a u64 when passing the MMIO gen around · 5192f9b9
      Sean Christopherson authored
      KVM currently uses an 'unsigned int' for the MMIO generation number
      despite it being derived from the 64-bit memslots generation and
      being propagated to (potentially) 64-bit sptes.  There is no hidden
      agenda behind using an 'unsigned int', it's done simply because the
      MMIO generation will never set bits above bit 19.
      
      Passing a u64 will allow the "update in-progress" flag to be relocated
      from bit 0 to bit 63 and removes the need to cast the generation back
      to a u64 when propagating it to a spte.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5192f9b9
    • Sean Christopherson's avatar
      KVM: Explicitly define the "memslot update in-progress" bit · 361209e0
      Sean Christopherson authored
      KVM uses bit 0 of the memslots generation as an "update in-progress"
      flag, which is used by x86 to prevent caching MMIO access while the
      memslots are changing.  Although the intended behavior is flag-like,
      e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
      caching data from in-flux memslots, the implementation oftentimes treats
      the bit as part of the generation number itself, e.g. incrementing the
      generation increments twice, once to set the flag and once to clear it.
      
      Prior to commit 4bd518f1 ("KVM: use separate generations for
      each address space"), incorporating the "update in-progress" bit into
      the generation number largely made sense, e.g. "real" generations are
      even, "bogus" generations are odd, most code doesn't need to be aware of
      the bit, etc...
      
      Now that unique memslots generation numbers are assigned to each address
      space, stealthing the in-progress status into the generation number
      results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
      over bit 0 when initializing the memslots generation without any hint as
      to why.
      
      Explicitly define the flag and convert as much code as possible (which
      isn't much) to actually treat it like a flag.  This paves the way for
      eventually using a different bit for "update in-progress" so that it can
      be a flag in truth instead of a awkward extension to the generation
      number.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      361209e0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux · ddfd1730
      Sean Christopherson authored
      When installing new memslots, KVM sets bit 0 of the generation number to
      indicate that an update is in-progress.  Until the update is complete,
      there are no guarantees as to whether a vCPU will see the old or the new
      memslots.  Explicity prevent caching MMIO accesses so as to avoid using
      an access cached from the old memslots after the new memslots have been
      installed.
      
      Note that it is unclear whether or not disabling caching during the
      update window is strictly necessary as there is no definitive
      documentation as to what ordering guarantees KVM provides with respect
      to updating memslots.  That being said, the MMIO spte code does not
      allow reusing sptes created while an update is in-progress, and the
      associated documentation explicitly states:
      
          We do not want to use an MMIO sptes created with an odd generation
          number, ...  If KVM is unlucky and creates an MMIO spte while the
          low bit is 1, the next access to the spte will always be a cache miss.
      
      At the very least, disabling the per-vCPU MMIO cache during updates will
      make its behavior consistent with the MMIO spte behavior and
      documentation.
      
      Fixes: 56f17dd3 ("kvm: x86: fix stale mmio cache bug")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ddfd1730
    • Sean Christopherson's avatar
      KVM: x86/mmu: Detect MMIO generation wrap in any address space · e1359e2b
      Sean Christopherson authored
      The check to detect a wrap of the MMIO generation explicitly looks for a
      generation number of zero.  Now that unique memslots generation numbers
      are assigned to each address space, only address space 0 will get a
      generation number of exactly zero when wrapping.  E.g. when address
      space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
      wrap to 0x2.  Adjust the MMIO generation to strip the address space
      modifier prior to checking for a wrap.
      
      Fixes: 4bd518f1 ("KVM: use separate generations for each address space")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e1359e2b
    • Sean Christopherson's avatar
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson authored
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      15248258
    • Ben Gardon's avatar
      kvm: vmx: Add memcg accounting to KVM allocations · 41836839
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      41836839
    • Ben Gardon's avatar
      kvm: svm: Add memcg accounting to KVM allocations · 1ec69647
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ec69647
    • Ben Gardon's avatar
      kvm: x86: Add memcg accounting to KVM allocations · 254272ce
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In x86, they include:
      	vcpu->arch.pio_data
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      254272ce
    • Ben Gardon's avatar
      kvm: Add memcg accounting to KVM allocations · b12ce36a
      Ben Gardon authored
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In they include:
              vcpu->run
              kvm->coalesced_mmio_ring
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b12ce36a
    • Paolo Bonzini's avatar
      KVM: nVMX: do not start the preemption timer hrtimer unnecessarily · 359a6c3d
      Paolo Bonzini authored
      The preemption timer can be started even if there is a vmentry
      failure during or after loading guest state.  That is pointless,
      move the call after all conditions have been checked.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      359a6c3d
    • Yu Zhang's avatar
      kvm: vmx: Fix typos in vmentry/vmexit control setting · d9293597
      Yu Zhang authored
      Previously, 'commit f99e3daf ("KVM: x86: Add Intel PT
      virtualization work mode")' work mode' offered framework
      to support Intel PT virtualization. However, the patch has
      some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
      used wrong flags and wrong variable, which will cause the
      VM entry failure later.
      
      Fixes: 'commit f99e3daf ("KVM: x86: Add Intel PT virtualization work mode")'
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d9293597
    • Paolo Bonzini's avatar
      KVM: x86: cleanup freeing of nested state · b4b65b56
      Paolo Bonzini authored
      Ensure that the VCPU free path goes through vmx_leave_nested and
      thus nested_vmx_vmexit, so that the cancellation of the timer does
      not have to be in free_nested.  In addition, because some paths through
      nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
      the timer is moved there.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b4b65b56
    • Luwei Kang's avatar
      KVM: x86: Sync the pending Posted-Interrupts · 81b01667
      Luwei Kang authored
      Some Posted-Interrupts from passthrough devices may be lost or
      overwritten when the vCPU is in runnable state.
      
      The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
      be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
      but not running on physical CPU). If a posted interrupt coming at this
      time, the irq remmaping facility will set the bit of PIR (Posted
      Interrupt Requests) without ON (Outstanding Notification).
      So this interrupt can't be sync to APIC virtualization register and
      will not be handled by Guest because ON is zero.
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      [Eliminate the pi_clear_sn fast path. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81b01667
    • Liu Jingqi's avatar
      KVM: x86: expose MOVDIR64B CPU feature into VM. · c029b5de
      Liu Jingqi authored
      MOVDIR64B moves 64-bytes as direct-store with 64-bytes write atomicity.
      Direct store is implemented by using write combining (WC) for writing
      data directly into memory without caching the data.
      
      Availability of the MOVDIR64B instruction is indicated by the presence
      of the CPUID feature flag MOVDIR64B (CPUID.0x07.0x0:ECX[bit 28]).
      
      This patch exposes the movdir64b feature to the guest.
      
      The release document ref below link:
      https://software.intel.com/sites/default/files/managed/c5/15/\
      architecture-instruction-set-extensions-programming-reference.pdf
      Signed-off-by: default avatarLiu Jingqi <jingqi.liu@intel.com>
      Cc: Xu Tao <tao3.xu@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c029b5de