1. 04 Aug, 2021 3 commits
    • Peter Xu's avatar
      KVM: X86: Optimize zapping rmap · a75b5404
      Peter Xu authored
      Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
      slow.  The easy way is to travers the rmap list, collecting the a/d bits and
      free the slots along the way.
      
      Provide a pte_list_destroy() and do exactly that.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220605.26377-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a75b5404
    • Peter Xu's avatar
      KVM: X86: Optimize pte_list_desc with per-array counter · 13236e25
      Peter Xu authored
      Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
      logic.  E.g., we don't need to loop over the array any more for most reasons.
      
      This will make more sense after we've switched the array size to be larger
      otherwise the counter will be a waste.
      
      Initially I wanted to store a tail pointer at the head of the array list so we
      don't need to traverse the list at least for pushing new ones (if without the
      counter we traverse both the list and the array).  However that'll need
      slightly more change without a huge lot benefit, e.g., after we grow entry
      numbers per array the list traversing is not so expensive.
      
      So let's be simple but still try to get as much benefit as we can with just
      these extra few lines of changes (not to mention the code looks easier too
      without looping over arrays).
      
      I used the same a test case to fork 500 child and recycle them ("./rmap_fork
      500" [1]), this patch further speeds up the total fork time of about 4%, which
      is a total of 33% of vanilla kernel:
      
              Vanilla:      473.90 (+-5.93%)
              3->15 slots:  366.10 (+-4.94%)
              Add counter:  351.00 (+-3.70%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220602.26327-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      13236e25
    • Peter Xu's avatar
      KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger · dc1cff96
      Peter Xu authored
      Currently rmap array element only contains 3 entries.  However for EPT=N there
      could have a lot of guest pages that got tens of even hundreds of rmap entry.
      
      A normal distribution of a 6G guest (even if idle) shows this with rmap count
      statistics:
      
      Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
      Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
      Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
      Level=1G:       32      0       0       0       0       0       0       0       0       0       0
      
      If we do some more fork some pages will grow even larger rmap counts.
      
      This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
      use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
      will be slightly more efficient; but still not too large so less waste when
      array not full.
      
      It should not affecting EPT=Y since EPT normally only has zero or one rmap
      entry for each page, so no array is even allocated.
      
      With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
      this patch speeds up fork time of about 29%.
      
          Before: 473.90 (+-5.93%)
          After:  366.10 (+-4.94%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-6-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dc1cff96
  2. 03 Aug, 2021 3 commits
    • Hamza Mahfooz's avatar
      KVM: const-ify all relevant uses of struct kvm_memory_slot · 269e9552
      Hamza Mahfooz authored
      As alluded to in commit f36f3f28 ("KVM: add "new" argument to
      kvm_arch_commit_memory_region"), a bunch of other places where struct
      kvm_memory_slot is used, needs to be refactored to preserve the
      "const"ness of struct kvm_memory_slot across-the-board.
      Signed-off-by: default avatarHamza Mahfooz <someguy@effective-light.com>
      Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
      [Do not touch body of slot_rmap_walk_init. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      269e9552
    • Paolo Bonzini's avatar
      KVM: Don't take mmu_lock for range invalidation unless necessary · 071064f1
      Paolo Bonzini authored
      Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
      that are unrelated to KVM.  This is possible now that memslot updates are
      blocked from range_start() to range_end(); that ensures that lock elision
      happens in both or none, and therefore that mmu_notifier_count updates
      (which must occur while holding mmu_lock for write) are always paired
      across start->end.
      
      Based on patches originally written by Ben Gardon.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      071064f1
    • Paolo Bonzini's avatar
      KVM: Block memslot updates across range_start() and range_end() · 52ac8b35
      Paolo Bonzini authored
      We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
      notifications that are unrelated to KVM.  Because mmu_notifier_count
      must be modified while holding mmu_lock for write, and must always
      be paired across start->end to stay balanced, lock elision must
      happen in both or none.  Therefore, in preparation for this change,
      this patch prevents memslot updates across range_start() and range_end().
      
      Note, technically flag-only memslot updates could be allowed in parallel,
      but stalling a memslot update for a relatively short amount of time is
      not a scalability issue, and this is all more than complex enough.
      
      A long note on the locking: a previous version of the patch used an rwsem
      to block the memslot update while the MMU notifier run, but this resulted
      in the following deadlock involving the pseudo-lock tagged as
      "mmu_notifier_invalidate_range_start".
      
         ======================================================
         WARNING: possible circular locking dependency detected
         5.12.0-rc3+ #6 Tainted: G           OE
         ------------------------------------------------------
         qemu-system-x86/3069 is trying to acquire lock:
         ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
      
         but task is already holding lock:
         ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
      
         which lock already depends on the new lock.
      
      This corresponds to the following MMU notifier logic:
      
          invalidate_range_start
            take pseudo lock
            down_read()           (*)
            release pseudo lock
          invalidate_range_end
            take pseudo lock      (**)
            up_read()
            release pseudo lock
      
      At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
      at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
      
      This could cause a deadlock (ignoring for a second that the pseudo lock
      is not a lock):
      
      - invalidate_range_start waits on down_read(), because the rwsem is
      held by install_new_memslots
      
      - install_new_memslots waits on down_write(), because the rwsem is
      held till (another) invalidate_range_end finishes
      
      - invalidate_range_end sits waits on the pseudo lock, held by
      invalidate_range_start.
      
      Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
      it would change the *shared* rwsem readers into *shared recursive*
      readers), so open-code the wait using a readers count and a
      spinlock.  This also allows handling blockable and non-blockable
      critical section in the same way.
      
      Losing the rwsem fairness does theoretically allow MMU notifiers to
      block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
      retry scheme in mmu_interval_read_begin also uses wait/wake_up
      and is likewise not fair.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52ac8b35
  3. 02 Aug, 2021 34 commits