An error occurred fetching the project authors.
  1. 02 May, 2024 1 commit
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fix a largely theoretical race in kvm_mmu_track_write() · 226d9b8f
      Sean Christopherson authored
      Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
      to plug a (very, very theoretical) race where kvm_mmu_track_write() could
      miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
      *stale* SPTEs.
      
      Without the barriers, because modern x86 CPUs allow (per the SDM):
      
        Reads may be reordered with older writes to different locations but not
        with older writes to the same location.
      
      it's possible that the following could happen (terms of values being
      visible/resolved):
      
       CPU0                          CPU1
       read memory[gfn] (=Y)
                                     memory[gfn] Y=>X
                                     read indirect_shadow_pages (=0)
       indirect_shadow_pages 0=>1
      
      or conversely:
      
       CPU0                          CPU1
       indirect_shadow_pages 0=>1
                                     read indirect_shadow_pages (=0)
       read memory[gfn] (=Y)
                                     memory[gfn] Y=>X
      
      E.g. in the below scenario, CPU0 could fail to zap SPTEs, and CPU1 could
      fail to retry the faulting instruction, resulting in a KVM entering the
      guest with a stale SPTE (map PTE=X instead of PTE=Y).
      
      PTE = X;
      
      CPU0:
          emulator_write_phys()
          PTE = Y
          kvm_page_track_write()
            kvm_mmu_track_write()
            // memory barrier missing here
            if (indirect_shadow_pages)
                zap();
      
      CPU1:
         FNAME(page_fault)
           FNAME(walk_addr)
             FNAME(walk_addr_generic)
               gw->pte = PTE; // X
      
           FNAME(fetch)
             kvm_mmu_get_child_sp
               kvm_mmu_get_shadow_page
                 __kvm_mmu_get_shadow_page
                   kvm_mmu_alloc_shadow_page
                     account_shadowed
                       indirect_shadow_pages++
                       // memory barrier missing here
             if (FNAME(gpte_changed)) // if (PTE == X)
                 return RET_PF_RETRY;
      
      In practice, this bug likely cannot be observed as both the 0=>1
      transition and reordering of this scope are extremely rare occurrences.
      
      Note, if the cost of the barrier (which is simply a locked ADD, see commit
      450cbdd0 ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
      is problematic, KVM could avoid the barrier by bailing earlier if checking
      kvm_memslots_have_rmaps() is false.  But the odds of the barrier being
      problematic is extremely low, *and* the odds of the extra checks being
      meaningfully faster overall is also low.
      
      Link: https://lore.kernel.org/r/20240423193114.2887673-1-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      226d9b8f
  2. 06 Mar, 2024 3 commits
    • Peter Xu's avatar
      mm/treewide: drop pXd_large() · e72c7c2b
      Peter Xu authored
      They're not used anymore, drop all of them.
      
      Link: https://lkml.kernel.org/r/20240305043750.93762-10-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e72c7c2b
    • Peter Xu's avatar
      mm/treewide: replace pud_large() with pud_leaf() · 0a845e0f
      Peter Xu authored
      pud_large() is always defined as pud_leaf().  Merge their usages.  Chose
      pud_leaf() because pud_leaf() is a global API, while pud_large() is not.
      
      Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a845e0f
    • Peter Xu's avatar
      mm/treewide: replace pmd_large() with pmd_leaf() · 2f709f7b
      Peter Xu authored
      pmd_large() is always defined as pmd_leaf().  Merge their usages.  Chose
      pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.
      
      Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f709f7b
  3. 04 Mar, 2024 1 commit
  4. 23 Feb, 2024 3 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing · d02c357e
      Sean Christopherson authored
      Retry page faults without acquiring mmu_lock, and without even faulting
      the page into the primary MMU, if the resolved gfn is covered by an active
      invalidation.  Contending for mmu_lock is especially problematic on
      preemptible kernels as the mmu_notifier invalidation task will yield
      mmu_lock (see rwlock_needbreak()), delay the in-progress invalidation, and
      ultimately increase the latency of resolving the page fault.  And in the
      worst case scenario, yielding will be accompanied by a remote TLB flush,
      e.g. if the invalidation covers a large range of memory and vCPUs are
      accessing addresses that were already zapped.
      
      Faulting the page into the primary MMU is similarly problematic, as doing
      so may acquire locks that need to be taken for the invalidation to
      complete (the primary MMU has finer grained locks than KVM's MMU), and/or
      may cause unnecessary churn (getting/putting pages, marking them accessed,
      etc).
      
      Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
      iterators to perform more work before yielding, but that wouldn't solve
      the lock contention and would negatively affect scenarios where a vCPU is
      trying to fault in an address that is NOT covered by the in-progress
      invalidation.
      
      Add a dedicated lockess version of the range-based retry check to avoid
      false positives on the sanity check on start+end WARN, and so that it's
      super obvious that checking for a racing invalidation without holding
      mmu_lock is unsafe (though obviously useful).
      
      Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that pre-checking
      invalidation in a loop won't put KVM into an infinite loop, e.g. due to
      caching the in-progress flag and never seeing it go to '0'.
      
      Force a load of mmu_invalidate_seq as well, even though it isn't strictly
      necessary to avoid an infinite loop, as doing so improves the probability
      that KVM will detect an invalidation that already completed before
      acquiring mmu_lock and bailing anyways.
      
      Do the pre-check even for non-preemptible kernels, as waiting to detect
      the invalidation until mmu_lock is held guarantees the vCPU will observe
      the worst case latency in terms of handling the fault, and can generate
      even more mmu_lock contention.  E.g. the vCPU will acquire mmu_lock,
      detect retry, drop mmu_lock, re-enter the guest, retake the fault, and
      eventually re-acquire mmu_lock.  This behavior is also why there are no
      new starvation issues due to losing the fairness guarantees provided by
      rwlocks: if the vCPU needs to retry, it _must_ drop mmu_lock, i.e. waiting
      on mmu_lock doesn't guarantee forward progress in the face of _another_
      mmu_notifier invalidation event.
      
      Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
      may generate a load into a register instead of doing a direct comparison
      (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
      is a few bytes of code and maaaaybe a cycle or three.
      Reported-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.comReported-by: default avatarFriedrich Weber <f.weber@proxmox.com>
      Cc: Kai Huang <kai.huang@intel.com>
      Cc: Yan Zhao <yan.y.zhao@intel.com>
      Cc: Yuan Yao <yuan.yao@linux.intel.com>
      Cc: Xu Yilun <yilun.xu@linux.intel.com>
      Acked-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Link: https://lore.kernel.org/r/20240222012640.2820927-1-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      d02c357e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read · 576a15de
      Sean Christopherson authored
      Free TDP MMU roots from vCPU context while holding mmu_lock for read, it
      is completely legal to invoke kvm_tdp_mmu_put_root() as a reader.  This
      eliminates the last mmu_lock writer in the TDP MMU's "fast zap" path
      after requesting vCPUs to reload roots, i.e. allows KVM to zap invalidated
      roots, free obsolete roots, and allocate new roots in parallel.
      
      On large VMs, e.g. 100+ vCPUs, allowing the bulk of the "fast zap"
      operation to run in parallel with freeing and allocating roots reduces the
      worst case latency for a vCPU to reload a root from 2-3ms to <100us.
      
      Link: https://lore.kernel.org/r/20240111020048.844847-9-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      576a15de
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for read · f5238c2a
      Sean Christopherson authored
      When allocating a new TDP MMU root, check for a usable root while holding
      mmu_lock for read and only acquire mmu_lock for write if a new root needs
      to be created.  There is no need to serialize other MMU operations if a
      vCPU is simply grabbing a reference to an existing root, holding mmu_lock
      for write is "necessary" (spoiler alert, it's not strictly necessary) only
      to ensure KVM doesn't end up with duplicate roots.
      
      Allowing vCPUs to get "new" roots in parallel is beneficial to VM boot and
      to setups that frequently delete memslots, i.e. which force all vCPUs to
      reload all roots.
      
      Link: https://lore.kernel.org/r/20240111020048.844847-7-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      f5238c2a
  5. 01 Feb, 2024 1 commit
  6. 31 Jan, 2024 1 commit
  7. 10 Jan, 2024 1 commit
  8. 03 Jan, 2024 1 commit
  9. 01 Dec, 2023 3 commits
  10. 29 Nov, 2023 2 commits
  11. 14 Nov, 2023 3 commits
    • Sean Christopherson's avatar
      KVM: Allow arch code to track number of memslot address spaces per VM · eed52e43
      Sean Christopherson authored
      Let x86 track the number of address spaces on a per-VM basis so that KVM
      can disallow SMM memslots for confidential VMs.  Confidentials VMs are
      fundamentally incompatible with emulating SMM, which as the name suggests
      requires being able to read and write guest memory and register state.
      
      Disallowing SMM will simplify support for guest private memory, as KVM
      will not need to worry about tracking memory attributes for multiple
      address spaces (SMM is the only "non-default" address space across all
      architectures).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eed52e43
    • Chao Peng's avatar
      KVM: x86/mmu: Handle page fault for private memory · 8dd2eee9
      Chao Peng authored
      Add support for resolving page faults on guest private memory for VMs
      that differentiate between "shared" and "private" memory.  For such VMs,
      KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
      hva-based shared memory, and KVM needs to map in the "correct" variant,
      i.e. KVM needs to map the gfn shared/private as appropriate based on the
      current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
      
      For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
      shared vs. private via a bit in the guest page tables, i.e. what the guest
      wants may conflict with the current memory attributes.  To support such
      "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
      to forward the request to userspace.  Add a new flag for memory faults,
      KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
      map memory as shared vs. private.
      
      Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
      so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
      needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
      exit on missing mappings when handling guest page fault VM-Exits.  In
      that case, userspace will want to know RWX information in order to
      correctly/precisely resolve the fault.
      
      Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
      always come from the host userspace page tables, and private mappings
      always come from a guest_memfd instance.
      Co-developed-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8dd2eee9
    • Chao Peng's avatar
      KVM: x86: Disallow hugepages when memory attributes are mixed · 90b4fe17
      Chao Peng authored
      Disallow creating hugepages with mixed memory attributes, e.g. shared
      versus private, as mapping a hugepage in this case would allow the guest
      to access memory with the wrong attributes, e.g. overlaying private memory
      with a shared hugepage.
      
      Tracking whether or not attributes are mixed via the existing
      disallow_lpage field, but use the most significant bit in 'disallow_lpage'
      to indicate a hugepage has mixed attributes instead using the normal
      refcounting.  Whether or not attributes are mixed is binary; either they
      are or they aren't.  Attempting to squeeze that info into the refcount is
      unnecessarily complex as it would require knowing the previous state of
      the mixed count when updating attributes.  Using a flag means KVM just
      needs to ensure the current status is reflected in the memslots.
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      90b4fe17
  12. 13 Nov, 2023 1 commit
  13. 18 Oct, 2023 1 commit
  14. 09 Oct, 2023 1 commit
  15. 04 Oct, 2023 1 commit
    • Qi Zheng's avatar
      kvm: mmu: dynamically allocate the x86-mmu shrinker · e5985c40
      Qi Zheng authored
      Use new APIs to dynamically allocate the x86-mmu shrinker.
      
      Link: https://lkml.kernel.org/r/20230911094444.68966-3-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5985c40
  16. 23 Sep, 2023 2 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously · 0df9dab8
      Sean Christopherson authored
      Stop zapping invalidate TDP MMU roots via work queue now that KVM
      preserves TDP MMU roots until they are explicitly invalidated.  Zapping
      roots asynchronously was effectively a workaround to avoid stalling a vCPU
      for an extended during if a vCPU unloaded a root, which at the time
      happened whenever the guest toggled CR0.WP (a frequent operation for some
      guest kernels).
      
      While a clever hack, zapping roots via an unbound worker had subtle,
      unintended consequences on host scheduling, especially when zapping
      multiple roots, e.g. as part of a memslot.  Because the work of zapping a
      root is no longer bound to the task that initiated the zap, things like
      the CPU affinity and priority of the original task get lost.  Losing the
      affinity and priority can be especially problematic if unbound workqueues
      aren't affined to a small number of CPUs, as zapping multiple roots can
      cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
      the CPUs KVM is already using to run vCPUs.
      
      When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
      zap can result in KVM occupying all logical CPUs for ~8ms, and result in
      high priority tasks not being scheduled in in a timely manner.  In v5.15,
      which doesn't preserve unloaded roots, the issues were even more noticeable
      as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
      
      Consuming all CPUs for an extended duration can lead to significant jitter
      throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
      is a semi-frequent operation as memslots are deleted and recreated with
      different host virtual addresses to react to host GPU drivers allocating
      and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
      during games due to the audio server's tasks not getting scheduled in
      promptly, despite the tasks having a high realtime priority.
      
      Deleting memslots isn't exactly a fast path and should be avoided when
      possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
      memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
      that removing the async zapping eliminates a non-trivial amount of
      complexity.
      
      Note, one of the subtle behaviors hidden behind the async zapping is that
      KVM would zap invalidated roots only once (ignoring partial zaps from
      things like mmu_notifier events).  Preserve this behavior by adding a flag
      to identify roots that are scheduled to be zapped versus roots that have
      already been zapped but not yet freed.
      
      Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
      encounter invalid roots, as it's not at all obvious why zapping
      invalidated roots shouldn't simply zap all invalid roots.
      Reported-by: default avatarPattara Teerapong <pteerapong@google.com>
      Cc: David Stevens <stevensd@google.com>
      Cc: Yiwei Zhang<zzyiwei@google.com>
      Cc: Paul Hsia <paulhsia@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230916003916.2545000-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0df9dab8
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() · 441a5dfc
      Paolo Bonzini authored
      All callers except the MMU notifier want to process all address spaces.
      Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe()
      and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe().
      
      Extracted out of a patch by Sean Christopherson <seanjc@google.com>
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      441a5dfc
  17. 21 Sep, 2023 1 commit
    • Sean Christopherson's avatar
      KVM: x86/mmu: Open code leaf invalidation from mmu_notifier · 50107e8b
      Sean Christopherson authored
      The mmu_notifier path is a bit of a special snowflake, e.g. it zaps only a
      single address space (because it's per-slot), and can't always yield.
      Because of this, it calls kvm_tdp_mmu_zap_leafs() in ways that no one
      else does.
      
      Iterate manually over the leafs in response to an mmu_notifier
      invalidation, instead of invoking kvm_tdp_mmu_zap_leafs().  Drop the
      @can_yield param from kvm_tdp_mmu_zap_leafs() as its sole remaining
      caller unconditionally passes "true".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230916003916.2545000-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      50107e8b
  18. 31 Aug, 2023 13 commits