1. 13 Aug, 2021 4 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-tdpmmu-fixes' into HEAD · 9a63b451
      Paolo Bonzini authored
      Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.
      9a63b451
    • Sean Christopherson's avatar
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson authored
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson authored
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson authored
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
  2. 10 Aug, 2021 2 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-vmx-secctl' into HEAD · c3e9434c
      Paolo Bonzini authored
      Merge common topic branch for 5.14-rc6 and 5.15 merge window.
      c3e9434c
    • Sean Christopherson's avatar
      KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation · 7b9cae02
      Sean Christopherson authored
      Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to
      effectively get the controls for the current VMCS, as opposed to using
      vmx->secondary_exec_controls, which is the cached value of KVM's desired
      controls for vmcs01 and truly not reflective of any particular VMCS.
      
      While the waitpkg control is not dynamic, i.e. vmcs01 will always hold
      the same waitpkg configuration as vmx->secondary_exec_controls, the same
      does not hold true for vmcs02 if the L1 VMM hides the feature from L2.
      If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL,
      L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7b9cae02
  3. 06 Aug, 2021 8 commits
    • David Matlack's avatar
      KVM: selftests: Move vcpu_args_set into perf_test_util · 32bdc019
      David Matlack authored
      perf_test_util is used to set up KVM selftests where vCPUs touch a
      region of memory. The guest code is implemented in perf_test_util.c (not
      the calling selftests). The guest code requires a 1 parameter, the
      vcpuid, which has to be set by calling vcpu_args_set(vm, vcpu_id, 1,
      vcpu_id).
      
      Today all of the selftests that use perf_test_util are making this call.
      Instead, perf_test_util should just do it. This will save some code but
      more importantly prevents mistakes since totally non-obvious that this
      needs to be called and failing to do so results in vCPUs not accessing
      the right regions of memory.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210805172821.2622793-1-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      32bdc019
    • David Matlack's avatar
      KVM: selftests: Support multiple slots in dirty_log_perf_test · 609e6202
      David Matlack authored
      Introduce a new option to dirty_log_perf_test: -x number_of_slots. This
      causes the test to attempt to split the region of memory into the given
      number of slots. If the region cannot be evenly divided, the test will
      fail.
      
      This allows testing with more than one slot and therefore measure how
      performance scales with the number of memslots.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-8-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      609e6202
    • David Matlack's avatar
      KVM: x86/mmu: Rename __gfn_to_rmap to gfn_to_rmap · 93e083d4
      David Matlack authored
      gfn_to_rmap was removed in the previous patch so there is no need to
      retain the double underscore on __gfn_to_rmap.
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-7-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      93e083d4
    • David Matlack's avatar
      KVM: x86/mmu: Leverage vcpu->last_used_slot for rmap_add and rmap_recycle · 601f8af0
      David Matlack authored
      rmap_add() and rmap_recycle() both run in the context of the vCPU and
      thus we can use kvm_vcpu_gfn_to_memslot() to look up the memslot. This
      enables rmap_add() and rmap_recycle() to take advantage of
      vcpu->last_used_slot and avoid expensive memslot searching.
      
      This change improves the performance of "Populate memory time" in
      dirty_log_perf_test with tdp_mmu=N. In addition to improving the
      performance, "Populate memory time" no longer scales with the number
      of memslots in the VM.
      
      Command                         | Before           | After
      ------------------------------- | ---------------- | -------------
      ./dirty_log_perf_test -v64 -x1  | 15.18001570s     | 14.99469366s
      ./dirty_log_perf_test -v64 -x64 | 18.71336392s     | 14.98675076s
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-6-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      601f8af0
    • David Matlack's avatar
      KVM: x86/mmu: Leverage vcpu->last_used_slot in tdp_mmu_map_handle_target_level · 081de470
      David Matlack authored
      The existing TDP MMU methods to handle dirty logging are vcpu-agnostic
      since they can be driven by MMU notifiers and other non-vcpu-specific
      events in addition to page faults. However this means that the TDP MMU
      is not benefiting from the new vcpu->last_used_slot. Fix that by
      introducing a tdp_mmu_map_set_spte_atomic() which is only called during
      a TDP page fault and has access to the kvm_vcpu for fast slot lookups.
      
      This improves "Populate memory time" in dirty_log_perf_test by 5%:
      
      Command                         | Before           | After
      ------------------------------- | ---------------- | -------------
      ./dirty_log_perf_test -v64 -x64 | 5.472321072s     | 5.169832886s
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-5-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      081de470
    • David Matlack's avatar
      KVM: Cache the last used slot index per vCPU · fe22ed82
      David Matlack authored
      The memslot for a given gfn is looked up multiple times during page
      fault handling. Avoid binary searching for it multiple times by caching
      the most recently used slot. There is an existing VM-wide last_used_slot
      but that does not work well for cases where vCPUs are accessing memory
      in different slots (see performance data below).
      
      Another benefit of caching the most recently use slot (versus looking
      up the slot once and passing around a pointer) is speeding up memslot
      lookups *across* faults and during spte prefetching.
      
      To measure the performance of this change I ran dirty_log_perf_test with
      64 vCPUs and 64 memslots and measured "Populate memory time" and
      "Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
      dirty logging to use fast_page_fault so its performance could be
      measured.
      
      Config     | Metric                        | Before | After
      ---------- | ----------------------------- | ------ | ------
      tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
      tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
      tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
      tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s
      
      The "Iteration 2 dirty memory time" results are especially compelling
      because they are equivalent to running the same test with a single
      memslot. In other words, fast_page_fault performance no longer scales
      with the number of memslots.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-4-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe22ed82
    • David Matlack's avatar
      KVM: Move last_used_slot logic out of search_memslots · 0f22af94
      David Matlack authored
      Make search_memslots unconditionally search all memslots and move the
      last_used_slot logic up one level to __gfn_to_memslot. This is in
      preparation for introducing a per-vCPU last_used_slot.
      
      As part of this change convert existing callers of search_memslots to
      __gfn_to_memslot to avoid making any functional changes.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-3-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f22af94
    • David Matlack's avatar
      KVM: Rename lru_slot to last_used_slot · 87689270
      David Matlack authored
      lru_slot is used to keep track of the index of the most-recently used
      memslot. The correct acronym would be "mru" but that is not a common
      acronym. So call it last_used_slot which is a bit more obvious.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      87689270
  4. 05 Aug, 2021 2 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds · d5aaad6f
      Sean Christopherson authored
      Take a signed 'long' instead of an 'unsigned long' for the number of
      pages to add/subtract to the total number of pages used by the MMU.  This
      fixes a zero-extension bug on 32-bit kernels that effectively corrupts
      the per-cpu counter used by the shrinker.
      
      Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
      kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
      an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
      to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
      sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
      4294967295 and effectively corrupts the counter.
      
      This was found by a staggering amount of sheer dumb luck when running
      kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
      in while running tests and do_shrink_slab() logged an error about trying
      to free a negative number of objects.  The truly lucky part is that the
      kernel just happened to be a slightly stale build, as the shrinker no
      longer yells about negative objects as of commit 18bb473e ("mm:
      vmscan: shrink deferred objects proportional to priority").
      
       vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460
      
      Fixes: bc8a3d89 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210804214609.1096003-1-seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d5aaad6f
    • Paolo Bonzini's avatar
      KVM: xen: do not use struct gfn_to_hva_cache · 319afe68
      Paolo Bonzini authored
      gfn_to_hva_cache is not thread-safe, so it is usually used only within
      a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
      implementation has such a cache in kvm->arch, but it is not really
      used except to store the location of the shared info page.  Replace
      shinfo_set and shinfo_cache with just the value that is passed via
      KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
      initialization value is not zero anymore and therefore kvm_xen_init_vm
      needs to be introduced.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      319afe68
  5. 04 Aug, 2021 8 commits
    • Maxim Levitsky's avatar
      KVM: selftests: fix hyperv_clock test · 13c2c3cf
      Maxim Levitsky authored
      The test was mistakenly using addr_gpa2hva on a gva and that happened
      to work accidentally.  Commit 106a2e76 ("KVM: selftests: Lower the
      min virtual address for misc page allocations") revealed this bug.
      
      Fixes: 2c7f76b4 ("selftests: kvm: Add basic Hyper-V clocksources tests", 2021-03-18)
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210804112057.409498-1-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      13c2c3cf
    • Mingwei Zhang's avatar
      KVM: SVM: improve the code readability for ASID management · bb2baeb2
      Mingwei Zhang authored
      KVM SEV code uses bitmaps to manage ASID states. ASID 0 was always skipped
      because it is never used by VM. Thus, in existing code, ASID value and its
      bitmap postion always has an 'offset-by-1' relationship.
      
      Both SEV and SEV-ES shares the ASID space, thus KVM uses a dynamic range
      [min_asid, max_asid] to handle SEV and SEV-ES ASIDs separately.
      
      Existing code mixes the usage of ASID value and its bitmap position by
      using the same variable called 'min_asid'.
      
      Fix the min_asid usage: ensure that its usage is consistent with its name;
      allocate extra size for ASID 0 to ensure that each ASID has the same value
      with its bitmap position. Add comments on ASID bitmap allocation to clarify
      the size change.
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alper Gun <alpergun@google.com>
      Cc: Dionna Glaze <dionnaglaze@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vipin Sharma <vipinsh@google.com>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Message-Id: <20210802180903.159381-1-mizhang@google.com>
      [Fix up sev_asid_free to also index by ASID, as suggested by Sean
       Christopherson, and use nr_asids in sev_cpu_init. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb2baeb2
    • Sean Christopherson's avatar
      KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB · 179c6c27
      Sean Christopherson authored
      Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
      freeing an SEV ASID.  The consumer, pre_sev_run(), indexes the array by
      the raw ASID, thus KVM could get a false negative when checking for a
      different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
      a new VM.
      
      Note, this cannot cause a functional issue _in the current code_, as
      pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
      last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
      guaranteed to mismatch on the first VMRUN.  However, prior to commit
      8a14fe4f ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
      last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
      last_cpu variable.  Thus it's theoretically possible that older versions
      of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
      and VMCB exactly match those of a prior VM.
      
      Fixes: 70cd94e6 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      179c6c27
    • Paolo Bonzini's avatar
      KVM: Do not leak memory for duplicate debugfs directories · 85cd39af
      Paolo Bonzini authored
      KVM creates a debugfs directory for each VM in order to store statistics
      about the virtual machine.  The directory name is built from the process
      pid and a VM fd.  While generally unique, it is possible to keep a
      file descriptor alive in a way that causes duplicate directories, which
      manifests as these messages:
      
        [  471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!
      
      Even though this should not happen in practice, it is more or less
      expected in the case of KVM for testcases that call KVM_CREATE_VM and
      close the resulting file descriptor repeatedly and in parallel.
      
      When this happens, debugfs_create_dir() returns an error but
      kvm_create_vm_debugfs() goes on to allocate stat data structs which are
      later leaked.  The slow memory leak was spotted by syzkaller, where it
      caused OOM reports.
      
      Since the issue only affects debugfs, do a lookup before calling
      debugfs_create_dir, so that the message is downgraded and rate-limited.
      While at it, ensure kvm->debugfs_dentry is NULL rather than an error
      if it is not created.  This fixes kvm_destroy_vm_debugfs, which was not
      checking IS_ERR_OR_NULL correctly.
      
      Cc: stable@vger.kernel.org
      Fixes: 536a6f88 ("KVM: Create debugfs dir and stat files for each VM")
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Suggested-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85cd39af
    • Like Xu's avatar
      KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces · e79f49c3
      Like Xu authored
      Based on our observations, after any vm-exit associated with vPMU, there
      are at least two or more perf interfaces to be called for guest counter
      emulation, such as perf_event_{pause, read_value, period}(), and each one
      will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
      more severe when guest use counters in a multiplexed manner.
      
      Holding a lock once and completing the KVM request operations in the perf
      context would introduce a set of impractical new interfaces. So we can
      further optimize the vPMU implementation by avoiding repeated calls to
      these interfaces in the KVM context for at least one pattern:
      
      After we call perf_event_pause() once, the event will be disabled and its
      internal count will be reset to 0. So there is no need to pause it again
      or read its value. Once the event is paused, event period will not be
      updated until the next time it's resumed or reprogrammed. And there is
      also no need to call perf_event_period twice for a non-running counter,
      considering the perf_event for a running counter is never paused.
      
      Based on this implementation, for the following common usage of
      sampling 4 events using perf on a 4u8g guest:
      
        echo 0 > /proc/sys/kernel/watchdog
        echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
        echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
        echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
        for i in `seq 1 1 10`
        do
        taskset -c 0 perf record \
        -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
        /root/br_instr a
        done
      
      the average latency of the guest NMI handler is reduced from
      37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
      Also, in addition to collecting more samples, no loss of sampling
      accuracy was observed compared to before the optimization.
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Message-Id: <20210728120705.6855-1-likexu@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      e79f49c3
    • Peter Xu's avatar
      KVM: X86: Optimize zapping rmap · a75b5404
      Peter Xu authored
      Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
      slow.  The easy way is to travers the rmap list, collecting the a/d bits and
      free the slots along the way.
      
      Provide a pte_list_destroy() and do exactly that.
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220605.26377-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a75b5404
    • Peter Xu's avatar
      KVM: X86: Optimize pte_list_desc with per-array counter · 13236e25
      Peter Xu authored
      Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
      logic.  E.g., we don't need to loop over the array any more for most reasons.
      
      This will make more sense after we've switched the array size to be larger
      otherwise the counter will be a waste.
      
      Initially I wanted to store a tail pointer at the head of the array list so we
      don't need to traverse the list at least for pushing new ones (if without the
      counter we traverse both the list and the array).  However that'll need
      slightly more change without a huge lot benefit, e.g., after we grow entry
      numbers per array the list traversing is not so expensive.
      
      So let's be simple but still try to get as much benefit as we can with just
      these extra few lines of changes (not to mention the code looks easier too
      without looping over arrays).
      
      I used the same a test case to fork 500 child and recycle them ("./rmap_fork
      500" [1]), this patch further speeds up the total fork time of about 4%, which
      is a total of 33% of vanilla kernel:
      
              Vanilla:      473.90 (+-5.93%)
              3->15 slots:  366.10 (+-4.94%)
              Add counter:  351.00 (+-3.70%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220602.26327-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      13236e25
    • Peter Xu's avatar
      KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger · dc1cff96
      Peter Xu authored
      Currently rmap array element only contains 3 entries.  However for EPT=N there
      could have a lot of guest pages that got tens of even hundreds of rmap entry.
      
      A normal distribution of a 6G guest (even if idle) shows this with rmap count
      statistics:
      
      Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
      Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
      Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
      Level=1G:       32      0       0       0       0       0       0       0       0       0       0
      
      If we do some more fork some pages will grow even larger rmap counts.
      
      This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
      use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
      will be slightly more efficient; but still not too large so less waste when
      array not full.
      
      It should not affecting EPT=Y since EPT normally only has zero or one rmap
      entry for each page, so no array is even allocated.
      
      With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
      this patch speeds up fork time of about 29%.
      
          Before: 473.90 (+-5.93%)
          After:  366.10 (+-4.94%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-6-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dc1cff96
  6. 03 Aug, 2021 7 commits
    • Vitaly Kuznetsov's avatar
      KVM: selftests: Test access to XMM fast hypercalls · 2476b5a1
      Vitaly Kuznetsov authored
      Check that #UD is raised if bit 16 is clear in
      HYPERV_CPUID_FEATURES.EDX and an 'XMM fast' hypercall is issued.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSiddharth Chandrasekaran <sidcha@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210730122625.112848-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2476b5a1
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: Check if guest is allowed to use XMM registers for hypercall input · 4e62aa96
      Vitaly Kuznetsov authored
      TLFS states that "Availability of the XMM fast hypercall interface is
      indicated via the “Hypervisor Feature Identification” CPUID Leaf
      (0x40000003, see section 2.4.4) ... Any attempt to use this interface
      when the hypervisor does not indicate availability will result in a #UD
      fault."
      
      Implement the check for 'strict' mode (KVM_CAP_HYPERV_ENFORCE_CPUID).
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSiddharth Chandrasekaran <sidcha@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210730122625.112848-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4e62aa96
    • Vitaly Kuznetsov's avatar
      KVM: x86: Introduce trace_kvm_hv_hypercall_done() · f5714bbb
      Vitaly Kuznetsov authored
      Hypercall failures are unusual with potentially far going consequences
      so it would be useful to see their results when tracing.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSiddharth Chandrasekaran <sidcha@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210730122625.112848-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f5714bbb
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: Check access to hypercall before reading XMM registers · 2e2f1e8d
      Vitaly Kuznetsov authored
      In case guest doesn't have access to the particular hypercall we can avoid
      reading XMM registers.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarSiddharth Chandrasekaran <sidcha@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210730122625.112848-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2e2f1e8d
    • Hamza Mahfooz's avatar
      KVM: const-ify all relevant uses of struct kvm_memory_slot · 269e9552
      Hamza Mahfooz authored
      As alluded to in commit f36f3f28 ("KVM: add "new" argument to
      kvm_arch_commit_memory_region"), a bunch of other places where struct
      kvm_memory_slot is used, needs to be refactored to preserve the
      "const"ness of struct kvm_memory_slot across-the-board.
      Signed-off-by: default avatarHamza Mahfooz <someguy@effective-light.com>
      Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
      [Do not touch body of slot_rmap_walk_init. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      269e9552
    • Paolo Bonzini's avatar
      KVM: Don't take mmu_lock for range invalidation unless necessary · 071064f1
      Paolo Bonzini authored
      Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
      that are unrelated to KVM.  This is possible now that memslot updates are
      blocked from range_start() to range_end(); that ensures that lock elision
      happens in both or none, and therefore that mmu_notifier_count updates
      (which must occur while holding mmu_lock for write) are always paired
      across start->end.
      
      Based on patches originally written by Ben Gardon.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      071064f1
    • Paolo Bonzini's avatar
      KVM: Block memslot updates across range_start() and range_end() · 52ac8b35
      Paolo Bonzini authored
      We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
      notifications that are unrelated to KVM.  Because mmu_notifier_count
      must be modified while holding mmu_lock for write, and must always
      be paired across start->end to stay balanced, lock elision must
      happen in both or none.  Therefore, in preparation for this change,
      this patch prevents memslot updates across range_start() and range_end().
      
      Note, technically flag-only memslot updates could be allowed in parallel,
      but stalling a memslot update for a relatively short amount of time is
      not a scalability issue, and this is all more than complex enough.
      
      A long note on the locking: a previous version of the patch used an rwsem
      to block the memslot update while the MMU notifier run, but this resulted
      in the following deadlock involving the pseudo-lock tagged as
      "mmu_notifier_invalidate_range_start".
      
         ======================================================
         WARNING: possible circular locking dependency detected
         5.12.0-rc3+ #6 Tainted: G           OE
         ------------------------------------------------------
         qemu-system-x86/3069 is trying to acquire lock:
         ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
      
         but task is already holding lock:
         ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
      
         which lock already depends on the new lock.
      
      This corresponds to the following MMU notifier logic:
      
          invalidate_range_start
            take pseudo lock
            down_read()           (*)
            release pseudo lock
          invalidate_range_end
            take pseudo lock      (**)
            up_read()
            release pseudo lock
      
      At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
      at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
      
      This could cause a deadlock (ignoring for a second that the pseudo lock
      is not a lock):
      
      - invalidate_range_start waits on down_read(), because the rwsem is
      held by install_new_memslots
      
      - install_new_memslots waits on down_write(), because the rwsem is
      held till (another) invalidate_range_end finishes
      
      - invalidate_range_end sits waits on the pseudo lock, held by
      invalidate_range_start.
      
      Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
      it would change the *shared* rwsem readers into *shared recursive*
      readers), so open-code the wait using a readers count and a
      spinlock.  This also allows handling blockable and non-blockable
      critical section in the same way.
      
      Losing the rwsem fairness does theoretically allow MMU notifiers to
      block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
      retry scheme in mmu_interval_read_begin also uses wait/wake_up
      and is likewise not fair.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52ac8b35
  7. 02 Aug, 2021 9 commits