1. 17 Nov, 2022 14 commits
  2. 16 Nov, 2022 14 commits
  3. 09 Nov, 2022 12 commits
    • Paolo Bonzini's avatar
      KVM: replace direct irq.h inclusion · d663b8a2
      Paolo Bonzini authored
      virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
      directory (i.e. not from arch/*/include) for the sole purpose of retrieving
      irqchip_in_kernel.
      
      Making the function inline in a header that is already included,
      such as asm/kvm_host.h, is not possible because it needs to look at
      struct kvm which is defined after asm/kvm_host.h is included.  So add a
      kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
      only performance critical on arm64 and x86, and the non-inline function
      is enough on all other architectures.
      
      irq.h can then be deleted from all architectures except x86.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d663b8a2
    • Like Xu's avatar
      KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter · de0f6195
      Like Xu authored
      Defer reprogramming counters and handling overflow via KVM_REQ_PMU
      when incrementing counters.  KVM skips emulated WRMSR in the VM-Exit
      fastpath, the fastpath runs with IRQs disabled, skipping instructions
      can increment and reprogram counters, reprogramming counters can
      sleep, and sleeping is disallowed while IRQs are disabled.
      
       [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
       [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
       [*] preempt_count: 1, expected: 0
       [*] RCU nest depth: 0, expected: 0
       [*] INFO: lockdep is turned off.
       [*] irq event stamp: 0
       [*] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
       [*] softirqs last  enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
       [*] softirqs last disabled at (0): [<0000000000000000>] 0x0
       [*] Preemption disabled at:
       [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
       [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
       [*] Call Trace:
       [*]  <TASK>
       [*]  dump_stack_lvl+0x6c/0x9b
       [*]  __might_resched.cold+0x22e/0x297
       [*]  __mutex_lock+0xc0/0x23b0
       [*]  perf_event_ctx_lock_nested+0x18f/0x340
       [*]  perf_event_pause+0x1a/0x110
       [*]  reprogram_counter+0x2af/0x1490 [kvm]
       [*]  kvm_pmu_trigger_event+0x429/0x950 [kvm]
       [*]  kvm_skip_emulated_instruction+0x48/0x90 [kvm]
       [*]  handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
       [*]  vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
       [*]  vcpu_enter_guest+0x1d22/0x3dc0 [kvm]
      
      Add a field to kvm_pmc to track the previous counter value in order
      to defer overflow detection to kvm_pmu_handle_event() (the counter must
      be paused before handling overflow, and that may increment the counter).
      
      Opportunistically shrink sizeof(struct kvm_pmc) a bit.
      Suggested-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Fixes: 9cd803d4 ("KVM: x86: Update vPMCs when retiring instructions")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Link: https://lore.kernel.org/r/20220831085328.45489-6-likexu@tencent.com
      [sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      de0f6195
    • Like Xu's avatar
      KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() · 68fb4757
      Like Xu authored
      Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus
      deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming
      a counter multiple times during a single VM-Exit.
      
      Deferring programming will also allow KVM to fix a bug where immediately
      reprogramming a counter can result in sleeping (taking a mutex) while
      interrupts are disabled in the VM-Exit fastpath.
      
      Introduce kvm_pmu_request_counter_reprogam() to make it obvious that
      KVM is _requesting_ a reprogram and not actually doing the reprogram.
      
      Opportunistically refine related comments to avoid misunderstandings.
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Link: https://lore.kernel.org/r/20220831085328.45489-5-likexu@tencent.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      68fb4757
    • Sean Christopherson's avatar
      KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowed · dcbb816a
      Sean Christopherson authored
      When reprogramming a counter, clear the counter's "reprogram pending" bit
      if the counter is disabled (by the guest) or is disallowed (by the
      userspace filter).  In both cases, there's no need to re-attempt
      programming on the next coincident KVM_REQ_PMU as enabling the counter by
      either method will trigger reprogramming.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dcbb816a
    • Sean Christopherson's avatar
      KVM: x86/pmu: Force reprogramming of all counters on PMU filter change · f1c5651f
      Sean Christopherson authored
      Force vCPUs to reprogram all counters on a PMU filter change to provide
      a sane ABI for userspace.  Use the existing KVM_REQ_PMU to do the
      programming, and take advantage of the fact that the reprogram_pmi bitmap
      fits in a u64 to set all bits in a single atomic update.  Note, setting
      the bitmap and making the request needs to be done _after_ the SRCU
      synchronization to ensure that vCPUs will reprogram using the new filter.
      
      KVM's current "lazy" approach is confusing and non-deterministic.  It's
      confusing because, from a developer perspective, the code is buggy as it
      makes zero sense to let userspace modify the filter but then not actually
      enforce the new filter.  The lazy approach is non-deterministic because
      KVM enforces the filter whenever a counter is reprogrammed, not just on
      guest WRMSRs, i.e. a guest might gain/lose access to an event at random
      times depending on what is going on in the host.
      
      Note, the resulting behavior is still non-determinstic while the filter
      is in flux.  If userspace wants to guarantee deterministic behavior, all
      vCPUs should be paused during the filter update.
      
      Jim Mattson <jmattson@google.com>
      
      Fixes: 66bb8a06 ("KVM: x86: PMU Event Filter")
      Cc: Aaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f1c5651f
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped · 3a056757
      Sean Christopherson authored
      Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the
      TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed
      NX huge page regardless of the MMU type.  Recovery runs while holding
      mmu_lock for write and so it should be impossible to get false positives
      on the WARN.
      Suggested-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221019165618.927057-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3a056757
    • Mingwei Zhang's avatar
      KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() · 76901e56
      Mingwei Zhang authored
      Explicitly check if a NX huge page is disallowed when determining if a
      page fault needs to be forced to use a smaller sized page.  KVM currently
      assumes that the NX huge page mitigation is the only scenario where KVM
      will force a shadow page instead of a huge page, and so unnecessarily
      keeps an existing shadow page instead of replacing it with a huge page.
      
      Any scenario that causes KVM to zap leaf SPTEs may result in having a SP
      that can be made huge without violating the NX huge page mitigation.
      E.g. prior to commit 5ba7c4c6 ("KVM: x86/MMU: Zap non-leaf SPTEs when
      disabling dirty logging"), KVM would keep shadow pages after disabling
      dirty logging due to a live migration being canceled, resulting in
      degraded performance due to running with 4kb pages instead of huge pages.
      
      Although the dirty logging case is "fixed", that fix is coincidental,
      i.e. is an implementation detail, and there are other scenarios where KVM
      will zap leaf SPTEs.  E.g. zapping leaf SPTEs in response to a host page
      migration (mmu_notifier invalidation) to create a huge page would yield a
      similar result; KVM would see the shadow-present non-leaf SPTE and assume
      a huge page is disallowed.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      [sean: use spte_to_child_sp(), massage changelog, fold into if-statement]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-Id: <20221019165618.927057-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      76901e56
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add helper to convert SPTE value to its shadow page · 5e3edd7e
      Sean Christopherson authored
      Add a helper to convert a SPTE to its shadow page to deduplicate a
      variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
      get the shadow page for a SPTE without dropping high bits.
      
      Opportunistically add a comment in mmu_free_root_page() documenting why
      it treats the root HPA as a SPTE.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221019165618.927057-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5e3edd7e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages · d25ceb92
      Sean Christopherson authored
      Track the number of TDP MMU "shadow" pages instead of tracking the pages
      themselves. With the NX huge page list manipulation moved out of the common
      linking flow, elminating the list-based tracking means the happy path of
      adding a shadow page doesn't need to acquire a spinlock and can instead
      inc/dec an atomic.
      
      Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
      pages is very, very useful for detecting KVM bugs.
      
      Tracking the number of pages will also make it trivial to expose the
      counter to userspace as a stat in the future, which may or may not be
      desirable.
      
      Note, the TDP MMU needs to use a separate counter (and stat if that ever
      comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
      supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
      TDP MMU consumes so few pages relative to shadow paging), and including TDP
      MMU pages in that counter would break both the shrinker and shadow MMUs,
      e.g. if a VM is using nested TDP.
      
      Cc: Yan Zhao <yan.y.zhao@intel.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-Id: <20221019165618.927057-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d25ceb92
    • Sean Christopherson's avatar
      KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE · 61f94478
      Sean Christopherson authored
      Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
      visible to other readers, i.e. before setting its SPTE.  This will allow
      KVM to query the flag when determining if a shadow page can be replaced
      by a NX huge page without violating the rules of the mitigation.
      
      Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
      for another CPU to see a shadow page without an up-to-date
      nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
      dance.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Message-Id: <20221019165618.927057-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61f94478
    • Sean Christopherson's avatar
      KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs · b5b0977f
      Sean Christopherson authored
      Account and track NX huge pages for nonpaging MMUs so that a future
      enhancement to precisely check if a shadow page can't be replaced by a NX
      huge page doesn't get false positives.  Without correct tracking, KVM can
      get stuck in a loop if an instruction is fetching and writing data on the
      same huge page, e.g. KVM installs a small executable page on the fetch
      fault, replaces it with an NX huge page on the write fault, and faults
      again on the fetch.
      
      Alternatively, and perhaps ideally, KVM would simply not enforce the
      workaround for nonpaging MMUs.  The guest has no page tables to abuse
      and KVM is guaranteed to switch to a different MMU on CR0.PG being
      toggled so there's no security or performance concerns.  However, getting
      make_spte() to play nice now and in the future is unnecessarily complex.
      
      In the current code base, make_spte() can enforce the mitigation if TDP
      is enabled or the MMU is indirect, but make_spte() may not always have a
      vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
      promotion when disabling dirty logging.
      
      Without a vCPU/MMU, KVM could either pass in the correct information
      and/or derive it from the shadow page, but the former is ugly and the
      latter subtly non-trivial due to the possibility of direct shadow pages
      in indirect MMUs.  Given that using shadow paging with an unpaged guest
      is far from top priority _and_ has been subjected to the workaround since
      its inception, keep it simple and just fix the accounting glitch.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20221019165618.927057-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5b0977f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename NX huge pages fields/functions for consistency · 55c510e2
      Sean Christopherson authored
      Rename most of the variables/functions involved in the NX huge page
      mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
      vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
      applies only to the NX huge page mitigation, not to any condition that
      prevents creating a huge page.
      
      Add a comment explaining what the newly named "possible_nx_huge_pages"
      tracks.
      
      Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
      stone.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20221019165618.927057-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      55c510e2