1. 11 Jul, 2022 22 commits
  2. 24 Jun, 2022 18 commits
    • Zeng Guang's avatar
      KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode · 4b88b1a5
      Zeng Guang authored
      Hardware would directly write x2APIC ICR register instead of software
      emulation in some circumstances, e.g when Intel IPI virtualization is
      enabled. This behavior requires normal reserved bits checking to ensure
      them input as zero, otherwise it will cause #GP. So we need mask out
      those reserved bits from the data written to vICR register.
      
      Remove Delivery Status bit emulation in test case as this flag
      is invalid and not needed in x2APIC mode. KVM may ignore clearing
      it during interrupt dispatch which will lead to fake test failure.
      
      Opportunistically correct vector number for test sending IPI to
      non-existent vCPUs.
      Signed-off-by: default avatarZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220623094511.26066-1-guang.zeng@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4b88b1a5
    • Jue Wang's avatar
      KVM: selftests: Add a self test for CMCI and UCNA emulations. · eede2065
      Jue Wang authored
      This patch add a self test that verifies user space can inject
      UnCorrectable No Action required (UCNA) memory errors to the guest.
      It also verifies that incorrectly configured MSRs for Corrected
      Machine Check Interrupt (CMCI) emulation will result in #GP.
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-9-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eede2065
    • Jue Wang's avatar
      KVM: x86: Enable CMCI capability by default and handle injected UCNA errors · aebc3ca1
      Jue Wang authored
      This patch enables MCG_CMCI_P by default in kvm_mce_cap_supported. It
      reuses ioctl KVM_X86_SET_MCE to implement injection of UnCorrectable
      No Action required (UCNA) errors, signaled via Corrected Machine
      Check Interrupt (CMCI).
      
      Neither of the CMCI and UCNA emulations depends on hardware.
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-8-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aebc3ca1
    • Jue Wang's avatar
      KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs. · 281b5278
      Jue Wang authored
      This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A
      separate mci_ctl2_banks array is used to keep the existing mce_banks
      register layout intact.
      
      In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of
      the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine
      Check error reporting is enabled.
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-7-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      281b5278
    • Jue Wang's avatar
      KVM: x86: Use kcalloc to allocate the mce_banks array. · 087acc4e
      Jue Wang authored
      This patch updates the allocation of mce_banks with the array allocation
      API (kcalloc) as a precedent for the later mci_ctl2_banks to implement
      per-bank control of Corrected Machine Check Interrupt (CMCI).
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-6-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      087acc4e
    • Jue Wang's avatar
      KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic. · 4b903561
      Jue Wang authored
      This patch calculates the number of lvt entries as part of
      KVM_X86_MCE_SETUP conditioned on the presence of MCG_CMCI_P bit in
      MCG_CAP and stores result in kvm_lapic. It translats from APIC_LVTx
      register to index in lapic_lvt_entry enum. It extends the APIC_LVTx
      macro as well as other lapic write/reset handling etc to support
      Corrected Machine Check Interrupt.
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-5-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4b903561
    • Jue Wang's avatar
      KVM: x86: Add APIC_LVTx() macro. · 987f625e
      Jue Wang authored
      An APIC_LVTx macro is introduced to calcualte the APIC_LVTx register
      offset based on the index in the lapic_lvt_entry enum. Later patches
      will extend the APIC_LVTx macro to support the APIC_LVTCMCI register
      in order to implement Corrected Machine Check Interrupt signaling.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-4-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      987f625e
    • Jue Wang's avatar
      KVM: x86: Fill apic_lvt_mask with enums / explicit entries. · 1d8c681f
      Jue Wang authored
      This patch defines a lapic_lvt_entry enum used as explicit indices to
      the apic_lvt_mask array. In later patches a LVT_CMCI will be added to
      implement the Corrected Machine Check Interrupt signaling.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-3-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d8c681f
    • Jue Wang's avatar
      KVM: x86: Make APIC_VERSION capture only the magic 0x14UL. · 951ceb94
      Jue Wang authored
      Refactor APIC_VERSION so that the maximum number of LVT entries is
      inserted at runtime rather than compile time. This will be used in a
      subsequent commit to expose the LVT CMCI Register to VMs that support
      Corrected Machine Check error counting/signaling
      (IA32_MCG_CAP.MCG_CMCI_P=1).
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-2-juew@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      951ceb94
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Avoid unnecessary flush on eager page split · 03787394
      Paolo Bonzini authored
      The TLB flush before installing the newly-populated lower level
      page table is unnecessary if the lower-level page table maps
      the huge page identically.  KVM knows it is if it did not reuse
      an existing shadow page table, tell drop_large_spte() to skip
      the flush in that case.
      
      Extracted from a patch by David Matlack.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      03787394
    • David Matlack's avatar
      KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs · ada51a9d
      David Matlack authored
      Add support for Eager Page Splitting pages that are mapped by nested
      MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
      pages, and then splitting all 2MiB pages to 4KiB pages.
      
      Note, Eager Page Splitting is limited to nested MMUs as a policy rather
      than due to any technical reason (the sp->role.guest_mode check could
      just be deleted and Eager Page Splitting would work correctly for all
      shadow MMU pages). There is really no reason to support Eager Page
      Splitting for tdp_mmu=N, since such support will eventually be phased
      out, and there is no current use case supporting Eager Page Splitting on
      hosts where TDP is either disabled or unavailable in hardware.
      Furthermore, future improvements to nested MMU scalability may diverge
      the code from the legacy shadow paging implementation. These
      improvements will be simpler to make if Eager Page Splitting does not
      have to worry about legacy shadow paging.
      
      Splitting huge pages mapped by nested MMUs requires dealing with some
      extra complexity beyond that of the TDP MMU:
      
      (1) The shadow MMU has a limit on the number of shadow pages that are
          allowed to be allocated. So, as a policy, Eager Page Splitting
          refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
          pages available.
      
      (2) Splitting a huge page may end up re-using an existing lower level
          shadow page tables. This is unlike the TDP MMU which always allocates
          new shadow page tables when splitting.
      
      (3) When installing the lower level SPTEs, they must be added to the
          rmap which may require allocating additional pte_list_desc structs.
      
      Case (2) is especially interesting since it may require a TLB flush,
      unlike the TDP MMU which can fully split huge pages without any TLB
      flushes. Specifically, an existing lower level page table may point to
      even lower level page tables that are not fully populated, effectively
      unmapping a portion of the huge page, which requires a flush.  As of
      this commit, a flush is always done always after dropping the huge page
      and before installing the lower level page table.
      
      This TLB flush could instead be delayed until the MMU lock is about to be
      dropped, which would batch flushes for multiple splits.  However these
      flushes should be rare in practice (a huge page must be aliased in
      multiple SPTEs and have been split for NX Huge Pages in only some of
      them). Flushing immediately is simpler to plumb and also reduces the
      chances of tripping over a CPU bug (e.g. see iTLB multihit).
      
      [ This commit is based off of the original implementation of Eager Page
        Splitting from Peter in Google's kernel from 2016. ]
      Suggested-by: default avatarPeter Feiner <pfeiner@google.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ada51a9d
    • David Matlack's avatar
      KVM: Allow for different capacities in kvm_mmu_memory_cache structs · 837f66c7
      David Matlack authored
      Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
      declaration time rather than being fixed for all declarations. This will
      be used in a follow-up commit to declare an cache in x86 with a capacity
      of 512+ objects without having to increase the capacity of all caches in
      KVM.
      
      This change requires each cache now specify its capacity at runtime,
      since the cache struct itself no longer has a fixed capacity known at
      compile time. To protect against someone accidentally defining a
      kvm_mmu_memory_cache struct directly (without the extra storage), this
      commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
      
      In order to support different capacities, this commit changes the
      objects pointer array to be dynamically allocated the first time the
      cache is topped-up.
      
      While here, opportunistically clean up the stack-allocated
      kvm_mmu_memory_cache structs in riscv and arm64 to use designated
      initializers.
      
      No functional change intended.
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-22-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      837f66c7
    • Paolo Bonzini's avatar
      KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page() · 0cd8dc73
      Paolo Bonzini authored
      Before allocating a child shadow page table, all callers check
      whether the parent already points to a huge page and, if so, they
      drop that SPTE.  This is done by drop_large_spte().
      
      However, dropping the large SPTE is really only necessary before the
      sp is installed.  While the sp is returned by kvm_mmu_get_child_sp(),
      installing it happens later in __link_shadow_page().  Move the call
      there instead of having it in each and every caller.
      
      To ensure that the shadow page is not linked twice if it was present,
      do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
      instead, return an error value if the shadow page already existed.
      This is a bit more verbose, but clearer than NULL.
      
      Finally, now that the drop_large_spte() name is not taken anymore,
      remove the two underscores in front of __drop_large_spte().
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0cd8dc73
    • David Matlack's avatar
      KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels · 20d49186
      David Matlack authored
      Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
      is fine for now since KVM never creates intermediate huge pages during
      dirty logging. In other words, KVM always replaces 1GiB pages directly
      with 4KiB pages, so there is no reason to look for collapsible 2MiB
      pages.
      
      However, this will stop being true once the shadow MMU participates in
      eager page splitting. During eager page splitting, each 1GiB is first
      split into 2MiB pages and then those are split into 4KiB pages. The
      intermediate 2MiB pages may be left behind if an error condition causes
      eager page splitting to bail early.
      
      No functional change intended.
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-20-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      20d49186
    • David Matlack's avatar
      KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU · 47855da0
      David Matlack authored
      Currently make_huge_page_split_spte() assumes execute permissions can be
      granted to any 4K SPTE when splitting huge pages. This is true for the
      TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
      shadowing a non-executable huge page.
      
      To fix this, pass in the role of the child shadow page where the huge
      page will be split and derive the execution permission from that.  This
      is correct because huge pages are always split with direct shadow page
      and thus the shadow page role contains the correct access permissions.
      
      No functional change intended.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-19-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47855da0
    • David Matlack's avatar
      KVM: x86/mmu: Cache the access bits of shadowed translations · 6a97575d
      David Matlack authored
      Splitting huge pages requires allocating/finding shadow pages to replace
      the huge page. Shadow pages are keyed, in part, off the guest access
      permissions they are shadowing. For fully direct MMUs, there is no
      shadowing so the access bits in the shadow page role are always ACC_ALL.
      But during shadow paging, the guest can enforce whatever access
      permissions it wants.
      
      In particular, eager page splitting needs to know the permissions to use
      for the subpages, but KVM cannot retrieve them from the guest page
      tables because eager page splitting does not have a vCPU.  Fortunately,
      the guest access permissions are easy to cache whenever page faults or
      FNAME(sync_page) update the shadow page tables; this is an extension of
      the existing cache of the shadowed GFNs in the gfns array of the shadow
      page.  The access bits only take up 3 bits, which leaves 61 bits left
      over for gfns, which is more than enough.
      
      Now that the gfns array caches more information than just GFNs, rename
      it to shadowed_translation.
      
      While here, preemptively fix up the WARN_ON() that detects gfn
      mismatches in direct SPs. The WARN_ON() was paired with a
      pr_err_ratelimited(), which means that users could sometimes see the
      WARN without the accompanying error message. Fix this by outputting the
      error message as part of the WARN splat, and opportunistically make
      them WARN_ONCE() because if these ever fire, they are all but guaranteed
      to fire a lot and will bring down the kernel.
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a97575d
    • David Matlack's avatar
      KVM: x86/mmu: Update page stats in __rmap_add() · 81cb4657
      David Matlack authored
      Update the page stats in __rmap_add() rather than at the call site. This
      will avoid having to manually update page stats when splitting huge
      pages in a subsequent commit.
      
      No functional change intended.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-17-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      81cb4657
    • David Matlack's avatar
      KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu · 2ff9039a
      David Matlack authored
      Allow adding new entries to the rmap and linking shadow pages without a
      struct kvm_vcpu pointer by moving the implementation of rmap_add() and
      link_shadow_page() into inner helper functions.
      
      No functional change intended.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-16-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ff9039a