1. 20 Jun, 2022 30 commits
    • Sean Christopherson's avatar
      KVM: VMX: Give host userspace full control of MSR_IA32_PERF_CAPABILITIES · 0f4a7185
      Sean Christopherson authored
      Do not clear manipulate MSR_IA32_PERF_CAPABILITIES in intel_pmu_refresh(),
      i.e. give userspace full control over capability/read-only MSRs.  KVM is
      not a babysitter, it is userspace's responsiblity to provide a valid and
      coherent vCPU model.
      
      Attempting to "help" the guest by forcing a consistent model creates edge
      cases, and ironicially leads to inconsistent behavior.
      
      Example #1:  KVM doesn't do intel_pmu_refresh() when userspace writes
      the MSR.
      
      Example #2: KVM doesn't clear the bits when the PMU is disabled, or when
      there's no architectural PMU.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220611005755.753273-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f4a7185
    • Sean Christopherson's avatar
      KVM: x86: Give host userspace full control of MSR_IA32_MISC_ENABLES · 9fc22296
      Sean Christopherson authored
      Give userspace full control of the read-only bits in MISC_ENABLES, i.e.
      do not modify bits on PMU refresh and do not preserve existing bits when
      userspace writes MISC_ENABLES.  With a few exceptions where KVM doesn't
      expose the necessary controls to userspace _and_ there is a clear cut
      association with CPUID, e.g. reserved CR4 bits, KVM does not own the vCPU
      and should not manipulate the vCPU model on behalf of "dummy user space".
      
      The argument that KVM is doing userspace a favor because "the order of
      setting vPMU capabilities and MSR_IA32_MISC_ENABLE is not strictly
      guaranteed" is specious, as attempting to configure MSRs on behalf of
      userspace inevitably leads to edge cases precisely because KVM does not
      prescribe a specific order of initialization.
      
      Example #1: intel_pmu_refresh() consumes and modifies the vCPU's
      MSR_IA32_PERF_CAPABILITIES, and so assumes userspace initializes config
      MSRs before setting the guest CPUID model.  If userspace sets CPUID
      first, then KVM will mark PEBS as available when arch.perf_capabilities
      is initialized with a non-zero PEBS format, thus creating a bad vCPU
      model if userspace later disables PEBS by writing PERF_CAPABILITIES.
      
      Example #2: intel_pmu_refresh() does not clear PERF_CAP_PEBS_MASK in
      MSR_IA32_PERF_CAPABILITIES if there is no vPMU, making KVM inconsistent
      in its desire to be consistent.
      
      Example #3: intel_pmu_refresh() does not clear MSR_IA32_MISC_ENABLE_EMON
      if KVM_SET_CPUID2 is called multiple times, first with a vPMU, then
      without a vPMU.  While slightly contrived, it's plausible a VMM could
      reflect KVM's default vCPU and then operate on KVM's copy of CPUID to
      later clear the vPMU settings, e.g. see KVM's selftests.
      
      Example #4: Enumerating an Intel vCPU on an AMD host will not call into
      intel_pmu_refresh() at any point, and so the BTS and PEBS "unavailable"
      bits will be left clear, without any way for userspace to set them.
      
      Keep the "R" behavior of the bit 7, "EMON available", for the guest.
      Unlike the BTS and PEBS bits, which are fully "RO", the EMON bit can be
      written with a different value, but that new value is ignored.
      
      Cc: Like Xu <likexu@tencent.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Message-Id: <20220611005755.753273-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fc22296
    • Dongliang Mu's avatar
      x86: kvm: remove NULL check before kfree · e20918f6
      Dongliang Mu authored
      kfree can handle NULL pointer as its argument.
      According to coccinelle isnullfree check, remove NULL check
      before kfree operation.
      Signed-off-by: default avatarDongliang Mu <mudongliangabcd@gmail.com>
      Message-Id: <20220614133458.147314-1-dzm91@hust.edu.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e20918f6
    • Sean Christopherson's avatar
      KVM: Do not zero initialize 'pfn' in hva_to_pfn() · 943dfea8
      Sean Christopherson authored
      Drop the unnecessary initialization of the local 'pfn' variable in
      hva_to_pfn().  First and foremost, '0' is not an invalid pfn, it's a
      perfectly valid pfn on most architectures.  I.e. if hva_to_pfn() were to
      return an "uninitializd" pfn, it would actually be interpeted as a legal
      pfn by most callers.
      
      Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn()
      explicitly sets pfn to an error value (or returns an error value directly)
      if a helper returns failure, and all helpers set the pfn on success.
      
      The zeroing of 'pfn' was introduced by commit 2fc84311 ("KVM:
      reorganize hva_to_pfn"), probably to avoid "uninitialized variable"
      warnings on statements that return pfn.  However, no compiler seems
      to produce them, making the initialization unnecessary.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      943dfea8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level() · 5d49f08c
      Sean Christopherson authored
      Move the check that restricts mapping huge pages into the guest to pfns
      that are backed by refcounted 'struct page' memory into the helper that
      actually "requires" a 'struct page', host_pfn_mapping_level().  In
      addition to deduplicating code, moving the check to the helper eliminates
      the subtle requirement that the caller check that the incoming pfn is
      backed by a refcounted struct page, and as an added bonus avoids an extra
      pfn_to_page() lookup.
      
      Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs
      to stay where it is, as it guards against dereferencing a NULL memslot in
      the kvm_slot_dirty_track_enabled() that follows.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d49f08c
    • Sean Christopherson's avatar
      KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() · b14b2690
      Sean Christopherson authored
      Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
      to better reflect what KVM is actually checking, and to eliminate extra
      pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
      helpers in particular benefit from "refouncted" nomenclature, as it's not
      all that obvious why KVM needs to get/put refcounts for some PG_reserved
      pages (ZERO_PAGE and ZONE_DEVICE).
      
      Add a comment to call out that the list of exceptions to PG_reserved is
      all but guaranteed to be incomplete.  The list has mostly been compiled
      by people throwing noodles at KVM and finding out they stick a little too
      well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
      didn't get freed.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b14b2690
    • Sean Christopherson's avatar
      KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page() · 284dc493
      Sean Christopherson authored
      Operate on a 'struct page' instead of a pfn when checking if a page is a
      ZONE_DEVICE page, and rename the helper accordingly.  Generally speaking,
      KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do
      anything special for ZONE_DEVICE memory.  Rather, KVM wants to treat
      ZONE_DEVICE memory like regular memory, and the need to identify
      ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In
      other words, KVM should only ever check for ZONE_DEVICE memory after KVM
      has already verified that there is a struct page associated with the pfn.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      284dc493
    • Sean Christopherson's avatar
      KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page() · b1624f99
      Sean Christopherson authored
      Drop helpers to convert a gfn/gpa to a 'struct page' in the context of a
      vCPU.  KVM doesn't require that guests be backed by 'struct page' memory,
      thus any use of helpers that assume 'struct page' is bound to be flawed,
      as was the case for the recently removed last user in x86's nested VMX.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b1624f99
    • Sean Christopherson's avatar
      KVM: Don't WARN if kvm_pfn_to_page() encounters a "reserved" pfn · 6573a691
      Sean Christopherson authored
      Drop a WARN_ON() if kvm_pfn_to_page() encounters a "reserved" pfn, which
      in this context means a struct page that has PG_reserved but is not a/the
      ZERO_PAGE and is not a ZONE_DEVICE page.  The usage, via gfn_to_page(),
      in x86 is safe as gfn_to_page() is used only to retrieve a page from
      KVM-controlled memslot, but the usage in PPC and s390 operates on
      arbitrary gfns and thus memslots that can be backed by incompatible
      memory.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6573a691
    • Sean Christopherson's avatar
      KVM: nVMX: Use kvm_vcpu_map() to get/pin vmcs12's APIC-access page · fe1911aa
      Sean Christopherson authored
      Use kvm_vcpu_map() to get/pin the backing for vmcs12's APIC-access page,
      there's no reason it has to be restricted to 'struct page' backing.  The
      APIC-access page actually doesn't need to be backed by anything, which is
      ironically why it got left behind by the series which introduced
      kvm_vcpu_map()[1]; the plan was to shove a dummy pfn into vmcs02[2], but
      that code never got merged.
      
      Switching the APIC-access page to kvm_vcpu_map() doesn't preclude using a
      magic pfn in the future, and will allow a future patch to drop
      kvm_vcpu_gpa_to_page().
      
      [1] https://lore.kernel.org/all/1547026933-31226-1-git-send-email-karahmed@amazon.de
      [2] https://lore.kernel.org/lkml/1543845551-4403-1-git-send-email-karahmed@amazon.deSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe1911aa
    • Sean Christopherson's avatar
      KVM: Avoid pfn_to_page() and vice versa when releasing pages · 8e1c6914
      Sean Christopherson authored
      Invert the order of KVM's page/pfn release helpers so that the "inner"
      helper operates on a page instead of a pfn.  As pointed out by Linus[*],
      converting between struct page and a pfn isn't necessarily cheap, and
      that's not even counting the overhead of is_error_noslot_pfn() and
      kvm_is_reserved_pfn().  Even if the checks were dirt cheap, there's no
      reason to convert from a page to a pfn and back to a page, just to mark
      the page dirty/accessed or to put a reference to the page.
      
      Opportunistically drop a stale declaration of kvm_set_page_accessed()
      from kvm_host.h (there was no implementation).
      
      No functional change intended.
      
      [*] https://lore.kernel.org/all/CAHk-=wifQimj2d6npq-wCi5onYPjzQg4vyO4tFcPJJZr268cRw@mail.gmail.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8e1c6914
    • Sean Christopherson's avatar
      KVM: Don't set Accessed/Dirty bits for ZERO_PAGE · a1040b0d
      Sean Christopherson authored
      Don't set Accessed/Dirty bits for a struct page with PG_reserved set,
      i.e. don't set A/D bits for the ZERO_PAGE.  The ZERO_PAGE (or pages
      depending on the architecture) should obviously never be written, and
      similarly there's no point in marking it accessed as the page will never
      be swapped out or reclaimed.  The comment in page-flags.h is quite clear
      that PG_reserved pages should be managed only by their owner, and
      strictly following that mandate also simplifies KVM's logic.
      
      Fixes: 7df003c8 ("KVM: fix overflow of zero page refcount with ksm running")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a1040b0d
    • Sean Christopherson's avatar
      KVM: Drop bogus "pfn != 0" guard from kvm_release_pfn() · 28b85ae0
      Sean Christopherson authored
      Remove a check from kvm_release_pfn() to bail if the provided @pfn is
      zero.  Zero is a perfectly valid pfn on most architectures, and should
      not be used to indicate an error or an invalid pfn.  The bogus check was
      added by commit 91724814 ("x86/kvm: Cache gfn to pfn translation"),
      which also did the bad thing of zeroing the pfn and gfn to mark a cache
      invalid.  Thankfully, that bad behavior was axed by commit 357a18ad
      ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache").
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28b85ae0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA mask · 70e41c31
      Sean Christopherson authored
      Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and
      EPT paging.  Both PAGE_MASK and the new-common logic are supsersets of
      what is actually needed for 32-bit paging.  PAGE_MASK sets bits 63:12 and
      the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of
      which value is used, the result will always be bits 31:12.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70e41c31
    • Sean Christopherson's avatar
      KVM: x86/mmu: Truncate paging32's PT_BASE_ADDR_MASK to 32 bits · f7384b88
      Sean Christopherson authored
      Truncate paging32's PT_BASE_ADDR_MASK to a pt_element_t, i.e. to 32 bits.
      Ignoring PSE huge pages, the mask is only used in conjunction with gPTEs,
      which are 32 bits, and so the address is limited to bits 31:12.
      
      PSE huge pages encoded PA bits 39:32 in PTE bits 20:13, i.e. need custom
      logic to handle their funky encoding regardless of PT_BASE_ADDR_MASK.
      
      Note, PT_LVL_OFFSET_MASK is somewhat confusing in that it computes the
      offset of the _gfn_, not of the gpa, i.e. not having bits 63:32 set in
      PT_BASE_ADDR_MASK is again correct.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7384b88
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Use common macros to compute 32/64-bit paging masks · f6b8ea6d
      Paolo Bonzini authored
      Dedup the code for generating (most of) the per-type PT_* masks in
      paging_tmpl.h.  The relevant macros only vary based on the number of bits
      per level, and that smidge of info is already provided in a common form
      as PT_LEVEL_BITS.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6b8ea6d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs · 2ca3129e
      Sean Christopherson authored
      Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
      (PT64).  SPTE and PT64 are _mostly_ the same, but the few differences are
      quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
      guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
      very much specific to SPTEs.
      
      Opportunistically (and temporarily) move most guest macros into paging.h
      to clearly associate them with shadow paging, and to ensure that they're
      not used as of this commit.  A future patch will eliminate them entirely.
      
      Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
      needed for the quadrant calculation in kvm_mmu_get_page().  The quadrant
      calculation is hot enough (when using shadow paging with 32-bit guests)
      that adding a per-context helper is undesirable, and burying the
      computation in paging_tmpl.h with a forward declaration isn't exactly an
      improvement.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ca3129e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Dedup macros for computing various page table masks · 42c88ff8
      Sean Christopherson authored
      Provide common helper macros to generate various masks, shifts, etc...
      for 32-bit vs. 64-bit page tables.  Only the inputs differ, the actual
      calculations are identical.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      42c88ff8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h · b3fcdb04
      Sean Christopherson authored
      Move a handful of one-off macros and helpers for 32-bit PSE paging into
      paging_tmpl.h and hide them behind "PTTYPE == 32".  Under no circumstance
      should anything but 32-bit shadow paging care about PSE paging.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b3fcdb04
    • Sean Christopherson's avatar
      KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macro · 1ae20e0b
      Sean Christopherson authored
      Compute the number of PTEs to be filled for the 32-bit PSE page tables
      using the page size and the size of each entry.  While using the MMU's
      PT32_ENT_PER_PAGE macro is arguably better in isolation, removing VMX's
      usage will allow a future namespacing cleanup to move the guest page
      table macros into paging_tmpl.h, out of the reach of code that isn't
      directly related to shadow paging.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ae20e0b
    • Sean Christopherson's avatar
      KVM: x86: Use lapic_in_kernel() to query in-kernel APIC in APICv helper · b8e1b962
      Sean Christopherson authored
      Use lapic_in_kernel() in kvm_vcpu_apicv_active() to take advantage of the
      kvm_has_noapic_vcpu static branch.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8e1b962
    • Sean Christopherson's avatar
      KVM: x86: Move "apicv_active" into "struct kvm_lapic" · ce0a58f4
      Sean Christopherson authored
      Move the per-vCPU apicv_active flag into KVM's local APIC instance.
      APICv is fully dependent on an in-kernel local APIC, but that's not at
      all clear when reading the current code due to the flag being stored in
      the generic kvm_vcpu_arch struct.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce0a58f4
    • Sean Christopherson's avatar
      KVM: x86: Check for in-kernel xAPIC when querying APICv for directed yield · ae801e13
      Sean Christopherson authored
      Use kvm_vcpu_apicv_active() to check if APICv is active when seeing if a
      vCPU is a candidate for directed yield due to a pending ACPIv interrupt.
      This will allow moving apicv_active into kvm_lapic without introducing a
      potential NULL pointer deref (kvm_vcpu_apicv_active() effectively adds a
      pre-check on the vCPU having an in-kernel APIC).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ae801e13
    • Sean Christopherson's avatar
      KVM: x86: Drop @vcpu parameter from kvm_x86_ops.hwapic_isr_update() · d39850f5
      Sean Christopherson authored
      Drop the unused @vcpu parameter from hwapic_isr_update().  AMD/AVIC is
      unlikely to implement the helper, and VMX/APICv doesn't need the vCPU as
      it operates on the current VMCS.  The result is somewhat odd, but allows
      for a decent amount of (future) cleanup in the APIC code.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d39850f5
    • Sean Christopherson's avatar
      KVM: SVM: Drop unused AVIC / kvm_x86_ops declarations · ec1d7e6a
      Sean Christopherson authored
      Drop a handful of unused AVIC function declarations whose implementations
      were removed during the conversion to optional static calls.
      
      No functional change intended.
      
      Fixes: abb6d479 ("KVM: x86: make several APIC virtualization callbacks optional")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec1d7e6a
    • Sean Christopherson's avatar
      KVM: nVMX: Update vmcs12 on BNDCFGS write, not at vmcs02=>vmcs12 sync · 913d6c9b
      Sean Christopherson authored
      Update vmcs12->guest_bndcfgs on intercepted writes to BNDCFGS from L2
      instead of waiting until vmcs02 is synchronized to vmcs12.  KVM always
      intercepts BNDCFGS accesses, so the only way the value in vmcs02 can
      change is via KVM's explicit VMWRITE during emulation.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      913d6c9b
    • Sean Christopherson's avatar
      KVM: nVMX: Save BNDCFGS to vmcs12 iff relevant controls are exposed to L1 · 308a4fff
      Sean Christopherson authored
      Save BNDCFGS to vmcs12 (from vmcs02) if and only if at least of one of
      the load-on-entry or clear-on-exit fields for BNDCFGS is enumerated as an
      allowed-1 bit in vmcs12.  Skipping the field avoids an unnecessary VMREAD
      when MPX is supported but not exposed to L1.
      
      Per Intel's SDM:
      
        If the processor supports either the 1-setting of the "load IA32_BNDCFGS"
        VM-entry control or that of the "clear IA32_BNDCFGS" VM-exit control, the
        contents of the IA32_BNDCFGS MSR are saved into the corresponding field.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      308a4fff
    • Sean Christopherson's avatar
      KVM: nVMX: Rename nested.vmcs01_* fields to nested.pre_vmenter_* · 5d76b1f8
      Sean Christopherson authored
      Rename the fields in struct nested_vmx used to snapshot pre-VM-Enter
      values to reflect that they can hold L2's values when restoring nested
      state, e.g. if userspace restores MSRs before nested state.  As crazy as
      it seems, restoring MSRs before nested state actually works (because KVM
      goes out if it's way to make it work), even though the initial MSR writes
      will hit vmcs01 despite holding L2 values.
      
      Add a related comment to vmx_enter_smm() to call out that using the
      common VM-Exit and VM-Enter helpers to emulate SMI and RSM is wrong and
      broken.  The few MSRs that have snapshots _could_ be fixed by taking a
      snapshot prior to the forced VM-Exit instead of at forced VM-Enter, but
      that's just the tip of the iceberg as the rather long list of MSRs that
      aren't snapshotted (hello, VM-Exit MSR load list) can't be handled this
      way.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d76b1f8
    • Sean Christopherson's avatar
      KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending case · 764643a6
      Sean Christopherson authored
      If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL
      irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in
      vmcs12.  When restoring nested state, e.g. after migration, without a
      nested run pending, prepare_vmcs02() will propagate
      nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into
      vmcs02.GUEST_IA32_DEBUGCTL.
      
      If userspace restores nested state before MSRs, then loading garbage is a
      non-issue as loading DEBUGCTL will also update vmcs02.  But if usersepace
      restores MSRs first, then KVM is responsible for propagating L2's value,
      which is actually thrown into vmcs01, into vmcs02.
      
      Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
      is all kinds of bizarre and ideally would not be supported.  Sadly, some
      VMMs do exactly that and rely on KVM to make things work.
      
      Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL
      to vmcs02 across RSM may corrupt L2's DEBUGCTL.  But KVM's entire VMX+SMM
      emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
      "default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
      
      Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      764643a6
    • Sean Christopherson's avatar
      KVM: nVMX: Snapshot pre-VM-Enter BNDCFGS for !nested_run_pending case · fa578398
      Sean Christopherson authored
      If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective
      of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12.  When restoring
      nested state, e.g. after migration, without a nested run pending,
      prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02,
      i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS.
      
      If userspace restores nested state before MSRs, then loading garbage is a
      non-issue as loading BNDCFGS will also update vmcs02.  But if usersepace
      restores MSRs first, then KVM is responsible for propagating L2's value,
      which is actually thrown into vmcs01, into vmcs02.
      
      Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
      is all kinds of bizarre and ideally would not be supported.  Sadly, some
      VMMs do exactly that and rely on KVM to make things work.
      
      Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS
      to vmcs02 across RSM may corrupt L2's BNDCFGS.  But KVM's entire VMX+SMM
      emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
      "default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.
      
      Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
      Fixes: 62cf9bd8 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS")
      Cc: stable@vger.kernel.org
      Cc: Lei Wang <lei4.wang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220614215831.3762138-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa578398
  2. 15 Jun, 2022 10 commits