1. 21 Jan, 2020 22 commits
  2. 08 Jan, 2020 18 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault · 6948199a
      Sean Christopherson authored
      WARN if root_hpa is invalid when handling a page fault.  The check on
      root_hpa exists for historical reasons that no longer apply to the
      current KVM code base.
      
      Remove an equivalent debug-only warning in direct_page_fault(), whose
      existence more or less confirms that root_hpa should always be valid
      when handling a page fault.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6948199a
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN on an invalid root_hpa · 0c7a98e3
      Sean Christopherson authored
      WARN on the existing invalid root_hpa checks in __direct_map() and
      FNAME(fetch).  The "legitimate" path that invalidated root_hpa in the
      middle of a page fault is long since gone, i.e. it should no longer be
      impossible to invalidate in the middle of a page fault[*].
      
      The root_hpa checks were added by two related commits
      
        989c6b34 ("KVM: MMU: handle invalid root_hpa at __direct_map")
        37f6a4e2 ("KVM: x86: handle invalid root_hpa everywhere")
      
      to fix a bug where nested_vmx_vmexit() could be called *in the middle*
      of a page fault.  At the time, vmx_interrupt_allowed(), which was and
      still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
      directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
      a VM-Exit on a pending interrupt.  Emulating the nested VM-Exit resulted
      in root_hpa being invalidated by kvm_mmu_reset_context() without
      explicitly terminating the page fault.
      
      Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
      on an invalid root_hpa to detect any flows that reset the MMU while
      handling a page fault.  The broken vmx_interrupt_allowed() behavior has
      long since been fixed and resetting the MMU during a page fault should
      not be considered legal behavior.
      
      [*] It's actually technically possible in FNAME(page_fault)() because it
          calls inject_page_fault() when the guest translation is invalid, but
          in that case the page fault handling is immediately terminated.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c7a98e3
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler · ddce6208
      Sean Christopherson authored
      Add a check on root_hpa at the beginning of the page fault handler to
      consolidate several checks on root_hpa that are scattered throughout the
      page fault code.  This is a preparatory step towards eventually removing
      such checks altogether, or at the very least WARNing if an invalid root
      is encountered.  Remove only the checks that can be easily audited to
      confirm that root_hpa cannot be invalidated between their current
      location and the new check in kvm_mmu_page_fault(), and aren't currently
      protected by mmu_lock, i.e. keep the checks in __direct_map() and
      FNAME(fetch) for the time being.
      
      The root_hpa checks that are consolidate were all added by commit
      
        37f6a4e2 ("KVM: x86: handle invalid root_hpa everywhere")
      
      which was a follow up to a bug fix for __direct_map(), commit
      
        989c6b34 ("KVM: MMU: handle invalid root_hpa at __direct_map")
      
      At the time, nested VMX had, in hindsight, crazy handling of nested
      interrupts and would trigger a nested VM-Exit in ->interrupt_allowed(),
      and thus unexpectedly reset the MMU in flows such as can_do_async_pf().
      
      Now that the wonky nested VM-Exit behavior is gone, the root_hpa checks
      are bogus and confusing, e.g. it's not at all obvious what they actually
      protect against, and at first glance they appear to be broken since many
      of them run without holding mmu_lock.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ddce6208
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move calls to thp_adjust() down a level · 4cd071d1
      Sean Christopherson authored
      Move the calls to thp_adjust() down a level from the page fault handlers
      to the map/fetch helpers and remove the page count shuffling done in
      thp_adjust().
      
      Despite holding a reference to the underlying page while processing a
      page fault, the page fault flows don't actually rely on holding a
      reference to the page when thp_adjust() is called.  At that point, the
      fault handlers hold mmu_lock, which prevents mmu_notifier from completing
      any invalidations, and have verified no invalidations from mmu_notifier
      have occurred since the page reference was acquired (which is done prior
      to taking mmu_lock).
      
      The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
      quirk that is necessitated because thp_adjust() modifies the pfn that is
      consumed by its caller.  Because the page fault handlers call
      kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
      reference to the correct pfn purely for correctness when the pfn is
      released.
      
      Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
      adjustment doesn't change the pfn as seen by the page fault handlers,
      i.e. the pfn released by the page fault handlers is the same pfn that
      was returned by gfn_to_pfn().
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cd071d1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map() · 0885904d
      Sean Christopherson authored
      Move thp_adjust() above __direct_map() in preparation of calling
      thp_adjust() from  __direct_map() and FNAME(fetch).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0885904d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault() · 0f90e1c1
      Sean Christopherson authored
      Consolidate the direct MMU page fault handlers into a common helper,
      direct_page_fault().  Except for unique max level conditions, the tdp
      and nonpaging fault handlers are functionally identical.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0f90e1c1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage · 2cb70fd4
      Sean Christopherson authored
      Rename __direct_map()'s param that controls whether or not a disallowed
      NX large page should be accounted to match what it actually does.  The
      nonpaging_page_fault() case unconditionally passes %false for the param
      even though it locally sets lpage_disallowed.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2cb70fd4
    • Sean Christopherson's avatar
      KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level · 2f57b705
      Sean Christopherson authored
      Persist the max page level calculated via gfn_lpage_is_disallowed() to
      the max level "returned" by mapping_level() so that its naturally taken
      into account by the max level check that conditions calling
      transparent_hugepage_adjust().
      
      Drop the gfn_lpage_is_disallowed() check in thp_adjust() as it's now
      handled by mapping_level() and its callers.
      
      Add a comment to document the behavior of host_mapping_level() and its
      interaction with max level and transparent huge pages.
      
      Note, transferring the gfn_lpage_is_disallowed() from thp_adjust() to
      mapping_level() superficially affects how changes to a memslot's
      disallow_lpage count will be handled due to thp_adjust() being run while
      holding mmu_lock.
      
      In the more common case where a different vCPU increments the count via
      account_shadowed(), gfn_lpage_is_disallowed() is rechecked by set_spte()
      to ensure a writable large page isn't created.
      
      In the less common case where the count is decremented to zero due to
      all shadow pages in the memslot being zapped, THP behavior now matches
      hugetlbfs behavior in the sense that a small page will be created when a
      large page could be used if the count reaches zero in the miniscule
      window between mapping_level() and acquiring mmu_lock.
      
      Lastly, the new THP behavior also follows hugetlbfs behavior in the
      absurdly unlikely scenario of a memslot being moved such that the
      memslot's compatibility with respect to large pages changes, but without
      changing the validity of the gpf->pfn walk.  I.e. if a memslot is moved
      between mapping_level() and snapshotting mmu_seq, it's theoretically
      possible to consume a stale disallow_lpage count.  But, since KVM zaps
      all shadow pages when moving a memslot and forces all vCPUs to reload a
      new MMU, the inserted spte will always be thrown away prior to
      completing the memslot move, i.e. whether or not the spte accurately
      reflects disallow_lpage is irrelevant.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f57b705
    • Sean Christopherson's avatar
      KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU · cbe1e6f0
      Sean Christopherson authored
      Restrict the max level for a shadow page based on the guest's level
      instead of capping the level after the fact for host-mapped huge pages,
      e.g. hugetlbfs pages.  Explicitly capping the max level using the guest
      mapping level also eliminates FNAME(page_fault)'s subtle dependency on
      THP only supporting 2mb pages.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cbe1e6f0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Refactor handling of forced 4k pages in page faults · 39ca1ecb
      Sean Christopherson authored
      Refactor the page fault handlers and mapping_level() to track the max
      allowed page level instead of only tracking if a 4k page is mandatory
      due to one restriction or another.  This paves the way for cleanly
      consolidating tdp_page_fault() and nonpaging_page_fault(), and for
      eliminating a redundant check on mmu_gfn_lpage_is_disallowed().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39ca1ecb
    • Sean Christopherson's avatar
      KVM: x86/mmu: Refactor the per-slot level calculation in mapping_level() · f0f37e22
      Sean Christopherson authored
      Invert the loop which adjusts the allowed page level based on what's
      compatible with the associated memslot to use a largest-to-smallest
      page size walk.  This paves the way for passing around a "max level"
      variable instead of having redundant checks and/or multiple booleans.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f0f37e22
    • Sean Christopherson's avatar
      KVM: x86/mmu: Refactor handling of cache consistency with TDP · cb9b88c6
      Sean Christopherson authored
      Pre-calculate the max level for a TDP page with respect to MTRR cache
      consistency in preparation of replacing force_pt_level with max_level,
      and eventually combining the bulk of nonpaging_page_fault() and
      tdp_page_fault() into a common helper.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cb9b88c6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf() · 9f1a8526
      Sean Christopherson authored
      Move nonpaging_page_fault() below try_async_pf() to eliminate the
      forward declaration of try_async_pf() and to prepare for combining the
      bulk of nonpaging_page_fault() and tdp_page_fault() into a common
      helper.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9f1a8526
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault() · 367fd790
      Sean Christopherson authored
      Fold nonpaging_map() into its sole caller, nonpaging_page_fault(), in
      preparation for combining the bulk of nonpaging_page_fault() and
      tdp_page_fault() into a common helper.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      367fd790
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move definition of make_mmu_pages_available() up · ba7888dd
      Sean Christopherson authored
      Move make_mmu_pages_available() above its first user to put it closer
      to related code and eliminate a forward declaration.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ba7888dd
    • Sean Christopherson's avatar
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 736c291c
      Sean Christopherson authored
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      736c291c
    • Sean Christopherson's avatar
      KVM: x86: Add a WARN on TIF_NEED_FPU_LOAD in kvm_load_guest_fpu() · 95145c25
      Sean Christopherson authored
      WARN once in kvm_load_guest_fpu() if TIF_NEED_FPU_LOAD is observed, as
      that would mean that KVM is corrupting userspace's FPU by saving
      unknown register state into arch.user_fpu.  Add a comment to explain
      why KVM WARNs on TIF_NEED_FPU_LOAD instead of implementing logic
      similar to fpu__copy().
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      95145c25
    • Sean Christopherson's avatar
      KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform · f958bd23
      Sean Christopherson authored
      Unlike most state managed by XSAVE, MPX is initialized to zero on INIT.
      Because INITs are usually recognized in the context of a VCPU_RUN call,
      kvm_vcpu_reset() puts the guest's FPU so that the FPU state is resident
      in memory, zeros the MPX state, and reloads FPU state to hardware.  But,
      in the unlikely event that an INIT is recognized during
      kvm_arch_vcpu_ioctl_get_mpstate() via kvm_apic_accept_events(),
      kvm_vcpu_reset() will call kvm_put_guest_fpu() without a preceding
      kvm_load_guest_fpu() and corrupt the guest's FPU state (and possibly
      userspace's FPU state as well).
      
      Given that MPX is being removed from the kernel[*], fix the bug with the
      simple-but-ugly approach of loading the guest's FPU during
      KVM_GET_MP_STATE.
      
      [*] See commit f240652b ("x86/mpx: Remove MPX APIs").
      
      Fixes: f775b13e ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f958bd23