1. 18 Jun, 2021 7 commits
  2. 17 Jun, 2021 33 commits
    • Kai Huang's avatar
      KVM: x86/mmu: Fix TDP MMU page table level · f1b83255
      Kai Huang authored
      TDP MMU iterator's level is identical to page table's actual level.  For
      instance, for the last level page table (whose entry points to one 4K
      page), iter->level is 1 (PG_LEVEL_4K), and in case of 5 level paging,
      the iter->level is mmu->shadow_root_level, which is 5.  However, struct
      kvm_mmu_page's level currently is not set correctly when it is allocated
      in kvm_tdp_mmu_map().  When iterator hits non-present SPTE and needs to
      allocate a new child page table, currently iter->level, which is the
      level of the page table where the non-present SPTE belongs to, is used.
      This results in struct kvm_mmu_page's level always having its parent's
      level (excpet root table's level, which is initialized explicitly using
      mmu->shadow_root_level).
      
      This is kinda wrong, and not consistent with existing non TDP MMU code.
      Fortuantely sp->role.level is only used in handle_removed_tdp_mmu_page()
      and kvm_tdp_mmu_zap_sp(), and they are already aware of this and behave
      correctly.  However to make it consistent with legacy MMU code (and fix
      the issue that both root page table and its child page table have
      shadow_root_level), use iter->level - 1 in kvm_tdp_mmu_map(), and change
      handle_removed_tdp_mmu_page() and kvm_tdp_mmu_zap_sp() accordingly.
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <bcb6569b6e96cb78aaa7b50640e6e6b53291a74e.1623717884.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f1b83255
    • Kai Huang's avatar
      KVM: x86/mmu: Fix pf_fixed count in tdp_mmu_map_handle_target_level() · 857f8474
      Kai Huang authored
      Currently pf_fixed is not increased when prefault is true.  This is not
      correct, since prefault here really means "async page fault completed".
      In that case, the original page fault from the guest was morphed into as
      async page fault and pf_fixed was not increased.  So when prefault
      indicates async page fault is completed, pf_fixed should be increased.
      
      Additionally, currently pf_fixed is also increased even when page fault
      is spurious, while legacy MMU increases pf_fixed when page fault returns
      RET_PF_EMULATE or RET_PF_FIXED.
      
      To fix above two issues, change to increase pf_fixed when return value
      is not RET_PF_SPURIOUS (RET_PF_RETRY has already been ruled out by
      reaching here).
      
      More information:
      https://lore.kernel.org/kvm/cover.1620200410.git.kai.huang@intel.com/T/#mbb5f8083e58a2cd262231512b9211cbe70fc3bd5
      
      Fixes: bb18842e ("kvm: x86/mmu: Add TDP MMU PF handler")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <2ea8b7f5d4f03c99b32bc56fc982e1e4e3d3fc6b.1623717884.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      857f8474
    • Kai Huang's avatar
      KVM: x86/mmu: Fix return value in tdp_mmu_map_handle_target_level() · 57a3e96d
      Kai Huang authored
      Currently tdp_mmu_map_handle_target_level() returns 0, which is
      RET_PF_RETRY, when page fault is actually fixed.  This makes
      kvm_tdp_mmu_map() also return RET_PF_RETRY in this case, instead of
      RET_PF_FIXED.  Fix by initializing ret to RET_PF_FIXED.
      
      Note that kvm_mmu_page_fault() resumes guest on both RET_PF_RETRY and
      RET_PF_FIXED, which means in practice returning the two won't make
      difference, so this fix alone won't be necessary for stable tree.
      
      Fixes: bb18842e ("kvm: x86/mmu: Add TDP MMU PF handler")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <f9e8956223a586cd28c090879a8ff40f5eb6d609.1623717884.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      57a3e96d
    • Wanpeng Li's avatar
      KVM: LAPIC: Keep stored TMCCT register value 0 after KVM_SET_LAPIC · 2735886c
      Wanpeng Li authored
      KVM_GET_LAPIC stores the current value of TMCCT and KVM_SET_LAPIC's memcpy
      stores it in vcpu->arch.apic->regs, KVM_SET_LAPIC could store zero in
      vcpu->arch.apic->regs after it uses it, and then the stored value would
      always be zero. In addition, the TMCCT is always computed on-demand and
      never directly readable.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623223000-18116-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2735886c
    • Ashish Kalra's avatar
      KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall · 0dbb1123
      Ashish Kalra authored
      This hypercall is used by the SEV guest to notify a change in the page
      encryption status to the hypervisor. The hypercall should be invoked
      only when the encryption attribute is changed from encrypted -> decrypted
      and vice versa. By default all guest pages are considered encrypted.
      
      The hypercall exits to userspace to manage the guest shared regions and
      integrate with the userspace VMM's migration code.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Co-developed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0dbb1123
    • Paolo Bonzini's avatar
      KVM: switch per-VM stats to u64 · e3cb6fa0
      Paolo Bonzini authored
      Make them the same type as vCPU stats.  There is no reason
      to limit the counters to unsigned long.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e3cb6fa0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Grab nx_lpage_splits as an unsigned long before division · ade74e14
      Sean Christopherson authored
      Snapshot kvm->stats.nx_lpage_splits into a local unsigned long to avoid
      64-bit division on 32-bit kernels.  Casting to an unsigned long is safe
      because the maximum number of shadow pages, n_max_mmu_pages, is also an
      unsigned long, i.e. KVM will start recycling shadow pages before the
      number of splits can exceed a 32-bit value.
      
        ERROR: modpost: "__udivdi3" [arch/x86/kvm/kvm.ko] undefined!
      
      Fixes: 7ee093d4f3f5 ("KVM: switch per-VM stats to u64")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210615162905.2132937-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ade74e14
    • Vitaly Kuznetsov's avatar
      KVM: x86: Check for pending interrupts when APICv is getting disabled · bca66dbc
      Vitaly Kuznetsov authored
      When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT
      request (see __apic_accept_irq()) as the required work is done by hardware.
      In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt
      may never get delivered.
      
      Currently, the described situation is hardly possible: all
      kvm_request_apicv_update() calls normally happen upon VM creation when
      no interrupts are pending. We are, however, going to move unconditional
      kvm_request_apicv_update() call from kvm_hv_activate_synic() to
      synic_update_vector() and without this fix 'hyperv_connections' test from
      kvm-unit-tests gets stuck on IPI delivery attempt right after configuring
      a SynIC route which triggers APICv disablement.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210609150911.1471882c-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bca66dbc
    • Sean Christopherson's avatar
      KVM: nVMX: Drop redundant checks on vmcs12 in EPTP switching emulation · c5ffd408
      Sean Christopherson authored
      Drop the explicit check on EPTP switching being enabled.  The EPTP
      switching check is handled in the generic VMFUNC function check, while
      the underlying VMFUNC enablement check is done by hardware and redone
      by generic VMFUNC emulation.
      
      The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a
      consistency check, keep it but add a WARN.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c5ffd408
    • Sean Christopherson's avatar
      KVM: nVMX: WARN if subtly-impossible VMFUNC conditions occur · 546e8398
      Sean Christopherson authored
      WARN and inject #UD when emulating VMFUNC for L2 if the function is
      out-of-bounds or if VMFUNC is not enabled in vmcs12.  Neither condition
      should occur in practice, as the CPU is supposed to prioritize the #UD
      over VM-Exit for out-of-bounds input and KVM is supposed to enable
      VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of
      those dependencies is obvious.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      546e8398
    • Sean Christopherson's avatar
      KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() · c9060662
      Sean Christopherson authored
      Remove the @reset_roots param from kvm_init_mmu(), the one user,
      kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
      invalidated all roots.  This also happens to be why the reset_roots=true
      paths doesn't leak roots; they're already invalid.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9060662
    • Sean Christopherson's avatar
      KVM: x86: Defer MMU sync on PCID invalidation · e62f1aa8
      Sean Christopherson authored
      Defer the MMU sync on PCID invalidation so that multiple sync requests in
      a single VM-Exit are batched.  This is a very minor optimization as
      checking for unsync'd children is quite cheap.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e62f1aa8
    • Sean Christopherson's avatar
      KVM: nVMX: Use fast PGD switch when emulating VMFUNC[EPTP_SWITCH] · 39353ab5
      Sean Christopherson authored
      Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate
      VMFUNC[EPTP_SWITCH] instead of nuking all MMUs.  EPTP_SWITCH is the EPT
      equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow,
      the only hiccup being that A/D enabling is buried in the EPTP.  But, that
      is easily handled by bouncing through kvm_init_shadow_ept_mmu().
      
      Explicitly request a guest TLB flush if VPID is disabled.  Per Intel's
      SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined
      mappings associated with VPID 0000H (for all PCIDs and for all EP4TA
      values, where EP4TA is the value of bits 51:12 of EPTP)".
      
      Note, this technically is a very bizarre bug fix of sorts if L2 is using
      PAE paging, as avoiding the full MMU reload also avoids incorrectly
      reloading the PDPTEs, which the SDM explicitly states are not touched:
      
        If PAE paging is in use, an EPTP-switching VMFUNC does not load the
        four page-directory-pointer-table entries (PDPTEs) from the
        guest-physical address in CR3. The logical processor continues to use
        the four guest-physical addresses already present in the PDPTEs. The
        guest-physical address in CR3 is not translated through the new EPT
        paging structures (until some operation that would load the PDPTEs).
      
      In addition to optimizing L2's MMU shenanigans, avoiding the full reload
      also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both
      root_mmu and guest_mmu.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39353ab5
    • Sean Christopherson's avatar
      KVM: x86: Use KVM_REQ_TLB_FLUSH_GUEST to handle INVPCID(ALL) emulation · 28f28d45
      Sean Christopherson authored
      Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating
      INVPCID of all contexts.  In the current code, this is a glorified nop as
      TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP
      is disabled, which is the only time INVPCID is only intercepted+emulated.
      In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths
      that emulate a guest TLB flush, e.g. by synchronizing as needed instead
      of completely unloading all MMUs.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28f28d45
    • Sean Christopherson's avatar
      KVM: nVMX: Free only guest_mode (L2) roots on INVVPID w/o EPT · 25b62c62
      Sean Christopherson authored
      When emulating INVVPID for L1, free only L2+ roots, using the guest_mode
      tag in the MMU role to identify L2+ roots.  From L1's perspective, its
      own TLB entries use VPID=0, and INVVPID is not requied to invalidate such
      entries.  Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0,
      but it is not required to do so.
      
      Cc: Lai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      25b62c62
    • Sean Christopherson's avatar
      KVM: nVMX: Consolidate VM-Enter/VM-Exit TLB flush and MMU sync logic · 50a41796
      Sean Christopherson authored
      Drop the dedicated nested_vmx_transition_mmu_sync() now that the MMU sync
      is handled via KVM_REQ_TLB_FLUSH_GUEST, and fold that flush into the
      all-encompassing nested_vmx_transition_tlb_flush().
      
      Opportunistically add a comment explaning why nested EPT never needs to
      sync the MMU on VM-Enter.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      50a41796
    • Sean Christopherson's avatar
      KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers · b5129100
      Sean Christopherson authored
      Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
      all call sites unconditionally skip both the sync and flush.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5129100
    • Sean Christopherson's avatar
      KVM: nSVM: Move TLB flushing logic (or lack thereof) to dedicated helper · d2e56019
      Sean Christopherson authored
      Introduce nested_svm_transition_tlb_flush() and use it force an MMU sync
      and TLB flush on nSVM VM-Enter and VM-Exit instead of sneaking the logic
      into the __kvm_mmu_new_pgd() call sites.  Add a partial todo list to
      document issues that need to be addressed before the unconditional sync
      and flush can be modified to look more like nVMX's logic.
      
      In addition to making nSVM's forced flushing more overt (guess who keeps
      losing track of it), the new helper brings further convergence between
      nSVM and nVMX, and also sets the stage for dropping the "skip" params
      from __kvm_mmu_new_pgd().
      
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d2e56019
    • Sean Christopherson's avatar
      KVM: x86: Uncondtionally skip MMU sync/TLB flush in MOV CR3's PGD switch · 415b1a01
      Sean Christopherson authored
      Stop leveraging the MMU sync and TLB flush requested by the fast PGD
      switch helper now that kvm_set_cr3() manually handles the necessary sync,
      frees, and TLB flush.  This will allow dropping the params from the fast
      PGD helpers since nested SVM is now the odd blob out.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      415b1a01
    • Sean Christopherson's avatar
      KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush · 21823fbd
      Sean Christopherson authored
      Flush and sync all PGDs for the current/target PCID on MOV CR3 with a
      TLB flush, i.e. without PCID_NOFLUSH set.  Paraphrasing Intel's SDM
      regarding the behavior of MOV to CR3:
      
        - If CR4.PCIDE = 0, invalidates all TLB entries associated with PCID
          000H and all entries in all paging-structure caches associated with
          PCID 000H.
      
        - If CR4.PCIDE = 1 and NOFLUSH=0, invalidates all TLB entries
          associated with the PCID specified in bits 11:0, and all entries in
          all paging-structure caches associated with that PCID. It is not
          required to invalidate entries in the TLBs and paging-structure
          caches that are associated with other PCIDs.
      
        - If CR4.PCIDE=1 and NOFLUSH=1, is not required to invalidate any TLB
          entries or entries in paging-structure caches.
      
      Extract and reuse the logic for INVPCID(single) which is effectively the
      same flow and works even if CR4.PCIDE=0, as the current PCID will be '0'
      in that case, thus honoring the requirement of flushing PCID=0.
      
      Continue passing skip_tlb_flush to kvm_mmu_new_pgd() even though it
      _should_ be redundant; the clean up will be done in a future patch.  The
      overhead of an unnecessary nop sync is minimal (especially compared to
      the actual sync), and the TLB flush is handled via request.  Avoiding the
      the negligible overhead is not worth the risk of breaking kernels that
      backport the fix.
      
      Fixes: 956bf353 ("kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest")
      Cc: Junaid Shahid <junaids@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21823fbd
    • Sean Christopherson's avatar
      KVM: nVMX: Don't clobber nested MMU's A/D status on EPTP switch · 272b0a99
      Sean Christopherson authored
      Drop bogus logic that incorrectly clobbers the accessed/dirty enabling
      status of the nested MMU on an EPTP switch.  When nested EPT is enabled,
      walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2.
      
      This is likely a benign bug, as mmu->ept_ad is never consumed (since the
      MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will
      never propagate into future shadow pages since the nested MMU isn't used
      to map anything, just to walk L2's page tables.
      
      Note, KVM also does a full MMU reload, i.e. the guest_mmu will be
      recreated using the new EPTP, and thus any change in A/D enabling will be
      properly recognized in the relevant MMU.
      
      Fixes: 41ab9372 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      272b0a99
    • Sean Christopherson's avatar
      KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmap · 0e75225d
      Sean Christopherson authored
      Use BIT_ULL() instead of an open-coded shift to check whether or not a
      function is enabled in L1's VMFUNC bitmap.  This is a benign bug as KVM
      supports only bit 0, and will fail VM-Enter if any other bits are set,
      i.e. bits 63:32 are guaranteed to be zero.
      
      Note, "function" is bounded by hardware as VMFUNC will #UD before taking
      a VM-Exit if the function is greater than 63.
      
      Before:
        if ((vmcs12->vm_function_control & (1 << function)) == 0)
         0x000000000001a916 <+118>:	mov    $0x1,%eax
         0x000000000001a91b <+123>:	shl    %cl,%eax
         0x000000000001a91d <+125>:	cltq
         0x000000000001a91f <+127>:	and    0x128(%rbx),%rax
      
      After:
        if (!(vmcs12->vm_function_control & BIT_ULL(function & 63)))
         0x000000000001a955 <+117>:	mov    0x128(%rbx),%rdx
         0x000000000001a95c <+124>:	bt     %rax,%rdx
      
      Fixes: 27c42a1b ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0e75225d
    • Sean Christopherson's avatar
      KVM: nVMX: Sync all PGDs on nested transition with shadow paging · 07ffaf34
      Sean Christopherson authored
      Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
      VM-Exit when VPID is disabled for L2.  kvm_mmu_new_pgd() syncs only the
      current PGD, which can theoretically leave stale, unsync'd entries in a
      previous guest PGD, which could be consumed if L2 is allowed to load CR3
      with PCID_NOFLUSH=1.
      
      Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
      be utilized for its obvious purpose of emulating a guest TLB flush.
      
      Note, there is no change the actual TLB flush executed by KVM, even
      though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT.  When VPID is
      disabled for L2, vpid02 is guaranteed to be '0', and thus
      nested_get_vpid02() will return the VPID that is shared by L1 and L2.
      
      Generate the request outside of kvm_mmu_new_pgd(), as getting the common
      helper to correctly identify which requested is needed is quite painful.
      E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
      a TLB flush from the L1 kernel's perspective does not invalidate EPT
      mappings.  And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
      simplification by moving the logic into nested_vmx_transition_tlb_flush().
      
      Fixes: 41fab65e ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      07ffaf34
    • Vitaly Kuznetsov's avatar
      KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost · 8f7663ce
      Vitaly Kuznetsov authored
      Do KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE for a freshly restored VM
      (before the first KVM_RUN) to check that KVM_STATE_NESTED_EVMCS is not
      lost.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-12-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f7663ce
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Request to sync eVMCS from VMCS12 after migration · 8629b625
      Vitaly Kuznetsov authored
      VMCS12 is used to keep the authoritative state during nested state
      migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're
      in between L2->L1 vmexit and L1 guest run when actual sync to
      enlightened (or shadow) VMCS happens. Nested state, however, has
      no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()->
      set_current_vmptr() always sets it. Enlightened vmptrld path, however,
      doesn't have the quirk so some VMCS12 changes may not get properly
      reflected to eVMCS and L1 will see an incorrect state.
      
      Note, during L2 execution or when need_vmcs12_to_shadow_sync is not
      set the change is effectively a nop: in the former case all changes
      will get reflected during the first L2->L1 vmexit and in the later
      case VMCS12 and eVMCS are already in sync (thanks to
      copy_enlightened_to_vmcs12() in vmx_get_nested_state()).
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-11-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8629b625
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() · dc313385
      Vitaly Kuznetsov authored
      When nested state migration happens during L1's execution, it
      is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
      At least genuine Hyper-V seems to not be very happy when 'clean fields'
      data changes underneath it.
      
      'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
      and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.
      
      While at it, update a comment stating why exactly we need to reset
      'hv_clean_fields' data from L0.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-10-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dc313385
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() · b7685cfd
      Vitaly Kuznetsov authored
      'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened
      VMCS sync when we exit to L1. The comment in nested_vmx_failValid()
      validly states why shadow vmcs sync can be omitted but this doesn't
      apply to enlightened VMCS as it 'shadows' all VMCS12 fields.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-9-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b7685cfd
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() · d6bf71a1
      Vitaly Kuznetsov authored
      'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
      hypervisor is not obliged to keep it up-to-date while it is mangling L2's
      state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
      eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
      VMCS12 is used as a source of ultimate truth, we must make sure we pick all
      the changes to eVMCS and thus 'clean fields' data must be ignored.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-8-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d6bf71a1
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Release enlightened VMCS on VMCLEAR · 3b19b81a
      Vitaly Kuznetsov authored
      Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when
      enlightened VMCS is in use. TLFS has the following brief description:
      "The L1 hypervisor can execute a VMCLEAR instruction to transition an
      enlightened VMCS from the active to the non-active state". Normally,
      this change can be ignored as unmapping active eVMCS can be postponed
      until the next VMLAUNCH instruction but in case nested state is migrated
      with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped
      may result in its synchronization with VMCS12 and this is incorrect:
      L1 hypervisor is free to reuse inactive eVMCS memory for something else.
      
      Inactive eVMCS after VMCLEAR can just be unmapped.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-7-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3b19b81a
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Introduce 'EVMPTR_MAP_PENDING' post-migration state · 27849968
      Vitaly Kuznetsov authored
      Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
      can not be called directly from vmx_set_nested_state() as KVM may not have
      all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
      restored yet). Enlightened VMCS is mapped later while getting nested state
      pages. In the meantime, vmx->nested.hv_evmcs_vmptr remains 'EVMPTR_INVALID'
      and it's indistinguishable from 'evmcs is not in use' case. This leads to
      certain issues, in particular, if KVM_GET_NESTED_STATE is called right
      after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the resulting
      state will be unset (and such state will later fail to load).
      
      Introduce 'EVMPTR_MAP_PENDING' state to detect not-yet-mapped eVMCS after
      restore. With this, the 'is_guest_mode(vcpu)' hack in vmx_has_valid_vmcs12()
      is no longer needed.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-6-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      27849968
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Make copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() return 'void' · 25641caf
      Vitaly Kuznetsov authored
      copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() don't return any result,
      make them return 'void'.
      
      No functional change intended.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      25641caf
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Release eVMCS when enlightened VMENTRY was disabled · 02761716
      Vitaly Kuznetsov authored
      In theory, L1 can try to disable enlightened VMENTRY in VP assist page and
      try to issue VMLAUNCH/VMRESUME. While nested_vmx_handle_enlightened_vmptrld()
      properly handles this as 'EVMPTRLD_DISABLED', previously mapped eVMCS
      remains mapped and thus all evmptr_is_valid() checks will still pass and
      nested_vmx_run() will proceed when it shouldn't.
      
      Release eVMCS immediately when we detect that enlightened vmentry was
      disabled by L1.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      02761716
    • Vitaly Kuznetsov's avatar
      KVM: nVMX: Don't set 'dirty_vmcs12' flag on enlightened VMPTRLD · 6a789ca5
      Vitaly Kuznetsov authored
      'dirty_vmcs12' is only checked in prepare_vmcs02_early()/prepare_vmcs02()
      and both checks look like:
      
       'vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)'
      
      so for eVMCS case the flag changes nothing. Drop the assignment to avoid
      the confusion.
      
      No functional change intended.
      Reported-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210526132026.270394-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a789ca5