1. 17 Apr, 2021 40 commits
    • Sean Christopherson's avatar
      KVM: Move x86's MMU notifier memslot walkers to generic code · 3039bcc7
      Sean Christopherson authored
      Move the hva->gfn lookup for MMU notifiers into common code.  Every arch
      does a similar lookup, and some arch code is all but identical across
      multiple architectures.
      
      In addition to consolidating code, this will allow introducing
      optimizations that will benefit all architectures without incurring
      multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
      relevant range exists in the memslots.
      
      The use of __always_inline to avoid indirect call retpolines, as done by
      x86, may also benefit other architectures.
      
      Consolidating the lookups also fixes a wart in x86, where the legacy MMU
      and TDP MMU each do their own memslot walks.
      
      Lastly, future enhancements to the memslot implementation, e.g. to add an
      interval tree to track host address, will need to touch far less arch
      specific code.
      
      MIPS, PPC, and arm64 will be converted one at a time in future patches.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210402005658.3024832-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3039bcc7
    • Sean Christopherson's avatar
      KVM: Assert that notifier count is elevated in .change_pte() · c13fda23
      Sean Christopherson authored
      In KVM's .change_pte() notification callback, replace the notifier
      sequence bump with a WARN_ON assertion that the notifier count is
      elevated.  An elevated count provides stricter protections than bumping
      the sequence, and the sequence is guarnateed to be bumped before the
      count hits zero.
      
      When .change_pte() was added by commit 828502d3 ("ksm: add
      mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
      as .change_pte() would be invoked without any surrounding notifications.
      
      However, since commit 6bdb913f ("mm: wrap calls to set_pte_at_notify
      with invalidate_range_start and invalidate_range_end"), all calls to
      .change_pte() are guaranteed to be surrounded by start() and end(), and
      so are guaranteed to run with an elevated notifier count.
      
      Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
      bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
      the purpose of .change_pte().  Every arch's kvm_set_spte_hva() assumes
      .change_pte() is called when the relevant SPTE is present in KVM's MMU,
      as the original goal was to accelerate Kernel Samepage Merging (KSM) by
      updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
      the SPTE).  I.e. it means that .change_pte() is effectively dead code
      on _all_ architectures.
      
      x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
      is guaranteed due to the prior invalidation.  PPC simply unmaps the SPTE,
      which again should be a nop due to the invalidation.  arm64 is a bit
      murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
      called without a cache pointer, which means it will map an entry if and
      only if an existing PTE was found.
      
      For now, take advantage of the bug to simplify future consolidation of
      KVMs's MMU notifier code.   Doing so will not greatly complicate fixing
      .change_pte(), assuming it's even worth fixing.  .change_pte() has been
      broken for 8+ years and no one has complained.  Even if there are
      KSM+KVM users that care deeply about its performance, the benefits of
      avoiding VM-Exits via .change_pte() need to be reevaluated to justify
      the added complexity and testing burden.  Ripping out .change_pte()
      entirely would be a lot easier.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c13fda23
    • Paolo Bonzini's avatar
      KVM: MIPS: defer flush to generic MMU notifier code · fe9a5b05
      Paolo Bonzini authored
      Return 1 from kvm_unmap_hva_range and kvm_set_spte_hva if a flush is
      needed, so that the generic code can coalesce the flushes.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe9a5b05
    • Paolo Bonzini's avatar
      KVM: MIPS: let generic code call prepare_flush_shadow · 566a0bee
      Paolo Bonzini authored
      Since all calls to kvm_flush_remote_tlbs must be preceded by
      kvm_mips_callbacks->prepare_flush_shadow, repurpose
      kvm_arch_flush_remote_tlb to invoke it.  This makes it possible
      to use the TLB flushing mechanism provided by the generic MMU
      notifier code.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      566a0bee
    • Paolo Bonzini's avatar
      KVM: MIPS: rework flush_shadow_* callbacks into one that prepares the flush · 5194552f
      Paolo Bonzini authored
      Both trap-and-emulate and VZ have a single implementation that covers
      both .flush_shadow_all and .flush_shadow_memslot, and both of them end
      with a call to kvm_flush_remote_tlbs.
      
      Unify the callbacks into one and extract the call to kvm_flush_remote_tlbs.
      The next patches will pull it further out of the the architecture-specific
      MMU notifier functions kvm_unmap_hva_range and kvm_set_spte_hva.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5194552f
    • Paolo Bonzini's avatar
      KVM: constify kvm_arch_flush_remote_tlbs_memslot · 6c9dd6d2
      Paolo Bonzini authored
      memslots are stored in RCU and there should be no need to
      change them.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c9dd6d2
    • Sean Christopherson's avatar
      KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations · 85f47930
      Sean Christopherson authored
      Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that
      that the allocations are accounted, to make it easier to audit KVM's
      allocations in the future, and to be consistent with other cache usage in
      KVM.
      
      When using SLAB/SLUB, this is a nop as the cache itself is created with
      SLAB_ACCOUNT.
      
      When using SLOB, there are caveats within caveats.  SLOB doesn't honor
      SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU
      allocations now being accounted.   But, even that depends on internal
      SLOB details as SLOB will only go to the page allocator when its cache is
      depleted.  That just happens to be extremely likely for vCPUs because the
      size of kvm_vcpu is larger than the a page for almost all combinations of
      architecture and page size.  Whether or not the SLOB behavior is by
      design is unknown; it's just as likely that no SLOB users care about
      accounding and so no one has bothered to implemented support in SLOB.
      Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup
      users, if any exist.
      Reviewed-by: default avatarWanpeng Li <kernellwp@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210406190740.4055679-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85f47930
    • Paolo Bonzini's avatar
      KVM: MMU: protect TDP MMU pages only down to required level · dbb6964e
      Paolo Bonzini authored
      When using manual protection of dirty pages, it is not necessary
      to protect nested page tables down to the 4K level; instead KVM
      can protect only hugepages in order to split them lazily, and
      delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG.
      This was overlooked in the TDP MMU, so do it there as well.
      
      Fixes: a6a0b05d ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
      Cc: Ben Gardon <bgardon@google.com>
      Reviewed-by: default avatarKeqian Zhu <zhukeqian1@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dbb6964e
    • Maxim Levitsky's avatar
      KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2 · a43b80b7
      Maxim Levitsky authored
      Define KVM_GUESTDBG_VALID_MASK and use it to implement this capabiity.
      Compile tested only.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-6-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a43b80b7
    • Maxim Levitsky's avatar
      KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2 · fa18aca9
      Maxim Levitsky authored
      Move KVM_GUESTDBG_VALID_MASK to kvm_host.h
      and use it to return the value of this capability.
      Compile tested only.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-5-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa18aca9
    • Maxim Levitsky's avatar
      KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2 · 7e582ccb
      Maxim Levitsky authored
      Store the supported bits into KVM_GUESTDBG_VALID_MASK
      macro, similar to how arm does this.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7e582ccb
    • Paolo Bonzini's avatar
      KVM: introduce KVM_CAP_SET_GUEST_DEBUG2 · 8b13c364
      Paolo Bonzini authored
      This capability will allow the user to know which KVM_GUESTDBG_* bits
      are supported.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401135451.1004564-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8b13c364
    • Maxim Levitsky's avatar
      KVM: x86: pending exceptions must not be blocked by an injected event · 4020da3b
      Maxim Levitsky authored
      Injected interrupts/nmi should not block a pending exception,
      but rather be either lost if nested hypervisor doesn't
      intercept the pending exception (as in stock x86), or be delivered
      in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
      that corresponds to the pending exception.
      
      The only reason for an exception to be blocked is when nested run
      is pending (and that can't really happen currently
      but still worth checking for).
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4020da3b
    • Yang Yingliang's avatar
      KVM: selftests: remove redundant semi-colon · b9c36fde
      Yang Yingliang authored
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Message-Id: <20210401142514.1688199-1-yangyingliang@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b9c36fde
    • Maxim Levitsky's avatar
      KVM: nSVM: call nested_svm_load_cr3 on nested state load · 232f75d3
      Maxim Levitsky authored
      While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
      by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
      only root_mmu is reset.
      
      On regular nested entries we call nested_svm_load_cr3 which both updates
      the guest's CR3 in the MMU when it is needed, and it also initializes
      the mmu again which makes it initialize the walk_mmu as well when nested
      paging is enabled in both host and guest.
      
      Since we don't call nested_svm_load_cr3 on nested state load,
      the walk_mmu can be left uninitialized, which can lead to a NULL pointer
      dereference while accessing it if we happen to get a nested page fault
      right after entering the nested guest first time after the migration and
      we decide to emulate it, which leads to the emulator trying to access
      walk_mmu->gva_to_gpa which is NULL.
      
      Therefore we should call this function on nested state load as well.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401141814.1029036-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      232f75d3
    • David Edmondson's avatar
      KVM: x86: dump_vmcs should include the autoload/autostore MSR lists · 8486039a
      David Edmondson authored
      When dumping the current VMCS state, include the MSRs that are being
      automatically loaded/stored during VM entry/exit.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210318120841.133123-6-david.edmondson@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8486039a
    • David Edmondson's avatar
      KVM: x86: dump_vmcs should show the effective EFER · 0702a3cb
      David Edmondson authored
      If EFER is not being loaded from the VMCS, show the effective value by
      reference to the MSR autoload list or calculation.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210318120841.133123-5-david.edmondson@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0702a3cb
    • David Edmondson's avatar
      KVM: x86: dump_vmcs should consider only the load controls of EFER/PAT · 5518da62
      David Edmondson authored
      When deciding whether to dump the GUEST_IA32_EFER and GUEST_IA32_PAT
      fields of the VMCS, examine only the VM entry load controls, as saving
      on VM exit has no effect on whether VM entry succeeds or fails.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210318120841.133123-4-david.edmondson@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5518da62
    • David Edmondson's avatar
      KVM: x86: dump_vmcs should not conflate EFER and PAT presence in VMCS · 699e1b2e
      David Edmondson authored
      Show EFER and PAT based on their individual entry/exit controls.
      Signed-off-by: default avatarDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210318120841.133123-3-david.edmondson@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      699e1b2e
    • David Edmondson's avatar
      KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid · d9e46d34
      David Edmondson authored
      If the VM entry/exit controls for loading/saving MSR_EFER are either
      not available (an older processor or explicitly disabled) or not
      used (host and guest values are the same), reading GUEST_IA32_EFER
      from the VMCS returns an inaccurate value.
      
      Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide
      whether to print the PDPTRs - always do so if the fields exist.
      
      Fixes: 4eb64dce ("KVM: x86: dump VMCS on invalid entry")
      Signed-off-by: default avatarDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d9e46d34
    • Maxim Levitsky's avatar
      KVM: nSVM: improve SYSENTER emulation on AMD · adc2a237
      Maxim Levitsky authored
      Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
      we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
      msrs, and we also emulate the sysenter/sysexit instruction in long mode.
      
      (Emulator does still refuse to emulate sysenter in 64 bit mode, on the
      ground that the code for that wasn't tested and likely has no users)
      
      However when virtual vmload/vmsave is enabled, the vmload instruction will
      update these 32 bit msrs without triggering their msr intercept,
      which will lead to having stale values in kvm's shadow copy of these msrs,
      which relies on the intercept to be up to date.
      
      Fix/optimize this by doing the following:
      
      1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
         (This is both a tiny optimization and also ensures that in case
         the guest cpu vendor is AMD, the msrs will be 32 bit wide as
         AMD defined).
      
      2. Store only high 32 bit part of these msrs on interception and combine
         it with hardware msr value on intercepted read/writes
         iff vendor=GenuineIntel.
      
      3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
         (It is somewhat insane to set vendor=GenuineIntel and still enable
         SVM for the guest but well whatever).
         Then zero the high 32 bit parts when kvm intercepts and emulates vmload.
      
      Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
      correct way.
      
      This patch fixes nested migration of 32 bit nested guests, that was
      broken because incorrect cached values of SYSENTER msrs were stored in
      the migration stream if L1 changed these msrs with
      vmload prior to L2 entry.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      adc2a237
    • Maxim Levitsky's avatar
      KVM: x86: add guest_cpuid_is_intel · c1df4aac
      Maxim Levitsky authored
      This is similar to existing 'guest_cpuid_is_amd_or_hygon'
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401111928.996871-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1df4aac
    • Sean Christopherson's avatar
      KVM: x86: Account a variety of miscellaneous allocations · eba04b20
      Sean Christopherson authored
      Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are
      clearly associated with a single task/VM.
      
      Note, there are a several SEV allocations that aren't accounted, but
      those can (hopefully) be fixed by using the local stack for memory.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331023025.2485960-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eba04b20
    • Sean Christopherson's avatar
      KVM: SVM: Do not allow SEV/SEV-ES initialization after vCPUs are created · 8727906f
      Sean Christopherson authored
      Reject KVM_SEV_INIT and KVM_SEV_ES_INIT if they are attempted after one
      or more vCPUs have been created.  KVM assumes a VM is tagged SEV/SEV-ES
      prior to vCPU creation, e.g. init_vmcb() needs to mark the VMCB as SEV
      enabled, and svm_create_vcpu() needs to allocate the VMSA.  At best,
      creating vCPUs before SEV/SEV-ES init will lead to unexpected errors
      and/or behavior, and at worst it will crash the host, e.g.
      sev_launch_update_vmsa() will dereference a null svm->vmsa pointer.
      
      Fixes: 1654efcb ("KVM: SVM: Add KVM_SEV_INIT command")
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331031936.2495277-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8727906f
    • Sean Christopherson's avatar
      KVM: SVM: Do not set sev->es_active until KVM_SEV_ES_INIT completes · 9fa1521d
      Sean Christopherson authored
      Set sev->es_active only after the guts of KVM_SEV_ES_INIT succeeds.  If
      the command fails, e.g. because SEV is already active or there are no
      available ASIDs, then es_active will be left set even though the VM is
      not fully SEV-ES capable.
      
      Refactor the code so that "es_active" is passed on the stack instead of
      being prematurely shoved into sev_info, both to avoid having to unwind
      sev_info and so that it's more obvious what actually consumes es_active
      in sev_guest_init() and its helpers.
      
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331031936.2495277-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fa1521d
    • Sean Christopherson's avatar
      KVM: SVM: Use online_vcpus, not created_vcpus, to iterate over vCPUs · c36b16d2
      Sean Christopherson authored
      Use the kvm_for_each_vcpu() helper to iterate over vCPUs when encrypting
      VMSAs for SEV, which effectively switches to use online_vcpus instead of
      created_vcpus.  This fixes a possible null-pointer dereference as
      created_vcpus does not guarantee a vCPU exists, since it is updated at
      the very beginning of KVM_CREATE_VCPU.  created_vcpus exists to allow the
      bulk of vCPU creation to run in parallel, while still correctly
      restricting the max number of max vCPUs.
      
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331031936.2495277-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c36b16d2
    • Sean Christopherson's avatar
      KVM: x86/mmu: Simplify code for aging SPTEs in TDP MMU · 8f8f52a4
      Sean Christopherson authored
      Use a basic NOT+AND sequence to clear the Accessed bit in TDP MMU SPTEs,
      as opposed to the fancy ffs()+clear_bit() logic that was copied from the
      legacy MMU.  The legacy MMU uses clear_bit() because it is operating on
      the SPTE itself, i.e. clearing needs to be atomic.  The TDP MMU operates
      on a local variable that it later writes to the SPTE, and so doesn't need
      to be atomic or even resident in memory.
      
      Opportunistically drop unnecessary initialization of new_spte, it's
      guaranteed to be written before being accessed.
      
      Using NOT+AND instead of ffs()+clear_bit() reduces the sequence from:
      
         0x0000000000058be6 <+134>:	test   %rax,%rax
         0x0000000000058be9 <+137>:	je     0x58bf4 <age_gfn_range+148>
         0x0000000000058beb <+139>:	test   %rax,%rdi
         0x0000000000058bee <+142>:	je     0x58cdc <age_gfn_range+380>
         0x0000000000058bf4 <+148>:	mov    %rdi,0x8(%rsp)
         0x0000000000058bf9 <+153>:	mov    $0xffffffff,%edx
         0x0000000000058bfe <+158>:	bsf    %eax,%edx
         0x0000000000058c01 <+161>:	movslq %edx,%rdx
         0x0000000000058c04 <+164>:	lock btr %rdx,0x8(%rsp)
         0x0000000000058c0b <+171>:	mov    0x8(%rsp),%r15
      
      to:
      
         0x0000000000058bdd <+125>:	test   %rax,%rax
         0x0000000000058be0 <+128>:	je     0x58beb <age_gfn_range+139>
         0x0000000000058be2 <+130>:	test   %rax,%r8
         0x0000000000058be5 <+133>:	je     0x58cc0 <age_gfn_range+352>
         0x0000000000058beb <+139>:	not    %rax
         0x0000000000058bee <+142>:	and    %r8,%rax
         0x0000000000058bf1 <+145>:	mov    %rax,%r15
      
      thus eliminating several memory accesses, including a locked access.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331004942.2444916-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f8f52a4
    • Sean Christopherson's avatar
      KVM: x86/mmu: Remove spurious clearing of dirty bit from TDP MMU SPTE · 6d9aafb9
      Sean Christopherson authored
      Don't clear the dirty bit when aging a TDP MMU SPTE (in response to a MMU
      notifier event).  Prematurely clearing the dirty bit could cause spurious
      PML updates if aging a page happened to coincide with dirty logging.
      
      Note, tdp_mmu_set_spte_no_acc_track() flows into __handle_changed_spte(),
      so the host PFN will be marked dirty, i.e. there is no potential for data
      corruption.
      
      Fixes: a6a0b05d ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331004942.2444916-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6d9aafb9
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint · 6dfbd6b5
      Sean Christopherson authored
      Remove x86's trace_kvm_age_page() tracepoint.  It's mostly redundant with
      the common trace_kvm_age_hva() tracepoint, and if there is a need for the
      extra details, e.g. gfn, referenced, etc... those details should be added
      to the common tracepoint so that all architectures and MMUs benefit from
      the info.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6dfbd6b5
    • Sean Christopherson's avatar
      KVM: Move arm64's MMU notifier trace events to generic code · 501b9185
      Sean Christopherson authored
      Move arm64's MMU notifier trace events into common code in preparation
      for doing the hva->gfn lookup in common code.  The alternative would be
      to trace the gfn instead of hva, but that's not obviously better and
      could also be done in common code.  Tracing the notifiers is also quite
      handy for debug regardless of architecture.
      
      Remove a completely redundant tracepoint from PPC e500.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      501b9185
    • Sean Christopherson's avatar
      KVM: Move prototypes for MMU notifier callbacks to generic code · 5f7c292b
      Sean Christopherson authored
      Move the prototypes for the MMU notifier callbacks out of arch code and
      into common code.  There is no benefit to having each arch replicate the
      prototypes since any deviation from the invocation in common code will
      explode.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f7c292b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE · aaaac889
      Sean Christopherson authored
      Use the leaf-only TDP iterator when changing the SPTE in reaction to a
      MMU notifier.  Practically speaking, this is a nop since the guts of the
      loop explicitly looks for 4k SPTEs, which are always leaf SPTEs.  Switch
      the iterator to match age_gfn_range() and test_age_gfn() so that a future
      patch can consolidate the core iterating logic.
      
      No real functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aaaac889
    • Sean Christopherson's avatar
      KVM: x86/mmu: Pass address space ID to TDP MMU root walkers · a3f15bda
      Sean Christopherson authored
      Move the address space ID check that is performed when iterating over
      roots into the macro helpers to consolidate code.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a3f15bda
    • Sean Christopherson's avatar
      KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range() · 2b9663d8
      Sean Christopherson authored
      Pass the address space ID to TDP MMU's primary "zap gfn range" helper to
      allow the MMU notifier paths to iterate over memslots exactly once.
      Currently, both the legacy MMU and TDP MMU iterate over memslots when
      looking for an overlapping hva range, which can be quite costly if there
      are a large number of memslots.
      
      Add a "flush" parameter so that iterating over multiple address spaces
      in the caller will continue to do the right thing when yielding while a
      flush is pending from a previous address space.
      
      Note, this also has a functional change in the form of coalescing TLB
      flushes across multiple address spaces in kvm_zap_gfn_range(), and also
      optimizes the TDP MMU to utilize range-based flushing when running as L1
      with Hyper-V enlightenments.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-6-seanjc@google.com>
      [Keep separate for loops to prepare for other incoming patches. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b9663d8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap · 1a61b7db
      Sean Christopherson authored
      Gather pending TLB flushes across both address spaces when zapping a
      given gfn range.  This requires feeding "flush" back into subsequent
      calls, but on the plus side sets the stage for further batching
      between the legacy MMU and TDP MMU.  It also allows refactoring the
      address space iteration to cover the legacy and TDP MMUs without
      introducing truly ugly code.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1a61b7db
    • Sean Christopherson's avatar
      KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs · 142ccde1
      Sean Christopherson authored
      Gather pending TLB flushes across both the legacy and TDP MMUs when
      zapping collapsible SPTEs to avoid multiple flushes if both the legacy
      MMU (for nested guests) and TDP MMU have mappings for the memslot.
      
      Note, this also optimizes the TDP MMU to flush only the relevant range
      when running as L1 with Hyper-V enlightenments.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      142ccde1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU · 302695a5
      Sean Christopherson authored
      Place the onus on the caller of slot_handle_*() to flush the TLB, rather
      than handling the flush in the helper, and rename parameters accordingly.
      This will allow future patches to coalesce flushes between address spaces
      and between the legacy and TDP MMUs.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      302695a5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs · af95b53e
      Sean Christopherson authored
      When zapping collapsible SPTEs across multiple roots, gather pending
      flushes and perform a single remote TLB flush at the end, as opposed to
      flushing after processing every root.
      
      Note, flush may be cleared by the result of zap_collapsible_spte_range().
      This is intended and correct, e.g. yielding may have serviced a prior
      pending flush.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af95b53e
    • Vitaly Kuznetsov's avatar
      KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have... · c28fa560
      Vitaly Kuznetsov authored
      KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
      
      MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned
      to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest
      the correct behavior is to inject #GP an not just return zero.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210329124804.170173-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c28fa560
    • Krish Sadhukhan's avatar
      KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in nested_svm_vmexit() · 9a7de6ec
      Krish Sadhukhan authored
      According to APM, the #DB intercept for a single-stepped VMRUN must happen
      after the completion of that instruction, when the guest does #VMEXIT to
      the host. However, in the current implementation of KVM, the #DB intercept
      for a single-stepped VMRUN happens after the completion of the instruction
      that follows the VMRUN instruction. When the #DB intercept handler is
      invoked, it shows the RIP of the instruction that follows VMRUN, instead of
      of VMRUN itself. This is an incorrect RIP as far as single-stepping VMRUN
      is concerned.
      
      This patch fixes the problem by checking, in nested_svm_vmexit(), for the
      condition that the VMRUN instruction is being single-stepped and if so,
      queues the pending #DB intercept so that the #DB is accounted for before
      we execute L1's next instruction.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oraacle.com>
      Message-Id: <20210323175006.73249-2-krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9a7de6ec