1. 17 Apr, 2021 24 commits
    • David Edmondson's avatar
      KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid · d9e46d34
      David Edmondson authored
      If the VM entry/exit controls for loading/saving MSR_EFER are either
      not available (an older processor or explicitly disabled) or not
      used (host and guest values are the same), reading GUEST_IA32_EFER
      from the VMCS returns an inaccurate value.
      
      Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide
      whether to print the PDPTRs - always do so if the fields exist.
      
      Fixes: 4eb64dce ("KVM: x86: dump VMCS on invalid entry")
      Signed-off-by: default avatarDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d9e46d34
    • Maxim Levitsky's avatar
      KVM: nSVM: improve SYSENTER emulation on AMD · adc2a237
      Maxim Levitsky authored
      Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
      we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
      msrs, and we also emulate the sysenter/sysexit instruction in long mode.
      
      (Emulator does still refuse to emulate sysenter in 64 bit mode, on the
      ground that the code for that wasn't tested and likely has no users)
      
      However when virtual vmload/vmsave is enabled, the vmload instruction will
      update these 32 bit msrs without triggering their msr intercept,
      which will lead to having stale values in kvm's shadow copy of these msrs,
      which relies on the intercept to be up to date.
      
      Fix/optimize this by doing the following:
      
      1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
         (This is both a tiny optimization and also ensures that in case
         the guest cpu vendor is AMD, the msrs will be 32 bit wide as
         AMD defined).
      
      2. Store only high 32 bit part of these msrs on interception and combine
         it with hardware msr value on intercepted read/writes
         iff vendor=GenuineIntel.
      
      3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
         (It is somewhat insane to set vendor=GenuineIntel and still enable
         SVM for the guest but well whatever).
         Then zero the high 32 bit parts when kvm intercepts and emulates vmload.
      
      Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
      correct way.
      
      This patch fixes nested migration of 32 bit nested guests, that was
      broken because incorrect cached values of SYSENTER msrs were stored in
      the migration stream if L1 changed these msrs with
      vmload prior to L2 entry.
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      adc2a237
    • Maxim Levitsky's avatar
      KVM: x86: add guest_cpuid_is_intel · c1df4aac
      Maxim Levitsky authored
      This is similar to existing 'guest_cpuid_is_amd_or_hygon'
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210401111928.996871-2-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1df4aac
    • Sean Christopherson's avatar
      KVM: x86: Account a variety of miscellaneous allocations · eba04b20
      Sean Christopherson authored
      Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are
      clearly associated with a single task/VM.
      
      Note, there are a several SEV allocations that aren't accounted, but
      those can (hopefully) be fixed by using the local stack for memory.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331023025.2485960-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eba04b20
    • Sean Christopherson's avatar
      KVM: SVM: Do not allow SEV/SEV-ES initialization after vCPUs are created · 8727906f
      Sean Christopherson authored
      Reject KVM_SEV_INIT and KVM_SEV_ES_INIT if they are attempted after one
      or more vCPUs have been created.  KVM assumes a VM is tagged SEV/SEV-ES
      prior to vCPU creation, e.g. init_vmcb() needs to mark the VMCB as SEV
      enabled, and svm_create_vcpu() needs to allocate the VMSA.  At best,
      creating vCPUs before SEV/SEV-ES init will lead to unexpected errors
      and/or behavior, and at worst it will crash the host, e.g.
      sev_launch_update_vmsa() will dereference a null svm->vmsa pointer.
      
      Fixes: 1654efcb ("KVM: SVM: Add KVM_SEV_INIT command")
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331031936.2495277-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8727906f
    • Sean Christopherson's avatar
      KVM: SVM: Do not set sev->es_active until KVM_SEV_ES_INIT completes · 9fa1521d
      Sean Christopherson authored
      Set sev->es_active only after the guts of KVM_SEV_ES_INIT succeeds.  If
      the command fails, e.g. because SEV is already active or there are no
      available ASIDs, then es_active will be left set even though the VM is
      not fully SEV-ES capable.
      
      Refactor the code so that "es_active" is passed on the stack instead of
      being prematurely shoved into sev_info, both to avoid having to unwind
      sev_info and so that it's more obvious what actually consumes es_active
      in sev_guest_init() and its helpers.
      
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331031936.2495277-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fa1521d
    • Sean Christopherson's avatar
      KVM: SVM: Use online_vcpus, not created_vcpus, to iterate over vCPUs · c36b16d2
      Sean Christopherson authored
      Use the kvm_for_each_vcpu() helper to iterate over vCPUs when encrypting
      VMSAs for SEV, which effectively switches to use online_vcpus instead of
      created_vcpus.  This fixes a possible null-pointer dereference as
      created_vcpus does not guarantee a vCPU exists, since it is updated at
      the very beginning of KVM_CREATE_VCPU.  created_vcpus exists to allow the
      bulk of vCPU creation to run in parallel, while still correctly
      restricting the max number of max vCPUs.
      
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331031936.2495277-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c36b16d2
    • Sean Christopherson's avatar
      KVM: x86/mmu: Simplify code for aging SPTEs in TDP MMU · 8f8f52a4
      Sean Christopherson authored
      Use a basic NOT+AND sequence to clear the Accessed bit in TDP MMU SPTEs,
      as opposed to the fancy ffs()+clear_bit() logic that was copied from the
      legacy MMU.  The legacy MMU uses clear_bit() because it is operating on
      the SPTE itself, i.e. clearing needs to be atomic.  The TDP MMU operates
      on a local variable that it later writes to the SPTE, and so doesn't need
      to be atomic or even resident in memory.
      
      Opportunistically drop unnecessary initialization of new_spte, it's
      guaranteed to be written before being accessed.
      
      Using NOT+AND instead of ffs()+clear_bit() reduces the sequence from:
      
         0x0000000000058be6 <+134>:	test   %rax,%rax
         0x0000000000058be9 <+137>:	je     0x58bf4 <age_gfn_range+148>
         0x0000000000058beb <+139>:	test   %rax,%rdi
         0x0000000000058bee <+142>:	je     0x58cdc <age_gfn_range+380>
         0x0000000000058bf4 <+148>:	mov    %rdi,0x8(%rsp)
         0x0000000000058bf9 <+153>:	mov    $0xffffffff,%edx
         0x0000000000058bfe <+158>:	bsf    %eax,%edx
         0x0000000000058c01 <+161>:	movslq %edx,%rdx
         0x0000000000058c04 <+164>:	lock btr %rdx,0x8(%rsp)
         0x0000000000058c0b <+171>:	mov    0x8(%rsp),%r15
      
      to:
      
         0x0000000000058bdd <+125>:	test   %rax,%rax
         0x0000000000058be0 <+128>:	je     0x58beb <age_gfn_range+139>
         0x0000000000058be2 <+130>:	test   %rax,%r8
         0x0000000000058be5 <+133>:	je     0x58cc0 <age_gfn_range+352>
         0x0000000000058beb <+139>:	not    %rax
         0x0000000000058bee <+142>:	and    %r8,%rax
         0x0000000000058bf1 <+145>:	mov    %rax,%r15
      
      thus eliminating several memory accesses, including a locked access.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331004942.2444916-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8f8f52a4
    • Sean Christopherson's avatar
      KVM: x86/mmu: Remove spurious clearing of dirty bit from TDP MMU SPTE · 6d9aafb9
      Sean Christopherson authored
      Don't clear the dirty bit when aging a TDP MMU SPTE (in response to a MMU
      notifier event).  Prematurely clearing the dirty bit could cause spurious
      PML updates if aging a page happened to coincide with dirty logging.
      
      Note, tdp_mmu_set_spte_no_acc_track() flows into __handle_changed_spte(),
      so the host PFN will be marked dirty, i.e. there is no potential for data
      corruption.
      
      Fixes: a6a0b05d ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210331004942.2444916-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6d9aafb9
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint · 6dfbd6b5
      Sean Christopherson authored
      Remove x86's trace_kvm_age_page() tracepoint.  It's mostly redundant with
      the common trace_kvm_age_hva() tracepoint, and if there is a need for the
      extra details, e.g. gfn, referenced, etc... those details should be added
      to the common tracepoint so that all architectures and MMUs benefit from
      the info.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-19-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6dfbd6b5
    • Sean Christopherson's avatar
      KVM: Move arm64's MMU notifier trace events to generic code · 501b9185
      Sean Christopherson authored
      Move arm64's MMU notifier trace events into common code in preparation
      for doing the hva->gfn lookup in common code.  The alternative would be
      to trace the gfn instead of hva, but that's not obviously better and
      could also be done in common code.  Tracing the notifiers is also quite
      handy for debug regardless of architecture.
      
      Remove a completely redundant tracepoint from PPC e500.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      501b9185
    • Sean Christopherson's avatar
      KVM: Move prototypes for MMU notifier callbacks to generic code · 5f7c292b
      Sean Christopherson authored
      Move the prototypes for the MMU notifier callbacks out of arch code and
      into common code.  There is no benefit to having each arch replicate the
      prototypes since any deviation from the invocation in common code will
      explode.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f7c292b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE · aaaac889
      Sean Christopherson authored
      Use the leaf-only TDP iterator when changing the SPTE in reaction to a
      MMU notifier.  Practically speaking, this is a nop since the guts of the
      loop explicitly looks for 4k SPTEs, which are always leaf SPTEs.  Switch
      the iterator to match age_gfn_range() and test_age_gfn() so that a future
      patch can consolidate the core iterating logic.
      
      No real functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aaaac889
    • Sean Christopherson's avatar
      KVM: x86/mmu: Pass address space ID to TDP MMU root walkers · a3f15bda
      Sean Christopherson authored
      Move the address space ID check that is performed when iterating over
      roots into the macro helpers to consolidate code.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a3f15bda
    • Sean Christopherson's avatar
      KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range() · 2b9663d8
      Sean Christopherson authored
      Pass the address space ID to TDP MMU's primary "zap gfn range" helper to
      allow the MMU notifier paths to iterate over memslots exactly once.
      Currently, both the legacy MMU and TDP MMU iterate over memslots when
      looking for an overlapping hva range, which can be quite costly if there
      are a large number of memslots.
      
      Add a "flush" parameter so that iterating over multiple address spaces
      in the caller will continue to do the right thing when yielding while a
      flush is pending from a previous address space.
      
      Note, this also has a functional change in the form of coalescing TLB
      flushes across multiple address spaces in kvm_zap_gfn_range(), and also
      optimizes the TDP MMU to utilize range-based flushing when running as L1
      with Hyper-V enlightenments.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-6-seanjc@google.com>
      [Keep separate for loops to prepare for other incoming patches. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b9663d8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap · 1a61b7db
      Sean Christopherson authored
      Gather pending TLB flushes across both address spaces when zapping a
      given gfn range.  This requires feeding "flush" back into subsequent
      calls, but on the plus side sets the stage for further batching
      between the legacy MMU and TDP MMU.  It also allows refactoring the
      address space iteration to cover the legacy and TDP MMUs without
      introducing truly ugly code.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1a61b7db
    • Sean Christopherson's avatar
      KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs · 142ccde1
      Sean Christopherson authored
      Gather pending TLB flushes across both the legacy and TDP MMUs when
      zapping collapsible SPTEs to avoid multiple flushes if both the legacy
      MMU (for nested guests) and TDP MMU have mappings for the memslot.
      
      Note, this also optimizes the TDP MMU to flush only the relevant range
      when running as L1 with Hyper-V enlightenments.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      142ccde1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU · 302695a5
      Sean Christopherson authored
      Place the onus on the caller of slot_handle_*() to flush the TLB, rather
      than handling the flush in the helper, and rename parameters accordingly.
      This will allow future patches to coalesce flushes between address spaces
      and between the legacy and TDP MMUs.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      302695a5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs · af95b53e
      Sean Christopherson authored
      When zapping collapsible SPTEs across multiple roots, gather pending
      flushes and perform a single remote TLB flush at the end, as opposed to
      flushing after processing every root.
      
      Note, flush may be cleared by the result of zap_collapsible_spte_range().
      This is intended and correct, e.g. yielding may have serviced a prior
      pending flush.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210326021957.1424875-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      af95b53e
    • Vitaly Kuznetsov's avatar
      KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have... · c28fa560
      Vitaly Kuznetsov authored
      KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
      
      MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned
      to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest
      the correct behavior is to inject #GP an not just return zero.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210329124804.170173-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c28fa560
    • Krish Sadhukhan's avatar
      KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in nested_svm_vmexit() · 9a7de6ec
      Krish Sadhukhan authored
      According to APM, the #DB intercept for a single-stepped VMRUN must happen
      after the completion of that instruction, when the guest does #VMEXIT to
      the host. However, in the current implementation of KVM, the #DB intercept
      for a single-stepped VMRUN happens after the completion of the instruction
      that follows the VMRUN instruction. When the #DB intercept handler is
      invoked, it shows the RIP of the instruction that follows VMRUN, instead of
      of VMRUN itself. This is an incorrect RIP as far as single-stepping VMRUN
      is concerned.
      
      This patch fixes the problem by checking, in nested_svm_vmexit(), for the
      condition that the VMRUN instruction is being single-stepped and if so,
      queues the pending #DB intercept so that the #DB is accounted for before
      we execute L1's next instruction.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarKrish Sadhukhan <krish.sadhukhan@oraacle.com>
      Message-Id: <20210323175006.73249-2-krish.sadhukhan@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9a7de6ec
    • Paolo Bonzini's avatar
      KVM: MMU: load PDPTRs outside mmu_lock · 4a38162e
      Paolo Bonzini authored
      On SVM, reading PDPTRs might access guest memory, which might fault
      and thus might sleep.  On the other hand, it is not possible to
      release the lock after make_mmu_pages_available has been called.
      
      Therefore, push the call to make_mmu_pages_available and the
      mmu_lock critical section within mmu_alloc_direct_roots and
      mmu_alloc_shadow_roots.
      Reported-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a38162e
    • Paolo Bonzini's avatar
      Merge remote-tracking branch 'tip/x86/sgx' into kvm-next · d9bd0082
      Paolo Bonzini authored
      Pull generic x86 SGX changes needed to support SGX in virtual machines.
      d9bd0082
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-5.13-2' of... · 387cb8e8
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: Fix potential crash in preemptible kernels
      
      There is a potential race for preemptible kernels, where
      the host kernel would get a fault when it is preempted as
      the wrong point in time.
      387cb8e8
  2. 15 Apr, 2021 2 commits
  3. 12 Apr, 2021 1 commit
  4. 08 Apr, 2021 1 commit
    • Jarkko Sakkinen's avatar
      x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section() · ae40aaf6
      Jarkko Sakkinen authored
      The commit in Fixes: changed the SGX EPC page sanitization to end up in
      sgx_free_epc_page() which puts clean and sanitized pages on the free
      list.
      
      This was done for the reason that it is best to keep the logic to assign
      available-for-use EPC pages to the correct NUMA lists in a single
      location.
      
      sgx_nr_free_pages is also incremented by sgx_free_epc_pages() but those
      pages which are being added there per EPC section do not belong to the
      free list yet because they haven't been sanitized yet - they land on the
      dirty list first and the sanitization happens later when ksgxd starts
      massaging them.
      
      So remove that addition there and have sgx_free_epc_page() do that
      solely.
      
       [ bp: Sanitize commit message too. ]
      
      Fixes: 51ab30eb ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list")
      Signed-off-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20210408092924.7032-1-jarkko@kernel.org
      ae40aaf6
  5. 06 Apr, 2021 10 commits
  6. 02 Apr, 2021 2 commits