1. 12 Nov, 2021 2 commits
  2. 11 Nov, 2021 38 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-5.16-fixes' into kvm-master · f5396f2d
      Paolo Bonzini authored
      * Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status
      
      * Fix selftests on APICv machines
      
      * Fix sparse warnings
      
      * Fix detection of KVM features in CPUID
      
      * Cleanups for bogus writes to MSR_KVM_PV_EOI_EN
      
      * Fixes and cleanups for MSR bitmap handling
      
      * Cleanups for INVPCID
      
      * Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures
      f5396f2d
    • Paolo Bonzini's avatar
      Merge branch 'kvm-sev-move-context' into kvm-master · 1f058331
      Paolo Bonzini authored
      Add support for AMD SEV and SEV-ES intra-host migration support.  Intra
      host migration provides a low-cost mechanism for userspace VMM upgrades.
      
      In the common case for intra host migration, we can rely on the normal
      ioctls for passing data from one VMM to the next. SEV, SEV-ES, and other
      confidential compute environments make most of this information opaque, and
      render KVM ioctls such as "KVM_GET_REGS" irrelevant.  As a result, we need
      the ability to pass this opaque metadata from one VMM to the next. The
      easiest way to do this is to leave this data in the kernel, and transfer
      ownership of the metadata from one KVM VM (or vCPU) to the next.  In-kernel
      hand off makes it possible to move any data that would be
      unsafe/impossible for the kernel to hand directly to userspace, and
      cannot be reproduced using data that can be handed to userspace.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1f058331
    • Vitaly Kuznetsov's avatar
      KVM: x86: Drop arbitrary KVM_SOFT_MAX_VCPUS · da1bfd52
      Vitaly Kuznetsov authored
      KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
      VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
      either num_online_cpus() or num_present_cpus(), s390 has multiple
      constants depending on hardware features. On x86, KVM reports an
      arbitrary value of '710' which is supposed to be the maximum tested
      value but it's possible to test all KVM_MAX_VCPUS even when there are
      less physical CPUs available.
      
      Drop the arbitrary '710' value and return num_online_cpus() on x86 as
      well. The recommendation will match other architectures and will mean
      'no CPU overcommit'.
      
      For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
      when the requested vCPU number exceeds it. The static limit of '710'
      is quite weird as smaller systems with just a few physical CPUs should
      certainly "recommend" less.
      Suggested-by: default avatarEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da1bfd52
    • Vipin Sharma's avatar
      KVM: Move INVPCID type check from vmx and svm to the common kvm_handle_invpcid() · 796c83c5
      Vipin Sharma authored
      Handle #GP on INVPCID due to an invalid type in the common switch
      statement instead of relying on the callers (VMX and SVM) to manually
      validate the type.
      
      Unlike INVVPID and INVEPT, INVPCID is not explicitly documented to check
      the type before reading the operand from memory, so deferring the
      type validity check until after that point is architecturally allowed.
      Signed-off-by: default avatarVipin Sharma <vipinsh@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211109174426.2350547-3-vipinsh@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      796c83c5
    • Vipin Sharma's avatar
      KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT · 329bd56c
      Vipin Sharma authored
      handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2
      field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that
      holds the invalidation type. Add a helper to retrieve reg2 from VMX
      instruction info to consolidate and document the shift+mask magic.
      Signed-off-by: default avatarVipin Sharma <vipinsh@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211109174426.2350547-2-vipinsh@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      329bd56c
    • Sean Christopherson's avatar
      KVM: nVMX: Clean up x2APIC MSR handling for L2 · a5e0c252
      Sean Christopherson authored
      Clean up the x2APIC MSR bitmap intereption code for L2, which is the last
      holdout of open coded bitmap manipulations.  Freshen up the SDM/PRM
      comment, rename the function to make it abundantly clear the funky
      behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored
      (the previous comment was flat out wrong for x2APIC behavior).
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a5e0c252
    • Sean Christopherson's avatar
      KVM: VMX: Macrofy the MSR bitmap getters and setters · 0cacb80b
      Sean Christopherson authored
      Add builder macros to generate the MSR bitmap helpers to reduce the
      amount of copy-paste code, especially with respect to all the magic
      numbers needed to calc the correct bit location.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0cacb80b
    • Sean Christopherson's avatar
      KVM: nVMX: Handle dynamic MSR intercept toggling · 67f4b996
      Sean Christopherson authored
      Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2,
      and always update the relevant bits in vmcs02.  This fixes two distinct,
      but intertwined bugs related to dynamic MSR bitmap modifications.
      
      The first issue is that KVM fails to enable MSR interception in vmcs02
      for the FS/GS base MSRs if L1 first runs L2 with interception disabled,
      and later enables interception.
      
      The second issue is that KVM fails to honor userspace MSR filtering when
      preparing vmcs02.
      
      Fix both issues simultaneous as fixing only one of the issues (doesn't
      matter which) would create a mess that no one should have to bisect.
      Fixing only the first bug would exacerbate the MSR filtering issue as
      userspace would see inconsistent behavior depending on the whims of L1.
      Fixing only the second bug (MSR filtering) effectively requires fixing
      the first, as the nVMX code only knows how to transition vmcs02's
      bitmap from 1->0.
      
      Move the various accessor/mutators that are currently buried in vmx.c
      into vmx.h so that they can be shared by the nested code.
      
      Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
      Fixes: d69129b4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible")
      Cc: stable@vger.kernel.org
      Cc: Alexander Graf <graf@amazon.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      67f4b996
    • Sean Christopherson's avatar
      KVM: nVMX: Query current VMCS when determining if MSR bitmaps are in use · 7dfbc624
      Sean Christopherson authored
      Check the current VMCS controls to determine if an MSR write will be
      intercepted due to MSR bitmaps being disabled.  In the nested VMX case,
      KVM will disable MSR bitmaps in vmcs02 if they're disabled in vmcs12 or
      if KVM can't map L1's bitmaps for whatever reason.
      
      Note, the bad behavior is relatively benign in the current code base as
      KVM sets all bits in vmcs02's MSR bitmap by default, clears bits if and
      only if L0 KVM also disables interception of an MSR, and only uses the
      buggy helper for MSR_IA32_SPEC_CTRL.  Because KVM explicitly tests WRMSR
      before disabling interception of MSR_IA32_SPEC_CTRL, the flawed check
      will only result in KVM reading MSR_IA32_SPEC_CTRL from hardware when it
      isn't strictly necessary.
      
      Tag the fix for stable in case a future fix wants to use
      msr_write_intercepted(), in which case a buggy implementation in older
      kernels could prove subtly problematic.
      
      Fixes: d28b387f ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7dfbc624
    • Vitaly Kuznetsov's avatar
      KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was... · afd67ee3
      Vitaly Kuznetsov authored
      KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was written to MSR_KVM_PV_EOI_EN
      
      When kvm_gfn_to_hva_cache_init() call from kvm_lapic_set_pv_eoi() fails,
      MSR write to MSR_KVM_PV_EOI_EN results in #GP so it is reasonable to
      expect that the value we keep internally in KVM wasn't updated.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211108152819.12485-3-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      afd67ee3
    • Vitaly Kuznetsov's avatar
      KVM: x86: Rename kvm_lapic_enable_pv_eoi() · 77c3323f
      Vitaly Kuznetsov authored
      kvm_lapic_enable_pv_eoi() is a misnomer as the function is also
      used to disable PV EOI. Rename it to kvm_lapic_set_pv_eoi().
      
      No functional change intended.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211108152819.12485-2-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      77c3323f
    • Paul Durrant's avatar
      KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES · 760849b1
      Paul Durrant authored
      Currently when kvm_update_cpuid_runtime() runs, it assumes that the
      KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true,
      however, if Hyper-V support is enabled. In this case the KVM leaves will
      be offset.
      
      This patch introdues as new 'kvm_cpuid_base' field into struct
      kvm_vcpu_arch to track the location of the KVM leaves and function
      kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the
      leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a
      definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now
      target the correct leaf.
      
      NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced
            into processor.h to avoid having duplicate code for the iteration
            over possible hypervisor base leaves.
      Signed-off-by: default avatarPaul Durrant <pdurrant@amazon.com>
      Message-Id: <20211105095101.5384-3-pdurrant@amazon.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      760849b1
    • Sean Christopherson's avatar
      KVM: x86: Add helper to consolidate core logic of SET_CPUID{2} flows · 8b44b174
      Sean Christopherson authored
      Move the core logic of SET_CPUID and SET_CPUID2 to a common helper, the
      only difference between the two ioctls() is the format of the userspace
      struct.  A future fix will add yet more code to the core logic.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211105095101.5384-2-pdurrant@amazon.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8b44b174
    • Junaid Shahid's avatar
      kvm: mmu: Use fast PF path for access tracking of huge pages when possible · 10c30de0
      Junaid Shahid authored
      The fast page fault path bails out on write faults to huge pages in
      order to accommodate dirty logging. This change adds a check to do that
      only when dirty logging is actually enabled, so that access tracking for
      huge pages can still use the fast path for write faults in the common
      case.
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211104003359.2201967-1-junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      10c30de0
    • Sean Christopherson's avatar
      KVM: x86/mmu: Properly dereference rcu-protected TDP MMU sptep iterator · c435d4b7
      Sean Christopherson authored
      Wrap the read of iter->sptep in tdp_mmu_map_handle_target_level() with
      rcu_dereference().  Shadow pages in the TDP MMU, and thus their SPTEs,
      are protected by rcu.
      
      This fixes a Sparse warning at tdp_mmu.c:900:51:
        warning: incorrect type in argument 1 (different address spaces)
        expected unsigned long long [usertype] *sptep
        got unsigned long long [noderef] [usertype] __rcu *[usertype] sptep
      
      Fixes: 7158bee4 ("KVM: MMU: pass kvm_mmu_page struct to make_spte")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211103161833.3769487-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c435d4b7
    • Maxim Levitsky's avatar
      KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active · cae72dcc
      Maxim Levitsky authored
      KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
      standard kvm's inject_pending_event, and not via APICv/AVIC.
      
      Since this is a debug feature, just inhibit APICv/AVIC while
      KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.
      
      Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
      Reported-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Tested-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cae72dcc
    • Jim Mattson's avatar
      kvm: x86: Convert return type of *is_valid_rdpmc_ecx() to bool · e6cd31f1
      Jim Mattson authored
      These function names sound like predicates, and they have siblings,
      *is_valid_msr(), which _are_ predicates. Moreover, there are comments
      that essentially warn that these functions behave unexpectedly.
      
      Flip the polarity of the return values, so that they become
      predicates, and convert the boolean result to a success/failure code
      at the outer call site.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20211105202058.1048757-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6cd31f1
    • David Woodhouse's avatar
      KVM: x86: Fix recording of guest steal time / preempted status · 7e2175eb
      David Woodhouse authored
      In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
      not missed") we switched to using a gfn_to_pfn_cache for accessing the
      guest steal time structure in order to allow for an atomic xchg of the
      preempted field. This has a couple of problems.
      
      Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
      atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
      guest vCPU using an IOMEM page for its steal time would never have its
      preempted field set.
      
      Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
      should have been. There are two stages to the GFN->PFN conversion;
      first the GFN is converted to a userspace HVA, and then that HVA is
      looked up in the process page tables to find the underlying host PFN.
      Correct invalidation of the latter would require being hooked up to the
      MMU notifiers, but that doesn't happen---so it just keeps mapping and
      unmapping the *wrong* PFN after the userspace page tables change.
      
      In the !IOMEM case at least the stale page *is* pinned all the time it's
      cached, so it won't be freed and reused by anyone else while still
      receiving the steal time updates. The map/unmap dance only takes care
      of the KVM administrivia such as marking the page dirty.
      
      Until the gfn_to_pfn cache handles the remapping automatically by
      integrating with the MMU notifiers, we might as well not get a
      kernel mapping of it, and use the perfectly serviceable userspace HVA
      that we already have.  We just need to implement the atomic xchg on
      the userspace address with appropriate exception handling, which is
      fairly trivial.
      
      Cc: stable@vger.kernel.org
      Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
      [I didn't entirely agree with David's assessment of the
       usefulness of the gfn_to_pfn cache, and integrated the outcome
       of the discussion in the above commit message. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7e2175eb
    • Peter Gonda's avatar
      selftest: KVM: Add intra host migration tests · 6a581508
      Peter Gonda authored
      Adds testcases for intra host migration for SEV and SEV-ES. Also adds
      locking test to confirm no deadlock exists.
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20211021174303.385706-6-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6a581508
    • Peter Gonda's avatar
      selftest: KVM: Add open sev dev helper · 7a6ab3cf
      Peter Gonda authored
      Refactors out open path support from open_kvm_dev_path_or_exit() and
      adds new helper for SEV device path.
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20211021174303.385706-5-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a6ab3cf
    • Peter Gonda's avatar
      KVM: SEV: Add support for SEV-ES intra host migration · 0b020f5a
      Peter Gonda authored
      For SEV-ES to work with intra host migration the VMSAs, GHCB metadata,
      and other SEV-ES info needs to be preserved along with the guest's
      memory.
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Reviewed-by: default avatarMarc Orr <marcorr@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20211021174303.385706-4-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0b020f5a
    • Peter Gonda's avatar
      KVM: SEV: Add support for SEV intra host migration · b5663931
      Peter Gonda authored
      For SEV to work with intra host migration, contents of the SEV info struct
      such as the ASID (used to index the encryption key in the AMD SP) and
      the list of memory regions need to be transferred to the target VM.
      This change adds a commands for a target VMM to get a source SEV VM's sev
      info.
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMarc Orr <marcorr@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20211021174303.385706-3-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5663931
    • Paolo Bonzini's avatar
      KVM: SEV: provide helpers to charge/uncharge misc_cg · 91b692a0
      Paolo Bonzini authored
      Avoid code duplication across all callers of misc_cg_try_charge and
      misc_cg_uncharge.  The resource type for KVM is always derived from
      sev->es_active, and the quantity is always 1.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      91b692a0
    • Paolo Bonzini's avatar
      KVM: generalize "bugged" VM to "dead" VM · f4d31653
      Paolo Bonzini authored
      Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases
      where it is by design that the VM cannot be operated upon.  In this
      case any KVM_BUG_ON should still warn, so introduce a new flag
      kvm->vm_dead that is separate from kvm->vm_bugged.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f4d31653
    • Peter Gonda's avatar
      KVM: SEV: Refactor out sev_es_state struct · b67a4cc3
      Peter Gonda authored
      Move SEV-ES vCPU metadata into new sev_es_state struct from vcpu_svm.
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      Suggested-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Acked-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20211021174303.385706-2-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b67a4cc3
    • Paolo Bonzini's avatar
      Merge branch 'kvm-guest-sev-migration' into kvm-master · b9ecb9a9
      Paolo Bonzini authored
      Add guest api and guest kernel support for SEV live migration.
      
      Introduces a new hypercall to notify the host of changes to the page
      encryption status.  If the page is encrypted then it must be migrated
      through the SEV firmware or a helper VM sharing the key.  If page is
      not encrypted then it can be migrated normally by userspace.  This new
      hypercall is invoked using paravirt_ops.
      
      Conflicts: sev_active() replaced by cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).
      b9ecb9a9
    • Ashish Kalra's avatar
      x86/kvm: Add kexec support for SEV Live Migration. · 73f1b4fe
      Ashish Kalra authored
      Reset the host's shared pages list related to kernel
      specific page encryption status settings before we load a
      new kernel by kexec. We cannot reset the complete
      shared pages list here as we need to retain the
      UEFI/OVMF firmware specific settings.
      
      The host's shared pages list is maintained for the
      guest to keep track of all unencrypted guest memory regions,
      therefore we need to explicitly mark all shared pages as
      encrypted again before rebooting into the new guest kernel.
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Message-Id: <3e051424ab839ea470f88333273d7a185006754f.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73f1b4fe
    • Ashish Kalra's avatar
      x86/kvm: Add guest support for detecting and enabling SEV Live Migration feature. · f4495615
      Ashish Kalra authored
      The guest support for detecting and enabling SEV Live migration
      feature uses the following logic :
      
       - kvm_init_plaform() checks if its booted under the EFI
      
         - If not EFI,
      
           i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl()
               to enable the SEV live migration support
      
         - If EFI,
      
           i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read
              the UEFI variable which indicates OVMF support for live migration
      
           ii) the variable indicates live migration is supported, issue a wrmsrl() to
                enable the SEV live migration support
      
      The EFI live migration check is done using a late_initcall() callback.
      
      Also, ensure that _bss_decrypted section is marked as decrypted in the
      hypervisor's guest page encryption status tracking.
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f4495615
    • Ashish Kalra's avatar
      EFI: Introduce the new AMD Memory Encryption GUID. · 2f70ddb1
      Ashish Kalra authored
      Introduce a new AMD Memory Encryption GUID which is currently
      used for defining a new UEFI environment variable which indicates
      UEFI/OVMF support for the SEV live migration feature. This variable
      is setup when UEFI/OVMF detects host/hypervisor support for SEV
      live migration and later this variable is read by the kernel using
      EFI runtime services to verify if OVMF supports the live migration
      feature.
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Message-Id: <1cea22976d2208f34d47e0c1ce0ecac816c13111.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f70ddb1
    • Brijesh Singh's avatar
      mm: x86: Invoke hypercall when page encryption status is changed · 064ce6c5
      Brijesh Singh authored
      Invoke a hypercall when a memory region is changed from encrypted ->
      decrypted and vice versa. Hypervisor needs to know the page encryption
      status during the guest migration.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Reviewed-by: default avatarVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      064ce6c5
    • Brijesh Singh's avatar
      x86/kvm: Add AMD SEV specific Hypercall3 · 08c2336d
      Brijesh Singh authored
      KVM hypercall framework relies on alternative framework to patch the
      VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
      apply_alternative() is called then it defaults to VMCALL. The approach
      works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
      will be able to decode the instruction and do the right things. But
      when SEV is active, guest memory is encrypted with guest key and
      hypervisor will not be able to decode the instruction bytes.
      
      To highlight the need to provide this interface, capturing the
      flow of apply_alternatives() :
      setup_arch() call init_hypervisor_platform() which detects
      the hypervisor platform the kernel is running under and then the
      hypervisor specific initialization code can make early hypercalls.
      For example, KVM specific initialization in case of SEV will try
      to mark the "__bss_decrypted" section's encryption state via early
      page encryption status hypercalls.
      
      Now, apply_alternatives() is called much later when setup_arch()
      calls check_bugs(), so we do need some kind of an early,
      pre-alternatives hypercall interface. Other cases of pre-alternatives
      hypercalls include marking per-cpu GHCB pages as decrypted on SEV-ES
      and per-cpu apf_reason, steal_time and kvm_apic_eoi as decrypted for
      SEV generally.
      
      Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
      will be used by the SEV guest to notify encrypted pages to the hypervisor.
      
      This kvm_sev_hypercall3() function is abstracted and used as follows :
      All these early hypercalls are made through early_set_memory_XX() interfaces,
      which in turn invoke pv_ops (paravirt_ops).
      
      This early_set_memory_XX() -> pv_ops.mmu.notify_page_enc_status_changed()
      is a generic interface and can easily have SEV, TDX and any other
      future platform specific abstractions added to it.
      
      Currently, pv_ops.mmu.notify_page_enc_status_changed() callback is setup to
      invoke kvm_sev_hypercall3() in case of SEV.
      
      Similarly, in case of TDX, pv_ops.mmu.notify_page_enc_status_changed()
      can be setup to a TDX specific callback.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: default avatarSteve Rutherford <srutherford@google.com>
      Reviewed-by: default avatarVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <6fd25c749205dd0b1eb492c60d41b124760cc6ae.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      08c2336d
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · debe436e
      Linus Torvalds authored
      Pull ext4 updates from Ted Ts'o:
       "Only bug fixes and cleanups for ext4 this merge window.
      
        Of note are fixes for the combination of the inline_data and
        fast_commit fixes, and more accurately calculating when to schedule
        additional lazy inode table init, especially when CONFIG_HZ is 100HZ"
      
      * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: fix error code saved on super block during file system abort
        ext4: inline data inode fast commit replay fixes
        ext4: commit inline data during fast commit
        ext4: scope ret locally in ext4_try_to_trim_range()
        ext4: remove an unused variable warning with CONFIG_QUOTA=n
        ext4: fix boolreturn.cocci warnings in fs/ext4/name.c
        ext4: prevent getting empty inode buffer
        ext4: move ext4_fill_raw_inode() related functions
        ext4: factor out ext4_fill_raw_inode()
        ext4: prevent partial update of the extent blocks
        ext4: check for inconsistent extents between index and leaf block
        ext4: check for out-of-order index extents in ext4_valid_extent_entries()
        ext4: convert from atomic_t to refcount_t on ext4_io_end->count
        ext4: refresh the ext4_ext_path struct after dropping i_data_sem.
        ext4: ensure enough credits in ext4_ext_shift_path_extents
        ext4: correct the left/middle/right debug message for binsearch
        ext4: fix lazy initialization next schedule time computation in more granular unit
        Revert "ext4: enforce buffer head state assertion in ext4_da_map_blocks"
      debe436e
    • Linus Torvalds's avatar
      Merge tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 6070dcc8
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "Fix for a deadlock when direct/buffered IO is done on a mmaped file
        and a fault happens (details in the patch). There's a fstest
        generic/647 that triggers the problem and makes testing hard"
      
      * tag 'for-5.16-deadlock-fix-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix deadlock due to page faults during direct IO reads and writes
      6070dcc8
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux · 38764c73
      Linus Torvalds authored
      Pull nfsd updates from Bruce Fields:
       "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
        support for a filehandle format deprecated 20 years ago, and further
        xdr-related cleanup from Chuck"
      
      * tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
        nfsd4: remove obselete comment
        nfsd: document server-to-server-copy parameters
        NFSD:fix boolreturn.cocci warning
        nfsd: update create verifier comment
        SUNRPC: Change return value type of .pc_encode
        SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
        NFSD: Save location of NFSv4 COMPOUND status
        SUNRPC: Change return value type of .pc_decode
        SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
        SUNRPC: De-duplicate .pc_release() call sites
        SUNRPC: Simplify the SVC dispatch code path
        SUNRPC: Capture value of xdr_buf::page_base
        SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
        svcrdma: Split svcrmda_wc_{read,write} tracepoints
        svcrdma: Split the svcrdma_wc_send() tracepoint
        svcrdma: Split the svcrdma_wc_receive() tracepoint
        NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
        SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
        NFSD: Initialize pointer ni with NULL and not plain integer 0
        NFSD: simplify struct nfsfh
        ...
      38764c73
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 2ec20f48
      Linus Torvalds authored
      Pull NFS client updates from Trond Myklebust:
       "Highlights include:
      
        Features:
         - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
         - Optimisations for READDIR and the 'ls -l' style workload
         - Further replacements of dprintk() with tracepoints and other
           tracing improvements
         - Ensure we re-probe NFSv4 server capabilities when the user does a
           "mount -o remount"
      
        Bugfixes:
         - Fix an Oops in pnfs_mark_request_commit()
         - Fix up deadlocks in the commit code
         - Fix regressions in NFSv2/v3 attribute revalidation due to the
           change_attr_type optimisations
         - Fix some dentry verifier races
         - Fix some missing dentry verifier settings
         - Fix a performance regression in nfs_set_open_stateid_locked()
         - SUNRPC was sending multiple SYN calls when re-establishing a TCP
           connection.
         - Fix multiple NFSv4 issues due to missing sanity checking of server
           return values
         - Fix a potential Oops when FREE_STATEID races with an unmount
      
        Cleanups:
         - Clean up the labelled NFS code
         - Remove unused header <linux/pnfs_osd_xdr.h>"
      
      * tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
        NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
        NFS: Remove the nfs4_label argument from decode_getattr_*() functions
        NFS: Remove the nfs4_label argument from nfs_setsecurity
        NFS: Remove the nfs4_label argument from nfs_fhget()
        NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
        NFS: Remove the nfs4_label argument from nfs_instantiate()
        NFS: Remove the nfs4_label from the nfs_setattrres
        NFS: Remove the nfs4_label from the nfs4_getattr_res
        NFS: Remove the f_label from the nfs4_opendata and nfs_openres
        NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
        NFS: Remove the label from the nfs4_lookup_res struct
        NFS: Remove the nfs4_label from the nfs4_link_res struct
        NFS: Remove the nfs4_label from the nfs4_create_res struct
        NFS: Remove the nfs4_label from the nfs_entry struct
        NFS: Create a new nfs_alloc_fattr_with_label() function
        NFS: Always initialise fattr->label in nfs_fattr_alloc()
        NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
        NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
        NFSv4: Remove unnecessary 'minor version' check
        NFSv4: Fix potential Oops in decode_op_map()
        ...
      2ec20f48
    • Linus Torvalds's avatar
      Merge branch 'exit-cleanups-for-v5.16' of... · 5147da90
      Linus Torvalds authored
      Merge branch 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
      
      Pull exit cleanups from Eric Biederman:
       "While looking at some issues related to the exit path in the kernel I
        found several instances where the code is not using the existing
        abstractions properly.
      
        This set of changes introduces force_fatal_sig a way of sending a
        signal and not allowing it to be caught, and corrects the misuse of
        the existing abstractions that I found.
      
        A lot of the misuse of the existing abstractions are silly things such
        as doing something after calling a no return function, rolling BUG by
        hand, doing more work than necessary to terminate a kernel thread, or
        calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
      
        In the review a deficiency in force_fatal_sig and force_sig_seccomp
        where ptrace or sigaction could prevent the delivery of the signal was
        found. I have added a change that adds SA_IMMUTABLE to change that
        makes it impossible to interrupt the delivery of those signals, and
        allows backporting to fix force_sig_seccomp
      
        And Arnd found an issue where a function passed to kthread_run had the
        wrong prototype, and after my cleanup was failing to build."
      
      * 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
        soc: ti: fix wkup_m3_rproc_boot_thread return type
        signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
        signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
        exit/r8188eu: Replace the macro thread_exit with a simple return 0
        exit/rtl8712: Replace the macro thread_exit with a simple return 0
        exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
        signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
        signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
        signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
        exit/syscall_user_dispatch: Send ordinary signals on failure
        signal: Implement force_fatal_sig
        exit/kthread: Have kernel threads return instead of calling do_exit
        signal/s390: Use force_sigsegv in default_trap_handler
        signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
        signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
        signal/sparc: In setup_tsb_params convert open coded BUG into BUG
        signal/powerpc: On swapcontext failure force SIGSEGV
        signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
        signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
        signal/sparc32: Remove unreachable do_exit in do_sparc_fault
        ...
      5147da90
    • Linus Torvalds's avatar
      Merge tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · a41b7445
      Linus Torvalds authored
      Pull prctl updates from Christian Brauner:
       "This contains the missing prctl uapi pieces for PR_SCHED_CORE.
      
        In order to activate core scheduling the caller is expected to specify
        the scope of the new core scheduling domain.
      
        For example, passing 2 in the 4th argument of
      
           prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>,  2, 0);
      
        would indicate that the new core scheduling domain encompasses all
        tasks in the process group of <pid>. Specifying 0 would only create a
        core scheduling domain for the thread identified by <pid> and 2 would
        encompass the whole thread-group of <pid>.
      
        Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID,
        and PIDTYPE_PGID. A first version tried to expose those values
        directly to which I objected because:
      
         - PIDTYPE_* is an enum that is kernel internal which we should not
           expose to userspace directly.
      
         - PIDTYPE_* indicates what a given struct pid is used for it doesn't
           express a scope.
      
        But what the 4th argument of PR_SCHED_CORE prctl() expresses is the
        scope of the operation, i.e. the scope of the core scheduling domain
        at creation time. So Eugene's patch now simply introduces three new
        defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP,
        and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what
        happens.
      
        This has been on the mailing list for quite a while with all relevant
        scheduler folks Cced. I announced multiple times that I'd pick this up
        if I don't see or her anyone else doing it. None of this touches
        proper scheduler code but only concerns uapi so I think this is fine.
      
        With core scheduling being quite common now for vm managers (e.g.
        moving individual vcpu threads into their own core scheduling domain)
        and container managers (e.g. moving the init process into its own core
        scheduling domain and letting all created children inherit it) having
        to rely on raw numbers passed as the 4th argument in prctl() is a bit
        annoying and everyone is starting to come up with their own defines"
      
      * tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
      a41b7445
    • Linus Torvalds's avatar
      Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 6752de1a
      Linus Torvalds authored
      Pull pidfd updates from Christian Brauner:
       "Various places in the kernel have picked up pidfds.
      
        The two most recent additions have probably been the ability to use
        pidfds in bpf maps and the usage of pidfds in mm-based syscalls such
        as process_mrelease() and process_madvise().
      
        The same pattern to turn a pidfd into a struct task exists in two
        places. One of those places used PIDTYPE_TGID while the other one used
        PIDTYPE_PID even though it is clearly documented in all pidfd-helpers
        that pidfds __currently__ only refer to thread-group leaders (subject
        to change in the future if need be).
      
        This isn't a bug per se but has the potential to be one if we allow
        pidfds to refer to individual threads. If that happens we want to
        audit all codepaths that make use of them to ensure they can deal with
        pidfds refering to individual threads.
      
        This adds a simple helper to turn a pidfd into a struct task making it
        easy to grep for such places. Plus, it gets rid of code-duplication"
      
      * tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        mm: use pidfd_get_task()
        pid: add pidfd_get_task() helper
      6752de1a