1. 10 May, 2024 5 commits
    • Paolo Bonzini's avatar
      KVM: guest_memfd: Add hook for initializing memory · 3bb2531e
      Paolo Bonzini authored
      guest_memfd pages are generally expected to be in some arch-defined
      initial state prior to using them for guest memory. For SEV-SNP this
      initial state is 'private', or 'guest-owned', and requires additional
      operations to move these pages into a 'private' state by updating the
      corresponding entries the RMP table.
      
      Allow for an arch-defined hook to handle updates of this sort, and go
      ahead and implement one for x86 so KVM implementations like AMD SVM can
      register a kvm_x86_ops callback to handle these updates for SEV-SNP
      guests.
      
      The preparation callback is always called when allocating/grabbing
      folios via gmem, and it is up to the architecture to keep track of
      whether or not the pages are already in the expected state (e.g. the RMP
      table in the case of SEV-SNP).
      
      In some cases, it is necessary to defer the preparation of the pages to
      handle things like in-place encryption of initial guest memory payloads
      before marking these pages as 'private'/'guest-owned'.  Add an argument
      (always true for now) to kvm_gmem_get_folio() that allows for the
      preparation callback to be bypassed.  To detect possible issues in
      the way userspace initializes memory, it is only possible to add an
      unprepared page if it is not already included in the filemap.
      
      Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20231230172351.574091-5-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3bb2531e
    • Paolo Bonzini's avatar
      KVM: guest_memfd: limit overzealous WARN · fa30b0dc
      Paolo Bonzini authored
      Because kvm_gmem_get_pfn() is called from the page fault path without
      any of the slots_lock, filemap lock or mmu_lock taken, it is
      possible for it to race with kvm_gmem_unbind().  This is not a
      problem, as any PTE that is installed temporarily will be zapped
      before the guest has the occasion to run.
      
      However, it is not possible to have a complete unbind+bind
      racing with the page fault, because deleting the memslot
      will call synchronize_srcu_expedited() and wait for the
      page fault to be resolved.  Thus, we can still warn if
      the file is there and is not the one we expect.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fa30b0dc
    • Paolo Bonzini's avatar
      KVM: guest_memfd: pass error up from filemap_grab_folio · 70623723
      Paolo Bonzini authored
      Some SNP ioctls will require the page not to be in the pagecache, and as such they
      will want to return EEXIST to userspace.  Start by passing the error up from
      filemap_grab_folio.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      70623723
    • Michael Roth's avatar
      KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode · 1d23040c
      Michael Roth authored
      truncate_inode_pages_range() may attempt to zero pages before truncating
      them, and this will occur before arch-specific invalidations can be
      triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
      AMD SEV-SNP this would result in an RMP #PF being generated by the
      hardware, which is currently treated as fatal (and even if specifically
      allowed for, would not result in anything other than garbage being
      written to guest pages due to encryption). On Intel TDX this would also
      result in undesirable behavior.
      
      Set the AS_INACCESSIBLE flag to prevent the MM from attempting
      unexpected accesses of this sort during operations like truncation.
      
      This may also in some cases yield a decent performance improvement for
      guest_memfd userspace implementations that hole-punch ranges immediately
      after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
      the current implementation of truncate_inode_pages_range() always ends
      up zero'ing an entire 4K range if it is backing by a 2M folio.
      
      Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240329212444.395559-6-michael.roth@amd.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1d23040c
    • Michael Roth's avatar
      mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory · c72ceafb
      Michael Roth authored
      filemap users like guest_memfd may use page cache pages to
      allocate/manage memory that is only intended to be accessed by guests
      via hardware protections like encryption. Writes to memory of this sort
      in common paths like truncation may cause unexpected behavior such as
      writing garbage instead of zeros when attempting to zero pages, or
      worse, triggering hardware protections that are considered fatal as far
      as the kernel is concerned.
      
      Introduce a new address_space flag, AS_INACCESSIBLE, and use this
      initially to prevent zero'ing of pages during truncation, with the
      understanding that it is up to the owner of the mapping to handle this
      specially if needed.
      
      This is admittedly a rather blunt solution, but it seems like
      there are no other places that should take into account the
      flag to keep its promise.
      
      Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/
      Cc: Matthew Wilcox <willy@infradead.org>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-ID: <20240329212444.395559-5-michael.roth@amd.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c72ceafb
  2. 07 May, 2024 17 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns · 2b1f4355
      Sean Christopherson authored
      WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully
      handle the unexpected behavior instead of continuing on with dangerous
      state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot,
      and so could install a bogus PFN into the guest.
      
      The existing code is functionally ok, because kvm_faultin_pfn() pre-checks
      all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily
      unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same
      memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set.
      And checking only fault->slot would fall apart if KVM ever added a flag or
      condition that forced emulation, similar to how KVM handles writes to
      read-only memslots.
      
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-17-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2b1f4355
    • Sean Christopherson's avatar
      KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values · f3310e62
      Sean Christopherson authored
      Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault()
      to harden KVM against using "uninitialized" values.  In quotes because the
      fields are actually zero-initialized, and zero is a legal value for both
      page frame numbers and virtual addresses.  E.g. failure to set "pfn" prior
      to creating an SPTE could result in KVM pointing at physical address '0',
      which is far less desirable than KVM generating a SPTE with reserved PA
      bits set and thus effectively killing the VM.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f3310e62
    • Sean Christopherson's avatar
      KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot" faults · 36d44927
      Sean Christopherson authored
      Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot"
      fault to ensure that KVM doesn't use a bogus virtual address, e.g. if
      there *was* a slot but it's unusable (APIC access page), or if there
      really was no slot, in which case fault->hva will be '0' (which is a
      legal address for x86).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      36d44927
    • Sean Christopherson's avatar
      KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn() · f6adeae8
      Sean Christopherson authored
      Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just
      after the private versus shared check, so that there's no need to
      repeatedly query whether or not a slot exists.  This also makes it more
      obvious that, except for private vs. shared attributes, the process of
      faulting in a pfn simply doesn't apply to gfns without a slot.
      
      Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so
      that it doesn't need to be duplicated in all paths that invoke
      kvm_handle_noslot_fault(), and to minimize the probability of not stuffing
      the right fields.
      
      Leave the existing handle behind, but convert it to a WARN, to guard
      against __kvm_faultin_pfn() unexpectedly nullifying fault->slot.
      
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6adeae8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn() · cd272fc4
      Sean Christopherson authored
      Move the checks related to the validity of an access to a memslot from the
      inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn().  This
      allows emulating accesses to the APIC access page, which don't need to
      resolve a pfn, even if there is a relevant in-progress mmu_notifier
      invalidation.  Ditto for accesses to KVM internal memslots from L2, which
      KVM also treats as emulated MMIO.
      
      More importantly, this will allow for future cleanup by having the
      "no memslot" case bail from kvm_faultin_pfn() very early on.
      
      Go to rather extreme and gross lengths to make the change a glorified
      nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the
      related code is very subtle.  E.g. fault->slot can be nullified if it
      points at the APIC access page, some flows in KVM x86 expect fault->pfn
      to be KVM_PFN_NOSLOT, while others check only fault->slot, etc.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-13-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd272fc4
    • Sean Christopherson's avatar
      KVM: x86/mmu: Explicitly disallow private accesses to emulated MMIO · bde9f9d2
      Sean Christopherson authored
      Explicitly detect and disallow private accesses to emulated MMIO in
      kvm_handle_noslot_fault() instead of relying on kvm_faultin_pfn_private()
      to perform the check.  This will allow the page fault path to go straight
      to kvm_handle_noslot_fault() without bouncing through __kvm_faultin_pfn().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240228024147.41573-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bde9f9d2
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC internal slots · 5bd74f6e
      Sean Christopherson authored
      Allow mapping KVM's internal memslots used for EPT without unrestricted
      guest into L2, i.e. allow mapping the hidden TSS and the identity mapped
      page tables into L2.  Unlike the APIC access page, there is no correctness
      issue with letting L2 access the "hidden" memory.  Allowing these memslots
      to be mapped into L2 fixes a largely theoretical bug where KVM could
      incorrectly emulate subsequent _L1_ accesses as MMIO, and also ensures
      consistent KVM behavior for L2.
      
      If KVM is using TDP, but L1 is using shadow paging for L2, then routing
      through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO,
      and create an MMIO SPTE.  Creating an MMIO SPTE is ok, but only because
      kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs.
      L2.  But vcpu->arch.mmio_gfn will remain valid, and could cause KVM to
      incorrectly treat an L1 access to the hidden TSS or identity mapped page
      tables as MMIO.
      
      Furthermore, forcing L2 accesses to be treated as "no slot" faults doesn't
      actually prevent exposing KVM's internal memslots to L2, it simply forces
      KVM to emulate the access.  In most cases, that will trigger MMIO,
      amusingly due to filling vcpu->arch.mmio_gfn, but also because
      vcpu_is_mmio_gpa() unconditionally treats APIC accesses as MMIO, i.e. APIC
      accesses are ok.  But the hidden TSS and identity mapped page tables could
      go either way (MMIO or access the private memslot's backing memory).
      
      Alternatively, the inconsistent emulator behavior could be addressed by
      forcing MMIO emulation for L2 access to all internal memslots, not just to
      the APIC.  But that's arguably less correct than letting L2 access the
      hidden TSS and identity mapped page tables, not to mention that it's
      *extremely* unlikely anyone cares what KVM does in this case.  From L1's
      perspective there is R/W memory at those memslots, the memory just happens
      to be initialized with non-zero data.  Making the memory disappear when it
      is accessed by L2 is far more magical and arbitrary than the memory
      existing in the first place.
      
      The APIC access page is special because KVM _must_ emulate the access to
      do the right thing (emulate an APIC access instead of reading/writing the
      APIC access page).  And despite what commit 3a2936de ("kvm: mmu: Don't
      expose private memslots to L2") said, it's not just necessary when L1 is
      accelerating L2's virtual APIC, it's just as important (likely *more*
      imporant for correctness when L1 is passing through its own APIC to L2.
      
      Fixes: 3a2936de ("kvm: mmu: Don't expose private memslots to L2")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5bd74f6e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move private vs. shared check above slot validity checks · 44f42ef3
      Sean Christopherson authored
      Prioritize private vs. shared gfn attribute checks above slot validity
      checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
      userspace if there is no memslot, but emulate accesses to the APIC access
      page even if the attributes mismatch.
      
      Fixes: 8dd2eee9 ("KVM: x86/mmu: Handle page fault for private memory")
      Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
      Cc: Chao Peng <chao.p.peng@linux.intel.com>
      Cc: Fuad Tabba <tabba@google.com>
      Cc: Michael Roth <michael.roth@amd.com>
      Cc: Isaku Yamahata <isaku.yamahata@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44f42ef3
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN and skip MMIO cache on private, reserved page faults · 07702e5a
      Sean Christopherson authored
      WARN and skip the emulated MMIO fastpath if a private, reserved page fault
      is encountered, as private+reserved should be an impossible combination
      (KVM should never create an MMIO SPTE for a private access).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240228024147.41573-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      07702e5a
    • Paolo Bonzini's avatar
      KVM: x86/mmu: check for invalid async page faults involving private memory · cd389f50
      Paolo Bonzini authored
      Right now the error code is not used when an async page fault is completed.
      This is not a problem in the current code, but it is untidy.  For protected
      VMs, we will also need to check that the page attributes match the current
      state of the page, because asynchronous page faults can only occur on
      shared pages (private pages go through kvm_faultin_pfn_private() instead of
      __gfn_to_pfn_memslot()).
      
      Start by piping the error code from kvm_arch_setup_async_pf() to
      kvm_arch_async_page_ready() via the architecture-specific async page
      fault data.  For now, it can be used to assert that there are no
      async page faults on private memory.
      
      Extracted from a patch by Isaku Yamahata.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cd389f50
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use synthetic page fault error code to indicate private faults · b3d5dc62
      Sean Christopherson authored
      Add and use a synthetic, KVM-defined page fault error code to indicate
      whether a fault is to private vs. shared memory.  TDX and SNP have
      different mechanisms for reporting private vs. shared, and KVM's
      software-protected VMs have no mechanism at all.  Usurp an error code
      flag to avoid having to plumb another parameter to kvm_mmu_page_fault()
      and friends.
      
      Alternatively, KVM could borrow AMD's PFERR_GUEST_ENC_MASK, i.e. set it
      for TDX and software-protected VMs as appropriate, but that would require
      *clearing* the flag for SEV and SEV-ES VMs, which support encrypted
      memory at the hardware layer, but don't utilize private memory at the
      KVM layer.
      
      Opportunistically add a comment to call out that the logic for software-
      protected VMs is (and was before this commit) broken for nested MMUs, i.e.
      for nested TDP, as the GPA is an L2 GPA.  Punt on trying to play nice with
      nested MMUs as there is a _lot_ of functionality that simply doesn't work
      for software-protected VMs, e.g. all of the paths where KVM accesses guest
      memory need to be updated to be aware of private vs. shared memory.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20240228024147.41573-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b3d5dc62
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN if upper 32 bits of legacy #PF error code are non-zero · 7bdbb820
      Sean Christopherson authored
      WARN if bits 63:32 are non-zero when handling an intercepted legacy #PF,
      as the error code for #PF is limited to 32 bits (and in practice, 16 bits
      on Intel CPUS).  This behavior is architectural, is part of KVM's ABI
      (see kvm_vcpu_events.error_code), and is explicitly documented as being
      preserved for intecerpted #PF in both the APM:
      
        The error code saved in EXITINFO1 is the same as would be pushed onto
        the stack by a non-intercepted #PF exception in protected mode.
      
      and even more explicitly in the SDM as VMCS.VM_EXIT_INTR_ERROR_CODE is a
      32-bit field.
      
      Simply drop the upper bits if hardware provides garbage, as spurious
      information should do no harm (though in all likelihood hardware is buggy
      and the kernel is doomed).
      
      Handling all upper 32 bits in the #PF path will allow moving the sanity
      check on synthetic checks from kvm_mmu_page_fault() to npf_interception(),
      which in turn will allow deriving PFERR_PRIVATE_ACCESS from AMD's
      PFERR_GUEST_ENC_MASK without running afoul of the sanity check.
      
      Note, this is also why Intel uses bit 15 for SGX (highest bit on Intel CPUs)
      and AMD uses bit 31 for RMP (highest bit on AMD CPUs); using the highest
      bit minimizes the probability of a collision with the "other" vendor,
      without needing to plumb more bits through microcode.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Message-ID: <20240228024147.41573-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7bdbb820
    • Isaku Yamahata's avatar
      KVM: x86/mmu: Pass full 64-bit error code when handling page faults · c9710130
      Isaku Yamahata authored
      Plumb the full 64-bit error code throughout the page fault handling code
      so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
      will be used to determine whether or not a fault is private vs. shared.
      
      Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
      the behavior of permission_fault() when invoked in the page fault path, as
      KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
      
      Continue passing '0' from the async #PF worker, as guest_memfd and thus
      private memory doesn't support async page faults.
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      [mdr: drop references/changes on rebase, update commit message]
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Message-ID: <20240228024147.41573-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c9710130
    • Sean Christopherson's avatar
      KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler · dee281e4
      Sean Christopherson authored
      Move the sanity check that hardware never sets bits that collide with KVM-
      define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
      i.e. make the sanity check #NPF specific.  The legacy #PF path already
      WARNs if _any_ of bits 63:32 are set, and the error code that comes from
      VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
      EXIT_QUALIFICATION into error code flags).
      
      Add a compile-time assert in the legacy #PF handler to make sure that KVM-
      define flags are covered by its existing sanity check on the upper bits.
      
      Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
      are removing the comment that defined it.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Reviewed-by: default avatarBinbin Wu <binbin.wu@linux.intel.com>
      Message-ID: <20240228024147.41573-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dee281e4
    • Sean Christopherson's avatar
      KVM: x86: Define more SEV+ page fault error bits/flags for #NPF · 9b62e03e
      Sean Christopherson authored
      Define more #NPF error code flags that are relevant to SEV+ (mostly SNP)
      guests, as specified by the APM:
      
       * Bit 31 (RMP):   Set to 1 if the fault was caused due to an RMP check or a
                         VMPL check failure, 0 otherwise.
       * Bit 34 (ENC):   Set to 1 if the guest’s effective C-bit was 1, 0 otherwise.
       * Bit 35 (SIZEM): Set to 1 if the fault was caused by a size mismatch between
                         PVALIDATE or RMPADJUST and the RMP, 0 otherwise.
       * Bit 36 (VMPL):  Set to 1 if the fault was caused by a VMPL permission
                         check failure, 0 otherwise.
      
      Note, the APM is *extremely* misleading, and strongly implies that the
      above flags can _only_ be set for #NPF exits from SNP guests.  That is a
      lie, as bit 34 (C-bit=1, i.e. was encrypted) can be set when running _any_
      flavor of SEV guest on SNP capable hardware.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240228024147.41573-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9b62e03e
    • Sean Christopherson's avatar
      KVM: x86: Remove separate "bit" defines for page fault error code masks · 63b6206e
      Sean Christopherson authored
      Open code the bit number directly in the PFERR_* masks and drop the
      intermediate PFERR_*_BIT defines, as having to bounce through two macros
      just to see which flag corresponds to which bit is quite annoying, as is
      having to define two macros just to add recognition of a new flag.
      
      Use ternary operator to derive the bit in permission_fault(), the one
      function that actually needs the bit number as part of clever shifting
      to avoid conditional branches.  Generally the compiler is able to turn
      it into a conditional move, and if not it's not really a big deal.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240228024147.41573-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      63b6206e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation · d0bf8e6e
      Sean Christopherson authored
      Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
      triggers emulation of any kind, as KVM doesn't currently support emulating
      access to guest private memory.  Practically speaking, private faults and
      emulation are already mutually exclusive, but there are many flow that
      can result in KVM returning RET_PF_EMULATE, and adding one last check
      to harden against weird, unexpected combinations and/or KVM bugs is
      inexpensive.
      Suggested-by: default avatarYan Zhao <yan.y.zhao@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-ID: <20240228024147.41573-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d0bf8e6e
  3. 19 Apr, 2024 1 commit
  4. 12 Apr, 2024 7 commits
    • Paolo Bonzini's avatar
      1ab157ce
    • Sean Christopherson's avatar
      KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument · 2325a21a
      Sean Christopherson authored
      TDX uses different ABI to get information about VM exit.  Pass intr_info to
      the NMI and INTR handlers instead of pulling it from vcpu_vmx in
      preparation for sharing the bulk of the handlers with TDX.
      
      When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
      exit qualification etc rather than the VMCS fields because VMM doesn't have
      access to the VMCS.  The eventual code will be
      
      VMX:
        - get exit reason, intr_info, exit_qualification, and etc from VMCS
        - call NMI/INTR handlers (common code)
      
      TDX:
        - get exit reason, intr_info, exit_qualification, and etc from guest
          registers
        - call NMI/INTR handlers (common code)
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2325a21a
    • Paolo Bonzini's avatar
      KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX · 5f18c642
      Paolo Bonzini authored
      KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
      to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
      Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
      indirectly operate those data structures.  This means we must have a TDX
      version of kvm_x86_ops.
      
      The existing global struct kvm_x86_ops already defines an interface which
      can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
      structure.  To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
      will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
      TDX at run time.
      
      To split the runtime switch, the VMX implementation, and the TDX
      implementation, add main.c, and move out the vmx_x86_ops hooks in
      preparation for adding TDX.  Use 'vt' for the naming scheme as a nod to
      VT-x and as a concatenation of VmxTdx.
      
      The eventually converted code will look like this:
      
      vmx.c:
        vmx_op() { ... }
        VMX initialization
      tdx.c:
        tdx_op() { ... }
        TDX initialization
      x86_ops.h:
        vmx_op();
        tdx_op();
      main.c:
        static vt_op() { if (tdx) tdx_op() else vmx_op() }
        static struct kvm_x86_ops vt_x86_ops = {
              .op = vt_op,
        initialization functions call both VMX and TDX initialization
      
      Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
      vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().
      Co-developed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarBinbin Wu <binbin.wu@linux.intel.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5f18c642
    • Sean Christopherson's avatar
      KVM: x86: Split core of hypercall emulation to helper function · e913ef15
      Sean Christopherson authored
      By necessity, TDX will use a different register ABI for hypercalls.
      Break out the core functionality so that it may be reused for TDX.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e913ef15
    • Paolo Bonzini's avatar
      Merge branch 'kvm-sev-init2' into HEAD · f9cecb3c
      Paolo Bonzini authored
      The idea that no parameter would ever be necessary when enabling SEV or
      SEV-ES for a VM was decidedly optimistic.  The first source of variability
      that was encountered is the desired set of VMSA features, as that affects
      the measurement of the VM's initial state and cannot be changed
      arbitrarily by the hypervisor.
      
      This series adds all the APIs that are needed to customize the features,
      with room for future enhancements:
      
      - a new /dev/kvm device attribute to retrieve the set of supported
        features (right now, only debug swap)
      
      - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
        replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT
      
      It then puts the new op to work by including the VMSA features as a field
      of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
      supported VMSA features for backwards compatibility; but I am considering
      also making them use zero as the feature mask, and will gladly adjust the
      patches if so requested.
      
      In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
      I could as well make SEV and SEV-ES use VM types.  This allows SEV-SNP
      to reuse the KVM_SEV_INIT2 ioctl.
      
      And while at it, KVM_SEV_INIT2 also includes two bugfixes.  First of all,
      SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
      reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
      once the VMSA has been encrypted...  which is how the API should have
      always behaved.  Second, they also synchronize the FPU and AVX state.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f9cecb3c
    • Paolo Bonzini's avatar
      Merge branch 'mm-delete-change-gpte' into HEAD · 531f5200
      Paolo Bonzini authored
      The .change_pte() MMU notifier callback was intended as an optimization
      and for this reason it was initially called without a surrounding
      mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
      implemented by KVM (which was also the original user of MMU notifiers)
      and the rules on when to call set_pte_at_notify() rather than set_pte_at()
      have always been pretty obscure.
      
      It may seem a miracle that it has never caused any hard to trigger
      bugs, but there's a good reason for that: KVM's implementation has
      been nonfunctional for a good part of its existence.  Already in
      2012, commit 6bdb913f ("mm: wrap calls to set_pte_at_notify with
      invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
      .change_pte() callback to occur within an invalidate_range_start/end()
      pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
      .change_pte() has no hope of finding a sPTE to change.
      
      Therefore, all the code for .change_pte() can be removed from both KVM
      and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      531f5200
    • Paolo Bonzini's avatar
      mm: replace set_pte_at_notify() with just set_pte_at() · f7842747
      Paolo Bonzini authored
      With the demise of the .change_pte() MMU notifier callback, there is no
      notification happening in set_pte_at_notify().  It is a synonym of
      set_pte_at() and can be replaced with it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7842747
  5. 11 Apr, 2024 10 commits
    • Paolo Bonzini's avatar
      mmu_notifier: remove the .change_pte() callback · 997308f9
      Paolo Bonzini authored
      The scope of set_pte_at_notify() has reduced more and more through the
      years.  Initially, it was meant for when the change to the PTE was
      not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
      that has not been so for over ten years.  During all this period
      the only implementation of .change_pte() was KVM and it
      had no actual functionality, because it was called after
      mmu_notifier_invalidate_range_start() zapped the secondary PTE.
      
      Now that this (nonfunctional) user of the .change_pte() callback is
      gone, the whole callback can be removed.  For now, leave in place
      set_pte_at_notify() even though it is just a synonym for set_pte_at().
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      997308f9
    • Paolo Bonzini's avatar
      KVM: remove unused argument of kvm_handle_hva_range() · 5257de95
      Paolo Bonzini authored
      The only user was kvm_mmu_notifier_change_pte(), which is now gone.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5257de95
    • Paolo Bonzini's avatar
      KVM: delete .change_pte MMU notifier callback · f3b65bba
      Paolo Bonzini authored
      The .change_pte() MMU notifier callback was intended as an
      optimization. The original point of it was that KSM could tell KVM to flip
      its secondary PTE to a new location without having to first zap it. At
      the time there was also an .invalidate_page() callback; both of them were
      *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
      and .invalidate_page() also doubled as a fallback implementation of
      .change_pte().
      
      Later on, however, both callbacks were changed to occur within an
      invalidate_range_start/end() block.
      
      In the case of .change_pte(), commit 6bdb913f ("mm: wrap calls to
      set_pte_at_notify with invalidate_range_start and invalidate_range_end",
      2012-10-09) did so to remove the fallback from .invalidate_page() to
      .change_pte() and allow sleepable .invalidate_page() hooks.
      
      This however made KVM's usage of the .change_pte() callback completely
      moot, because KVM unmaps the sPTEs during .invalidate_range_start()
      and therefore .change_pte() has no hope of finding a sPTE to change.
      Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
      well as all the architecture specific implementations.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarAnup Patel <anup@brainfault.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Reviewed-by: default avatarBibo Mao <maobibo@loongson.cn>
      Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f3b65bba
    • Paolo Bonzini's avatar
      selftests: kvm: add test for transferring FPU state into VMSA · 8c53183d
      Paolo Bonzini authored
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-18-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8c53183d
    • Paolo Bonzini's avatar
      selftests: kvm: split "launch" phase of SEV VM creation · 4c180a57
      Paolo Bonzini authored
      Allow the caller to set the initial state of the VM.  Doing this
      before sev_vm_launch() matters for SEV-ES, since that is the
      place where the VMSA is updated and after which the guest state
      becomes sealed.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-17-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4c180a57
    • Paolo Bonzini's avatar
      selftests: kvm: switch to using KVM_X86_*_VM · d18c8648
      Paolo Bonzini authored
      This removes the concept of "subtypes", instead letting the tests use proper
      VM types that were recently added.  While the sev_init_vm() and sev_es_init_vm()
      are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT
      ioctls, this is limited to VMs that are created manually with
      vm_create_barebones().
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-16-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d18c8648
    • Paolo Bonzini's avatar
      selftests: kvm: add tests for KVM_SEV_INIT2 · dfc083a1
      Paolo Bonzini authored
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-15-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dfc083a1
    • Paolo Bonzini's avatar
      KVM: SEV: allow SEV-ES DebugSwap again · 4dd5ecac
      Paolo Bonzini authored
      The DebugSwap feature of SEV-ES provides a way for confidential guests
      to use data breakpoints.  Its status is record in VMSA, and therefore
      attestation signatures depend on whether it is enabled or not.  In order
      to avoid invalidating the signatures depending on the host machine, it
      was disabled by default (see commit 5abf6dce, "SEV: disable SEV-ES
      DebugSwap by default", 2024-03-09).
      
      However, we now have a new API to create SEV VMs that allows enabling
      DebugSwap based on what the user tells KVM to do, and we also changed the
      legacy KVM_SEV_ES_INIT API to never enable DebugSwap.  It is therefore
      possible to re-enable the feature without breaking compatibility with
      kernels that pre-date the introduction of DebugSwap, so go ahead.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4dd5ecac
    • Paolo Bonzini's avatar
      KVM: SEV: introduce KVM_SEV_INIT2 operation · 4f5defae
      Paolo Bonzini authored
      The idea that no parameter would ever be necessary when enabling SEV or
      SEV-ES for a VM was decidedly optimistic.  In fact, in some sense it's
      already a parameter whether SEV or SEV-ES is desired.  Another possible
      source of variability is the desired set of VMSA features, as that affects
      the measurement of the VM's initial state and cannot be changed
      arbitrarily by the hypervisor.
      
      Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
      and put the new op to work by including the VMSA features as a field of the
      struct.  The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
      supported VMSA features for backwards compatibility.
      
      The struct also includes the usual bells and whistles for future
      extensibility: a flags field that must be zero for now, and some padding
      at the end.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f5defae
    • Paolo Bonzini's avatar
      KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time · eb444186
      Paolo Bonzini authored
      SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
      Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
      FPU contents as confidential after it has been copied and encrypted into
      the VMSA.
      
      Since the XSAVE state for AVX is the first, it does not need the
      compacted-state handling of get_xsave_addr().  However, there are other
      parts of XSAVE state in the VMSA that currently are not handled, and
      the validation logic of get_xsave_addr() is pointless to duplicate
      in KVM, so move get_xsave_addr() to public FPU API; it is really just
      a facility to operate on XSAVE state and does not expose any internal
      details of arch/x86/kernel/fpu.
      Acked-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eb444186