1. 07 Dec, 2023 12 commits
  2. 29 Nov, 2023 1 commit
  3. 21 Nov, 2023 1 commit
  4. 14 Nov, 2023 22 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-guestmemfd' into HEAD · 6c370dc6
      Paolo Bonzini authored
      Introduce several new KVM uAPIs to ultimately create a guest-first memory
      subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
      to provide features, enhancements, and optimizations that are kludgly
      or outright impossible to implement in a generic memory subsystem.
      
      The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
      similar to the generic memfd_create(), creates an anonymous file and
      returns a file descriptor that refers to it.  Again like "regular"
      memfd files, guest_memfd files live in RAM, have volatile storage,
      and are automatically released when the last reference is dropped.
      The key differences between memfd files (and every other memory subystem)
      is that guest_memfd files are bound to their owning virtual machine,
      cannot be mapped, read, or written by userspace, and cannot be resized.
      guest_memfd files do however support PUNCH_HOLE, which can be used to
      convert a guest memory area between the shared and guest-private states.
      
      A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
      specify attributes for a given page of guest memory.  In the long term,
      it will likely be extended to allow userspace to specify per-gfn RWX
      protections, including allowing memory to be writable in the guest
      without it also being writable in host userspace.
      
      The immediate and driving use case for guest_memfd are Confidential
      (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
      For such use cases, being able to map memory into KVM guests without
      requiring said memory to be mapped into the host is a hard requirement.
      While SEV+ and TDX prevent untrusted software from reading guest private
      data by encrypting guest memory, pKVM provides confidentiality and
      integrity *without* relying on memory encryption.  In addition, with
      SEV-SNP and especially TDX, accessing guest private memory can be fatal
      to the host, i.e. KVM must be prevent host userspace from accessing
      guest memory irrespective of hardware behavior.
      
      Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
      for example hardening userspace against unintentional accesses to guest
      memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
      define the allow guest protection (with an exception granted to mapping
      guest memory executable), and similarly KVM currently requires the guest
      mapping size to be a strict subset of the host userspace mapping size.
      Decoupling the mappings sizes would allow userspace to precisely map
      only what is needed and with the required permissions, without impacting
      guest performance.
      
      A guest-first memory subsystem also provides clearer line of sight to
      things like a dedicated memory pool (for slice-of-hardware VMs) and
      elimination of "struct page" (for offload setups where userspace _never_
      needs to DMA from or into guest memory).
      
      guest_memfd is the result of 3+ years of development and exploration;
      taking on memory management responsibilities in KVM was not the first,
      second, or even third choice for supporting CoCo VMs.  But after many
      failed attempts to avoid KVM-specific backing memory, and looking at
      where things ended up, it is quite clear that of all approaches tried,
      guest_memfd is the simplest, most robust, and most extensible, and the
      right thing to do for KVM and the kernel at-large.
      
      The "development cycle" for this version is going to be very short;
      ideally, next week I will merge it as is in kvm/next, taking this through
      the KVM tree for 6.8 immediately after the end of the merge window.
      The series is still based on 6.6 (plus KVM changes for 6.7) so it
      will require a small fixup for changes to get_file_rcu() introduced in
      6.7 by commit 0ede61d8 ("file: convert to SLAB_TYPESAFE_BY_RCU").
      The fixup will be done as part of the merge commit, and most of the text
      above will become the commit message for the merge.
      
      Pending post-merge work includes:
      - hugepage support
      - looking into using the restrictedmem framework for guest memory
      - introducing a testing mechanism to poison memory, possibly using
        the same memory attributes introduced here
      - SNP and TDX support
      
      There are two non-KVM patches buried in the middle of this series:
      
        fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
        mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
      
      The first is small and mostly suggested-by Christian Brauner; the second
      a bit less so but it was written by an mm person (Vlastimil Babka).
      6c370dc6
    • Sean Christopherson's avatar
      KVM: selftests: Add a memory region subtest to validate invalid flags · 5d743164
      Sean Christopherson authored
      Add a subtest to set_memory_region_test to verify that KVM rejects invalid
      flags and combinations with -EINVAL.  KVM might or might not fail with
      EINVAL anyways, but we can at least try.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231031002049.3915752-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5d743164
    • Ackerley Tng's avatar
      KVM: selftests: Test KVM exit behavior for private memory/access · e3577788
      Ackerley Tng authored
      "Testing private access when memslot gets deleted" tests the behavior
      of KVM when a private memslot gets deleted while the VM is using the
      private memslot. When KVM looks up the deleted (slot = NULL) memslot,
      KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.
      
      In the second test, upon a private access to non-private memslot, KVM
      should also exit to userspace with KVM_EXIT_MEMORY_FAULT.
      
      Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
      KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
      KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      [sean: call out the similarities with set_memory_region_test]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-36-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e3577788
    • Chao Peng's avatar
      KVM: selftests: Add basic selftest for guest_memfd() · 8a89efd4
      Chao Peng authored
      Add a selftest to verify the basic functionality of guest_memfd():
      
      + file descriptor created with the guest_memfd() ioctl does not allow
        read/write/mmap operations
      + file size and block size as returned from fstat are as expected
      + fallocate on the fd checks that offset/length on
        fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
      + invalid inputs (misaligned size, invalid flags) are rejected
      + file size and inode are unique (the innocuous-sounding
        anon_inode_getfile() backs all files with a single inode...)
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      Co-developed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-35-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8a89efd4
    • Chao Peng's avatar
      KVM: selftests: Expand set_memory_region_test to validate guest_memfd() · 2feabb85
      Chao Peng authored
      Expand set_memory_region_test to exercise various positive and negative
      testcases for private memory.
      
       - Non-guest_memfd() file descriptor for private memory
       - guest_memfd() from different VM
       - Overlapping bindings
       - Unaligned bindings
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      [sean: trim the testcases to remove duplicate coverage]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-34-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2feabb85
    • Chao Peng's avatar
      KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper · e6f4f345
      Chao Peng authored
      Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
      can validate of features that are unique to "version 2" of "set user
      memory region", e.g. do negative testing on gmem_fd and gmem_offset.
      
      Provide a raw version as well as an assert-success version to reduce
      the amount of boilerplate code need for basic usage.
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-33-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e6f4f345
    • Vishal Annapurve's avatar
      KVM: selftests: Add x86-only selftest for private memory conversions · 43f623f3
      Vishal Annapurve authored
      Add a selftest to exercise implicit/explicit conversion functionality
      within KVM and verify:
      
       - Shared memory is visible to host userspace
       - Private memory is not visible to host userspace
       - Host userspace and guest can communicate over shared memory
       - Data in shared backing is preserved across conversions (test's
         host userspace doesn't free the data)
       - Private memory is bound to the lifetime of the VM
      
      Ideally, KVM's selftests infrastructure would be reworked to allow backing
      a single region of guest memory with multiple memslots for _all_ backing
      types and shapes, i.e. ideally the code for using a single backing fd
      across multiple memslots would work for "regular" memory as well.  But
      sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
      and overhauling selftests' memslots infrastructure would likely open a can
      of worms, i.e. delay things even further.
      
      In addition to the more obvious tests, verify that PUNCH_HOLE actually
      frees memory.  Directly verifying that KVM frees memory is impractical, if
      it's even possible, so instead indirectly verify memory is freed by
      asserting that the guest reads zeroes after a PUNCH_HOLE.  E.g. if KVM
      zaps SPTEs but doesn't actually punch a hole in the inode, the subsequent
      read will still see the previous value.  And obviously punching a hole
      shouldn't cause explosions.
      
      Let the user specify the number of memslots in the private mem conversion
      test, i.e. don't require the number of memslots to be '1' or "nr_vcpus".
      Creating more memslots than vCPUs is particularly interesting, e.g. it can
      result in a single KVM_SET_MEMORY_ATTRIBUTES spanning multiple memslots.
      To keep the math reasonable, align each vCPU's chunk to at least 2MiB (the
      size is 2MiB+4KiB), and require the total size to be cleanly divisible by
      the number of memslots.  The goal is to be able to validate that KVM plays
      nice with multiple memslots, being able to create a truly arbitrary number
      of memslots doesn't add meaningful value, i.e. isn't worth the cost.
      
      Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
      KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
      KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.
      Signed-off-by: default avatarVishal Annapurve <vannapurve@google.com>
      Co-developed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-32-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      43f623f3
    • Sean Christopherson's avatar
      KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data · 242331df
      Sean Christopherson authored
      Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
      information supported via ucall(), without needing to resort to shared
      memory.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-31-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      242331df
    • Sean Christopherson's avatar
      KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type · 672eaa35
      Sean Christopherson authored
      Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
      along with the KVM-defined "type" for use when creating a new VM.  "mode"
      tracks physical and virtual address properties, as well as the preferred
      backing memory type, while "type" corresponds to the VM type.
      
      Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD
      without needing an entirely separate set of helpers.  At this time,
      guest_memfd is effectively usable only by confidential VM types in the
      form of guest private memory, and it's expected that x86 will double down
      and require unique VM types for TDX and SNP guests.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-30-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      672eaa35
    • Vishal Annapurve's avatar
      KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) · 01244fce
      Vishal Annapurve authored
      Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
      which KVM will forward to userspace and thus can be used by tests to
      coordinate private<=>shared conversions between host userspace code and
      guest code.
      Signed-off-by: default avatarVishal Annapurve <vannapurve@google.com>
      [sean: drop shared/private helpers (let tests specify flags)]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-29-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      01244fce
    • Vishal Annapurve's avatar
      KVM: selftests: Add helpers to convert guest memory b/w private and shared · f7fa6749
      Vishal Annapurve authored
      Add helpers to convert memory between private and shared via KVM's
      memory attributes, as well as helpers to free/allocate guest_memfd memory
      via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
      when converting memory, as the attributes are the single source of truth.
      Provide allocate() helpers so that tests can mimic a userspace that frees
      private memory on conversion, e.g. to prioritize memory usage over
      performance.
      Signed-off-by: default avatarVishal Annapurve <vannapurve@google.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-28-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f7fa6749
    • Sean Christopherson's avatar
      KVM: selftests: Add support for creating private memslots · bb2968ad
      Sean Christopherson authored
      Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
      KVM_SET_USER_MEMORY_REGION2.  Make vm_userspace_mem_region_add() a wrapper
      to its effective replacement, vm_mem_add(), so that private memslots are
      fully opt-in, i.e. don't require update all tests that add memory regions.
      
      Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
      file descriptor so that simple tests can let vm_mem_add() do the heavy
      lifting of creating the guest memfd, but also allow the caller to pass in
      an explicit fd+offset so that fancier tests can do things like back
      multiple memslots with a single file.  If the caller passes in a fd, dup()
      the fd so that (a) __vm_mem_region_delete() can close the fd associated
      with the memory region without needing yet another flag, and (b) so that
      the caller can safely close its copy of the fd without having to first
      destroy memslots.
      Co-developed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-27-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb2968ad
    • Sean Christopherson's avatar
      KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 · 8d99e347
      Sean Christopherson authored
      Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
      support for guest private memory can be added without needing an entirely
      separate set of helpers.
      
      Note, this obviously makes selftests backwards-incompatible with older KVM
      versions from this point forward.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-26-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8d99e347
    • Sean Christopherson's avatar
      KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper · 335869c3
      Sean Christopherson authored
      Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
      (probably why it's unused).  If anything outside of kvm_util.c needs to
      get at the memslot, userspace_mem_region_find() can be exposed to give
      others full access to all memory region/slot information.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-25-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      335869c3
    • Sean Christopherson's avatar
      KVM: x86: Add support for "protected VMs" that can utilize private memory · 89ea60c2
      Sean Christopherson authored
      Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
      and testing vehicle for Confidential (CoCo) VMs, and potentially to even
      become a "real" product in the distant future, e.g. a la pKVM.
      
      The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
      Intel's TDX, but those technologies are extremely complex (understatement),
      difficult to debug, don't support running as nested guests, and require
      hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
      for maintaining guest private memory isn't a realistic option.
      
      At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
      selftests for guest_memfd and private memory support without requiring
      unique hardware.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20231027182217.3615211-24-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      89ea60c2
    • Sean Christopherson's avatar
      KVM: Allow arch code to track number of memslot address spaces per VM · eed52e43
      Sean Christopherson authored
      Let x86 track the number of address spaces on a per-VM basis so that KVM
      can disallow SMM memslots for confidential VMs.  Confidentials VMs are
      fundamentally incompatible with emulating SMM, which as the name suggests
      requires being able to read and write guest memory and register state.
      
      Disallowing SMM will simplify support for guest private memory, as KVM
      will not need to worry about tracking memory attributes for multiple
      address spaces (SMM is the only "non-default" address space across all
      architectures).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-23-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eed52e43
    • Sean Christopherson's avatar
      KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro · 2333afa1
      Sean Christopherson authored
      Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
      KVM_ADDRESS_SPACE_NUM.
      
      No functional change intended.
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-22-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2333afa1
    • Chao Peng's avatar
      KVM: x86/mmu: Handle page fault for private memory · 8dd2eee9
      Chao Peng authored
      Add support for resolving page faults on guest private memory for VMs
      that differentiate between "shared" and "private" memory.  For such VMs,
      KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
      hva-based shared memory, and KVM needs to map in the "correct" variant,
      i.e. KVM needs to map the gfn shared/private as appropriate based on the
      current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
      
      For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
      shared vs. private via a bit in the guest page tables, i.e. what the guest
      wants may conflict with the current memory attributes.  To support such
      "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
      to forward the request to userspace.  Add a new flag for memory faults,
      KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
      map memory as shared vs. private.
      
      Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
      so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
      needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
      exit on missing mappings when handling guest page fault VM-Exits.  In
      that case, userspace will want to know RWX information in order to
      correctly/precisely resolve the fault.
      
      Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
      always come from the host userspace page tables, and private mappings
      always come from a guest_memfd instance.
      Co-developed-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-21-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8dd2eee9
    • Chao Peng's avatar
      KVM: x86: Disallow hugepages when memory attributes are mixed · 90b4fe17
      Chao Peng authored
      Disallow creating hugepages with mixed memory attributes, e.g. shared
      versus private, as mapping a hugepage in this case would allow the guest
      to access memory with the wrong attributes, e.g. overlaying private memory
      with a shared hugepage.
      
      Tracking whether or not attributes are mixed via the existing
      disallow_lpage field, but use the most significant bit in 'disallow_lpage'
      to indicate a hugepage has mixed attributes instead using the normal
      refcounting.  Whether or not attributes are mixed is binary; either they
      are or they aren't.  Attempting to squeeze that info into the refcount is
      unnecessarily complex as it would require knowing the previous state of
      the mixed count when updating attributes.  Using a flag means KVM just
      needs to ensure the current status is reflected in the memslots.
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-20-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      90b4fe17
    • Sean Christopherson's avatar
      KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN · ee605e31
      Sean Christopherson authored
      Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
      the probability of exiting to userspace with a stale run->exit_reason that
      *appears* to be valid.
      
      To support fd-based guest memory (guest memory without a corresponding
      userspace virtual address), KVM will exit to userspace for various memory
      related errors, which userspace *may* be able to resolve, instead of using
      e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
      utilize the same functionality to let userspace "intercept" and handle
      memory faults when the userspace mapping is missing, i.e. when fast gup()
      fails.
      
      Because many of KVM's internal APIs related to guest memory use '0' to
      indicate "success, continue on" and not "exit to userspace", reporting
      memory faults/errors to userspace will set run->exit_reason and
      corresponding fields in the run structure fields in conjunction with a
      a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
      KVM already returns  -EFAULT in many paths, there's a relatively high
      probability that KVM could return -EFAULT without setting run->exit_reason,
      in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
      whatever exit reason happened to be in the run structure.
      
      Note, KVM must wait until after run->immediate_exit is serviced to
      sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
      preserved across KVM_RUN when run->immediate_exit is true.
      
      Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
      Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-19-seanjc@google.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee605e31
    • Sean Christopherson's avatar
      KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory · a7800aa8
      Sean Christopherson authored
      Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
      memory that is tied to a specific KVM virtual machine and whose primary
      purpose is to serve guest memory.
      
      A guest-first memory subsystem allows for optimizations and enhancements
      that are kludgy or outright infeasible to implement/support in a generic
      memory subsystem.  With guest_memfd, guest protections and mapping sizes
      are fully decoupled from host userspace mappings.   E.g. KVM currently
      doesn't support mapping memory as writable in the guest without it also
      being writable in host userspace, as KVM's ABI uses VMA protections to
      define the allow guest protection.  Userspace can fudge this by
      establishing two mappings, a writable mapping for the guest and readable
      one for itself, but that’s suboptimal on multiple fronts.
      
      Similarly, KVM currently requires the guest mapping size to be a strict
      subset of the host userspace mapping size, e.g. KVM doesn’t support
      creating a 1GiB guest mapping unless userspace also has a 1GiB guest
      mapping.  Decoupling the mappings sizes would allow userspace to precisely
      map only what is needed without impacting guest performance, e.g. to
      harden against unintentional accesses to guest memory.
      
      Decoupling guest and userspace mappings may also allow for a cleaner
      alternative to high-granularity mappings for HugeTLB, which has reached a
      bit of an impasse and is unlikely to ever be merged.
      
      A guest-first memory subsystem also provides clearer line of sight to
      things like a dedicated memory pool (for slice-of-hardware VMs) and
      elimination of "struct page" (for offload setups where userspace _never_
      needs to mmap() guest memory).
      
      More immediately, being able to map memory into KVM guests without mapping
      said memory into the host is critical for Confidential VMs (CoCo VMs), the
      initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
      untrusted software from reading guest private data by encrypting guest
      memory with a key that isn't usable by the untrusted host, projects such
      as Protected KVM (pKVM) provide confidentiality and integrity *without*
      relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
      private memory can be fatal to the host, i.e. KVM must be prevent host
      userspace from accessing guest memory irrespective of hardware behavior.
      
      Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
      being mappable only by KVM (or a similarly enlightened kernel subsystem).
      That approach was abandoned largely due to it needing to play games with
      PROT_NONE to prevent userspace from accessing guest memory.
      
      Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
      guest private memory into userspace, but that approach failed to meet
      several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
      wouldn't easily be able to enforce a 1:1 page:guest association, let alone
      a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
      that isn't backed by 'struct page', e.g. if devices gain support for
      exposing encrypted memory regions to guests.
      
      Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
      dedicated file-based guest memory.  That approach made it as far as v10
      before feedback from Hugh Dickins and Christian Brauner (and others) led
      to it demise.
      
      Hugh's objection was that piggybacking shmem made no sense for KVM's use
      case as KVM didn't actually *want* the features provided by shmem.  I.e.
      KVM was using memfd() and shmem to avoid having to manage memory directly,
      not because memfd() and shmem were the optimal solution, e.g. things like
      read/write/mmap in shmem were dead weight.
      
      Christian pointed out flaws with implementing a partial overlay (wrapping
      only _some_ of shmem), e.g. poking at inode_operations or super_operations
      would show shmem stuff, but address_space_operations and file_operations
      would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
      stop being lazy and create a proper API.
      
      Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
      Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
      Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
      Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
      Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
      Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
      Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
      Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
      Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
      Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
      Cc: Fuad Tabba <tabba@google.com>
      Cc: Vishal Annapurve <vannapurve@google.com>
      Cc: Ackerley Tng <ackerleytng@google.com>
      Cc: Jarkko Sakkinen <jarkko@kernel.org>
      Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Quentin Perret <qperret@google.com>
      Cc: Michael Roth <michael.roth@amd.com>
      Cc: Wang <wei.w.wang@intel.com>
      Cc: Liam Merwick <liam.merwick@oracle.com>
      Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
      Co-developed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Co-developed-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: default avatarYu Zhang <yu.c.zhang@linux.intel.com>
      Co-developed-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarAckerley Tng <ackerleytng@google.com>
      Signed-off-by: default avatarAckerley Tng <ackerleytng@google.com>
      Co-developed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Signed-off-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Co-developed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Co-developed-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-17-seanjc@google.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Reviewed-by: default avatarXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a7800aa8
    • Paolo Bonzini's avatar
      fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() · 4f0b9194
      Paolo Bonzini authored
      The call to the inode_init_security_anon() LSM hook is not the sole
      reason to use anon_inode_getfile_secure() or anon_inode_getfd_secure().
      For example, the functions also allow one to create a file with non-zero
      size, without needing a full-blown filesystem.  In this case, you don't
      need a "secure" version, just unique inodes; the current name of the
      functions is confusing and does not explain well the difference with
      the more "standard" anon_inode_getfile() and anon_inode_getfd().
      
      Of course, there is another side of the coin; neither io_uring nor
      userfaultfd strictly speaking need distinct inodes, and it is not
      that clear anymore that anon_inode_create_get{file,fd}() allow the LSM
      to intercept and block the inode's creation.  If one was so inclined,
      anon_inode_getfile_secure() and anon_inode_getfd_secure() could be kept,
      using the shared inode or a new one depending on CONFIG_SECURITY.
      However, this is probably overkill, and potentially a cause of bugs in
      different configurations.  Therefore, just add a comment to io_uring
      and userfaultfd explaining the choice of the function.
      
      While at it, remove the export for what is now anon_inode_create_getfd().
      There is no in-tree module that uses it, and the old name is gone anyway.
      If anybody actually needs the symbol, they can ask or they can just use
      anon_inode_create_getfile(), which will be exported very soon for use
      in KVM.
      Suggested-by: default avatarChristian Brauner <brauner@kernel.org>
      Reviewed-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4f0b9194
  5. 13 Nov, 2023 4 commits
    • Sean Christopherson's avatar
      mm: Add AS_UNMOVABLE to mark mapping as completely unmovable · 0003e2a4
      Sean Christopherson authored
      Add an "unmovable" flag for mappings that cannot be migrated under any
      circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
      which will not support compaction/migration, at least not in the
      foreseeable future.
      
      Test AS_UNMOVABLE under folio lock as already done for the async
      compaction/dirty folio case, as the mapping can be removed by truncation
      while compaction is running.  To avoid having to lock every folio with a
      mapping, assume/require that unmovable mappings are also unevictable, and
      have mapping_set_unmovable() also set AS_UNEVICTABLE.
      
      Cc: Matthew Wilcox <willy@infradead.org>
      Co-developed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-15-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0003e2a4
    • Chao Peng's avatar
      KVM: Introduce per-page memory attributes · 5a475554
      Chao Peng authored
      In confidential computing usages, whether a page is private or shared is
      necessary information for KVM to perform operations like page fault
      handling, page zapping etc. There are other potential use cases for
      per-page memory attributes, e.g. to make memory read-only (or no-exec,
      or exec-only, etc.) without having to modify memslots.
      
      Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
      KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
      attributes to a guest memory range.
      
      Use an xarray to store the per-page attributes internally, with a naive,
      not fully optimized implementation, i.e. prioritize correctness over
      performance for the initial implementation.
      
      Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
      attributes/protections in the future, e.g. to give userspace fine-grained
      control over read, write, and execute protections for guest memory.
      
      Provide arch hooks for handling attribute changes before and after common
      code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
      relevant mappings, and the "post" hook to track whether or not hugepages
      can be used to map the range.
      
      To simplify the implementation wrap the entire sequence with
      kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
      guaranteed to be an invalidation.  For the initial use case, x86 *will*
      always invalidate memory, and preventing arch code from creating new
      mappings while the attributes are in flux makes it much easier to reason
      about the correctness of consuming attributes.
      
      It's possible that future usages may not require an invalidation, e.g.
      if KVM ends up supporting RWX protections and userspace grants _more_
      protections, but again opt for simplicity and punt optimizations to
      if/when they are needed.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
      Cc: Fuad Tabba <tabba@google.com>
      Cc: Xu Yilun <yilun.xu@intel.com>
      Cc: Mickaël Salaün <mic@digikod.net>
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20231027182217.3615211-14-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5a475554
    • Sean Christopherson's avatar
      KVM: Drop .on_unlock() mmu_notifier hook · 193bbfaa
      Sean Christopherson authored
      Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
      notifying arch code that memory has been reclaimed.  Adding .on_unlock()
      and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
      resulted in .on_lock() and .on_unlock() having divergent and asymmetric
      behavior, and set future developers up for failure, i.e. all but asked for
      bugs where KVM relied on using .on_unlock() to try to run a callback while
      holding mmu_lock.
      
      Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
      guard against future bugs of this nature.
      Reported-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      193bbfaa
    • Sean Christopherson's avatar
      KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory · cec29eef
      Sean Christopherson authored
      Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
      __kvm_handle_hva_range() return whether or not an overlapping memslot
      was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
      works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
      mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.
      
      Use a small struct to return the tuple of the notifier-specific return,
      plus whether or not overlap was found.  Because the iteration helpers are
      __always_inlined, practically speaking, the struct will never actually be
      returned from a function call (not to mention the size of the struct will
      be two bytes in practice).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Tested-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20231027182217.3615211-11-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cec29eef