1. 29 Apr, 2022 22 commits
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove extended bits from mmu_role, rename field · 7a458f0e
      Paolo Bonzini authored
      mmu_role represents the role of the root of the page tables.
      It does not need any extended bits, as those govern only KVM's
      page table walking; the is_* functions used for page table
      walking always use the CPU role.
      
      ext.valid is not present anymore in the MMU role, but an
      all-zero MMU role is impossible because the level field is
      never zero in the MMU role.  So just zap the whole mmu_role
      in order to force invalidation after CPUID is updated.
      
      While making this change, which requires touching almost every
      occurrence of "mmu_role", rename it to "root_role".
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a458f0e
    • Paolo Bonzini's avatar
      KVM: x86/mmu: store shadow EFER.NX in the MMU role · 362505de
      Paolo Bonzini authored
      Now that the MMU role is separate from the CPU role, it can be a
      truthful description of the format of the shadow pages.  This includes
      whether the shadow pages use the NX bit; so force the efer_nx field
      of the MMU role when TDP is disabled, and remove the hardcoding it in
      the callers of reset_shadow_zero_bits_mask.
      
      In fact, the initialization of reserved SPTE bits can now be made common
      to shadow paging and shadow NPT; move it to shadow_mmu_init_context.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      362505de
    • Paolo Bonzini's avatar
      KVM: x86/mmu: cleanup computation of MMU roles for shadow paging · f417e145
      Paolo Bonzini authored
      Pass the already-computed CPU role, instead of redoing it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f417e145
    • Paolo Bonzini's avatar
      KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging · 2ba67677
      Paolo Bonzini authored
      Inline kvm_calc_mmu_role_common into its sole caller, and simplify it
      by removing the computation of unnecessary bits.
      
      Extended bits are unnecessary because page walking uses the CPU role,
      and EFER.NX/CR0.WP can be set to one unconditionally---matching the
      format of shadow pages rather than the format of guest pages.
      
      The MMU role for two dimensional paging does still depend on the CPU role,
      even if only barely so, due to SMM and guest mode; for consistency,
      pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying
      the vcpu with is_smm or is_guest_mode.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ba67677
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove kvm_calc_shadow_root_page_role_common · 19b5dcc3
      Paolo Bonzini authored
      kvm_calc_shadow_root_page_role_common is the same as
      kvm_calc_cpu_role except for the level, which is overwritten
      afterwards in kvm_calc_shadow_mmu_root_page_role
      and kvm_calc_shadow_npt_root_page_role.
      
      role.base.direct is already set correctly for the CPU role,
      and CR0.PG=1 is required for VMRUN so it will also be
      correct for nested NPT.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      19b5dcc3
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove ept_ad field · ec283cb1
      Paolo Bonzini authored
      The ept_ad field is used during page walk to determine if the guest PTEs
      have accessed and dirty bits.  In the MMU role, the ad_disabled
      bit represents whether the *shadow* PTEs have the bits, so it
      would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
      !mmu->mmu_role.base.ad_disabled.
      
      However, the similar field in the CPU mode, ad_disabled, is initialized
      correctly: to the opposite value of ept_ad for shadow EPT, and zero
      for non-EPT guest paging modes (which always have A/D bits).  It is
      therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
      like other page-format fields; it just has to be inverted to account
      for the different polarity.
      
      In fact, now that the CPU mode is distinct from the MMU roles, it would
      even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
      use !mmu->cpu_role.base.ad_disabled instead.  I am not doing this because
      the macro has a small effect in terms of dead code elimination:
      
         text	   data	    bss	    dec	    hex
       103544	  16665	    112	 120321	  1d601    # as of this patch
       103746	  16665	    112	 120523	  1d6cb    # without PT_HAVE_ACCESSED_DIRTY
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec283cb1
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do not recompute root level from kvm_mmu_role_regs · 60f3cb60
      Paolo Bonzini authored
      The root_level can be found in the cpu_role (in fact the field
      is superfluous and could be removed, but one thing at a time).
      Since there is only one usage left of role_regs_to_root_level,
      inline it into kvm_calc_cpu_role.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      60f3cb60
    • Paolo Bonzini's avatar
      KVM: x86/mmu: split cpu_role from mmu_role · e5ed0fb0
      Paolo Bonzini authored
      Snapshot the state of the processor registers that govern page walk into
      a new field of struct kvm_mmu.  This is a more natural representation
      than having it *mostly* in mmu_role but not exclusively; the delta
      right now is represented in other fields, such as root_level.
      
      The nested MMU now has only the CPU role; and in fact the new function
      kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
      except that it has role.base.direct equal to !CR0.PG.  For a walk-only
      MMU, "direct" has no meaning, but we set it to !CR0.PG so that
      role.ext.cr0_pg can go away in a future patch.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e5ed0fb0
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove "bool base_only" arguments · b8980508
      Paolo Bonzini authored
      The argument is always false now that kvm_mmu_calc_root_page_role has
      been removed.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8980508
    • Sean Christopherson's avatar
      KVM: x86: Clean up and document nested #PF workaround · 6819af75
      Sean Christopherson authored
      Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
      with an explicit, common workaround in kvm_inject_emulated_page_fault().
      Aside from being a hack, the current approach is brittle and incomplete,
      e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
      and nVMX fails to apply the workaround when VMX is intercepting #PF due
      to allow_smaller_maxphyaddr=1.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6819af75
    • Paolo Bonzini's avatar
      KVM: x86/mmu: rephrase unclear comment · 25cc0565
      Paolo Bonzini authored
      If accessed bits are not supported there simple isn't any distinction
      between accessed and non-accessed gPTEs, so the comment does not make
      much sense.  Rephrase it in terms of what happens if accessed bits
      *are* supported.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      25cc0565
    • Paolo Bonzini's avatar
      KVM: x86/mmu: pull computation of kvm_mmu_role_regs to kvm_init_mmu · 39e7e2bf
      Paolo Bonzini authored
      The init_kvm_*mmu functions, with the exception of shadow NPT,
      do not need to know the full values of CR0/CR4/EFER; they only
      need to know the bits that make up the "role".  This cleanup
      however will take quite a few incremental steps.  As a start,
      pull the common computation of the struct kvm_mmu_role_regs
      into their caller: all of them extract the struct from the vcpu
      as the very first step.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39e7e2bf
    • Paolo Bonzini's avatar
      KVM: x86/mmu: constify uses of struct kvm_mmu_role_regs · 82ffa13f
      Paolo Bonzini authored
      struct kvm_mmu_role_regs is computed just once and then accessed.  Use
      const to make this clearer, even though the const fields of struct
      kvm_mmu_role_regs already prevent (or make it harder...) to modify
      the contents of the struct.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      82ffa13f
    • Paolo Bonzini's avatar
      KVM: x86/mmu: nested EPT cannot be used in SMM · daed87b8
      Paolo Bonzini authored
      The role.base.smm flag is always zero when setting up shadow EPT,
      do not bother copying it over from vcpu->arch.root_mmu.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      daed87b8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled · 8b9e74bf
      Sean Christopherson authored
      Clear enable_mmio_caching if hardware can't support MMIO caching and use
      the dedicated flag to detect if MMIO caching is enabled instead of
      assuming shadow_mmio_value==0 means MMIO caching is disabled.  TDX will
      use a zero value even when caching is enabled, and is_mmio_spte() isn't
      so hot that it needs to avoid an extra memory access, i.e. there's no
      reason to be super clever.  And the clever approach may not even be more
      performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
      but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
      uops for non-MMIO SPTEs.
      
      Cc: Isaku Yamahata <isaku.yamahata@intel.com>
      Cc: Kai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420002747.3287931-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8b9e74bf
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for host MMIO exclusion from mem encrypt iff necessary · 65936229
      Sean Christopherson authored
      When determining whether or not a SPTE needs to have SME/SEV's memory
      encryption flag set, do the moderately expensive host MMIO pfn check if
      and only if the memory encryption mask is non-zero.
      
      Note, KVM could further optimize the host MMIO checks by making a single
      call to kvm_is_mmio_pfn(), but the tdp_enabled path (for EPT's memtype
      handling) will likely be split out to a separate flow[*].  At that point,
      a better approach would be to shove the call to kvm_is_mmio_pfn() into
      VMX code so that AMD+NPT without SME doesn't get hit with an unnecessary
      lookup.
      
      [*] https://lkml.kernel.org/r/20220321224358.1305530-3-bgardon@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220415004909.2216670-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      65936229
    • Babu Moger's avatar
      KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts · 296d5a17
      Babu Moger authored
      The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use
      TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID
      instructions. The TSC_AUX value is set using the WRMSR instruction to the
      TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID
      instructions. If the read/write of the TSC_AUX MSR is intercepted, then
      RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization
      is present. However, the RDPID instruction can't be intercepted. This means
      that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR
      read/write must not be intercepted for SEV-ES (or SEV-SNP) guests.
      Signed-off-by: default avatarBabu Moger <babu.moger@amd.com>
      Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      296d5a17
    • Babu Moger's avatar
      x86/cpufeatures: Add virtual TSC_AUX feature bit · f3090339
      Babu Moger authored
      The TSC_AUX Virtualization feature allows AMD SEV-ES guests to securely use
      TSC_AUX (auxiliary time stamp counter data) MSR in RDTSCP and RDPID
      instructions.
      
      The TSC_AUX MSR is typically initialized to APIC ID or another unique
      identifier so that software can quickly associate returned TSC value
      with the logical processor.
      
      Add the feature bit and also include it in the kvm for detection.
      Signed-off-by: default avatarBabu Moger <babu.moger@amd.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Message-Id: <165040157111.1399644.6123821125319995316.stgit@bmoger-ubuntu>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f3090339
    • Paolo Bonzini's avatar
      Merge branch 'kvm-fixes-for-5.18-rc5' into HEAD · 71d7c575
      Paolo Bonzini authored
      Fixes for (relatively) old bugs, to be merged in both the -rc and next
      development trees.
      
      The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between
      5.18 and commit c24a950e ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata
      for SEV-ES", 2022-04-13).
      71d7c575
    • Mingwei Zhang's avatar
      KVM: x86/mmu: fix potential races when walking host page table · 44187235
      Mingwei Zhang authored
      KVM uses lookup_address_in_mm() to detect the hugepage size that the host
      uses to map a pfn.  The function suffers from several issues:
      
       - no usage of READ_ONCE(*). This allows multiple dereference of the same
         page table entry. The TOCTOU problem because of that may cause KVM to
         incorrectly treat a newly generated leaf entry as a nonleaf one, and
         dereference the content by using its pfn value.
      
       - the information returned does not match what KVM needs; for non-present
         entries it returns the level at which the walk was terminated, as long
         as the entry is not 'none'.  KVM needs level information of only 'present'
         entries, otherwise it may regard a non-present PXE entry as a present
         large page mapping.
      
       - the function is not safe for mappings that can be torn down, because it
         does not disable IRQs and because it returns a PTE pointer which is never
         safe to dereference after the function returns.
      
      So implement the logic for walking host page tables directly in KVM, and
      stop using lookup_address_in_mm().
      
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220429031757.2042406-1-mizhang@google.com>
      [Inline in host_pfn_mapping_level, ensure no semantic change for its
       callers. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44187235
    • Paolo Bonzini's avatar
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini authored
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson authored
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  2. 13 Apr, 2022 18 commits