1. 01 Aug, 2022 5 commits
    • Oliver Upton's avatar
      selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test · 7eebae78
      Oliver Upton authored
      As it turns out, tests sometimes fail. When that is the case, packing
      the test assertion with as much relevant information helps track down
      the problem more quickly.
      
      Sharpen up the stat descriptor assertions in kvm_binary_stats_test to
      more precisely describe the reason for the test assertion and which
      stat is to blame.
      Signed-off-by: default avatarOliver Upton <oupton@google.com>
      Reviewed-by: default avatarAndrew Jones <andrew.jones@linux.dev>
      Message-Id: <20220719143134.3246798-3-oliver.upton@linux.dev>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7eebae78
    • Oliver Upton's avatar
      selftests: KVM: Check stat name before other fields · ad5b0727
      Oliver Upton authored
      In order to provide more useful test assertions that describe the broken
      stats descriptor, perform sanity check on the stat name before any other
      descriptor field. While at it, avoid dereferencing the name field if the
      sanity check fails as it is more likely to contain garbage.
      Signed-off-by: default avatarOliver Upton <oupton@google.com>
      Reviewed-by: default avatarAndrew Jones <andrew.jones@linux.dev>
      Message-Id: <20220719143134.3246798-2-oliver.upton@linux.dev>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad5b0727
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove unused variable · 31f6e383
      Paolo Bonzini authored
      The last use of 'pfn' went away with the same-named argument to
      host_pfn_mapping_level; now that the hugepage level is obtained
      exclusively from the host page tables, kvm_mmu_zap_collapsible_spte
      does not need to know host pfns at all.
      
      Fixes: a8ac499b ("KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      31f6e383
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · c4edb2ba
      Paolo Bonzini authored
      KVM/arm64 updates for 5.20:
      
      - Unwinder implementations for both nVHE modes (classic and
        protected), complete with an overflow stack
      
      - Rework of the sysreg access from userspace, with a complete
        rewrite of the vgic-v3 view to allign with the rest of the
        infrastructure
      
      - Disagregation of the vcpu flags in separate sets to better track
        their use model.
      
      - A fix for the GICv2-on-v3 selftest
      
      - A small set of cosmetic fixes
      c4edb2ba
    • Paolo Bonzini's avatar
      Merge remote-tracking branch 'kvm/next' into kvm-next-5.20 · 63f4b210
      Paolo Bonzini authored
      KVM/s390, KVM/x86 and common infrastructure changes for 5.20
      
      x86:
      
      * Permit guests to ignore single-bit ECC errors
      
      * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
      
      * Intel IPI virtualization
      
      * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
      
      * PEBS virtualization
      
      * Simplify PMU emulation by just using PERF_TYPE_RAW events
      
      * More accurate event reinjection on SVM (avoid retrying instructions)
      
      * Allow getting/setting the state of the speaker port data bit
      
      * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
      
      * "Notify" VM exit (detect microarchitectural hangs) for Intel
      
      * Cleanups for MCE MSR emulation
      
      s390:
      
      * add an interface to provide a hypervisor dump for secure guests
      
      * improve selftests to use TAP interface
      
      * enable interpretive execution of zPCI instructions (for PCI passthrough)
      
      * First part of deferred teardown
      
      * CPU Topology
      
      * PV attestation
      
      * Minor fixes
      
      Generic:
      
      * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
      
      x86:
      
      * Use try_cmpxchg64 instead of cmpxchg64
      
      * Bugfixes
      
      * Ignore benign host accesses to PMU MSRs when PMU is disabled
      
      * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
      
      * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
      
      * Port eager page splitting to shadow MMU as well
      
      * Enable CMCI capability by default and handle injected UCNA errors
      
      * Expose pid of vcpu threads in debugfs
      
      * x2AVIC support for AMD
      
      * cleanup PIO emulation
      
      * Fixes for LLDT/LTR emulation
      
      * Don't require refcounted "struct page" to create huge SPTEs
      
      x86 cleanups:
      
      * Use separate namespaces for guest PTEs and shadow PTEs bitmasks
      
      * PIO emulation
      
      * Reorganize rmap API, mostly around rmap destruction
      
      * Do not workaround very old KVM bugs for L0 that runs with nesting enabled
      
      * new selftests API for CPUID
      63f4b210
  2. 29 Jul, 2022 12 commits
  3. 28 Jul, 2022 23 commits
    • Kai Huang's avatar
      KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs() · 7edc3a68
      Kai Huang authored
      Now kvm_tdp_mmu_zap_leafs() only zaps leaf SPTEs but not any non-root
      pages within that GFN range anymore, so the comment around it isn't
      right.
      
      Fix it by shifting the comment from tdp_mmu_zap_leafs() instead of
      duplicating it, as tdp_mmu_zap_leafs() is static and is only called by
      kvm_tdp_mmu_zap_leafs().
      
      Opportunistically tweak the blurb about SPTEs being cleared to (a) say
      "zapped" instead of "cleared" because "cleared" will be wrong if/when
      KVM allows a non-zero value for non-present SPTE (i.e. for Intel TDX),
      and (b) to clarify that a flush is needed if and only if a SPTE has been
      zapped since MMU lock was last acquired.
      
      Fixes: f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap")
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <20220728030452.484261-1-kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7edc3a68
    • Jarkko Sakkinen's avatar
      KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog · 6fac42f1
      Jarkko Sakkinen authored
      As Virtual Machine Save Area (VMSA) is essential in troubleshooting
      attestation, dump it to the klog with the KERN_DEBUG level of priority.
      
      Cc: Jarkko Sakkinen <jarkko@kernel.org>
      Suggested-by: default avatarHarald Hoyer <harald@profian.com>
      Signed-off-by: default avatarJarkko Sakkinen <jarkko@profian.com>
      Message-Id: <20220728050919.24113-1-jarkko@profian.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6fac42f1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT · 6c6ab524
      Sean Christopherson authored
      Treat the NX bit as valid when using NPT, as KVM will set the NX bit when
      the NX huge page mitigation is enabled (mindblowing) and trigger the WARN
      that fires on reserved SPTE bits being set.
      
      KVM has required NX support for SVM since commit b26a71a1 ("KVM: SVM:
      Refuse to load kvm_amd if NX support is not available") for exactly this
      reason, but apparently it never occurred to anyone to actually test NPT
      with the mitigation enabled.
      
        ------------[ cut here ]------------
        spte = 0x800000018a600ee7, level = 2, rsvd bits = 0x800f0000001fe000
        WARNING: CPU: 152 PID: 15966 at arch/x86/kvm/mmu/spte.c:215 make_spte+0x327/0x340 [kvm]
        Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022
        RIP: 0010:make_spte+0x327/0x340 [kvm]
        Call Trace:
         <TASK>
         tdp_mmu_map_handle_target_level+0xc3/0x230 [kvm]
         kvm_tdp_mmu_map+0x343/0x3b0 [kvm]
         direct_page_fault+0x1ae/0x2a0 [kvm]
         kvm_tdp_page_fault+0x7d/0x90 [kvm]
         kvm_mmu_page_fault+0xfb/0x2e0 [kvm]
         npf_interception+0x55/0x90 [kvm_amd]
         svm_invoke_exit_handler+0x31/0xf0 [kvm_amd]
         svm_handle_exit+0xf6/0x1d0 [kvm_amd]
         vcpu_enter_guest+0xb6d/0xee0 [kvm]
         ? kvm_pmu_trigger_event+0x6d/0x230 [kvm]
         vcpu_run+0x65/0x2c0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x355/0x610 [kvm]
         kvm_vcpu_ioctl+0x551/0x610 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220723013029.1753623-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6c6ab524
    • Suravee Suthikulpanit's avatar
      KVM: x86: Do not block APIC write for non ICR registers · 1bd9dfec
      Suravee Suthikulpanit authored
      The commit 5413bcba ("KVM: x86: Add support for vICR APIC-write
      VM-Exits in x2APIC mode") introduces logic to prevent APIC write
      for offset other than ICR in kvm_apic_write_nodecode() function.
      This breaks x2AVIC support, which requires KVM to trap and emulate
      x2APIC MSR writes.
      
      Therefore, removes the warning and modify to logic to allow MSR write.
      
      Fixes: 5413bcba ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode")
      Cc: Zeng Guang <guang.zeng@intel.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220725053356.4275-1-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1bd9dfec
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Do not virtualize MSR accesses for APIC LVTT register · 0a8735a6
      Suravee Suthikulpanit authored
      AMD does not support APIC TSC-deadline timer mode. AVIC hardware
      will generate GP fault when guest kernel writes 1 to bits [18]
      of the APIC LVTT register (offset 0x32) to set the timer mode.
      (Note: bit 18 is reserved on AMD system).
      
      Therefore, always intercept and let KVM emulate the MSR accesses.
      
      Fixes: f3d7c8aa6882 ("KVM: SVM: Fix x2APIC MSRs interception")
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <20220725033428.3699-1-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0a8735a6
    • Sean Christopherson's avatar
      KVM: selftests: Verify VMX MSRs can be restored to KVM-supported values · ce30d8b9
      Sean Christopherson authored
      Verify that KVM allows toggling VMX MSR bits to be "more" restrictive,
      and also allows restoring each MSR to KVM's original, less restrictive
      value.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-16-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ce30d8b9
    • Sean Christopherson's avatar
      KVM: nVMX: Set UMIP bit CR4_FIXED1 MSR when emulating UMIP · a910b5ab
      Sean Christopherson authored
      Make UMIP an "allowed-1" bit CR4_FIXED1 MSR when KVM is emulating UMIP.
      KVM emulates UMIP for both L1 and L2, and so should enumerate that L2 is
      allowed to have CR4.UMIP=1.  Not setting the bit doesn't immediately
      break nVMX, as KVM does set/clear the bit in CR4_FIXED1 in response to a
      guest CPUID update, i.e. KVM will correctly (dis)allow nested VM-Entry
      based on whether or not UMIP is exposed to L1.  That said, KVM should
      enumerate the bit as being allowed from time zero, e.g. userspace will
      see the wrong value if the MSR is read before CPUID is written.
      
      Fixes: 0367f205 ("KVM: vmx: add support for emulating UMIP")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-12-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a910b5ab
    • Paolo Bonzini's avatar
      Revert "KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control" · 9389d577
      Paolo Bonzini authored
      This reverts commit 03a8871a.
      
      Since commit 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
      VM-{Entry,Exit} control"), KVM has taken ownership of the "load
      IA32_PERF_GLOBAL_CTRL" VMX entry/exit control bits, trying to set these
      bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID
      supports the architectural PMU (CPUID[EAX=0Ah].EAX[7:0]=1), and clear
      otherwise.
      
      This was a misguided attempt at mimicking what commit 5f76f6f5
      ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
      2018-10-01) did for MPX.  However, that commit was a workaround for
      another KVM bug and not something that should be imitated.  Mucking with
      the VMX MSRs creates a subtle, difficult to maintain ABI as KVM must
      ensure that any internal changes, e.g. to how KVM handles _any_ guest
      CPUID changes, yield the same functional result.  Therefore, KVM's policy
      is to let userspace have full control of the guest vCPU model so long
      as the host kernel is not at risk.
      
      Now that KVM really truly ensures kvm_set_msr() will succeed by loading
      PERF_GLOBAL_CTRL if and only if it exists, revert KVM's misguided and
      roundabout behavior.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [sean: make it a pure revert]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9389d577
    • Sean Christopherson's avatar
      KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it exists · 4496a6f9
      Sean Christopherson authored
      Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and
      only if the MSR exists (according to the guest vCPU model).  KVM has very
      misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and
      attempts to force the nVMX MSR settings to match the vPMU model, i.e. to
      hide/expose the control based on whether or not the MSR exists from the
      guest's perspective.
      
      KVM's modifications fail to handle the scenario where the vPMU is hidden
      from the guest _after_ being exposed to the guest, e.g. by userspace
      doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any
      KVM_RUN.  nested_vmx_pmu_refresh() is called if and only if there's a
      recognized vPMU, i.e. KVM will leave the bits in the allow state and then
      ultimately reject the MSR load and WARN.
      
      KVM should not force the VMX MSRs in the first place.  KVM taking control
      of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5
      ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled",
      2018-10-01) did for MPX.  However, the MPX commit was a workaround for
      another KVM bug and not something that should be imitated (and it should
      never been done in the first place).
      
      In other words, KVM's ABI _should_ be that userspace has full control
      over the MSRs, at which point triggering the WARN that loading the MSR
      must not fail is trivial.
      
      The intent of the WARN is still valid; KVM has consistency checks to
      ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid.  The
      problem is that '0' must be considered a valid value at all times, and so
      the simple/obvious solution is to just not actually load the MSR when it
      does not exist.  It is userspace's responsibility to provide a sane vCPU
      model, i.e. KVM is well within its ABI and Intel's VMX architecture to
      skip the loads if the MSR does not exist.
      
      Fixes: 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4496a6f9
    • Sean Christopherson's avatar
      KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRL · b663f0b5
      Sean Christopherson authored
      Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is
      unintuitive _and_ diverges from Intel's architecturally defined behavior.
      Even worse, KVM currently implements the check using two different (but
      equivalent) checks, _and_ there has been at least one attempt to add a
      _third_ flavor.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b663f0b5
    • Sean Christopherson's avatar
      KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMU · 93255bf9
      Sean Christopherson authored
      Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits
      as reserved if there is no guest vPMU.  The nVMX VM-Entry consistency
      checks do not check for a valid vPMU prior to consuming the masks via
      kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask
      to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the
      actual MSR load will be rejected by intel_is_valid_msr()).
      
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220722224409.1336532-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      93255bf9
    • Paolo Bonzini's avatar
      Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled" · 8805875a
      Paolo Bonzini authored
      Since commit 5f76f6f5 ("KVM: nVMX: Do not expose MPX VMX controls
      when guest MPX disabled"), KVM has taken ownership of the "load
      IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls,
      trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS
      MSRs if the guest's CPUID supports MPX, and clear otherwise.
      
      The intent of the patch was to apply it to L0 in order to work around
      L1 kernels that lack the fix in commit 691bd434 ("kvm: vmx: allow
      host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the
      control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST,
      and the L1 bug is neutralized even in the lack of commit 691bd434.
      
      This was perhaps a sensible kludge at the time, but a horrible
      idea in the long term and in fact it has not been extended to
      other CPUID bits like these:
      
        X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE,
                          VMX_MISC_SAVE_EFER_LMA
      
        X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING,
                           SECONDARY_EXEC_TSC_SCALING
      
        X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID
      
        X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING
      
        X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA,
                                VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL
      
        X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES
      
      These days it's sort of common knowledge that any MSR in
      KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR
      to a default value, so it is unlikely that something like commit
      5f76f6f5 will be needed again.  So revert it, at the potential cost
      of breaking L1s with a 6 year old kernel.  While in principle the L0 owner
      doesn't control what runs on L1, such an old hypervisor would probably
      have many other bugs.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8805875a
    • Sean Christopherson's avatar
      KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported value · f8ae08f9
      Sean Christopherson authored
      Restrict the nVMX MSRs based on KVM's config, not based on the guest's
      current config.  Using the guest's config to audit the new config
      prevents userspace from restoring the original config (KVM's config) if
      at any point in the past the guest's config was restricted in any way.
      
      Fixes: 62cc6b9d ("KVM: nVMX: support restore of VMX capability MSRs")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f8ae08f9
    • Sean Christopherson's avatar
      KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}() · a645c2b5
      Sean Christopherson authored
      Rename the exit handlers for VMXON and VMXOFF to match the instruction
      names, the terms "vmon" and "vmoff" are not used anywhere in Intel's
      documentation, nor are they used elsehwere in KVM.
      
      Sadly, the exit reasons are exposed to userspace and so cannot be renamed
      without breaking userspace. :-(
      
      Fixes: ec378aee ("KVM: nVMX: Implement VMXON and VMXOFF")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a645c2b5
    • Sean Christopherson's avatar
      KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4 · c7d855c2
      Sean Christopherson authored
      Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed
      per the associated nested VMX MSRs' fixed0/1 settings.  KVM cannot rely
      on hardware to perform the checks, even for the few checks that have
      higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in
      hardware while running the guest, (b) there may incompatible CR0/CR4 bits
      that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may
      have further restricted the allowed CR0/CR4 values by manipulating the
      guest's nested VMX MSRs.
      
      Note, despite a very strong desire to throw shade at Jim, commit
      70f3aac9 ("kvm: nVMX: Remove superfluous VMX instruction fault checks")
      is not to blame for the buggy behavior (though the comment...).  That
      commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks
      (though it did erroneously drop the CPL check, but that has already been
      remedied).  KVM may force CR0.PE=1, but will do so only when also
      forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033
      Fixes: ec378aee ("KVM: nVMX: Implement VMXON and VMXOFF")
      Reported-by: default avatarEric Li <ercli@ucdavis.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c7d855c2
    • Sean Christopherson's avatar
      KVM: nVMX: Account for KVM reserved CR4 bits in consistency checks · ca58f3aa
      Sean Christopherson authored
      Check that the guest (L2) and host (L1) CR4 values that would be loaded
      by nested VM-Enter and VM-Exit respectively are valid with respect to
      KVM's (L0 host) allowed CR4 bits.  Failure to check KVM reserved bits
      would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or
      failed VM-Entry) by massaging guest CPUID to allow features that are not
      supported by KVM.  Amusingly, KVM itself is an accomplice in its doom, as
      KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for
      L2 based on L1's CPUID model.
      
      Note, although nested_{guest,host}_cr4_valid() are _currently_ used if
      and only if the vCPU is post-VMXON (nested.vmxon == true), that may not
      be true in the future, e.g. emulating VMXON has a bug where it doesn't
      check the allowed/required CR0/CR4 bits.
      
      Cc: stable@vger.kernel.org
      Fixes: 3899152c ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ca58f3aa
    • Sean Christopherson's avatar
      KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bits · c33f6f22
      Sean Christopherson authored
      Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
      checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
      inner helper to vendor code in order to prevent nested VMX from calling
      back into vmx_is_valid_cr4() via kvm_is_valid_cr4().
      
      On SVM, this is a nop as SVM doesn't place any additional restrictions on
      CR4.
      
      On VMX, this is also currently a nop, but only because nested VMX is
      missing checks on reserved CR4 bits for nested VM-Enter.  That bug will
      be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
      but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
      restrictions on CR0/CR4.  The cleanest and most intuitive way to fix the
      VMXON bug is to use nested_host_cr{0,4}_valid().  If the CR4 variant
      routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
      the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
      restrictions if and only if the vCPU is post-VMXON.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c33f6f22
    • Sean Christopherson's avatar
      KVM: selftests: Add an option to run vCPUs while disabling dirty logging · cfe12e64
      Sean Christopherson authored
      Add a command line option to dirty_log_perf_test to run vCPUs for the
      entire duration of disabling dirty logging.  By default, the test stops
      running runs vCPUs before disabling dirty logging, which is faster but
      less interesting as it doesn't stress KVM's handling of contention
      between page faults and the zapping of collapsible SPTEs.  Enabling the
      flag also lets the user verify that KVM is indeed rebuilding zapped SPTEs
      as huge pages by checking KVM's pages_{1g,2m,4k} stats.  Without vCPUs to
      fault in the zapped SPTEs, the stats will show that KVM is zapping pages,
      but they never show whether or not KVM actually allows huge pages to be
      recreated.
      
      Note!  Enabling the flag can _significantly_ increase runtime, especially
      if the thread that's disabling dirty logging doesn't have a dedicated
      pCPU, e.g. if all pCPUs are used to run vCPUs.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715232107.3775620-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cfe12e64
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEs · 85f44f8c
      Sean Christopherson authored
      When zapping collapsible SPTEs in the TDP MMU, don't bottom out on a leaf
      SPTE now that KVM doesn't require a PFN to compute the host mapping level,
      i.e. now that there's no need to first find a leaf SPTE and then step
      back up.
      
      Drop the now unused tdp_iter_step_up(), as it is not the safest of
      helpers (using any of the low level iterators requires some understanding
      of the various side effects).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715232107.3775620-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      85f44f8c
    • Sean Christopherson's avatar
      KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level() · 65e3b446
      Sean Christopherson authored
      Add a comment to document how host_pfn_mapping_level() can be used safely,
      as the line between safe and dangerous is quite thin.  E.g. if KVM were
      to ever support in-place promotion to create huge pages, consuming the
      level is safe if the caller holds mmu_lock and checks that there's an
      existing _leaf_ SPTE, but unsafe if the caller only checks that there's a
      non-leaf SPTE.
      
      Opportunistically tweak the existing comments to explicitly document why
      KVM needs to use READ_ONCE().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220715232107.3775620-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      65e3b446
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs · a8ac499b
      Sean Christopherson authored
      Drop the requirement that a pfn be backed by a refcounted, compound or
      or ZONE_DEVICE, struct page, and instead rely solely on the host page
      tables to identify huge pages.  The PageCompound() check is a remnant of
      an old implementation that identified (well, attempt to identify) huge
      pages without walking the host page tables.  The ZONE_DEVICE check was
      added as an exception to the PageCompound() requirement.  In other words,
      neither check is actually a hard requirement, if the primary has a pfn
      backed with a huge page, then KVM can back the pfn with a huge page
      regardless of the backing store.
      
      Dropping the @pfn parameter will also allow KVM to query the max host
      mapping level without having to first get the pfn, which is advantageous
      for use outside of the page fault path where KVM wants to take action if
      and only if a page can be mapped huge, i.e. avoids the pfn lookup for
      gfns that can't be backed with a huge page.
      
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220715232107.3775620-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a8ac499b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're used · d5e90a69
      Sean Christopherson authored
      Restrict the mapping level for SPTEs based on the guest MTRRs if and only
      if KVM may actually use the guest MTRRs to compute the "real" memtype.
      For all forms of paging, guest MTRRs are purely virtual in the sense that
      they are completely ignored by hardware, i.e. they affect the memtype
      only if software manually consumes them.  The only scenario where KVM
      consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the
      guest has non-coherent DMA, in all other cases KVM simply leaves the PAT
      field in SPTEs as '0' to encode WB memtype.
      
      Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing
      pfn is host MMIO, but false positives are ok as they only cause a slight
      performance blip (unless the guest is doing weird things with its MTRRs,
      which is extremely unlikely).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220715230016.3762909-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d5e90a69
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add shadow mask for effective host MTRR memtype · 38bf9d7b
      Sean Christopherson authored
      Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
      instead of relying on TDP being enabled, as NPT doesn't need a non-zero
      mask.  This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
      for NPT anyways.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220715230016.3762909-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      38bf9d7b