1. 25 May, 2022 3 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD · 47e8eec8
      Paolo Bonzini authored
      KVM/arm64 updates for 5.19
      
      - Add support for the ARMv8.6 WFxT extension
      
      - Guard pages for the EL2 stacks
      
      - Trap and emulate AArch32 ID registers to hide unsupported features
      
      - Ability to select and save/restore the set of hypercalls exposed
        to the guest
      
      - Support for PSCI-initiated suspend in collaboration with userspace
      
      - GICv3 register-based LPI invalidation support
      
      - Move host PMU event merging into the vcpu data structure
      
      - GICv3 ITS save/restore fixes
      
      - The usual set of small-scale cleanups and fixes
      
      [Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
       from 4 to 6. - Paolo]
      47e8eec8
    • Yang Weijiang's avatar
      KVM: selftests: x86: Fix test failure on arch lbr capable platforms · 825be3b5
      Yang Weijiang authored
      On Arch LBR capable platforms, LBR_FMT in perf capability msr is 0x3f,
      so the last format test will fail. Use a true invalid format(0x30) for
      the test if it's running on these platforms. Opportunistically change
      the file name to reflect the tests actually carried out.
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarYang Weijiang <weijiang.yang@intel.com>
      Message-Id: <20220512084046.105479-1-weijiang.yang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      825be3b5
    • Wanpeng Li's avatar
      KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351
      Wanpeng Li authored
      In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
      tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
      was moved after guest_exit_irqoff() because invoking tracepoints within
      kvm_guest_enter/kvm_guest_exit caused a lockdep splat.
      
      These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
      Move context tracking where it belongs", 2020-07-09) restricted
      the RCU extended quiescent state to be closer to vmentry/vmexit.
      Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
      because it will be reported even if vcpu_enter_guest causes multiple
      vmentries via the IPI/Timer fast paths, and it allows the removal of
      advance_expire_delta.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0ac5351
  2. 16 May, 2022 12 commits
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/its-save-restore-fixes-5.19 into kvmarm-master/next · 5c0ad551
      Marc Zyngier authored
      * kvm-arm64/its-save-restore-fixes-5.19:
        : .
        : Tighten the ITS save/restore infrastructure to fail early rather
        : than late. Patches courtesy of Rocardo Koller.
        : .
        KVM: arm64: vgic: Undo work in failed ITS restores
        KVM: arm64: vgic: Do not ignore vgic_its_restore_cte failures
        KVM: arm64: vgic: Add more checks when restoring ITS tables
        KVM: arm64: vgic: Check that new ITEs could be saved in guest memory
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      5c0ad551
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/misc-5.19 into kvmarm-master/next · 822ca7f8
      Marc Zyngier authored
      * kvm-arm64/misc-5.19:
        : .
        : Misc fixes and general improvements for KVMM/arm64:
        :
        : - Better handle out of sequence sysregs in the global tables
        :
        : - Remove a couple of unnecessary loads from constant pool
        :
        : - Drop unnecessary pKVM checks
        :
        : - Add all known M1 implementations to the SEIS workaround
        :
        : - Cleanup kerneldoc warnings
        : .
        KVM: arm64: vgic-v3: List M1 Pro/Max as requiring the SEIS workaround
        KVM: arm64: pkvm: Don't mask already zeroed FEAT_SVE
        KVM: arm64: pkvm: Drop unnecessary FP/SIMD trap handler
        KVM: arm64: nvhe: Eliminate kernel-doc warnings
        KVM: arm64: Avoid unnecessary absolute addressing via literals
        KVM: arm64: Print emulated register table name when it is unsorted
        KVM: arm64: Don't BUG_ON() if emulated register table is unsorted
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      822ca7f8
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/per-vcpu-host-pmu-data into kvmarm-master/next · 8794b4f5
      Marc Zyngier authored
      * kvm-arm64/per-vcpu-host-pmu-data:
        : .
        : Pass the host PMU state in the vcpu to avoid the use of additional
        : shared memory between EL1 and EL2 (this obviously only applies
        : to nVHE and Protected setups).
        :
        : Patches courtesy of Fuad Tabba.
        : .
        KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected
        KVM: arm64: Reenable pmu in Protected Mode
        KVM: arm64: Pass pmu events to hyp via vcpu
        KVM: arm64: Repack struct kvm_pmu to reduce size
        KVM: arm64: Wrapper for getting pmu_events
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      8794b4f5
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/vgic-invlpir into kvmarm-master/next · ec2cff6c
      Marc Zyngier authored
      * kvm-arm64/vgic-invlpir:
        : .
        : Implement MMIO-based LPI invalidation for vGICv3.
        : .
        KVM: arm64: vgic-v3: Advertise GICR_CTLR.{IR, CES} as a new GICD_IIDR revision
        KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation
        KVM: arm64: vgic-v3: Expose GICR_CTLR.RWP when disabling LPIs
        irqchip/gic-v3: Exposes bit values for GICR_CTLR.{IR, CES}
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      ec2cff6c
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/psci-suspend into kvmarm-master/next · 3b8e21e3
      Marc Zyngier authored
      * kvm-arm64/psci-suspend:
        : .
        : Add support for PSCI SYSTEM_SUSPEND and allow userspace to
        : filter the wake-up events.
        :
        : Patches courtesy of Oliver.
        : .
        Documentation: KVM: Fix title level for PSCI_SUSPEND
        selftests: KVM: Test SYSTEM_SUSPEND PSCI call
        selftests: KVM: Refactor psci_test to make it amenable to new tests
        selftests: KVM: Use KVM_SET_MP_STATE to power off vCPU in psci_test
        selftests: KVM: Create helper for making SMCCC calls
        selftests: KVM: Rename psci_cpu_on_test to psci_test
        KVM: arm64: Implement PSCI SYSTEM_SUSPEND
        KVM: arm64: Add support for userspace to suspend a vCPU
        KVM: arm64: Return a value from check_vcpu_requests()
        KVM: arm64: Rename the KVM_REQ_SLEEP handler
        KVM: arm64: Track vCPU power state using MP state values
        KVM: arm64: Dedupe vCPU power off helpers
        KVM: arm64: Don't depend on fallthrough to hide SYSTEM_RESET2
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      3b8e21e3
    • Marc Zyngier's avatar
      Merge branch kvm-arm64/hcall-selection into kvmarm-master/next · 0586e28a
      Marc Zyngier authored
      * kvm-arm64/hcall-selection:
        : .
        : Introduce a new set of virtual sysregs for userspace to
        : select the hypercalls it wants to see exposed to the guest.
        :
        : Patches courtesy of Raghavendra and Oliver.
        : .
        KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run
        KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace
        Documentation: Fix index.rst after psci.rst renaming
        selftests: KVM: aarch64: Add the bitmap firmware registers to get-reg-list
        selftests: KVM: aarch64: Introduce hypercall ABI test
        selftests: KVM: Create helper for making SMCCC calls
        selftests: KVM: Rename psci_cpu_on_test to psci_test
        tools: Import ARM SMCCC definitions
        Docs: KVM: Add doc for the bitmap firmware registers
        Docs: KVM: Rename psci.rst to hypercalls.rst
        KVM: arm64: Add vendor hypervisor firmware register
        KVM: arm64: Add standard hypervisor firmware register
        KVM: arm64: Setup a framework for hypercall bitmap firmware registers
        KVM: arm64: Factor out firmware register handling from psci.c
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      0586e28a
    • Marc Zyngier's avatar
      KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run · 528ada28
      Marc Zyngier authored
      We generally want to disallow hypercall bitmaps being changed
      once vcpus have already run. But we must allow the write if
      the written value is unchanged so that userspace can rewrite
      the register file on reboot, for example.
      
      Without this, a QEMU-based VM will fail to reboot correctly.
      
      The original code was correct, and it is me that introduced
      the regression.
      
      Fixes: 05714cab ("KVM: arm64: Setup a framework for hypercall bitmap firmware registers")
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      528ada28
    • Ricardo Koller's avatar
      KVM: arm64: vgic: Undo work in failed ITS restores · 8c5e74c9
      Ricardo Koller authored
      Failed ITS restores should clean up all state restored until the
      failure. There is some cleanup already present when failing to restore
      some tables, but it's not complete. Add the missing cleanup.
      
      Note that this changes the behavior in case of a failed restore of the
      device tables.
      
      	restore ioctl:
      	1. restore collection tables
      	2. restore device tables
      
      With this commit, failures in 2. clean up everything created so far,
      including state created by 1.
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220510001633.552496-5-ricarkol@google.com
      8c5e74c9
    • Ricardo Koller's avatar
      KVM: arm64: vgic: Do not ignore vgic_its_restore_cte failures · a1ccfd6f
      Ricardo Koller authored
      Restoring a corrupted collection entry (like an out of range ID) is
      being ignored and treated as success. More specifically, a
      vgic_its_restore_cte failure is treated as success by
      vgic_its_restore_collection_table.  vgic_its_restore_cte uses positive
      and negative numbers to return error, and +1 to return success.  The
      caller then uses "ret > 0" to check for success.
      
      Fix this by having vgic_its_restore_cte only return negative numbers on
      error.  Do this by changing alloc_collection return codes to only return
      negative numbers on error.
      Signed-off-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220510001633.552496-4-ricarkol@google.com
      a1ccfd6f
    • Ricardo Koller's avatar
      KVM: arm64: vgic: Add more checks when restoring ITS tables · 243b1f6c
      Ricardo Koller authored
      Try to improve the predictability of ITS save/restores (and debuggability
      of failed ITS saves) by failing early on restore when trying to read
      corrupted tables.
      
      Restoring the ITS tables does some checks for corrupted tables, but not as
      many as in a save: an overflowing device ID will be detected on save but
      not on restore.  The consequence is that restoring a corrupted table won't
      be detected until the next save; including the ITS not working as expected
      after the restore.  As an example, if the guest sets tables overlapping
      each other, which would most likely result in some corrupted table, this is
      what we would see from the host point of view:
      
      	guest sets base addresses that overlap each other
      	save ioctl
      	restore ioctl
      	save ioctl (fails)
      
      Ideally, we would like the first save to fail, but overlapping tables could
      actually be intended by the guest. So, let's at least fail on the restore
      with some checks: like checking that device and event IDs don't overflow
      their tables.
      Signed-off-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220510001633.552496-3-ricarkol@google.com
      243b1f6c
    • Ricardo Koller's avatar
      KVM: arm64: vgic: Check that new ITEs could be saved in guest memory · cafe7e54
      Ricardo Koller authored
      Try to improve the predictability of ITS save/restores by failing
      commands that would lead to failed saves. More specifically, fail any
      command that adds an entry into an ITS table that is not in guest
      memory, which would otherwise lead to a failed ITS save ioctl. There
      are already checks for collection and device entries, but not for
      ITEs.  Add the corresponding check for the ITT when adding ITEs.
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarRicardo Koller <ricarkol@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220510001633.552496-2-ricarkol@google.com
      cafe7e54
    • Marc Zyngier's avatar
      KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected · 20492a62
      Marc Zyngier authored
      Moving kvm_pmu_events into the vcpu (and refering to it) broke the
      somewhat unusual case where the kernel has no support for a PMU
      at all.
      
      In order to solve this, move things around a bit so that we can
      easily avoid refering to the pmu structure outside of PMU-aware
      code. As a bonus, pmu.c isn't compiled in when HW_PERF_EVENTS
      isn't selected.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarFuad Tabba <tabba@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/202205161814.KQHpOzsJ-lkp@intel.com
      20492a62
  3. 15 May, 2022 6 commits
  4. 12 May, 2022 13 commits
    • Vipin Sharma's avatar
      KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps · 6ba1e04f
      Vipin Sharma authored
      Avoid calling handlers on empty rmap entries and skip to the next non
      empty rmap entry.
      
      Empty rmap entries are noop in handlers.
      Signed-off-by: default avatarVipin Sharma <vipinsh@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220502220347.174664-1-vipinsh@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6ba1e04f
    • Kai Huang's avatar
      KVM: VMX: Include MKTME KeyID bits in shadow_zero_check · 3c5c3245
      Kai Huang authored
      Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
      never be set to SPTE.  Set shadow_me_value to 0 and shadow_me_mask to
      include all MKTME KeyID bits to include them to shadow_zero_check.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c5c3245
    • Kai Huang's avatar
      KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2
      Kai Huang authored
      Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
      high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
      Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
      KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
      host kernel since they can only be accessed by software running inside a
      new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
      any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
      legitimate for KVM to set any KeyID bits in SPTE which maps guest
      memory.
      
      KVM maintains shadow_zero_check bits to represent which bits must be
      zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
      to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
      the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
      shadow_zero_check.  So initializing shadow_me_mask to represent all
      MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
      to shadow_zero_check).
      
      Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
      and repurpose shadow_me_mask as 'all possible memory encryption bits'.
      The new schematic of them will be:
      
       - shadow_me_value: the memory encryption bit(s) that will be set to the
         SPTE (the original shadow_me_mask).
       - shadow_me_mask: all possible memory encryption bits (which is a super
         set of shadow_me_value).
       - For now, shadow_me_value is supposed to be set by SVM and VMX
         respectively, and it is a constant during KVM's life time.  This
         perhaps doesn't fit MKTME but for now host kernel doesn't support it
         (and perhaps will never do).
       - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
         in shadow_me_value.
      
      Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
      Replace shadow_me_mask with shadow_me_value in almost all code paths,
      except the one in PT64_PERM_MASK, which is used by need_remote_flush()
      to determine whether remote TLB flush is needed.  This should still use
      shadow_me_mask as any encryption bit change should need a TLB flush.
      And for AMD, move initializing shadow_me_value/shadow_me_mask from
      kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e54f1ff2
    • Kai Huang's avatar
      KVM: x86/mmu: Rename reset_rsvds_bits_mask() · c919e881
      Kai Huang authored
      Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
      it clearer that it resets the reserved bits check for guest's page table
      entries.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c919e881
    • Paolo Bonzini's avatar
    • Sean Christopherson's avatar
      KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e
      Sean Christopherson authored
      Expand and clean up the page fault stats.  The current stats are at best
      incomplete, and at worst misleading.  Differentiate between faults that
      are actually fixed vs those that result in an MMIO SPTE being created,
      track faults that are spurious, faults that trigger emulation, faults
      that that are fixed in the fast path, and last but not least, track the
      number of faults that are taken.
      
      Note, the number of faults that require emulation for write-protected
      shadow pages can roughly be calculated by subtracting the number of MMIO
      SPTEs created from the overall number of faults that trigger emulation.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1075d41e
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults · 8d5265b1
      Sean Christopherson authored
      Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
      path for TDP page faults.  The generated code is identical, and the #ifdef
      makes it dangerously difficult to extend the logic (guess who forgot to
      add an "else" inside the #ifdef and ran through the page fault handler
      twice).
      
      No functional or binary change intented.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8d5265b1
    • Sean Christopherson's avatar
      KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b
      Sean Christopherson authored
      Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
      of the page fault handling collateral that was in mmu.h purely for the
      async #PF handler into mmu_internal.h, where it belongs.  This will allow
      kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
      expose those enums outside of the MMU.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8a009d5b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616
      Sean Christopherson authored
      Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
      kvm_faultin_pfn() to signal that the page fault handler should continue
      doing its thing.  Aside from being gross and inefficient, using a boolean
      return to signal continue vs. stop makes it extremely difficult to add
      more helpers and/or move existing code to a helper.
      
      E.g. hypothetically, if nested MMUs were to gain a separate page fault
      handler in the future, everything up to the "is self-modifying PTE" check
      can be shared by all shadow MMUs, but communicating up the stack whether
      to continue on or stop becomes a nightmare.
      
      More concretely, proposed support for private guest memory ran into a
      similar issue, where it'll be forced to forego a helper in order to yield
      sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.
      
      No functional change intended.
      
      Cc: David Matlack <dmatlack@google.com>
      Cc: Chao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5276c616
    • Sean Christopherson's avatar
      KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5
      Sean Christopherson authored
      Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
      faults in the access tracking case, and drop the exec/NX check that
      becomes redundant as a result.  No sane hardware will generate an access
      that is both an instruct fetch and a write, i.e. it's a waste of cycles.
      If hardware goes off the rails, or KVM runs under a misguided hypervisor,
      spuriously running throught fast path is benign (KVM has been uknowingly
      being doing exactly that for years).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5c64aba5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson authored
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54275f74
    • Li RongQing's avatar
      KVM: VMX: clean up pi_wakeup_handler · 91ab933f
      Li RongQing authored
      Passing per_cpu() to list_for_each_entry() causes the macro to be
      evaluated N+1 times for N sleeping vCPUs.  This is a very small
      inefficiency, and the code is cleaner if the address of the per-CPU
      variable is loaded earlier.  Do this for both the list and the spinlock.
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      91ab933f
    • Maxim Levitsky's avatar
      KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness · 33fbe6be
      Maxim Levitsky authored
      This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
      relies makes L1 install two different shadow pages under same spte, and one of
      them is leaked.
      
      Fixes: 1c2361f6 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      33fbe6be
  5. 10 May, 2022 2 commits
  6. 06 May, 2022 2 commits
    • Randy Dunlap's avatar
      KVM: arm64: nvhe: Eliminate kernel-doc warnings · bd61395a
      Randy Dunlap authored
      Don't use begin-kernel-doc notation (/**) for comments that are not in
      kernel-doc format.
      
      This prevents these kernel-doc warnings:
      
      arch/arm64/kvm/hyp/nvhe/switch.c:126: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
       * Disable host events, enable guest events
      arch/arm64/kvm/hyp/nvhe/switch.c:146: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
       * Disable guest events, enable host events
      arch/arm64/kvm/hyp/nvhe/switch.c:164: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
       * Handler for protected VM restricted exceptions.
      arch/arm64/kvm/hyp/nvhe/switch.c:176: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
       * Handler for protected VM MSR, MRS or System instruction execution in AArch64.
      arch/arm64/kvm/hyp/nvhe/switch.c:196: warning: Function parameter or member 'vcpu' not described in 'kvm_handle_pvm_fpsimd'
      arch/arm64/kvm/hyp/nvhe/switch.c:196: warning: Function parameter or member 'exit_code' not described in 'kvm_handle_pvm_fpsimd'
      arch/arm64/kvm/hyp/nvhe/switch.c:196: warning: expecting prototype for Handler for protected floating(). Prototype was for kvm_handle_pvm_fpsimd() instead
      
      Fixes: 09cf57eb ("KVM: arm64: Split hyp/switch.c to VHE/nVHE")
      Fixes: 1423afcb ("KVM: arm64: Trap access to pVM restricted features")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Fuad Tabba <tabba@google.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: David Brazdil <dbrazdil@google.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Alexandru Elisei <alexandru.elisei@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: kvmarm@lists.cs.columbia.edu
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220430050123.2844-1-rdunlap@infradead.org
      bd61395a
    • Ard Biesheuvel's avatar
      KVM: arm64: Avoid unnecessary absolute addressing via literals · 7ee74cc7
      Ard Biesheuvel authored
      There are a few cases in the nVHE code where we take the absolute
      address of a symbol via a literal pool entry, and subsequently translate
      it to another address space (PA, kimg VA, kernel linear VA, etc).
      Originally, this literal was needed because we relied on a different
      translation for absolute references, but this is no longer the case, so
      we can simply use relative addressing instead. This removes a couple of
      RELA entries pointing into the .text segment.
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220428140350.3303481-1-ardb@kernel.org
      7ee74cc7
  7. 05 May, 2022 1 commit
  8. 04 May, 2022 1 commit