1. 22 Jun, 2023 1 commit
    • Gavin Shan's avatar
      KVM: Avoid illegal stage2 mapping on invalid memory slot · 2230f9e1
      Gavin Shan authored
      We run into guest hang in edk2 firmware when KSM is kept as running on
      the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
      device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
      buffered write. The status is returned by reading the memory region of
      the pflash device and the read request should have been forwarded to QEMU
      and emulated by it. Unfortunately, the read request is covered by an
      illegal stage2 mapping when the guest hang issue occurs. The read request
      is completed with QEMU bypassed and wrong status is fetched. The edk2
      firmware runs into an infinite loop with the wrong status.
      
      The illegal stage2 mapping is populated due to same page sharing by KSM
      at (C) even the associated memory slot has been marked as invalid at (B)
      when the memory slot is requested to be deleted. It's notable that the
      active and inactive memory slots can't be swapped when we're in the middle
      of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
      is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
      to zero again. Besides, the swapping from the active to the inactive memory
      slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
      corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
      
        CPU-A                    CPU-B
        -----                    -----
                                 ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                                 kvm_vm_ioctl_set_memory_region
                                 kvm_set_memory_region
                                 __kvm_set_memory_region
                                 kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                                   kvm_invalidate_memslot
                                     kvm_copy_memslot
                                     kvm_replace_memslot
                                     kvm_swap_active_memslots        (A)
                                     kvm_arch_flush_shadow_memslot   (B)
        same page sharing by KSM
        kvm_mmu_notifier_invalidate_range_start
              :
        kvm_mmu_notifier_change_pte
          kvm_handle_hva_range
          __kvm_handle_hva_range
          kvm_set_spte_gfn            (C)
              :
        kvm_mmu_notifier_invalidate_range_end
      
      Fix the issue by skipping the invalid memory slot at (C) to avoid the
      illegal stage2 mapping so that the read request for the pflash's status
      is forwarded to QEMU and emulated by it. In this way, the correct pflash's
      status can be returned from QEMU to break the infinite loop in the edk2
      firmware.
      
      We tried a git-bisect and the first problematic commit is cd4c7183 ("
      KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
      clean_dcache_guest_page() is called after the memory slots are iterated
      in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
      before the iteration on the memory slots before this commit. This change
      literally enlarges the racy window between kvm_mmu_notifier_change_pte()
      and memory slot removal so that we're able to reproduce the issue in a
      practical test case. However, the issue exists since commit d5d8184d
      ("KVM: ARM: Memory virtualization setup").
      
      Cc: stable@vger.kernel.org # v3.9+
      Fixes: d5d8184d ("KVM: ARM: Memory virtualization setup")
      Reported-by: default avatarShuai Hu <hshuai@redhat.com>
      Reported-by: default avatarZhenyu Zhang <zhenyzha@redhat.com>
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarShaoqin Huang <shahuang@redhat.com>
      Message-Id: <20230615054259.14911-1-gshan@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2230f9e1
  2. 03 Jun, 2023 5 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-fixes-6.4' of https://github.com/kvm-x86/linux into HEAD · f211b450
      Paolo Bonzini authored
      KVM x86 fixes for 6.4
      
       - Fix a memslot lookup bug in the NX recovery thread that could
         theoretically let userspace bypass the NX hugepage mitigation
      
       - Fix a s/BLOCKING/PENDING bug in SVM's vNMI support
      
       - Account exit stats for fastpath VM-Exits that never leave the super
         tight run-loop
      
       - Fix an out-of-bounds bug in the optimized APIC map code, and add a
         regression test for the race.
      f211b450
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.4-3' of... · 49661a52
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.4, take #3
      
      - Fix the reported address of a watchpoint forwarded to userspace
      
      - Fix the freeing of the root of stage-2 page tables
      
      - Stop creating spurious PMU events to perform detection of the
        default PMU and use the existing PMU list instead.
      49661a52
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.4-2' of... · 26f31498
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.4, take #2
      
      - Address some fallout of the locking rework, this time affecting
        the way the vgic is configured
      
      - Fix an issue where the page table walker frees a subtree and
        then proceeds with walking what it has just freed...
      
      - Check that a given PA donated to the gues is actually memory
        (only affecting pKVM)
      
      - Correctly handle MTE CMOs by Set/Way
      26f31498
    • Michal Luczaj's avatar
      KVM: selftests: Add test for race in kvm_recalculate_apic_map() · 47d2804b
      Michal Luczaj authored
      Keep switching between LAPIC_MODE_X2APIC and LAPIC_MODE_DISABLED during
      APIC map construction to hunt for TOCTOU bugs in KVM.  KVM's optimized map
      recalc makes multiple passes over the list of vCPUs, and the calculations
      ignore vCPU's whose APIC is hardware-disabled, i.e. there's a window where
      toggling LAPIC_MODE_DISABLED is quite interesting.
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20230602233250.1014316-4-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      47d2804b
    • Sean Christopherson's avatar
      KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds · 4364b287
      Sean Christopherson authored
      Bail from kvm_recalculate_phys_map() and disable the optimized map if the
      target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added
      and/or enabled its local APIC after the map was allocated.  This fixes an
      out-of-bounds access bug in the !x2apic_format path where KVM would write
      beyond the end of phys_map.
      
      Check the x2APIC ID regardless of whether or not x2APIC is enabled,
      as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and
      the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC
      being enabled, i.e. the check won't get false postivies.
      
      Note, this also affects the x2apic_format path, which previously just
      ignored the "x2apic_id > new->max_apic_id" case.  That too is arguably a
      bug fix, as ignoring the vCPU meant that KVM would not send interrupts to
      the vCPU until the next map recalculation.  In practice, that "bug" is
      likely benign as a newly present vCPU/APIC would immediately trigger a
      recalc.  But, there's no functional downside to disabling the map, and
      a future patch will gracefully handle the -E2BIG case by retrying instead
      of simply disabling the optimized map.
      
      Opportunistically add a sanity check on the xAPIC ID size, along with a
      comment explaining why the xAPIC ID is guaranteed to be "good".
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Fixes: 5b84b029 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs")
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4364b287
  3. 02 Jun, 2023 3 commits
    • Sean Christopherson's avatar
      KVM: x86: Account fastpath-only VM-Exits in vCPU stats · 8b703a49
      Sean Christopherson authored
      Increment vcpu->stat.exits when handling a fastpath VM-Exit without
      going through any part of the "slow" path.  Not bumping the exits stat
      can result in wildly misleading exit counts, e.g. if the primary reason
      the guest is exiting is to program the TSC deadline timer.
      
      Fixes: 404d5d7b ("KVM: X86: Introduce more exit_fastpath_completion enum values")
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      8b703a49
    • Maciej S. Szmigiero's avatar
      KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK · b2ce8997
      Maciej S. Szmigiero authored
      While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
      I noticed that with vCPU count large enough (> 16) they sometimes froze at
      boot.
      With vCPU count of 64 they never booted successfully - suggesting some kind
      of a race condition.
      
      Since adding "vnmi=0" module parameter made these guests boot successfully
      it was clear that the problem is most likely (v)NMI-related.
      
      Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
      "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
      and the NMI parts of eventinj test.
      
      The issue was that once one NMI was being serviced no other NMI was allowed
      to be set pending (NMI limit = 0), which was traced to
      svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
      than for the "NMI pending" flag.
      
      Fix this by testing for the right flag in svm_is_vnmi_pending().
      Once this is done, the NMI-related kvm-unit-tests pass successfully and
      the Windows guest no longer freezes at boot.
      
      Fixes: fa4c027a ("KVM: x86: Add support for SVM's Virtual NMI")
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      b2ce8997
    • Sean Christopherson's avatar
      KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker · 817fa998
      Sean Christopherson authored
      Factor in the address space (non-SMM vs. SMM) of the target shadow page
      when recovering potential NX huge pages, otherwise KVM will retrieve the
      wrong memslot when zapping shadow pages that were created for SMM.  The
      bug most visibly manifests as a WARN on the memslot being non-NULL, but
      the worst case scenario is that KVM could unaccount the shadow page
      without ensuring KVM won't install a huge page, i.e. if the non-SMM slot
      is being dirty logged, but the SMM slot is not.
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015
       kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
       CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re
       RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
       RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
       RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98
       R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003
       FS:  0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0
       Call Trace:
        <TASK>
      __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm]
        kvm_vm_worker_thread+0x106/0x1c0 [kvm]
        kthread+0xd9/0x100
        ret_from_fork+0x2c/0x50
        </TASK>
       ---[ end trace 0000000000000000 ]---
      
      This bug was exposed by commit edbdb43f ("KVM: x86: Preserve TDP MMU
      roots until they are explicitly invalidated"), which allowed KVM to retain
      SMM TDP MMU roots effectively indefinitely.  Before commit edbdb43f,
      KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages
      once all vCPUs exited SMM, which made the window where this bug (recovering
      an SMM NX huge page) could be encountered quite tiny.  To hit the bug, the
      NX recovery thread would have to run while at least one vCPU was in SMM.
      Most VMs typically only use SMM during boot, and so the problematic shadow
      pages were gone by the time the NX recovery thread ran.
      
      Now that KVM preserves TDP MMU roots until they are explicitly invalidated
      (e.g. by a memslot deletion), the window to trigger the bug is effectively
      never closed because most VMMs don't delete memslots after boot (except
      for a handful of special scenarios).
      
      Fixes: eb298605 ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages")
      Reported-by: default avatarFabio Coatti <fabio.coatti@gmail.com>
      Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      817fa998
  4. 31 May, 2023 3 commits
  5. 30 May, 2023 1 commit
  6. 24 May, 2023 3 commits
  7. 21 May, 2023 3 commits
  8. 19 May, 2023 9 commits
    • Michal Luczaj's avatar
      KVM: Fix vcpu_array[0] races · afb2acb2
      Michal Luczaj authored
      In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
      access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
      no failure path requiring vcpu removal and destruction. Such order is
      important because vcpu_array accessors may end up referencing vcpu at
      vcpu_array[0] even before online_vcpus is set to 1.
      
      When online_vcpus=0, any call to kvm_get_vcpu() goes through
      array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
      
      	int num_vcpus = atomic_read(&kvm->online_vcpus);
      	i = array_index_nospec(i, num_vcpus);
      	return xa_load(&kvm->vcpu_array, i);
      
      Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
      an "empty" range, but actually [0, ULONG_MAX]:
      
      	xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
      			  (atomic_read(&kvm->online_vcpus) - 1))
      
      In both cases, such online_vcpus=0 edge case, even if leading to
      unnecessary calls to XArray API, should not be an issue; requesting
      unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
      
      However, this means that when the first vCPU is created and inserted in
      vcpu_array *and* before online_vcpus is incremented, code calling
      kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
      
      This should not pose a problem assuming that once a vcpu is stored in
      vcpu_array, it will remain there, but that's not the case:
      kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
      file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
      from the vcpu_array, then destroyed:
      
      	vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
      	r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
      	kvm_get_kvm(kvm);
      	r = create_vcpu_fd(vcpu);
      	if (r < 0) {
      		xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
      		kvm_put_kvm_no_destroy(kvm);
      		goto unlock_vcpu_destroy;
      	}
      	atomic_inc(&kvm->online_vcpus);
      
      This results in a possible race condition when a reference to a vcpu is
      acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
      vcpu is destroyed.
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
      Cc: stable@vger.kernel.org
      Fixes: c5b07754 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      afb2acb2
    • Jacob Xu's avatar
      KVM: VMX: Fix header file dependency of asm/vmx.h · 3367eeab
      Jacob Xu authored
      Include a definition of WARN_ON_ONCE() before using it.
      
      Fixes: bb1fcc70 ("KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT")
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJacob Xu <jacobhxu@google.com>
      [reworded commit message; changed <asm/bug.h> to <linux/bug.h>]
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225012959.1554168-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3367eeab
    • Sean Christopherson's avatar
      KVM: Don't enable hardware after a restart/shutdown is initiated · e0ceec22
      Sean Christopherson authored
      Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
      been initiated to avoid re-enabling hardware between kvm_reboot() and
      machine_{halt,power_off,restart}().  The restart case is especially
      problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
      SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
      is unable to wake and rendezvous with APs.
      
      Note, this bug, and the original issue that motivated the addition of
      kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
      In a "normal" reboot, userspace will gracefully teardown userspace before
      triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
      that might do ioctl(KVM_CREATE_VM) is long gone.
      
      Fixes: 8e1c1815 ("KVM: VMX: Disable VMX when system shutdown")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Message-Id: <20230512233127.804012-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0ceec22
    • Sean Christopherson's avatar
      KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown · 6735150b
      Sean Christopherson authored
      Use syscore_ops.shutdown to disable hardware virtualization during a
      reboot instead of using the dedicated reboot_notifier so that KVM disables
      virtualization _after_ system_state has been updated.  This will allow
      fixing a race in KVM's handling of a forced reboot where KVM can end up
      enabling hardware virtualization between kernel_restart_prepare() and
      machine_restart().
      
      Rename KVM's hook to match the syscore op to avoid any possible confusion
      from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown
      nor "reboot" is completely accurate as the hook handles both).
      
      Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX
      specific, and to explain why kvm_rebooting exists.
      
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oliver.upton@linux.dev>
      Cc: James Morse <james.morse@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Zenghui Yu <yuzenghui@huawei.com>
      Cc: kvmarm@lists.linux.dev
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
      Cc: Anup Patel <anup@brainfault.org>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: kvm-riscv@lists.infradead.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Message-Id: <20230512233127.804012-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6735150b
    • Will Deacon's avatar
      KVM: arm64: Prevent unconditional donation of unmapped regions from the host · 09cce60b
      Will Deacon authored
      Since host stage-2 mappings are created lazily, we cannot rely solely on
      the pte in order to recover the target physical address when checking a
      host-initiated memory transition as this permits donation of unmapped
      regions corresponding to MMIO or "no-map" memory.
      
      Instead of inspecting the pte, move the addr_is_allowed_memory() check
      into the host callback function where it is passed the physical address
      directly from the walker.
      
      Cc: Quentin Perret <qperret@google.com>
      Fixes: e82edcc7 ("KVM: arm64: Implement do_share() helper for sharing memory")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20230518095844.1178-1-will@kernel.org
      09cce60b
    • Jean-Philippe Brucker's avatar
      KVM: arm64: vgic: Fix a comment · 62548732
      Jean-Philippe Brucker authored
      It is host userspace, not the guest, that issues
      KVM_DEV_ARM_VGIC_GRP_CTRL
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20230518100914.2837292-5-jean-philippe@linaro.org
      62548732
    • Jean-Philippe Brucker's avatar
      KVM: arm64: vgic: Fix locking comment · c38b8400
      Jean-Philippe Brucker authored
      It is now config_lock that must be held, not kvm lock. Replace the
      comment with a lockdep annotation.
      
      Fixes: f0032773 ("KVM: arm64: Use config_lock to protect vgic state")
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20230518100914.2837292-4-jean-philippe@linaro.org
      c38b8400
    • Jean-Philippe Brucker's avatar
      KVM: arm64: vgic: Wrap vgic_its_create() with config_lock · 9cf2f840
      Jean-Philippe Brucker authored
      vgic_its_create() changes the vgic state without holding the
      config_lock, which triggers a lockdep warning in vgic_v4_init():
      
      [  358.667941] WARNING: CPU: 3 PID: 178 at arch/arm64/kvm/vgic/vgic-v4.c:245 vgic_v4_init+0x15c/0x7a8
      ...
      [  358.707410]  vgic_v4_init+0x15c/0x7a8
      [  358.708550]  vgic_its_create+0x37c/0x4a4
      [  358.709640]  kvm_vm_ioctl+0x1518/0x2d80
      [  358.710688]  __arm64_sys_ioctl+0x7ac/0x1ba8
      [  358.711960]  invoke_syscall.constprop.0+0x70/0x1e0
      [  358.713245]  do_el0_svc+0xe4/0x2d4
      [  358.714289]  el0_svc+0x44/0x8c
      [  358.715329]  el0t_64_sync_handler+0xf4/0x120
      [  358.716615]  el0t_64_sync+0x190/0x194
      
      Wrap the whole of vgic_its_create() with config_lock since, in addition
      to calling vgic_v4_init(), it also modifies the global kvm->arch.vgic
      state.
      
      Fixes: f0032773 ("KVM: arm64: Use config_lock to protect vgic state")
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20230518100914.2837292-3-jean-philippe@linaro.org
      9cf2f840
    • Jean-Philippe Brucker's avatar
      KVM: arm64: vgic: Fix a circular locking issue · 59112e9c
      Jean-Philippe Brucker authored
      Lockdep reports a circular lock dependency between the srcu and the
      config_lock:
      
      [  262.179917] -> #1 (&kvm->srcu){.+.+}-{0:0}:
      [  262.182010]        __synchronize_srcu+0xb0/0x224
      [  262.183422]        synchronize_srcu_expedited+0x24/0x34
      [  262.184554]        kvm_io_bus_register_dev+0x324/0x50c
      [  262.185650]        vgic_register_redist_iodev+0x254/0x398
      [  262.186740]        vgic_v3_set_redist_base+0x3b0/0x724
      [  262.188087]        kvm_vgic_addr+0x364/0x600
      [  262.189189]        vgic_set_common_attr+0x90/0x544
      [  262.190278]        vgic_v3_set_attr+0x74/0x9c
      [  262.191432]        kvm_device_ioctl+0x2a0/0x4e4
      [  262.192515]        __arm64_sys_ioctl+0x7ac/0x1ba8
      [  262.193612]        invoke_syscall.constprop.0+0x70/0x1e0
      [  262.195006]        do_el0_svc+0xe4/0x2d4
      [  262.195929]        el0_svc+0x44/0x8c
      [  262.196917]        el0t_64_sync_handler+0xf4/0x120
      [  262.198238]        el0t_64_sync+0x190/0x194
      [  262.199224]
      [  262.199224] -> #0 (&kvm->arch.config_lock){+.+.}-{3:3}:
      [  262.201094]        __lock_acquire+0x2b70/0x626c
      [  262.202245]        lock_acquire+0x454/0x778
      [  262.203132]        __mutex_lock+0x190/0x8b4
      [  262.204023]        mutex_lock_nested+0x24/0x30
      [  262.205100]        vgic_mmio_write_v3_misc+0x5c/0x2a0
      [  262.206178]        dispatch_mmio_write+0xd8/0x258
      [  262.207498]        __kvm_io_bus_write+0x1e0/0x350
      [  262.208582]        kvm_io_bus_write+0xe0/0x1cc
      [  262.209653]        io_mem_abort+0x2ac/0x6d8
      [  262.210569]        kvm_handle_guest_abort+0x9b8/0x1f88
      [  262.211937]        handle_exit+0xc4/0x39c
      [  262.212971]        kvm_arch_vcpu_ioctl_run+0x90c/0x1c04
      [  262.214154]        kvm_vcpu_ioctl+0x450/0x12f8
      [  262.215233]        __arm64_sys_ioctl+0x7ac/0x1ba8
      [  262.216402]        invoke_syscall.constprop.0+0x70/0x1e0
      [  262.217774]        do_el0_svc+0xe4/0x2d4
      [  262.218758]        el0_svc+0x44/0x8c
      [  262.219941]        el0t_64_sync_handler+0xf4/0x120
      [  262.221110]        el0t_64_sync+0x190/0x194
      
      Note that the current report, which can be triggered by the vgic_irq
      kselftest, is a triple chain that includes slots_lock, but after
      inverting the slots_lock/config_lock dependency, the actual problem
      reported above remains.
      
      In several places, the vgic code calls kvm_io_bus_register_dev(), which
      synchronizes the srcu, while holding config_lock (#1). And the MMIO
      handler takes the config_lock while holding the srcu read lock (#0).
      
      Break dependency #1, by registering the distributor and redistributors
      without holding config_lock. The ITS also uses kvm_io_bus_register_dev()
      but already relies on slots_lock to serialize calls.
      
      The distributor iodev is created on the first KVM_RUN call. Multiple
      threads will race for vgic initialization, and only the first one will
      see !vgic_ready() under the lock. To serialize those threads, rely on
      slots_lock rather than config_lock.
      
      Redistributors are created earlier, through KVM_DEV_ARM_VGIC_GRP_ADDR
      ioctls and vCPU creation. Similarly, serialize the iodev creation with
      slots_lock, and the rest with config_lock.
      
      Fixes: f0032773 ("KVM: arm64: Use config_lock to protect vgic state")
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20230518100914.2837292-2-jean-philippe@linaro.org
      59112e9c
  9. 17 May, 2023 1 commit
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.4-1' of... · 39428f6e
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.4, take #1
      
      - Plug a race in the stage-2 mapping code where the IPA and the PA
        would end up being out of sync
      
      - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
      
      - FP/SVE/SME documentation update, in the hope that this field
        becomes clearer...
      
      - Add workaround for the usual Apple SEIS brokenness
      
      - Random comment fixes
      39428f6e
  10. 14 May, 2023 11 commits