• Sean Christopherson's avatar
    KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock · 44d17459
    Sean Christopherson authored
    Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
    on x86 due to a chain of locks and SRCU synchronizations.  Translating the
    below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
    CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
    fairness of r/w semaphores).
    
        CPU0                     CPU1                     CPU2
    1   lock(&kvm->slots_lock);
    2                                                     lock(&vcpu->mutex);
    3                                                     lock(&kvm->srcu);
    4                            lock(cpu_hotplug_lock);
    5                            lock(kvm_lock);
    6                            lock(&kvm->slots_lock);
    7                                                     lock(cpu_hotplug_lock);
    8   sync(&kvm->srcu);
    
    Note, there are likely more potential deadlocks in KVM x86, e.g. the same
    pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
    __kvmclock_cpufreq_notifier():
    
      cpuhp_cpufreq_online()
      |
      -> cpufreq_online()
         |
         -> cpufreq_gov_performance_limits()
            |
            -> __cpufreq_driver_target()
               |
               -> __target_index()
                  |
                  -> cpufreq_freq_transition_begin()
                     |
                     -> cpufreq_notify_transition()
                        |
                        -> ... __kvmclock_cpufreq_notifier()
    
    But, actually triggering such deadlocks is beyond rare due to the
    combination of dependencies and timings involved.  E.g. the cpufreq
    notifier is only used on older CPUs without a constant TSC, mucking with
    the NX hugepage mitigation while VMs are running is very uncommon, and
    doing so while also onlining/offlining a CPU (necessary to generate
    contention on cpu_hotplug_lock) would be even more unusual.
    
    The most robust solution to the general cpu_hotplug_lock issue is likely
    to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
    notifier doesn't to take kvm_lock.  For now, settle for fixing the most
    blatant deadlock, as switching to an RCU-protected list is a much more
    involved change, but add a comment in locking.rst to call out that care
    needs to be taken when walking holding kvm_lock and walking vm_list.
    
      ======================================================
      WARNING: possible circular locking dependency detected
      6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
      ------------------------------------------------------
      tee/35048 is trying to acquire lock:
      ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]
    
      but task is already holding lock:
      ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]
    
      which lock already depends on the new lock.
    
       the existing dependency chain (in reverse order) is:
    
      -> #3 (kvm_lock){+.+.}-{3:3}:
             __mutex_lock+0x6a/0xb40
             mutex_lock_nested+0x1f/0x30
             kvm_dev_ioctl+0x4fb/0xe50 [kvm]
             __se_sys_ioctl+0x7b/0xd0
             __x64_sys_ioctl+0x21/0x30
             x64_sys_call+0x15d0/0x2e60
             do_syscall_64+0x83/0x160
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
      -> #2 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x2e/0xb0
             static_key_slow_inc+0x16/0x30
             kvm_lapic_set_base+0x6a/0x1c0 [kvm]
             kvm_set_apic_base+0x8f/0xe0 [kvm]
             kvm_set_msr_common+0x9ae/0xf80 [kvm]
             vmx_set_msr+0xa54/0xbe0 [kvm_intel]
             __kvm_set_msr+0xb6/0x1a0 [kvm]
             kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
             kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
             __se_sys_ioctl+0x7b/0xd0
             __x64_sys_ioctl+0x21/0x30
             x64_sys_call+0x15d0/0x2e60
             do_syscall_64+0x83/0x160
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
      -> #1 (&kvm->srcu){.+.+}-{0:0}:
             __synchronize_srcu+0x44/0x1a0
             synchronize_srcu_expedited+0x21/0x30
             kvm_swap_active_memslots+0x110/0x1c0 [kvm]
             kvm_set_memslot+0x360/0x620 [kvm]
             __kvm_set_memory_region+0x27b/0x300 [kvm]
             kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
             kvm_vm_ioctl+0x295/0x650 [kvm]
             __se_sys_ioctl+0x7b/0xd0
             __x64_sys_ioctl+0x21/0x30
             x64_sys_call+0x15d0/0x2e60
             do_syscall_64+0x83/0x160
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
      -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
             __lock_acquire+0x15ef/0x2e30
             lock_acquire+0xe0/0x260
             __mutex_lock+0x6a/0xb40
             mutex_lock_nested+0x1f/0x30
             set_nx_huge_pages+0x179/0x1e0 [kvm]
             param_attr_store+0x93/0x100
             module_attr_store+0x22/0x40
             sysfs_kf_write+0x81/0xb0
             kernfs_fop_write_iter+0x133/0x1d0
             vfs_write+0x28d/0x380
             ksys_write+0x70/0xe0
             __x64_sys_write+0x1f/0x30
             x64_sys_call+0x281b/0x2e60
             do_syscall_64+0x83/0x160
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
    Cc: Chao Gao <chao.gao@intel.com>
    Fixes: 0bf50497 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
    Cc: stable@vger.kernel.org
    Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
    Acked-by: default avatarKai Huang <kai.huang@intel.com>
    Tested-by: default avatarFarrah Chen <farrah.chen@intel.com>
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    Message-ID: <20240830043600.127750-2-seanjc@google.com>
    Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    44d17459
locking.rst 13.9 KB