• Philip Yang's avatar
    drm/amdkfd: don't use dqm lock during device reset/suspend/resume · 2c99a547
    Philip Yang authored
    If device reset/suspend/resume failed for some reason, dqm lock is
    hold forever and this causes deadlock. Below is a kernel backtrace when
    application open kfd after suspend/resume failed.
    
    Instead of holding dqm lock in pre_reset and releasing dqm lock in
    post_reset, add dqm->sched_running flag which is modified in
    dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
    because write/read are all inside dqm lock.
    
    For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
    sched_running flag before sending the updated runlist.
    
    v2: For no-HWS case, when device is stopped, don't call
    load/destroy_mqd for eviction, restore and create queue, and avoid
    debugfs dump hdqs.
    
    Backtrace of dqm lock deadlock:
    
    [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
    than 120 seconds.
    [Thu Oct 17 16:43:37 2019]       Not tainted
    5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
    [Thu Oct 17 16:43:37 2019] "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
    0x80000000
    [Thu Oct 17 16:43:37 2019] Call Trace:
    [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
    [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
    [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
    [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
    [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
    [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
    [amdgpu]
    [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
    [amdgpu]
    [Thu Oct 17 16:43:37 2019]
    kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
    [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
    [amdgpu]
    [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
    [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
    [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
    [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
    [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
    [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
    [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
    [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
    [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
    [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
    [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Suggested-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
    Signed-off-by: default avatarPhilip Yang <Philip.Yang@amd.com>
    Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    2c99a547
kfd_device.c 31.3 KB