• Lang Yu's avatar
    drm/amd/amdgpu: fix a potential deadlock in gpu reset · 9c2876d5
    Lang Yu authored
    When amdgpu_ib_ring_tests failed, the reset logic called
    amdgpu_device_ip_suspend twice, then deadlock occurred.
    Deadlock log:
    
    [  805.655192] amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
    [  806.290952] [drm] free PSP TMR buffer
    
    [  806.319406] ============================================
    [  806.320315] WARNING: possible recursive locking detected
    [  806.321225] 5.11.0-custom #1 Tainted: G        W  OEL
    [  806.322135] --------------------------------------------
    [  806.323043] cat/2593 is trying to acquire lock:
    [  806.323825] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.325668]
                   but task is already holding lock:
    [  806.326664] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.328430]
                   other info that might help us debug this:
    [  806.329539]  Possible unsafe locking scenario:
    
    [  806.330549]        CPU0
    [  806.330983]        ----
    [  806.331416]   lock(&adev->dm.dc_lock);
    [  806.332086]   lock(&adev->dm.dc_lock);
    [  806.332738]
                    *** DEADLOCK ***
    
    [  806.333747]  May be due to missing lock nesting notation
    
    [  806.334899] 3 locks held by cat/2593:
    [  806.335537]  #0: ffff888100d3f1b8 (&attr->mutex){+.+.}-{3:3}, at: simple_attr_read+0x4e/0x110
    [  806.337009]  #1: ffff888136b1fd78 (&adev->reset_sem){++++}-{3:3}, at: amdgpu_device_lock_adev+0x42/0x94 [amdgpu]
    [  806.339018]  #2: ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.340869]
                   stack backtrace:
    [  806.341621] CPU: 6 PID: 2593 Comm: cat Tainted: G        W  OEL    5.11.0-custom #1
    [  806.342921] Hardware name: AMD Celadon-CZN/Celadon-CZN, BIOS WLD0C23N_Weekly_20_12_2 12/23/2020
    [  806.344413] Call Trace:
    [  806.344849]  dump_stack+0x93/0xbd
    [  806.345435]  __lock_acquire.cold+0x18a/0x2cf
    [  806.346179]  lock_acquire+0xca/0x390
    [  806.346807]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.347813]  __mutex_lock+0x9b/0x930
    [  806.348454]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.349434]  ? amdgpu_device_indirect_rreg+0x58/0x70 [amdgpu]
    [  806.350581]  ? _raw_spin_unlock_irqrestore+0x47/0x50
    [  806.351437]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.352437]  ? rcu_read_lock_sched_held+0x4f/0x80
    [  806.353252]  ? rcu_read_lock_sched_held+0x4f/0x80
    [  806.354064]  mutex_lock_nested+0x1b/0x20
    [  806.354747]  ? mutex_lock_nested+0x1b/0x20
    [  806.355457]  dm_suspend+0xb8/0x1d0 [amdgpu]
    [  806.356427]  ? soc15_common_set_clockgating_state+0x17d/0x19 [amdgpu]
    [  806.357736]  amdgpu_device_ip_suspend_phase1+0x78/0xd0 [amdgpu]
    [  806.360394]  amdgpu_device_ip_suspend+0x21/0x70 [amdgpu]
    [  806.362926]  amdgpu_device_pre_asic_reset+0xb3/0x270 [amdgpu]
    [  806.365560]  amdgpu_device_gpu_recover.cold+0x679/0x8eb [amdgpu]
    Signed-off-by: default avatarLang Yu <Lang.Yu@amd.com>
    Acked-by: default avatarChristian KÃnig <christian.koenig@amd.com>
    Reviewed-by: default avatarAndrey Grodzovsky <andrey.grodzovsky@amd.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    9c2876d5
amdgpu_device.c 145 KB