1. 10 Apr, 2024 37 commits
  2. 27 Mar, 2024 3 commits
    • Alex Deucher's avatar
      drm/amdgpu: always force full reset for SOC21 · d7f14876
      Alex Deucher authored
      There are cases where soft reset seems to succeed, but
      does not, so always use mode1/2 for now.
      Reviewed-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      d7f14876
    • Harish Kasiviswanathan's avatar
      drm/amdkfd: Reset GPU on queue preemption failure · d7e8ddc3
      Harish Kasiviswanathan authored
      Currently, with F32 HWS GPU reset is only when unmap queue fails.
      
      However, if compute queue doesn't repond to preemption request in time
      unmap will return without any error. In this case, only preemption error
      is logged and Reset is not triggered. Call GPU reset in this case also.
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Reviewed-by: default avatarMukul Joshi <mukul.joshi@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      d7e8ddc3
    • Johannes Weiner's avatar
      drm/amdgpu: fix deadlock while reading mqd from debugfs · c25d09bc
      Johannes Weiner authored
      An errant disk backup on my desktop got into debugfs and triggered the
      following deadlock scenario in the amdgpu debugfs files. The machine
      also hard-resets immediately after those lines are printed (although I
      wasn't able to reproduce that part when reading by hand):
      
      [ 1318.016074][ T1082] ======================================================
      [ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
      [ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
      [ 1318.017598][ T1082] ------------------------------------------------------
      [ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
      [ 1318.018585][ T1082] ffff98c44175d6a0 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x40/0x80
      [ 1318.019084][ T1082]
      [ 1318.019084][ T1082] but task is already holding lock:
      [ 1318.020052][ T1082] ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
      [ 1318.020607][ T1082]
      [ 1318.020607][ T1082] which lock already depends on the new lock.
      [ 1318.020607][ T1082]
      [ 1318.022081][ T1082]
      [ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
      [ 1318.023083][ T1082]
      [ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
      [ 1318.024114][ T1082]        __ww_mutex_lock.constprop.0+0xe0/0x12f0
      [ 1318.024639][ T1082]        ww_mutex_lock+0x32/0x90
      [ 1318.025161][ T1082]        dma_resv_lockdep+0x18a/0x330
      [ 1318.025683][ T1082]        do_one_initcall+0x6a/0x350
      [ 1318.026210][ T1082]        kernel_init_freeable+0x1a3/0x310
      [ 1318.026728][ T1082]        kernel_init+0x15/0x1a0
      [ 1318.027242][ T1082]        ret_from_fork+0x2c/0x40
      [ 1318.027759][ T1082]        ret_from_fork_asm+0x11/0x20
      [ 1318.028281][ T1082]
      [ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
      [ 1318.029297][ T1082]        dma_resv_lockdep+0x16c/0x330
      [ 1318.029790][ T1082]        do_one_initcall+0x6a/0x350
      [ 1318.030263][ T1082]        kernel_init_freeable+0x1a3/0x310
      [ 1318.030722][ T1082]        kernel_init+0x15/0x1a0
      [ 1318.031168][ T1082]        ret_from_fork+0x2c/0x40
      [ 1318.031598][ T1082]        ret_from_fork_asm+0x11/0x20
      [ 1318.032011][ T1082]
      [ 1318.032011][ T1082] -> #0 (&mm->mmap_lock){++++}-{3:3}:
      [ 1318.032778][ T1082]        __lock_acquire+0x14bf/0x2680
      [ 1318.033141][ T1082]        lock_acquire+0xcd/0x2c0
      [ 1318.033487][ T1082]        __might_fault+0x58/0x80
      [ 1318.033814][ T1082]        amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
      [ 1318.034181][ T1082]        full_proxy_read+0x55/0x80
      [ 1318.034487][ T1082]        vfs_read+0xa7/0x360
      [ 1318.034788][ T1082]        ksys_read+0x70/0xf0
      [ 1318.035085][ T1082]        do_syscall_64+0x94/0x180
      [ 1318.035375][ T1082]        entry_SYSCALL_64_after_hwframe+0x46/0x4e
      [ 1318.035664][ T1082]
      [ 1318.035664][ T1082] other info that might help us debug this:
      [ 1318.035664][ T1082]
      [ 1318.036487][ T1082] Chain exists of:
      [ 1318.036487][ T1082]   &mm->mmap_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
      [ 1318.036487][ T1082]
      [ 1318.037310][ T1082]  Possible unsafe locking scenario:
      [ 1318.037310][ T1082]
      [ 1318.037838][ T1082]        CPU0                    CPU1
      [ 1318.038101][ T1082]        ----                    ----
      [ 1318.038350][ T1082]   lock(reservation_ww_class_mutex);
      [ 1318.038590][ T1082]                                lock(reservation_ww_class_acquire);
      [ 1318.038839][ T1082]                                lock(reservation_ww_class_mutex);
      [ 1318.039083][ T1082]   rlock(&mm->mmap_lock);
      [ 1318.039328][ T1082]
      [ 1318.039328][ T1082]  *** DEADLOCK ***
      [ 1318.039328][ T1082]
      [ 1318.040029][ T1082] 1 lock held by tar/1082:
      [ 1318.040259][ T1082]  #0: ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
      [ 1318.040560][ T1082]
      [ 1318.040560][ T1082] stack backtrace:
      [ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
      [ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
      [ 1318.041614][ T1082] Call Trace:
      [ 1318.041895][ T1082]  <TASK>
      [ 1318.042175][ T1082]  dump_stack_lvl+0x4a/0x80
      [ 1318.042460][ T1082]  check_noncircular+0x145/0x160
      [ 1318.042743][ T1082]  __lock_acquire+0x14bf/0x2680
      [ 1318.043022][ T1082]  lock_acquire+0xcd/0x2c0
      [ 1318.043301][ T1082]  ? __might_fault+0x40/0x80
      [ 1318.043580][ T1082]  ? __might_fault+0x40/0x80
      [ 1318.043856][ T1082]  __might_fault+0x58/0x80
      [ 1318.044131][ T1082]  ? __might_fault+0x40/0x80
      [ 1318.044408][ T1082]  amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 8fe2afaa910cbd7654c8cab23563a94d6caebaab]
      [ 1318.044749][ T1082]  full_proxy_read+0x55/0x80
      [ 1318.045042][ T1082]  vfs_read+0xa7/0x360
      [ 1318.045333][ T1082]  ksys_read+0x70/0xf0
      [ 1318.045623][ T1082]  do_syscall_64+0x94/0x180
      [ 1318.045913][ T1082]  ? do_syscall_64+0xa0/0x180
      [ 1318.046201][ T1082]  ? lockdep_hardirqs_on+0x7d/0x100
      [ 1318.046487][ T1082]  ? do_syscall_64+0xa0/0x180
      [ 1318.046773][ T1082]  ? do_syscall_64+0xa0/0x180
      [ 1318.047057][ T1082]  ? do_syscall_64+0xa0/0x180
      [ 1318.047337][ T1082]  ? do_syscall_64+0xa0/0x180
      [ 1318.047611][ T1082]  entry_SYSCALL_64_after_hwframe+0x46/0x4e
      [ 1318.047887][ T1082] RIP: 0033:0x7f480b70a39d
      [ 1318.048162][ T1082] Code: 91 ba 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb b2 e8 18 a3 01 00 0f 1f 84 00 00 00 00 00 80 3d a9 3c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
      [ 1318.048769][ T1082] RSP: 002b:00007ffde77f5c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      [ 1318.049083][ T1082] RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f480b70a39d
      [ 1318.049392][ T1082] RDX: 0000000000000800 RSI: 000055c9f2120c00 RDI: 0000000000000008
      [ 1318.049703][ T1082] RBP: 0000000000000800 R08: 000055c9f2120a94 R09: 0000000000000007
      [ 1318.050011][ T1082] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c9f2120c00
      [ 1318.050324][ T1082] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000800
      [ 1318.050638][ T1082]  </TASK>
      
      amdgpu_debugfs_mqd_read() holds a reservation when it calls
      put_user(), which may fault and acquire the mmap_sem. This violates
      the established locking order.
      
      Bounce the mqd data through a kernel buffer to get put_user() out of
      the illegal section.
      
      Fixes: 445d85e3 ("drm/amdgpu: add debugfs interface for reading MQDs")
      Cc: stable@vger.kernel.org # v6.5+
      Reviewed-by: default avatarShashank Sharma <shashank.sharma@amd.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      c25d09bc