Commits · c435bce6af9b2a277662698875a689c389358f17 · Kirill Smelkov / linux

10 Apr, 2024 37 commits

drm/amd/display: Add extra DMUB logging to track message timeout · c435bce6

Alvin Lee authored Mar 21, 2024

[Description]
- Add logging for first DMUB inbox message that timed out to diagnostic
  data
- It is useful to track the first failed message for debug purposes
  because once DMUB becomes hung (typically on a message), it will
  remain hung and all subsequent messages. In these cases we're
  interested in knowing which is the first message that failed.
Reviewed-by: Josip Pavic <josip.pavic@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c435bce6

drm/amd/display: add root clock control function pointer to fix display corruption · de2d1105

Xi (Alex) Liu authored Mar 20, 2024

[Why and how]

External display has corruption because no root clock control function. Add the function pointer to fix the issue.
Reviewed-by: Daniel Miess <daniel.miess@amd.com>
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Xi (Alex) Liu <xi.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

de2d1105

drm/amd/display: Disable Z8 minimum stutter period check for DCN35 · 211a06df

Nicholas Kazlauskas authored Mar 20, 2024

[Why]
The threshold is no longer useful for blocking suboptimal power states
for DCN35 based on real measurement.

[How]
Reduce to the minimum threshold duration, 1us.
Reviewed-by: Gabe Teeger <gabe.teeger@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

211a06df

drm/amd/display: Add extra logging for HUBP and OTG · 7eb9d1e0

Alvin Lee authored Mar 20, 2024

[Description]
Add extra logging for DCSURF_FLIP_CNTL, DCHUBP_CNTL,
OTG_MASTER_EN, and OTG_DOUBLE_BUFFER_CONTROL for more
debuggability for a system crash.
Reviewed-by: Samson Tam <samson.tam@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7eb9d1e0

drm/amd/display: Add OTG check for set AV mute · 75d5f90d

Leo (Hanghong) Ma authored Mar 19, 2024

[Why && How]
OTG can be disabled before setting dpms on. Add check to skip wait
when setting AV mute if OTG is disabled.
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Leo (Hanghong) Ma <hanghong.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

75d5f90d

drm/amd/display: Skip on writeback when it's not applicable · cf82a80a

Alex Hung authored Mar 15, 2024

[WHY]
dynamic memory safety error detector (KASAN) catches and generates error
messages "BUG: KASAN: slab-out-of-bounds" as writeback connector does not
support certain features which are not initialized.

[HOW]
Skip them when connector type is DRM_MODE_CONNECTOR_WRITEBACK.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3199Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Rodrigo Siqueira <rodrigo.siqueira@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

cf82a80a

drm/amd/display: Allow HPO PG for DCN35 · b3f98c00

Duncan Ma authored Mar 08, 2024

[Why]
HPO can be power gated unconditionally for
DCN35.

[How]
Set disable flag to false.
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Duncan Ma <duncan.ma@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b3f98c00

drm/amd/display: Enable RCO for HDMISTREAMCLK in DCN35 · dbfb51d1

Daniel Miess authored Mar 15, 2024

[Why & How]
Enable root clock optimization for HDMISTREAMCLK and only
disable it when it's actively being used.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Daniel Miess <daniel.miess@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

dbfb51d1

drm/amd/display: Add dummy interface for tracing DCN32 SMU messages · beb9764a

George Shen authored Mar 08, 2024

[Why/How]
Some issues may require a trace of the previous SMU messages from DC to
understand the context and aid in debugging. Actual logging to be
implemented when needed.
Reviewed-by: Josip Pavic <josip.pavic@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: George Shen <george.shen@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

beb9764a

drm/amd/display: Enable DTBCLK DTO earlier in the sequence · 14f9db42

Sung Joon Kim authored Mar 19, 2024

[why]
As per programming guide, we need to
enable the virtual pixel clock via DTBCLK
DTO and ungate the clock before we begin
programming OPP/OPTC control registers.
Otherwise, the double-buffered registers
will be left pending until the clocks are enabled.

[how]
Move the DTBCLK DTO programming up to
where we do the legacy DP DTO programming.
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Sung Joon Kim <sungjoon.kim@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

14f9db42

drm/amd/display: update pipe topology log to support subvp · 5db346c2

Wenjing Liu authored Mar 19, 2024

[why]
There is an ambiguity in subvp pipe topology log. The log doesn't show
subvp relation to main stream and it is not clear that certain stream
is an internal stream for subvp pipes.

[how]
Separate subvp pipe topology logging from main pipe topology. Log main
stream indices instead of the internal stream for subvp pipes.
The following is a sample log showing 2 streams with subvp enabled on
both:

   pipe topology update
 ________________________
| plane0  slice0  stream0|
|DPP1----OPP1----OTG1----|
| plane0  slice0  stream1|
|DPP0----OPP0----OTG0----|
|    (phantom pipes)     |
| plane0  slice0  stream0|
|DPP3----OPP3----OTG3----|
| plane0  slice0  stream1|
|DPP2----OPP2----OTG2----|
|________________________|
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Wenjing Liu <wenjing.liu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5db346c2

drm/amd/display: Add dmub additional interface support for FAMS · 89e5f42c

Dillon Varone authored Mar 15, 2024

[WHY&HOW]
Update dmub and driver interface for future FAMS revisions.
Reviewed-by: Anthony Koo <anthony.koo@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Dillon Varone <dillon.varone@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

89e5f42c

drm/amdgpu: use vm_update_mode=0 as default in sriov for gfx10.3 onwards · 3e2dacca

Danijel Slivka authored Mar 27, 2024

Apply this rule to all newer asics in sriov case.
For asic with VF MMIO access protection avoid using CPU for VM table updates.
CPU pagetable updates have issues with HDP flush as VF MMIO access protection
blocks write to BIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL register
during sriov runtime.
Moved the check to amdgpu_device_init() to ensure it is done after
amdgpu_device_ip_early_init() where the IP versions are discovered.
Signed-off-by: Danijel Slivka <danijel.slivka@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

3e2dacca

drm/amd/amdgpu: add pipe1 hardware support · b7a1a0ef

Arunpravin Paneer Selvam authored Jun 06, 2022

Enable pipe1 support starting from SIENNA CICHLID asic

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2117Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b7a1a0ef

drm/amdgpu: select HDP ref/mask according to gfx ring pipe · 0453e5f2

ZhenGuo Yin authored Mar 29, 2024

Use correct ref/mask for differnent gfx ring pipe.
This should fix the gfx hang issue after enabling gfx pipe1.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2117Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0453e5f2

drm/amd/display: handle invalid connector indices · 60df5628

Joshua Aberback authored Mar 15, 2024

[Why]
The function to count the number of valid connectors does not
guarantee that the first n indices are valid, only that there
exist n valid indices. When invalid indices are present, this
results in later valid connectors being missed, as processing
would end after checking n indices.

[How]
 - count valid indices separately from total indices examined
 - add explicit definition of MAX_LINKS
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Joshua Aberback <joshua.aberback@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

60df5628

drm/amd/display: FEC overhead should be checked once for mst slot nums · 8b2cb32c

Hersen Wu authored Mar 11, 2024

[Why] Mst slot nums equals to pbn / pbn_div.

Today, pbn_div refers to dm_mst_get_pbn_divider ->
dc_link_bandwidth_kbps. In dp_link_bandwidth_kbps,
which includes effect of FEC overhead already. As
result, we should not include effect of FEC overhead
again while calculating pbn by kpbs_to_peak_pbn
(stream_kbps).

[How] Include FEC overhead within dp_link_bandwidth_kbps.
Remove FEC overhead from kbps_to_peak_pbn.
Reviewed-by: Wayne Lin <wayne.lin@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Hersen Wu <hersenxs.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8b2cb32c

drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2 · 029faefb

Tao Zhou authored Mar 28, 2024

SDMA_CNTL is not set in some cases, driver configures it by itself.

v2: simplify code
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

029faefb

drm/amd/display: Expand supported Replay residency mode · af8999c5

Leon Huang authored Mar 04, 2024

[Why]
Dmub provides several Replay residency calculation methods,
but current interface only supports either ALPM or PHY mode

[How]
Modify the interface for supporting different types
of Replay residency calculation.
Reviewed-by: Robin Chen <robin.chen@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Leon Huang <leon.huang1@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

af8999c5

drm/amd/display: Toggle additional RCO options in DCN35 · c9c70395

Daniel Miess authored Nov 03, 2023

[Why]
With root clock optimization now enabled for DCN35 there
are still RCO registers still not being toggled

[How]
Add in logic to toggle RCO registers for DPPCLK,
DPSTREAMCLK and DSCCLK
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Daniel Miess <daniel.miess@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c9c70395

drm/amd/display: optimize dml2 pipe resource allocation order · 62297b71

Wenjing Liu authored Mar 15, 2024

[why]
There could be cases that we are transition from MPC to ODM combine.
In this case if we map pipes before unmapping MPC pipes, we might
temporarly run out of pipes. The change reorders pipe resource
allocation. So we unmapping pipes before mapping new pipes.
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Wenjing Liu <wenjing.liu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

62297b71

drm/amd/display: fix underflow in some two display subvp/non-subvp configs · 7ae0caf3

Samson Tam authored Mar 15, 2024

[Why]
In two display configuration, switching between subvp and non-subvp
 may cause underflow because it moves an existing pipe between
 displays

[How]
Create helper function for applying pipe split flags
Apply pipe split flags prior to deciding on subvp
During subvp check, do not merge pipes, so it can retain previous
 pipe configuration
Add check for prev odm pipe in subvp check
For single display subvp case, use same odm policy for phantom pipes
 as main subvp pipe
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Samson Tam <samson.tam@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7ae0caf3

drm/amd/display: Add timing pixel encoding for mst mode validation · 4df96ba6

Hersen Wu authored Mar 11, 2024

[Why] Mode pbn is not calculated correctly because timing pixel encoding is
not checked within convert_dc_color_depth_into_bpc.

[How] Get mode kbps from dc_bandwidth_in_kbps_from_timing, then calculate
pbn by kbps_to_peak_pbn.
Reviewed-by: Wayne Lin <wayne.lin@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Hersen Wu <hersenxs.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4df96ba6

drm/amd/display: Fix compiler redefinition warnings for certain configs · fec85f99

Mounika Adhuri authored Mar 15, 2024

[why & how]
Modified definitions of 1 function and 2 structs to remove warnings on
certain specific compiler configurations due to redefinition.
Reviewed-by: Martin Leung <martin.leung@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Mounika Adhuri <moadhuri@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

fec85f99

drm/amdgpu: add smu 14.0.1 discovery support · f5a3507c

Yifan Zhang authored Dec 12, 2023

This patch to add smu 14.0.1 support
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f5a3507c

drm/amd/swsmu: Update smu v14.0.0 headers to be 14.0.1 compatible · d045f4ad

lima1002 authored Jan 29, 2024

update ppsmc.h pmfw.h and driver_if.h for smu v14_0_1
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: lima1002 <li.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d045f4ad

drm/amdgpu : Increase the mes log buffer size as per new MES FW version · 8966c316

shaoyunl authored Mar 22, 2024

From MES version 0x54, the log entry increased and require the log buffer
size to be increased. The 16k is maximum size agreed
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8966c316

Documentation: add a page on amdgpu debugging · 23808afc

Alex Deucher authored Mar 13, 2024

Covers GPU page fault debugging and adds a reference
to umr.

v2: update client ids to include SQC/G
v3: Remove duplicate text
v4: add umr documentation link, fix typo
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

23808afc

drm/amdgpu : Add mes_log_enable to control mes log feature · e58acb76

shaoyunl authored Mar 22, 2024

The MES log might slow down the performance for extra step of log the data,
disable it by default and introduce a parameter can enable it when necessary
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e58acb76

drm/amd/pm: fixes a random hang in S4 for SMU v13.0.4/11 · fedb6ae4

Tim Huang authored Mar 27, 2024

While doing multiple S4 stress tests, GC/RLC/PMFW get into
an invalid state resulting into hard hangs.

Adding a GFX reset as workaround just before sending the
MP1_UNLOAD message avoids this failure.
Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

fedb6ae4

drm/amdgpu: refine function signature of amdgpu_aca_get_error_data() · 81d96e8b

Yang Wang authored Mar 28, 2024

refine function signature of amdgpu_aca_get_error_data();
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

81d96e8b

drm/amd/display: add DCN 351 version for microcode load · 2dbe9c2b

Li Ma authored Mar 28, 2024

There is a new DCN veriosn 3.5.1 need to load
Signed-off-by: Li Ma <li.ma@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2dbe9c2b

drm/amdgpu: Reset dGPU if suspend got aborted · df3c7dc5

Lijo Lazar authored Feb 14, 2024

For SOC21 ASICs, there is an issue in re-enabling PM features if a
suspend got aborted. In such cases, reset the device during resume
phase. This is a workaround till a proper solution is finalized.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

df3c7dc5

drm/amdgpu: add IP's FW information to devcoredump · 6a0e1baf

Sunil Khatri authored Mar 26, 2024

Add FW information of all the IP's in the devcoredump.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

6a0e1baf

drm/amdgpu/umsch: reinitialize write pointer in hw init · 108ab31b

Lang Yu authored Mar 25, 2024

Otherwise the old one will be used during GPU reset.
That's not expected.
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

108ab31b

drm/amdgpu: Refine IB schedule error logging · cd409dbc

Lijo Lazar authored Mar 21, 2024

Downgrade to debug information when IBs are skipped. Also, use dev_* to
identify the device.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

cd409dbc

drm/amdgpu: make amdgpu device attr_update() function more efficient · 2ea6f4d9

Yang Wang authored Mar 26, 2024

v1:
add a new enumeration type to identify device attribute node,
this method is relatively more efficient compared with 'strcmp' in
update_attr() function.

v2:
rename device_attr_type to device_attr_id.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Ma Jun <majun@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2ea6f4d9

27 Mar, 2024 3 commits

drm/amdgpu: always force full reset for SOC21 · d7f14876

Alex Deucher authored Mar 23, 2024

There are cases where soft reset seems to succeed, but
does not, so always use mode1/2 for now.
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d7f14876

drm/amdkfd: Reset GPU on queue preemption failure · d7e8ddc3

Harish Kasiviswanathan authored Mar 26, 2024

Currently, with F32 HWS GPU reset is only when unmap queue fails.

However, if compute queue doesn't repond to preemption request in time
unmap will return without any error. In this case, only preemption error
is logged and Reset is not triggered. Call GPU reset in this case also.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d7e8ddc3

drm/amdgpu: fix deadlock while reading mqd from debugfs · c25d09bc

Johannes Weiner authored Mar 07, 2024

An errant disk backup on my desktop got into debugfs and triggered the
following deadlock scenario in the amdgpu debugfs files. The machine
also hard-resets immediately after those lines are printed (although I
wasn't able to reproduce that part when reading by hand):

[ 1318.016074][ T1082] ======================================================
[ 1318.016607][ T1082] WARNING: possible circular locking dependency detected
[ 1318.017107][ T1082] 6.8.0-rc7-00015-ge0c8221b72c0 #17 Not tainted
[ 1318.017598][ T1082] ------------------------------------------------------
[ 1318.018096][ T1082] tar/1082 is trying to acquire lock:
[ 1318.018585][ T1082] ffff98c44175d6a0 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x40/0x80
[ 1318.019084][ T1082]
[ 1318.019084][ T1082] but task is already holding lock:
[ 1318.020052][ T1082] ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.020607][ T1082]
[ 1318.020607][ T1082] which lock already depends on the new lock.
[ 1318.020607][ T1082]
[ 1318.022081][ T1082]
[ 1318.022081][ T1082] the existing dependency chain (in reverse order) is:
[ 1318.023083][ T1082]
[ 1318.023083][ T1082] -> #2 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 1318.024114][ T1082]        __ww_mutex_lock.constprop.0+0xe0/0x12f0
[ 1318.024639][ T1082]        ww_mutex_lock+0x32/0x90
[ 1318.025161][ T1082]        dma_resv_lockdep+0x18a/0x330
[ 1318.025683][ T1082]        do_one_initcall+0x6a/0x350
[ 1318.026210][ T1082]        kernel_init_freeable+0x1a3/0x310
[ 1318.026728][ T1082]        kernel_init+0x15/0x1a0
[ 1318.027242][ T1082]        ret_from_fork+0x2c/0x40
[ 1318.027759][ T1082]        ret_from_fork_asm+0x11/0x20
[ 1318.028281][ T1082]
[ 1318.028281][ T1082] -> #1 (reservation_ww_class_acquire){+.+.}-{0:0}:
[ 1318.029297][ T1082]        dma_resv_lockdep+0x16c/0x330
[ 1318.029790][ T1082]        do_one_initcall+0x6a/0x350
[ 1318.030263][ T1082]        kernel_init_freeable+0x1a3/0x310
[ 1318.030722][ T1082]        kernel_init+0x15/0x1a0
[ 1318.031168][ T1082]        ret_from_fork+0x2c/0x40
[ 1318.031598][ T1082]        ret_from_fork_asm+0x11/0x20
[ 1318.032011][ T1082]
[ 1318.032011][ T1082] -> #0 (&mm->mmap_lock){++++}-{3:3}:
[ 1318.032778][ T1082]        __lock_acquire+0x14bf/0x2680
[ 1318.033141][ T1082]        lock_acquire+0xcd/0x2c0
[ 1318.033487][ T1082]        __might_fault+0x58/0x80
[ 1318.033814][ T1082]        amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu]
[ 1318.034181][ T1082]        full_proxy_read+0x55/0x80
[ 1318.034487][ T1082]        vfs_read+0xa7/0x360
[ 1318.034788][ T1082]        ksys_read+0x70/0xf0
[ 1318.035085][ T1082]        do_syscall_64+0x94/0x180
[ 1318.035375][ T1082]        entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.035664][ T1082]
[ 1318.035664][ T1082] other info that might help us debug this:
[ 1318.035664][ T1082]
[ 1318.036487][ T1082] Chain exists of:
[ 1318.036487][ T1082]   &mm->mmap_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
[ 1318.036487][ T1082]
[ 1318.037310][ T1082]  Possible unsafe locking scenario:
[ 1318.037310][ T1082]
[ 1318.037838][ T1082]        CPU0                    CPU1
[ 1318.038101][ T1082]        ----                    ----
[ 1318.038350][ T1082]   lock(reservation_ww_class_mutex);
[ 1318.038590][ T1082]                                lock(reservation_ww_class_acquire);
[ 1318.038839][ T1082]                                lock(reservation_ww_class_mutex);
[ 1318.039083][ T1082]   rlock(&mm->mmap_lock);
[ 1318.039328][ T1082]
[ 1318.039328][ T1082]  *** DEADLOCK ***
[ 1318.039328][ T1082]
[ 1318.040029][ T1082] 1 lock held by tar/1082:
[ 1318.040259][ T1082]  #0: ffff98c4c13f55f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_debugfs_mqd_read+0x6a/0x250 [amdgpu]
[ 1318.040560][ T1082]
[ 1318.040560][ T1082] stack backtrace:
[ 1318.041053][ T1082] CPU: 22 PID: 1082 Comm: tar Not tainted 6.8.0-rc7-00015-ge0c8221b72c0 #17 3316c85d50e282c5643b075d1f01a4f6365e39c2
[ 1318.041329][ T1082] Hardware name: Gigabyte Technology Co., Ltd. B650 AORUS PRO AX/B650 AORUS PRO AX, BIOS F20 12/14/2023
[ 1318.041614][ T1082] Call Trace:
[ 1318.041895][ T1082]  <TASK>
[ 1318.042175][ T1082]  dump_stack_lvl+0x4a/0x80
[ 1318.042460][ T1082]  check_noncircular+0x145/0x160
[ 1318.042743][ T1082]  __lock_acquire+0x14bf/0x2680
[ 1318.043022][ T1082]  lock_acquire+0xcd/0x2c0
[ 1318.043301][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043580][ T1082]  ? __might_fault+0x40/0x80
[ 1318.043856][ T1082]  __might_fault+0x58/0x80
[ 1318.044131][ T1082]  ? __might_fault+0x40/0x80
[ 1318.044408][ T1082]  amdgpu_debugfs_mqd_read+0x103/0x250 [amdgpu 8fe2afaa910cbd7654c8cab23563a94d6caebaab]
[ 1318.044749][ T1082]  full_proxy_read+0x55/0x80
[ 1318.045042][ T1082]  vfs_read+0xa7/0x360
[ 1318.045333][ T1082]  ksys_read+0x70/0xf0
[ 1318.045623][ T1082]  do_syscall_64+0x94/0x180
[ 1318.045913][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.046201][ T1082]  ? lockdep_hardirqs_on+0x7d/0x100
[ 1318.046487][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.046773][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.047057][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.047337][ T1082]  ? do_syscall_64+0xa0/0x180
[ 1318.047611][ T1082]  entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 1318.047887][ T1082] RIP: 0033:0x7f480b70a39d
[ 1318.048162][ T1082] Code: 91 ba 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb b2 e8 18 a3 01 00 0f 1f 84 00 00 00 00 00 80 3d a9 3c 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 53 48 83
[ 1318.048769][ T1082] RSP: 002b:00007ffde77f5c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1318.049083][ T1082] RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f480b70a39d
[ 1318.049392][ T1082] RDX: 0000000000000800 RSI: 000055c9f2120c00 RDI: 0000000000000008
[ 1318.049703][ T1082] RBP: 0000000000000800 R08: 000055c9f2120a94 R09: 0000000000000007
[ 1318.050011][ T1082] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c9f2120c00
[ 1318.050324][ T1082] R13: 0000000000000008 R14: 0000000000000008 R15: 0000000000000800
[ 1318.050638][ T1082]  </TASK>

amdgpu_debugfs_mqd_read() holds a reservation when it calls
put_user(), which may fault and acquire the mmap_sem. This violates
the established locking order.

Bounce the mqd data through a kernel buffer to get put_user() out of
the illegal section.

Fixes: 445d85e3 ("drm/amdgpu: add debugfs interface for reading MQDs")
Cc: stable@vger.kernel.org # v6.5+
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c25d09bc