Commits · 9fbfeaf110714dd6176e209230569c2dd9a9ad08 · Kirill Smelkov / linux

25 Apr, 2022 12 commits

drm/amd/display: Maintain current link settings in link loss interrupt · 9fbfeaf1

Gary Li authored Apr 14, 2022

[Why]
DP compliance test case 400.3.2.3 is failed because in link loss interrupt
the current link settings is not used in the DP link training.

[How]
In link loss interrupt, use the current link settings in the following DP
link training.
Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Gary Li <garyli12@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9fbfeaf1

drm/amd/display: Remove ddc write and read size checking · e953cd08

Leo Ma authored Mar 23, 2022

[Why]
Customer found I2C over AUX using ADL_Display_DDCBlockAccess_Get
will fail when sending more than 256 bytes of data;

[How]
Remove the write and read size checking to allow sending data more
than 256 bytes;
Reviewed-by: Martin Leung <Martin.Leung@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Leo Ma <hanghong.ma@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e953cd08

drm/amd/display: read PSR-SU cap DPCD for specific panel · d9f442e9

David Zhang authored Apr 11, 2022

[why & how]
For some specific eDP panel, we'd check the PSR-SU cap during boot
by reading the vendor specific DPCD, otherwise it will cause to
false report the eDP panel which supports PSR-SU as an non-PSR-SU
panel.

- add the vendor specific DPCD address in ddc_service_types header
- if specific eDP panel detected, check vendor specific DPCD for
  PSR-SU cap
Reviewed-by: Aurabindo Jayamohanan Pillai <Aurabindo.Pillai@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: David Zhang <dingchen.zhang@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d9f442e9

drm/amd/display: Don't pass HostVM by default on DCN3.1 · 4a0caac0

Michael Strauss authored Apr 05, 2022

[WHY]
Roll back previous change to stop passing this value by default, instead
add a debug flag to override to previous behaviour (or force HostVM calcs)
Reviewed-by: Nicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Michael Strauss <michael.strauss@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4a0caac0

drm/amd/display: Reset cached PSR parameters after hibernate · d2069326

Evgenii Krasnikov authored Apr 05, 2022

[WHY]
After hibernate system might be using old invalid psr_power_opt and
psr_allow_active that never get reset

[HOW]
Reset cached Panel Self Refresh parameters when PSR is first configured
for eDP in dc_link_setup_psr.
Reviewed-by: Harry Vanzylldejong <harry.vanzylldejong@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Evgenii Krasnikov <Evgenii.Krasnikov@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d2069326

drm/amd/display: Add Audio readback registers · e955b547

Ilya Bakoulin authored Mar 28, 2022

[Why]
Can be useful for verifying the correctness of audio output.
Reviewed-by: Aric Cyr <Aric.Cyr@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Ilya Bakoulin <Ilya.Bakoulin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e955b547

drm/amd/display: update dcn315 clk table read · 89c342a9

Dmytro Laktyushkin authored Apr 08, 2022

Clean up the sequence by making sure clk_mgr always builds a
reasonable clock table regardless of what we read from smu
by moving all defaults from resource soc struct to clk_mgr.

Now the only thing resource soc update does is read
the clock table and apply any DC specific policy decisions
to how clocks are populated in dml soc.
Reviewed-by: Qingqing Zhuo <Qingqing.Zhuo@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Dmytro Laktyushkin <Dmytro.Laktyushkin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

89c342a9

drm/amd/display: 3.2.182 · 259f249c

Aric Cyr authored Apr 10, 2022

This version brings along following improvements:
- Fix HDCP QUERY Error for eDP and Tiled
- Insert smu busy status before sending another request
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

259f249c

drm/amd/display: Fix HDCP QUERY Error for eDP and Tiled · 84ebd73e

Mustapha Ghaddar authored Apr 07, 2022

[WHY]
For dio_output_encoder ID we are relying on SW concept which is
invisible to HW

[HOW]
Needed to create separate cases for when DPIA and non DPIA for
dio link encoder ID
Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com>
Reviewed-by: James Zhang <james.zhang@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Mustapha Ghaddar <mghaddar@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

84ebd73e

drm/amd/display: Insert smu busy status before sending another request · 721af39f

Oliver Logush authored Apr 01, 2022

[why]
Need to check if result register is busy before sending another request

[how]
Call method to check if result register is busy
Reviewed-by: Charlene Liu <Charlene.Liu@amd.com>
Acked-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Oliver Logush <oliver.logush@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

721af39f

drm/amdkfd: Ignore bogus signals from MEC efficiently · c3eb12df

Felix Kuehling authored Apr 07, 2022

MEC firmware sometimes sends signal interrupts without a valid context ID
on end of pipe events that don't intend to signal any HSA signals.
This triggers the slow path in kfd_signal_event_interrupt that scans the
entire event page for signaled events. Detect these signals in the top
half interrupt handler to stop processing them as early as possible.

Because we now always treat event ID 0 as invalid, reserve that ID during
process initialization.

v2: Update firmware version checks to support more GPUs
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c3eb12df

drm/amdgpu: Remove useless kfree · b3ef3205

Haowen Bai authored Apr 22, 2022

After alloc fail, we do not need to kfree.
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b3ef3205

22 Apr, 2022 6 commits

drm/amdgpu: Ta fw needs to be loaded for SRIOV aldebaran · a2443ef0

David Yu authored Apr 22, 2022

Load ta fw during psp_init_sriov_microcode to enable XGMI.
It is required to be loaded by both guest and host starting
from Arcturus. Cap fw needs to be loaded first.
Signed-off-by: David Yu <David.Yu@amd.com>
Reviewed-by: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a2443ef0

drm/amd/pm: fix the deadlock issue observed on SI · 114f0887

Evan Quan authored Apr 08, 2022

The adev->pm.mutx is already held at the beginning of
amdgpu_dpm_compute_clocks/amdgpu_dpm_enable_uvd/amdgpu_dpm_enable_vce.
But on their calling path, amdgpu_display_bandwidth_update will be
called and thus its sub functions amdgpu_dpm_get_sclk/mclk. They
will then try to acquire the same adev->pm.mutex and deadlock will
occur.

By placing amdgpu_display_bandwidth_update outside of adev->pm.mutex
protection(considering logically they do not need such protection) and
restructuring the call flow accordingly, we can eliminate the deadlock
issue. This comes with no real logics change.

Fixes: 3712e7a4 ("drm/amd/pm: unified lock protections in amdgpu_dpm.c")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Link: https://lore.kernel.org/all/9e689fea-6c69-f4b0-8dee-32c4cf7d8f9c@molgen.mpg.de/
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1957Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

114f0887

drm/amdgpu: add RAS fatal error interrupt handler · b3c76814

Tao Zhou authored Apr 19, 2022

The fatal error handler is independent from general ras interrupt
handler since there is no related IH ring.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b3c76814

drm/amdgpu: add RAS poison consumption handler (v2) · 66f87949

Tao Zhou authored Apr 19, 2022

Add support for general RAS poison consumption handler.

v2: remove callback function for poison consumption.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

66f87949

drm/amdgpu: add RAS poison creation handler (v2) · 50a7d025

Tao Zhou authored Apr 08, 2022

Prepare for the implementation of poison consumption handler.

v2: separate umc handler from poison creation.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

50a7d025

drm/amdkfd: use kvcalloc() instead of kvmalloc() in kfd_migrate · cc9d82fc

Yang Wang authored Apr 21, 2022

simplify programming with existing functions.
Signed-off-by: Yang Wang <KevinYang.Wang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

cc9d82fc

21 Apr, 2022 12 commits

drm/amd/amdgpu: Update PF2VF header · e15c9d06

Bokun Zhang authored Apr 21, 2022

- In the latest version of the header, there is a variable name change.
  This should not cause any backward compatibility since the variable is
  at the same offset in the struct.
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e15c9d06

drm/amd/amdgpu: Properly indent PF2VF header · 451913e9

Bokun Zhang authored Apr 21, 2022

- Clean up the identation in the header file
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

451913e9

drm/amd/amdgpu: Update MIT license in SRIOV msg header · c649287a

Bokun Zhang authored Apr 21, 2022

- Update MIT license header
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c649287a

drm/amdgpu/display: make hubp31_program_extended_blank static · 72f05e3b

Alex Deucher authored Apr 21, 2022

It's not used outside of dcn31_hubp.c.
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

72f05e3b

drm/amd/display: Fix memory leak in dcn21_clock_source_create · e4f1e3a2

Miaoqian Lin authored Apr 21, 2022

When dcn20_clk_src_construct() fails, we need to release clk_src.

Fixes: 6f4e6361 ("drm/amd/display: Add Renoir resource (v2)")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e4f1e3a2

drm/amd/display: Remove useless code · 754fc182

Haowen Bai authored Apr 21, 2022

aux_rep only memset but no use at all, so we drop it.
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

754fc182

drm/amdgpu: don't runtime suspend if there are displays attached (v3) · 4020c228

Alex Deucher authored Dec 28, 2021

We normally runtime suspend when there are displays attached if they
are in the DPMS off state, however, if something wakes the GPU
we send a hotplug event on resume (in case any displays were connected
while the GPU was in suspend) which can cause userspace to light
up the displays again soon after they were turned off.

Prior to
commit 087451f3 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's."),
the driver took a runtime pm reference when the fbdev emulation was
enabled because we didn't implement proper shadowing support for
vram access when the device was off so the device never runtime
suspended when there was a console bound. Once that commit landed,
we now utilize the core fb helper implementation which properly
handles the emulation, so runtime pm now suspends in cases where it did
not before. Ultimately, we need to sort out why runtime suspend in not
working in this case for some users, but this should restore similar
behavior to before.

v2: move check into runtime_suspend
v3: wake ups -> wakeups in comment, retain pm_runtime behavior in
runtime_idle callback

Fixes: 087451f3 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Link: https://lore.kernel.org/r/20220403132322.51c90903@darkstar.example.org/Tested-by: Michele Ballabio <ballabio.m@gmail.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4020c228

Revert "drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too" · 515d7ceb

Lang Yu authored Apr 20, 2022

This reverts commit 36bf9321.

It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
at the moment.

./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

515d7ceb

drm/amdgpu: Add debugfs TA load/unload/invoke support · e50d9ba0

Candice Li authored Apr 17, 2022

v1:
Add debugfs support to load/unload/invoke TA in runtime.

v2:
1. Update some variables to static.
2. Use PAGE_ALIGN to calculate shared buf size directly.
3. Remove fp check.
4. Update debugfs from read to write.
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e50d9ba0

drm/amdgpu: Use indirect buffer and save response status for TA load/invoke · fe96e563

Candice Li authored Apr 17, 2022

The upcoming TA debugfs interface needs to use indirect buffer
when performing TA invoke and check psp response status for TA
load and invoke.
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

fe96e563

drm/amdkfd: CRIU add support for GWS queues · 747eea07

David Yat Sin authored Apr 13, 2022

Add support to checkpoint/restore GWS (Global Wave Sync) queues.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

747eea07

drm/amdkfd: Fix GWS queue count · ab4d51d4

David Yat Sin authored Apr 18, 2022

dqm->gws_queue_count and pdd->qpd.mapped_gws_queue need to be updated
each time the queue gets evicted.

Fixes: b8020b03 ("drm/amdkfd: Enable over-subscription with >1 GWS queue")
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ab4d51d4

19 Apr, 2022 10 commits

MAINTAINERS: add docs entry to AMDGPU · 4ae6eeed

Tales Lelo da Aparecida authored Apr 15, 2022

To make sure maintainers of amdgpu drivers are aware of any changes
 in their documentation, add its entry to MAINTAINERS.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tales Lelo da Aparecida <tales.aparecida@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4ae6eeed

Documentation/gpu: Add entries to amdgpu glossary · 6954e5ba

Tales Lelo da Aparecida authored Apr 15, 2022

Add missing acronyms to the amdgppu glossary.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1939Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tales Lelo da Aparecida <tales.aparecida@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

6954e5ba

drm/radeon/kms: change evergreen_default_state table from global to static · 79847f13

Tom Rix authored Apr 16, 2022

evergreen_default_state and evergreen_default_size are only
used in evergreen.c.  Single file symbols should be static.
So move their definitions to evergreen_blit_shaders.h
and change their storage-class-specifier to static.

Remove unneeded evergreen_blit_shader.c

evergreen_ps/vs definitions were removed with
commit 4f862967 ("drm/radeon/kms: remove r6xx+ blit copy routines")
So their declarations in evergreen_blit_shader.h
are not needed, so remove them.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

79847f13

drm/amd/display: add virtual_setup_stream_attribute decl to header · 3eccf76c

Tom Rix authored Apr 18, 2022

Smatch reports this issue
virtual_link_hwss.c:32:6: warning: symbol
  'virtual_setup_stream_attribute' was not declared.
  Should it be static?

virtual_setup_stream_attribute is only used in
virtual_link_hwss.c, but the other functions in the
file are declared in the header file and used elsewhere.
For consistency, add the virtual_setup_stream_attribute
decl to virtual_link_hwss.h.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

3eccf76c

drm/amd/pm: fix double free in si_parse_power_table() · f3fa2bec

Keita Suzuki authored Apr 19, 2022

In function si_parse_power_table(), array adev->pm.dpm.ps and its member
is allocated. If the allocation of each member fails, the array itself
is freed and returned with an error code. However, the array is later
freed again in si_dpm_fini() function which is called when the function
returns an error.

This leads to potential double free of the array adev->pm.dpm.ps, as
well as leak of its array members, since the members are not freed in
the allocation function and the array is not nulled when freed.
In addition adev->pm.dpm.num_ps, which keeps track of the allocated
array member, is not updated until the member allocation is
successfully finished, this could also lead to either use after free,
or uninitialized variable access in si_dpm_fini().

Fix this by postponing the free of the array until si_dpm_fini() and
increment adev->pm.dpm.num_ps everytime the array member is allocated.
Signed-off-by: Keita Suzuki <keitasuzuki.park@sslab.ics.keio.ac.jp>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f3fa2bec

drm/amd/display: make hubp1_wait_pipe_read_start() static · a26b9e0b

Tales Lelo da Aparecida authored Apr 15, 2022

It's a local function, let's make it static.

AGD: remove prototype in dcn10_hubp.h
Signed-off-by: Tales Lelo da Aparecida <tales.aparecida@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a26b9e0b

amdgpu/pm: Clarify documentation of error handling in send_smc_mesg · f24044bd

Darren Powell authored Apr 05, 2022

Clarify the smu_cmn_send_smc_msg_with_param documentation to mention two
cases exist where messages are silently dropped with no error returned.
These cases occur in unusual situations where either:
 1. the message type is not allowed to a virtual GPU, or
 2. a PCI recovery is underway and the HW is not yet in sync with the SW

For more details see
 commit 4ea5081c ("drm/amd/powerplay: enable SMC message filter")
 commit bf36b52e ("drm/amdgpu: Avoid accessing HW when suspending SW state")

(v2)
  Reworked with suggestions from Luben & Paul

(v3)
  Updated wording as per Luben's feedback
  Corrected error stating all messages denied on virtual GPU
  (each GPU has mask of which messages are allowed)
Signed-off-by: Darren Powell <darren.powell@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f24044bd

drm/amdgpu/pm: fix the null pointer while the smu is disabled · eea5c7b3

Huang Rui authored Apr 14, 2022

It needs to check if the pp_funcs is initialized while release the
context, otherwise it will trigger null pointer panic while the software
smu is not enabled.

[ 1109.404555] BUG: kernel NULL pointer dereference, address: 0000000000000078
[ 1109.404609] #PF: supervisor read access in kernel mode
[ 1109.404638] #PF: error_code(0x0000) - not-present page
[ 1109.404657] PGD 0 P4D 0
[ 1109.404672] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1109.404701] CPU: 7 PID: 9150 Comm: amdgpu_test Tainted: G           OEL    5.16.0-custom #1
[ 1109.404732] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 1109.404765] RIP: 0010:amdgpu_dpm_force_performance_level+0x1d/0x170 [amdgpu]
[ 1109.405109] Code: 5d c3 44 8b a3 f0 80 00 00 eb e5 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 08 4c 8b b7 f0 7d 00 00 <49> 83 7e 78 00 0f 84 f2 00 00 00 80 bf 87 80 00 00 00 48 89 fb 0f
[ 1109.405176] RSP: 0018:ffffaf3083ad7c20 EFLAGS: 00010282
[ 1109.405203] RAX: 0000000000000000 RBX: ffff9796b1c14600 RCX: 0000000002862007
[ 1109.405229] RDX: ffff97968591c8c0 RSI: 0000000000000001 RDI: ffff9796a3700000
[ 1109.405260] RBP: ffffaf3083ad7c50 R08: ffffffff9897de00 R09: ffff979688d9db60
[ 1109.405286] R10: 0000000000000000 R11: ffff979688d9db90 R12: 0000000000000001
[ 1109.405316] R13: ffff9796a3700000 R14: 0000000000000000 R15: ffff9796a3708fc0
[ 1109.405345] FS:  00007ff055cff180(0000) GS:ffff9796bfdc0000(0000) knlGS:0000000000000000
[ 1109.405378] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1109.405400] CR2: 0000000000000078 CR3: 000000000a394000 CR4: 00000000000506e0
[ 1109.405434] Call Trace:
[ 1109.405445]  <TASK>
[ 1109.405456]  ? delete_object_full+0x1d/0x20
[ 1109.405480]  amdgpu_ctx_set_stable_pstate+0x7c/0xa0 [amdgpu]
[ 1109.405698]  amdgpu_ctx_fini.part.0+0xcb/0x100 [amdgpu]
[ 1109.405911]  amdgpu_ctx_do_release+0x71/0x80 [amdgpu]
[ 1109.406121]  amdgpu_ctx_ioctl+0x52d/0x550 [amdgpu]
[ 1109.406327]  ? _raw_spin_unlock+0x1a/0x30
[ 1109.406354]  ? drm_gem_handle_delete+0x81/0xb0 [drm]
[ 1109.406400]  ? amdgpu_ctx_get_entity+0x2c0/0x2c0 [amdgpu]
[ 1109.406609]  drm_ioctl_kernel+0xb6/0x140 [drm]
Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

eea5c7b3

drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too · 36bf9321

Lang Yu authored Apr 15, 2022

The idea is from
commit a50fe707 ("drm/amdkfd: Only apply heavy-weight TLB flush on Aldebaran")
and
commit f61c40c0 ("drm/amdkfd: enable heavy-weight TLB flush on Arcturus").

At the moment, heavy-weight TLB could cause problems on ASICs except
Aldebaran and Arcturus.

A simple hipMallocManaged/hipFree program could trigger this issue.

[   97.787657] amdgpu 0000:01:00.0: amdgpu: wait for kiq fence error: 0.
[  106.868758] amdgpu: qcm fence wait loop timeout expired
[  106.868966] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  106.869203] amdgpu: Failed to evict process queues
[  106.869261] amdgpu: Failed to quiesce KFD
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

36bf9321

drm/amdkfd: move kfd_flush_tlb_after_unmap into kfd_priv.h · 459ccca5

Lang Yu authored Apr 14, 2022

To make kfd_flush_tlb_after_unmap visible in kfd_svm.c,
move it into kfd_priv.h. And change it to an inline function.
Signed-off-by: Lang Yu <Lang.Yu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

459ccca5