Commits · 1ade5f84cc25ddd02161859b59345dca9aabc2e8 · Kirill Smelkov / linux

01 Jul, 2021 21 commits

drm/amdkfd: skip invalid pages during migrations · 1ade5f84

Alex Sierra authored May 12, 2021

Invalid pages can be the result of pages that have been migrated
already due to copy-on-write procedure or pages that were never
migrated to VRAM in first place. This is not an issue anymore,
as pranges now support mixed memory domains (CPU/GPU).
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1ade5f84

drm/amdkfd: classify and map mixed svm range pages in GPU · 1d5dbfe6

Alex Sierra authored May 05, 2021

[Why]
svm ranges can have mixed pages from device or system memory.
A good example is, after a prange has been allocated in VRAM and a
copy-on-write is triggered by a fork. This invalidates some pages
inside the prange. Endding up in mixed pages.

[How]
By classifying each page inside a prange, based on its type. Device or
system memory, during dma mapping call. If page corresponds
to VRAM domain, a flag is set to its dma_addr entry for each GPU.
Then, at the GPU page table mapping. All group of contiguous pages within
the same type are mapped with their proper pte flags.

v2:
Instead of using ttm_res to calculate vram pfns in the svm_range. It is now
done by setting the vram real physical address into drm_addr array.
This makes more flexible VRAM management, plus removes the need to have
a BO reference in the svm_range.

v3:
Remove mapping member from svm_range
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1d5dbfe6

drm/amdkfd: use hmm range fault to get both domain pfns · 278a7087

Alex Sierra authored Jun 23, 2021

Now that prange could have mixed domains (VRAM or SYSRAM),
actual_loc nor svm_bo can not be used to check its current
domain and eventually get its pfns to map them in GPU.
Instead, pfns from both domains, are now obtained from
hmm_range_fault through amdgpu_hmm_range_get_pages
call. This is done everytime a GPU map occur.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

278a7087

drm/amdgpu: get owner ref in validate and map · 1fc160cf

Alex Sierra authored May 06, 2021

Get the proper owner reference for amdgpu_hmm_range_get_pages function.
This is useful for partial migrations. To avoid migrating back to
system memory, VRAM pages, that are accessible by all devices in the
same memory domain.
Ex. multiple devices in the same hive.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1fc160cf

drm/amdkfd: set owner ref to svm range prefault · a010d98a

Alex Sierra authored May 06, 2021

svm_range_prefault is called right before migrations to VRAM,
to make sure pages are resident in system memory before the migration.
With partial migrations, this reference is used by hmm range get pages
to avoid migrating pages that are already in the same VRAM domain.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a010d98a

drm/amdkfd: add owner ref param to get hmm pages · 8c21fc49

Alex Sierra authored May 06, 2021

The parameter is used in the dev_private_owner to decide if device
pages in the range require to be migrated back to system memory, based
if they are or not in the same memory domain.
In this case, this reference could come from the same memory domain
with devices connected to the same hive.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8c21fc49

drm/amdkfd: device pgmap owner at the svm migrate init · 3a61dae8

Alex Sierra authored May 05, 2021

GPUs in the same XGMI hive have direct access to all
members'VRAM. When mapping memory to a GPU, we don't need
hmm_range_fault to fault device-private pages in the same
hive back to the host. Identifying the page owner as the hive,
rather than the individual GPU, accomplishes this.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

3a61dae8

drm/amdkfd: inc counter on child ranges with xnack off · 9e4a91cd

Alex Sierra authored Jun 29, 2021

During GPU page table invalidation with xnack off, new ranges
split may occur concurrently in the same prange. Creating a new
child per split. Each child should also increment its
invalid counter, to assure GPU page table updates in these
ranges.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9e4a91cd

drm/amd/display: Extend DMUB diagnostic logging to DCN3.1 · 1d40ef90

Nicholas Kazlauskas authored Jun 30, 2021

[Why & How]
Extend existing support for DCN2.1 DMUB diagnostic logging to
DCN3.1 so we can collect useful information if the DMUB hangs.
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1d40ef90

drm/amdgpu: Update NV SIMD-per-CU to 2 · aa615811

Joseph Greathouse authored Jun 29, 2021

Navi series GPUs have 2 SIMDs per CU (and then 2 CUs per WGP).
The NV enum headers incorrectly listed this as 4, which later meant
we were incorrectly reporting the number of SIMDs in the HSA
topology. This could cause problems down the line for user-space
applications that want to launch a fixed amount of work to each
SIMD.
Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

aa615811

drm/amdgpu: add new dimgrey cavefish DID · 06ac9b6c

Alex Deucher authored Jun 28, 2021

Add new PCI device id.
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

06ac9b6c

drm/amd/pm: skip PrepareMp1ForUnload message in s0ix · 0e212522

Shyam Sundar S K authored Jun 28, 2021

The documentation around PrepareMp1ForUnload message says that
anything sent to SMU after this command would be stalled as the
PMFW would not be in a state to take further job requests.

Technically this is right in case of S3 scenario. But, this might
not be the case during s0ix as the PMC driver would be the last
to send the SMU on the OS_HINT. If SMU gets a PrepareMp1ForUnload
message before the OS_HINT, this would stall the entire S0ix process.

Results show that, this message to SMU is not required during S0ix
and hence skip it.
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0e212522

drm/amdgpu: move apu flags initialization to the start of device init · 9f6a7857

Huang Rui authored Jun 22, 2021

In some asics, we need to adjust the behavior according to the apu flags
at very early stage.
Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9f6a7857

drm/amd/display: Respect CONFIG_FRAME_WARN=0 in dml Makefile · 25f178bb

Reka Norman authored Jun 29, 2021

Setting CONFIG_FRAME_WARN=0 should disable 'stack frame larger than'
warnings. This is useful for example in KASAN builds. Make the dml
Makefile respect this config.

Fixes the following build warnings with CONFIG_KASAN=y and
CONFIG_FRAME_WARN=0:

drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:3642:6:
warning: stack frame size of 2216 bytes in function
'dml30_ModeSupportAndSystemConfigurationFull' [-Wframe-larger-than=]
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:3957:6:
warning: stack frame size of 2568 bytes in function
'dml31_ModeSupportAndSystemConfigurationFull' [-Wframe-larger-than=]
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Reka Norman <rekanorman@google.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

25f178bb

drm/radeon: Add the missed drm_gem_object_put() in radeon_user_framebuffer_create() · 9ba85914

Jing Xiangfeng authored Jun 29, 2021

radeon_user_framebuffer_create() misses to call drm_gem_object_put() in
an error path. Add the missed function call to fix it.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

9ba85914

drm/amdgpu/dc: Really fix DCN3.1 Makefile for PPC64 · c339a80d

Michal Suchanek authored Jun 23, 2021

Also copy over the part that makes old gcc handling cross-platform.

Fixes: df7a1658 ("drm/amdgpu/dc: fix DCN3.1 Makefile for PPC64")
Fixes: 926d6972 ("drm/amd/display: Add DCN3.1 blocks to the DC Makefile")
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c339a80d

drm/radeon: Call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64 · c1bfd74b

Tiezhu Yang authored Jun 28, 2021

On the Loongson64 platform used with Radeon GPU, shutdown or reboot failed
when console=tty is in the boot cmdline.

radeon_suspend_kms() puts the hw in the suspend state, especially set fb
state as FBINFO_STATE_SUSPENDED:

        if (fbcon) {
                console_lock();
                radeon_fbdev_set_suspend(rdev, 1);
                console_unlock();
        }

Then avoid to do any more fb operations in the related functions:

        if (p->state != FBINFO_STATE_RUNNING)
                return;

So call radeon_suspend_kms() in radeon_pci_shutdown() for Loongson64 to fix
this issue, it looks like some kind of workaround like powerpc.
Co-developed-by: Jianmin Lv <lvjianmin@loongson.cn>
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

c1bfd74b

drm/amdgpu: Set ttm caching flags during bo allocation · 8dbe43e9

Oak Zeng authored Jun 28, 2021

The ttm caching flags (ttm_cached, ttm_write_combined etc) are
used to determine a buffer object's mapping attributes in both
CPU page table and GPU page table (when that buffer is also
accessed by GPU). Currently the ttm caching flags are set in
function amdgpu_ttm_io_mem_reserve which is called during
DRM_AMDGPU_GEM_MMAP ioctl. This has a problem since the GPU
mapping of the buffer object (ioctl DRM_AMDGPU_GEM_VA) can
happen earlier than the mmap time, thus the GPU page table
update code can't pick up the right ttm caching flags to
decide the right GPU page table attributes.

This patch moves the ttm caching flags setting to function
amdgpu_vram_mgr_new - this function is called during the
first step of a buffer object create (eg, DRM_AMDGPU_GEM_CREATE)
so the later both CPU and GPU mapping function calls will
pick up this flag for CPU/GPU page table set up.

v2: rebase (Alex)
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Suggested-by: Christian Koenig <Christian.Koenig@amd.com>
Reviewed-by: Christian Koenig <Christian.Koenig@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Tested-by: Po Huang <Po.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8dbe43e9

drm/amd/display: fix null pointer access in gpu reset · b66596f6

Guchun Chen authored Jun 28, 2021

During GPU reset, when receiving a DMCUB OUTBUX0 interrupt,
DAL code will set it to be OUTBOX interrupt and sets hw interrupt.
However, OUTBOX interrupt is not registered yet, so a NULL pointer
access will be executed.

Call Trace:
  dal_irq_service_set+0x30/0x90 [amdgpu]
  dc_interrupt_set+0x24/0x30 [amdgpu]
  amdgpu_dm_set_dmub_outbox_irq_state+0x22/0x30 [amdgpu]
  amdgpu_irq_update+0x77/0xa0 [amdgpu]
  amdgpu_irq_gpu_reset_resume_helper+0x67/0xa0 [amdgpu]
  amdgpu_do_asic_reset+0x219/0x260 [amdgpu]
  amdgpu_device_gpu_recover.cold+0x8c5/0xb64 [amdgpu]
  amdgpu_debugfs_gpu_recover_show+0x2c/0x60 [amdgpu]
  seq_read_iter+0xc2/0x450
  ? do_anonymous_page+0x22c/0x3b0
  seq_read+0xf9/0x140
  full_proxy_read+0x5c/0x90
  vfs_read+0xaa/0x190
  ksys_read+0x67/0xe0
  __x64_sys_read+0x1a/0x20

Fixes: effbf6ca ("drm/amdgpu/display: remove an old DCN3 guard")
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-and-tested-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b66596f6

drm/amd/display: fix incorrrect valid irq check · e38ca7e4

Guchun Chen authored Jun 28, 2021

valid DAL irq should be < DAL_IRQ_SOURCES_NUMBER.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-and-tested-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

e38ca7e4

drm/amdgpu: enable sdma0 tmz for Raven/Renoir(V2) · e2329e74

Aaron Liu authored Jun 25, 2021

Without driver loaded, SDMA0_UTCL1_PAGE.TMZ_ENABLE is set to 1
by default for all asic. On Raven/Renoir, the sdma goldsetting
changes SDMA0_UTCL1_PAGE.TMZ_ENABLE to 0.
This patch restores SDMA0_UTCL1_PAGE.TMZ_ENABLE to 1.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

e2329e74

30 Jun, 2021 19 commits

drm/amd/amdgpu: enable gpu recovery for beige_goby · a2f55040

Chengming Gui authored Apr 26, 2021

Enable gpu recovery for beige_goby.
Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a2f55040

amdgpu/pm: remove code duplication in show_power_cap calls · 91161b06

Darren Powell authored Jun 24, 2021

 v3: updated patch to apply to latest code
 v2: reorder to check pointers before calling pm_runtime_* functions

 created generic function and call with enum from
 * amdgpu_hwmon_show_power_cap_max
 * amdgpu_hwmon_show_power_cap
 * amdgpu_hwmon_show_power_cap_default

=== Test ===
AMDGPU_PCI_ADDR=`lspci -nn | grep "VGA\|Display" | cut -d " " -f 1`
AMDGPU_HWMON=`ls -la /sys/class/hwmon | grep $AMDGPU_PCI_ADDR | cut -d " " -f 10`
HWMON_DIR=/sys/class/hwmon/${AMDGPU_HWMON}

cp pp_show_power_cap.txt{,.old}
lspci -nn | grep "VGA\|Display" > pp_show_power_cap.test.log
FILES="
power1_cap
power1_cap_max
power1_cap_default "

for f in $FILES
do
  echo  $f = `cat $HWMON_DIR/$f` >> pp_show_power_cap.test.log
done
Signed-off-by: Darren Powell <darren.powell@amd.com>
Reviewed-by: Kevin Wang <kevin1.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

91161b06

drm/amdgpu/display: drop unused variable · ed509955

Alex Deucher authored Jun 29, 2021

Remove unused variable.

Fixes: e7d9560a ("Revert "drm/amd/display: Fix overlay validation by considering cursors"")
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ed509955

Revert "drm/amd/display: Fix overlay validation by considering cursors" · e7d9560a

Rodrigo Siqueira authored Jun 16, 2021

This reverts commit 33f409e6.

The patch that we are reverting here was originally applied because it
fixes multiple IGT issues and flickering in Android. However, after a
discussion with Sean Paul and Mark, it looks like that this patch might
cause problems on ChromeOS. For this reason, we decided to revert this
patch.

Cc: Nicholas Kazlauskas <Nicholas.Kazlauskas@amd.com>
Cc: Harry Wentland <Harry.Wentland@amd.com>
Cc: Hersen Wu <hersenxs.wu@amd.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: Mark Yacoub <markyacoub@chromium.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Reviewed-by: Sean Paul <seanpaul@chromium.org>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

e7d9560a

amdgpu/nv.c - Added codec query for Beige Goby · b3a24461

Veerabadhran Gopalakrishnan authored Jun 19, 2021

Added the Beige Goby capabilities in codec query.

v2: fix build error and indent (James)
Signed-off-by: Veerabadhran Gopalakrishnan <veerabadhran.gopalakrishnan@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b3a24461

drm/amdgpu: enable tmz on yellow carp · c8af9390

Aaron Liu authored Jun 21, 2021

The tmz functions are verified on yellow carp. So enable it by
default.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c8af9390

drm/amdkfd: implement counters for vm fault and migration · d4ebc200

Philip Yang authored Jun 22, 2021

Add helper function to get process device data structure from adev to
update counters.

Update vm faults, page_in, page_out counters will no be executed in
parallel, use WRITE_ONCE to avoid any form of compiler optimizations.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d4ebc200

drm/amdkfd: add sysfs counters for vm fault and migration · 751580b3

Philip Yang authored Jun 16, 2021

This is part of SVM profiling API, export sysfs counters for
per-process, per-GPU vm retry fault, pages migrated in and out of GPU vram.

counters will not be updated in parallel in GPU retry fault handler and
migration to vram/ram path, use READ_ONCE to avoid compiler
optimization.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

751580b3

drm/amdkfd: fix sysfs kobj leak · dcdb4d90

Philip Yang authored Jun 21, 2021

3 cases of kobj leak, which causes memory leak:

kobj_type must have release() method to free memory from release
callback. Don't need NULL default_attrs to init kobj.

sysfs files created under kobj_status should be removed with kobj_status
as parent kobject.

Remove queue sysfs files when releasing queue from process MMU notifier
release callback.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

dcdb4d90

drm/amdkfd: add helper function for kfd sysfs create · 75ae84c8

Philip Yang authored Jun 16, 2021

No functionality change. Modify kfd_sysfs_create_file to use kobject as
parameter, so it becomes common helper function to remove duplicate code
and will simplify new kfd sysfs file create in future.

Move pr_warn to helper function if sysfs file create failed. Set helper
function as void return because caller doesn't use the helper function
return value.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

75ae84c8