- 30 Dec, 2021 1 commit
-
-
Leslie Shi authored
drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure [Why] In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well to release mapped VRAM. However, in the following release callback, driver stills visits the unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs. [How] call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in vcn_v3_0_sw_fini() when GPU initialization failure. Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
- 28 Dec, 2021 14 commits
-
-
Zongmin Zhou authored
Some boards(like RX550) seem to have garbage in the upper 16 bits of the vram size register. Check for this and clamp the size properly. Fixes boards reporting bogus amounts of vram. after add this patch,the maximum GPU VRAM size is 64GB, otherwise only 64GB vram size will be used. Signed-off-by: Zongmin Zhou<zhouzongmin@kylinos.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Nicholas Kazlauskas authored
[Why] A porting error on a previous patch left the block of code that causes the crash from a NULL pointer dereference. More specifically, we try to access link_enc before it's assigned in the USB4 case in the following assignment: config.dio_output_idx = link_enc->transmitter - TRANSMITTER_UNIPHY_A; [How] That assignment occurs later depending on the ASIC version. It's only needed on DCN31 and only after link_enc is already assigned. Fixes: 98643044 ("drm/amd/display: fix a crash on USB4 over C20 PHY") Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
sashank saye authored
For Aldebaran chip passthrough case we need to intimate SMU about special handling for SBR.On older chips we send LightSBR to SMU, enabling the same for Aldebaran. Slight difference, compared to previous chips, is on Aldebaran, SMU would do a heavy reset on SBR. Hence, the word Heavy instead of Light SBR is used for SMU to differentiate. Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: sashank saye <sashank.saye@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
When an application having open file access to a node forks, its shared mappings also get reflected in the address space of child process even though it cannot access them with the object permissions applied. With the existing permission checks on the gem objects, it might be reasonable to also create the VMAs with VM_DONTCOPY flag so a user space application doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) system call to prevent the pages in the mapped range to appear in the address space of the child process. It also prevents the memory leaks due to additional reference counts on the mapped BOs in the child process that prevented freeing the memory in the parent for which we had worked around earlier in the user space inside the thunk library. Additionally, we faced this issue when using CRIU to checkpoint restore an application that had such inherited mappings in the child which confuse CRIU when it mmaps on restore. Having this flag set for the render node VMAs helps. VMAs mapped via KFD already take care of this so this is needed only for the render nodes. To limit the impact of the change to user space consumers such as OpenGL etc, limit it to KFD BOs only. Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
CP supports unmap queue with reset mode which only destroys specific queue without affecting others. Replacing whole gpu reset with reset queue mode for RAS poison consumption saves much time, and we can also fallback to gpu reset solution if reset queue fails. v2: Return directly if process is NULL; Reset queue solution is not applicable to SDMA, fallback to legacy way; Call kfd_unref_process after lookup process. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
The new interface unmaps queues with reset mode for the process consumes RAS poison, it's only for compute queue. v2: rename the function to reset_queues. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
So we can set reset mode for unmap operation, no functional change. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
Add a reset parameter for umc page retirement, let user decide whether call gpu reset in umc page retirement. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Expand RLCG interface for new GC read & write commands. New interface will only be used if the PF enables the flag in pf2vf msg. v2: Added a description for the scratch registers Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by: David Nieto <david.nieto@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Driver needs to call get_xgmi_info() before ip_init to determine whether it needs to handle a pending hive reset. Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by: David Nieto <david.nieto@amd.com> Reviewed by: shaoyun.liu <Shaoyun.lui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Modify GC register access from MMIO to RLCG if the indirect flag is set Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by: David Nieto <david.nieto@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Modify GC register access from MMIO to RLCG if the indirect flag is set v2: Replaced ternary operator with if-else for better readability Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by: David Nieto <david.nieto@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Add helper macros to change register access from direct to indirect. Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by: David Nieto <david.nieto@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Bokun Zhang authored
Recently, there is security policy update under SRIOV. We need to filter the registers that hit the violation and move the code to the host driver side so that the guest driver can execute correctly. Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
- 16 Dec, 2021 12 commits
-
-
Alex Deucher authored
Used on gfx9 based systems. Fixes incorrect CU counts reported in the kernel log. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1833Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Jiapeng Chong authored
Fix the following coccicheck warning: ./drivers/gpu/drm/amd/display/dc/clk_mgr/dcn301/vg_clk_mgr.c:744:35-36: WARNING comparing pointer to 0. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Alex Deucher authored
Some old registers leftover from pre-silicon. No longer relevant on real hardware. Remove. Reviewed-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
We want to be able to call virt data exchange conditionally after gmc sw init to reserve bad pages as early as possible. Since this is a conditional call, we will need to call it again unconditionally later in the init sequence. Refactor the data exchange function so it can be called multiple times without re-initializing the work item. v2: Cleaned up the code. Kept the original call to init_exchange_data() inside early init to initialize the work item, afterwards call exchange_data() when needed. Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed By: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Changcheng Deng authored
Use max() and min() in order to make code cleaner. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
chen gong authored
Play a video on the raven (or PCO, raven2) platform, and then do the S3 test. When resume, the following error will be reported: amdgpu 0000:02:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec test failed (-110) [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v1_0> failed -110 amdgpu 0000:02:00.0: amdgpu: amdgpu_device_ip_resume failed (-110). PM: dpm_run_callback(): pci_pm_resume+0x0/0x90 returns -110 [why] When playing the video: The power state flag of the vcn block is set to POWER_STATE_ON. When doing suspend: There is no change to the power state flag of the vcn block, it is still POWER_STATE_ON. When doing resume: Need to open the power gate of the vcn block and set the power state flag of the VCN block to POWER_STATE_ON. But at this time, the power state flag of the vcn block is already POWER_STATE_ON. The power status flag check in the "8f2cdef drm/amd/pm: avoid duplicate powergate/ungate setting" patch will return the amdgpu_dpm_set_powergating_by_smu function directly. As a result, the gate of the power was not opened, causing the subsequent ring test to fail. [how] In the suspend function of the vcn block, explicitly change the power state flag of the vcn block to POWER_STATE_OFF. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828Signed-off-by: chen gong <curry.gong@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Lijo Lazar authored
Fix the message argument. 0: Allow power down 1: Disallow power down Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Mario Limonciello authored
This is still needed for thoes in case the firmware fails to load then the message is the only way to tell what firmware was on them Suggested-by: Lijo Lazar <lijo.Lazar@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Philip Yang authored
Add svm_range_bo_unref_async to schedule work to wait for svm_bo eviction work done and then free svm_bo. __do_munmap put_page is atomic context, call svm_range_bo_unref_async to avoid warning invalid wait context. Other non atomic context call svm_range_bo_unref. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Huang Rui authored
The job embedded fence donesn't initialize the flags at dma_fence_init(). Then we will go a wrong way in amdgpu_fence_get_timeline_name callback and trigger a null pointer panic once we enabled the trace event here. So introduce new amdgpu_fence object to indicate the job embedded fence. [ 156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0 [ 156.131804] #PF: supervisor read access in kernel mode [ 156.131811] #PF: error_code(0x0000) - not-present page [ 156.131817] PGD 0 P4D 0 [ 156.131824] Oops: 0000 [#1] PREEMPT SMP PTI [ 156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G OE 5.16.0-rc1-custom #1 [ 156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016 [ 156.131848] RIP: 0010:strlen+0x0/0x20 [ 156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 [ 156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206 [ 156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b [ 156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0 [ 156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000 [ 156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0 [ 156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007 [ 156.131914] FS: 0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000 [ 156.131923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0 [ 156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 156.131949] Call Trace: [ 156.131953] <TASK> [ 156.131957] ? trace_event_raw_event_dma_fence+0xcc/0x200 [ 156.131973] ? ring_buffer_unlock_commit+0x23/0x130 [ 156.131982] dma_fence_init+0x92/0xb0 [ 156.131993] amdgpu_fence_emit+0x10d/0x2b0 [amdgpu] [ 156.132302] amdgpu_ib_schedule+0x2f9/0x580 [amdgpu] [ 156.132586] amdgpu_job_run+0xed/0x220 [amdgpu] v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot) Signed-off-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Prike Liang authored
In the s0ix entry need retain gfx in the gfxoff state,so here need't set gfx cgpg in the S0ix suspend-resume process. Moreover move the S0ix check into SMU12 can simplify the code condition check. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Mario Limonciello authored
Since this variable was made available by the previous commit, use it to make function access cleaner. Suggested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
- 14 Dec, 2021 13 commits
-
-
Xiang wangx authored
Delete the redundant word 'we'. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Yann Dirson authored
Signed-off-by: Yann Dirson <ydirson@free.fr> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Yann Dirson authored
Those are not today pulled by the sphinx doc, but better be ready. Signed-off-by: Yann Dirson <ydirson@free.fr> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Yann Dirson authored
This is Alex' description from the "Looking for clarifications around gfx/kcq/kiq" thread, edited to fit as ReST. Original text: https://www.spinics.net/lists/amd-gfx/msg71383.htmlOriginally-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Yann Dirson <ydirson@free.fr> Acked-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Yann Dirson authored
This describes in broad lines the how an AMD GPU is organized, in terms of hardware blocks. This is Alex' description from the "gpu block diagram" thread, edited to fit as ReST. Original text: https://www.spinics.net/lists/amd-gfx/msg71543.htmlOriginally-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Yann Dirson <ydirson@free.fr> Acked-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Lang Yu authored
Memory is allocated for gpu_metrics_table in renoir_init_smc_tables(), but not freed in int smu_v12_0_fini_smc_tables(). Free it! Fixes: 95868b85 ("drm/amd/powerplay: add Renoir support for gpu metrics export") Signed-off-by: Lang Yu <lang.yu@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Jingwen Chen authored
[Why] gmc bo will be pinned during loading amdgpu and reset in SRIOV while only unpinned in unload amdgpu [How] add amdgpu_in_reset and sriov judgement to skip pin bo v2: fix wrong judgement Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Horace Chen <horace.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Jingwen Chen authored
[Why] psp tmr bo will be pinned during loading amdgpu and reset in SRIOV while only unpinned in unload amdgpu [How] add amdgpu_in_reset and sriov judgement to skip pin bo v2: fix wrong judgement Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Horace Chen <horace.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Lijo Lazar authored
Power states are not valid for arcturus and aldebaran, no need to allocate memory. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Evan Quan authored
Pair the operations did in GMC ->hw_init and ->hw_fini. That can help to maintain correct cached state for GMC and avoid unintention gate operation dropping due to wrong cached state. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1828Signed-off-by: Evan Quan <evan.quan@amd.com> Acked-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Guchun Chen authored
Updated for consistency when accessing drm_device from amdgpu driver. Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Evan Quan authored
As the smu_context will be invisible from outside(of power). Also, the smu_debug_mask can be shared around all power code instead of some specific framework(swSMU) only. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset goes through. Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-