- 13 Sep, 2019 40 commits
-
-
Yong Zhao authored
The issue was accidentally introduced recently. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Fix typo which messed up the calculation. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Restoring clock gating break SMU opeartion afterwards, avoid this until this further invistigated with SMU. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Protect from concurrent SMU accesses. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: Guchun Chen <guchun.chen@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Roman Li authored
[Why] The earlier patch "Hook up calls to do stereo mux and dig programming..." doesn't include update for dcn21. [How] Align dcn21 gpio settings with updated stereo control interface. Signed-off-by: Roman Li <Roman.Li@amd.com> Acked-by: Aaron Liu <aaron.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Yong Zhao authored
This optimizes out the pci device id usage in KFD and makes the code more maintainable. Signed-off-by: Yong Zhao <Yong.Zhao@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Felix Kuehling authored
These wptrs must be pinned and GPU accessible when this is called from hqd_load functions. So they should never fault. This resolves a circular lock dependency issue involving four locks including the DQM lock and mmap_sem. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Oak Zeng <Oak.Zeng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Aaron Liu authored
With stutter mode enabled, NMI prints frequently. Disable stutter for the moment because NMI warning storm, and will enable it back till the issue is addressed Signed-off-by: Aaron Liu <aaron.liu@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Aaron Liu authored
This patch updates MP1_BASE in renoir_ip_offset.h Signed-off-by: Aaron Liu <aaron.liu@amd.com> Acked-by: Roman Li <roman.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Prike Liang authored
With the common interface print_clk_levels can get the following dpm clock: -pp_dpm_dcefclk -pp_dpm_fclk -pp_dpm_mclk -pp_dpm_sclk -pp_dpm_socclk Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Kevin Wang <kevin1.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
John Clements authored
Removed redundant goto statement Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
John Clements authored
Add support for loading XGMI/RAS FW Signed-off-by: John Clements <john.clements@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
change r type from bool to int, suitable for both bool and int return value Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Kevin Wang authored
fix bellow patch issue: drm/amd/powerplay: introduce smu table id type to handle the smu table for each asic ---- "This patch introduces new smu table type, it's to handle the different smu table defines for each asic with the same smu ip." before: use smu->table_count to represent the actual table count in smc firmware use actual table count to check smu function parameter about smu table after: use logic table count "SMU_TABLE_COUNT" to check function parameter because table id already mapped in smu driver, and smu function will use logic table id not actual table id to check func parameter. v2: squash in warning fix Signed-off-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Jack Zhang authored
add sw_fini interface of df_funcs. This interface will remove sysfs file of df_cntr_avail function. The old behavior only create sysfs of df_cntr_avail in sw_init, but never remove it for lack of sw_fini interface. With this,driver will report create sysfs fail when it's loaded for the second time. Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Jonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
UMC RAS feature requires access to UMC & RSMU registers Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
amdgpu_nbio_ras_late_init is used to init nbio specfic ras debugfs/sysfs node and nbio specific interrupt handler. It can be shared among nbio generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
amdgpu_mmhub_ras_late_init is used to init mmhub specfic ras debugfs/sysfs node and mmhub specific interrupt handler. It can be shared among mmhub generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
amdgpu_sdma_ras_late_init is used to init sdma specfic ras debugfs/sysfs node and sdma specific interrupt handler. It can be shared among sdma generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
amdgpu_gfx_ras_late_init is used to init gfx specfic ras debugfs/sysfs node and gfx specific interrupt handler. It can be shared among gfx generations Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
amdgpu_gmc_ras_late_init is used to init gmc specfic ras debugfs/sysfs node and gmc specific interrupt handler. It can be shared among gmc generations. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
to prevent access to dangling pointers Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Issue 1: In XGMI case amdgpu_device_lock_adev for other devices in hive was called to late, after access to their repsective schedulers. So relocate the lock to the begining of accessing the other devs. Issue 2: Using amdgpu_device_ip_need_full_reset to switch the device list from all devices in hive to the single 'master' device who owns this reset call is wrong because when stopping schedulers we iterate all the devices in hive but when restarting we will only reactivate the 'master' device. Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS needed we then have to stop schedulers for all devices in hive and not only the 'master' but with amdgpu_device_ip_need_full_reset we already missed the opprotunity do to so. So just remove this logic and always stop and start all schedulers for all devices in hive. Also minor cleanup and print fix. v4: Minor coding style fix. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
Trying to evict things from the current working set doesn't work that well anymore because of per VM BOs. Rely on reserving VRAM for page tables to avoid contention. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
This hopefully helps reduce the contention for page tables. v2: adjust maximum reported VRAM size as well Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
Make VM updates depend on the moving fence instead of the exclusive one. Makes it less likely to actually have a dependency. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Evan Quan authored
These are needed for smu_reset support. Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Evan Quan authored
Need to update in cache feature enablement status after pp_feature settings. Another fix for the commit below: drm/amd/powerplay: implment sysfs feature status function in smu V2: update smu_feature_update_enable_state() and relates V3: use bitmap_or and bitmap_andnot Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Evan Quan authored
Force clock level is for dpm manual mode only. Reported-by: Candice Li <candice.li@amd.com> Signed-off-by: Evan Quan <evan.quan@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
gds clearing workaround should only be applied on asics that support gfx ras Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
free ras_if if ras is not supported Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
mmhub callback functions are not initialized for all the ASICs Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Felix Kuehling authored
This workaround is better handled in user mode in a way that doesn't require allocating extra memory and breaking userptr BOs. The TLB bug is a performance bug, not a functional or security bug. Hence it is safe to remove this kernel part of the workaround to allow a better workaround using only virtual address alignments in user mode. v2: Removed VI_BO_SIZE_ALIGN definition Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Felix Kuehling authored
For compute VRAM allocations on Arturus use the new RW mtype for non-coherent local memory, CC mtype for coherent local memory and PTE_SNOOPED bit for invalidating non-dirty cache lines on remote XGMI mappings. Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Tested-by: Amber Lin <Amber.Lin@amd.com> Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-