An error occurred fetching the project authors.
- 03 Oct, 2019 7 commits
-
-
Tao Zhou authored
simplify the code of accessing to eeprom_control struct Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
remove mmhub_funcs in adev Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Alex Deucher authored
Add new sections to amdgpu.rst, fix up formatting issues, add additional documentation to each section. Acked-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Guchun Chen authored
null ptr should be checked first to avoid null ptr access Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Alex Deucher authored
We are reserving vram pages so they should be aligned to the GPU page size. Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
There are two cases of reserve error should be ignored: 1) a ras bad page has been allocated (used by someone); 2) a ras bad page has been reserved (duplicate error injection for one page); DRM_ERROR is unnecessary for the failure of bad page reserve Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Adam Zerella authored
Some of the documentation formatting could be improved which will resolve some Sphinx amdgpu build warnings e.g WARNING: Unexpected indentation. WARNING: Block quote ends without a blank line; unexpected unindent. WARNING: Inline emphasis start-string without end-string. Signed-off-by:
Adam Zerella <adam.zerella@gmail.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 17 Sep, 2019 1 commit
-
-
Christian König authored
The placement is something TTM/BO internal and the RAS code should avoid touching that directly. Add a helper to create a BO at a fixed location and use that instead. v2: squash in fixes (Alex) Signed-off-by:
Christian König <christian.koenig@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 16 Sep, 2019 6 commits
-
-
Guchun Chen authored
Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Guchun Chen authored
Call pcie bif ras query/inject in amdgpu ras. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
allow inject error to XGMI block via debugfs node ras_ctrl Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
The table grows quickly during debug/development effort when multiple RAS errors are injected. Allow to avoid this by setting table header back to empty if needed. v2: Switch to debugfs entry instead of load time parameter. Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Fixes driver load regression on APUs. Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 13 Sep, 2019 7 commits
-
-
Tao Zhou authored
ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
Ras controller interrupt and Ras err event athub interrupt are two dedicated interrupts for RAS support. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 23 Aug, 2019 1 commit
-
-
Guchun Chen authored
Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 12 Aug, 2019 3 commits
-
-
Tao Zhou authored
feature mask info is enough for rocm tool, "cat /sys/class/drm/card0/device/ras/features" will get the info like this: feature mask: 0x3ffb Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
call mmhub ras query/inject in amdgpu ras Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
ras sub block index could be passed from shell command Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 06 Aug, 2019 1 commit
-
-
Tao Zhou authored
remove confused ras error type info Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 02 Aug, 2019 3 commits
-
-
Tao Zhou authored
ce can also trigger interrupt, and even both ce and ue error can be found in one ras query, distinguishing between ce and ue in interrupt handler is uncessary. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Suggested-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
correctable error can also trigger interrupt in some ras blocks Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
umc error address query can get ce/ue error address and clear error status Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 31 Jul, 2019 9 commits
-
-
Dennis Li authored
check gfx error count in both ras querry function and ras interrupt handler. gfx ras is still disabled by default due to known stability issue found in gpu reset. Signed-off-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
error injection address is not in gpu address space Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
only ue and ce errors are supported Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
add error data as parameter for ras interrupt cb and process it Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
more than one error address may be recorded in one query Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
create new amdgpu_umc structure to for more umc settings in future and switch to the new structure Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
v1: increase ras ce/ue error count v2: log the number of correctable and uncorrectable errors Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
check umc error count in both ras querry function and ras interrupt handler Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
These are common structures that can be included by IP specific source files Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 18 Jul, 2019 2 commits
-
-
Hawking Zhang authored
this function is not needed any more. error injection is the only way to validate ras but it can't be executed in amdgpu_ras_init, where gpu is even not initialized Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
error injection to other IP blocks (except UMC) will be enabled until RAS feature stablize on those IP blocks Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-