Commits · 0771b0bf0790295b141cc30644a1b0b3e22a331e · Kirill Smelkov / linux

An error occurred fetching the project authors.

03 Oct, 2019 7 commits

drm/amdgpu: simplify the access to eeprom_control struct · 0771b0bf

Tao Zhou authored 5 years ago

simplify the code of accessing to eeprom_control struct
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0771b0bf

drm/amdgpu: replace mmhub_funcs with mmhub.funcs · d65bf1f8

Tao Zhou authored 5 years ago

remove mmhub_funcs in adev
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d65bf1f8

drm/amdgpu/ras: fix and update the documentation for RAS · f77c7109

Alex Deucher authored 5 years ago

Add new sections to amdgpu.rst, fix up formatting issues,
add additional documentation to each section.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f77c7109

drm/amdgpu: avoid null pointer dereference · 8a3e801f

Guchun Chen authored 5 years ago

null ptr should be checked first to avoid null ptr access
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8a3e801f

drm/amdgpu/ras: use GPU PAGE_SIZE/SHIFT for reserving pages · a142ba88

Alex Deucher authored 5 years ago

We are reserving vram pages so they should be aligned to the
GPU page size.
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a142ba88

drm/amdgpu: replace DRM_ERROR with DRM_WARN in ras_reserve_bad_pages · ae115c81

Tao Zhou authored 5 years ago

There are two cases of reserve error should be ignored:
1) a ras bad page has been allocated (used by someone);
2) a ras bad page has been reserved (duplicate error injection for one page);

DRM_ERROR is unnecessary for the failure of bad page reserve
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ae115c81

docs: drm/amdgpu: Resolve build warnings · 879e723d

Adam Zerella authored 5 years ago

Some of the documentation formatting could be improved
which will resolve some Sphinx amdgpu build warnings e.g

WARNING: Unexpected indentation.
WARNING: Block quote ends without a blank line; unexpected unindent.
WARNING: Inline emphasis start-string without end-string.
Signed-off-by: Adam Zerella <adam.zerella@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

879e723d

17 Sep, 2019 1 commit

drm/amdgpu: cleanup creating BOs at fixed location (v2) · de7b45ba

Christian König authored 5 years ago

The placement is something TTM/BO internal and the RAS code should
avoid touching that directly.

Add a helper to create a BO at a fixed location and use that instead.

v2: squash in fixes (Alex)
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

de7b45ba

16 Sep, 2019 6 commits

drm/amdgpu: fix ras ctrl debugfs node leak · 012dd14d

Guchun Chen authored 5 years ago

Use debugfs_remove_recursive to remove the whole debugfs
directory instead of removing the node one by one.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

012dd14d

drm/amdgpu: support pcie bif ras query and inject · d7bd680d

Guchun Chen authored 5 years ago

Call pcie bif ras query/inject in amdgpu ras.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d7bd680d

drm/amdgpu: enable error injection to XGMI block via debugfs · f3170352

Hawking Zhang authored 5 years ago

allow inject error to XGMI block via debugfs node ras_ctrl
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f3170352

drm/amdgpu: Allow to reset to EERPOM table. · 084fe13b

Andrey Grodzovsky authored 5 years ago

The table grows quickly during debug/development effort when
multiple RAS errors are injected. Allow to avoid this by setting
table header back to empty if needed.

v2: Switch to debugfs entry instead of load time parameter.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

084fe13b

drm/amdgpu: move umc ras init to umc block · 4930aabe

Tao Zhou authored 5 years ago

move umc ras init from ras module to umc block, generic ras module
should pay less attention to specific ras block.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4930aabe

drm/amdgpu: Avoid RAS recovery init when no RAS support. · 4d1337d2

Andrey Grodzovsky authored 5 years ago

Fixes driver load regression on APUs.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4d1337d2

13 Sep, 2019 7 commits

drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place · 1a6fc071

Tao Zhou authored 5 years ago

ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1a6fc071

drm/amdgpu: Hook EEPROM table to RAS · 78ad00c9

Tao Zhou authored 5 years ago

support eeprom records load and save for ras,
move EEPROM records storing to bad page reserving

v2: remove redundant check for con->eh_data
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

78ad00c9

drm/amdgpu: change ras bps type to eeprom table record structure · 9dc23a63

Tao Zhou authored 5 years ago

change bps type from retired page to eeprom table record, prepare for
saving umc error records to eeprom
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9dc23a63

dmr/amdgpu: Add system auto reboot to RAS. · d5ea093e

Andrey Grodzovsky authored 5 years ago

In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

v4: Use latest kernel API for disk sync.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d5ea093e

drm/amdgpu: Avoid HW GPU reset for RAS. · 7c6e68c7

Andrey Grodzovsky authored 5 years ago

Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7c6e68c7

drm/amdgpu: add helper function to do common ras_late_init/fini (v3) · b293e891

Hawking Zhang authored 5 years ago

In late_init for ras, the helper function will be used to
1). disable ras feature if the IP block is masked as disabled
2). send enable feature command if the ip block was masked as enabled
3). create debugfs/sysfs node per IP block
4). register interrupt handler

v2: check ih_info.cb to decide add interrupt handler or not

v3: add ras_late_fini for cleanup all the ras fs node and remove
interrupt handler
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b293e891

drm/amdgpu: add ras_controller and err_event_athub interrupt support · 4e644fff

Hawking Zhang authored 5 years ago

Ras controller interrupt and Ras err event athub interrupt are two dedicated
interrupts for RAS support.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4e644fff

23 Aug, 2019 1 commit

drm/amdgpu: correct ras error count type · 64cc5414

Guchun Chen authored 5 years ago

Use unsigned long type for the same ras count variable.
This will avoid overflow on 64 bit system.
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

64cc5414

12 Aug, 2019 3 commits

drm/amdgpu: remove ras block's feature status info in sysfs · 5212a3bd

Tao Zhou authored 5 years ago

feature mask info is enough for rocm tool,
"cat /sys/class/drm/card0/device/ras/features" will get the
info like this:

feature mask: 0x3ffb
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5212a3bd

drm/amdgpu: support mmhub ras in amdgpu ras · 9fb2d8de

Tao Zhou authored 5 years ago

call mmhub ras query/inject in amdgpu ras
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9fb2d8de

drm/amdgpu: add sub block parameter in ras inject command · 44494f96

Tao Zhou authored 5 years ago

ras sub block index could be passed from shell command
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

44494f96

06 Aug, 2019 1 commit

drm/amdgpu: update ras sysfs feature info · 2a3c7ff6

Tao Zhou authored 5 years ago

remove confused ras error type info
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2a3c7ff6

02 Aug, 2019 3 commits

drm/amdgpu: replace AMDGPU_RAS_UE with AMDGPU_RAS_SUCCESS · bd2280da

Tao Zhou authored 5 years ago

ce can also trigger interrupt, and even both ce and ue error can be
found in one ras query, distinguishing between ce and ue in interrupt
handler is uncessary.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Suggested-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

bd2280da

drm/amdgpu: support ce interrupt in ras module · 51437623

Tao Zhou authored 5 years ago

correctable error can also trigger interrupt in some ras blocks
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

51437623

drm/amdgpu: add error address query for umc ras · 13b7c46c

Tao Zhou authored 5 years ago

umc error address query can get ce/ue error address and clear error
status
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

13b7c46c

31 Jul, 2019 9 commits

drm/amdgpu: support gfx ras error injection and err_cnt query · 83b0582c

Dennis Li authored 5 years ago

check gfx error count in both ras querry function and
ras interrupt handler.

gfx ras is still disabled by default due to known stability
issue found in gpu reset.
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

83b0582c

drm/amdgpu: remove ras_reserve_vram in ras injection · 7cdc2ee3

Tao Zhou authored 5 years ago

error injection address is not in gpu address space
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7cdc2ee3

drm/amdgpu: add check for ras error type · e1063493

Tao Zhou authored 5 years ago

only ue and ce errors are supported
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e1063493

drm/amdgpu: allow ras interrupt callback to return error data · cf04dfd0

Tao Zhou authored 5 years ago

add error data as parameter for ras interrupt cb and process it
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

cf04dfd0

drm/amdgpu: add support for recording ras error address · 6f102dba

Tao Zhou authored 5 years ago

more than one error address may be recorded in one query
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

6f102dba

drm/amdgpu: switch to amdgpu_umc structure · 045c0216

Tao Zhou authored 5 years ago

create new amdgpu_umc structure to for more umc
settings in future and switch to the new structure
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

045c0216

drm/amdgpu: add ras error count after each query (v2) · 05a58345

Tao Zhou authored 5 years ago

v1: increase ras ce/ue error count
v2: log the number of correctable and uncorrectable errors
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

05a58345

drm/amdgpu: querry umc error count · 939e2258

Hawking Zhang authored 5 years ago

check umc error count in both ras querry function and
ras interrupt handler
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

939e2258

drm/amdgpu: move some ras data structure to amdgpu_ras.h · 7af25d5b

Hawking Zhang authored 5 years ago

These are common structures that can be included by IP specific
source files
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <dennis.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7af25d5b

18 Jul, 2019 2 commits

drm/amdgpu: drop ras self test · 33c976c9

Hawking Zhang authored 5 years ago

this function is not needed any more. error injection is
the only way to validate ras but it can't be executed in
amdgpu_ras_init, where gpu is even not initialized
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

33c976c9

drm/amdgpu: only allow error injection to UMC IP block · a5dd40ca

Hawking Zhang authored 5 years ago

error injection to other IP blocks (except UMC) will be enabled
until RAS feature stablize on those IP blocks
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a5dd40ca