- 07 Feb, 2022 23 commits
-
-
Rajneesh Bhardwaj authored
During checkpoint stage, save the shared virtual memory ranges and attributes for the target process. A process may contain a number of svm ranges and each range might contain a number of attributes. While not all attributes may be applicable for a given prange but during checkpoint we store all possible values for the max possible attribute types. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
A KFD process may contain a number of virtual address ranges for shared virtual memory management and each such range can have many SVM attributes spanning across various nodes within the process boundary. This change reports the total number of such SVM ranges and their total private data size by extending the PROCESS_INFO op of the the CRIU IOCTL to discover the svm ranges in the target process and a future patches brings in the required support for checkpoint and restore for SVM ranges. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore support its possible that the SVM ranges can be resumed on another node where the actual_gpu_id may not be same as the original (user_gpu_id) gpu id. So modify svm code to use user_gpu_id. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct from current but for a Checkpoint or Restore operation, the current->mm will fetch the mm for the CRIU master process. So modify these helpers to accept the task mm for a target kfd process to support Checkpoint Restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
Recoverable page faults are represented by the xnack mode setting inside a kfd process and are used to represent the device page faults. For CR, we don't consider negative values which are typically used for querying the current xnack mode without modifying it. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
KFD buffer objects do not associate a GEM handle with them so cannot directly be used with libdrm to initiate a system dma (sDMA) operation to speedup the checkpoint and restore operation so export them as dmabuf objects and use with libdrm helper (amdgpu_bo_import) to further process the sdma command submissions. With sDMA, we see huge improvement in checkpoint and restore operations compared to the generic pci based access via host data path. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
When doing a restore on a different node, the gpu_id's on the restore node may be different. But the user space application will still refer use the original gpu_id's in the ioctl calls. Adding code to create a gpu id mapping so that kfd can determine actual gpu_id during the user ioctl's. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
Add support to existing CRIU ioctl's to save and restore events during criu checkpoint and restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
Checkpoint contents of queue control stacks on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
Checkpoint contents of queue MQD's on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
When re-creating queues during CRIU restore, restore the queue with the same doorbell id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
When re-creating queues during CRIU restore, restore the queue with the same sdma id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
When re-creating queues during CRIU restore, restore the queue with the same queue id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
Add support to existing CRIU ioctl's to save number of queues and queue properties for each queue during checkpoint and re-create queues on restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
David Yat Sin authored
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO op the queues will be stay in an evicted state. Once the plugin is done draining BO contents, it is safe to perform an UNPAUSE op for the queues to resume. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
This implements the KFD CRIU Restore ioctl that lays the basic foundation for the CRIU restore operation. It provides support to create the buffer objects corresponding to the checkpointed image. This ioctl creates various types of buffer objects such as VRAM, MMIO, Doorbell, GTT based on the date sent from the userspace plugin. The data mostly contains the previously checkpointed KFD images from some KFD processs. While restoring a criu process, attach old IDR values to newly created BOs. This also adds the minimal gpu mapping support for a single gpu checkpoint restore use case. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
This IOCTL op is expected to be called as a precursor to the actual Checkpoint operation. This does the basic discovery into the target process seized by CRIU and relays the information to the userspace that utilizes it to start the Checkpoint operation via another dedicated IOCTL op. The process_info IOCTL op determines the number of GPUs, buffer objects that are associated with the target process, its process id in caller's namespace since /proc/pid/mem interface maybe used to drain the contents of the discovered buffer objects in userspace and getpid returns the pid of CRIU dumper process. Also the pid of a process inside a container might be different than its global pid so return the ns pid. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Rajneesh Bhardwaj authored
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can snapshot a running process and later restore it on same or a remote machine but expects the processes that have a device file (e.g. GPU) associated with them, provide necessary driver support to assist CRIU and its extensible plugin interface. Thus, In order to support the Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute Kernel driver, needs to provide a set of new APIs that provide necessary VRAM metadata and its contents to a userspace component (CRIU plugin) that can store it in form of image files. This introduces some new ioctls which will be used to checkpoint-Restore any KFD bound user process. KFD only allows ioctl calls from the same process that opened the KFD file descriptor. Since these ioctls are expected to be called from a KFD criu plugin which has elevated ptrace attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with the file descriptors so modify KFD to allow such calls. (API redesigned by David Yat Sin) Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Luben Tuikov authored
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes the dmesg log to fill up with the same message, e.g, [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function. Make it dev_dbg_once(), as it isn't something correctible during boot or thereafter, so printing just once is sufficient. Also sanitize the message. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: yipechai <YiPeng.Chai@amd.com> Fixes: 8b0fb0e9 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
Some people complained about the name and this matches much more Linux naming conventions for object functions. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
Whenever a bo_va structure is added or removed the VM and eventually added BO should be locked. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
- 02 Feb, 2022 17 commits
-
-
Magali Lemes authored
Assigning 0L to a pointer variable caused the following warning: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dsc/rc_calc_fpu.c:71:40: warning: Using plain integer as NULL pointer In order to remove this warning, this commit assigns a NULL pointer to the pointer variable that caused this issue. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Magali Lemes <magalilemes00@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Darren Powell authored
(v3) Rewrote patchset to order patches as (API, hw impl, usecase) - added API for new power management function emit_clk_levels This function should duplicate the functionality of print_clk_levels, but this solution passes the buffer base and write offset down the stack. - new powerplay function emit_clock_levels, implemented by smu_emit_ppclk_levels() This function parallels the implementation of smu_print_ppclk_levels and calls emit_clk_levels, and allows the returns of errors - new helper function smu_convert_to_smuclk called by smu_print_ppclk_levels and smu_emit_ppclk_levels Signed-off-by: Darren Powell <darren.powell@amd.com> Reviewed-By: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Somalapuram Amaranath authored
trace_amdgpu_vm_update_ptes trace unable to log when nptes too large Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Mario Limonciello authored
dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: e25443d2 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
We ran into the problem that clearing really larger buffer (60GiB) caused an SDMA timeout. Restructure the function to use the dst window instead of mapping the whole buffer into the GART and then fill only 2MiB/256MiB chunks at a time. v2: rebase on restructured window map. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
Instead of limiting the size before we call the mapping function let the function itself limit the size. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
That should never happen, but make sure that we only warn instead of crash. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Christian König authored
We probably never trigger this, but the logic inside the check is inverted. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Aun-Ali Zaidi authored
The eDP link rate reported by the DP_MAX_LINK_RATE dpcd register (0xa) is contradictory to the highest rate supported reported by EDID (0xc = LINK_RATE_RBR2). The effects of this compounded with commit '4a8ca46b ("drm/amd/display: Default max bpc to 16 for eDP")' results in no display modes being found and a dark panel. For now, simply force the maximum supported link rate for the eDP attached 2018 15" Apple Retina panels. Additionally, we must also check the firmware revision since the device ID reported by the DPCD is identical to that of the more capable 16,1, incorrectly quirking it. We also use said firmware check to quirk the refreshed 15,1 models with Vega graphics as they use a slightly newer firmware version. Tested-by: Aun-Ali Zaidi <admin@kodeit.net> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Aun-Ali Zaidi <admin@kodeit.net> Signed-off-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Yang Li authored
Eliminate the follow smatch warning: drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c:2246 dp_perform_8b_10b_link_training() warn: inconsistent indenting Reviewed-by: Harry Wentland <harry.wentland@amd.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Fangzhi Zuo authored
DP2 sequence is triggered only if VESA certified cable is detected. Force DP2 sequence with uncertified cable for testing purpose. Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Fangzhi Zuo <Jerry.Zuo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Aric Cyr authored
This version brings along following fixes: - DC refactor and bug fixes for DP links - Bug fixes for DP2 - Fix regressions causing display not light up - Improved debug trace - Improved DP AUX transfer - Updated watermark latencies to fix underflows in some modes Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Aric Cyr <aric.cyr@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Anthony Koo authored
- Correct number of reserved bits in cmd_lock_hw - Extend bits of hw_lock_client to allow for more clients Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Anthony Koo <Anthony.Koo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Wenjing Liu authored
[why] Move link_hwss to its own folder as part of DC LIB and break it down to separate file one for each type of backend for code isolation. Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Reviewed-by: Jun Lei <Jun.Lei@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Wenjing Liu authored
[why] Isolate the way to obtain link_hwss from the actual implemenation of link_hwss. So the caller can call link_hwss without knowing the implementation detail of link_hwss. Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Reviewed-by: Jun Lei <Jun.Lei@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Wenjing Liu authored
[why] Clean up dc_link_hwss file in the preparation of breaking it down to file for each encoder type. We temporarly move the original dp link functions in link_hwss back to dc_link_dp. We will break dc_link_dp down after link_hwss is in good shape. Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Reviewed-by: Jun Lei <Jun.Lei@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-
Wenjing Liu authored
[why] Factor set dp lane settings to link_hwss. v2: fix statement with no effect warning (Alex) Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Reviewed-by: Jun Lei <Jun.Lei@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
-