1. 21 Aug, 2024 16 commits
    • Mukul Joshi's avatar
      drm/amdkfd: Update BadOpcode Interrupt handling with MES · eb067d65
      Mukul Joshi authored
      Based on the recommendation of MEC FW, update BadOpcode interrupt
      handling by unmapping all queues, removing the queue that got the
      interrupt from scheduling and remapping rest of the queues back when
      using MES scheduler. This is done to prevent the case where unmapping
      of the bad queue can fail thereby causing a GPU reset.
      Signed-off-by: default avatarMukul Joshi <mukul.joshi@amd.com>
      Acked-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Acked-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarFelix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      eb067d65
    • Mukul Joshi's avatar
      drm/amdkfd: Update queue unmap after VM fault with MES · 9a16042f
      Mukul Joshi authored
      MEC FW expects MES to unmap all queues when a VM fault is observed
      on a queue and then resumed once the affected process is terminated.
      Use the MES Suspend and Resume APIs to achieve this.
      Signed-off-by: default avatarMukul Joshi <mukul.joshi@amd.com>
      Acked-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Reviewed-by: default avatarFelix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      9a16042f
    • Mukul Joshi's avatar
      drm/amdgpu: Implement MES Suspend and Resume APIs for GFX11 · ccf8ef6b
      Mukul Joshi authored
      Add implementation for MES Suspend and Resume APIs to unmap/map
      all queues for GFX11. Support for GFX12 will be added when the
      corresponding firmware support is in place.
      Signed-off-by: default avatarMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      ccf8ef6b
    • Amber Lin's avatar
      drm/amdkfd: Enable processes isolation on gfx9 · 87758a0e
      Amber Lin authored
      When amdgpu enable enforce_isolation, KFD enables single-process mode in
      HWS and sets exec_cleaner_shader bit in MAP_PROCESS.
      Signed-off-by: default avatarAmber Lin <Amber.Lin@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      87758a0e
    • Srinivasan Shanmugam's avatar
      drm/amdgpu/gfx_v9_4_3: Apply Isolation Enforcement to GFX & Compute rings · f846250b
      Srinivasan Shanmugam authored
      This commit applies isolation enforcement to the GFX and Compute rings
      in the gfx_v9_4_3 module.
      
      The commit sets `amdgpu_gfx_enforce_isolation_ring_begin_use` and
      `amdgpu_gfx_enforce_isolation_ring_end_use` as the functions to be
      called when a ring begins and ends its use, respectively.
      
      `amdgpu_gfx_enforce_isolation_ring_begin_use` is called when a ring
      begins its use. This function cancels any scheduled
      `enforce_isolation_work` and, if necessary, signals the Kernel Fusion
      Driver (KFD) to stop the runqueue.
      
      `amdgpu_gfx_enforce_isolation_ring_end_use` is called when a ring ends
      its use. This function schedules `enforce_isolation_work` to be run
      after a delay.
      
      These functions are part of the Enforce Isolation Handler, which
      enforces shader isolation on AMD GPUs to prevent data leakage between
      different processes.
      
      The commit also includes a check for the type of the ring. If the type
      of the ring is `AMDGPU_RING_TYPE_COMPUTE`, the `xcp_id` of the
      `enforce_isolation` structure in the `gfx` structure of the
      `amdgpu_device` is set to the `xcp_id` of the ring. This ensures that
      the correct `xcp_id` is used when enforcing isolation on compute rings.
      The `xcp_id` is an identifier for an XCP partition, and different rings
      can be associated with different XCP partitions.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      f846250b
    • Srinivasan Shanmugam's avatar
      drm/amdgpu/gfx9: Apply Isolation Enforcement to GFX & Compute rings · b710dbe5
      Srinivasan Shanmugam authored
      This commit applies isolation enforcement to the GFX and Compute rings
      in the gfx_v9_0 module.
      
      The commit sets `amdgpu_gfx_enforce_isolation_ring_begin_use` and
      `amdgpu_gfx_enforce_isolation_ring_end_use` as the functions to be
      called when a ring begins and ends its use, respectively.
      
      `amdgpu_gfx_enforce_isolation_ring_begin_use` is called when a ring
      begins its use. This function cancels any scheduled
      `enforce_isolation_work` and, if necessary, signals the Kernel Fusion
      Driver (KFD) to stop the runqueue.
      
      `amdgpu_gfx_enforce_isolation_ring_end_use` is called when a ring ends
      its use. This function schedules `enforce_isolation_work` to be run
      after a delay.
      
      These functions are part of the Enforce Isolation Handler, which
      enforces shader isolation on AMD GPUs to prevent data leakage between
      different processes.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Suggested-by: default avatarChristian König <christian.koenig@amd.com>
      b710dbe5
    • Srinivasan Shanmugam's avatar
      drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization · afefd6f2
      Srinivasan Shanmugam authored
      This commit introduces the Enforce Isolation Handler designed to enforce
      shader isolation on AMD GPUs, which helps to prevent data leakage
      between different processes.
      
      The handler counts the number of emitted fences for each GFX and compute
      ring. If there are any fences, it schedules the `enforce_isolation_work`
      to be run after a delay of `GFX_SLICE_PERIOD`. If there are no fences,
      it signals the Kernel Fusion Driver (KFD) to resume the runqueue.
      
      The function is synchronized using the `enforce_isolation_mutex`.
      
      This commit also introduces a reference count mechanism
      (kfd_sch_req_count) to keep track of the number of requests to enable
      the KFD scheduler. When a request to enable the KFD scheduler is made,
      the reference count is decremented. When the reference count reaches
      zero, a delayed work is scheduled to enforce isolation after a delay of
      GFX_SLICE_PERIOD.
      
      When a request to disable the KFD scheduler is made, the function first
      checks if the reference count is zero. If it is, it cancels the delayed
      work for enforcing isolation and checks if the KFD scheduler is active.
      If the KFD scheduler is active, it sends a request to stop the KFD
      scheduler and sets the KFD scheduler state to inactive. Then, it
      increments the reference count.
      
      The function is synchronized using the kfd_sch_mutex to ensure that the
      KFD scheduler state and reference count are updated atomically.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarChristian König <christian.koenig@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      afefd6f2
    • Amber Lin's avatar
      drm/amdkfd: APIs to stop/start KFD scheduling · 234eebe1
      Amber Lin authored
      Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
      compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
      When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
      runlist. If users send ioctls to KFD to create queues, they'll be added
      but those queues won't be mapped to runlist (so not scheduled) until
      amdgpu_amdkfd_start_sched is called.
      
      v2: fix build (Alex)
      Signed-off-by: default avatarAmber Lin <Amber.Lin@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      234eebe1
    • Srinivasan Shanmugam's avatar
      drm/amdgpu/gfx9: Add cleaner shader support for GFX9.4.4 hardware · b1f49ff9
      Srinivasan Shanmugam authored
      This commit extends the cleaner shader feature to support GFX9.4.4
      hardware.
      
      The cleaner shader feature is used to clear or initialize certain GPU
      resources, such as Local Data Share (LDS), Vector General Purpose
      Registers (VGPRs), and Scalar General Purpose Registers (SGPRs). This
      operation needs to be performed in isolation, while no other tasks
      should be running on the GPU at the same time.
      
      Previously, the cleaner shader feature was implemented for GFX9.4.3
      hardware. This commit adds support for GFX9.4.4 hardware by allowing the
      cleaner shader to be used with this hardware version.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      b1f49ff9
    • Srinivasan Shanmugam's avatar
      drm/amdgpu/gfx9: Add cleaner shader for GFX9.4.3 · 33528831
      Srinivasan Shanmugam authored
      This commit adds the cleaner shader microcode for GFX9.4.3 GPUs. The
      cleaner shader is a piece of GPU code that is used to clear or
      initialize certain GPU resources, such as Local Data Share (LDS), Vector
      General Purpose Registers (VGPRs), and Scalar General Purpose Registers
      (SGPRs).
      
      Clearing these resources is important for ensuring data isolation
      between different workloads running on the GPU. Without the cleaner
      shader, residual data from a previous workload could potentially be
      accessed by a subsequent workload, leading to data leaks and incorrect
      computation results.
      
      The cleaner shader microcode is represented as an array of 32-bit words
      (`gfx_9_4_3_cleaner_shader_hex`). This array is the binary
      representation of the cleaner shader code, which is written in a
      low-level GPU instruction set.
      
      When the cleaner shader feature is enabled, the AMDGPU driver loads this
      array into a specific location in the GPU memory. The GPU then reads
      this memory location to fetch and execute the cleaner shader
      instructions.
      
      The cleaner shader is executed automatically by the GPU at the end of
      each workload, before the next workload starts. This ensures that all
      GPU resources are in a clean state before the start of each workload.
      
      This addition is part of the cleaner shader feature implementation. The
      cleaner shader feature helps improve GPU performance and resource
      utilization by cleaning up GPU resources after they are used. It also
      enhances security and reliability by preventing data leaks between
      workloads.
      
      v2: fix copyright date (Alex)
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      33528831
    • Srinivasan Shanmugam's avatar
      drm/amdgpu/gfx9: Implement cleaner shader support for GFX9.4.3 hardware · d4c38154
      Srinivasan Shanmugam authored
      The patch modifies the gfx_v9_4_3_kiq_set_resources function to write
      the cleaner shader's memory controller address to the ring buffer. It
      also adds a new function, gfx_v9_4_3_ring_emit_cleaner_shader, which
      emits the PACKET3_RUN_CLEANER_SHADER packet to the ring buffer.
      
      This patch adds support for the PACKET3_RUN_CLEANER_SHADER packet in the
      gfx_v9_4_3 module. This packet is used to emit the cleaner shader, which
      is used to clear GPU memory before it's reused, helping to prevent data
      leakage between different processes.
      
      Finally, the patch updates the ring function structures to include the
      new gfx_v9_4_3_ring_emit_cleaner_shader function. This allows the
      cleaner shader to be emitted as part of the ring's operations.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      d4c38154
    • Srinivasan Shanmugam's avatar
      drm/amdgpu/gfx9: Implement cleaner shader support for GFX9 hardware · c2e70d30
      Srinivasan Shanmugam authored
      The patch modifies the gfx_v9_0_kiq_set_resources function to write
      the cleaner shader's memory controller address to the ring buffer. It
      also adds a new function, gfx_v9_0_ring_emit_cleaner_shader, which
      emits the PACKET3_RUN_CLEANER_SHADER packet to the ring buffer.
      
      This patch adds support for the PACKET3_RUN_CLEANER_SHADER packet in the
      gfx_v9_0 module. This packet is used to emit the cleaner shader, which
      is used to clear GPU memory before it's reused, helping to prevent data
      leakage between different processes.
      
      Finally, the patch updates the ring function structures to include the
      new gfx_v9_0_ring_emit_cleaner_shader function. This allows the
      cleaner shader to be emitted as part of the ring's operations.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      c2e70d30
    • Srinivasan Shanmugam's avatar
      drm/amdgpu: Add PACKET3_RUN_CLEANER_SHADER for cleaner shader execution · 22ff907d
      Srinivasan Shanmugam authored
      This commit adds the PACKET3_RUN_CLEANER_SHADER definition. This packet
      is a command packet used to instruct the GPU to execute the cleaner
      shader.
      
      The cleaner shader is a piece of GPU code that is used to clear or
      initialize certain GPU resources, such as Local Data Share (LDS), Vector
      General Purpose Registers (VGPRs), and Scalar General Purpose Registers
      (SGPRs). Clearing these resources is important for ensuring data
      isolation between different workloads running on the GPU.
      
      The PACKET3_RUN_CLEANER_SHADER packet is used to trigger the execution
      of the cleaner shader on the GPU. The packet consists of a header
      followed by a RESERVED field, which is programmed to zero. When the GPU
      receives this packet, it fetches and executes the cleaner shader
      instructions from the location specified in the packet.
      
      The cleaner shader feature helps to enhances security and reliability by
      preventing data leaks between workloads.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      22ff907d
    • Srinivasan Shanmugam's avatar
      drm/amdgpu: Add sysfs interface for running cleaner shader · d361ad5d
      Srinivasan Shanmugam authored
      This patch adds a new sysfs interface for running the cleaner shader on
      AMD GPUs. The cleaner shader is used to clear GPU memory before it's
      reused, which can help prevent data leakage between different processes.
      
      The new sysfs file is write-only and is named `run_cleaner_shader`.
      Write the number of the partition to this file to trigger the cleaner shader
      on that partition. There is only one partition on GPUs which do not
      support partitioning.
      
      Changes made in this patch:
      
      - Added `amdgpu_set_run_cleaner_shader` function to handle writes to the
        `run_cleaner_shader` sysfs file.
      - Added `run_cleaner_shader` to the list of device attributes in
        `amdgpu_device_attrs`.
      - Updated `default_attr_update` to handle `run_cleaner_shader`.
      - Added `AMDGPU_DEVICE_ATTR_WO` macro to create write-only device
        attributes.
      
      v2: fix error handling (Alex)
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      d361ad5d
    • Srinivasan Shanmugam's avatar
      drm/amdgpu: Add enforce_isolation sysfs attribute · e189be9b
      Srinivasan Shanmugam authored
      This commit adds a new sysfs attribute 'enforce_isolation' to control
      the 'enforce_isolation' setting per GPU. The attribute can be read and
      written, and accepts values 0 (disabled) and 1 (enabled).
      
      When 'enforce_isolation' is enabled, reserved VMIDs are allocated for
      each ring. When it's disabled, the reserved VMIDs are freed.
      
      The set function locks a mutex before changing the 'enforce_isolation'
      flag and the VMIDs, and unlocks it afterwards. This ensures that these
      operations are atomic and prevents race conditions and other concurrency
      issues.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      e189be9b
    • Srinivasan Shanmugam's avatar
      drm/amdgpu: Enforce isolation as part of the job · dba1a6cf
      Srinivasan Shanmugam authored
      This patch adds a new parameter 'enforce_isolation' to the amdgpu_job
      structure. This parameter is used to determine whether shader isolation
      should be enforced for a job. The enforce_isolation parameter is then
      stored in the amdgpu_job structure and used when flushing the VM.
      
      The enforce_isolation field of the amdgpu_job structure is set directly
      after the job is allocated
      
      This change allows more fine-grained control over shader isolation,
      making it possible to enforce isolation on a per-job basis rather than
      globally. This can be useful in scenarios where only certain jobs
      require isolation.
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSrinivasan Shanmugam <srinivasan.shanmugam@amd.com>
      Suggested-by: default avatarChristian König <christian.koenig@amd.com>
      dba1a6cf
  2. 16 Aug, 2024 24 commits