Commit db0f246c authored by André Almeida's avatar André Almeida Committed by Christian König

drm/doc: Document DRM device reset expectations

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.
Signed-off-by: default avatarAndré Almeida <andrealmeid@igalia.com>
Acked-by: default avatarPekka Paalanen <pekka.paalanen@collabora.com>
Acked-by: default avatarSebastian Wick <sebastian.wick@redhat.com>
Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
Signed-off-by: default avatarChristian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230929092509.42042-1-andrealmeid@igalia.com
parent 988d0ff2
......@@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
mmapped regular files. Threads cause additional pain with signal
handling as well.
Device reset
============
The GPU stack is really complex and is prone to errors, from hardware bugs,
faulty applications and everything in between the many layers. Some errors
require resetting the device in order to make the device usable again. This
section describes the expectations for DRM and usermode drivers when a
device resets and how to propagate the reset status.
Device resets can not be disabled without tainting the kernel, which can lead to
hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
device resets is to propagate the message to the application and apply any
special policy for blocking guilty applications, if any. Corollary is that
debugging a hung GPU context require hardware support to be able to preempt such
a GPU context while it's stopped.
Kernel Mode Driver
------------------
The KMD is responsible for checking if the device needs a reset, and to perform
it as needed. Usually a hang is detected when a job gets stuck executing. KMD
should keep track of resets, because userspace can query any time about the
reset status for a specific context. This is needed to propagate to the rest of
the stack that a reset has happened. Currently, this is implemented by each
driver separately, with no common DRM interface. Ideally this should be properly
integrated at DRM scheduler to provide a common ground for all drivers. After a
reset, KMD should reject new command submissions for affected contexts.
User Mode Driver
----------------
After command submission, UMD should check if the submission was accepted or
rejected. After a reset, KMD should reject submissions, and UMD can issue an
ioctl to the KMD to check the reset status, and this can be checked more often
if the UMD requires it. After detecting a reset, UMD will then proceed to report
it to the application using the appropriate API error code, as explained in the
section below about robustness.
Robustness
----------
The only way to try to keep a graphical API context working after a reset is if
it complies with the robustness aspects of the graphical API that it is using.
Graphical APIs provide ways to applications to deal with device resets. However,
there is no guarantee that the app will use such features correctly, and a
userspace that doesn't support robust interfaces (like a non-robust
OpenGL context or API without any robustness support like libva) leave the
robustness handling entirely to the userspace driver. There is no strong
community consensus on what the userspace driver should do in that case,
since all reasonable approaches have some clear downsides.
OpenGL
~~~~~~
Apps using OpenGL should use the available robust interfaces, like the
extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
interface tells if a reset has happened, and if so, all the context state is
considered lost and the app proceeds by creating new ones. There's no consensus
on what to do to if robustness is not in use.
Vulkan
~~~~~~
Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
This error code means, among other things, that a device reset has happened and
it needs to recreate the contexts to keep going.
Reporting causes of resets
--------------------------
Apart from propagating the reset through the stack so apps can recover, it's
really useful for driver developers to learn more about what caused the reset in
the first place. DRM devices should make use of devcoredump to store relevant
information about the reset, so this information can be added to user bug
reports.
.. _drm_driver_ioctl:
IOCTL Support on Device Nodes
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment