• Guilherme G. Piccoli's avatar
    drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini · 5ad7bbf3
    Guilherme G. Piccoli authored
    Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
    routine - such function is expected to be called only after the
    respective init function - drm_sched_init() - was executed successfully.
    
    Happens that we faced a driver probe failure in the Steam Deck
    recently, and the function drm_sched_fini() was called even without
    its counter-part had been previously called, causing the following oops:
    
    amdgpu: probe of 0000:04:00.0 failed with error -110
    BUG: kernel NULL pointer dereference, address: 0000000000000090
    PGD 0 P4D 0
    Oops: 0002 [#1] PREEMPT SMP NOPTI
    CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
    Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
    RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
    [...]
    Call Trace:
     <TASK>
     amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
     amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
     amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
     devm_drm_dev_init_release+0x49/0x70
     [...]
    
    To prevent that, check if the drm_sched was properly initialized for a
    given ring before calling its fini counter-part.
    
    Notice ideally we'd use sched.ready for that; such field is set as the latest
    thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
    field - in the above oops for example, it was a GFX ring causing the crash, and
    the sched.ready field was set to true in the ring init routine, regardless of
    the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
    Christian's suggestion [0], and also removed the no_scheduler check [1].
    
    [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
    [1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/
    
    Fixes: 067f44c8 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
    Suggested-by: default avatarChristian König <christian.koenig@amd.com>
    Cc: Guchun Chen <guchun.chen@amd.com>
    Cc: Luben Tuikov <luben.tuikov@amd.com>
    Cc: Mario Limonciello <mario.limonciello@amd.com>
    Reviewed-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
    Signed-off-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org
    5ad7bbf3
amdgpu_fence.c 24.6 KB