• Emily Deng's avatar
    drm/gpu-sched: fix force APP kill hang(v4) · 8ee3a52e
    Emily Deng authored
    issue:
    there are VMC page fault occurred if force APP kill during
    3dmark test, the cause is in entity_fini we manually signal
    all those jobs in entity's queue which confuse the sync/dep
    mechanism:
    
    1)page fault occurred in sdma's clear job which operate on
    shadow buffer, and shadow buffer's Gart table is cleaned by
    ttm_bo_release since the fence in its reservation was fake signaled
    by entity_fini() under the case of SIGKILL received.
    
    2)page fault occurred in gfx' job because during the lifetime
    of gfx job we manually fake signal all jobs from its entity
    in entity_fini(), thus the unmapping/clear PTE job depend on those
    result fence is satisfied and sdma start clearing the PTE and lead
    to GFX page fault.
    
    fix:
    1)should at least wait all jobs already scheduled complete in entity_fini()
    if SIGKILL is the case.
    
    2)if a fence signaled and try to clear some entity's dependency, should
    set this entity guilty to prevent its job really run since the dependency
    is fake signaled.
    
    v2:
    splitting drm_sched_entity_fini() into two functions:
    1)The first one is does the waiting, removes the entity from the
    runqueue and returns an error when the process was killed.
    2)The second one then goes over the entity, install it as
    completion signal for the remaining jobs and signals all jobs
    with an error code.
    
    v3:
    1)Replace the fini1 and fini2 with better name
    2)Call the first part before the VM teardown in
    amdgpu_driver_postclose_kms() and the second part
    after the VM teardown
    3)Keep the original function drm_sched_entity_fini to
    refine the code.
    
    v4:
    1)Rename entity->finished to entity->last_scheduled;
    2)Rename drm_sched_entity_fini_job_cb() to
    drm_sched_entity_kill_jobs_cb();
    3)Pass NULL to drm_sched_entity_fini_job_cb() if -ENOENT;
    4)Replace the type of entity->fini_status with "int";
    5)Remove the check about entity->finished.
    Signed-off-by: default avatarMonk Liu <Monk.Liu@amd.com>
    Signed-off-by: default avatarEmily Deng <Emily.Deng@amd.com>
    Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    8ee3a52e
gpu_scheduler.c 21.3 KB