• Tejun Heo's avatar
    sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable · a8532fac
    Tejun Heo authored
    During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
    on every task. To do this, it does get_task_struct() on each iterated task,
    drop the lock and then call ops.init_task().
    
    However, a TASK_DEAD task may already have lost all its usage count and be
    waiting for RCU grace period to be freed. If get_task_struct() is called on
    such task, use-after-free can happen. To avoid such situations,
    scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
    as they are never going to be scheduled again.
    
    Unfortunately, a racing sched_setscheduler(2) can grab the task before the
    task is unhashed and then continue to e.g. move the task from RT to SCX
    after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
    gone through scx_ops_init_task(), scx_ops_enable_task() called from
    switching_to_scx() triggers the following warning:
    
      sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
      WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
      ...
      RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
      ...
       switching_to_scx+0x13/0xa0
       __sched_setscheduler+0x84e/0xa50
       do_sched_setscheduler+0x104/0x1c0
       __x64_sys_sched_setscheduler+0x18/0x30
       do_syscall_64+0x7b/0x140
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
    
    As in the ops_disable path, it just doesn't seem like a good idea to leave
    any task in an inconsistent state, even when the task is dead. The root
    cause is ops_enable not being able to tell reliably whether a task is truly
    dead (no one else is looking at it and it's about to be freed) and was
    testing TASK_DEAD instead. Fix it by testing the task's usage count
    directly.
    
    - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
      tasks, @include_dead is removed from scx_task_iter_next_locked() along
      with dead task filtering.
    
    - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
      fails.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: David Vernet <void@manifault.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    a8532fac
ext.c 181 KB