• Peter Zijlstra's avatar
    sched/rt: Fix PI handling vs. sched_setscheduler() · ff77e468
    Peter Zijlstra authored
    Andrea Parri reported:
    
    > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
    > handled correctly:
    >
    >     T1 (prio = 20)
    >        lock(rtmutex);
    >
    >     T2 (prio = 20)
    >        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
    >
    >     T1 (prio = 20)
    >        sys_set_scheduler(prio = 0)
    >           [new_effective_prio == oldprio]
    >           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
    >
    > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
    > in particular, if we continue with
    >
    >    T1 (prio = 20)
    >       unlock(rtmutex)
    >          wakeup(T2)
    >          adjust_prio(T1)
    >             [prio != rt_mutex_getprio(T1)]
    >	    dequeue(T1)
    >	       rt_nr_boosted = (unsigned long)(-1)
    >	       ...
    >             T1 prio = 0
    >
    > then we end up leaving rt_nr_boosted in an "inconsistent" state.
    >
    > The simple program attached could reproduce the previous scenario; note
    > that, as a consequence of the presence of this state, the "assertion"
    >
    >     WARN_ON(!rt_nr_running && rt_nr_boosted)
    >
    > from dec_rt_group() may trigger.
    
    So normally we dequeue/enqueue tasks in sched_setscheduler(), which
    would ensure the accounting stays correct. However in the early PI path
    we fail to do so.
    
    So this was introduced at around v3.14, by:
    
      c365c292 ("sched: Consider pi boosting in setscheduler()")
    
    which fixed another problem exactly because that dequeue/enqueue, joy.
    
    Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
    preserve runqueue location with that option. This requires decoupling
    the on_rt_rq() state from being on the list.
    
    In order to allow for explicit movement during the SAVE/RESTORE,
    introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
    cases to preserve other invariants.
    
    Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
    things like sys_nice()/sys_sched_setaffinity() also do not reorder
    FIFO tasks (whereas they used to before this patch).
    Reported-by: default avatarAndrea Parri <parri.andrea@gmail.com>
    Tested-by: default avatarAndrea Parri <parri.andrea@gmail.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Juri Lelli <juri.lelli@arm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    ff77e468
sched.h 44.1 KB