• Juri Lelli's avatar
    sched/features: Fix hrtick reprogramming · 156ec6f4
    Juri Lelli authored
    Hung tasks and RCU stall cases were reported on systems which were not
    100% busy. Investigation of such unexpected cases (no sign of potential
    starvation caused by tasks hogging the system) pointed out that the
    periodic sched tick timer wasn't serviced anymore after a certain point
    and that caused all machinery that depends on it (timers, RCU, etc.) to
    stop working as well. This issues was however only reproducible if
    HRTICK was enabled.
    
    Looking at core dumps it was found that the rbtree of the hrtimer base
    used also for the hrtick was corrupted (i.e. next as seen from the base
    root and actual leftmost obtained by traversing the tree are different).
    Same base is also used for periodic tick hrtimer, which might get "lost"
    if the rbtree gets corrupted.
    
    Much alike what described in commit 1f71addd ("tick/sched: Do not
    mess with an enqueued hrtimer") there is a race window between
    hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in
    __hrtick_restart() in which the former might be operating on an already
    queued hrtick hrtimer, which might lead to corruption of the base.
    
    Use hrtick_start() (which removes the timer before enqueuing it back) to
    ensure hrtick hrtimer reprogramming is entirely guarded by the base
    lock, so that no race conditions can occur.
    Signed-off-by: default avatarJuri Lelli <juri.lelli@redhat.com>
    Signed-off-by: default avatarLuis Claudio R. Goncalves <lgoncalv@redhat.com>
    Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com
    156ec6f4
sched.h 70.3 KB