• Ingo Molnar's avatar
    [PATCH] sched: fix ->nr_uninterruptible handling bugs · 4395e879
    Ingo Molnar authored
    PREEMPT_RT on SMP systems triggered weird (very high) load average
    values rather easily, which turned out to be a mainline kernel
    ->nr_uninterruptible handling bug in try_to_wake_up().
    
    the following code:
    
            if (old_state == TASK_UNINTERRUPTIBLE) {
                    old_rq->nr_uninterruptible--;
    
    potentially executes with old_rq potentially being != rq, and hence
    updating ->nr_uninterruptible without the lock held. Given a
    sufficiently concurrent preemption workload the count can get out of
    whack and updates might get lost, permanently skewing the global count. 
    Nothing except the load-average uses nr_uninterruptible() so this
    condition can go unnoticed quite easily.
    
    the fix is to update ->nr_uninterruptible always on the runqueue where
    the task currently is. (this is also a tiny performance plus for
    try_to_wake_up() as a stackslot gets freed up.)
    
    while fixing this bug i found three other ->nr_uninterruptible related
    bugs:
    
     - the update should be moved from deactivate_task() into schedule(),
       beacause e.g. setscheduler() does deactivate_task()+activate_task(),
       which in turn may result in a -1 counter-skew if setscheduler() is
       done on a task asynchronously, which task is still on the runqueue
       but has already set ->state to TASK_UNINTERRUPTIBLE.
    
       sys_sched_setscheduler() is used rarely, but the bug is real. (The
       fix is also a small performance enhancement.)
    
       The rules for ->nr_uninterruptible updating are the following: it
       gets increased by schedule() only, when a task is moved off the
       runqueue and it has a state of TASK_UNINTERRUPTIBLE. It is decreased
       by try_to_wake_up(), by the first wakeup that materially changes the
       state from TASK_UNINTERRUPTIBLE back to TASK_RUNNING, and moves the
       task to the runqueue.
    
     - on CPU-hotplug down we might zap a CPU that has a nonzero counter.
       Due to the fuzzy nature of the global counter a CPU might hold a
       nonzero ->nr_uninterruptible count even if it has no tasks anymore.
       The solution is to 'migrate' the counter to another runqueue.
    
     - we should not return negative counter values from the
       nr_uninterruptible() function, since it accesses them without taking
       the runqueue locks, so the total sum might be slightly above or
       slightly below the real count.
    
    I tested the attached patch on x86 SMP and it solves the load-average
    problem. (I have tested CPU_HOTPLUG compilation but not functionality.) 
    I think this is a must-have for 2.6.10, because there are apps that go
    berzerk if load-average is too high (e.g. sendmail).
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    4395e879
sched.c 116 KB