-
Ingo Molnar authored
PREEMPT_RT on SMP systems triggered weird (very high) load average values rather easily, which turned out to be a mainline kernel ->nr_uninterruptible handling bug in try_to_wake_up(). the following code: if (old_state == TASK_UNINTERRUPTIBLE) { old_rq->nr_uninterruptible--; potentially executes with old_rq potentially being != rq, and hence updating ->nr_uninterruptible without the lock held. Given a sufficiently concurrent preemption workload the count can get out of whack and updates might get lost, permanently skewing the global count. Nothing except the load-average uses nr_uninterruptible() so this condition can go unnoticed quite easily. the fix is to update ->nr_uninterruptible always on the runqueue where the task currently is. (this is also a tiny performance plus for try_to_wake_up() as a stackslot gets freed up.) while fixing this bug i found three other ->nr_uninterruptible related bugs: - the update should be moved from deactivate_task() into schedule(), beacause e.g. setscheduler() does deactivate_task()+activate_task(), which in turn may result in a -1 counter-skew if setscheduler() is done on a task asynchronously, which task is still on the runqueue but has already set ->state to TASK_UNINTERRUPTIBLE. sys_sched_setscheduler() is used rarely, but the bug is real. (The fix is also a small performance enhancement.) The rules for ->nr_uninterruptible updating are the following: it gets increased by schedule() only, when a task is moved off the runqueue and it has a state of TASK_UNINTERRUPTIBLE. It is decreased by try_to_wake_up(), by the first wakeup that materially changes the state from TASK_UNINTERRUPTIBLE back to TASK_RUNNING, and moves the task to the runqueue. - on CPU-hotplug down we might zap a CPU that has a nonzero counter. Due to the fuzzy nature of the global counter a CPU might hold a nonzero ->nr_uninterruptible count even if it has no tasks anymore. The solution is to 'migrate' the counter to another runqueue. - we should not return negative counter values from the nr_uninterruptible() function, since it accesses them without taking the runqueue locks, so the total sum might be slightly above or slightly below the real count. I tested the attached patch on x86 SMP and it solves the load-average problem. (I have tested CPU_HOTPLUG compilation but not functionality.) I think this is a must-have for 2.6.10, because there are apps that go berzerk if load-average is too high (e.g. sendmail). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4395e879