1. 26 Feb, 2010 1 commit
    • Suresh Siddha's avatar
      sched: Fix SCHED_MC regression caused by change in sched cpu_power · dd5feea1
      Suresh Siddha authored
      On platforms like dual socket quad-core platform, the scheduler load
      balancer is not detecting the load imbalances in certain scenarios. This
      is leading to scenarios like where one socket is completely busy (with
      all the 4 cores running with 4 tasks) and leaving another socket
      completely idle. This causes performance issues as those 4 tasks share
      the memory controller, last-level cache bandwidth etc. Also we won't be
      taking advantage of turbo-mode as much as we would like, etc.
      
      Some of the comparisons in the scheduler load balancing code are
      comparing the "weighted cpu load that is scaled wrt sched_group's
      cpu_power" with the "weighted average load per task that is not scaled
      wrt sched_group's cpu_power". While this has probably been broken for a
      longer time (for multi socket numa nodes etc), the problem got aggrevated
      via this recent change:
      
       |
       |  commit f93e65c1
       |  Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
       |  Date:   Tue Sep 1 10:34:32 2009 +0200
       |
       |	sched: Restore __cpu_power to a straight sum of power
       |
      
      Also with this change, the sched group cpu power alone no longer reflects
      the group capacity that is needed to implement MC, MT performance
      (default) and power-savings (user-selectable) policies.
      
      We need to use the computed group capacity (sgs.group_capacity, that is
      computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
      find out if the group with the max load is above its capacity and how
      much load to move etc.
      Reported-by: default avatarMa Ling <ling.ma@intel.com>
      Initial-Analysis-by: default avatarZhang, Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      [ -v2: build fix ]
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x]
      LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      dd5feea1
  2. 17 Feb, 2010 1 commit
    • Thomas Gleixner's avatar
      sched: Don't use possibly stale sched_class · 83ab0aa0
      Thomas Gleixner authored
      setscheduler() saves task->sched_class outside of the rq->lock held
      region for a check after the setscheduler changes have become
      effective. That might result in checking a stale value.
      
      rtmutex_setprio() has the same problem, though it is protected by
      p->pi_lock against setscheduler(), but for correctness sake (and to
      avoid bad examples) it needs to be fixed as well.
      
      Retrieve task->sched_class inside of the rq->lock held region.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: stable@kernel.org
      83ab0aa0
  3. 16 Feb, 2010 4 commits
    • Thomas Gleixner's avatar
      Merge branch 'sched/urgent' into sched/core · 6e40f5bb
      Thomas Gleixner authored
      Conflicts: kernel/sched.c
      
      Necessary due to the urgent fixes which conflict with the code move
      from sched.c to sched_fair.c
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      6e40f5bb
    • Peter Zijlstra's avatar
      sched: Fix race between ttwu() and task_rq_lock() · 0970d299
      Peter Zijlstra authored
      Thomas found that due to ttwu() changing a task's cpu without holding
      the rq->lock, task_rq_lock() might end up locking the wrong rq.
      
      Avoid this by serializing against TASK_WAKING.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1266241712.15770.420.camel@laptop>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      0970d299
    • Suresh Siddha's avatar
      sched: Fix SMT scheduler regression in find_busiest_queue() · 9000f05c
      Suresh Siddha authored
      Fix a SMT scheduler performance regression that is leading to a scenario
      where SMT threads in one core are completely idle while both the SMT threads
      in another core (on the same socket) are busy.
      
      This is caused by this commit (with the problematic code highlighted)
      
         commit bdb94aa5
         Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
         Date:   Tue Sep 1 10:34:38 2009 +0200
      
         sched: Try to deal with low capacity
      
         @@ -4203,15 +4223,18 @@ find_busiest_queue()
         ...
      	for_each_cpu(i, sched_group_cpus(group)) {
         +	unsigned long power = power_of(i);
      
         ...
      
         -	wl = weighted_cpuload(i);
         +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
         +	wl /= power;
      
         -	if (rq->nr_running == 1 && wl > imbalance)
         +	if (capacity && rq->nr_running == 1 && wl > imbalance)
      		continue;
      
      On a SMT system, power of the HT logical cpu will be 589 and
      the scheduler load imbalance (for scenarios like the one mentioned above)
      can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
      the weighted load with the power will result in "wl > imbalance" and
      ultimately resulting in find_busiest_queue() return NULL, causing
      load_balance() to think that the load is well balanced. But infact
      one of the tasks can be moved to the idle core for optimal performance.
      
      We don't need to use the weighted load (wl) scaled by the cpu power to
      compare with  imabalance. In that condition, we already know there is only a
      single task "rq->nr_running == 1" and the comparison between imbalance,
      wl is to make sure that we select the correct priority thread which matches
      imbalance. So we really need to compare the imabalnce with the original
      weighted load of the cpu and not the scaled load.
      
      But in other conditions where we want the most hammered(busiest) cpu, we can
      use scaled load to ensure that we consider the cpu power in addition to the
      actual load on that cpu, so that we can move the load away from the
      guy that is getting most hammered with respect to the actual capacity,
      as compared with the rest of the cpu's in that busiest group.
      
      Fix it.
      Reported-by: default avatarMa Ling <ling.ma@intel.com>
      Initial-Analysis-by: default avatarZhang, Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
      Cc: stable@kernel.org [2.6.32.x]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      9000f05c
    • Vaidyanathan Srinivasan's avatar
      sched: Fix sched_mv_power_savings for !SMT · 28f53181
      Vaidyanathan Srinivasan authored
      Fix for sched_mc_powersavigs for pre-Nehalem platforms.
      Child sched domain should clear SD_PREFER_SIBLING if parent will have
      SD_POWERSAVINGS_BALANCE because they are contradicting.
      
      Sets the flags correctly based on sched_mc_power_savings.
      Signed-off-by: default avatarVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100208100555.GD2931@dirshya.in.ibm.com>
      Cc: stable@kernel.org [2.6.32.x]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      28f53181
  4. 09 Feb, 2010 1 commit
  5. 08 Feb, 2010 4 commits
  6. 06 Feb, 2010 4 commits
  7. 05 Feb, 2010 25 commits