1. 02 Mar, 2011 9 commits
  2. 18 Feb, 2011 1 commit
  3. 17 Feb, 2011 30 commits
    • Namhyung Kim's avatar
      kernel/user.c: add lock release annotation on free_user() · d02522a9
      Namhyung Kim authored
      commit 571428be upstream.
      
      free_user() releases uidhash_lock but was missing annotation.  Add it.
      This removes following sparse warnings:
      
       include/linux/spinlock.h:339:9: warning: context imbalance in 'free_user' - unexpected unlock
       kernel/user.c:120:6: warning: context imbalance in 'free_uid' - wrong count at exit
      Signed-off-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dhaval Giani <dhaval.giani@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      
      d02522a9
    • Dan Carpenter's avatar
      sched: Remove some dead code · c3d5f1e8
      Dan Carpenter authored
      commit 618765801ebc271fe0ba3eca99fcfd62a1f786e1 upstream.
      
      This was left over from "7c941438 sched: Remove USER_SCHED"
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Acked-by: default avatarDhaval Giani <dhaval.giani@gmail.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      LKML-Reference: <20100315082148.GD18181@bicker>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      c3d5f1e8
    • Peter Zijlstra's avatar
      sched: Fix wake_affine() vs RT tasks · 97fc6c0d
      Peter Zijlstra authored
      Commit: e51fd5e2 upstream
      
      Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT
      tasks), wake_affine() goes funny on RT tasks due to them still having a
      !0 weight and wake_affine() still subtracts that from the rq weight.
      
      Since nobody should be using se->weight for RT tasks, set the value to
      zero. Also, since we now use ->cpu_power to normalize rq weights to
      account for RT cpu usage, add that factor into the imbalance computation.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1275316109.27810.22969.camel@twins>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      97fc6c0d
    • Nikhil Rao's avatar
      sched: Fix idle balancing · 354f5613
      Nikhil Rao authored
      Commit: d5ad140b upstream
      
      An earlier commit reverts idle balancing throttling reset to fix a 30%
      regression in volanomark throughput. We still need to reset idle_stamp
      when we pull a task in newidle balance.
      Reported-by: default avatarAlex Shi <alex.shi@intel.com>
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1290022924-3548-1-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      354f5613
    • Alex Shi's avatar
      sched: Fix volanomark performance regression · 82dd2a0c
      Alex Shi authored
      Commit: b5482cfa upstream
      
      Commit fab47622 triggers excessive idle balancing, causing a ~30% loss in
      volanomark throughput. Remove idle balancing throttle reset.
      Originally-by: default avatarAlex Shi <alex.shi@intel.com>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1289928732.5169.211.camel@maggy.simson.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      82dd2a0c
    • Peter Zijlstra's avatar
      sched: Fix cross-sched-class wakeup preemption · 134f7fee
      Peter Zijlstra authored
      Commit: 1e5a7405 upstream
      
      Instead of dealing with sched classes inside each check_preempt_curr()
      implementation, pull out this logic into the generic wakeup preemption
      path.
      
      This fixes a hang in KVM (and others) where we are waiting for the
      stop machine thread to run ...
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Tested-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Tested-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1288891946.2039.31.camel@laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      134f7fee
    • Suresh Siddha's avatar
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aa68c032
      Suresh Siddha authored
      Commit: aae6d3dd upstream
      
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      aa68c032
    • Peter Zijlstra's avatar
      sched, cgroup: Fixup broken cgroup movement · 2bdf3dc4
      Peter Zijlstra authored
      Commit: b2b5ce02 upstream
      
      Dima noticed that we fail to correct the ->vruntime of sleeping tasks
      when we move them between cgroups.
      Reported-by: default avatarDima Zavin <dima@android.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      LKML-Reference: <1287150604.29097.1513.camel@twins>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      2bdf3dc4
    • Ingo Molnar's avatar
      sched: Export account_system_vtime() · ea63ff2b
      Ingo Molnar authored
      Commit: b7dadc38 upstream
      
      KVM uses it for example:
      
       ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
      
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      ea63ff2b
    • Venkatesh Pallipadi's avatar
      sched: Call tick_check_idle before __irq_enter · 19d3e3cb
      Venkatesh Pallipadi authored
      Commit: d267f87f upstream
      
      When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
      to notify interruption from idle. But, there is a problem if this call
      is done after __irq_enter, as all routines in __irq_enter may find
      stale time due to yet to be done tick_check_idle.
      
      Specifically, trace calls in __irq_enter when they use global clock and also
      account_system_vtime change in this patch as it wants to use sched_clock_cpu()
      to do proper irq timing.
      
      But, tick_check_idle was moved after __irq_enter intentionally to
      prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a9:
      
          irq: call __irq_enter() before calling the tick_idle_check
          Impact: avoid spurious ksoftirqd wakeups
      
      Moving tick_check_idle() before __irq_enter and wrapping it with
      local_bh_enable/disable would solve both the problems.
      Fixed-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      19d3e3cb
    • Venkatesh Pallipadi's avatar
      sched: Remove irq time from available CPU power · c8c88559
      Venkatesh Pallipadi authored
      Commit: aa483808 upstream
      
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      c8c88559
    • Venkatesh Pallipadi's avatar
      sched: Do not account irq time to current task · 3a69989d
      Venkatesh Pallipadi authored
      Commit: 305e6835 upstream
      
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      3a69989d
    • Venkatesh Pallipadi's avatar
      x86: Add IRQ_TIME_ACCOUNTING · 3b7d4d54
      Venkatesh Pallipadi authored
      Commit: e82b8e4e upstream
      
      This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it
      when TSC is enabled.
      
      This change just enables fine grained irq time accounting, isn't used yet.
      Following patches use it for different purposes.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-6-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      3b7d4d54
    • Venkatesh Pallipadi's avatar
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · 5e7ce6ec
      Venkatesh Pallipadi authored
      Commit: b52bfee4 upstream
      
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      5e7ce6ec
    • Venkatesh Pallipadi's avatar
      sched: Add a PF flag for ksoftirqd identification · 9b511401
      Venkatesh Pallipadi authored
      Commit: 6cdd5199 upstream
      
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      9b511401
    • Dave Young's avatar
      sched: Remove unused PF_ALIGNWARN flag · 82f7e90e
      Dave Young authored
      Commit: 637bbdc5 upstream
      
      PF_ALIGNWARN is not implemented and it is for 486 as the
      comment.
      
      It is not likely someone will implement this flag feature.
      So here remove this flag and leave the valuable 0x00000001 for
      future use.
      Signed-off-by: default avatarDave Young <hidave.darkstar@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20100913121903.GB22238@darkstar>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      82f7e90e
    • Venkatesh Pallipadi's avatar
      sched: Consolidate account_system_vtime extern declaration · 95824433
      Venkatesh Pallipadi authored
      Commit: e1e10a26 upstream
      
      Just a minor cleanup patch that makes things easier to the following patches.
      No functionality change in this patch.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      95824433
    • Venkatesh Pallipadi's avatar
      sched: Fix softirq time accounting · 49c6f4a2
      Venkatesh Pallipadi authored
      Commit: 75e1056f upstream
      
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      49c6f4a2
    • Nikhil Rao's avatar
      sched: Drop group_capacity to 1 only if local group has extra capacity · 1d3d2371
      Nikhil Rao authored
      Commit: 75dd321d upstream
      
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      1d3d2371
    • Nikhil Rao's avatar
      sched: Force balancing on newidle balance if local group has capacity · 703482e7
      Nikhil Rao authored
      Commit: fab47622 upstream
      
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      703482e7
    • Nikhil Rao's avatar
      sched: Set group_imb only a task can be pulled from the busiest cpu · 6e1d0fe9
      Nikhil Rao authored
      Commit: 2582f0eb upstream
      
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      6e1d0fe9
    • Nikhil Rao's avatar
      sched: Do not consider SCHED_IDLE tasks to be cache hot · 215856a4
      Nikhil Rao authored
      Commit: ef8002f6 upstream
      
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: default avatarNikhil Rao <ncrao@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      215856a4
    • Peter Zijlstra's avatar
      sched: fix RCU lockdep splat from task_group() · acb2c6dc
      Peter Zijlstra authored
      Commit: 6506cf6c upstream
      
      This addresses the following RCU lockdep splat:
      
      [0.051203] CPU0: AMD QEMU Virtual CPU version 0.12.4 stepping 03
      [0.052999] lockdep: fixing up alternatives.
      [0.054105]
      [0.054106] ===================================================
      [0.054999] [ INFO: suspicious rcu_dereference_check() usage. ]
      [0.054999] ---------------------------------------------------
      [0.054999] kernel/sched.c:616 invoked rcu_dereference_check() without protection!
      [0.054999]
      [0.054999] other info that might help us debug this:
      [0.054999]
      [0.054999]
      [0.054999] rcu_scheduler_active = 1, debug_locks = 1
      [0.054999] 3 locks held by swapper/1:
      [0.054999]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff814be933>] cpu_up+0x42/0x6a
      [0.054999]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810400d8>] cpu_hotplug_begin+0x2a/0x51
      [0.054999]  #2:  (&rq->lock){-.-...}, at: [<ffffffff814be2f7>] init_idle+0x2f/0x113
      [0.054999]
      [0.054999] stack backtrace:
      [0.054999] Pid: 1, comm: swapper Not tainted 2.6.35 #1
      [0.054999] Call Trace:
      [0.054999]  [<ffffffff81068054>] lockdep_rcu_dereference+0x9b/0xa3
      [0.054999]  [<ffffffff810325c3>] task_group+0x7b/0x8a
      [0.054999]  [<ffffffff810325e5>] set_task_rq+0x13/0x40
      [0.054999]  [<ffffffff814be39a>] init_idle+0xd2/0x113
      [0.054999]  [<ffffffff814be78a>] fork_idle+0xb8/0xc7
      [0.054999]  [<ffffffff81068717>] ? mark_held_locks+0x4d/0x6b
      [0.054999]  [<ffffffff814bcebd>] do_fork_idle+0x17/0x2b
      [0.054999]  [<ffffffff814bc89b>] native_cpu_up+0x1c1/0x724
      [0.054999]  [<ffffffff814bcea6>] ? do_fork_idle+0x0/0x2b
      [0.054999]  [<ffffffff814be876>] _cpu_up+0xac/0x127
      [0.054999]  [<ffffffff814be946>] cpu_up+0x55/0x6a
      [0.054999]  [<ffffffff81ab562a>] kernel_init+0xe1/0x1ff
      [0.054999]  [<ffffffff81003854>] kernel_thread_helper+0x4/0x10
      [0.054999]  [<ffffffff814c353c>] ? restore_args+0x0/0x30
      [0.054999]  [<ffffffff81ab5549>] ? kernel_init+0x0/0x1ff
      [0.054999]  [<ffffffff81003850>] ? kernel_thread_helper+0x0/0x10
      [0.056074] Booting Node   0, Processors  #1lockdep: fixing up alternatives.
      [0.130045]  #2lockdep: fixing up alternatives.
      [0.203089]  #3 Ok.
      [0.275286] Brought up 4 CPUs
      [0.276005] Total of 4 processors activated (16017.17 BogoMIPS).
      
      The cgroup_subsys_state structures referenced by idle tasks are never
      freed, because the idle tasks should be part of the root cgroup,
      which is not removable.
      
      The problem is that while we do in-fact hold rq->lock, the newly spawned
      idle thread's cpu is not yet set to the correct cpu so the lockdep check
      in task_group():
      
        lockdep_is_held(&task_rq(p)->lock)
      
      will fail.
      
      But this is a chicken and egg problem.  Setting the CPU's runqueue requires
      that the CPU's runqueue already be set.  ;-)
      
      So insert an RCU read-side critical section to avoid the complaint.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      acb2c6dc
    • Paul E. McKenney's avatar
      sched: suppress RCU lockdep splat in task_fork_fair · f4de371f
      Paul E. McKenney authored
      Commit: b0a0f667 upstream
      
      > ===================================================
      > [ INFO: suspicious rcu_dereference_check() usage. ]
      > ---------------------------------------------------
      > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
      >
      > other info that might help us debug this:
      >
      > rcu_scheduler_active = 1, debug_locks = 1
      > 1 lock held by ifup/23517:
      >   #0:  (&rq->lock){-.-.-.}, at: [<c042f782>] task_fork_fair+0x3b/0x108
      >
      > stack backtrace:
      > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
      > Call Trace:
      >   [<c075e219>] ? printk+0xf/0x16
      >   [<c0455842>] lockdep_rcu_dereference+0x74/0x7d
      >   [<c0426854>] task_group+0x6d/0x79
      >   [<c042686e>] set_task_rq+0xe/0x57
      >   [<c042f79e>] task_fork_fair+0x57/0x108
      >   [<c042e965>] sched_fork+0x82/0xf9
      >   [<c04334b3>] copy_process+0x569/0xe8e
      >   [<c0433ef0>] do_fork+0x118/0x262
      >   [<c076302f>] ? do_page_fault+0x16a/0x2cf
      >   [<c044b80c>] ? up_read+0x16/0x2a
      >   [<c04085ae>] sys_clone+0x1b/0x20
      >   [<c04030a5>] ptregs_clone+0x15/0x30
      >   [<c0402f1c>] ? sysenter_do_call+0x12/0x38
      
      Here a newly created task is having its runqueue assigned.  The new task
      is not yet on the tasklist, so cannot go away.  This is therefore a false
      positive, suppress with an RCU read-side critical section.
      
      Reported-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: Ben Greear <greearb@candelatech.com
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f4de371f
    • stable-bot for Steven Rostedt's avatar
      sched: Give CPU bound RT tasks preference · f1d70344
      stable-bot for Steven Rostedt authored
      From:: Steven Rostedt <srostedt@redhat.com>
      
      Commit: b3bc211c upstream
      
      If a high priority task is waking up on a CPU that is running a
      lower priority task that is bound to a CPU, see if we can move the
      high RT task to another CPU first. Note, if all other CPUs are
      running higher priority tasks than the CPU bounded current task,
      then it will be preempted regardless.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.888922071@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f1d70344
    • Steven Rostedt's avatar
      sched: Try not to migrate higher priority RT tasks · f266611e
      Steven Rostedt authored
      Commit: 43fa5460 upstream
      
      When first working on the RT scheduler design, we concentrated on
      keeping all CPUs running RT tasks instead of having multiple RT
      tasks on a single CPU waiting for the migration thread to move
      them. Instead we take a more proactive stance and push or pull RT
      tasks from one CPU to another on wakeup or scheduling.
      
      When an RT task wakes up on a CPU that is running another RT task,
      instead of preempting it and killing the cache of the running RT
      task, we look to see if we can migrate the RT task that is waking
      up, even if the RT task waking up is of higher priority.
      
      This may sound a bit odd, but RT tasks should be limited in
      migration by the user anyway. But in practice, people do not do
      this, which causes high prio RT tasks to bounce around the CPUs.
      This becomes even worse when we have priority inheritance, because
      a high prio task can block on a lower prio task and boost its
      priority. When the lower prio task wakes up the high prio task, if
      it happens to be on the same CPU it will migrate off of it.
      
      But in reality, the above does not happen much either, because the
      wake up of the lower prio task, which has already been boosted, if
      it was on the same CPU as the higher prio task, it would then
      migrate off of it. But anyway, we do not want to migrate them
      either.
      
      To examine the scheduling, I created a test program and examined it
      under kernelshark. The test program created CPU * 2 threads, where
      each thread had a different priority. The program takes different
      options. The options used in this change log was to have priority
      inheritance mutexes or not.
      
      All threads did the following loop:
      
      static void grab_lock(long id, int iter, int l)
      {
      	ftrace_write("thread %ld iter %d, taking lock %d\n",
      		     id, iter, l);
      	pthread_mutex_lock(&locks[l]);
      	ftrace_write("thread %ld iter %d, took lock %d\n",
      		     id, iter, l);
      	busy_loop(nr_tasks - id);
      	ftrace_write("thread %ld iter %d, unlock lock %d\n",
      		     id, iter, l);
      	pthread_mutex_unlock(&locks[l]);
      }
      
      void *start_task(void *id)
      {
      	[...]
      	while (!done) {
      		for (l = 0; l < nr_locks; l++) {
      			grab_lock(id, i, l);
      			ftrace_write("thread %ld iter %d sleeping\n",
      				     id, i);
      			ms_sleep(id);
      		}
      		i++;
      	}
      	[...]
      }
      
      The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
      ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
      to the ftrace buffer to help analyze via ftrace.
      
      The higher the id, the higher the prio, the shorter it does the
      busy loop, but the longer it spins. This is usually the case with
      RT tasks, the lower priority tasks usually run longer than higher
      priority tasks.
      
      At the end of the test, it records the number of loops each thread
      took, as well as the number of voluntary preemptions, non-voluntary
      preemptions, and number of migrations each thread took, taking the
      information from /proc/$$/sched and /proc/$$/status.
      
      Running this on a 4 CPU processor, the results without changes to
      the kernel looked like this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         53      3220       1470             98
        1:        562       773        724             98
        2:        752       933       1375             98
        3:        749        39        697             98
        4:        758         5        515             98
        5:        764         2        679             99
        6:        761         2        535             99
        7:        757         3        346             99
      
      total:     5156       4977      6341            787
      
      Each thread regardless of priority migrated a few hundred times.
      The higher priority tasks, were a little better but still took
      quite an impact.
      
      By letting higher priority tasks bump the lower prio task from the
      CPU, things changed a bit:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         37      2835       1937             98
        1:        666      1821       1865             98
        2:        654      1003       1385             98
        3:        664       635        973             99
        4:        698       197        352             99
        5:        703       101        159             99
        6:        708         1         75             99
        7:        713         1          2             99
      
      total:     4843       6594      6748            789
      
      The total # of migrations did not change (several runs showed the
      difference all within the noise). But we now see a dramatic
      improvement to the higher priority tasks. (kernelshark showed that
      the watchdog timer bumped the highest priority task to give it the
      2 count. This was actually consistent with every run).
      
      Notice that the # of iterations did not change either.
      
      The above was with priority inheritance mutexes. That is, when the
      higher prority task blocked on a lower priority task, the lower
      priority task would inherit the higher priority task (which shows
      why task 6 was bumped so many times). When not using priority
      inheritance mutexes, the current kernel shows this:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:         56      3101       1892             95
        1:        594       713        937             95
        2:        625       188        618             95
        3:        628         4        491             96
        4:        640         7        468             96
        5:        631         2        501             96
        6:        641         1        466             96
        7:        643         2        497             96
      
      total:     4458       4018      5870            765
      
      Not much changed with or without priority inheritance mutexes. But
      if we let the high priority task bump lower priority tasks on
      wakeup we see:
      
      Task        vol    nonvol   migrated     iterations
      ----        ---    ------   --------     ----------
        0:        115      3439       2782             98
        1:        633      1354       1583             99
        2:        652       919       1218             99
        3:        645       713        934             99
        4:        690         3          3             99
        5:        694         1          4             99
        6:        720         3          4             99
        7:        747         0          1            100
      
      Which shows a even bigger change. The big difference between task 3
      and task 4 is because we have only 4 CPUs on the machine, causing
      the 4 highest prio tasks to always have preference.
      
      Although I did not measure cache misses, and I'm sure there would
      be little to measure since the test was not data intensive, I could
      imagine large improvements for higher priority tasks when dealing
      with lower priority tasks. Thus, I'm satisfied with making the
      change and agreeing with what Gregory Haskins argued a few years
      ago when we first had this discussion.
      
      One final note. All tasks in the above tests were RT tasks. Any RT
      task will always preempt a non RT task that is running on the CPU
      the RT task wants to run on.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Gregory Haskins <ghaskins@novell.com>
      LKML-Reference: <20100921024138.605460343@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      f266611e
    • Venkatesh Pallipadi's avatar
      sched: Increment cache_nice_tries only on periodic lb · 00f3566f
      Venkatesh Pallipadi authored
      Commit: 58b26c4c upstream
      
      scheduler uses cache_nice_tries as an indicator to do cache_hot and
      active load balance, when normal load balance fails. Currently,
      this value is changed on any failed load balance attempt. That ends
      up being not so nice to workloads that enter/exit idle often, as
      they do more frequent new_idle balance and that pretty soon results
      in cache hot tasks being pulled in.
      
      Making the cache_nice_tries ignore failed new_idle balance seems to
      make better sense. With that only the failed load balance in
      periodic load balance gets accounted and the rate of accumulation
      of cache_nice_tries will not depend on idle entry/exit (short
      running sleep-wakeup kind of tasks). This reduces movement of
      cache_hot tasks.
      
      schedstat diff (after-before) excerpt from a workload that has
      frequent and short wakeup-idle pattern (:2 in cpu col below refers
      to NEWIDLE idx) This snapshot was across ~400 seconds.
      
      Without this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  306487   219575    73167  110069413    44583    19070     1172   218403
       1:2  292139   194853    81421  120893383    50745    21902     1259   193594
       2:2  283166   174607    91359  129699642    54931    23688     1287   173320
       3:2  273998   161788    93991  132757146    57122    24351     1366   160422
       4:2  289851   215692    62190  83398383    36377    13680      851   214841
       5:2  316312   222146    77605  117582154    49948    20281      988   221158
       6:2  297172   195596    83623  122133390    52801    21301      929   194667
       7:2  283391   178078    86378  126622761    55122    22239      928   177150
       8:2  297655   210359    72995  110246694    45798    19777     1125   209234
       9:2  297357   202011    79363  119753474    50953    22088     1089   200922
      10:2  278797   178703    83180  122514385    52969    22726     1128   177575
      11:2  272661   167669    86978  127342327    55857    24342     1195   166474
      12:2  293039   204031    73211  110282059    47285    19651      948   203083
      13:2  289502   196762    76803  114712942    49339    20547     1016   195746
      14:2  264446   169609    78292  115715605    50459    21017      982   168627
      15:2  260968   163660    80142  116811793    51483    21281     1064   162596
      
      With this change:
      domainstats:  domain0
       cpu     cnt      bln      fld      imb     gain    hgain  nobusyq  nobusyg
       0:2  272347   187380    77455  105420270    24975        1      953   186427
       1:2  267276   172360    86234  116242264    28087        6     1028   171332
       2:2  259769   156777    93281  123243134    30555        1     1043   155734
       3:2  250870   143129    97627  127370868    32026        6     1188   141941
       4:2  248422   177116    64096  78261112    22202        2      757   176359
       5:2  275595   180683    84950  116075022    29400        6      778   179905
       6:2  262418   162609    88944  119256898    31056        4      817   161792
       7:2  252204   147946    92646  122388300    32879        4      824   147122
       8:2  262335   172239    81631  110477214    26599        4      864   171375
       9:2  261563   164775    88016  117203621    28331        3      849   163926
      10:2  243389   140949    93379  121353071    29585        2      909   140040
      11:2  242795   134651    98310  124768957    30895        2     1016   133635
      12:2  255234   166622    79843  104696912    26483        4      746   165876
      13:2  244944   151595    83855  109808099    27787        3      801   150794
      14:2  241301   140982    89935  116954383    30403        6      845   140137
      15:2  232271   128564    92821  119185207    31207        4     1416   127148
      Signed-off-by: default avatarVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      00f3566f
    • Suresh Siddha's avatar
      sched: Move sched_avg_update() to update_cpu_load() · deb28d43
      Suresh Siddha authored
      Commit: da2b71ed upstream
      
      Currently sched_avg_update() (which updates rt_avg stats in the rq)
      is getting called from scale_rt_power() (in the load balance context)
      which doesn't take rq->lock.
      
      Fix it by moving the sched_avg_update() to more appropriate
      update_cpu_load() where the CFS load gets updated as well.
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      deb28d43
    • Li Zefan's avatar
      sched: Remove remaining USER_SCHED code · 05db2a0c
      Li Zefan authored
      Commit: 32bd7eb5 upstream
      
      This is left over from commit 7c941438 ("sched: Remove USER_SCHED"")
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarDhaval Giani <dhaval.giani@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: David Howells <dhowells@redhat.com>
      LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      05db2a0c
    • Dhaval Giani's avatar
      sched: Remove USER_SCHED · b271aebc
      Dhaval Giani authored
      Commit: 7c941438 upstream
      
      Remove the USER_SCHED feature. It has been scheduled to be removed in
      2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2
      
      [trace from referenced thread]
      [1046577.884289] general protection fault: 0000 [#1] SMP
      [1046577.911332] last sysfs file: /sys/devices/platform/coretemp.7/temp1_input
      [1046577.938715] CPU 3
      [1046577.965814] Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables coretemp k8temp
      [1046577.994456] Pid: 38, comm: events/3 Not tainted 2.6.32.27intel #1 X8DT3
      [1046578.023166] RIP: 0010:[] [] sched_destroy_group+0x3c/0x10d
      [1046578.052639] RSP: 0000:ffff88043e5abe10 EFLAGS: 00010097
      [1046578.081360] RAX: ffff880139fa5540 RBX: ffff8803d18419c0 RCX: ffff8801d2f8fb78
      [1046578.109903] RDX: dead000000200200 RSI: 0000000000000000 RDI: 0000000000000000
      [1046578.109905] RBP: 0000000000000246 R08: 0000000000000020 R09: ffffffff816339b8
      [1046578.109907] R10: 0000000004e6e5f0 R11: 0000000000000006 R12: ffffffff816339b8
      [1046578.109909] R13: ffff8803d63ac4e0 R14: ffff88043e582340 R15: ffffffff8104a216
      [1046578.109911] FS: 0000000000000000(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
      [1046578.109914] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      [1046578.109915] CR2: 00007f55ab220000 CR3: 00000001e5797000 CR4: 00000000000006e0
      [1046578.109917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [1046578.109919] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [1046578.109922] Process events/3 (pid: 38, threadinfo ffff88043e5aa000, task ffff88043e582340)
      [1046578.109923] Stack:
      [1046578.109924] ffff8803d63ac498 ffff8803d63ac4d8 ffff8803d63ac440 ffffffff8104a2c3
      [1046578.109927] <0> ffff88043e5abef8 ffff880028276040 ffff8803d63ac4d8 ffffffff81050395
      [1046578.109929] <0> ffff88043e582340 ffff88043e5826c8 ffff88043e582340 ffff88043e5abfd8
      [1046578.109932] Call Trace:
      [1046578.109938] [] ? cleanup_user_struct+0xad/0xcc
      [1046578.109942] [] ? worker_thread+0x148/0x1d4
      [1046578.109946] [] ? autoremove_wake_function+0x0/0x2e
      [1046578.109948] [] ? worker_thread+0x0/0x1d4
      [1046578.109951] [] ? kthread+0x79/0x81
      [1046578.109955] [] ? child_rip+0xa/0x20
      [1046578.109957] [] ? kthread+0x0/0x81
      [1046578.109959] [] ? child_rip+0x0/0x20
      [1046578.109961] Code: 3c 00 4c 8b 25 02 98 3d 00 48 89 c5 83 cf ff eb 5c 48 8b 43 10 48 63 f7 48 8b 04 f0 48 8b 90 80 00 00 00 48 8b 48 78 48 89 51 08 <48> 89 0a 48 b9 00 02 20 00 00 00 ad de 48 89 88 80 00 00 00 48
      [1046578.109975] RIP [] sched_destroy_group+0x3c/0x10d
      [1046578.109979] RSP
      [1046578.109981] ---[ end trace 5ebc2944b7872d4a ]---
      Signed-off-by: default avatarDhaval Giani <dhaval.giani@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1263990378.24844.3.camel@localhost>
      LKML-Reference: http://marc.info/?l=linux-kernel&m=129466345327931Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      b271aebc