1. 28 Aug, 2009 1 commit
    • Peter Zijlstra's avatar
      sched: Fix division by zero - really · 34d76c41
      Peter Zijlstra authored
      When re-computing the shares for each task group's cpu
      representation we need the ratio of weight on each cpu vs the
      total weight of the sched domain.
      
      Since load-balancing is loosely (read not) synchronized, the
      weight of individual cpus can change between doing the sum and
      calculating the ratio.
      
      The previous patch dealt with only one of the race scenarios,
      this patch side steps them all by saving a snapshot of all the
      individual cpu weights, thereby always working on a consistent
      set.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: torvalds@linux-foundation.org
      Cc: jes@sgi.com
      Cc: jens.axboe@oracle.com
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1251371336.18584.77.camel@twins>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      34d76c41
  2. 21 Aug, 2009 1 commit
  3. 20 Aug, 2009 1 commit
  4. 18 Aug, 2009 1 commit
  5. 09 Aug, 2009 1 commit
    • Arnd Bergmann's avatar
      sched: Add default defines for PREEMPT_ACTIVE · 8e5b59a2
      Arnd Bergmann authored
      The PREEMPT_ACTIVE setting doesn't actually need to be
      arch-specific, so set up a sane default for all arches to
      (hopefully) migrate to.
      
      > if we look at linux/hardirq.h, it makes this claim:
      >  * - bit 28 is the PREEMPT_ACTIVE flag
      > if that's true, then why are we letting any arch set this define ?  a
      > quick survey shows that half the arches (11) are using 0x10000000 (bit
      > 28) while the other half (10) are using 0x4000000 (bit 26).  and then
      > there is the ia64 oddity which uses bit 30.  the exact value here
      > shouldnt really matter across arches though should it ?
      
      actually alpha, arm and avr32 also use bit 30 (0x40000000),
      there are only five (or eight, depending on how you count)
      architectures (blackfin, h8300, m68k, s390 and sparc) using bit
      26.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarMike Frysinger <vapier@gentoo.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8e5b59a2
  6. 02 Aug, 2009 13 commits
    • Ingo Molnar's avatar
      sched: Fix cpupri build on !CONFIG_SMP · bcf08df3
      Ingo Molnar authored
      This build bug:
      
       In file included from kernel/sched.c:1765:
       kernel/sched_rt.c: In function ‘has_pushable_tasks’:
       kernel/sched_rt.c:1069: error: ‘struct rt_rq’ has no member named ‘pushable_tasks’
       kernel/sched_rt.c: In function ‘pick_next_task_rt’:
       kernel/sched_rt.c:1084: error: ‘struct rq’ has no member named ‘post_schedule’
      
      Triggers because both pushable_tasks and post_schedule are
      SMP-only fields.
      
      Move pushable_tasks() to the SMP section and #ifdef the post_schedule use.
      
      Cc: Gregory Haskins <ghaskins@novell.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      bcf08df3
    • Peter Zijlstra's avatar
      sched: Ensure the migration task doesn't go away during use · 693525e3
      Peter Zijlstra authored
      Like sched_migrate_task(), set_cpus_allowed_ptr() should hold
      onto the migration thread too.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      693525e3
    • Peter Zijlstra's avatar
      sched: Add debug check to task_of() · 8f48894f
      Peter Zijlstra authored
      A frequent mistake appears to be to call task_of() on a
      scheduler entity that is not actually a task, which can result
      in a wild pointer.
      
      Add a check to catch these mistakes.
      Suggested-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8f48894f
    • Gregory Haskins's avatar
      sched: Fully integrate cpus_active_map and root-domain code · 00aec93d
      Gregory Haskins authored
      Reflect "active" cpus in the rq->rd->online field, instead of
      the online_map.
      
      The motivation is that things that use the root-domain code
      (such as cpupri) only care about cpus classified as "active"
      anyway. By synchronizing the root-domain state with the active
      map, we allow several optimizations.
      
      For instance, we can remove an extra cpumask_and from the
      scheduler hotpath by utilizing rq->rd->online (since it is now
      a cached version of cpu_active_map & rq->rd->span).
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMax Krasnyansky <maxk@qualcomm.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090730145723.25226.24493.stgit@dev.haskins.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      00aec93d
    • Gregory Haskins's avatar
      sched: Enhance the pre/post scheduling logic · 3f029d3c
      Gregory Haskins authored
      We currently have an explicit "needs_post" vtable method which
      returns a stack variable for whether we should later run
      post-schedule.  This leads to an awkward exchange of the
      variable as it bubbles back up out of the context switch. Peter
      Zijlstra observed that this information could be stored in the
      run-queue itself instead of handled on the stack.
      
      Therefore, we revert to the method of having context_switch
      return void, and update an internal rq->post_schedule variable
      when we require further processing.
      
      In addition, we fix a race condition where we try to access
      current->sched_class without holding the rq->lock.  This is
      technically racy, as the sched-class could change out from
      under us.  Instead, we reference the per-rq post_schedule
      variable with the runqueue unlocked, but with preemption
      disabled to see if we need to reacquire the rq->lock.
      
      Finally, we clean the code up slightly by removing the #ifdef
      CONFIG_SMP conditionals from the schedule() call, and implement
      some inline helper functions instead.
      
      This patch passes checkpatch, and rt-migrate.
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      3f029d3c
    • Steven Rostedt's avatar
      sched: Add new prio to cpupri before removing old prio · c3a2ae3d
      Steven Rostedt authored
      We need to add the new prio to the cpupri accounting before
      removing the old prio. This is because removing the old prio
      first will open a race window where the cpu will be removed
      from pri_active. In this case the cpu will not be visible for
      RT push and pulls. This could cause a RT task to not migrate
      appropriately, and create a very large latency.
      
      This bug was found with the use of ftrace sched events and
      trace_printk.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729042526.438281019@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c3a2ae3d
    • Steven Rostedt's avatar
      sched: Check for pushing rt tasks after all scheduling · da19ab51
      Steven Rostedt authored
      The current method for pushing RT tasks after scheduling only
      happens after a context switch. But we found cases where a task
      is set up on a run queue to be pushed but the push never
      happens because the schedule chooses the same task.
      
      This bug was found with the help of Gregory Haskins and the use
      of ftrace (trace_printk). It tooks several days for both of us
      analyzing the code and the trace output to find this.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729042526.205923666@goodmis.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      da19ab51
    • Peter Zijlstra's avatar
      sched: Optimize unused cgroup configuration · e7097159
      Peter Zijlstra authored
      When cgroup group scheduling is built in, skip some code paths
      if we don't have any (but the root) cgroups configured.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e7097159
    • Peter Zijlstra's avatar
      sched: Fix cgroup smp fairness · a5004278
      Peter Zijlstra authored
      Commit ec4e0e2f ("fix
      inconsistency when redistribute per-cpu tg->cfs_rq shares")
      broke cgroup smp fairness.
      
      In order to avoid starvation of newly placed tasks, we never
      quite set the share of an empty cpu group-task to 0, but
      instead we set it as if there's a single NICE-0 task present.
      
      If however we actually set this in cfs_rq[cpu]->shares, that
      means the total shares for that group will be slightly inflated
      every time we balance, causing the observed unfairness.
      
      Fix this by setting cfs_rq[cpu]->shares to 0 but actually
      setting the effective weight of the related se to the inflated
      number.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1248696557.6987.1615.camel@twins>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      a5004278
    • Ingo Molnar's avatar
      Merge branch 'sched/urgent' into sched/core · 8e9ed8b0
      Ingo Molnar authored
      Merge reason: avoid upcoming patch conflict.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8e9ed8b0
    • Gregory Haskins's avatar
      sched: Fix race in cpupri introduced by cpumask_var changes · 07903af1
      Gregory Haskins authored
      Background:
      
      Several race conditions in the scheduler have cropped up
      recently, which Steven and I have tracked down using ftrace.
      The most recent one turns out to be a race in how the scheduler
      determines a suitable migration target for RT tasks, introduced
      recently with commit:
      
          commit 68e74568
          Date:   Tue Nov 25 02:35:13 2008 +1030
      
              sched: convert struct cpupri_vec cpumask_var_t.
      
      The original design of cpupri allowed lockless readers to
      quickly determine a best-estimate target.  Races between the
      pri_active bitmap and the vec->mask were handled in the
      original code because we would detect and return "0" when this
      occured.  The design was predicated on the *effective*
      atomicity (*) of caching the result of cpus_and() between the
      cpus_allowed and the vec->mask.
      
      Commit 68e74568 changed the behavior such that vec->mask is
      accessed multiple times.  This introduces a subtle race, the
      result of which means we can have a result that returns "1",
      but with an empty bitmap.
      
      *) yes, we know cpus_and() is not a locked operator across the
         entire composite array, but it is implicitly atomic on a
         per-word basis which is all the design required to work.
      
      Implementation:
      
      Rather than forgoing the lockless design, or reverting to a
      stack-based cpumask_t, we simply check for when the race has
      been encountered and continue processing in the event that the
      race is hit.  This renders the removal race as if the priority
      bit had been atomically cleared as well, and allows the
      algorithm to execute correctly.
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      CC: Rusty Russell <rusty@rustcorp.com.au>
      CC: Steven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090730145728.25226.92769.stgit@dev.haskins.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      07903af1
    • Peter Zijlstra's avatar
      sched: Fix latencytop and sleep profiling vs group scheduling · e414314c
      Peter Zijlstra authored
      The latencytop and sleep accounting code assumes that any
      scheduler entity represents a task, this is not so.
      
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e414314c
    • Frederic Weisbecker's avatar
      sched: Fix cond_resched_lock() in !CONFIG_PREEMPT · 716a4234
      Frederic Weisbecker authored
      The might_sleep() test inside cond_resched_lock() assumes the
      spinlock is held and then preemption is disabled. This is true
      with CONFIG_PREEMPT but the preempt_count() doesn't change
      otherwise.
      
      Check by starting from the appropriate preempt offset depending
      on the config.
      Reported-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1248458723-12146-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      716a4234
  7. 01 Aug, 2009 1 commit
  8. 31 Jul, 2009 19 commits
  9. 30 Jul, 2009 2 commits