1. 06 May, 2010 4 commits
    • Tejun Heo's avatar
      sched: kill paranoia check in synchronize_sched_expedited() · 94458d5e
      Tejun Heo authored
      The paranoid check which verifies that the cpu_stop callback is
      actually called on all online cpus is completely superflous.  It's
      guaranteed by cpu_stop facility and if it didn't work as advertised
      other things would go horribly wrong and trying to recover using
      synchronize_sched() wouldn't be very meaningful.
      
      Kill the paranoid check.  Removal of this feature is done as a
      separate step so that it can serve as a bisection point if something
      actually goes wrong.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      94458d5e
    • Tejun Heo's avatar
      sched: replace migration_thread with cpu_stop · 969c7921
      Tejun Heo authored
      Currently migration_thread is serving three purposes - migration
      pusher, context to execute active_load_balance() and forced context
      switcher for expedited RCU synchronize_sched.  All three roles are
      hardcoded into migration_thread() and determining which job is
      scheduled is slightly messy.
      
      This patch kills migration_thread and replaces all three uses with
      cpu_stop.  The three different roles of migration_thread() are
      splitted into three separate cpu_stop callbacks -
      migration_cpu_stop(), active_load_balance_cpu_stop() and
      synchronize_sched_expedited_cpu_stop() - and each use case now simply
      asks cpu_stop to execute the callback as necessary.
      
      synchronize_sched_expedited() was implemented with private
      preallocated resources and custom multi-cpu queueing and waiting
      logic, both of which are provided by cpu_stop.
      synchronize_sched_expedited_count is made atomic and all other shared
      resources along with the mutex are dropped.
      
      synchronize_sched_expedited() also implemented a check to detect cases
      where not all the callback got executed on their assigned cpus and
      fall back to synchronize_sched().  If called with cpu hotplug blocked,
      cpu_stop already guarantees that and the condition cannot happen;
      otherwise, stop_machine() would break.  However, this patch preserves
      the paranoid check using a cpumask to record on which cpus the stopper
      ran so that it can serve as a bisection point if something actually
      goes wrong theree.
      
      Because the internal execution state is no longer visible,
      rcu_expedited_torture_stats() is removed.
      
      This patch also renames cpu_stop threads to from "stopper/%d" to
      "migration/%d".  The names of these threads ultimately don't matter
      and there's no reason to make unnecessary userland visible changes.
      
      With this patch applied, stop_machine() and sched now share the same
      resources.  stop_machine() is faster without wasting any resources and
      sched migration users are much cleaner.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Josh Triplett <josh@freedesktop.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      969c7921
    • Tejun Heo's avatar
      stop_machine: reimplement using cpu_stop · 3fc1f1e2
      Tejun Heo authored
      Reimplement stop_machine using cpu_stop.  As cpu stoppers are
      guaranteed to be available for all online cpus,
      stop_machine_create/destroy() are no longer necessary and removed.
      
      With resource management and synchronization handled by cpu_stop, the
      new implementation is much simpler.  Asking the cpu_stop to execute
      the stop_cpu() state machine on all online cpus with cpu hotplug
      disabled is enough.
      
      stop_machine itself doesn't need to manage any global resources
      anymore, so all per-instance information is rolled into struct
      stop_machine_data and the mutex and all static data variables are
      removed.
      
      The previous implementation created and destroyed RT workqueues as
      necessary which made stop_machine() calls highly expensive on very
      large machines.  According to Dimitri Sivanich, preventing the dynamic
      creation/destruction makes booting faster more than twice on very
      large machines.  cpu_stop resources are preallocated for all online
      cpus and should have the same effect.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      3fc1f1e2
    • Tejun Heo's avatar
      cpu_stop: implement stop_cpu[s]() · 1142d810
      Tejun Heo authored
      Implement a simplistic per-cpu maximum priority cpu monopolization
      mechanism.  A non-sleeping callback can be scheduled to run on one or
      multiple cpus with maximum priority monopolozing those cpus.  This is
      primarily to replace and unify RT workqueue usage in stop_machine and
      scheduler migration_thread which currently is serving multiple
      purposes.
      
      Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(),
      stop_cpus() and try_stop_cpus().
      
      This is to allow clean sharing of resources among stop_cpu and all the
      migration thread users.  One stopper thread per cpu is created which
      is currently named "stopper/CPU".  This will eventually replace the
      migration thread and take on its name.
      
      * This facility was originally named cpuhog and lived in separate
        files but Peter Zijlstra nacked the name and thus got renamed to
        cpu_stop and moved into stop_machine.c.
      
      * Better reporting of preemption leak as per Peter's suggestion.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      1142d810
  2. 23 Apr, 2010 3 commits
    • Suresh Siddha's avatar
      sched: Fix select_idle_sibling() logic in select_task_rq_fair() · 99bd5e2f
      Suresh Siddha authored
      Issues in the current select_idle_sibling() logic in select_task_rq_fair()
      in the context of a task wake-up:
      
      a) Once we select the idle sibling, we use that domain (spanning the cpu that
         the task is currently woken-up and the idle sibling that we found) in our
         wake_affine() decisions. This domain is completely different from the
         domain(we are supposed to use) that spans the cpu that the task currently
         woken-up and the cpu where the task previously ran.
      
      b) We do select_idle_sibling() check only for the cpu that the task is
         currently woken-up on. If select_task_rq_fair() selects the previously run
         cpu for waking the task, doing a select_idle_sibling() check
         for that cpu also helps and we don't do this currently.
      
      c) In the scenarios where the cpu that the task is woken-up is busy but
         with its HT siblings are idle, we are selecting the task be woken-up
         on the idle HT sibling instead of a core that it previously ran
         and currently completely idle. i.e., we are not taking decisions based on
         wake_affine() but directly selecting an idle sibling that can cause
         an imbalance at the SMT/MC level which will be later corrected by the
         periodic load balancer.
      
      Fix this by first going through the load imbalance calculations using
      wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
      then choose a possible idle sibling for waking up the task on.
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      99bd5e2f
    • Peter Zijlstra's avatar
      sched: Pre-compute cpumask_weight(sched_domain_span(sd)) · 669c55e9
      Peter Zijlstra authored
      Dave reported that his large SPARC machines spend lots of time in
      hweight64(), try and optimize some of those needless cpumask_weight()
      invocations (esp. with the large offstack cpumasks these are very
      expensive indeed).
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      669c55e9
    • Peter Zijlstra's avatar
      sched: Cure load average vs NO_HZ woes · 74f5187a
      Peter Zijlstra authored
      Chase reported that due to us decrementing calc_load_task prematurely
      (before the next LOAD_FREQ sample), the load average could be scewed
      by as much as the number of CPUs in the machine.
      
      This patch, based on Chase's patch, cures the problem by keeping the
      delta of the CPU going into NO_HZ idle separately and folding that in
      on the next LOAD_FREQ update.
      
      This restores the balance and we get strict LOAD_FREQ period samples.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarChase Douglas <chase.douglas@canonical.com>
      LKML-Reference: <1271934490.1776.343.camel@laptop>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      74f5187a
  3. 15 Apr, 2010 2 commits
  4. 14 Apr, 2010 1 commit
  5. 13 Apr, 2010 30 commits