An error occurred fetching the project authors.
  1. 06 Mar, 2009 1 commit
  2. 05 Mar, 2009 1 commit
    • Frederic Weisbecker's avatar
      sched: don't rebalance if attached on NULL domain · 8a0be9ef
      Frederic Weisbecker authored
      Impact: fix function graph trace hang / drop pointless softirq on UP
      
      While debugging a function graph trace hang on an old PII, I saw
      that it consumed most of its time on the timer interrupt. And
      the domain rebalancing softirq was the most concerned.
      
      The timer interrupt calls trigger_load_balance() which will
      decide if it is worth to schedule a rebalancing softirq.
      
      In case of builtin UP kernel, no problem arises because there is
      no domain question.
      
      In case of builtin SMP kernel running on an SMP box, still no
      problem, the softirq will be raised each time we reach the
      next_balance time.
      
      In case of builtin SMP kernel running on a UP box (most distros
      provide default SMP kernels, whatever the box you have), then
      the CPU is attached to the NULL sched domain. So a kind of
      unexpected behaviour happen:
      
      trigger_load_balance() -> raises the rebalancing softirq later
      on softirq: run_rebalance_domains() -> rebalance_domains() where
      the for_each_domain(cpu, sd) is not taken because of the NULL
      domain we are attached at. Which means rq->next_balance is never
      updated. So on the next timer tick, we will enter
      trigger_load_balance() which will always reschedule() the
      rebalacing softirq:
      
      if (time_after_eq(jiffies, rq->next_balance))
      	raise_softirq(SCHED_SOFTIRQ);
      
      So for each tick, we process this pointless softirq.
      
      This patch fixes it by checking if we are attached to the null
      domain before raising the softirq, another possible fix would be
      to set the maximal possible JIFFIES value to rq->next_balance if
      we are attached to the NULL domain.
      
      v2: build fix on UP
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8a0be9ef
  3. 02 Mar, 2009 1 commit
  4. 27 Feb, 2009 1 commit
  5. 26 Feb, 2009 2 commits
  6. 16 Feb, 2009 1 commit
  7. 15 Feb, 2009 1 commit
  8. 12 Feb, 2009 1 commit
  9. 11 Feb, 2009 1 commit
  10. 05 Feb, 2009 1 commit
    • Johannes Weiner's avatar
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner authored
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: default avatarChris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
  11. 04 Feb, 2009 1 commit
    • Suresh Siddha's avatar
      sched: fix nohz load balancer on cpu offline · 483b4ee6
      Suresh Siddha authored
      Christian Borntraeger reports:
      
      > After a logical cpu offline, even on a complete idle system, there
      > is one cpu with full ticks. It turns out that nohz.cpu_mask has the
      > the offlined cpu still set.
      >
      > In select_nohz_load_balancer() we check if the system is completely
      > idle to turn of load balancing. We compare cpu_online_map with
      > nohz.cpu_mask.  Since cpu_online_map is updated on cpu unplug,
      > but nohz.cpu_mask is not, the check fails and the scheduler believes
      > that we need an "idle load balancer" even on a fully idle system.
      > Since the ilb cpu does not deactivate the timer tick this breaks NOHZ.
      
      Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask
      while a cpu is going offline.
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Tested-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      483b4ee6
  12. 01 Feb, 2009 2 commits
  13. 15 Jan, 2009 3 commits
  14. 14 Jan, 2009 4 commits
  15. 12 Jan, 2009 1 commit
  16. 11 Jan, 2009 2 commits
  17. 07 Jan, 2009 1 commit
    • Peter Zijlstra's avatar
      sched: fix possible recursive rq->lock · da8d5089
      Peter Zijlstra authored
      Vaidyanathan Srinivasan reported:
      
       > =============================================
       > [ INFO: possible recursive locking detected ]
       > 2.6.28-autotest-tip-sv #1
       > ---------------------------------------------
       > klogd/5062 is trying to acquire lock:
       >  (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
       >
       > but task is already holding lock:
       >  (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
      
      With sched_mc at 2. (it is default-off)
      
      Strictly speaking we'll not deadlock, because ttwu will not be able to
      place the migration task on our rq, but since the code can deal with
      both rqs getting unlocked, this seems the easiest way out.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      da8d5089
  18. 06 Jan, 2009 2 commits
  19. 05 Jan, 2009 2 commits
  20. 03 Jan, 2009 1 commit
    • Mike Travis's avatar
      sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc
      Mike Travis authored
      Impact: prevents panic from stack overflow on numa-capable machines.
      
      Some of the "removal of stack hogs" changes in kernel/sched.c by using
      node_to_cpumask_ptr were undone by the early cpumask API updates, and
      causes a panic due to stack overflow.  This patch undoes those changes
      by using cpumask_of_node() which returns a 'const struct cpumask *'.
      
      In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
      reducing stack usage.  (Both of these updates removed 9 FIXME's!)
      
      Also:
         Pick up some remaining changes from the old 'cpumask_t' functions to
         the new 'struct cpumask *' functions.
      
         Optimize memory traffic by allocating each percpu local_cpu_mask on the
         same node as the referring cpu.
      Signed-off-by: default avatarMike Travis <travis@sgi.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      6ca09dfc
  21. 31 Dec, 2008 2 commits
    • Martin Schwidefsky's avatar
      [PATCH] idle cputime accounting · 79741dd3
      Martin Schwidefsky authored
      The cpu time spent by the idle process actually doing something is
      currently accounted as idle time. This is plain wrong, the architectures
      that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
      time spent doing nothing and the time spent by idle doing work. The first
      is accounted with account_idle_time and the second with account_system_time.
      The architectures that use the account_xxx_time interface directly and not
      the account_xxx_ticks interface now need to do the check for the idle
      process in their arch code. In particular to improve the system vs true
      idle time accounting the arch code needs to measure the true idle time
      instead of just testing for the idle process.
      To improve the tick based accounting as well we would need an architecture
      primitive that can tell us if the pt_regs of the interrupted context
      points to the magic instruction that halts the cpu.
      
      In addition idle time is no more added to the stime of the idle process.
      This field now contains the system time of the idle process as it should
      be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
      every tick that occurs while idle is running will be accounted as idle
      time.
      
      This patch contains the necessary common code changes to be able to
      distinguish idle system time and true idle time. The architectures with
      support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      79741dd3
    • Martin Schwidefsky's avatar
      [PATCH] fix scaled & unscaled cputime accounting · 457533a7
      Martin Schwidefsky authored
      The utimescaled / stimescaled fields in the task structure and the
      global cpustat should be set on all architectures. On s390 the calls
      to account_user_time_scaled and account_system_time_scaled never have
      been added. In addition system time that is accounted as guest time
      to the user time of a process is accounted to the scaled system time
      instead of the scaled user time.
      To fix the bugs and to prevent future forgetfulness this patch merges
      account_system_time_scaled into account_system_time and
      account_user_time_scaled into account_user_time.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Acked-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      457533a7
  22. 29 Dec, 2008 5 commits
    • Gregory Haskins's avatar
      sched: create "pushable_tasks" list to limit pushing to one attempt · 917b627d
      Gregory Haskins authored
      The RT scheduler employs a "push/pull" design to actively balance tasks
      within the system (on a per disjoint cpuset basis).  When a task is
      awoken, it is immediately determined if there are any lower priority
      cpus which should be preempted.  This is opposed to the way normal
      SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
      operation to occur before spreading out load.
      
      When a particular RQ has more than 1 active RT task, it is said to
      be in an "overloaded" state.  Once this occurs, the system enters
      the active balancing mode, where it will try to push the task away,
      or persuade a different cpu to pull it over.  The system will stay
      in this state until the system falls back below the <= 1 queued RT
      task per RQ.
      
      However, the current implementation suffers from a limitation in the
      push logic.  Once overloaded, all tasks (other than current) on the
      RQ are analyzed on every push operation, even if it was previously
      unpushable (due to affinity, etc).  Whats more, the operation stops
      at the first task that is unpushable and will not look at items
      lower in the queue.  This causes two problems:
      
      1) We can have the same tasks analyzed over and over again during each
         push, which extends out the fast path in the scheduler for no
         gain.  Consider a RQ that has dozens of tasks that are bound to a
         core.  Each one of those tasks will be encountered and skipped
         for each push operation while they are queued.
      
      2) There may be lower-priority tasks under the unpushable task that
         could have been successfully pushed, but will never be considered
         until either the unpushable task is cleared, or a pull operation
         succeeds.  The net result is a potential latency source for mid
         priority tasks.
      
      This patch aims to rectify these two conditions by introducing a new
      priority sorted list: "pushable_tasks".  A task is added to the list
      each time a task is activated or preempted.  It is removed from the
      list any time it is deactivated, made current, or fails to push.
      
      This works because a task only needs to be attempted to push once.
      After an initial failure to push, the other cpus will eventually try to
      pull the task when the conditions are proper.  This also solves the
      problem that we don't completely analyze all tasks due to encountering
      an unpushable tasks.  Now every task will have a push attempted (when
      appropriate).
      
      This reduces latency both by shorting the critical section of the
      rq->lock for certain workloads, and by making sure the algorithm
      considers all eligible tasks in the system.
      
      [ rostedt: added a couple more BUG_ONs ]
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      Acked-by: default avatarSteven Rostedt <srostedt@redhat.com>
      917b627d
    • Gregory Haskins's avatar
      sched: add sched_class->needs_post_schedule() member · 967fc046
      Gregory Haskins authored
      We currently run class->post_schedule() outside of the rq->lock, which
      means that we need to test for the need to post_schedule outside of
      the lock to avoid a forced reacquistion.  This is currently not a problem
      as we only look at rq->rt.overloaded.  However, we want to enhance this
      going forward to look at more state to reduce the need to post_schedule to
      a bare minimum set.  Therefore, we introduce a new member-func called
      needs_post_schedule() which tests for the post_schedule condtion without
      actually performing the work.  Therefore it is safe to call this
      function before the rq->lock is released, because we are guaranteed not
      to drop the lock at an intermediate point (such as what post_schedule()
      may do).
      
      We will use this later in the series
      
      [ rostedt: removed paranoid BUG_ON ]
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      967fc046
    • Gregory Haskins's avatar
      sched: make double-lock-balance fair · 8f45e2b5
      Gregory Haskins authored
      double_lock balance() currently favors logically lower cpus since they
      often do not have to release their own lock to acquire a second lock.
      The result is that logically higher cpus can get starved when there is
      a lot of pressure on the RQs.  This can result in higher latencies on
      higher cpu-ids.
      
      This patch makes the algorithm more fair by forcing all paths to have
      to release both locks before acquiring them again.  Since callsites to
      double_lock_balance already consider it a potential preemption/reschedule
      point, they have the proper logic to recheck for atomicity violations.
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      8f45e2b5
    • Gregory Haskins's avatar
      sched: pull only one task during NEWIDLE balancing to limit critical section · 7e96fa58
      Gregory Haskins authored
      git-id c4acb2c0 attempted to limit
      newidle critical section length by stopping after at least one task
      was moved.  Further investigation has shown that there are other
      paths nested further inside the algorithm which still remain that allow
      long latencies to occur with newidle balancing.  This patch applies
      the same technique inside balance_tasks() to limit the duration of
      this optional balancing operation.
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      CC: Nick Piggin <npiggin@suse.de>
      7e96fa58
    • Gregory Haskins's avatar
      sched: track the next-highest priority on each runqueue · e864c499
      Gregory Haskins authored
      We will use this later in the series to reduce the amount of rq-lock
      contention during a pull operation
      Signed-off-by: default avatarGregory Haskins <ghaskins@novell.com>
      e864c499
  23. 26 Dec, 2008 1 commit
  24. 25 Dec, 2008 1 commit
  25. 23 Dec, 2008 1 commit