1. 27 Mar, 2015 13 commits
    • Preeti U Murthy's avatar
      sched: Improve load balancing in the presence of idle CPUs · d4573c3e
      Preeti U Murthy authored
      When a CPU is kicked to do nohz idle balancing, it wakes up to do load
      balancing on itself, followed by load balancing on behalf of idle CPUs.
      But it may end up with load after the load balancing attempt on itself.
      This aborts nohz idle balancing. As a result several idle CPUs are left
      without tasks till such a time that an ILB CPU finds it unfavorable to
      pull tasks upon itself. This delays spreading of load across idle CPUs
      and worse, clutters only a few CPUs with tasks.
      
      The effect of the above problem was observed on an SMT8 POWER server
      with 2 levels of numa domains. Busy loops equal to number of cores were
      spawned. Since load balancing on fork/exec is discouraged across numa
      domains, all busy loops would start on one of the numa domains. However
      it was expected that eventually one busy loop would run per core across
      all domains due to nohz idle load balancing. But it was observed that it
      took as long as 10 seconds to spread the load across numa domains.
      
      Further investigation showed that this was a consequence of the
      following:
      
       1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
          load balancing [Given the experiment, upto 6 CPUs per core could be
          potentially idle in this domain.]
      
       2. However the ILB CPU would call load_balance() on itself before
          initiating nohz idle load balancing.
      
       3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
          tasks from its sibling cores to even out load.
      
       4. Now that the ILB CPU was no longer idle, it would abort nohz idle
          load balancing
      
      As a result the opportunities to spread load across numa domains were
      lost until such a time that the cores within the first numa domain had
      equal number of tasks among themselves.  This is a pretty bad scenario,
      since the cores within the first numa domain would have as many as 4
      tasks each, while cores in the neighbouring numa domains would all
      remain idle.
      
      Fix this, by checking if a CPU was woken up to do nohz idle load
      balancing, before it does load balancing upon itself. This way we allow
      idle CPUs across the system to do load balancing which results in
      quicker spread of load, instead of performing load balancing within the
      local sched domain hierarchy of the ILB CPU alone under circumstances
      such as above.
      Signed-off-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJason Low <jason.low2@hp.com>
      Cc: benh@kernel.crashing.org
      Cc: daniel.lezcano@linaro.org
      Cc: efault@gmx.de
      Cc: iamjoonsoo.kim@lge.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: riel@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: svaidy@linux.vnet.ibm.com
      Cc: tim.c.chen@linux.intel.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d4573c3e
    • Peter Zijlstra's avatar
      sched: Optimize freq invariant accounting · dfbca41f
      Peter Zijlstra authored
      Currently the freq invariant accounting (in
      __update_entity_runnable_avg() and sched_rt_avg_update()) get the
      scale factor from a weak function call, this means that even for archs
      that default on their implementation the compiler cannot see into this
      function and optimize the extra scaling math away.
      
      This is sad, esp. since its a 64-bit multiplication which can be quite
      costly on some platforms.
      
      So replace the weak function with #ifdef and __always_inline goo. This
      is not quite as nice from an arch support PoV but should at least
      result in compile time errors if done wrong.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dfbca41f
    • Vincent Guittot's avatar
      sched: Move CFS tasks to CPUs with higher capacity · 1aaf90a4
      Vincent Guittot authored
      When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
      capacity for CFS tasks can be significantly reduced. Once we detect such
      situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
      load balance to check if it's worth moving its tasks on an idle CPU.
      
      It's worth trying to move the task before the CPU is fully utilized to
      minimize the preemption by irq or RT tasks.
      
      Once the idle load_balance has selected the busiest CPU, it will look for an
      active load balance for only two cases:
      
        - There is only 1 task on the busiest CPU.
      
        - We haven't been able to move a task of the busiest rq.
      
      A CPU with a reduced capacity is included in the 1st case, and it's worth to
      actively migrate its task if the idle CPU has got more available capacity for
      CFS tasks. This test has been added in need_active_balance.
      
      As a sidenote, this will not generate more spurious ilb because we already
      trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
      has a task, we will trig the ilb once for migrating the task.
      
      The nohz_kick_needed function has been cleaned up a bit while adding the new
      test
      
      env.src_cpu and env.src_rq must be set unconditionnally because they are used
      in need_active_balance which is called even if busiest->nr_running equals 1
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1aaf90a4
    • Vincent Guittot's avatar
      sched: Add SD_PREFER_SIBLING for SMT level · caff37ef
      Vincent Guittot authored
      Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
      the scheduler will place at least one task per core.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      caff37ef
    • Vincent Guittot's avatar
      sched: Remove unused struct sched_group_capacity::capacity_orig · dc7ff76e
      Vincent Guittot authored
      The 'struct sched_group_capacity::capacity_orig' field is no longer used
      in the scheduler so we can remove it.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dc7ff76e
    • Vincent Guittot's avatar
      sched: Replace capacity_factor by usage · ea67821b
      Vincent Guittot authored
      The scheduler tries to compute how many tasks a group of CPUs can handle by
      assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
      SCHED_CAPACITY_SCALE.
      
      'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
      by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
      compares this value with the sum of nr_running to decide if the group is
      overloaded or not.
      
      But the 'group_capacity_factor' concept is hardly working for SMT systems, it
      sometimes works for big cores but fails to do the right thing for little cores.
      
      Below are two examples to illustrate the problem that this patch solves:
      
      1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
         (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
         (div_round_closest(3x640/1024) = 2) which means that it will be seen as
         overloaded even if we have only one task per CPU.
      
      2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
         (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
         (at max and thanks to the fix [0] for SMT system that prevent the apparition
         of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
         reduced to nearly nothing), the capacity factor of the group will still be 4
         (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
      
      So, this patch tries to solve this issue by removing capacity_factor and
      replacing it with the 2 following metrics:
      
        - The available CPU's capacity for CFS tasks which is already used by
          load_balance().
      
        - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
          has been re-introduced to compute the usage of a CPU by CFS tasks.
      
      'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
      by 'group_no_capacity'. We compare the number of task with the number of CPUs and
      we evaluate the level of utilization of the CPUs to define if a group is
      overloaded or if a group has capacity to handle more tasks.
      
      For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
      so it will be selected in priority (among the overloaded groups). Since [1],
      SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
      because local is not overloaded.
      
      [1] 9a5d9ba6 ("sched/fair: Allow calculate_imbalance() to move idle cpus")
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
      [ Tidied up the changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ea67821b
    • Vincent Guittot's avatar
      sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage · 8bb5b00c
      Vincent Guittot authored
      Monitor the usage level of each group of each sched_domain level. The usage is
      the portion of cpu_capacity_orig that is currently used on a CPU or group of
      CPUs. We use the utilization_load_avg to evaluate the usage level of each
      group.
      
      The utilization_load_avg only takes into account the running time of the CFS
      tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
      utilized. Nevertheless, we must cap utilization_load_avg which can be
      temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
      CPU and until the metrics are stabilized.
      
      The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
      running load on the CPU whereas the available capacity for the CFS task is in
      the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
      by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
      of the CPU to get the usage of the latter. The usage can then be compared with
      the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
      
      The frequency scaling invariance of the usage is not taken into account in this
      patch, it will be solved in another patch which will deal with frequency
      scaling invariance on the utilization_load_avg.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8bb5b00c
    • Vincent Guittot's avatar
      sched: Add struct rq::cpu_capacity_orig · ca6d75e6
      Vincent Guittot authored
      This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
      before being altered by rt tasks and/or IRQ
      
      The cpu_capacity_orig will be used:
      
        - to detect when the capacity of a CPU has been noticeably reduced so we can
          trig load balance to look for a CPU with better capacity. As an example, we
          can detect when a CPU handles a significant amount of irq
          (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
          scheduler whereas CPUs, which are really idle, are available.
      
        - evaluate the available capacity for CFS tasks
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ca6d75e6
    • Vincent Guittot's avatar
      sched: Make scale_rt invariant with frequency · b5b4860d
      Vincent Guittot authored
      The average running time of RT tasks is used to estimate the remaining compute
      capacity for CFS tasks. This remaining capacity is the original capacity scaled
      down by a factor (aka scale_rt_capacity). This estimation of available capacity
      must also be invariant with frequency scaling.
      
      A frequency scaling factor is applied on the running time of the RT tasks for
      computing scale_rt_capacity.
      
      In sched_rt_avg_update(), we now scale the RT execution time like below:
      
        rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
      
      Then, scale_rt_capacity can be summarized by:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
      
      with available = total - rq->rt_avg
      
      This has been been optimized in current code by:
      
        scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)
      
      But we can also developed the equation like below:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)
      
      and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
      the computation of rq->rt_avg and scale_rt_capacity().
      
      so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
      and
      scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)
      
      arch_scale_frequency_capacity() will be called in the hot path of the scheduler
      which implies to have a short and efficient function.
      
      As an example, arch_scale_frequency_capacity() should return a cached value that
      is updated periodically outside of the hot path.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b5b4860d
    • Morten Rasmussen's avatar
      sched: Make sched entity usage tracking scale-invariant · 0c1dc6b2
      Morten Rasmussen authored
      Apply frequency scale-invariance correction factor to usage tracking.
      
      Each segment of the running_avg_sum geometric series is now scaled by the
      current frequency so the utilization_avg_contrib of each entity will be
      invariant with frequency scaling.
      
      As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
      becomes invariant too. So the usage level that is returned by get_cpu_usage(),
      stays relative to the max frequency as the cpu_capacity which is is compared against.
      
      Then, we want the keep the load tracking values in a 32-bit type, which implies
      that the max value of {runnable|running}_avg_sum must be lower than
      2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
      arch_scale_freq_capacity() must return a value less than
      (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
      So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0c1dc6b2
    • Vincent Guittot's avatar
      sched: Remove frequency scaling from cpu_capacity · a8faa8f5
      Vincent Guittot authored
      Now that arch_scale_cpu_capacity has been introduced to scale the original
      capacity, the arch_scale_freq_capacity is no longer used (it was
      previously used by ARM arch).
      
      Remove arch_scale_freq_capacity from the computation of cpu_capacity.
      The frequency invariance will be handled in the load tracking and not in
      the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
      load with the current frequency of the CPUs in a later patch.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a8faa8f5
    • Morten Rasmussen's avatar
      sched: Track group sched_entity usage contributions · 21f44866
      Morten Rasmussen authored
      Add usage contribution tracking for group entities. Unlike
      se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
      entities is the sum of se->avg.utilization_avg_contrib for all entities on the
      group runqueue.
      
      It is _not_ influenced in any way by the task group h_load. Hence it is
      representing the actual cpu usage of the group, not its intended load
      contribution which may differ significantly from the utilization on
      lightly utilized systems.
      Signed-off-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      21f44866
    • Vincent Guittot's avatar
      sched: Add sched_avg::utilization_avg_contrib · 36ee28e4
      Vincent Guittot authored
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum.
      
      Add some descriptions of the variables to explain their differences.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      36ee28e4
  2. 23 Mar, 2015 4 commits
    • Steven Rostedt's avatar
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt authored
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b6366f04
    • Steven Rostedt's avatar
      irq_work: Fix build failure when CONFIG_IRQ_WORK is not defined · 71ad00d6
      Steven Rostedt authored
      When CONFIG_IRQ_WORK is not defined (difficult to do, as it also
      requires CONFIG_PRINTK not to be defined), we get a build failure:
      
      	kernel/built-in.o: In function `flush_smp_call_function_queue':
      	kernel/smp.c:263: undefined reference to `irq_work_run'
      	kernel/smp.c:263: undefined reference to `irq_work_run'
      	Makefile:933: recipe for target 'vmlinux' failed
      
      Simplest thing to do is to make irq_work_run() a nop when not set.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20150319101851.4d224d9b@gandalf.local.homeSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      71ad00d6
    • Ingo Molnar's avatar
    • Brian Silverman's avatar
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944
      Brian Silverman authored
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      Signed-off-by: default avatarBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      746db944
  3. 22 Mar, 2015 7 commits
  4. 21 Mar, 2015 11 commits
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma · f8975224
      Linus Torvalds authored
      Pull slave dmaengine fixes from Vinod Koul:
       "Four fixes for dw, pl08x, imx-sdma and at_hdmac driver.  Nothing
        unusual here, simple fixes to these drivers"
      
      * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
        dmaengine: pl08x: Define capabilities for generic capabilities reporting
        dmaengine: dw: append MODULE_ALIAS for platform driver
        dmaengine: imx-sdma: switch to dynamic context mode after script loaded
        dmaengine: at_hdmac: Fix calculation of the residual bytes
      f8975224
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 3d7a6db5
      Linus Torvalds authored
      Pull power management and ACPI fixes from Rafael Wysocki:
       "These are fixes for recent regressions (PCI/ACPI resources and at91
        RTC locking), a stable-candidate powercap RAPL driver fix and two ARM
        cpuidle fixes (one stable-candidate too).
      
        Specifics:
      
         - Revert a recent PCI commit related to IRQ resources management that
           introduced a regression for drivers attempting to bind to devices
           whose previous drivers did not balance pci_enable_device() and
           pci_disable_device() as expected (Rafael J Wysocki).
      
         - Fix a deadlock in at91_rtc_interrupt() introduced by a typo in a
           recent commit related to wakeup interrupt handling (Dan Carpenter).
      
         - Allow the power capping RAPL (Running-Average Power Limit) driver
           to use different energy units for domains within one CPU package
           which is necessary to handle Intel Haswell EP processors correctly
           (Jacob Pan).
      
         - Improve the cpuidle mvebu driver's handling of Armada XP SoCs by
           updating the target residency and exit latency numbers for those
           chips (Sebastien Rannou).
      
         - Prevent the cpuidle mvebu driver from calling cpu_pm_enter() twice
           in a row before cpu_pm_exit() is called on the same CPU which
           breaks the core's assumptions regarding the usage of those
           functions (Gregory Clement)"
      
      * tag 'pm+acpi-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        Revert "x86/PCI: Refine the way to release PCI IRQ resources"
        rtc: at91rm9200: double locking bug in at91_rtc_interrupt()
        powercap / RAPL: handle domains with different energy units
        cpuidle: mvebu: Update cpuidle thresholds for Armada XP SOCs
        cpuidle: mvebu: Fix the CPU PM notifier usage
      3d7a6db5
    • Linus Torvalds's avatar
      Merge git://people.freedesktop.org/~airlied/linux · 97448d5b
      Linus Torvalds authored
      Pull drm updates from Dave Airlie:
       "A bunch of fixes across drivers:
      
        radeon:
           disable two ended allocation for now, it breaks some stuff
      
        amdkfd:
           misc fixes
      
        nouveau:
           fix irq loop problem, add basic support for GM206 (new hw)
      
        i915:
           fix some WARNs people were seeing
      
        exynos:
           fix some iommu interactions causing boot failures"
      
      * git://people.freedesktop.org/~airlied/linux:
        drm/radeon: drop ttm two ended allocation
        drm/exynos: fix the initialization order in FIMD
        drm/exynos: fix typo config name correctly.
        drm/exynos: Check for NULL dereference of crtc
        drm/exynos: IS_ERR() vs NULL bug
        drm/exynos: remove unused files
        drm/i915: Make sure the primary plane is enabled before reading out the fb state
        drm/nouveau/bios: fix i2c table parsing for dcb 4.1
        drm/nouveau/device/gm100: Basic GM206 bring up (as copy of GM204)
        drm/nouveau/device: post write to NV_PMC_BOOT_1 when flipping endian switch
        drm/nouveau/gr/gf100: fix some accidental or'ing of buffer addresses
        drm/nouveau/fifo/nv04: remove the loop from the interrupt handler
        drm/radeon: Changing number of compute pipe lines
        drm/amdkfd: Fix SDMA queue init. in non-HWS mode
        drm/amdkfd: destroy mqd when destroying kernel queue
        drm/i915: Ensure plane->state->fb stays in sync with plane->fb
      97448d5b
    • Linus Torvalds's avatar
      Merge tag 'devicetree-fixes-for-4.0-part2' of... · bb8ef2fb
      Linus Torvalds authored
      Merge tag 'devicetree-fixes-for-4.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
      
      Pull more DeviceTree fixes vfom Rob Herring:
      
       - revert setting stdout-path as preferred console.  This caused
         regressions in PowerMACs and other systems.
      
       - yet another fix for stdout-path option parsing.
      
       - fix error path handling in of_irq_parse_one
      
      * tag 'devicetree-fixes-for-4.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
        Revert "of: Fix premature bootconsole disable with 'stdout-path'"
        of: handle both '/' and ':' in path strings
        of: unittest: Add option string test case with longer path
        of/irq: Fix of_irq_parse_one() returned error codes
      bb8ef2fb
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · e477f3e0
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "Here are current target-pending fixes for v4.0-rc5 code that have made
        their way into the queue over the last weeks.
      
        The fixes this round include:
      
         - Fix long-standing iser-target logout bug related to early
           conn_logout_comp completion, resulting in iscsi_conn use-after-tree
           OOpsen.  (Sagi + nab)
      
         - Fix long-standing tcm_fc bug in ft_invl_hw_context() failure
           handing for DDP hw offload.  (DanC)
      
         - Fix incorrect use of unprotected __transport_register_session() in
           tcm_qla2xxx + other single local se_node_acl fabrics.  (Bart)
      
         - Fix reference leak in target_submit_cmd() -> target_get_sess_cmd()
           for ack_kref=1 failure path.  (Bart)
      
         - Fix pSCSI backend ->get_device_type() statistics OOPs with
           un-configured device.  (Olaf + nab)
      
         - Fix virtual LUN=0 target_configure_device failure OOPs at modprobe
           time.  (Claudio + nab)
      
         - Fix FUA write false positive failure regression in v4.0-rc1 code.
           (Christophe Vu-Brugier + HCH)"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: do not reject FUA CDBs when write cache is enabled but emulate_write_cache is 0
        target: Fix virtual LUN=0 target_configure_device failure OOPs
        target/pscsi: Fix NULL pointer dereference in get_device_type
        tcm_fc: missing curly braces in ft_invl_hw_context()
        target: Fix reference leak in target_get_sess_cmd() error path
        loop/usb/vhost-scsi/xen-scsiback: Fix use of __transport_register_session
        tcm_qla2xxx: Fix incorrect use of __transport_register_session
        iscsi-target: Avoid early conn_logout_comp for iser connections
        Revert "iscsi-target: Avoid IN_LOGOUT failure case for iser-target"
        target: Disallow changing of WRITE cache/FUA attrs after export
      e477f3e0
    • Linus Torvalds's avatar
      Merge tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · da6b9a20
      Linus Torvalds authored
      Pull devicemapper fixes from Mike Snitzer:
       "A handful of stable fixes for DM:
         - fix thin target to always zero-fill reads to unprovisioned blocks
         - fix to interlock device destruction's suspend from internal
           suspends
         - fix 2 snapshot exception store handover bugs
         - fix dm-io to cope with DISCARD and WRITE_SAME capabilities changing"
      
      * tag 'dm-4.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm io: deal with wandering queue limits when handling REQ_DISCARD and REQ_WRITE_SAME
        dm snapshot: suspend merging snapshot when doing exception handover
        dm snapshot: suspend origin when doing exception handover
        dm: hold suspend_lock while suspending device during device deletion
        dm thin: fix to consistently zero-fill reads to unprovisioned blocks
      da6b9a20
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 521d4746
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "Most of these are fixing extent reservation accounting, or corners
        with tree writeback during commit.
      
        Josef's set does add a test, which isn't strictly a fix, but it'll
        keep us from making this same mistake again"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: fix outstanding_extents accounting in DIO
        Btrfs: add sanity test for outstanding_extents accounting
        Btrfs: just free dummy extent buffers
        Btrfs: account merges/splits properly
        Btrfs: prepare block group cache before writing
        Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
        Btrfs: account for the correct number of extents for delalloc reservations
        Btrfs: fix merge delalloc logic
        Btrfs: fix comp_oper to get right order
        Btrfs: catch transaction abortion after waiting for it
        btrfs: fix sizeof format specifier in btrfs_check_super_valid()
      521d4746
    • Linus Torvalds's avatar
      Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux · 0d122f74
      Linus Torvalds authored
      Pull nfsd bufix from Bruce Fields:
       "This is a fix for a crash easily triggered by 4.1 activity to a server
        built with CONFIG_NFSD_PNFS.
      
        There are some more bugfixes queued up that I intend to pass along
        next week, but this is the most critical"
      
      * 'for-4.0' of git://linux-nfs.org/~bfields/linux:
        Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
      0d122f74
    • Linus Torvalds's avatar
      Merge tag 'upstream-4.0-rc5' of git://git.infradead.org/linux-ubifs · c6ef8145
      Linus Torvalds authored
      Pull UBI fix from Artem Bityutskiy:
       "This fixes a bug introduced during the v4.0 merge window where we
        forgot to put braces where they should be"
      
      * tag 'upstream-4.0-rc5' of git://git.infradead.org/linux-ubifs:
        UBI: fix missing brace control flow
      c6ef8145
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 60ed380e
      Linus Torvalds authored
      Pull arm64 fixes from Catalin Marinas:
      
       - mm switching fix where the kernel pgd ends up in the user TTBR0 after
         returning from an EFI run-time services call
      
       - fix __GFP_ZERO handling for atomic pool and CMA DMA allocations (the
         generic code does get the gfp flags, so it's left with the arch code
         to memzero accordingly)
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Honor __GFP_ZERO in dma allocations
        arm64: efi: don't restore TTBR0 if active_mm points at init_mm
      60ed380e
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm · 62a202d7
      Linus Torvalds authored
      Pull ARM fixes from Russell King:
       "Another few ARM fixes.  Fabrice fixed the L2 cache DT parsing to allow
        prefetch configuration to be specified even when the cache size
        parsing fails.
      
        Laura noticed that the setting of page attributes wasn't working for
        modules due to is_module_addr() always returning false.
      
        Marc Gonzalez (aka Mason) noticed a potential latent bug with the way
        we read one of the CPUID registers (where we could attempt to read a
        non-present CPUID register which may fault.)
      
        I've fixed an issue where 32-bit DMA masks were failing with memory
        which extended to the top of physical address space, and I've also
        added debugging output of the page tables when we hit a data access
        exception which we don't specifically handle - prompted by the lack of
        information in a bug report"
      
      * 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
        ARM: 8313/1: Use read_cpuid_ext() macro instead of inline asm
        ARM: 8311/1: Don't use is_module_addr in setting page attributes
        ARM: 8310/1: l2c: Fix prefetch settings dt parsing
        ARM: dump pgd, pmd and pte states on unhandled data abort faults
        ARM: dma-api: fix off-by-one error in __dma_supported()
      62a202d7
  5. 20 Mar, 2015 5 commits