1. 30 Mar, 2017 2 commits
    • Yuyang Du's avatar
      sched/fair: Optimize ___update_sched_avg() · a481db34
      Yuyang Du authored
      The main PELT function ___update_load_avg(), which implements the
      accumulation and progression of the geometric average series, is
      implemented along the following lines for the scenario where the time
      delta spans all 3 possible sections (see figure below):
      
        1. add the remainder of the last incomplete period
        2. decay old sum
        3. accumulate new sum in full periods since last_update_time
        4. accumulate the current incomplete period
        5. update averages
      
      Or:
      
                  d1          d2           d3
                  ^           ^            ^
                  |           |            |
                |<->|<----------------->|<--->|
        ... |---x---|------| ... |------|-----x (now)
      
        load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)
      
                                              p
      	      weight * scale * 1024 * \Sum y^n +		(3)
                                             n=1
      
      	      weight * scale * d3 * y^0				(4)
      
        load_avg' = load_sum' / LOAD_AVG_MAX				(5)
      
      Where:
      
       d1 - is the delta part completing the remainder of the last
            incomplete period,
       d2 - is the delta part spannind complete periods, and
       d3 - is the delta part starting the current incomplete period.
      
      We can simplify the code in two steps; the first step is to separate
      the first term into new and old parts like:
      
        (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
      					       weight * scale * d1 * y^(p+1)
      
      Once we've done that, its easy to see that all new terms carry the
      common factors:
      
        weight * scale
      
      If we factor those out, we arrive at the form:
      
        load_sum' = load_sum * y^(p+1) +
      
      	      weight * scale * (d1 * y^(p+1) +
      
      					 p
      			        1024 * \Sum y^n +
      					n=1
      
      				d3 * y^0)
      
      Which results in a simpler, smaller and faster implementation.
      Signed-off-by: default avatarYuyang Du <yuyang.du@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: matt@codeblueprint.co.uk
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a481db34
    • Peter Zijlstra's avatar
      sched/fair: Explicitly generate __update_load_avg() instances · 0ccb977f
      Peter Zijlstra authored
      The __update_load_avg() function is an __always_inline because its
      used with constant propagation to generate different variants of the
      code without having to duplicate it (which would be prone to bugs).
      
      Explicitly instantiate the 3 variants.
      
      Note that most of this is called from rather hot paths, so reducing
      branches is good.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0ccb977f
  2. 27 Mar, 2017 1 commit
    • Srikar Dronamraju's avatar
      sched/fair: Prefer sibiling only if local group is under-utilized · 05b40e05
      Srikar Dronamraju authored
      If the child domain prefers tasks to go siblings, the local group could
      end up pulling tasks to itself even if the local group is almost equally
      loaded as the source group.
      
      Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
      Everytime, local group has capacity and source group has atleast 2 threads,
      local group tries to pull the task. This causes the threads to constantly
      move between different cores. This is even more profound if the cores have
      more threads, like in Power 8, smt 8 mode.
      
      Fix this by only allowing local group to pull a task, if the source group
      has more number of tasks than the local group.
      
      Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
      
      Without patch:
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
                     366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
                   3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
                   3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
                   5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
                   2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
                  10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )
      
      With patch:
      
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
                     123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
                   3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
                     189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
                   5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
                     506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
                  10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )
      
      Which show that in these workloads CPU migrations get reduced significantly.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      05b40e05
  3. 23 Mar, 2017 2 commits
  4. 16 Mar, 2017 16 commits
    • Peter Zijlstra's avatar
      sched/core: Avoid double update_rq_clock() in move_queued_task() · 15ff991e
      Peter Zijlstra authored
      Address this case:
      
        WARNING: CPU: 0 PID: 2070 at ../kernel/sched/core.c:109 update_rq_clock+0x74/0x80
        rq->clock_update_flags & RQCF_UPDATED
      
        Call Trace:
        update_rq_clock()
        move_queued_task()
        __set_cpus_allowed_ptr()
        ...
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      15ff991e
    • Peter Zijlstra's avatar
      sched/core: Fix double update_rq_clock) calls in attach_task()/detach_task() · 5704ac0a
      Peter Zijlstra authored
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5704ac0a
    • Peter Zijlstra's avatar
      sched/core: Avoid obvious double update_rq_clock() · 7a57f32a
      Peter Zijlstra authored
      Add DEQUEUE_NOCLOCK to all places where we just did an
      update_rq_clock() already.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7a57f32a
    • Peter Zijlstra's avatar
      sched/core: Simplify update_rq_clock() in __schedule() · bce4dc80
      Peter Zijlstra authored
      Instead of relying on deactivate_task() to call update_rq_clock() and
      handling the case where it didn't happen (task_on_rq_queued),
      unconditionally do update_rq_clock() and skip any further updates.
      
      This also avoids a double update on deactivate_task() + ttwu_local().
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bce4dc80
    • Peter Zijlstra's avatar
      sched/core: Make sched_ttwu_pending() atomic in time · 77558e4d
      Peter Zijlstra authored
      Since all tasks on the wake_list are woken under a single rq->lock
      avoid calling update_rq_clock() for each task.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      77558e4d
    • Peter Zijlstra's avatar
      sched/core: Add ENQUEUE_NOCLOCK to ENQUEUE_RESTORE · 7134b3e9
      Peter Zijlstra authored
      In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because
      DEQUEUE_SAVE will have done an update_rq_clock().
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7134b3e9
    • Peter Zijlstra's avatar
      sched/core: Add {EN,DE}QUEUE_NOCLOCK flags · 0a67d1ee
      Peter Zijlstra authored
      Currently {en,de}queue_task() do an unconditional update_rq_clock().
      However since we want to avoid duplicate updates, so that each
      rq->lock section appears atomic in time, we need to be able to skip
      these clock updates.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0a67d1ee
    • Peter Zijlstra's avatar
      sched/core: Add rq->lock wrappers · 8a8c69c3
      Peter Zijlstra authored
      The missing update_rq_clock() check can work with partial rq->lock
      wrappery, since a missing wrapper can cause the warning to not be
      emitted when it should have, but cannot cause the warning to trigger
      when it should not have.
      
      The duplicate update_rq_clock() check however can cause false warnings
      to trigger. Therefore add more comprehensive rq->lock wrappery.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8a8c69c3
    • Peter Zijlstra's avatar
      sched/core: Add WARNING for multiple update_rq_clock() calls · 26ae58d2
      Peter Zijlstra authored
      Now that we have no missing calls, add a warning to find multiple
      calls.
      
      By having only a single update_rq_clock() call per rq-lock section,
      the section appears 'atomic' wrt time.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      26ae58d2
    • Steven Rostedt (VMware)'s avatar
      sched/rt: Add comments describing the RT IPI pull method · 3e777f99
      Steven Rostedt (VMware) authored
      While looking into optimizations for the RT scheduler IPI logic, I realized
      that the comments are lacking to describe it efficiently. It deserves a
      lengthy description describing its design.
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home
      [ Small typographical edits. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3e777f99
    • Steven Rostedt (VMware)'s avatar
      sched/deadline: Use deadline instead of period when calculating overflow · 2317d5f1
      Steven Rostedt (VMware) authored
      I was testing Daniel's changes with his test case, and tweaked it a
      little. Instead of having the runtime equal to the deadline, I
      increased the deadline ten fold.
      
      Daniel's test case had:
      
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      To make it more interesting, I changed it to:
      
      	attr.sched_runtime  =  2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 20 * 1000 * 1000;		/* 20 ms */
      	attr.sched_period   =  2 * 1000 * 1000 * 1000;	/* 2 s */
      
      The results were rather surprising. The behavior that Daniel's patch
      was fixing came back. The task started using much more than .1% of the
      CPU. More like 20%.
      
      Looking into this I found that it was due to the dl_entity_overflow()
      constantly returning true. That's because it uses the relative period
      against relative runtime vs the absolute deadline against absolute
      runtime.
      
        runtime / (deadline - t) > dl_runtime / dl_period
      
      There's even a comment mentioning this, and saying that when relative
      deadline equals relative period, that the equation is the same as using
      deadline instead of period. That comment is backwards! What we really
      want is:
      
        runtime / (deadline - t) > dl_runtime / dl_deadline
      
      We care about if the runtime can make its deadline, not its period. And
      then we can say "when the deadline equals the period, the equation is
      the same as using dl_period instead of dl_deadline".
      
      After correcting this, now when the task gets enqueued, it can throttle
      correctly, and Daniel's fix to the throttling of sleeping deadline
      tasks works even when the runtime and deadline are not the same.
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2317d5f1
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Throttle a constrained deadline task activated after the deadline · df8eac8c
      Daniel Bristot de Oliveira authored
      During the activation, CBS checks if it can reuse the current task's
      runtime and period. If the deadline of the task is in the past, CBS
      cannot use the runtime, and so it replenishes the task. This rule
      works fine for implicit deadline tasks (deadline == period), and the
      CBS was designed for implicit deadline tasks. However, a task with
      constrained deadline (deadine < period) might be awakened after the
      deadline, but before the next period. In this case, replenishing the
      task would allow it to run for runtime / deadline. As in this case
      deadline < period, CBS enables a task to run for more than the
      runtime / period. In a very loaded system, this can cause a domino
      effect, making other tasks miss their deadlines.
      
      To avoid this problem, in the activation of a constrained deadline
      task after the deadline but before the next period, throttle the
      task and set the replenishing timer to the begin of the next period,
      unless it is boosted.
      
      Reproducer:
      
       --------------- %< ---------------
        int main (int argc, char **argv)
        {
      	int ret;
      	int flags = 0;
      	unsigned long l = 0;
      	struct timespec ts;
      	struct sched_attr attr;
      
      	memset(&attr, 0, sizeof(attr));
      	attr.size = sizeof(attr);
      
      	attr.sched_policy   = SCHED_DEADLINE;
      	attr.sched_runtime  = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_deadline = 2 * 1000 * 1000;		/* 2 ms */
      	attr.sched_period   = 2 * 1000 * 1000 * 1000;	/* 2 s */
      
      	ts.tv_sec = 0;
      	ts.tv_nsec = 2000 * 1000;			/* 2 ms */
      
      	ret = sched_setattr(0, &attr, flags);
      
      	if (ret < 0) {
      		perror("sched_setattr");
      		exit(-1);
      	}
      
      	for(;;) {
      		/* XXX: you may need to adjust the loop */
      		for (l = 0; l < 150000; l++);
      		/*
      		 * The ideia is to go to sleep right before the deadline
      		 * and then wake up before the next period to receive
      		 * a new replenishment.
      		 */
      		nanosleep(&ts, NULL);
      	}
      
      	exit(0);
        }
        --------------- >% ---------------
      
      On my box, this reproducer uses almost 50% of the CPU time, which is
      obviously wrong for a task with 2/2000 reservation.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      df8eac8c
    • Daniel Bristot de Oliveira's avatar
      sched/deadline: Make sure the replenishment timer fires in the next period · 5ac69d37
      Daniel Bristot de Oliveira authored
      Currently, the replenishment timer is set to fire at the deadline
      of a task. Although that works for implicit deadline tasks because the
      deadline is equals to the begin of the next period, that is not correct
      for constrained deadline tasks (deadline < period).
      
      For instance:
      
      f.c:
       --------------- %< ---------------
      int main (void)
      {
      	for(;;);
      }
       --------------- >% ---------------
      
        # gcc -o f f.c
      
        # trace-cmd record -e sched:sched_switch                              \
      				   -e syscalls:sys_exit_sched_setattr   \
         chrt -d --sched-runtime  490000000					\
                 --sched-deadline 500000000					\
      	   --sched-period  1000000000 0 ./f
      
        # trace-cmd report | grep "{pid of ./f}"
      
      After setting parameters, the task is replenished and continue running
      until being throttled:
      
               f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0
      
      The task is throttled after running 492318 ms, as expected:
      
               f-11295 [003] 13322.606094: sched_switch:   f:11295 [-1] R ==> watchdog/3:32 [0]
      
      But then, the task is replenished 500719 ms after the first
      replenishment:
      
          <idle>-0     [003] 13322.614495: sched_switch:   swapper/3:0 [120] R ==> f:11295 [-1]
      
      Running for 490277 ms:
      
               f-11295 [003] 13323.104772: sched_switch:   f:11295 [-1] R ==>  swapper/3:0 [120]
      
      Hence, in the first period, the task runs 2 * runtime, and that is a bug.
      
      During the first replenishment, the next deadline is set one period away.
      So the runtime / period starts to be respected. However, as the second
      replenishment took place in the wrong instant, the next replenishment
      will also be held in a wrong instant of time. Rather than occurring in
      the nth period away from the first activation, it is taking place
      in the (nth period - relative deadline).
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarLuca Abeni <luca.abeni@santannapisa.it>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: default avatarJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5ac69d37
    • Matt Fleming's avatar
      sched/loadavg: Use {READ,WRITE}_ONCE() for sample window · caeb5882
      Matt Fleming authored
      'calc_load_update' is accessed without any kind of locking and there's
      a clear assumption in the code that only a single value is read or
      written.
      
      Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
      unintentionally seeing multiple values, or having the load/stores
      split.
      
      Technically the loads in calc_global_*() don't require this since
      those are the only functions that update 'calc_load_update', but I've
      added the READ_ONCE() for consistency.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      caeb5882
    • Matt Fleming's avatar
      sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting · 6e5f32f7
      Matt Fleming authored
      If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
      the pending sample window time on exit, setting the next update not
      one window into the future, but two.
      
      This situation on exiting NO_HZ is described by:
      
        this_rq->calc_load_update < jiffies < calc_load_update
      
      In this scenario, what we should be doing is:
      
        this_rq->calc_load_update = calc_load_update		     [ next window ]
      
      But what we actually do is:
      
        this_rq->calc_load_update = calc_load_update + LOAD_FREQ   [ next+1 window ]
      
      This has the effect of delaying load average updates for potentially
      up to ~9seconds.
      
      This can result in huge spikes in the load average values due to
      per-cpu uninterruptible task counts being out of sync when accumulated
      across all CPUs.
      
      It's safe to update the per-cpu active count if we wake between sample
      windows because any load that we left in 'calc_load_idle' will have
      been zero'd when the idle load was folded in calc_global_load().
      
      This issue is easy to reproduce before,
      
        commit 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      
      just by forking short-lived process pipelines built from ps(1) and
      grep(1) in a loop. I'm unable to reproduce the spikes after that
      commit, but the bug still seems to be present from code review.
      Signed-off-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again")
      Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6e5f32f7
    • Wanpeng Li's avatar
      sched/deadline: Add missing update_rq_clock() in dl_task_timer() · dcc3b5ff
      Wanpeng Li authored
      The following warning can be triggered by hot-unplugging the CPU
      on which an active SCHED_DEADLINE task is running on:
      
       ------------[ cut here ]------------
       WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
       rq->clock_update_flags < RQCF_ACT_SKIP
       CPU: 7 PID: 0 Comm: swapper/7 Tainted: G    B           4.11.0-rc1+ #24
       Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
       Call Trace:
        <IRQ>
        dump_stack+0x85/0xc4
        __warn+0x172/0x1b0
        warn_slowpath_fmt+0xb4/0xf0
        ? __warn+0x1b0/0x1b0
        ? debug_check_no_locks_freed+0x2c0/0x2c0
        ? cpudl_set+0x3d/0x2b0
        replenish_dl_entity+0x71e/0xc40
        enqueue_task_dl+0x2ea/0x12e0
        ? dl_task_timer+0x777/0x990
        ? __hrtimer_run_queues+0x270/0xa50
        dl_task_timer+0x316/0x990
        ? enqueue_task_dl+0x12e0/0x12e0
        ? enqueue_task_dl+0x12e0/0x12e0
        __hrtimer_run_queues+0x270/0xa50
        ? hrtimer_cancel+0x20/0x20
        ? hrtimer_interrupt+0x119/0x600
        hrtimer_interrupt+0x19c/0x600
        ? trace_hardirqs_off+0xd/0x10
        local_apic_timer_interrupt+0x74/0xe0
        smp_apic_timer_interrupt+0x76/0xa0
        apic_timer_interrupt+0x93/0xa0
      
      The DL task will be migrated to a suitable later deadline rq once the DL
      timer fires and currnet rq is offline. The rq clock of the new rq should
      be updated. This patch fixes it by updating the rq clock after holding
      the new rq's rq lock.
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dcc3b5ff
  5. 15 Mar, 2017 6 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 69eea5a4
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Four small fixes for this cycle:
      
         - followup fix from Neil for a fix that went in before -rc2, ensuring
           that we always see the full per-task bio_list.
      
         - fix for blk-mq-sched from me that ensures that we retain similar
           direct-to-issue behavior on running the queue.
      
         - fix from Sagi fixing a potential NULL pointer dereference in blk-mq
           on spurious CPU unplug.
      
         - a memory leak fix in writeback from Tahsin, fixing a case where
           device removal of a mounted device can leak a struct
           wb_writeback_work"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly()
        writeback: fix memory leak in wb_queue_work()
        blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
        blk: Ensure users for current->bio_list can see the full list.
      69eea5a4
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 95422dec
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This is a rather large set of fixes. The bulk are for lpfc correcting
        a lot of issues in the new NVME driver code which just went in in the
        merge window.
      
        The others are:
      
         - fix a hang in the vmware paravirt driver caused by incorrect
           handling of the new MSI vector allocation
      
         - long standing bug in storvsc, which recent block changes turned
           from being a harmless annoyance into a hang
      
         - yet more fallout (in mpt3sas) from the changes to device blocking
      
        The remainder are small fixes and updates"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (34 commits)
        scsi: lpfc: Add shutdown method for kexec
        scsi: storvsc: Workaround for virtual DVD SCSI version
        scsi: lpfc: revise version number to 11.2.0.10
        scsi: lpfc: code cleanups in NVME initiator discovery
        scsi: lpfc: code cleanups in NVME initiator base
        scsi: lpfc: correct rdp diag portnames
        scsi: lpfc: remove dead sli3 nvme code
        scsi: lpfc: correct double print
        scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT
        scsi: lpfc: Rework lpfc Kconfig for NVME options
        scsi: lpfc: add transport eh_timed_out reference
        scsi: lpfc: Fix eh_deadline setting for sli3 adapters.
        scsi: lpfc: add NVME exchange aborts
        scsi: lpfc: Fix nvme allocation bug on failed nvme_fc_register_localport
        scsi: lpfc: Fix IO submission if WQ is full
        scsi: lpfc: Fix NVME CMD IU byte swapped word 1 problem
        scsi: lpfc: Fix RCTL value on NVME LS request and response
        scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
        scsi: lpfc: fix missing spin_unlock on sql_list_lock
        scsi: lpfc: don't dereference dma_buf->iocbq before null check
        ...
      95422dec
    • Linus Torvalds's avatar
      Merge tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · aabcf5fc
      Linus Torvalds authored
      Pull gfs2 fix from Bob Peterson:
       "This is an emergency patch for 4.11-rc3
      
        The GFS2 developers uncovered a really nasty problem that can lead to
        random corruption and kernel panic, much like the last one. Andreas
        Gruenbacher wrote a simple one-line patch to fix the problem."
      
      * tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Avoid alignment hole in struct lm_lockname
      aabcf5fc
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · defc7d75
      Linus Torvalds authored
      Pull crypto fixes from Herbert Xu:
      
       - self-test failure of crc32c on powerpc
      
       - regressions of ecb(aes) when used with xts/lrw in s5p-sss
      
       - a number of bugs in the omap RNG driver
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: s5p-sss - Fix spinlock recursion on LRW(AES)
        hwrng: omap - Do not access INTMASK_REG on EIP76
        hwrng: omap - use devm_clk_get() instead of of_clk_get()
        hwrng: omap - write registers after enabling the clock
        crypto: s5p-sss - Fix completing crypto request in IRQ handler
        crypto: powerpc - Fix initialisation of crc32c context
      defc7d75
    • Andreas Gruenbacher's avatar
      gfs2: Avoid alignment hole in struct lm_lockname · 28ea06c4
      Andreas Gruenbacher authored
      Commit 88ffbf3e switches to using rhashtables for glocks, hashing over
      the entire struct lm_lockname instead of its individual fields.  On some
      architectures, struct lm_lockname contains a hole of uninitialized
      memory due to alignment rules, which now leads to incorrect hash values.
      Get rid of that hole.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      CC: <stable@vger.kernel.org> #v4.3+
      28ea06c4
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ae50dfd6
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
          from Steffen Klassert.
      
       2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
          Alexei Starovoitov.
      
       3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
          Michal Schmidt.
      
       4) We can get a divide by zero when TCP socket are morphed into
          listening state, fix from Eric Dumazet.
      
       5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
          skb_complete_tx_timestamp(). From Eric Dumazet.
      
       6) Use after free in dccp_feat_activate_values(), also from Eric
          Dumazet.
      
       7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
          Jarod Wilson.
      
       8) Fix use after free in vrf_xmit(), from David Ahern.
      
       9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
          Alexey Kodanev.
      
      10) Properly check napi_complete_done() return value in order to decide
          whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
          Lendacky.
      
      11) Fix double free of hwmon device in marvell phy driver, from Andrew
          Lunn.
      
      12) Don't crash on malformed netlink attributes in act_connmark, from
          Etienne Noss.
      
      13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
          from Sabrina Dubroca.
      
      14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
          Florian Westphal.
      
      15) Fix routing redirect races in dccp and tcp, basically the ICMP
          handler can't modify the socket's cached route in it's locked by the
          user at this moment. From Jon Maxwell.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
        qed: Enable iSCSI Out-of-Order
        qed: Correct out-of-bound access in OOO history
        qed: Fix interrupt flags on Rx LL2
        qed: Free previous connections when releasing iSCSI
        qed: Fix mapping leak on LL2 rx flow
        qed: Prevent creation of too-big u32-chains
        qed: Align CIDs according to DORQ requirement
        mlxsw: reg: Fix SPVMLR max record count
        mlxsw: reg: Fix SPVM max record count
        net: Resend IGMP memberships upon peer notification.
        dccp: fix memory leak during tear-down of unsuccessful connection request
        tun: fix premature POLLOUT notification on tun devices
        dccp/tcp: fix routing redirect race
        ucc/hdlc: fix two little issue
        vxlan: fix ovs support
        net: use net->count to check whether a netns is alive or not
        bridge: drop netfilter fake rtable unconditionally
        ipv6: avoid write to a possibly cloned skb
        net: wimax/i2400m: fix NULL-deref at probe
        isdn/gigaset: fix NULL-deref at probe
        ...
      ae50dfd6
  6. 14 Mar, 2017 13 commits