- 30 Mar, 2017 2 commits
-
-
Yuyang Du authored
The main PELT function ___update_load_avg(), which implements the accumulation and progression of the geometric average series, is implemented along the following lines for the scenario where the time delta spans all 3 possible sections (see figure below): 1. add the remainder of the last incomplete period 2. decay old sum 3. accumulate new sum in full periods since last_update_time 4. accumulate the current incomplete period 5. update averages Or: d1 d2 d3 ^ ^ ^ | | | |<->|<----------------->|<--->| ... |---x---|------| ... |------|-----x (now) load_sum' = (load_sum + weight * scale * d1) * y^(p+1) + (1,2) p weight * scale * 1024 * \Sum y^n + (3) n=1 weight * scale * d3 * y^0 (4) load_avg' = load_sum' / LOAD_AVG_MAX (5) Where: d1 - is the delta part completing the remainder of the last incomplete period, d2 - is the delta part spannind complete periods, and d3 - is the delta part starting the current incomplete period. We can simplify the code in two steps; the first step is to separate the first term into new and old parts like: (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) + weight * scale * d1 * y^(p+1) Once we've done that, its easy to see that all new terms carry the common factors: weight * scale If we factor those out, we arrive at the form: load_sum' = load_sum * y^(p+1) + weight * scale * (d1 * y^(p+1) + p 1024 * \Sum y^n + n=1 d3 * y^0) Which results in a simpler, smaller and faster implementation. Signed-off-by: Yuyang Du <yuyang.du@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: matt@codeblueprint.co.uk Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
The __update_load_avg() function is an __always_inline because its used with constant propagation to generate different variants of the code without having to duplicate it (which would be prone to bugs). Explicitly instantiate the 3 variants. Note that most of this is called from rather hot paths, so reducing branches is good. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
- 27 Mar, 2017 1 commit
-
-
Srikar Dronamraju authored
If the child domain prefers tasks to go siblings, the local group could end up pulling tasks to itself even if the local group is almost equally loaded as the source group. Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload. Everytime, local group has capacity and source group has atleast 2 threads, local group tries to pull the task. This causes the threads to constantly move between different cores. This is even more profound if the cores have more threads, like in Power 8, smt 8 mode. Fix this by only allowing local group to pull a task, if the source group has more number of tasks than the local group. Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine. Without patch: Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs): 1,440 context-switches # 0.001 K/sec ( +- 1.26% ) 366 cpu-migrations # 0.000 K/sec ( +- 5.58% ) 3,933 page-faults # 0.002 K/sec ( +- 11.08% ) Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs): 6,287 context-switches # 0.001 K/sec ( +- 3.65% ) 3,776 cpu-migrations # 0.001 K/sec ( +- 4.84% ) 5,702 page-faults # 0.001 K/sec ( +- 9.36% ) Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs): 8,776 context-switches # 0.001 K/sec ( +- 0.73% ) 2,790 cpu-migrations # 0.000 K/sec ( +- 0.98% ) 10,540 page-faults # 0.001 K/sec ( +- 3.12% ) With patch: Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs): 1,133 context-switches # 0.001 K/sec ( +- 4.72% ) 123 cpu-migrations # 0.000 K/sec ( +- 3.42% ) 3,858 page-faults # 0.002 K/sec ( +- 8.52% ) Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs): 2,169 context-switches # 0.000 K/sec ( +- 6.19% ) 189 cpu-migrations # 0.000 K/sec ( +- 12.75% ) 5,917 page-faults # 0.001 K/sec ( +- 8.09% ) Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs): 5,333 context-switches # 0.001 K/sec ( +- 5.91% ) 506 cpu-migrations # 0.000 K/sec ( +- 3.35% ) 10,792 page-faults # 0.001 K/sec ( +- 7.75% ) Which show that in these workloads CPU migrations get reduced significantly. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 23 Mar, 2017 2 commits
-
-
Vincent Guittot authored
A regression of the FTQ noise has been reported by Ying Huang, on the following hardware: 8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory ... which was caused by this commit: commit 4e516076 ("sched/fair: Propagate asynchrous detach") The only part of the patch that can increase the noise is the update of blocked load of group entity in update_blocked_averages(). We can optimize this call and skip the update of group entity if its load and utilization are already null and there is no pending propagation of load in the task group. This optimization partly restores the noise score. A more agressive optimization has been tried but has shown worse score. Reported-by: ying.huang@linux.intel.com Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: ying.huang@intel.com Fixes: 4e516076 ("sched/fair: Propagate asynchrous detach") Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org [ Fixed typos, improved layout. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Wanpeng Li authored
This can be reproduced by running rt-migrate-test: WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock() unpinning an unpinned lock ... Call Trace: dump_stack() __warn() warn_slowpath_fmt() lock_unpin_lock() __balance_callback() __schedule() schedule() futex_wait_queue_me() futex_wait() do_futex() SyS_futex() do_syscall_64() entry_SYSCALL64_slow_path() Revert the rq_lock_irqsave() usage here, the whole point of the balance_callback() was to allow dropping rq->lock. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 8a8c69c3 ("sched/core: Add rq->lock wrappers") Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 16 Mar, 2017 16 commits
-
-
Peter Zijlstra authored
Address this case: WARNING: CPU: 0 PID: 2070 at ../kernel/sched/core.c:109 update_rq_clock+0x74/0x80 rq->clock_update_flags & RQCF_UPDATED Call Trace: update_rq_clock() move_queued_task() __set_cpus_allowed_ptr() ... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Add DEQUEUE_NOCLOCK to all places where we just did an update_rq_clock() already. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Instead of relying on deactivate_task() to call update_rq_clock() and handling the case where it didn't happen (task_on_rq_queued), unconditionally do update_rq_clock() and skip any further updates. This also avoids a double update on deactivate_task() + ttwu_local(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Since all tasks on the wake_list are woken under a single rq->lock avoid calling update_rq_clock() for each task. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
In all cases, ENQUEUE_RESTORE should also have ENQUEUE_NOCLOCK because DEQUEUE_SAVE will have done an update_rq_clock(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Currently {en,de}queue_task() do an unconditional update_rq_clock(). However since we want to avoid duplicate updates, so that each rq->lock section appears atomic in time, we need to be able to skip these clock updates. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
The missing update_rq_clock() check can work with partial rq->lock wrappery, since a missing wrapper can cause the warning to not be emitted when it should have, but cannot cause the warning to trigger when it should not have. The duplicate update_rq_clock() check however can cause false warnings to trigger. Therefore add more comprehensive rq->lock wrappery. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Now that we have no missing calls, add a warning to find multiple calls. By having only a single update_rq_clock() call per rq-lock section, the section appears 'atomic' wrt time. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Steven Rostedt (VMware) authored
While looking into optimizations for the RT scheduler IPI logic, I realized that the comments are lacking to describe it efficiently. It deserves a lengthy description describing its design. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170228155030.30c69068@gandalf.local.home [ Small typographical edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Steven Rostedt (VMware) authored
I was testing Daniel's changes with his test case, and tweaked it a little. Instead of having the runtime equal to the deadline, I increased the deadline ten fold. Daniel's test case had: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ To make it more interesting, I changed it to: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ The results were rather surprising. The behavior that Daniel's patch was fixing came back. The task started using much more than .1% of the CPU. More like 20%. Looking into this I found that it was due to the dl_entity_overflow() constantly returning true. That's because it uses the relative period against relative runtime vs the absolute deadline against absolute runtime. runtime / (deadline - t) > dl_runtime / dl_period There's even a comment mentioning this, and saying that when relative deadline equals relative period, that the equation is the same as using deadline instead of period. That comment is backwards! What we really want is: runtime / (deadline - t) > dl_runtime / dl_deadline We care about if the runtime can make its deadline, not its period. And then we can say "when the deadline equals the period, the equation is the same as using dl_period instead of dl_deadline". After correcting this, now when the task gets enqueued, it can throttle correctly, and Daniel's fix to the throttling of sleeping deadline tasks works even when the runtime and deadline are not the same. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Daniel Bristot de Oliveira authored
During the activation, CBS checks if it can reuse the current task's runtime and period. If the deadline of the task is in the past, CBS cannot use the runtime, and so it replenishes the task. This rule works fine for implicit deadline tasks (deadline == period), and the CBS was designed for implicit deadline tasks. However, a task with constrained deadline (deadine < period) might be awakened after the deadline, but before the next period. In this case, replenishing the task would allow it to run for runtime / deadline. As in this case deadline < period, CBS enables a task to run for more than the runtime / period. In a very loaded system, this can cause a domino effect, making other tasks miss their deadlines. To avoid this problem, in the activation of a constrained deadline task after the deadline but before the next period, throttle the task and set the replenishing timer to the begin of the next period, unless it is boosted. Reproducer: --------------- %< --------------- int main (int argc, char **argv) { int ret; int flags = 0; unsigned long l = 0; struct timespec ts; struct sched_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ ts.tv_sec = 0; ts.tv_nsec = 2000 * 1000; /* 2 ms */ ret = sched_setattr(0, &attr, flags); if (ret < 0) { perror("sched_setattr"); exit(-1); } for(;;) { /* XXX: you may need to adjust the loop */ for (l = 0; l < 150000; l++); /* * The ideia is to go to sleep right before the deadline * and then wake up before the next period to receive * a new replenishment. */ nanosleep(&ts, NULL); } exit(0); } --------------- >% --------------- On my box, this reproducer uses almost 50% of the CPU time, which is obviously wrong for a task with 2/2000 reservation. Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Daniel Bristot de Oliveira authored
Currently, the replenishment timer is set to fire at the deadline of a task. Although that works for implicit deadline tasks because the deadline is equals to the begin of the next period, that is not correct for constrained deadline tasks (deadline < period). For instance: f.c: --------------- %< --------------- int main (void) { for(;;); } --------------- >% --------------- # gcc -o f f.c # trace-cmd record -e sched:sched_switch \ -e syscalls:sys_exit_sched_setattr \ chrt -d --sched-runtime 490000000 \ --sched-deadline 500000000 \ --sched-period 1000000000 0 ./f # trace-cmd report | grep "{pid of ./f}" After setting parameters, the task is replenished and continue running until being throttled: f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0 The task is throttled after running 492318 ms, as expected: f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0] But then, the task is replenished 500719 ms after the first replenishment: <idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1] Running for 490277 ms: f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120] Hence, in the first period, the task runs 2 * runtime, and that is a bug. During the first replenishment, the next deadline is set one period away. So the runtime / period starts to be respected. However, as the second replenishment took place in the wrong instant, the next replenishment will also be held in a wrong instant of time. Rather than occurring in the nth period away from the first activation, it is taking place in the (nth period - relative deadline). Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@santannapisa.it> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Matt Fleming authored
'calc_load_update' is accessed without any kind of locking and there's a clear assumption in the code that only a single value is read or written. Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid unintentionally seeing multiple values, or having the load/stores split. Technically the loads in calc_global_*() don't require this since those are the only functions that update 'calc_load_update', but I've added the READ_ONCE() for consistency. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.ukSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Matt Fleming authored
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.ukSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Wanpeng Li authored
The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: <IRQ> dump_stack+0x85/0xc4 __warn+0x172/0x1b0 warn_slowpath_fmt+0xb4/0xf0 ? __warn+0x1b0/0x1b0 ? debug_check_no_locks_freed+0x2c0/0x2c0 ? cpudl_set+0x3d/0x2b0 replenish_dl_entity+0x71e/0xc40 enqueue_task_dl+0x2ea/0x12e0 ? dl_task_timer+0x777/0x990 ? __hrtimer_run_queues+0x270/0xa50 dl_task_timer+0x316/0x990 ? enqueue_task_dl+0x12e0/0x12e0 ? enqueue_task_dl+0x12e0/0x12e0 __hrtimer_run_queues+0x270/0xa50 ? hrtimer_cancel+0x20/0x20 ? hrtimer_interrupt+0x119/0x600 hrtimer_interrupt+0x19c/0x600 ? trace_hardirqs_off+0xd/0x10 local_apic_timer_interrupt+0x74/0xe0 smp_apic_timer_interrupt+0x76/0xa0 apic_timer_interrupt+0x93/0xa0 The DL task will be migrated to a suitable later deadline rq once the DL timer fires and currnet rq is offline. The rq clock of the new rq should be updated. This patch fixes it by updating the rq clock after holding the new rq's rq lock. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 15 Mar, 2017 6 commits
-
-
git://git.kernel.dk/linux-blockLinus Torvalds authored
Pull block fixes from Jens Axboe: "Four small fixes for this cycle: - followup fix from Neil for a fix that went in before -rc2, ensuring that we always see the full per-task bio_list. - fix for blk-mq-sched from me that ensures that we retain similar direct-to-issue behavior on running the queue. - fix from Sagi fixing a potential NULL pointer dereference in blk-mq on spurious CPU unplug. - a memory leak fix in writeback from Tahsin, fixing a case where device removal of a mounted device can leak a struct wb_writeback_work" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly() writeback: fix memory leak in wb_queue_work() blk-mq: Fix tagset reinit in the presence of cpu hot-unplug blk: Ensure users for current->bio_list can see the full list.
-
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds authored
Pull SCSI fixes from James Bottomley: "This is a rather large set of fixes. The bulk are for lpfc correcting a lot of issues in the new NVME driver code which just went in in the merge window. The others are: - fix a hang in the vmware paravirt driver caused by incorrect handling of the new MSI vector allocation - long standing bug in storvsc, which recent block changes turned from being a harmless annoyance into a hang - yet more fallout (in mpt3sas) from the changes to device blocking The remainder are small fixes and updates" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (34 commits) scsi: lpfc: Add shutdown method for kexec scsi: storvsc: Workaround for virtual DVD SCSI version scsi: lpfc: revise version number to 11.2.0.10 scsi: lpfc: code cleanups in NVME initiator discovery scsi: lpfc: code cleanups in NVME initiator base scsi: lpfc: correct rdp diag portnames scsi: lpfc: remove dead sli3 nvme code scsi: lpfc: correct double print scsi: lpfc: Rename LPFC_MAX_EQ_DELAY to LPFC_MAX_EQ_DELAY_EQID_CNT scsi: lpfc: Rework lpfc Kconfig for NVME options scsi: lpfc: add transport eh_timed_out reference scsi: lpfc: Fix eh_deadline setting for sli3 adapters. scsi: lpfc: add NVME exchange aborts scsi: lpfc: Fix nvme allocation bug on failed nvme_fc_register_localport scsi: lpfc: Fix IO submission if WQ is full scsi: lpfc: Fix NVME CMD IU byte swapped word 1 problem scsi: lpfc: Fix RCTL value on NVME LS request and response scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters scsi: lpfc: fix missing spin_unlock on sql_list_lock scsi: lpfc: don't dereference dma_buf->iocbq before null check ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2Linus Torvalds authored
Pull gfs2 fix from Bob Peterson: "This is an emergency patch for 4.11-rc3 The GFS2 developers uncovered a really nasty problem that can lead to random corruption and kernel panic, much like the last one. Andreas Gruenbacher wrote a simple one-line patch to fix the problem." * tag 'gfs2-4.11-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Avoid alignment hole in struct lm_lockname
-
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds authored
Pull crypto fixes from Herbert Xu: - self-test failure of crc32c on powerpc - regressions of ecb(aes) when used with xts/lrw in s5p-sss - a number of bugs in the omap RNG driver * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: s5p-sss - Fix spinlock recursion on LRW(AES) hwrng: omap - Do not access INTMASK_REG on EIP76 hwrng: omap - use devm_clk_get() instead of of_clk_get() hwrng: omap - write registers after enabling the clock crypto: s5p-sss - Fix completing crypto request in IRQ handler crypto: powerpc - Fix initialisation of crc32c context
-
Andreas Gruenbacher authored
Commit 88ffbf3e switches to using rhashtables for glocks, hashing over the entire struct lm_lockname instead of its individual fields. On some architectures, struct lm_lockname contains a hole of uninitialized memory due to alignment rules, which now leads to incorrect hash values. Get rid of that hole. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> CC: <stable@vger.kernel.org> #v4.3+
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds authored
Pull networking fixes from David Miller: 1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver, from Steffen Klassert. 2) Fix crashes when user tries to get_next_key on an LPM bpf map, from Alexei Starovoitov. 3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from Michal Schmidt. 4) We can get a divide by zero when TCP socket are morphed into listening state, fix from Eric Dumazet. 5) Fix socket refcounting bugs in skb_complete_wifi_ack() and skb_complete_tx_timestamp(). From Eric Dumazet. 6) Use after free in dccp_feat_activate_values(), also from Eric Dumazet. 7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from Jarod Wilson. 8) Fix use after free in vrf_xmit(), from David Ahern. 9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from Alexey Kodanev. 10) Properly check napi_complete_done() return value in order to decide whether to re-enable IRQs or not in amd-xgbe driver, from Thomas Lendacky. 11) Fix double free of hwmon device in marvell phy driver, from Andrew Lunn. 12) Don't crash on malformed netlink attributes in act_connmark, from Etienne Noss. 13) Don't remove routes with a higher metric in ipv6 ECMP route replace, from Sabrina Dubroca. 14) Don't write into a cloned SKB in ipv6 fragmentation handling, from Florian Westphal. 15) Fix routing redirect races in dccp and tcp, basically the ICMP handler can't modify the socket's cached route in it's locked by the user at this moment. From Jon Maxwell. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits) qed: Enable iSCSI Out-of-Order qed: Correct out-of-bound access in OOO history qed: Fix interrupt flags on Rx LL2 qed: Free previous connections when releasing iSCSI qed: Fix mapping leak on LL2 rx flow qed: Prevent creation of too-big u32-chains qed: Align CIDs according to DORQ requirement mlxsw: reg: Fix SPVMLR max record count mlxsw: reg: Fix SPVM max record count net: Resend IGMP memberships upon peer notification. dccp: fix memory leak during tear-down of unsuccessful connection request tun: fix premature POLLOUT notification on tun devices dccp/tcp: fix routing redirect race ucc/hdlc: fix two little issue vxlan: fix ovs support net: use net->count to check whether a netns is alive or not bridge: drop netfilter fake rtable unconditionally ipv6: avoid write to a possibly cloned skb net: wimax/i2400m: fix NULL-deref at probe isdn/gigaset: fix NULL-deref at probe ...
-
- 14 Mar, 2017 13 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroupLinus Torvalds authored
Pull cgroup fixes from Tejun Heo: "Three cgroup fixes. Nothing critical: - the pids controller could trigger suspicious RCU warning spuriously. Fixed. - in the debug controller, %p -> %pK to protect kernel pointer from getting exposed. - documentation formatting fix" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroups: censor kernel pointer in debug files cgroup/pids: remove spurious suspicious RCU usage warning cgroup: Fix indenting in PID controller documentation
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libataLinus Torvalds authored
Pull libata fixes from Tejun Heo: "Three libata fixes: - fix for a circular reference bug in sysfs code which prevented pata_legacy devices from being released after probe failure, which in turn prevented devres from releasing the associated resources. - drop spurious WARN in the command issue path which can be triggered by a legitimate passthrough command. - an ahci_qoriq specific fix" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: qoriq: correct the sata ecc setting error libata: drop WARN from protocol error in ata_sff_qc_issue() libata: transport: Remove circular dependency at free time
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds authored
Pull workqueue fix from Tejun Heo: "If a delayed work is queued with NULL @wq, workqueue code explodes after the timer expires at which point it's difficult to tell who the culprit was. This actually happened and the offender was net/smc this time. Add an explicit sanity check for it in the queueing path" * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpuLinus Torvalds authored
Pull percpu fixes from Tejun Heo: - the allocation path was updating pcpu_nr_empty_pop_pages without the required locking which can lead to incorrect handling of empty chunks (e.g. keeping too many around), which is buggy but shouldn't lead to critical failures. Fixed by adding the locking - a trivial patch to drop an unused param from pcpu_get_pages() * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: remove unused chunk_alloc parameter from pcpu_get_pages() percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages
-
David S. Miller authored
Yuval Mintz says: ==================== qed: Fixes series This address several different issues in qed. The more significant portions: Patch #1 would cause timeout when qedr utilizes the highest CIDs availble for it [or when future qede adapters would utilize queues in some constellations]. Patch #4 fixes a leak of mapped addresses; When iommu is enabled, offloaded storage protocols might eventually run out of resources and fail to map additional buffers. Patches #6,#7 were missing in the initial iSCSI infrastructure submissions, and would hamper qedi's stability when it reaches out-of-order scenarios. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Mintz, Yuval authored
Missing in the initial submission, qed fails to propagate qedi's request to enable OOO to firmware. Fixes: fc831825 ("qed: Add support for hardware offloaded iSCSI") Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Mintz, Yuval authored
Need to set the number of entries in database, otherwise the logic would quickly surpass the array. Fixes: 1d6cff4f ("qed: Add iSCSI out of order packet handling") Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ram Amrani authored
Before iterating over the the LL2 Rx ring, the ring's spinlock is taken via spin_lock_irqsave(). The actual processing of the packet [including handling by the protocol driver] is done without said lock, so qed releases the spinlock and re-claims it afterwards. Problem is that the final spin_lock_irqrestore() at the end of the iteration uses the original flags saved from the initial irqsave() instead of the flags from the most recent irqsave(). So it's possible that the interrupt status would be incorrect at the end of the processing. Fixes: 0a7fb11c ("qed: Add Light L2 support"); CC: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Mintz, Yuval authored
Fixes: fc831825 ("qed: Add support for hardware offloaded iSCSI") Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Mintz, Yuval authored
When receiving an Rx LL2 packet, qed fails to unmap the previous buffer. Fixes: 0a7fb11c ("qed: Add Light L2 support"); Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tomer Tayar authored
Current Logic would allow the creation of a chain with U32_MAX + 1 elements, when the actual maximum supported by the driver infrastructure is U32_MAX. Fixes: a91eb52a ("qed: Revisit chain implementation") Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ram Amrani authored
The Doorbell HW block can be configured at a granularity of 16 x CIDs, so we need to make sure that the actual number of CIDs configured would be a multiplication of 16. Today, when RoCE is enabled - given that the number is unaligned, doorbelling the higher CIDs would fail to reach the firmware and would eventually timeout. Fixes: dbb799c3 ("qed: Initialize hardware for new protocols") Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jiri Pirko says: ==================== mlxsw: Couple of fixes Couple or small fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-