- 04 Dec, 2015 17 commits
-
-
Waiman Long authored
If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq & se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533 Critical-jOps: 134877 After patch - Max-jOPs: 916011 Critical-jOps: 142366 Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Waiman Long authored
Part of the responsibility of the update_sg_lb_stats() function is to update the idle_cpus statistical counter in struct sg_lb_stats. This check is done by calling idle_cpu(). The idle_cpu() function, in turn, checks a number of fields within the run queue structure such as rq->curr and rq->nr_running. With the current layout of the run queue structure, rq->curr and rq->nr_running are in separate cachelines. The rq->curr variable is checked first followed by nr_running. As nr_running is also accessed by update_sg_lb_stats() earlier, it makes no sense to load another cacheline when nr_running is not 0 as idle_cpu() will always return false in this case. This patch eliminates this redundant cacheline load by checking the cached nr_running before calling idle_cpu(). Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Andi Kleen authored
When building a kernel with a gcc 6 snapshot the compiler complains about unused const static variables for prio_to_weight and prio_to_mult for multiple scheduler files (all but core.c and autogroup.c) The way the array is currently declared it will be duplicated in every scheduler file that includes sched.h, which seems rather wasteful. Move the array out of line into core.c. I also added a sched_ prefix to avoid any potential name space collisions. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Frederic Weisbecker authored
The cputime can only be updated by the current task itself, even in vtime case. So we can safely use seqcount instead of seqlock as there is no writer concurrency involved. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Frederic Weisbecker authored
Readers need to know if vtime runs at all on some CPU somewhere, this is a fast-path check to determine if we need to check further the need to add up any tickless cputime delta. This fast path check uses context tracking state because vtime is tied to context tracking as of now. This check appears to be confusing though so lets use a vtime function that deals with context tracking details in vtime implementation instead. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-7-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Frederic Weisbecker authored
vtime_accounting_enabled() checks if vtime is running on the current CPU and is as such a misnomer. Lets rename it to a function that reflect its locality. We are going to need the current name for a function that tells if vtime runs at all on some CPU. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-6-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Frederic Weisbecker authored
When a task runs on a housekeeper (a CPU running with the periodic tick with neighbours running tickless), it doesn't account cputime using vtime but relies on the tick. Such a task has its vtime_snap_whence value set to VTIME_INACTIVE. Readers won't handle that correctly though. As long as vtime is running on some CPU, readers incorretly assume that vtime runs on all CPUs and always compute the tickless cputime delta, which is only junk on housekeepers. So lets fix this with checking that the target runs on a vtime CPU through the appropriate state check before computing the tickless delta. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-5-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Frederic Weisbecker authored
VTIME_SLEEPING state happens either when: 1) The task is sleeping and no tickless delta is to be added on the task cputime stats. 2) The CPU isn't running vtime at all, so the same properties of 1) applies. Lets rename the vtime symbol to reflect both states. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Hiroshi Shimamoto authored
There is an extra cost in task_cputime() and task_cputime_scaled() when nohz_full is not activated. When vtime accounting is not enabled, we don't need to get deltas of utime and stime under vtime seqlock. This patch removes that cost with adding a shortcut route if vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless cputime delta. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-3-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Byungchul Park authored
The current code accounts for the time a task was absent from the fair class (per ATTACH_AGE_LOAD). However it does not work correctly when a task got migrated or moved to another cgroup while outside of the fair class. This patch tries to address that by aging on migration. We locklessly read the 'last_update_time' stamp from both the old and new cfs_rq, ages the load upto the old time, and sets it to the new time. These timestamps should in general not be more than 1 tick apart from one another, so there is a definite bound on things. Signed-off-by: Byungchul Park <byungchul.park@lge.com> [ Changelog, a few edits and !SMP build fix ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Ingo Molnar authored
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such that we'll prematurely continue with the wakeup and effectively run p on two CPUs at the same time. Even though the overlap is very limited; the task is in the middle of being scheduled out; it could still result in corruption of the scheduler data structures. CPU0 CPU1 set_current_state(...) <preempt_schedule> context_switch(X, Y) prepare_lock_switch(Y) Y->on_cpu = 1; finish_lock_switch(X) store_release(X->on_cpu, 0); try_to_wake_up(X) LOCK(p->pi_lock); t = X->on_cpu; // 0 context_switch(Y, X) prepare_lock_switch(X) X->on_cpu = 1; finish_lock_switch(Y) store_release(Y->on_cpu, 0); </preempt_schedule> schedule(); deactivate_task(X); X->on_rq = 0; if (X->on_rq) // false if (t) while (X->on_cpu) cpu_relax(); context_switch(X, ..) finish_lock_switch(X) store_release(X->on_cpu, 0); Avoid the load of X->on_cpu being hoisted over the X->on_rq load. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Explain how the control dependency and smp_rmb() end up providing ACQUIRE semantics and pair with smp_store_release() in finish_lock_switch(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Hiroshi Shimamoto authored
/proc/stats shows invalid gtime when the thread is running in guest. When vtime accounting is not enabled, we cannot get a valid delta. The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap is only updated when vtime accounting is runtime enabled. This patch makes task_gtime() just return gtime without computing the buggy non-existing tickless delta when vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless delta. This way we fix the gtime value regression on machines not running nohz full. The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full. I ran and stop a busy loop in VM and see the gtime in host. Dump the 43rd field which shows the gtime in every second: # while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done S 4348 R 7064566 R 7064766 R 7064967 R 7065168 S 4759 S 4759 During running busy loop, it returns large value. After applying this patch, we can see right gtime. # while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done S 5338 R 5365 R 5465 R 5566 R 5666 S 5726 S 5726 Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Xunlei Pang authored
root_domain::rto_mask allocated through alloc_cpumask_var() contains garbage data, this may cause problems. For instance, When doing pull_rt_task(), it may do useless iterations if rto_mask retains some extra garbage bits. Worse still, this violates the isolated domain rule for clustered scheduling using cpuset, because the tasks(with all the cpus allowed) belongs to one root domain can be pulled away into another root domain. The patch cleans the garbage by using zalloc_cpumask_var() instead of alloc_cpumask_var() for root_domain::rto_mask allocation, thereby addressing the issues. Do the same thing for root_domain's other cpumask memembers: dlo_mask, span, and online. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Sasha Levin authored
Because wakeups can (fundamentally) be late, a task might not be in the expected state. Therefore testing against a task's state is racy, and can yield false positives. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: oleg@redhat.com Fixes: 9067ac85 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task") Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Vladimir reported getting RCU stall warnings and bisected it back to commit: 74316201 ("sched: Remove proliferation of wait_on_bit() action functions") That commit inadvertently reversed the calls to schedule() and signal_pending(), thereby not handling the case where the signal receives while we sleep. Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mark.rutland@arm.com Cc: neilb@suse.de Cc: oleg@redhat.com Fixes: 74316201 ("sched: Remove proliferation of wait_on_bit() action functions") Fixes: cbbce822 ("SCHED: add some "wait..on_bit...timeout()" interfaces.") Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.netSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
- 23 Nov, 2015 19 commits
-
-
Byungchul Park authored
The comment describing migrate_task_rq_fair() says that the caller should hold p->pi_lock. But in some cases the caller can hold task_rq(p)->lock instead of p->pi_lock. So the comment is broken and this patch fixes it. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447806899-20303-1-git-send-email-byungchul.park@lge.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
1. Change this code to use preempt_count_inc/preempt_count_dec; this way it works even if CONFIG_PREEMPT_COUNT=n, and we avoid the unnecessary __preempt_schedule() check (stop_sched_class is not preemptible). And this makes clear that we only want to make preempt_count() != 0 for __might_sleep() / schedule_debug(). 2. Change WARN_ONCE() to use %pf to print the function name and remove kallsyms_lookup/ksym_buf. 3. Move "int ret" into the "if (work)" block, this looks more consistent. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193332.GA8281@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Change cpu_stop_queue_work() and cpu_stopper_thread() to check done != NULL before cpu_stop_signal_done(done). This makes the code more clean imo, note that cpu_stopper_thread() has to do this check anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193329.GA8274@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Now that cpu_stop_done->executed becomes write-only (ignoring WARN_ON() checks) we can remove it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193326.GA8269@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Change queue_stop_cpus_work() to return true if it queues at least one work, this means that the caller should wait. __stop_cpus() can check the value returned by queue_stop_cpus_work() and avoid done.executed, just like stop_one_cpu() does. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193323.GA8262@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Change stop_one_cpu() to return -ENOENT if cpu_stop_queue_work() fails. Otherwise we know that ->executed must be true after wait_for_completion() so we can just return done.ret. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193320.GA8259@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Change cpu_stop_queue_work() to return true if the work was queued and change stop_one_cpu_nowait() to return the result of cpu_stop_queue_work(). This makes it more useful, for example now you can alloc cpu_stop_work for stop_one_cpu_nowait() and free it in the callback or if stop_one_cpu_nowait() fails, currently this is impossible because you can't know if @fn will be called or not. Also, this allows to kill cpu_stop_done->executed, see the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151117170523.GA13955@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
Now that stop_two_cpus() path does not check cpu_active() we can remove preempt_disable(), it was only needed to ensure that stop_machine() can not be called after we observe cpu_active() == T and before we queue the new work. Also, turn the pointless and confusing ->executed check into WARN_ON(). We know that both works must be executed, otherwise we have a bug. And in fact I think that done->executed should die, see the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193314.GA8249@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Oleg Nesterov authored
stop_one_cpu_nowait(fn) will crash the kernel if the callback returns nonzero, work->done == NULL in this case. This needs more cleanups, cpu_stop_signal_done() is called right after we check done != NULL and it does the same check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Milos Vyletel <milos@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20151115193311.GA8242@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Geliang Tang authored
Use list_is_singular() to check if run_list has only one entry. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a5453fafd735affcf28e53a1d0a3d6965cb5dbb5.1447582547.git.geliangtang@163.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Joonwoo Park authored
At present scheduler resets task's wait start timestamp when the task migrates to another rq. This misleads scheduler itself into reporting less wait time than actual by omitting time spent for waiting prior to migration and also more wait count than actual by counting migration as wait end event which can be seen by trace or /proc/<pid>/sched with CONFIG_SCHEDSTATS=y. Carry forward migrating task's wait time prior to migration and don't count migration as a wait end event to fix such statistics error. In order to determine whether task is migrating mark task->on_rq with TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration. Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: ohaugan@codeaurora.org Link: http://lkml.kernel.org/r/20151113033854.GA4247@codeaurora.orgSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
There is a fundamental mismatch between the runtime based NUMA scanning at the task level, and the wall clock time NUMA scanning at the mm level. On a severely overloaded system, with very large processes, this mismatch can cause the system to spend all of its time in change_prot_numa(). This can happen if the task spends at least two ticks in change_prot_numa(), and only gets two ticks of CPU time in the real time between two scan intervals of the mm. This patch ensures that a task never spends more than 3% of run time scanning PTEs. It does that by ensuring that in-between task_numa_work() runs, the task spends at least 32x as much time on other things than it did on task_numa_work(). This is done stochastically: if a timer tick happens, or the task gets rescheduled during task_numa_work(), we delay a future run of task_numa_work() until the task has spent at least 32x the amount of CPU time doing something else, as it spent inside task_numa_work(). The longer task_numa_work() takes, the more likely it is this happens. If task_numa_work() takes very little time, chances are low that that code will do anything, but we will not care. Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mgorman@suse.de Link: http://lkml.kernel.org/r/1446756983-28173-3-git-send-email-riel@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Byungchul Park authored
Usually the tick can be stopped for an idle CPU in NOHZ. However in NOHZ_FULL mode, a non-idle CPU's tick can also be stopped. However, update_cpu_load_nohz() does not consider the case a non-idle CPU's tick has been stopped at all. This patch makes the update_cpu_load_nohz() know if the calling path comes from NOHZ_FULL or idle NOHZ. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1447115762-19734-3-git-send-email-byungchul.park@lge.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Byungchul Park authored
There are some cases where distance between ticks is more than one tick while the CPU is not idle, e.g. full NOHZ. However __update_cpu_load() assumes it is the idle tickless case if the distance between ticks is more than 1, even though it can be the active tickless case as well. Thus in the active tickless case, updating the CPU load will not be performed correctly. Where the current code assumes the load for each tick is zero, this is (obviously) not true in non-idle tickless case. We can approximately consider the load ~= this_rq->cpu_load[0] during tickless in non-idle tickless case. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1444816056-11886-2-git-send-email-byungchul.park@lge.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Kosuku reports that there were a fair number of buggy waitqueue_active() users and this function deserves a big comment in order to avoid growing more. Reported-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
-
Dietmar Eggemann authored
Commit cd126afe ("sched/fair: Remove rq's runnable avg") got rid of rq->avg and so there is no need to update it any more when entering or exiting idle. Remove the now empty functions idle_{enter|exit}_fair(). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1445342681-17171-1-git-send-email-dietmar.eggemann@arm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Arnd Bergmann authored
The push_irq_work_func() function is conditionally defined only when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the forward declaration remains visibile without HAVE_RT_PUSH_IPI, causing a gcc warning in ARM64 allnoconfig: kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function] This changes the code to use the same condition for both the declaration and the function definition, which gets rid of the warning. As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI thing after: 8053871d ("smp: Fix smp_call_function_single_async() locking") Until that is done, this patch can be used to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfelSigned-off-by: Ingo Molnar <mingo@kernel.org>
-
Linus Torvalds authored
-
- 22 Nov, 2015 4 commits
-
-
Linus Torvalds authored
Merge slub bulk allocator updates from Andrew Morton: "This missed the merge window because I was waiting for some repairs to come in. Nothing actually uses the bulk allocator yet and the changes to other code paths are pretty small. And the net guys are waiting for this so they can start merging the client code" More comments from Jesper Dangaard Brouer: "The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in previous kernel. The present version contains a bug. Vladimir Davydov noticed it contained a bug, when kernel is compiled with CONFIG_MEMCG_KMEM (see commit 03ec0ed5: "slub: fix kmem cgroup bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in kmem_cache_free_bulk() were missing (see commit 03374518 "slub: add missing kmem cgroup support to kmem_cache_free_bulk"). I don't consider the fix stable-material because there are no in-tree users of the API. But with known bugs (for memcg) I cannot start using the API in the net-tree" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: slab/slub: adjust kmem_cache_alloc_bulk API slub: add missing kmem cgroup support to kmem_cache_free_bulk slub: fix kmem cgroup bug in kmem_cache_alloc_bulk slub: optimize bulk slowpath free by detached freelist slub: support for bulk free with SLUB freelists
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/ttyLinus Torvalds authored
Pull tty/serial fixes from Greg KH: "Here are a few small tty/serial driver fixes for 4.4-rc2 that resolve some reported problems. All have been in linux-next, full details are in the shortlog below" * tag 'tty-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: export fsl8250_handle_irq serial: 8250_mid: Add missing dependency tty: audit: Fix audit source serial: etraxfs-uart: Fix crash serial: fsl_lpuart: Fix earlycon support bcm63xx_uart: Use the device name when registering an interrupt tty: Fix direct use of tty buffer work tty: Fix tty_send_xchar() lock order inversion
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/stagingLinus Torvalds authored
Pull staging/IIO fixes from Greg KH: "Here are some staging and iio driver fixes for 4.4-rc2. All of these are in response to issues that have been reported and have been in linux-next for a while" * tag 'staging-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: Revert "Staging: wilc1000: coreconfigurator: Drop unneeded wrapper functions" iio: adc: xilinx: Fix VREFN scale iio: si7020: Swap data byte order iio: adc: vf610_adc: Fix division by zero error iio:ad7793: Fix ad7785 product ID iio: ad5064: Fix ad5629/ad5669 shift iio:ad5064: Make sure ad5064_i2c_write() returns 0 on success iio: lpc32xx_adc: fix warnings caused by enabling unprepared clock staging: iio: select IRQ_WORK for IIO_DUMMY_EVGEN vf610_adc: Fix internal temperature calculation
-
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usbLinus Torvalds authored
Pull USB fixes from Greg KH: "Here are a number of USB fixes and new device ids for 4.4-rc2. All have been in linux-next and the details are in the shortlog" * tag 'usb-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (28 commits) usblp: do not set TASK_INTERRUPTIBLE before lock USB: MAINTAINERS: cxacru usb: kconfig: fix warning of select USB_OTG USB: option: add XS Stick W100-2 from 4G Systems xhci: Fix a race in usb2 LPM resume, blocking U3 for usb2 devices usb: xhci: fix checking ep busy for CFC xhci: Workaround to get Intel xHCI reset working more reliably usb: chipidea: imx: fix a possible NULL dereference usb: chipidea: usbmisc_imx: fix a possible NULL dereference usb: chipidea: otg: gadget module load and unload support usb: chipidea: debug: disable usb irq while role switch ARM: dts: imx27.dtsi: change the clock information for usb usb: chipidea: imx: refine clock operations to adapt for all platforms usb: gadget: atmel_usba_udc: Expose correct device speed usb: musb: enable usb_dma parameter usb: phy: phy-mxs-usb: fix a possible NULL dereference usb: dwc3: gadget: let us set lower max_speed usb: musb: fix tx fifo flush handling usb: gadget: f_loopback: fix the warning during the enumeration usb: dwc2: host: Fix remote wakeup when not in DWC2_L2 ...
-