An error occurred fetching the project authors.
- 22 Feb, 2014 5 commits
-
-
Thomas Gleixner authored
The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48ca (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.deSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Thomas Gleixner authored
If the policy and priority remain unchanged a possible modification of p->sched_reset_on_fork gets lost in the early exit path. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> [ Rebase ontop of v3.14-rc1. ] Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.deSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Thomas Gleixner authored
might_sleep() can tell us where interrupts have been disabled, but we have no idea what disabled preemption. Add some debug infrastructure. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.deSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Thomas Gleixner authored
Idle is not allowed to call sleeping functions ever! Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.deSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Thomas Gleixner authored
We stumbled in RT over a SMP bringup issue on ARM where the idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run into nada land. After adding that idle->on_rq = 1; I was able to find the root cause of the lockup: the idle task on the newly woken up cpu was fiddling with a sleeping spinlock, which is a nono. I kept the init of idle->on_rq to keep the state consistent and to avoid another long lasting debug session. As a side note, the whole debug mess could have been avoided if might_sleep() would have yelled when called from the idle task. That's fixed with patch 2/6 - and that one actually has a changelog :) Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.deSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 21 Feb, 2014 1 commit
-
-
Peter Zijlstra authored
Dan Carpenter reported: > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338) > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005) Kirill also spotted that migrate_tasks() will have an instant NULL deref because pick_next_task() will immediately deref prev. Instead of fixing all the corner cases because migrate_tasks() can pass in a NULL prev task in the unlikely case of hot-un-plug, provide a fake task such that we can remove all the NULL checks from the far more common paths. A further problem; not previously spotted; is that because we pushed pre_schedule() and idle_balance() into pick_next_task() we now need to avoid those getting called and pulling more tasks on our dying CPU. We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1. We also note that since we call pick_next_task() exactly the amount of times we have runnable tasks present, we should never land in idle_balance(). Fixes: 38033c37 ("sched: Push down pre_schedule() and idle_balance()") Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by:
Kirill Tkhai <tkhai@yandex.ru> Reported-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.netSigned-off-by:
Thomas Gleixner <tglx@linutronix.de>
-
- 11 Feb, 2014 2 commits
-
-
Alex Shi authored
Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost. It's useful to know these values in debug mode. Signed-off-by:
Alex Shi <alex.shi@linaro.org> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52E0F3BF.5020904@linaro.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
This patch both merged idle_balance() and pre_schedule() and pushes both of them into pick_next_task(). Conceptually pre_schedule() and idle_balance() are rather similar, both are used to pull more work onto the current CPU. We cannot however first move idle_balance() into pre_schedule_fair() since there is no guarantee the last runnable task is a fair task, and thus we would miss newidle balances. Similarly, the dl and rt pre_schedule calls must be ran before idle_balance() since their respective tasks have higher priority and it would not do to delay their execution searching for less important tasks first. However, by noticing that pick_next_tasks() already traverses the sched_class hierarchy in the right order, we can get the right behaviour and do away with both calls. We must however change the special case optimization to also require that prev is of sched_class_fair, otherwise we can miss doing a dl or rt pull where we needed one. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 10 Feb, 2014 3 commits
-
-
Peter Zijlstra authored
In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptopSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Daniel Lezcano authored
idle_balance() modifies the rq->idle_stamp field, making this information shared across core.c and fair.c. As we know if the cpu is going to idle or not with the previous patch, let's encapsulate the rq->idle_stamp information in core.c by moving it up to the caller. The idle_balance() function returns true in case a balancing occured and the cpu won't be idle, false if no balance happened and the cpu is going idle. Signed-off-by:
Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Daniel Lezcano authored
The cpu parameter passed to idle_balance() is not needed as it could be retrieved from 'struct rq.' Signed-off-by:
Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-1-git-send-email-daniel.lezcano@linaro.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 09 Feb, 2014 1 commit
-
-
Dongsheng Yang authored
As patch "sched: Move the priority specific bits into a new header file" exposes the priority related macros in linux/sched/prio.h, we don't have to implement task_nice() in kernel/sched/core.c any more. This patch implements it in linux/sched/sched.h as static inline function, saving the kernel stack and enhancing performance a bit. Signed-off-by:
Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: clark.williams@gmail.com Cc: rostedt@goodmis.org Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 28 Jan, 2014 3 commits
-
-
Rik van Riel authored
Tracing the code that decides the active nodes has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the node with the garbage collector being marked the only active node in the group. This issue is avoided if we weigh the statistics by CPU use of each task in the numa group, instead of by how many faults each thread has occurred. To achieve this, we normalize the number of faults to the fraction of faults that occurred on each node, and then multiply that fraction by the fraction of CPU time the task has used since the last time task_numa_placement was invoked. This way the nodes in the active node mask will be the ones where the tasks from the numa group are most actively running, and the influence of eg. the garbage collector and other do-little threads is properly minimized. On a 4 node system, using CPU use statistics calculated over a longer interval results in about 1% fewer page migrations with two 32-warehouse specjbb runs on a 4 node system, and about 5% fewer page migrations, as well as 1% better throughput, with two 8-warehouse specjbb runs, as compared with the shorter term statistics kept by the scheduler. Signed-off-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-7-git-send-email-riel@redhat.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Rik van Riel authored
In order to get a more consistent naming scheme, making it clear which fault statistics track memory locality, and which track CPU locality, rename the memory fault statistics. Suggested-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-3-git-send-email-riel@redhat.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Not all classes implement (or can implement) a useful get_rr_interval() function, default to a 0 time-slice for them. This fixes a crash reported by Tommi Rantala. Reported-by:
Tommi Rantala <tt.rantala@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140127105413.GC11314@laptop.programming.kicks-ass.netSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 24 Jan, 2014 1 commit
-
-
Andi Kleen authored
Add a working sysctl to enable/disable automatic numa memory balancing at runtime. This allows us to track down performance problems with this feature and is generally a good idea. This was possible earlier through debugfs, but only with special debugging options set. Also fix the boot message. [akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/] Signed-off-by:
Andi Kleen <ak@linux.intel.com> Acked-by:
Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 22 Jan, 2014 1 commit
-
-
Mel Gorman authored
This patch adds three tracepoints o trace_sched_move_numa when a task is moved to a node o trace_sched_swap_numa when a task is swapped with another task o trace_sched_stick_numa when a numa-related migration fails The tracepoints allow the NUMA scheduler activity to be monitored and the following high-level metrics can be calculated o NUMA migrated stuck nr trace_sched_stick_numa o NUMA migrated idle nr trace_sched_move_numa o NUMA migrated swapped nr trace_sched_swap_numa o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid Maybe a small number of these are acceptable but a high number would be a major surprise. It would be even worse if bounces are frequent. o NUMA avg task migs. Average number of migrations for tasks o NUMA stddev task mig Self-explanatory o NUMA max task migs. Maximum number of migrations for a single task In general the intent of the tracepoints is to help diagnose problems where automatic NUMA balancing appears to be doing an excessive amount of useless work. [akpm@linux-foundation.org: remove semicolon-after-if, repair coding-style] Signed-off-by:
Mel Gorman <mgorman@suse.de> Reviewed-by:
Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 16 Jan, 2014 7 commits
-
-
Peter Zijlstra authored
With the introduction of sched_attr::sched_nice we need to check if we've got permission to actually change the nice value. Daniel found that can_nice() would always fail; and upon inspection it turns out that can_nice() only tests to see if we can lower the nice value, but it doesn't validate if we're lowering or not. Therefore amend the test to only call can_nice() when we lower the nice value. Reported-and-Tested-by:
Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Fixes: d50dde5a ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Link: http://lkml.kernel.org/r/20140116165425.GA9481@laptop.programming.kicks-ass.netSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
I noticed the new sched_{set,get}attr() calls didn't properly deal with the SCHED_RESET_ON_FORK hack. Instead of propagating the flags in high bits nonsense use the brand spanking new attr::sched_flags field. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Link: http://lkml.kernel.org/r/20140115162242.GJ31570@twins.programming.kicks-ass.netSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Fengguang Wu reported the following build warning: > kernel/sched/core.c:3067 __sched_setscheduler() warn: unsigned 'attr->sched_priority' is never less than zero. Since it doesn't make sense for attr::sched_priority to be negative, remove the check, since we already test for an upper limit any actual negative values passed in through the old param::sched_priority field will still be detected. Reported-by:
Fengguang Wu <fengguang.wu@intel.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Fixes: d50dde5a ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Link: http://lkml.kernel.org/n/tip-fid9nalzii2r5voxtf4eh5kz@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Wu reported LTP failures: > ltp.sched_setparam02.1.TFAIL > ltp.sched_setparam02.2.TFAIL > ltp.sched_setparam02.3.TFAIL > ltp.sched_setparam03.1.TFAIL There were 2 things wrong; firstly __setscheduler() failed on sched_setparam()'s policy = -1, fix that by reading from p->policy in that case. Secondly, getparam() (and getattr()) would still report !0 sched_priority for !FIFO/RR tasks after having been such. So unconditionally set p->rt_priority. Reported-by:
Fengguang Wu <fengguang.wu@intel.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Fixes: d50dde5a ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Link: http://lkml.kernel.org/r/20140115153320.GH31570@twins.programming.kicks-ass.netSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Previously sched_setscheduler() and sched_setparam() would not affect the nice value of a task, restore this behaviour. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Fixes: d50dde5a ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Link: http://lkml.kernel.org/r/20140115113015.GB31570@twins.programming.kicks-ass.netSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Juri Lelli authored
Fengguang Wu's kbuild test robot reported the following new htmldocs warnings: >>> Warning(kernel/sched/core.c:3380): No description found for parameter 'uattr' >>> Warning(kernel/sched/core.c:3380): Excess function parameter 'attr' description in 'sys_sched_setattr' >>> Warning(kernel/sched/core.c:3520): No description found for parameter 'uattr' >>> Warning(kernel/sched/core.c:3520): Excess function parameter 'attr' description in 'sys_sched_getattr' The second argument to sys_sched_{setattr,getattr}() is named uattr (not attr). Reported-by:
Fengguang Wu <fengguang.wu@intel.com> Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Dario Faggioli <raistlin@linux.it> Fixes: d50dde5a ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Link: http://lkml.kernel.org/r/52D5552D.5000102@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
fix these new sparse warnings: >> kernel/sched/core.c:305:14: sparse: symbol 'sysctl_sched_dl_period' was not declared. Should it be static? >> kernel/sched/core.c:306:5: sparse: symbol 'sysctl_sched_dl_runtime' was not declared. Should it be static? Better still, they're completely unused so remove them. Reported-by:
Fengguang Wu <fengguang.wu@intel.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Link: http://lkml.kernel.org/n/tip-ke0shkG7vMnzmcdqhhiymyem@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 15 Jan, 2014 1 commit
-
-
Kevin Hilman authored
While calculating the scheduler tick max deferment, the delta is converted from microseconds to nanoseconds through a multiplication against NSEC_PER_USEC. But this microseconds operand is an unsigned int, thus the result may likely overflow. The result is cast to u64 but only once the operation is completed, which is too late to avoid overflown result. This is currently not a problem because the scheduler tick max deferment is 1 second. But this may become an issue as we plan to make this value tunable. So lets fix this by casting the usecs value to u64 before multiplying by NSECS_PER_USEC. Also to prevent from this kind of mistake to happen again, move this ad-hoc jiffies -> nsecs conversion to a new helper. Signed-off-by:
Kevin Hilman <khilman@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kevin Hilman <khilman@linaro.org> Link: http://lkml.kernel.org/r/1387315388-31676-2-git-send-email-khilman@linaro.org [move ad-hoc conversion to jiffies_to_nsecs helper] Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com>
-
- 13 Jan, 2014 13 commits
-
-
Peter Zijlstra authored
With various drivers wanting to inject idle time; we get people calling idle routines outside of the idle loop proper. Therefore we need to be extra careful about not missing TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations. While looking at this, I also realized there's a small window in the existing idle loop where we can miss TIF_NEED_RESCHED; when it hits right after the tif_need_resched() test at the end of the loop but right before the need_resched() test at the start of the loop. So move preempt_fold_need_resched() out of the loop where we're guaranteed to have TIF_NEED_RESCHED set. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-x9jgh45oeayzajz2mjt0y7d6@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Daniel Lezcano authored
The cpu information is already stored in the struct rq, so no need to pass it as parameter to the trigger_load_balance function. Cc: linaro-kernel@lists.linaro.org Cc: preeti.lkml@gmail.com Cc: mingo@redhat.com Cc: peterz@infradead.org Signed-off-by:
Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
The current hotplug admission control is broken because: CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task() cannot fail and hard assumes it _will_ move all tasks off of the dying cpu, failing this will break hotplug. The much simpler solution is a DOWN_PREPARE handler that fails when removing one CPU gets us below the total allocated bandwidth. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.netSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Remove the deadline specific sysctls for now. The problem with them is that the interaction with the exisiting rt knobs is nearly impossible to get right. The current (as per before this patch) situation is that the rt and dl bandwidth is completely separate and we enforce rt+dl < 100%. This is undesirable because this means that the rt default of 95% leaves us hardly any room, even though dl tasks are saver than rt tasks. Another proposed solution was (a discarted patch) to have the dl bandwidth be a fraction of the rt bandwidth. This is highly confusing imo. Furthermore neither proposal is consistent with the situation we actually want; which is rt tasks ran from a dl server. In which case the rt bandwidth is a direct subset of dl. So whichever way we go, the introduction of dl controls at this point is painful. Therefore remove them and instead share the rt budget. This means that for now the rt knobs are used for dl admission control and the dl runtime is accounted against the rt runtime. I realise that this isn't entirely desirable either; but whatever we do we appear to need to change the interface later, so better have a small interface for now. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
For now deadline tasks are not allowed to set smp affinity; however the current tests are wrong, cure this. The test in __sched_setscheduler() also uses an on-stack cpumask_t which is a no-no. Change both tests to use cpumask_subset() such that we test the root domain span to be a subset of the cpus_allowed mask. This way we're sure the tasks can always run on all CPUs they can be balanced over, and have no effective affinity constraints. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-fyqtb1lapxca3lhsxv9cumdc@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Juri Lelli authored
Data from tests confirmed that the original active load balancing logic didn't scale neither in the number of CPU nor in the number of tasks (as sched_rt does). Here we provide a global data structure to keep track of deadlines of the running tasks in the system. The structure is composed by a bitmask showing the free CPUs and a max-heap, needed when the system is heavily loaded. The implementation and concurrent access scheme are kept simple by design. However, our measurements show that we can compete with sched_rt on large multi-CPUs machines [1]. Only the push path is addressed, the extension to use this structure also for pull decisions is straightforward. However, we are currently evaluating different (in order to decrease/avoid contention) data structures to solve possibly both problems. We are also going to re-run tests considering recent changes inside cpupri [2]. [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf [2] http://www.spinics.net/lists/linux-rt-users/msg06778.htmlSigned-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Dario Faggioli authored
In order of deadline scheduling to be effective and useful, it is important that some method of having the allocation of the available CPU bandwidth to tasks and task groups under control. This is usually called "admission control" and if it is not performed at all, no guarantee can be given on the actual scheduling of the -deadline tasks. Since when RT-throttling has been introduced each task group have a bandwidth associated to itself, calculated as a certain amount of runtime over a period. Moreover, to make it possible to manipulate such bandwidth, readable/writable controls have been added to both procfs (for system wide settings) and cgroupfs (for per-group settings). Therefore, the same interface is being used for controlling the bandwidth distrubution to -deadline tasks and task groups, i.e., new controls but with similar names, equivalent meaning and with the same usage paradigm are added. However, more discussion is needed in order to figure out how we want to manage SCHED_DEADLINE bandwidth at the task group level. Therefore, this patch adds a less sophisticated, but actually very sensible, mechanism to ensure that a certain utilization cap is not overcome per each root_domain (the single rq for !SMP configurations). Another main difference between deadline bandwidth management and RT-throttling is that -deadline tasks have bandwidth on their own (while -rt ones doesn't!), and thus we don't need an higher level throttling mechanism to enforce the desired bandwidth. This patch, therefore: - adds system wide deadline bandwidth management by means of: * /proc/sys/kernel/sched_dl_runtime_us, * /proc/sys/kernel/sched_dl_period_us, that determine (i.e., runtime / period) the total bandwidth available on each CPU of each root_domain for -deadline tasks; - couples the RT and deadline bandwidth management, i.e., enforces that the sum of how much bandwidth is being devoted to -rt -deadline tasks to stay below 100%. This means that, for a root_domain comprising M CPUs, -deadline tasks can be created until the sum of their bandwidths stay below: M * (sched_dl_runtime_us / sched_dl_period_us) It is also possible to disable this bandwidth management logic, and be thus free of oversubscribing the system up to any arbitrary level. Signed-off-by:
Dario Faggioli <raistlin@linux.it> Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Dario Faggioli authored
Some method to deal with rt-mutexes and make sched_dl interact with the current PI-coded is needed, raising all but trivial issues, that needs (according to us) to be solved with some restructuring of the pi-code (i.e., going toward a proxy execution-ish implementation). This is under development, in the meanwhile, as a temporary solution, what this commits does is: - ensure a pi-lock owner with waiters is never throttled down. Instead, when it runs out of runtime, it immediately gets replenished and it's deadline is postponed; - the scheduling parameters (relative deadline and default runtime) used for that replenishments --during the whole period it holds the pi-lock-- are the ones of the waiting task with earliest deadline. Acting this way, we provide some kind of boosting to the lock-owner, still by using the existing (actually, slightly modified by the previous commit) pi-architecture. We would stress the fact that this is only a surely needed, all but clean solution to the problem. In the end it's only a way to re-start discussion within the community. So, as always, comments, ideas, rants, etc.. are welcome! :-) Signed-off-by:
Dario Faggioli <raistlin@linux.it> Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> [ Added !RT_MUTEXES build fix. ] Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Peter Zijlstra authored
Turn the pi-chains from plist to rb-tree, in the rt_mutex code, and provide a proper comparison function for -deadline and -priority tasks. This is done mainly because: - classical prio field of the plist is just an int, which might not be enough for representing a deadline; - manipulating such a list would become O(nr_deadline_tasks), which might be to much, as the number of -deadline task increases. Therefore, an rb-tree is used, and tasks are queued in it according to the following logic: - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the one with the higher (lower, actually!) prio wins; - among a -priority and a -deadline task, the latter always wins; - among two -deadline tasks, the one with the earliest deadline wins. Queueing and dequeueing functions are changed accordingly, for both the list of a task's pi-waiters and the list of tasks blocked on a pi-lock. Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Dario Faggioli <raistlin@linux.it> Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-again-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Harald Gustafsson authored
Make it possible to specify a period (different or equal than deadline) for -deadline tasks. Relative deadlines (D_i) are used on task arrivals to generate new scheduling (absolute) deadlines as "d = t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d = d + P_i" when the budget is zero. This is in general useful to model (and schedule) tasks that have slow activation rates (long periods), but have to be scheduled soon once activated (short deadlines). Signed-off-by:
Harald Gustafsson <harald.gustafsson@ericsson.com> Signed-off-by:
Dario Faggioli <raistlin@linux.it> Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Juri Lelli authored
Introduces data structures relevant for implementing dynamic migration of -deadline tasks and the logic for checking if runqueues are overloaded with -deadline tasks and for choosing where a task should migrate, when it is the case. Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can be moved among CPUs when necessary. It is also possible to bind a task to a (set of) CPU(s), thus restricting its capability of migrating, or forbidding migrations at all. The very same approach used in sched_rt is utilised: - -deadline tasks are kept into CPU-specific runqueues, - -deadline tasks are migrated among runqueues to achieve the following: * on an M-CPU system the M earliest deadline ready tasks are always running; * affinity/cpusets settings of all the -deadline tasks is always respected. Therefore, this very special form of "load balancing" is done with an active method, i.e., the scheduler pushes or pulls tasks between runqueues when they are woken up and/or (de)scheduled. IOW, every time a preemption occurs, the descheduled task might be sent to some other CPU (depending on its deadline) to continue executing (push). On the other hand, every time a CPU becomes idle, it might pull the second earliest deadline ready task from some other CPU. To enforce this, a pull operation is always attempted before taking any scheduling decision (pre_schedule()), as well as a push one after each scheduling decision (post_schedule()). In addition, when a task arrives or wakes up, the best CPU where to resume it is selected taking into account its affinity mask, the system topology, but also its deadline. E.g., from the scheduling point of view, the best CPU where to wake up (and also where to push) a task is the one which is running the task with the latest deadline among the M executing ones. In order to facilitate these decisions, per-runqueue "caching" of the deadlines of the currently running and of the first ready task is used. Queued but not running tasks are also parked in another rb-tree to speed-up pushes. Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-by:
Dario Faggioli <raistlin@linux.it> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Dario Faggioli authored
Introduces the data structures, constants and symbols needed for SCHED_DEADLINE implementation. Core data structure of SCHED_DEADLINE are defined, along with their initializers. Hooks for checking if a task belong to the new policy are also added where they are needed. Adds a scheduling class, in sched/dl.c and a new policy called SCHED_DEADLINE. It is an implementation of the Earliest Deadline First (EDF) scheduling algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) that makes it possible to isolate the behaviour of tasks between each other. The typical -deadline task will be made up of a computation phase (instance) which is activated on a periodic or sporadic fashion. The expected (maximum) duration of such computation is called the task's runtime; the time interval by which each instance need to be completed is called the task's relative deadline. The task's absolute deadline is dynamically calculated as the time instant a task (better, an instance) activates plus the relative deadline. The EDF algorithms selects the task with the smallest absolute deadline as the one to be executed first, while the CBS ensures each task to run for at most its runtime every (relative) deadline length time interval, avoiding any interference between different tasks (bandwidth isolation). Thanks to this feature, also tasks that do not strictly comply with the computational model sketched above can effectively use the new policy. To summarize, this patch: - introduces the data structures, constants and symbols needed; - implements the core logic of the scheduling algorithm in the new scheduling class file; - provides all the glue code between the new scheduling class and the core scheduler and refines the interactions between sched/dl and the other existing scheduling classes. Signed-off-by:
Dario Faggioli <raistlin@linux.it> Signed-off-by:
Michael Trimarchi <michael@amarulasolutions.com> Signed-off-by:
Fabio Checconi <fchecconi@gmail.com> Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
Dario Faggioli authored
Add the syscalls needed for supporting scheduling algorithms with extended scheduling parameters (e.g., SCHED_DEADLINE). In general, it makes possible to specify a periodic/sporadic task, that executes for a given amount of runtime at each instance, and is scheduled according to the urgency of their own timing constraints, i.e.: - a (maximum/typical) instance execution time, - a minimum interval between consecutive instances, - a time constraint by which each instance must be completed. Thus, both the data structure that holds the scheduling parameters of the tasks and the system calls dealing with it must be extended. Unfortunately, modifying the existing struct sched_param would break the ABI and result in potentially serious compatibility issues with legacy binaries. For these reasons, this patch: - defines the new struct sched_attr, containing all the fields that are necessary for specifying a task in the computational model described above; - defines and implements the new scheduling related syscalls that manipulate it, i.e., sched_setattr() and sched_getattr(). Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a proof of concept and for developing and testing purposes. Making them available on other architectures is straightforward. Since no "user" for these new parameters is introduced in this patch, the implementation of the new system calls is just identical to their already existing counterpart. Future patches that implement scheduling policies able to exploit the new data structure must also take care of modifying the sched_*attr() calls accordingly with their own purposes. Signed-off-by:
Dario Faggioli <raistlin@linux.it> [ Rewrote to use sched_attr. ] Signed-off-by:
Juri Lelli <juri.lelli@gmail.com> [ Removed sched_setscheduler2() for now. ] Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.comSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 17 Dec, 2013 1 commit
-
-
Mel Gorman authored
Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL dereference on sd_busy but the fix also altered what scheduling domain it used for the 'sd_llc' percpu variable. One impact of this is that a task selecting a runqueue may consider idle CPUs that are not cache siblings as candidates for running. Tasks are then running on CPUs that are not cache hot. This was found through bisection where ebizzy threads were not seeing equal performance and it looked like a scheduling fairness issue. This patch mitigates but does not completely fix the problem on all machines tested implying there may be an additional bug or a common root cause. Here are the average range of performance seen by individual ebizzy threads. It was tested on top of candidate patches related to x86 TLB range flushing. 4-core machine 3.13.0-rc3 3.13.0-rc3 vanilla fixsd-v3r3 Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%) Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%) Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%) Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%) Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%) Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%) Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%) 8-core machine Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%) Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%) Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%) Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%) Mean 6 23.21 ( 0.00%) 69.46 (-199.27%) Mean 7 15.85 ( 0.00%) 101.72 (-541.77%) Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%) Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%) Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%) It's eliminated for one machine and reduced for another. Signed-off-by:
Mel Gorman <mgorman@suse.de> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: H Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 11 Dec, 2013 1 commit
-
-
Peter Zijlstra authored
Yinghai reported that he saw a /0 in sg_capacity on his EX parts. Make sure to always initialize power_orig now that we actually use it. Ideally build_sched_domains() -> init_sched_groups_power() would also initialize this; but for some yet unexplained reason some setups seem to miss updates there. Reported-by:
Yinghai Lu <yinghai@kernel.org> Tested-by:
Yinghai Lu <yinghai@kernel.org> Signed-off-by:
Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by:
Ingo Molnar <mingo@kernel.org>
-