1. 16 Jan, 2014 10 commits
  2. 13 Jan, 2014 30 commits
    • Peter Zijlstra's avatar
      sched, thermal: Clean up preempt_enable_no_resched() abuse · 130816ce
      Peter Zijlstra authored
      The only valid use of preempt_enable_no_resched() is if the very next
      line is schedule() or if we know preemption cannot actually be enabled
      by that statement due to known more preempt_count 'refs'.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: rjw@rjwysocki.net
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: rui.zhang@intel.com
      Cc: jacob.jun.pan@linux.intel.com
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: hpa@zytor.com
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: lenb@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-zcfvacdlvlr63qmnn5i58vuj@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      130816ce
    • Peter Zijlstra's avatar
      sched, net: Fixup busy_loop_us_clock() · 37089834
      Peter Zijlstra authored
      The only valid use of preempt_enable_no_resched() is if the very next
      line is schedule() or if we know preemption cannot actually be enabled
      by that statement due to known more preempt_count 'refs'.
      
      This busy_poll stuff looks to be completely and utterly broken,
      sched_clock() can return utter garbage with interrupts enabled (rare
      but still) and it can drift unbounded between CPUs.
      
      This means that if you get preempted/migrated and your new CPU is
      years behind on the previous CPU we get to busy spin for a _very_ long
      time.
      
      There is a _REASON_ sched_clock() warns about preemptability -
      papering over it with a preempt_disable()/preempt_enable_no_resched()
      is just terminal brain damage on so many levels.
      
      Replace sched_clock() usage with local_clock() which has a bounded
      drift between CPUs (<2 jiffies).
      
      There is a further problem with the entire busy wait poll thing in
      that the spin time is additive to the syscall timeout, not inclusive.
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: rui.zhang@intel.com
      Cc: jacob.jun.pan@linux.intel.com
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: hpa@zytor.com
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: lenb@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      37089834
    • Peter Zijlstra's avatar
      sched, net: Clean up preempt_enable_no_resched() abuse · 1774e9f3
      Peter Zijlstra authored
      The only valid use of preempt_enable_no_resched() is if the very next
      line is schedule() or if we know preemption cannot actually be enabled
      by that statement due to known more preempt_count 'refs'.
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: rjw@rjwysocki.net
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: rui.zhang@intel.com
      Cc: jacob.jun.pan@linux.intel.com
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: hpa@zytor.com
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: lenb@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1774e9f3
    • Peter Zijlstra's avatar
      sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding · 8cb75e0c
      Peter Zijlstra authored
      With various drivers wanting to inject idle time; we get people
      calling idle routines outside of the idle loop proper.
      
      Therefore we need to be extra careful about not missing
      TIF_NEED_RESCHED -> PREEMPT_NEED_RESCHED propagations.
      
      While looking at this, I also realized there's a small window in the
      existing idle loop where we can miss TIF_NEED_RESCHED; when it hits
      right after the tif_need_resched() test at the end of the loop but
      right before the need_resched() test at the start of the loop.
      
      So move preempt_fold_need_resched() out of the loop where we're
      guaranteed to have TIF_NEED_RESCHED set.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-x9jgh45oeayzajz2mjt0y7d6@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8cb75e0c
    • Ingo Molnar's avatar
      Merge branch 'x86/idle' into sched/core · c9c89868
      Ingo Molnar authored
      Merge these x86 specific bits - we are going to add generic bits as well.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c9c89868
    • Peter Zijlstra's avatar
      sched/preempt, locking: Rework local_bh_{dis,en}able() · 0bd3a173
      Peter Zijlstra authored
      Currently local_bh_disable() is out-of-line for no apparent reason.
      So inline it to save a few cycles on call/return nonsense, the
      function body is a single add on x86 (a few loads and store extra on
      load/store archs).
      
      Also expose two new local_bh functions:
      
        __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt);
      
      Which implement the actual local_bh_{dis,en}able() behaviour.
      
      The next patch uses the exposed @cnt argument to optimize bh lock
      functions.
      
      With build fixes from Jacob Pan.
      
      Cc: rjw@rjwysocki.net
      Cc: rui.zhang@intel.com
      Cc: jacob.jun.pan@linux.intel.com
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: hpa@zytor.com
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: lenb@kernel.org
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0bd3a173
    • Peter Zijlstra's avatar
      sched/clock, x86: Avoid a runtime condition in native_sched_clock() · 10b033d4
      Peter Zijlstra authored
      Use a static_key to avoid touching tsc_disabled and a runtime
      condition in native_sched_clock() -- less cachelines touched is always
      better.
      
                              MAINLINE   PRE       POST
      
          sched_clock_stable: 1          1         1
          (cold) sched_clock: 329841     215295    213039
          (cold) local_clock: 301773     220773    216084
          (warm) sched_clock: 38375      25659     25231
          (warm) local_clock: 100371     27242     27601
          (warm) rdtsc:       27340      24208     24203
          sched_clock_stable: 0          0         0
          (cold) sched_clock: 382634     237019    240055
          (cold) local_clock: 396890     294819    299942
          (warm) sched_clock: 38194      25609     25276
          (warm) local_clock: 143452     71232     73232
          (warm) rdtsc:       27345      24243     24244
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-hrz87bo37qke25bty6pnfy4b@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      10b033d4
    • Peter Zijlstra's avatar
      sched/clock: Fix up clear_sched_clock_stable() · 6577e42a
      Peter Zijlstra authored
      The below tells us the static_key conversion has a problem; since the
      exact point of clearing that flag isn't too important, delay the flip
      and use a workqueue to process it.
      
      [ ] TSC synchronization [CPU#0 -> CPU#22]:
      [ ] Measured 8 cycles TSC warp between CPUs, turning off TSC clock.
      [ ]
      [ ] ======================================================
      [ ] [ INFO: possible circular locking dependency detected ]
      [ ] 3.13.0-rc3-01745-g848b0d0322cb-dirty #637 Not tainted
      [ ] -------------------------------------------------------
      [ ] swapper/0/1 is trying to acquire lock:
      [ ]  (jump_label_mutex){+.+...}, at: [<ffffffff8115a637>] jump_label_lock+0x17/0x20
      [ ]
      [ ] but task is already holding lock:
      [ ]  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60
      [ ]
      [ ] which lock already depends on the new lock.
      [ ]
      [ ]
      [ ] the existing dependency chain (in reverse order) is:
      [ ]
      [ ] -> #1 (cpu_hotplug.lock){+.+.+.}:
      [ ]        [<ffffffff810def00>] lock_acquire+0x90/0x130
      [ ]        [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
      [ ]        [<ffffffff81093fdc>] get_online_cpus+0x3c/0x60
      [ ]        [<ffffffff8104cc67>] arch_jump_label_transform+0x37/0x130
      [ ]        [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80
      [ ]        [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0
      [ ]        [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0
      [ ]        [<ffffffff810c0f65>] sched_feat_set+0xf5/0x100
      [ ]        [<ffffffff810c5bdc>] set_numabalancing_state+0x2c/0x30
      [ ]        [<ffffffff81d12f3d>] numa_policy_init+0x1af/0x1b7
      [ ]        [<ffffffff81cebdf4>] start_kernel+0x35d/0x41f
      [ ]        [<ffffffff81ceb5a5>] x86_64_start_reservations+0x2a/0x2c
      [ ]        [<ffffffff81ceb6a2>] x86_64_start_kernel+0xfb/0xfe
      [ ]
      [ ] -> #0 (jump_label_mutex){+.+...}:
      [ ]        [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0
      [ ]        [<ffffffff810def00>] lock_acquire+0x90/0x130
      [ ]        [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
      [ ]        [<ffffffff8115a637>] jump_label_lock+0x17/0x20
      [ ]        [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0
      [ ]        [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
      [ ]        [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
      [ ]        [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
      [ ]        [<ffffffff81076612>] native_cpu_up+0x3a2/0x890
      [ ]        [<ffffffff810941cb>] _cpu_up+0xdb/0x160
      [ ]        [<ffffffff810942c9>] cpu_up+0x79/0x90
      [ ]        [<ffffffff81d0af6b>] smp_init+0x60/0x8c
      [ ]        [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
      [ ]        [<ffffffff8164e32e>] kernel_init+0xe/0x130
      [ ]        [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0
      [ ]
      [ ] other info that might help us debug this:
      [ ]
      [ ]  Possible unsafe locking scenario:
      [ ]
      [ ]        CPU0                    CPU1
      [ ]        ----                    ----
      [ ]   lock(cpu_hotplug.lock);
      [ ]                                lock(jump_label_mutex);
      [ ]                                lock(cpu_hotplug.lock);
      [ ]   lock(jump_label_mutex);
      [ ]
      [ ]  *** DEADLOCK ***
      [ ]
      [ ] 2 locks held by swapper/0/1:
      [ ]  #0:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81094037>] cpu_maps_update_begin+0x17/0x20
      [ ]  #1:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109408b>] cpu_hotplug_begin+0x2b/0x60
      [ ]
      [ ] stack backtrace:
      [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637
      [ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
      [ ]  ffffffff82c9c270 ffff880236843bb8 ffffffff8165c5f5 ffffffff82c9c270
      [ ]  ffff880236843bf8 ffffffff81658c02 ffff880236843c80 ffff8802368586a0
      [ ]  ffff880236858678 0000000000000001 0000000000000002 ffff880236858000
      [ ] Call Trace:
      [ ]  [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a
      [ ]  [<ffffffff81658c02>] print_circular_bug+0x1f9/0x207
      [ ]  [<ffffffff810de141>] __lock_acquire+0x1701/0x1eb0
      [ ]  [<ffffffff816680ff>] ? __atomic_notifier_call_chain+0x8f/0xb0
      [ ]  [<ffffffff810def00>] lock_acquire+0x90/0x130
      [ ]  [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20
      [ ]  [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20
      [ ]  [<ffffffff81661f83>] mutex_lock_nested+0x63/0x3e0
      [ ]  [<ffffffff8115a637>] ? jump_label_lock+0x17/0x20
      [ ]  [<ffffffff8115a637>] jump_label_lock+0x17/0x20
      [ ]  [<ffffffff8115aa3b>] static_key_slow_inc+0x6b/0xb0
      [ ]  [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
      [ ]  [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
      [ ]  [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
      [ ]  [<ffffffff81076612>] native_cpu_up+0x3a2/0x890
      [ ]  [<ffffffff810941cb>] _cpu_up+0xdb/0x160
      [ ]  [<ffffffff810942c9>] cpu_up+0x79/0x90
      [ ]  [<ffffffff81d0af6b>] smp_init+0x60/0x8c
      [ ]  [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
      [ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
      [ ]  [<ffffffff8164e32e>] kernel_init+0xe/0x130
      [ ]  [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0
      [ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
      [ ] ------------[ cut here ]------------
      [ ] WARNING: CPU: 0 PID: 1 at /usr/src/linux-2.6/kernel/smp.c:374 smp_call_function_many+0xad/0x300()
      [ ] Modules linked in:
      [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc3-01745-g848b0d0322cb-dirty #637
      [ ] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
      [ ]  0000000000000009 ffff880236843be0 ffffffff8165c5f5 0000000000000000
      [ ]  ffff880236843c18 ffffffff81093d8c 0000000000000000 0000000000000000
      [ ]  ffffffff81ccd1a0 ffffffff810ca951 0000000000000000 ffff880236843c28
      [ ] Call Trace:
      [ ]  [<ffffffff8165c5f5>] dump_stack+0x4e/0x7a
      [ ]  [<ffffffff81093d8c>] warn_slowpath_common+0x8c/0xc0
      [ ]  [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
      [ ]  [<ffffffff81093dda>] warn_slowpath_null+0x1a/0x20
      [ ]  [<ffffffff8110b72d>] smp_call_function_many+0xad/0x300
      [ ]  [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
      [ ]  [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
      [ ]  [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
      [ ]  [<ffffffff8110ba96>] smp_call_function+0x46/0x80
      [ ]  [<ffffffff8104f200>] ? arch_unregister_cpu+0x30/0x30
      [ ]  [<ffffffff8110bb3c>] on_each_cpu+0x3c/0xa0
      [ ]  [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20
      [ ]  [<ffffffff810ca951>] ? sched_clock_tick+0x1/0xa0
      [ ]  [<ffffffff8104f964>] text_poke_bp+0x64/0xd0
      [ ]  [<ffffffff810ca950>] ? sched_clock_idle_sleep_event+0x20/0x20
      [ ]  [<ffffffff8104ccde>] arch_jump_label_transform+0xae/0x130
      [ ]  [<ffffffff8115a3cf>] __jump_label_update+0x5f/0x80
      [ ]  [<ffffffff8115a48d>] jump_label_update+0x9d/0xb0
      [ ]  [<ffffffff8115aa6d>] static_key_slow_inc+0x9d/0xb0
      [ ]  [<ffffffff810ca775>] clear_sched_clock_stable+0x15/0x20
      [ ]  [<ffffffff810503b3>] mark_tsc_unstable+0x23/0x70
      [ ]  [<ffffffff810772cb>] check_tsc_sync_source+0x14b/0x150
      [ ]  [<ffffffff81076612>] native_cpu_up+0x3a2/0x890
      [ ]  [<ffffffff810941cb>] _cpu_up+0xdb/0x160
      [ ]  [<ffffffff810942c9>] cpu_up+0x79/0x90
      [ ]  [<ffffffff81d0af6b>] smp_init+0x60/0x8c
      [ ]  [<ffffffff81cebf42>] kernel_init_freeable+0x8c/0x197
      [ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
      [ ]  [<ffffffff8164e32e>] kernel_init+0xe/0x130
      [ ]  [<ffffffff8166beec>] ret_from_fork+0x7c/0xb0
      [ ]  [<ffffffff8164e320>] ? rest_init+0xd0/0xd0
      [ ] ---[ end trace 6ff1df5620c49d26 ]---
      [ ] tsc: Marking TSC unstable due to check_tsc_sync_source failed
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-v55fgqj3nnyqnngmvuu8ep6h@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6577e42a
    • Peter Zijlstra's avatar
      sched/clock, x86: Use a static_key for sched_clock_stable · 35af99e6
      Peter Zijlstra authored
      In order to avoid the runtime condition and variable load turn
      sched_clock_stable into a static_key.
      
      Also provide a shorter implementation of local_clock() and
      cpu_clock(int) when sched_clock_stable==1.
      
                              MAINLINE   PRE       POST
      
          sched_clock_stable: 1          1         1
          (cold) sched_clock: 329841     221876    215295
          (cold) local_clock: 301773     234692    220773
          (warm) sched_clock: 38375      25602     25659
          (warm) local_clock: 100371     33265     27242
          (warm) rdtsc:       27340      24214     24208
          sched_clock_stable: 0          0         0
          (cold) sched_clock: 382634     235941    237019
          (cold) local_clock: 396890     297017    294819
          (warm) sched_clock: 38194      25233     25609
          (warm) local_clock: 143452     71234     71232
          (warm) rdtsc:       27345      24245     24243
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      35af99e6
    • Peter Zijlstra's avatar
      sched/clock: Remove local_irq_disable() from the clocks · ef08f0ff
      Peter Zijlstra authored
      Now that x86 no longer requires IRQs disabled for sched_clock() and
      ia64 never had this requirement (it doesn't seem to do cpufreq at
      all), we can remove the requirement of disabling IRQs.
      
                              MAINLINE   PRE        POST
      
          sched_clock_stable: 1          1          1
          (cold) sched_clock: 329841     257223     221876
          (cold) local_clock: 301773     309889     234692
          (warm) sched_clock: 38375      25280      25602
          (warm) local_clock: 100371     85268      33265
          (warm) rdtsc:       27340      24247      24214
          sched_clock_stable: 0          0          0
          (cold) sched_clock: 382634     301224     235941
          (cold) local_clock: 396890     399870     297017
          (warm) sched_clock: 38194      25630      25233
          (warm) local_clock: 143452     129629     71234
          (warm) rdtsc:       27345      24307      24245
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-36e5kohiasnr106d077mgubp@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ef08f0ff
    • Peter Zijlstra's avatar
      sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs · 20d1c86a
      Peter Zijlstra authored
      Use a ring-buffer like multi-version object structure which allows
      always having a coherent object; we use this to avoid having to
      disable IRQs while reading sched_clock() and avoids a problem when
      getting an NMI while changing the cyc2ns data.
      
                              MAINLINE   PRE        POST
      
          sched_clock_stable: 1          1          1
          (cold) sched_clock: 329841     331312     257223
          (cold) local_clock: 301773     310296     309889
          (warm) sched_clock: 38375      38247      25280
          (warm) local_clock: 100371     102713     85268
          (warm) rdtsc:       27340      27289      24247
          sched_clock_stable: 0          0          0
          (cold) sched_clock: 382634     372706     301224
          (cold) local_clock: 396890     399275     399870
          (warm) sched_clock: 38194      38124      25630
          (warm) local_clock: 143452     148698     129629
          (warm) rdtsc:       27345      27365      24307
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      20d1c86a
    • Peter Zijlstra's avatar
      sched/clock, x86: Move some cyc2ns() code around · 57c67da2
      Peter Zijlstra authored
      There are no __cycles_2_ns() users outside of arch/x86/kernel/tsc.c,
      so move it there.
      
      There are no cycles_2_ns() users.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-01lslnavfgo3kmbo4532zlcj@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      57c67da2
    • Peter Zijlstra's avatar
      sched/clock, x86: Use mul_u64_u32_shr() for native_sched_clock() · 5dd12c21
      Peter Zijlstra authored
      Use mul_u64_u32_shr() so that x86_64 can use a single 64x64->128 mul.
      
      Before:
      
      0000000000000560 <native_sched_clock>:
       560:   44 8b 1d 00 00 00 00    mov    0x0(%rip),%r11d        # 567 <native_sched_clock+0x7>
       567:   55                      push   %rbp
       568:   48 89 e5                mov    %rsp,%rbp
       56b:   45 85 db                test   %r11d,%r11d
       56e:   75 4f                   jne    5bf <native_sched_clock+0x5f>
       570:   0f 31                   rdtsc
       572:   89 c0                   mov    %eax,%eax
       574:   48 c1 e2 20             shl    $0x20,%rdx
       578:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
       57f:   48 09 c2                or     %rax,%rdx
       582:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
       589:   65 8b 04 25 00 00 00    mov    %gs:0x0,%eax
       590:   00
       591:   48 98                   cltq
       593:   48 8b 34 c5 00 00 00    mov    0x0(,%rax,8),%rsi
       59a:   00
       59b:   48 89 d0                mov    %rdx,%rax
       59e:   81 e2 ff 03 00 00       and    $0x3ff,%edx
       5a4:   48 c1 e8 0a             shr    $0xa,%rax
       5a8:   48 0f af 14 0e          imul   (%rsi,%rcx,1),%rdx
       5ad:   48 0f af 04 0e          imul   (%rsi,%rcx,1),%rax
       5b2:   5d                      pop    %rbp
       5b3:   48 03 04 3e             add    (%rsi,%rdi,1),%rax
       5b7:   48 c1 ea 0a             shr    $0xa,%rdx
       5bb:   48 01 d0                add    %rdx,%rax
       5be:   c3                      retq
      
      After:
      
      0000000000000550 <native_sched_clock>:
       550:   8b 3d 00 00 00 00       mov    0x0(%rip),%edi        # 556 <native_sched_clock+0x6>
       556:   55                      push   %rbp
       557:   48 89 e5                mov    %rsp,%rbp
       55a:   48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
       55e:   85 ff                   test   %edi,%edi
       560:   75 2c                   jne    58e <native_sched_clock+0x3e>
       562:   0f 31                   rdtsc
       564:   89 c0                   mov    %eax,%eax
       566:   48 c1 e2 20             shl    $0x20,%rdx
       56a:   48 09 c2                or     %rax,%rdx
       56d:   65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
       574:   00 00
       576:   89 c0                   mov    %eax,%eax
       578:   48 f7 e2                mul    %rdx
       57b:   65 48 8b 0c 25 00 00    mov    %gs:0x0,%rcx
       582:   00 00
       584:   c9                      leaveq
       585:   48 0f ac d0 0a          shrd   $0xa,%rdx,%rax
       58a:   48 01 c8                add    %rcx,%rax
       58d:   c3                      retq
      
                              MAINLINE   POST
      
          sched_clock_stable: 1	   1
          (cold) sched_clock: 329841     331312
          (cold) local_clock: 301773     310296
          (warm) sched_clock: 38375      38247
          (warm) local_clock: 100371     102713
          (warm) rdtsc:       27340      27289
          sched_clock_stable: 0          0
          (cold) sched_clock: 382634     372706
          (cold) local_clock: 396890     399275
          (warm) sched_clock: 38194      38124
          (warm) local_clock: 143452     148698
          (warm) rdtsc:       27345      27365
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-piu203ses5y1g36bnyw2n16x@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5dd12c21
    • Peter Zijlstra's avatar
      sched/preempt: Take away preempt_enable_no_resched() from modules · 62b94a08
      Peter Zijlstra authored
      Discourage drivers/modules to be creative with preemption.
      
      Sadly all is implemented in macros and inline so if they want to do
      evil they still can, but at least try and discourage some.
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: rui.zhang@intel.com
      Cc: jacob.jun.pan@linux.intel.com
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: hpa@zytor.com
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: lenb@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-fn7h6vu8wtgxk0ih402qcijx@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      62b94a08
    • Peter Zijlstra's avatar
      locking: Optimize lock_bh functions · 9ea4c380
      Peter Zijlstra authored
      Currently all _bh_ lock functions do two preempt_count operations:
      
        local_bh_disable();
        preempt_disable();
      
      and for the unlock:
      
        preempt_enable_no_resched();
        local_bh_enable();
      
      Since its a waste of perfectly good cycles to modify the same variable
      twice when you can do it in one go; use the new
      __local_bh_{dis,en}able_ip() functions that allow us to provide a
      preempt_count value to add/sub.
      
      So define SOFTIRQ_LOCK_OFFSET as the offset a _bh_ lock needs to
      add/sub to be done in one go.
      
      As a bonus it gets rid of the preempt_enable_no_resched() usage.
      
      This reduces a 1000 loops of:
      
        spin_lock_bh(&bh_lock);
        spin_unlock_bh(&bh_lock);
      
      from 53596 cycles to 51995 cycles. I didn't do enough measurements to
      say for absolute sure that the result is significant but the the few
      runs I did for each suggest it is so.
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: jacob.jun.pan@linux.intel.com
      Cc: Mike Galbraith <bitbucket@online.de>
      Cc: hpa@zytor.com
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: lenb@kernel.org
      Cc: rjw@rjwysocki.net
      Cc: rui.zhang@intel.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9ea4c380
    • Daniel Lezcano's avatar
      sched: Factor out the on_null_domain() checks in trigger_load_balance() · c726099e
      Daniel Lezcano authored
      The test on_null_domain is done twice in the trigger_load_balance function.
      
      Move the test at the begin of the function, so there is only one check.
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-9-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c726099e
    • Daniel Lezcano's avatar
      sched: Pass 'struct rq' to nohz_idle_balance() · 208cb16b
      Daniel Lezcano authored
      The cpu information is stored in the struct rq. Pass the struct rq to
      nohz_idle_balance, so all the functions called in run_rebalance_domains have
      the same parameters and the 'this_cpu' variable becomes pointless.
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      [ Added !SMP build fix. ]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-8-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      208cb16b
    • Daniel Lezcano's avatar
      sched: Pass 'struct rq' to rebalance_domains() · f7ed0a89
      Daniel Lezcano authored
      The cpu information is stored in the struct rq and the caller of the
      rebalance_domains function pass the cpu to retrieve the struct rq but
      it already has the struct rq info. Replace the cpu parameter with the
      struct rq.
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-7-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f7ed0a89
    • Daniel Lezcano's avatar
      sched: Remove unused parameter from nohz_balancer_kick() · 0aeeeeba
      Daniel Lezcano authored
      The cpu parameter is no longer needed in nohz_balancer_kick, let's remove
      the parameter.
      Reviewed-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-6-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0aeeeeba
    • Daniel Lezcano's avatar
    • Daniel Lezcano's avatar
      sched: Pass 'struct rq' to on_null_domain() · 63f609b1
      Daniel Lezcano authored
      The on_null_domain() function is getting the cpu to retrieve the struct rq
      associated with it.
      
      Pass 'struct rq' directly to the function as the caller already has the info.
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-4-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      63f609b1
    • Daniel Lezcano's avatar
      sched: Reduce nohz_kick_needed() parameters · 4a725627
      Daniel Lezcano authored
      The cpu information is already stored in the struct rq, so no need to pass it
      as parameter to the nohz_kick_needed function.
      
      The caller of this function just called idle_cpu() before to fill the
      rq->idle_balance field.
      
      Use rq->cpu and rq->idle_balance.
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-3-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      4a725627
    • Daniel Lezcano's avatar
      sched: Reduce trigger_load_balance() parameters · 7caff66f
      Daniel Lezcano authored
      The cpu information is already stored in the struct rq, so no need to pass it
      as parameter to the trigger_load_balance function.
      
      Cc: linaro-kernel@lists.linaro.org
      Cc: preeti.lkml@gmail.com
      Cc: mingo@redhat.com
      Cc: peterz@infradead.org
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7caff66f
    • Peter Zijlstra's avatar
      sched/deadline: Fix hotplug admission control · de212f18
      Peter Zijlstra authored
      The current hotplug admission control is broken because:
      
        CPU_DYING -> migration_call() -> migrate_tasks() -> __migrate_task()
      
      cannot fail and hard assumes it _will_ move all tasks off of the dying
      cpu, failing this will break hotplug.
      
      The much simpler solution is a DOWN_PREPARE handler that fails when
      removing one CPU gets us below the total allocated bandwidth.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20131220171343.GL2480@laptop.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      de212f18
    • Peter Zijlstra's avatar
      sched/deadline: Remove the sysctl_sched_dl knobs · 1724813d
      Peter Zijlstra authored
      Remove the deadline specific sysctls for now. The problem with them is
      that the interaction with the exisiting rt knobs is nearly impossible
      to get right.
      
      The current (as per before this patch) situation is that the rt and dl
      bandwidth is completely separate and we enforce rt+dl < 100%. This is
      undesirable because this means that the rt default of 95% leaves us
      hardly any room, even though dl tasks are saver than rt tasks.
      
      Another proposed solution was (a discarted patch) to have the dl
      bandwidth be a fraction of the rt bandwidth. This is highly
      confusing imo.
      
      Furthermore neither proposal is consistent with the situation we
      actually want; which is rt tasks ran from a dl server. In which case
      the rt bandwidth is a direct subset of dl.
      
      So whichever way we go, the introduction of dl controls at this point
      is painful. Therefore remove them and instead share the rt budget.
      
      This means that for now the rt knobs are used for dl admission control
      and the dl runtime is accounted against the rt runtime. I realise that
      this isn't entirely desirable either; but whatever we do we appear to
      need to change the interface later, so better have a small interface
      for now.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1724813d
    • Peter Zijlstra's avatar
      sched/deadline: Fix up the smp-affinity mask tests · e4099a5e
      Peter Zijlstra authored
      For now deadline tasks are not allowed to set smp affinity; however
      the current tests are wrong, cure this.
      
      The test in __sched_setscheduler() also uses an on-stack cpumask_t
      which is a no-no.
      
      Change both tests to use cpumask_subset() such that we test the root
      domain span to be a subset of the cpus_allowed mask. This way we're
      sure the tasks can always run on all CPUs they can be balanced over,
      and have no effective affinity constraints.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-fyqtb1lapxca3lhsxv9cumdc@git.kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e4099a5e
    • Juri Lelli's avatar
      sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap · 6bfd6d72
      Juri Lelli authored
      Data from tests confirmed that the original active load balancing
      logic didn't scale neither in the number of CPU nor in the number of
      tasks (as sched_rt does).
      
      Here we provide a global data structure to keep track of deadlines
      of the running tasks in the system. The structure is composed by
      a bitmask showing the free CPUs and a max-heap, needed when the system
      is heavily loaded.
      
      The implementation and concurrent access scheme are kept simple by
      design. However, our measurements show that we can compete with sched_rt
      on large multi-CPUs machines [1].
      
      Only the push path is addressed, the extension to use this structure
      also for pull decisions is straightforward. However, we are currently
      evaluating different (in order to decrease/avoid contention) data
      structures to solve possibly both problems. We are also going to re-run
      tests considering recent changes inside cpupri [2].
      
       [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
       [2] http://www.spinics.net/lists/linux-rt-users/msg06778.htmlSigned-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6bfd6d72
    • Dario Faggioli's avatar
      sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks · 332ac17e
      Dario Faggioli authored
      In order of deadline scheduling to be effective and useful, it is
      important that some method of having the allocation of the available
      CPU bandwidth to tasks and task groups under control.
      This is usually called "admission control" and if it is not performed
      at all, no guarantee can be given on the actual scheduling of the
      -deadline tasks.
      
      Since when RT-throttling has been introduced each task group have a
      bandwidth associated to itself, calculated as a certain amount of
      runtime over a period. Moreover, to make it possible to manipulate
      such bandwidth, readable/writable controls have been added to both
      procfs (for system wide settings) and cgroupfs (for per-group
      settings).
      
      Therefore, the same interface is being used for controlling the
      bandwidth distrubution to -deadline tasks and task groups, i.e.,
      new controls but with similar names, equivalent meaning and with
      the same usage paradigm are added.
      
      However, more discussion is needed in order to figure out how
      we want to manage SCHED_DEADLINE bandwidth at the task group level.
      Therefore, this patch adds a less sophisticated, but actually
      very sensible, mechanism to ensure that a certain utilization
      cap is not overcome per each root_domain (the single rq for !SMP
      configurations).
      
      Another main difference between deadline bandwidth management and
      RT-throttling is that -deadline tasks have bandwidth on their own
      (while -rt ones doesn't!), and thus we don't need an higher level
      throttling mechanism to enforce the desired bandwidth.
      
      This patch, therefore:
      
       - adds system wide deadline bandwidth management by means of:
          * /proc/sys/kernel/sched_dl_runtime_us,
          * /proc/sys/kernel/sched_dl_period_us,
         that determine (i.e., runtime / period) the total bandwidth
         available on each CPU of each root_domain for -deadline tasks;
      
       - couples the RT and deadline bandwidth management, i.e., enforces
         that the sum of how much bandwidth is being devoted to -rt
         -deadline tasks to stay below 100%.
      
      This means that, for a root_domain comprising M CPUs, -deadline tasks
      can be created until the sum of their bandwidths stay below:
      
          M * (sched_dl_runtime_us / sched_dl_period_us)
      
      It is also possible to disable this bandwidth management logic, and
      be thus free of oversubscribing the system up to any arbitrary level.
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      332ac17e
    • Dario Faggioli's avatar
      sched/deadline: Add SCHED_DEADLINE inheritance logic · 2d3d891d
      Dario Faggioli authored
      Some method to deal with rt-mutexes and make sched_dl interact with
      the current PI-coded is needed, raising all but trivial issues, that
      needs (according to us) to be solved with some restructuring of
      the pi-code (i.e., going toward a proxy execution-ish implementation).
      
      This is under development, in the meanwhile, as a temporary solution,
      what this commits does is:
      
       - ensure a pi-lock owner with waiters is never throttled down. Instead,
         when it runs out of runtime, it immediately gets replenished and it's
         deadline is postponed;
      
       - the scheduling parameters (relative deadline and default runtime)
         used for that replenishments --during the whole period it holds the
         pi-lock-- are the ones of the waiting task with earliest deadline.
      
      Acting this way, we provide some kind of boosting to the lock-owner,
      still by using the existing (actually, slightly modified by the previous
      commit) pi-architecture.
      
      We would stress the fact that this is only a surely needed, all but
      clean solution to the problem. In the end it's only a way to re-start
      discussion within the community. So, as always, comments, ideas, rants,
      etc.. are welcome! :-)
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      [ Added !RT_MUTEXES build fix. ]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2d3d891d
    • Peter Zijlstra's avatar
      rtmutex: Turn the plist into an rb-tree · fb00aca4
      Peter Zijlstra authored
      Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
      and provide a proper comparison function for -deadline and
      -priority tasks.
      
      This is done mainly because:
       - classical prio field of the plist is just an int, which might
         not be enough for representing a deadline;
       - manipulating such a list would become O(nr_deadline_tasks),
         which might be to much, as the number of -deadline task increases.
      
      Therefore, an rb-tree is used, and tasks are queued in it according
      to the following logic:
       - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
         one with the higher (lower, actually!) prio wins;
       - among a -priority and a -deadline task, the latter always wins;
       - among two -deadline tasks, the one with the earliest deadline
         wins.
      
      Queueing and dequeueing functions are changed accordingly, for both
      the list of a task's pi-waiters and the list of tasks blocked on
      a pi-lock.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-again-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fb00aca4