1. 14 Apr, 2021 3 commits
    • Eric Dumazet's avatar
      rseq: Optimize rseq_update_cpu_id() · 60af388d
      Eric Dumazet authored
      Two put_user() in rseq_update_cpu_id() are replaced
      by a pair of unsafe_put_user() with appropriate surroundings.
      
      This removes one stac/clac pair on x86 in fast path.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.dumazet@gmail.com
      60af388d
    • Thomas Gleixner's avatar
      signal: Allow tasks to cache one sigqueue struct · 4bad58eb
      Thomas Gleixner authored
      The idea for this originates from the real time tree to make signal
      delivery for realtime applications more efficient. In quite some of these
      application scenarios a control tasks signals workers to start their
      computations. There is usually only one signal per worker on flight.  This
      works nicely as long as the kmem cache allocations do not hit the slow path
      and cause latencies.
      
      To cure this an optimistic caching was introduced (limited to RT tasks)
      which allows a task to cache a single sigqueue in a pointer in task_struct
      instead of handing it back to the kmem cache after consuming a signal. When
      the next signal is sent to the task then the cached sigqueue is used
      instead of allocating a new one. This solved the problem for this set of
      application scenarios nicely.
      
      The task cache is not preallocated so the first signal sent to a task goes
      always to the cache allocator. The cached sigqueue stays around until the
      task exits and is freed when task::sighand is dropped.
      
      After posting this solution for mainline the discussion came up whether
      this would be useful in general and should not be limited to realtime
      tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org
      
      One concern leading to the original limitation was to avoid a large amount
      of pointlessly cached sigqueues in alive tasks. The other concern was
      vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.
      
      The accounting problem is real, but on the other hand slightly academic.
      After gathering some statistics it turned out that after boot of a regular
      distro install there are less than 10 sigqueues cached in ~1500 tasks.
      
      In case of a 'mass fork and fire signal to child' scenario the extra 80
      bytes of memory per task are well in the noise of the overall memory
      consumption of the fork bomb.
      
      If this should be limited then this would need an extra counter in struct
      user, more atomic instructions and a seperate rlimit. Yet another tunable
      which is mostly unused.
      
      The caching is actually used. After boot and a full kernel compile on a
      64CPU machine with make -j128 the number of 'allocations' looks like this:
      
        From slab:	   23996
        From task cache: 52223
      
      I.e. it reduces the number of slab cache operations by ~68%.
      
      A typical pattern there is:
      
      <...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
      <...>-58488 __sigqueue_free:   cache ffff8881132df460
      <...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
        bash-1149 exit_task_sighand: free ffff8881132df460
        bash-1149 __sigqueue_free:   cache ffff8881103dc550
      
      The interesting sequence is that the exiting task 58488 grabs the sigqueue
      from bash's task cache to signal exit and bash sticks it back into it's own
      cache. Lather, rinse and repeat.
      
      The caching is probably not noticable for the general use case, but the
      benefit for latency sensitive applications is clear. While kmem caches are
      usually just serving from the fast path the slab merging (default) can
      depending on the usage pattern of the merged slabs cause occasional slow
      path allocations.
      
      The time spared per cached entry is a few micro seconds per signal which is
      not relevant for e.g. a kernel build, but for signal heavy workloads it's
      measurable.
      
      As there is no real downside of this caching mechanism making it
      unconditionally available is preferred over more conditional code or new
      magic tunables.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
      4bad58eb
    • Thomas Gleixner's avatar
      signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc() · 69995ebb
      Thomas Gleixner authored
      There is no point in having the conditional at the callsite.
      
      Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
      initialize sigqueue::flags.
      
      No functional change.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
      69995ebb
  2. 09 Apr, 2021 4 commits
    • Valentin Schneider's avatar
      sched/fair: Introduce a CPU capacity comparison helper · 4aed8aa4
      Valentin Schneider authored
      During load-balance, groups classified as group_misfit_task are filtered
      out if they do not pass
      
        group_smaller_max_cpu_capacity(<candidate group>, <local group>);
      
      which itself employs fits_capacity() to compare the sgc->max_capacity of
      both groups.
      
      Due to the underlying margin, fits_capacity(X, 1024) will return false for
      any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
      {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
      CPUs, misfit migration will never intentionally upmigrate it to a CPU of
      higher capacity due to the aforementioned margin.
      
      One may argue the 20% margin of fits_capacity() is excessive in the advent
      of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
      is that fits_capacity() is meant to compare a utilization value to a
      capacity value, whereas here it is being used to compare two capacity
      values. As CPU capacity and task utilization have different dynamics, a
      sensible approach here would be to add a new helper dedicated to comparing
      CPU capacities.
      
      Also note that comparing capacity extrema of local and source sched_group's
      doesn't make much sense when at the day of the day the imbalance will be
      pulled by a known env->dst_cpu, whose capacity can be anywhere within the
      local group's capacity extrema.
      
      While at it, replace group_smaller_{min, max}_cpu_capacity() with
      comparisons of the source group's min/max capacity and the destination
      CPU's capacity.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarQais Yousef <qais.yousef@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: default avatarLingutla Chandrasekhar <clingutla@codeaurora.org>
      Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
      4aed8aa4
    • Valentin Schneider's avatar
      sched/fair: Clean up active balance nr_balance_failed trickery · 23fb06d9
      Valentin Schneider authored
      When triggering an active load balance, sd->nr_balance_failed is set to
      such a value that any further can_migrate_task() using said sd will ignore
      the output of task_hot().
      
      This behaviour makes sense, as active load balance intentionally preempts a
      rq's running task to migrate it right away, but this asynchronous write is
      a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
      before the sd->nr_balance_failed write either becomes visible to the
      stopper's CPU or even happens on the CPU that appended the stopper work.
      
      Add a struct lb_env flag to denote active balancing, and use it in
      can_migrate_task(). Remove the sd->nr_balance_failed write that served the
      same purpose. Cleanup the LBF_DST_PINNED active balance special case.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
      23fb06d9
    • Lingutla Chandrasekhar's avatar
      sched/fair: Ignore percpu threads for imbalance pulls · 9bcb959d
      Lingutla Chandrasekhar authored
      During load balance, LBF_SOME_PINNED will be set if any candidate task
      cannot be detached due to CPU affinity constraints. This can result in
      setting env->sd->parent->sgc->group_imbalance, which can lead to a group
      being classified as group_imbalanced (rather than any of the other, lower
      group_type) when balancing at a higher level.
      
      In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
      set due to per-CPU kthreads being the only other runnable tasks on any
      given rq. This results in changing the group classification during
      load-balance at higher levels when in reality there is nothing that can be
      done for this affinity constraint: per-CPU kthreads, as the name implies,
      don't get to move around (modulo hotplug shenanigans).
      
      It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
      with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
      an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
      we can leverage here to not set LBF_SOME_PINNED.
      
      Note that the aforementioned classification to group_imbalance (when
      nothing can be done) is especially problematic on big.LITTLE systems, which
      have a topology the likes of:
      
        DIE [          ]
        MC  [    ][    ]
             0  1  2  3
             L  L  B  B
      
        arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
      
      Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
      level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
      the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
      CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
      can significantly delay the upgmigration of said misfit tasks. Systems
      relying on ASYM_PACKING are likely to face similar issues.
      Signed-off-by: default avatarLingutla Chandrasekhar <clingutla@codeaurora.org>
      [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
      [Reword changelog]
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
      9bcb959d
    • Rik van Riel's avatar
      sched/fair: Bring back select_idle_smt(), but differently · c722f35b
      Rik van Riel authored
      Mel Gorman did some nice work in 9fe1f127 ("sched/fair: Merge
      select_idle_core/cpu()"), resulting in the kernel being more efficient
      at finding an idle CPU, and in tasks spending less time waiting to be
      run, both according to the schedstats run_delay numbers, and according
      to measured application latencies. Yay.
      
      The flip side of this is that we see more task migrations (about 30%
      more), higher cache misses, higher memory bandwidth utilization, and
      higher CPU use, for the same number of requests/second.
      
      This is most pronounced on a memcache type workload, which saw a
      consistent 1-3% increase in total CPU use on the system, due to those
      increased task migrations leading to higher L2 cache miss numbers, and
      higher memory utilization. The exclusive L3 cache on Skylake does us
      no favors there.
      
      On our web serving workload, that effect is usually negligible.
      
      It appears that the increased number of CPU migrations is generally a
      good thing, since it leads to lower cpu_delay numbers, reflecting the
      fact that tasks get to run faster. However, the reduced locality and
      the corresponding increase in L2 cache misses hurts a little.
      
      The patch below appears to fix the regression, while keeping the
      benefit of the lower cpu_delay numbers, by reintroducing
      select_idle_smt with a twist: when a socket has no idle cores, check
      to see if the sibling of "prev" is idle, before searching all the
      other CPUs.
      
      This fixes both the occasional 9% regression on the web serving
      workload, and the continuous 2% CPU use regression on the memcache
      type workload.
      
      With Mel's patches and this patch together, task migrations are still
      high, but L2 cache misses, memory bandwidth, and CPU time used are
      back down to what they were before. The p95 and p99 response times for
      the memcache type application improve by about 10% over what they were
      before Mel's patches got merged.
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
      c722f35b
  3. 08 Apr, 2021 1 commit
  4. 25 Mar, 2021 3 commits
  5. 23 Mar, 2021 4 commits
  6. 21 Mar, 2021 1 commit
    • Ingo Molnar's avatar
      sched: Fix various typos · 3b03706f
      Ingo Molnar authored
      Fix ~42 single-word typos in scheduler code comments.
      
      We have accumulated a few fun ones over the years. :-)
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: linux-kernel@vger.kernel.org
      3b03706f
  7. 17 Mar, 2021 1 commit
  8. 10 Mar, 2021 2 commits
  9. 06 Mar, 2021 21 commits