1. 19 Feb, 2022 7 commits
    • Mark Rutland's avatar
      arm64: Support PREEMPT_DYNAMIC · 1b2d3451
      Mark Rutland authored
      This patch enables support for PREEMPT_DYNAMIC on arm64, allowing the
      preemption model to be chosen at boot time.
      
      Specifically, this patch selects HAVE_PREEMPT_DYNAMIC_KEY, so that each
      preemption function is an out-of-line call with an early return
      depending upon a static key. This leaves almost all the codegen up to
      the compiler, and side-steps a number of pain points with static calls
      (e.g. interaction with CFI schemes). This should have no worse overhead
      than using non-inline static calls, as those use out-of-line trampolines
      with early returns.
      
      For example, the dynamic_cond_resched() wrapper looks as follows when
      enabled. When disabled, the first `B` is replaced with a `NOP`,
      resulting in an early return.
      
      | <dynamic_cond_resched>:
      |        bti     c
      |        b       <dynamic_cond_resched+0x10>     // or `nop`
      |        mov     w0, #0x0
      |        ret
      |        mrs     x0, sp_el0
      |        ldr     x0, [x0, #8]
      |        cbnz    x0, <dynamic_cond_resched+0x8>
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      ... compared to the regular form of the function:
      
      | <__cond_resched>:
      |        bti     c
      |        mrs     x0, sp_el0
      |        ldr     x1, [x0, #8]
      |        cbz     x1, <__cond_resched+0x18>
      |        mov     w0, #0x0
      |        ret
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      Since arm64 does not yet use the generic entry code, we must define our
      own `sk_dynamic_irqentry_exit_cond_resched`, which will be
      enabled/disabled by the common code in kernel/sched/core.c. All other
      preemption functions and associated static keys are defined there.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-8-mark.rutland@arm.com
      1b2d3451
    • Mark Rutland's avatar
      arm64: entry: Centralize preemption decision · 8e12ab7c
      Mark Rutland authored
      For historical reasons, the decision of whether or not to preempt is
      spread across arm64_preempt_schedule_irq() and __el1_irq(), and it would
      be clearer if this were all in one place.
      
      Also, arm64_preempt_schedule_irq() calls lockdep_assert_irqs_disabled(),
      but this is redundant, as we have a subsequent identical assertion in
      __exit_to_kernel_mode(), and preempt_schedule_irq() will
      BUG_ON(!irqs_disabled()) anyway.
      
      This patch removes the redundant assertion and centralizes the
      preemption decision making within arm64_preempt_schedule_irq().
      
      Other than the slight change to assertion behaviour, there should be no
      functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-7-mark.rutland@arm.com
      8e12ab7c
    • Mark Rutland's avatar
      sched/preempt: Add PREEMPT_DYNAMIC using static keys · 99cf983c
      Mark Rutland authored
      Where an architecture selects HAVE_STATIC_CALL but not
      HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline
      which will either branch to a callee or return to the caller.
      
      On such architectures, a number of constraints can conspire to make
      those trampolines more complicated and potentially less useful than we'd
      like. For example:
      
      * Hardware and software control flow integrity schemes can require the
        addition of "landing pad" instructions (e.g. `BTI` for arm64), which
        will also be present at the "real" callee.
      
      * Limited branch ranges can require that trampolines generate or load an
        address into a register and perform an indirect branch (or at least
        have a slow path that does so). This loses some of the benefits of
        having a direct branch.
      
      * Interaction with SW CFI schemes can be complicated and fragile, e.g.
        requiring that we can recognise idiomatic codegen and remove
        indirections understand, at least until clang proves more helpful
        mechanisms for dealing with this.
      
      For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we
      really only need to enable/disable specific preemption functions. We can
      achieve the same effect without a number of the pain points above by
      using static keys to fold early returns into the preemption functions
      themselves rather than in an out-of-line trampoline, effectively
      inlining the trampoline into the start of the function.
      
      For arm64, this results in good code generation. For example, the
      dynamic_cond_resched() wrapper looks as follows when enabled. When
      disabled, the first `B` is replaced with a `NOP`, resulting in an early
      return.
      
      | <dynamic_cond_resched>:
      |        bti     c
      |        b       <dynamic_cond_resched+0x10>     // or `nop`
      |        mov     w0, #0x0
      |        ret
      |        mrs     x0, sp_el0
      |        ldr     x0, [x0, #8]
      |        cbnz    x0, <dynamic_cond_resched+0x8>
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      ... compared to the regular form of the function:
      
      | <__cond_resched>:
      |        bti     c
      |        mrs     x0, sp_el0
      |        ldr     x1, [x0, #8]
      |        cbz     x1, <__cond_resched+0x18>
      |        mov     w0, #0x0
      |        ret
      |        paciasp
      |        stp     x29, x30, [sp, #-16]!
      |        mov     x29, sp
      |        bl      <preempt_schedule_common>
      |        mov     w0, #0x1
      |        ldp     x29, x30, [sp], #16
      |        autiasp
      |        ret
      
      Any architecture which implements static keys should be able to use this
      to implement PREEMPT_DYNAMIC with similar cost to non-inlined static
      calls. Since this is likely to have greater overhead than (inlined)
      static calls, PREEMPT_DYNAMIC is only defaulted to enabled when
      HAVE_PREEMPT_DYNAMIC_CALL is selected.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-6-mark.rutland@arm.com
      99cf983c
    • Mark Rutland's avatar
      sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY · 33c64734
      Mark Rutland authored
      Now that the enabled/disabled states for the preemption functions are
      declared alongside their definitions, the core PREEMPT_DYNAMIC logic is
      no longer tied to GENERIC_ENTRY, and can safely be selected so long as
      an architecture provides enabled/disabled states for
      irqentry_exit_cond_resched().
      
      Make it possible to select HAVE_PREEMPT_DYNAMIC without GENERIC_ENTRY.
      
      For existing users of HAVE_PREEMPT_DYNAMIC there should be no functional
      change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-5-mark.rutland@arm.com
      33c64734
    • Mark Rutland's avatar
      sched/preempt: Simplify irqentry_exit_cond_resched() callers · 4624a14f
      Mark Rutland authored
      Currently callers of irqentry_exit_cond_resched() need to be aware of
      whether the function should be indirected via a static call, leading to
      ugly ifdeffery in callers.
      
      Save them the hassle with a static inline wrapper that does the right
      thing. The raw_irqentry_exit_cond_resched() will also be useful in
      subsequent patches which will add conditional wrappers for preemption
      functions.
      
      Note: in arch/x86/entry/common.c, xen_pv_evtchn_do_upcall() always calls
      irqentry_exit_cond_resched() directly, even when PREEMPT_DYNAMIC is in
      use. I believe this is a latent bug (which this patch corrects), but I'm
      not entirely certain this wasn't deliberate.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-4-mark.rutland@arm.com
      4624a14f
    • Mark Rutland's avatar
      sched/preempt: Refactor sched_dynamic_update() · 8a69fe0b
      Mark Rutland authored
      Currently sched_dynamic_update needs to open-code the enabled/disabled
      function names for each preemption model it supports, when in practice
      this is a boolean enabled/disabled state for each function.
      
      Make this clearer and avoid repetition by defining the enabled/disabled
      states at the function definition, and using helper macros to perform the
      static_call_update(). Where x86 currently overrides the enabled
      function, it is made to provide both the enabled and disabled states for
      consistency, with defaults provided by the core code otherwise.
      
      In subsequent patches this will allow us to support PREEMPT_DYNAMIC
      without static calls.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-3-mark.rutland@arm.com
      8a69fe0b
    • Mark Rutland's avatar
      sched/preempt: Move PREEMPT_DYNAMIC logic later · 4c748558
      Mark Rutland authored
      The PREEMPT_DYNAMIC logic in kernel/sched/core.c patches static calls
      for a bunch of preemption functions. While most are defined prior to
      this, the definition of cond_resched() is later in the file, and so we
      only have its declarations from include/linux/sched.h.
      
      In subsequent patches we'd like to define some macros alongside the
      definition of each of the preemption functions, which we can use within
      sched_dynamic_update(). For this to be possible, the PREEMPT_DYNAMIC
      logic needs to be placed after the various preemption functions.
      
      As a preparatory step, this patch moves the PREEMPT_DYNAMIC logic after
      the various preemption functions, with no other changes -- this is
      purely a move.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Acked-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lore.kernel.org/r/20220214165216.2231574-2-mark.rutland@arm.com
      4c748558
  2. 16 Feb, 2022 12 commits
  3. 11 Feb, 2022 4 commits
    • Huang Ying's avatar
      sched/numa-balancing: Move some document to make it consistent with the code · 3624ba7b
      Huang Ying authored
      After commit 8a99b683 ("sched: Move SCHED_DEBUG sysctl to
      debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has
      been moved to debugfs.  This patch move the document for these
      sysctls from
      
        Documentation/admin-guide/sysctl/kernel.rst
      
      to
      
        Documentation/scheduler/sched-debug.rst
      
      to make the document consistent with the code.
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
      3624ba7b
    • Mel Gorman's avatar
      sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs · e496132e
      Mel Gorman authored
      Commit 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA
      nodes") allowed an imbalance between NUMA nodes such that communicating
      tasks would not be pulled apart by the load balancer. This works fine when
      there is a 1:1 relationship between LLC and node but can be suboptimal
      for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
      
      Zen* has multiple LLCs per node with local memory channels and due to
      the allowed imbalance, it's far harder to tune some workloads to run
      optimally than it is on hardware that has 1 LLC per node. This patch
      allows an imbalance to exist up to the point where LLCs should be balanced
      between nodes.
      
      On a Zen3 machine running STREAM parallelised with OMP to have on instance
      per LLC the results and without binding, the results are
      
                                  5.17.0-rc0             5.17.0-rc0
                                     vanilla       sched-numaimb-v6
      MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
      MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
      MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
      MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
      
      STREAM can use directives to force the spread if the OpenMP is new
      enough but that doesn't help if an application uses threads and
      it's not known in advance how many threads will be created.
      
      Coremark is a CPU and cache intensive benchmark parallelised with
      threads. When running with 1 thread per core, the vanilla kernel
      allows threads to contend on cache. With the patch;
      
                                     5.17.0-rc0             5.17.0-rc0
                                        vanilla       sched-numaimb-v5
      Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
      Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
      Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
      Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
      CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
      
      It can also make a big difference for semi-realistic workloads
      like specjbb which can execute arbitrary numbers of threads without
      advance knowledge of how they should be placed. Even in cases where
      the average performance is neutral, the results are more stable.
      
                                     5.17.0-rc0             5.17.0-rc0
                                        vanilla       sched-numaimb-v6
      Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
      Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
      Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
      Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
      Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
      Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
      Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
      Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
      Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
      Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
      Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
      Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
      Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
      Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
      Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
      Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
      Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
      Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
      Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
      Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
      Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
      Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
      Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
      Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
      Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
      Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
      Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
      Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
      Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
      Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
      Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
      Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
      Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
      Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
      
      Similarly, for embarassingly parallel problems like NPB-ep, there are
      improvements due to better spreading across LLC when the machine is not
      fully utilised.
      
                                    vanilla       sched-numaimb-v6
      Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
      Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
      Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
      CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
      Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarGautham R. Shenoy <gautham.shenoy@amd.com>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
      e496132e
    • Mel Gorman's avatar
      sched/fair: Improve consistency of allowed NUMA balance calculations · 2cfb7a1b
      Mel Gorman authored
      There are inconsistencies when determining if a NUMA imbalance is allowed
      that should be corrected.
      
      o allow_numa_imbalance changes types and is not always examining
        the destination group so both the type should be corrected as
        well as the naming.
      o find_idlest_group uses the sched_domain's weight instead of the
        group weight which is different to find_busiest_group
      o find_busiest_group uses the source group instead of the destination
        which is different to task_numa_find_cpu
      o Both find_idlest_group and find_busiest_group should account
        for the number of running tasks if a move was allowed to be
        consistent with task_numa_find_cpu
      
      Fixes: 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA nodes")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarGautham R. Shenoy <gautham.shenoy@amd.com>
      Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
      2cfb7a1b
    • Mathieu Desnoyers's avatar
      selftests/rseq: Change type of rseq_offset to ptrdiff_t · 889c5d60
      Mathieu Desnoyers authored
      Just before the 2.35 release of glibc, the __rseq_offset userspace ABI
      was changed from int to ptrdiff_t.
      
      Adapt to this change in the kernel selftests.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://sourceware.org/pipermail/libc-alpha/2022-February/136024.html
      889c5d60
  4. 02 Feb, 2022 16 commits
  5. 27 Jan, 2022 1 commit