1. 16 Feb, 2022 12 commits
  2. 11 Feb, 2022 4 commits
    • Huang Ying's avatar
      sched/numa-balancing: Move some document to make it consistent with the code · 3624ba7b
      Huang Ying authored
      After commit 8a99b683 ("sched: Move SCHED_DEBUG sysctl to
      debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has
      been moved to debugfs.  This patch move the document for these
      sysctls from
      
        Documentation/admin-guide/sysctl/kernel.rst
      
      to
      
        Documentation/scheduler/sched-debug.rst
      
      to make the document consistent with the code.
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
      3624ba7b
    • Mel Gorman's avatar
      sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs · e496132e
      Mel Gorman authored
      Commit 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA
      nodes") allowed an imbalance between NUMA nodes such that communicating
      tasks would not be pulled apart by the load balancer. This works fine when
      there is a 1:1 relationship between LLC and node but can be suboptimal
      for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
      
      Zen* has multiple LLCs per node with local memory channels and due to
      the allowed imbalance, it's far harder to tune some workloads to run
      optimally than it is on hardware that has 1 LLC per node. This patch
      allows an imbalance to exist up to the point where LLCs should be balanced
      between nodes.
      
      On a Zen3 machine running STREAM parallelised with OMP to have on instance
      per LLC the results and without binding, the results are
      
                                  5.17.0-rc0             5.17.0-rc0
                                     vanilla       sched-numaimb-v6
      MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
      MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
      MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
      MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
      
      STREAM can use directives to force the spread if the OpenMP is new
      enough but that doesn't help if an application uses threads and
      it's not known in advance how many threads will be created.
      
      Coremark is a CPU and cache intensive benchmark parallelised with
      threads. When running with 1 thread per core, the vanilla kernel
      allows threads to contend on cache. With the patch;
      
                                     5.17.0-rc0             5.17.0-rc0
                                        vanilla       sched-numaimb-v5
      Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
      Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
      Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
      Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
      CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
      
      It can also make a big difference for semi-realistic workloads
      like specjbb which can execute arbitrary numbers of threads without
      advance knowledge of how they should be placed. Even in cases where
      the average performance is neutral, the results are more stable.
      
                                     5.17.0-rc0             5.17.0-rc0
                                        vanilla       sched-numaimb-v6
      Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
      Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
      Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
      Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
      Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
      Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
      Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
      Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
      Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
      Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
      Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
      Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
      Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
      Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
      Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
      Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
      Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
      Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
      Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
      Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
      Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
      Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
      Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
      Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
      Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
      Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
      Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
      Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
      Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
      Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
      Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
      Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
      Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
      Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
      
      Similarly, for embarassingly parallel problems like NPB-ep, there are
      improvements due to better spreading across LLC when the machine is not
      fully utilised.
      
                                    vanilla       sched-numaimb-v6
      Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
      Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
      Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
      CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
      Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarGautham R. Shenoy <gautham.shenoy@amd.com>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
      e496132e
    • Mel Gorman's avatar
      sched/fair: Improve consistency of allowed NUMA balance calculations · 2cfb7a1b
      Mel Gorman authored
      There are inconsistencies when determining if a NUMA imbalance is allowed
      that should be corrected.
      
      o allow_numa_imbalance changes types and is not always examining
        the destination group so both the type should be corrected as
        well as the naming.
      o find_idlest_group uses the sched_domain's weight instead of the
        group weight which is different to find_busiest_group
      o find_busiest_group uses the source group instead of the destination
        which is different to task_numa_find_cpu
      o Both find_idlest_group and find_busiest_group should account
        for the number of running tasks if a move was allowed to be
        consistent with task_numa_find_cpu
      
      Fixes: 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA nodes")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarGautham R. Shenoy <gautham.shenoy@amd.com>
      Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
      2cfb7a1b
    • Mathieu Desnoyers's avatar
      selftests/rseq: Change type of rseq_offset to ptrdiff_t · 889c5d60
      Mathieu Desnoyers authored
      Just before the 2.35 release of glibc, the __rseq_offset userspace ABI
      was changed from int to ptrdiff_t.
      
      Adapt to this change in the kernel selftests.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://sourceware.org/pipermail/libc-alpha/2022-February/136024.html
      889c5d60
  3. 02 Feb, 2022 16 commits
  4. 27 Jan, 2022 8 commits