1. 16 Feb, 2022 3 commits
    • Huang Ying's avatar
      sched/numa: Avoid migrating task to CPU-less node · 5c7b1aaf
      Huang Ying authored
      In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
      nodes.  But if the number of the hint page faults on a PMEM node is
      the max for a task, The current NUMA balancing policy may try to place
      the task on the PMEM node instead of DRAM node.  This is unreasonable,
      because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
      nodes are ignored when searching the migration target node for a task
      in this patch.
      
      To test the patch, we run a workload that accesses more memory in PMEM
      node than memory in DRAM node.  Without the patch, the PMEM node will
      be chosen as preferred node in task_numa_placement().  While the DRAM
      node will be chosen instead with the patch.
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
      5c7b1aaf
    • Huang Ying's avatar
      sched/numa: Fix NUMA topology for systems with CPU-less nodes · 0fb3978b
      Huang Ying authored
      The NUMA topology parameters (sched_numa_topology_type,
      sched_domains_numa_levels, and sched_max_numa_distance, etc.)
      identified by scheduler may be wrong for systems with CPU-less nodes.
      
      For example, the ACPI SLIT of a system with CPU-less persistent
      memory (Intel Optane DCPMM) nodes is as follows,
      
      [000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
      [004h 0004   4]                 Table Length : 0000042C
      [008h 0008   1]                     Revision : 01
      [009h 0009   1]                     Checksum : 59
      [00Ah 0010   6]                       Oem ID : "XXXX"
      [010h 0016   8]                 Oem Table ID : "XXXXXXX"
      [018h 0024   4]                 Oem Revision : 00000001
      [01Ch 0028   4]              Asl Compiler ID : "INTL"
      [020h 0032   4]        Asl Compiler Revision : 20091013
      
      [024h 0036   8]                   Localities : 0000000000000004
      [02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
      [030h 0048   4]                 Locality   1 : 15 0A 1C 11
      [034h 0052   4]                 Locality   2 : 11 1C 0A 1C
      [038h 0056   4]                 Locality   3 : 1C 11 1C 0A
      
      While the `numactl -H` output is as follows,
      
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
      node 0 size: 64136 MB
      node 0 free: 5981 MB
      node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 64466 MB
      node 1 free: 10415 MB
      node 2 cpus:
      node 2 size: 253952 MB
      node 2 free: 253920 MB
      node 3 cpus:
      node 3 size: 253952 MB
      node 3 free: 253951 MB
      node distances:
      node   0   1   2   3
        0:  10  21  17  28
        1:  21  10  28  17
        2:  17  28  10  28
        3:  28  17  28  10
      
      In this system, there are only 2 sockets.  In each memory controller,
      both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
      topology is simple, the logical NUMA topology becomes a little
      complex.  Because both the distance(0, 1) and distance (1, 3) are less
      than the distance (0, 3), it appears that node 1 sits between node 0
      and node 3.  And the whole system appears to be a glueless mesh NUMA
      topology type.  But it's definitely not, there is even no CPU in node 3.
      
      This isn't a practical problem now yet.  Because the PMEM nodes (node
      2 and node 3 in example system) are offlined by default during system
      boot.  So init_numa_topology_type() called during system boot will
      ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
      init_numa_topology_type() is only called at runtime when a CPU of a
      never-onlined-before node gets plugged in.  And there's no CPU in the
      PMEM nodes.  But it appears better to fix this to make the code more
      robust.
      
      To test the potential problem.  We have used a debug patch to call
      init_numa_topology_type() when the PMEM node is onlined (in
      __set_migration_target_nodes()).  With that, the NUMA parameters
      identified by scheduler is as follows,
      
      sched_numa_topology_type:	NUMA_GLUELESS_MESH
      sched_domains_numa_levels:	4
      sched_max_numa_distance:	28
      
      To fix the issue, the CPU-less nodes are ignored when the NUMA topology
      parameters are identified.  Because a node may become CPU-less or not
      at run time because of CPU hotplug, the NUMA topology parameters need
      to be re-initialized at runtime for CPU hotplug too.
      
      With the patch, the NUMA parameters identified for the example system
      above is as follows,
      
      sched_numa_topology_type:	NUMA_DIRECT
      sched_domains_numa_levels:	2
      sched_max_numa_distance:	21
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
      0fb3978b
    • Yury Norov's avatar
      sched: replace cpumask_weight with cpumask_empty where appropriate · 1087ad4e
      Yury Norov authored
      In some places, kernel/sched code calls cpumask_weight() to check if
      any bit of a given cpumask is set. We can do it more efficiently with
      cpumask_empty() because cpumask_empty() stops traversing the cpumask as
      soon as it finds first set bit, while cpumask_weight() counts all bits
      unconditionally.
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com
      1087ad4e
  2. 11 Feb, 2022 4 commits
    • Huang Ying's avatar
      sched/numa-balancing: Move some document to make it consistent with the code · 3624ba7b
      Huang Ying authored
      After commit 8a99b683 ("sched: Move SCHED_DEBUG sysctl to
      debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has
      been moved to debugfs.  This patch move the document for these
      sysctls from
      
        Documentation/admin-guide/sysctl/kernel.rst
      
      to
      
        Documentation/scheduler/sched-debug.rst
      
      to make the document consistent with the code.
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
      3624ba7b
    • Mel Gorman's avatar
      sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs · e496132e
      Mel Gorman authored
      Commit 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA
      nodes") allowed an imbalance between NUMA nodes such that communicating
      tasks would not be pulled apart by the load balancer. This works fine when
      there is a 1:1 relationship between LLC and node but can be suboptimal
      for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
      
      Zen* has multiple LLCs per node with local memory channels and due to
      the allowed imbalance, it's far harder to tune some workloads to run
      optimally than it is on hardware that has 1 LLC per node. This patch
      allows an imbalance to exist up to the point where LLCs should be balanced
      between nodes.
      
      On a Zen3 machine running STREAM parallelised with OMP to have on instance
      per LLC the results and without binding, the results are
      
                                  5.17.0-rc0             5.17.0-rc0
                                     vanilla       sched-numaimb-v6
      MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
      MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
      MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
      MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
      
      STREAM can use directives to force the spread if the OpenMP is new
      enough but that doesn't help if an application uses threads and
      it's not known in advance how many threads will be created.
      
      Coremark is a CPU and cache intensive benchmark parallelised with
      threads. When running with 1 thread per core, the vanilla kernel
      allows threads to contend on cache. With the patch;
      
                                     5.17.0-rc0             5.17.0-rc0
                                        vanilla       sched-numaimb-v5
      Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
      Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
      Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
      Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
      CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
      
      It can also make a big difference for semi-realistic workloads
      like specjbb which can execute arbitrary numbers of threads without
      advance knowledge of how they should be placed. Even in cases where
      the average performance is neutral, the results are more stable.
      
                                     5.17.0-rc0             5.17.0-rc0
                                        vanilla       sched-numaimb-v6
      Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
      Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
      Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
      Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
      Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
      Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
      Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
      Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
      Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
      Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
      Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
      Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
      Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
      Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
      Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
      Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
      Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
      Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
      Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
      Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
      Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
      Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
      Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
      Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
      Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
      Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
      Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
      Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
      Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
      Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
      Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
      Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
      Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
      Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
      
      Similarly, for embarassingly parallel problems like NPB-ep, there are
      improvements due to better spreading across LLC when the machine is not
      fully utilised.
      
                                    vanilla       sched-numaimb-v6
      Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
      Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
      Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
      CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
      Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarGautham R. Shenoy <gautham.shenoy@amd.com>
      Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
      Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
      e496132e
    • Mel Gorman's avatar
      sched/fair: Improve consistency of allowed NUMA balance calculations · 2cfb7a1b
      Mel Gorman authored
      There are inconsistencies when determining if a NUMA imbalance is allowed
      that should be corrected.
      
      o allow_numa_imbalance changes types and is not always examining
        the destination group so both the type should be corrected as
        well as the naming.
      o find_idlest_group uses the sched_domain's weight instead of the
        group weight which is different to find_busiest_group
      o find_busiest_group uses the source group instead of the destination
        which is different to task_numa_find_cpu
      o Both find_idlest_group and find_busiest_group should account
        for the number of running tasks if a move was allowed to be
        consistent with task_numa_find_cpu
      
      Fixes: 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA nodes")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarGautham R. Shenoy <gautham.shenoy@amd.com>
      Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
      2cfb7a1b
    • Mathieu Desnoyers's avatar
      selftests/rseq: Change type of rseq_offset to ptrdiff_t · 889c5d60
      Mathieu Desnoyers authored
      Just before the 2.35 release of glibc, the __rseq_offset userspace ABI
      was changed from int to ptrdiff_t.
      
      Adapt to this change in the kernel selftests.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://sourceware.org/pipermail/libc-alpha/2022-February/136024.html
      889c5d60
  3. 02 Feb, 2022 16 commits
  4. 27 Jan, 2022 8 commits
  5. 23 Jan, 2022 6 commits
    • Linus Torvalds's avatar
      Linux 5.17-rc1 · e783362e
      Linus Torvalds authored
      e783362e
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-for-v5.17-2022-01-22' of... · 40c84321
      Linus Torvalds authored
      Merge tag 'perf-tools-for-v5.17-2022-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull more perf tools updates from Arnaldo Carvalho de Melo:
      
       - Fix printing 'phys_addr' in 'perf script'.
      
       - Fix failure to add events with 'perf probe' in ppc64 due to not
         removing leading dot (ppc64 ABIv1).
      
       - Fix cpu_map__item() python binding building.
      
       - Support event alias in form foo-bar-baz, add pmu-events and
         parse-event tests for it.
      
       - No need to setup affinities when starting a workload or attaching to
         a pid.
      
       - Use path__join() to compose a path instead of ad-hoc snprintf()
         equivalent.
      
       - Override attr->sample_period for non-libpfm4 events.
      
       - Use libperf cpumap APIs instead of accessing the internal state
         directly.
      
       - Sync x86 arch prctl headers and files changed by the new
         set_mempolicy_home_node syscall with the kernel sources.
      
       - Remove duplicate include in cpumap.h.
      
       - Remove redundant err variable.
      
      * tag 'perf-tools-for-v5.17-2022-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf tools: Remove redundant err variable
        perf test: Add parse-events test for aliases with hyphens
        perf test: Add pmu-events test for aliases with hyphens
        perf parse-events: Support event alias in form foo-bar-baz
        perf evsel: Override attr->sample_period for non-libpfm4 events
        perf cpumap: Remove duplicate include in cpumap.h
        perf cpumap: Migrate to libperf cpumap api
        perf python: Fix cpu_map__item() building
        perf script: Fix printing 'phys_addr' failure issue
        tools headers UAPI: Sync files changed by new set_mempolicy_home_node syscall
        tools headers UAPI: Sync x86 arch prctl headers with the kernel sources
        perf machine: Use path__join() to compose a path instead of snprintf(dir, '/', filename)
        perf evlist: No need to setup affinities when disabling events for pid targets
        perf evlist: No need to setup affinities when enabling events for pid targets
        perf stat: No need to setup affinities when starting a workload
        perf affinity: Allow passing a NULL arg to affinity__cleanup()
        perf probe: Fix ppc64 'perf probe add events failed' case
      40c84321
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 67bfce0e
      Linus Torvalds authored
      Pull ftrace fix from Steven Rostedt:
       "Fix s390 breakage from sorting mcount tables.
      
        The latest merge of the tracing tree sorts the mcount table at build
        time. But s390 appears to do things differently (like always) and
        replaces the sorted table back to the original unsorted one. As the
        ftrace algorithm depends on it being sorted, bad things happen when it
        is not, and s390 experienced those bad things.
      
        Add a new config to tell the boot if the mcount table is sorted or
        not, and allow s390 to opt out of it"
      
      * tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        ftrace: Fix assuming build time sort works for s390
      67bfce0e
    • Steven Rostedt (Google)'s avatar
      ftrace: Fix assuming build time sort works for s390 · 6b9b6413
      Steven Rostedt (Google) authored
      To speed up the boot process, as mcount_loc needs to be sorted for ftrace
      to work properly, sorting it at build time is more efficient than boot up
      and can save milliseconds of time. Unfortunately, this change broke s390
      as it will modify the mcount_loc location after the sorting takes place
      and will put back the unsorted locations. Since the sorting is skipped at
      boot up if it is believed that it was sorted at run time, ftrace can crash
      as its algorithms are dependent on the list being sorted.
      
      Add a new config BUILDTIME_MCOUNT_SORT that is set when
      BUILDTIME_TABLE_SORT but not if S390 is set. Use this config to determine
      if sorting should take place at boot up.
      
      Link: https://lore.kernel.org/all/yt9dee51ctfn.fsf@linux.ibm.com/
      
      Fixes: 72b3942a ("scripts: ftrace - move the sort-processing in ftrace_init")
      Reported-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Tested-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      6b9b6413
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v5.17' of... · 473aec0e
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Bring include/uapi/linux/nfc.h into the UAPI compile-test coverage
      
       - Revert the workaround of CONFIG_CC_IMPLICIT_FALLTHROUGH
      
       - Fix build errors in certs/Makefile
      
      * tag 'kbuild-fixes-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        certs: Fix build error when CONFIG_MODULE_SIG_KEY is empty
        certs: Fix build error when CONFIG_MODULE_SIG_KEY is PKCS#11 URI
        Revert "Makefile: Do not quote value for CONFIG_CC_IMPLICIT_FALLTHROUGH"
        usr/include/Makefile: add linux/nfc.h to the compile-test coverage
      473aec0e
    • Linus Torvalds's avatar
      Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linux · 3689f9f8
      Linus Torvalds authored
      Pull bitmap updates from Yury Norov:
      
       - introduce for_each_set_bitrange()
      
       - use find_first_*_bit() instead of find_next_*_bit() where possible
      
       - unify for_each_bit() macros
      
      * tag 'bitmap-5.17-rc1' of git://github.com/norov/linux:
        vsprintf: rework bitmap_list_string
        lib: bitmap: add performance test for bitmap_print_to_pagebuf
        bitmap: unify find_bit operations
        mm/percpu: micro-optimize pcpu_is_populated()
        Replace for_each_*_bit_from() with for_each_*_bit() where appropriate
        find: micro-optimize for_each_{set,clear}_bit()
        include/linux: move for_each_bit() macros from bitops.h to find.h
        cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
        tools: sync tools/bitmap with mother linux
        all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
        cpumask: use find_first_and_bit()
        lib: add find_first_and_bit()
        arch: remove GENERIC_FIND_FIRST_BIT entirely
        include: move find.h from asm_generic to linux
        bitops: move find_bit_*_le functions from le.h to find.h
        bitops: protect find_first_{,zero}_bit properly
      3689f9f8
  6. 22 Jan, 2022 3 commits