• Huang Ying's avatar
    sched/numa: Fix NUMA topology for systems with CPU-less nodes · 0fb3978b
    Huang Ying authored
    The NUMA topology parameters (sched_numa_topology_type,
    sched_domains_numa_levels, and sched_max_numa_distance, etc.)
    identified by scheduler may be wrong for systems with CPU-less nodes.
    
    For example, the ACPI SLIT of a system with CPU-less persistent
    memory (Intel Optane DCPMM) nodes is as follows,
    
    [000h 0000   4]                    Signature : "SLIT"    [System Locality Information Table]
    [004h 0004   4]                 Table Length : 0000042C
    [008h 0008   1]                     Revision : 01
    [009h 0009   1]                     Checksum : 59
    [00Ah 0010   6]                       Oem ID : "XXXX"
    [010h 0016   8]                 Oem Table ID : "XXXXXXX"
    [018h 0024   4]                 Oem Revision : 00000001
    [01Ch 0028   4]              Asl Compiler ID : "INTL"
    [020h 0032   4]        Asl Compiler Revision : 20091013
    
    [024h 0036   8]                   Localities : 0000000000000004
    [02Ch 0044   4]                 Locality   0 : 0A 15 11 1C
    [030h 0048   4]                 Locality   1 : 15 0A 1C 11
    [034h 0052   4]                 Locality   2 : 11 1C 0A 1C
    [038h 0056   4]                 Locality   3 : 1C 11 1C 0A
    
    While the `numactl -H` output is as follows,
    
    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
    node 0 size: 64136 MB
    node 0 free: 5981 MB
    node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
    node 1 size: 64466 MB
    node 1 free: 10415 MB
    node 2 cpus:
    node 2 size: 253952 MB
    node 2 free: 253920 MB
    node 3 cpus:
    node 3 size: 253952 MB
    node 3 free: 253951 MB
    node distances:
    node   0   1   2   3
      0:  10  21  17  28
      1:  21  10  28  17
      2:  17  28  10  28
      3:  28  17  28  10
    
    In this system, there are only 2 sockets.  In each memory controller,
    both DRAM and PMEM DIMMs are installed.  Although the physical NUMA
    topology is simple, the logical NUMA topology becomes a little
    complex.  Because both the distance(0, 1) and distance (1, 3) are less
    than the distance (0, 3), it appears that node 1 sits between node 0
    and node 3.  And the whole system appears to be a glueless mesh NUMA
    topology type.  But it's definitely not, there is even no CPU in node 3.
    
    This isn't a practical problem now yet.  Because the PMEM nodes (node
    2 and node 3 in example system) are offlined by default during system
    boot.  So init_numa_topology_type() called during system boot will
    ignore them and set sched_numa_topology_type to NUMA_DIRECT.  And
    init_numa_topology_type() is only called at runtime when a CPU of a
    never-onlined-before node gets plugged in.  And there's no CPU in the
    PMEM nodes.  But it appears better to fix this to make the code more
    robust.
    
    To test the potential problem.  We have used a debug patch to call
    init_numa_topology_type() when the PMEM node is onlined (in
    __set_migration_target_nodes()).  With that, the NUMA parameters
    identified by scheduler is as follows,
    
    sched_numa_topology_type:	NUMA_GLUELESS_MESH
    sched_domains_numa_levels:	4
    sched_max_numa_distance:	28
    
    To fix the issue, the CPU-less nodes are ignored when the NUMA topology
    parameters are identified.  Because a node may become CPU-less or not
    at run time because of CPU hotplug, the NUMA topology parameters need
    to be re-initialized at runtime for CPU hotplug too.
    
    With the patch, the NUMA parameters identified for the example system
    above is as follows,
    
    sched_numa_topology_type:	NUMA_DIRECT
    sched_domains_numa_levels:	2
    sched_max_numa_distance:	21
    Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
    Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
    0fb3978b
sched.h 81.5 KB