• Srikar Dronamraju's avatar
    powerpc/topology: Get topology for shared processors at boot · 2ea62630
    Srikar Dronamraju authored
    On a shared LPAR, Phyp will not update the CPU associativity at boot
    time. Just after the boot system does recognize itself as a shared
    LPAR and trigger a request for correct CPU associativity. But by then
    the scheduler would have already created/destroyed its sched domains.
    
    This causes
      - Broken load balance across Nodes causing islands of cores.
      - Performance degradation esp if the system is lightly loaded
      - dmesg to wrongly report all CPUs to be in Node 0.
      - Messages in dmesg saying borken topology.
      - With commit 051f3ca0 ("sched/topology: Introduce NUMA identity
        node sched domain"), can cause rcu stalls at boot up.
    
    The sched_domains_numa_masks table which is used to generate cpumasks
    is only created at boot time just before creating sched domains and
    never updated. Hence, its better to get the topology correct before
    the sched domains are created.
    
    For example on 64 core Power 8 shared LPAR, dmesg reports
    
      Brought up 512 CPUs
      Node 0 CPUs: 0-511
      Node 1 CPUs:
      Node 2 CPUs:
      Node 3 CPUs:
      Node 4 CPUs:
      Node 5 CPUs:
      Node 6 CPUs:
      Node 7 CPUs:
      Node 8 CPUs:
      Node 9 CPUs:
      Node 10 CPUs:
      Node 11 CPUs:
      ...
      BUG: arch topology borken
           the DIE domain not a subset of the NUMA domain
      BUG: arch topology borken
           the DIE domain not a subset of the NUMA domain
    
    numactl/lscpu output will still be correct with cores spreading across
    all nodes:
    
      Socket(s):             64
      NUMA node(s):          12
      Model:                 2.0 (pvr 004d 0200)
      Model name:            POWER8 (architected), altivec supported
      Hypervisor vendor:     pHyp
      Virtualization type:   para
      L1d cache:             64K
      L1i cache:             32K
      NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
      NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
      NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
      NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
      NUMA node4 CPU(s):     208-215,304-311,400-407,496-503
      NUMA node5 CPU(s):     168-175,264-271,360-367,456-463
      NUMA node6 CPU(s):     128-135,224-231,320-327,416-423
      NUMA node7 CPU(s):     136-143,232-239,328-335,424-431
      NUMA node8 CPU(s):     216-223,312-319,408-415,504-511
      NUMA node9 CPU(s):     144-151,240-247,336-343,432-439
      NUMA node10 CPU(s):    152-159,248-255,344-351,440-447
      NUMA node11 CPU(s):    160-167,256-263,352-359,448-455
    
    Currently on this LPAR, the scheduler detects 2 levels of Numa and
    created numa sched domains for all CPUs, but it finds a single DIE
    domain consisting of all CPUs. Hence it deletes all numa sched
    domains.
    
    To address this, detect the shared processor and update topology soon
    after CPUs are setup so that correct topology is updated just before
    scheduler creates sched domain.
    
    With the fix, dmesg reports:
    
      numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
      numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
      numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
      numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
      numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
      numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
      numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
      numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
      numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
      numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
      numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
      numa: Node 11 CPUs: 160-167 256-263 352-359 448-455
    
    and lscpu also reports:
    
      Socket(s):             64
      NUMA node(s):          12
      Model:                 2.0 (pvr 004d 0200)
      Model name:            POWER8 (architected), altivec supported
      Hypervisor vendor:     pHyp
      Virtualization type:   para
      L1d cache:             64K
      L1i cache:             32K
      NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
      NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
      NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
      NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
      NUMA node4 CPU(s):     208-215,304-311,400-407,496-503
      NUMA node5 CPU(s):     168-175,264-271,360-367,456-463
      NUMA node6 CPU(s):     128-135,224-231,320-327,416-423
      NUMA node7 CPU(s):     136-143,232-239,328-335,424-431
      NUMA node8 CPU(s):     216-223,312-319,408-415,504-511
      NUMA node9 CPU(s):     144-151,240-247,336-343,432-439
      NUMA node10 CPU(s):    152-159,248-255,344-351,440-447
      NUMA node11 CPU(s):    160-167,256-263,352-359,448-455
    Reported-by: default avatarManjunatha H R <manjuhr1@in.ibm.com>
    Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
    [mpe: Trim / format change log]
    Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    2ea62630
smp.c 27.8 KB