1. 06 Oct, 2020 8 commits
  2. 18 Sep, 2020 18 commits
  3. 16 Sep, 2020 14 commits
    • Michael Ellerman's avatar
      Merge coregroup support into next · b5c8a293
      Michael Ellerman authored
      From Srikar's cover letter, with some reformatting:
      
      Cleanup of existing powerpc topologies and add coregroup support on
      powerpc. Coregroup is a group of (subset of) cores of a DIE that share
      a resource.
      
      Summary of some of the testing done with coregroup patchset.
      
      It includes ebizzy, schbench, perf bench sched pipe and topology
      verification. On the left side are results from powerpc/next tree and
      on the right are the results with the patchset applied. Topological
      verification clearly shows that there is no change in topology with
      and without the patches on all the 3 class of systems that were
      tested.
      
      Power 9 PowerNV (2 Node/ 160 Cpu System)
      ----------------------------------------
      
      Baseline                                                                Baseline + Coregroup Support
      
        N      Min       Max    Median       Avg        Stddev                  N      Min       Max    Median       Avg      Stddev
      100   993884   1276090   1173476   1165914     54867.201                100   910470   1279820   1171095   1162091    67363.28
      
      ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
      
      schbench (latency hence lower is better)
      Latency percentiles (usec)                                              Latency percentiles (usec)
              50.0th: 455                                                             50.0th: 454
              75.0th: 533                                                             75.0th: 543
              90.0th: 683                                                             90.0th: 701
              95.0th: 743                                                             95.0th: 737
              *99.0th: 815                                                            *99.0th: 805
              99.5th: 839                                                             99.5th: 835
              99.9th: 913                                                             99.9th: 893
              min=0, max=1011                                                         min=0, max=2833
      
      perf bench sched pipe (lesser time and higher ops/sec is better)
      Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
      Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes
      
           Total time: 6.083 [sec]                                                 Total time: 6.303 [sec]
      
             6.083576 usecs/op                                                       6.303318 usecs/op
               164377 ops/sec                                                          158646 ops/sec
      
      Power 9 LPAR (2 Node/ 128 Cpu System)
      -------------------------------------
      
      Baseline                                                                Baseline + Coregroup Support
      
        N       Min       Max    Median         Avg      Stddev                 N       Min       Max    Median         Avg      Stddev
      100   1058029   1295393   1200414   1188306.7   56786.538               100    943264   1287619   1180522   1168473.2   64469.955
      
      ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
      
      schbench (latency hence lower is better)
      Latency percentiles (usec)                                              Latency percentiles (usec)
              50.0000th: 34                                                           50.0000th: 39
              75.0000th: 46                                                           75.0000th: 52
              90.0000th: 53                                                           90.0000th: 68
              95.0000th: 56                                                           95.0000th: 77
              *99.0000th: 61                                                          *99.0000th: 89
              99.5000th: 63                                                           99.5000th: 94
              99.9000th: 81                                                           99.9000th: 169
              min=0, max=8405                                                         min=0, max=23674
      
      perf bench sched pipe (lesser time and higher ops/sec is better)
      Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
      Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes
      
           Total time: 8.768 [sec]                                                 Total time: 5.217 [sec]
      
             8.768400 usecs/op                                                       5.217625 usecs/op
               114045 ops/sec                                                          191658 ops/sec
      
      Power 8 LPAR (8 Node/ 256 Cpu System)
      -------------------------------------
      
      Baseline                                                                Baseline + Coregroup Support
      
        N       Min       Max    Median         Avg      Stddev                 N      Min      Max   Median        Avg     Stddev
      100   1267615   1965234   1707423   1689137.6   144363.29               100  1175357  1924262  1691104  1664792.1   145876.4
      
      ^ ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
      
      schbench (latency hence lower is better)
      Latency percentiles (usec)                                              Latency percentiles (usec)
              50.0th: 37                                                              50.0th: 36
              75.0th: 51                                                              75.0th: 48
              90.0th: 59                                                              90.0th: 55
              95.0th: 63                                                              95.0th: 59
              *99.0th: 71                                                             *99.0th: 67
              99.5th: 75                                                              99.5th: 72
              99.9th: 105                                                             99.9th: 170
              min=0, max=18560                                                        min=0, max=27031
      
      perf bench sched pipe (lesser time and higher ops/sec is better)
      Running 'sched/pipe' benchmark:                                         Running 'sched/pipe' benchmark:
      Executed 1000000 pipe operations between two processes                  Executed 1000000 pipe operations between two processes
      
           Total time: 6.013 [sec]                                                 Total time: 5.930 [sec]
      
             6.013963 usecs/op                                                       5.930724 usecs/op
               166279 ops/sec                                                          168613 ops/sec
      
      Topology verification on Power9
      Power9 / powernv / SMT4
      
        $ tail /proc/cpuinfo
        cpu             : POWER9, altivec supported
        clock           : 3600.000000MHz
        revision        : 2.2 (pvr 004e 1202)
      
        timebase        : 512000000
        platform        : PowerNV
        model           : 9006-22P
        machine         : PowerNV 9006-22P
        firmware        : OPAL
        MMU             : Radix
      
      Baseline                                                                Baseline + Coregroup Support
      
        lscpu                                                                 lscpu
        ------                                                                ------
        Architecture:        ppc64le                                          Architecture:        ppc64le
        Byte Order:          Little Endian                                    Byte Order:          Little Endian
        CPU(s):              160                                              CPU(s):              160
        On-line CPU(s) list: 0-159                                            On-line CPU(s) list: 0-159
        Thread(s) per core:  4                                                Thread(s) per core:  4
        Core(s) per socket:  20                                               Core(s) per socket:  20
        Socket(s):           2                                                Socket(s):           2
        NUMA node(s):        2                                                NUMA node(s):        2
        Model:               2.2 (pvr 004e 1202)                              Model:               2.2 (pvr 004e 1202)
        Model name:          POWER9, altivec supported                        Model name:          POWER9, altivec supported
        CPU max MHz:         3800.0000                                        CPU max MHz:         3800.0000
        CPU min MHz:         2166.0000                                        CPU min MHz:         2166.0000
        L1d cache:           32K                                              L1d cache:           32K
        L1i cache:           32K                                              L1i cache:           32K
        L2 cache:            512K                                             L2 cache:            512K
        L3 cache:            10240K                                           L3 cache:            10240K
        NUMA node0 CPU(s):   0-79                                             NUMA node0 CPU(s):   0-79
        NUMA node8 CPU(s):   80-159                                           NUMA node8 CPU(s):   80-159
      
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
        -----------------------------------------------------                 -----------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT                   /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
        /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE                 /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
        /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE                   /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
        /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA                  /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA
      
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags               grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
        ------------------------------------------------------                ------------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391                 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
        /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327                 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
        /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071                 /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
        /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801                /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801
      
      Baseline
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4295043536
        cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 4941435230 11106132 1583
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      Baseline + Coregroup Support
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4296311826
        cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
        domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
        Post sudo ppc64_cpu --smt=1                                           Post sudo ppc64_cpu --smt=1
        ---------------------                                                 ---------------------
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
        -----------------------------------------------------                 -----------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE                 /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
        /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE                   /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
        /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA                  /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA
      
        grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags               grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
        ------------------------------------------------------                ------------------------------------------------------
        /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327                 /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
        /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071                 /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
        /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801                /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801
      
      Baseline:
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4295046242
        cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu4 0 0 0 0 0 0 5408663896 95701034 7697
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      Baseline + Coregroup Support
      
        head /proc/schedstat
        --------------------
        version 15
        timestamp 4296314905
        cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
        domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and
      Power 9 (2 node 128 Cpu LPAR) and they showed the topology before and
      after the patch to be identical. If Interested, I could provide the
      same.
      
      On Power 9 (with device-tree enablement to show coregroups):
      
        $ tail /proc/cpuinfo
        processor     : 127
        cpu           : POWER9 (architected), altivec supported
        clock         : 3000.000000MHz
        revision      : 2.2 (pvr 004e 0202)
      
        timebase      : 512000000
        platform      : pSeries
        model         : IBM,9008-22L
        machine       : CHRP IBM,9008-22L
        MMU           : Hash
      
      Before patchset:
      
        $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
        SMT
        CACHE
        DIE
        NUMA
      
        $ head /proc/schedstat
        version 15
        timestamp 4318242208
        cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
        domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 24177439200 413887604 75393
        domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      
      After patchset:
      
        $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
        SMT
        CACHE
        MC
        DIE
        NUMA
      
        $ head /proc/schedstat
        version 15
        timestamp 4318242208
        cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
        domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        cpu1 0 0 0 0 0 0 24177439200 413887604 75393
        domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      b5c8a293
    • Srikar Dronamraju's avatar
      powerpc/smp: Implement cpu_to_coregroup_id · fa35e868
      Srikar Dronamraju authored
      Lookup the coregroup id from the associativity array.
      
      If unable to detect the coregroup id, fallback on the core id.
      This way, ensure sched_domain degenerates and an extra sched domain is
      not created.
      
      Ideally this function should have been implemented in
      arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
      don't need to find the primary domain again.
      
      If the device-tree mentions more than one coregroup, then kernel
      implements only the last or the smallest coregroup, which currently
      corresponds to the penultimate domain in the device-tree.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
      fa35e868
    • Srikar Dronamraju's avatar
      powerpc/smp: Create coregroup domain · 72730bfc
      Srikar Dronamraju authored
      Add percpu coregroup maps and masks to create coregroup domain.
      If a coregroup doesn't exist, the coregroup domain will be degenerated
      in favour of SMT/CACHE domain. Do note this patch is only creating stubs
      for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
      would be in a subsequent patch.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
      72730bfc
    • Srikar Dronamraju's avatar
      powerpc/smp: Allocate cpumask only after searching thread group · 6e086302
      Srikar Dronamraju authored
      If allocated earlier and the search fails, then cpu_l1_cache_map cpumask
      is unnecessarily cleared. However cpu_l1_cache_map can be allocated /
      cleared after we search thread group.
      
      Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask
      allocated by zalloc_cpumask_var_node is never freed.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-9-srikar@linux.vnet.ibm.com
      6e086302
    • Srikar Dronamraju's avatar
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju authored
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • Srikar Dronamraju's avatar
      powerpc/smp: Optimize start_secondary · caa8e29d
      Srikar Dronamraju authored
      In start_secondary, even if shared_cache was already set, system does a
      redundant match for cpumask. This redundant check can be removed by
      checking if shared_cache is already set.
      
      While here, localize the sibling_mask variable to within the if
      condition.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-7-srikar@linux.vnet.ibm.com
      caa8e29d
    • Srikar Dronamraju's avatar
      powerpc/smp: Dont assume l2-cache to be superset of sibling · f6606cfd
      Srikar Dronamraju authored
      Current code assumes that cpumask of cpus sharing a l2-cache mask will
      always be a superset of cpu_sibling_mask.
      
      Lets stop that assumption. cpu_l2_cache_mask is a superset of
      cpu_sibling_mask if and only if shared_caches is set.
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200913171038.GB11808@linux.vnet.ibm.com
      f6606cfd
    • Srikar Dronamraju's avatar
      powerpc/smp: Move topology fixups into a new function · 3c6032a8
      Srikar Dronamraju authored
      Move topology fixup based on the platform attributes into its own
      function which is called just before set_sched_topology.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-5-srikar@linux.vnet.ibm.com
      3c6032a8
    • Srikar Dronamraju's avatar
      powerpc/smp: Move powerpc_topology above · 5e93f16a
      Srikar Dronamraju authored
      Just moving the powerpc_topology description above.
      This will help in using functions in this file and avoid declarations.
      
      No other functional changes
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-4-srikar@linux.vnet.ibm.com
      5e93f16a
    • Srikar Dronamraju's avatar
      powerpc/smp: Merge Power9 topology with Power topology · 2ef0ca54
      Srikar Dronamraju authored
      A new sched_domain_topology_level was added just for Power9. However the
      same can be achieved by merging powerpc_topology with power9_topology
      and makes the code more simpler especially when adding a new sched
      domain.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-3-srikar@linux.vnet.ibm.com
      2ef0ca54
    • Srikar Dronamraju's avatar
      powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES · d0fd24bb
      Srikar Dronamraju authored
      Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
      "error: _numa_cpu_lookup_table_ undeclared"
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-2-srikar@linux.vnet.ibm.com
      d0fd24bb
    • Srikar Dronamraju's avatar
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju authored
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • Srikar Dronamraju's avatar
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju authored
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • Srikar Dronamraju's avatar
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju authored
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100