• Shrikanth Hegde's avatar
    sched/fair: Optimize should_we_balance() for large SMT systems · f8858d96
    Shrikanth Hegde authored
    should_we_balance() is called in load_balance() to find out if the CPU that
    is trying to do the load balance is the right one or not.
    
    With commit:
    
      b1bfeab9("sched/fair: Consider the idle state of the whole core for load balance")
    
    the code tries to find an idle core to do the load balancing
    and falls back on an idle sibling CPU if there is no idle core.
    
    However, on larger SMT systems, it could be needlessly iterating to find a
    idle by scanning all the CPUs in an non-idle core. If the core is not idle,
    and first SMT sibling which is idle has been found, then its not needed to
    check other SMT siblings for idleness
    
    Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE.
    balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and
    same process would be repeated for CPU4 and CPU6 but this is unnecessary.
    Since calling is_core_idle loops through all CPU's in the SMT mask, effect
    is multiplied by weight of smt_mask. For example,when say 1 CPU is busy,
    we would skip loop for 2 CPU's and skip iterating over 8CPU's. That
    effect would be more in DIE/NUMA domain where there are more cores.
    
    Testing and performance evaluation
    ==================================
    
    The test has been done on this system which has 12 cores, i.e 24 small
    cores with SMT=4:
    
      lscpu
      Architecture:            ppc64le
        Byte Order:            Little Endian
      CPU(s):                  96
        On-line CPU(s) list:   0-95
      Model name:              POWER10 (architected), altivec supported
        Thread(s) per core:    8
    
    Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For
    base tip/sched/core the time taken is collected by making the
    should_we_balance() noinline. time is in nanoseconds. The values are
    collected by running the funclatency tracer for 60 seconds. values are
    average of 3 such runs. This represents the expected reduced time with
    patch.
    
    tip/sched/core was at commit:
    
      2f88c8e8 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")
    
    Results:
    
    	------------------------------------------------------------------------------
    	workload			   tip/sched/core	with_patch(%gain)
    	------------------------------------------------------------------------------
    	idle system				 809.3		 695.0(16.45)
    	stress ng – 12 threads -l 100		1013.5		 893.1(13.49)
    	stress ng – 24 threads -l 100		1073.5		 980.0(9.54)
    	stress ng – 48 threads -l 100		 683.0		 641.0(6.55)
    	stress ng – 96 threads -l 100		2421.0		2300(5.26)
    	stress ng – 96 threads -l 15		 375.5		 377.5(-0.53)
    	stress ng – 96 threads -l 25		 635.5		 637.5(-0.31)
    	stress ng – 96 threads -l 35		 934.0		 891.0(4.83)
    
    Ran schbench(old), hackbench and stress_ng  to evaluate the workload
    performance between tip/sched/core and with patch.
    No modification to tip/sched/core
    
    TL;DR:
    
    Good improvement is seen with schbench. when hackbench and stress_ng
    runs for longer good improvement is seen.
    
    	------------------------------------------------------------------------------
    	schbench(old)		            tip		+patch(%gain)
    	10 iterations			sched/core
    	------------------------------------------------------------------------------
    	1 Threads
    	50.0th:		      		    8.00       9.00(-12.50)
    	75.0th:   			    9.60       9.00(6.25)
    	90.0th:   			   11.80      10.20(13.56)
    	95.0th:   			   12.60      10.40(17.46)
    	99.0th:   			   13.60      11.90(12.50)
    	99.5th:   			   14.10      12.60(10.64)
    	99.9th:   			   15.90      14.60(8.18)
    	2 Threads
    	50.0th:   			    9.90       9.20(7.07)
    	75.0th:   			   12.60      10.10(19.84)
    	90.0th:   			   15.50      12.00(22.58)
    	95.0th:   			   17.70      14.00(20.90)
    	99.0th:   			   21.20      16.90(20.28)
    	99.5th:   			   22.60      17.50(22.57)
    	99.9th:   			   30.40      19.40(36.18)
    	4 Threads
    	50.0th:   			   12.50      10.60(15.20)
    	75.0th:   			   15.30      12.00(21.57)
    	90.0th:   			   18.60      14.10(24.19)
    	95.0th:   			   21.30      16.20(23.94)
    	99.0th:   			   26.00      20.70(20.38)
    	99.5th:   			   27.60      22.50(18.48)
    	99.9th:   			   33.90      31.40(7.37)
    	8 Threads
    	50.0th:   			   16.30      14.30(12.27)
    	75.0th:   			   20.20      17.40(13.86)
    	90.0th:   			   24.50      21.90(10.61)
    	95.0th:   			   27.30      24.70(9.52)
    	99.0th:   			   35.00      31.20(10.86)
    	99.5th:   			   46.40      33.30(28.23)
    	99.9th:   			   89.30      57.50(35.61)
    	16 Threads
    	50.0th:   			   22.70      20.70(8.81)
    	75.0th:   			   30.10      27.40(8.97)
    	90.0th:   			   36.00      32.80(8.89)
    	95.0th:   			   39.60      36.40(8.08)
    	99.0th:   			   49.20      44.10(10.37)
    	99.5th:   			   64.90      50.50(22.19)
    	99.9th:   			  143.50     100.60(29.90)
    	32 Threads
    	50.0th:   			   34.60      35.50(-2.60)
    	75.0th:   			   48.20      50.50(-4.77)
    	90.0th:   			   59.20      62.40(-5.41)
    	95.0th:   			   65.20      69.00(-5.83)
    	99.0th:   			   80.40      83.80(-4.23)
    	99.5th:   			  102.10      98.90(3.13)
    	99.9th:   			  727.10     506.80(30.30)
    
    schbench does improve in general. There is some run to run variation with
    schbench. Did a validation run to confirm that trend is similar.
    
    	------------------------------------------------------------------------------
    	hackbench				tip	   +patch(%gain)
    	20 iterations, 50000 loops	     sched/core
    	------------------------------------------------------------------------------
    	Process 10 groups                :      11.74      11.70(0.34)
    	Process 20 groups                :      22.73      22.69(0.18)
    	Process 30 groups                :      33.39      33.40(-0.03)
    	Process 40 groups                :      43.73      43.61(0.27)
    	Process 50 groups                :      53.82      54.35(-0.98)
    	Process 60 groups                :      64.16      65.29(-1.76)
    	thread 10 Time                   :      12.81      12.79(0.16)
    	thread 20 Time                   :      24.63      24.47(0.65)
    	Process(Pipe) 10 Time            :       6.40       6.34(0.94)
    	Process(Pipe) 20 Time            :      10.62      10.63(-0.09)
    	Process(Pipe) 30 Time            :      15.09      14.84(1.66)
    	Process(Pipe) 40 Time            :      19.42      19.01(2.11)
    	Process(Pipe) 50 Time            :      24.04      23.34(2.91)
    	Process(Pipe) 60 Time            :      28.94      27.51(4.94)
    	thread(Pipe) 10 Time             :       6.96       6.87(1.29)
    	thread(Pipe) 20 Time             :      11.74      11.73(0.09)
    
    hackbench shows slight improvement with pipe. Slight degradation in process.
    
    	------------------------------------------------------------------------------
    	stress_ng				tip        +patch(%gain)
    	10 iterations 100000 cpu_ops	     sched/core
    	------------------------------------------------------------------------------
    
    	--cpu=96 -util=100 Time taken  	 :       5.30,       5.01(5.47)
    	--cpu=48 -util=100 Time taken    :       7.94,       6.73(15.24)
    	--cpu=24 -util=100 Time taken    :      11.67,       8.75(25.02)
    	--cpu=12 -util=100 Time taken    :      15.71,      15.02(4.39)
    	--cpu=96 -util=10 Time taken     :      22.71,      22.19(2.29)
    	--cpu=96 -util=20 Time taken     :      12.14,      12.37(-1.89)
    	--cpu=96 -util=30 Time taken     :       8.76,       8.86(-1.14)
    	--cpu=96 -util=40 Time taken     :       7.13,       7.14(-0.14)
    	--cpu=96 -util=50 Time taken     :       6.10,       6.13(-0.49)
    	--cpu=96 -util=60 Time taken     :       5.42,       5.41(0.18)
    	--cpu=96 -util=70 Time taken     :       4.94,       4.94(0.00)
    	--cpu=96 -util=80 Time taken     :       4.56,       4.53(0.66)
    	--cpu=96 -util=90 Time taken     :       4.27,       4.26(0.23)
    
    Good improvement seen with 24 CPUs. In this case only one CPU is busy,
    and no core is idle. Decent improvement with 100% utilization case. no
    difference in other utilization.
    
    Fixes: b1bfeab9 ("sched/fair: Consider the idle state of the whole core for load balance")
    Signed-off-by: default avatarShrikanth Hegde <sshegde@linux.vnet.ibm.com>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
    f8858d96
fair.c 344 KB