• Chen Yu's avatar
    sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg · 70fb5ccf
    Chen Yu authored
    [Problem Statement]
    select_idle_cpu() might spend too much time searching for an idle CPU,
    when the system is overloaded.
    
    The following histogram is the time spent in select_idle_cpu(),
    when running 224 instances of netperf on a system with 112 CPUs
    per LLC domain:
    
    @usecs:
    [0]                  533 |                                                    |
    [1]                 5495 |                                                    |
    [2, 4)             12008 |                                                    |
    [4, 8)            239252 |                                                    |
    [8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
    [16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
    [32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
    [128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
    [256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
    [512, 1K)        2600472 |@@@@@@@@@                                           |
    [1K, 2K)          927912 |@@@                                                 |
    [2K, 4K)          218720 |                                                    |
    [4K, 8K)           98161 |                                                    |
    [8K, 16K)          37722 |                                                    |
    [16K, 32K)          6715 |                                                    |
    [32K, 64K)           477 |                                                    |
    [64K, 128K)            7 |                                                    |
    
    netperf latency usecs:
    =======
    case            	load    	    Lat_99th	    std%
    TCP_RR          	thread-224	      257.39	(  0.21)
    
    The time spent in select_idle_cpu() is visible to netperf and might have a negative
    impact.
    
    [Symptom analysis]
    The patch [1] from Mel Gorman has been applied to track the efficiency
    of select_idle_sibling. Copy the indicators here:
    
    SIS Search Efficiency(se_eff%):
            A ratio expressed as a percentage of runqueues scanned versus
            idle CPUs found. A 100% efficiency indicates that the target,
            prev or recent CPU of a task was idle at wakeup. The lower the
            efficiency, the more runqueues were scanned before an idle CPU
            was found.
    
    SIS Domain Search Efficiency(dom_eff%):
            Similar, except only for the slower SIS
    	patch.
    
    SIS Fast Success Rate(fast_rate%):
            Percentage of SIS that used target, prev or
    	recent CPUs.
    
    SIS Success rate(success_rate%):
            Percentage of scans that found an idle CPU.
    
    The test is based on Aubrey's schedtests tool, including netperf, hackbench,
    schbench and tbench.
    
    Test on vanilla kernel:
    schedstat_parse.py -f netperf_vanilla.log
    case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
    TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
    TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
    TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
    TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
    TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
    TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
    TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
    TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
    UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
    UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
    UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
    UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
    UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
    UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
    UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
    UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860
    
    schedstat_parse.py -f schbench_vanilla.log
    (each group has 28 tasks)
    case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
    normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
    normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
    normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
    normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
    normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
    normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
    normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
    normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549
    
    schedstat_parse.py -f hackbench_vanilla.log
    (each group has 28 tasks)
    case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
    process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
    process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
    process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
    process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
    process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
    process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
    process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
    process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
    process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
    process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
    process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
    process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
    process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
    process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
    process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
    process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
    threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
    threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
    threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
    threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
    threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
    threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
    threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
    threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
    threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
    threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
    threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
    threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
    threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
    threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
    threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
    threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312
    
    schedstat_parse.py -f tbench_vanilla.log
    case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
    loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
    loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
    loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
    loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
    loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
    loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
    loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
    loopback	 224 threads	        0.000	       0.000	       0.000	       0.000
    
    According to the test above, if the system becomes busy, the
    SIS Search Efficiency(se_eff%) drops significantly. Although some
    benchmarks would finally find an idle CPU(success_rate% = 100%), it is
    doubtful whether it is worth it to search the whole LLC domain.
    
    [Proposal]
    It would be ideal to have a crystal ball to answer this question:
    How many CPUs must a wakeup path walk down, before it can find an idle
    CPU? Many potential metrics could be used to predict the number.
    One candidate is the sum of util_avg in this LLC domain. The benefit
    of choosing util_avg is that it is a metric of accumulated historic
    activity, which seems to be smoother than instantaneous metrics
    (such as rq->nr_running). Besides, choosing the sum of util_avg
    would help predict the load of the LLC domain more precisely, because
    SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
    time.
    
    In summary, the lower the util_avg is, the more select_idle_cpu()
    should scan for idle CPU, and vice versa. When the sum of util_avg
    in this LLC domain hits 85% or above, the scan stops. The reason to
    choose 85% as the threshold is that this is the imbalance_pct(117)
    when a LLC sched group is overloaded.
    
    Introduce the quadratic function:
    
    y = SCHED_CAPACITY_SCALE - p * x^2
    and y'= y / SCHED_CAPACITY_SCALE
    
    x is the ratio of sum_util compared to the CPU capacity:
    x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
    y' is the ratio of CPUs to be scanned in the LLC domain,
    and the number of CPUs to scan is calculated by:
    
    nr_scan = llc_weight * y'
    
    Choosing quadratic function is because:
    [1] Compared to the linear function, it scans more aggressively when the
        sum_util is low.
    [2] Compared to the exponential function, it is easier to calculate.
    [3] It seems that there is no accurate mapping between the sum of util_avg
        and the number of CPUs to be scanned. Use heuristic scan for now.
    
    For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
    sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
    scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...
    
    For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
    sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
    scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...
    
    Furthermore, to minimize the overhead of calculating the metrics in
    select_idle_cpu(), borrow the statistics from periodic load balance.
    As mentioned by Abel, on a platform with 112 CPUs per LLC, the
    sum_util calculated by periodic load balance after 112 ms would
    decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
    in reflecting the latest utilization. But it is a trade-off.
    Checking the util_avg in newidle load balance would be more frequent,
    but it brings overhead - multiple CPUs write/read the per-LLC shared
    variable and introduces cache contention. Tim also mentioned that,
    it is allowed to be non-optimal in terms of scheduling for the
    short-term variations, but if there is a long-term trend in the load
    behavior, the scheduler can adjust for that.
    
    When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
    calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
    Mel suggested, SIS_UTIL should be enabled by default.
    
    This patch is based on the util_avg, which is very sensitive to the
    CPU frequency invariance. There is an issue that, when the max frequency
    has been clamp, the util_avg would decay insanely fast when
    the CPU is idle. Commit addca285 ("cpufreq: intel_pstate: Handle no_turbo
    in frequency invariance") could be used to mitigate this symptom, by adjusting
    the arch_max_freq_ratio when turbo is disabled. But this issue is still
    not thoroughly fixed, because the current code is unaware of the user-specified
    max CPU frequency.
    
    [Test result]
    
    netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
    175% 200% of CPU number respectively. Hackbench and schbench were launched
    by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
    
    The following is the benchmark result comparison between
    baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
    indicates better performance.
    
    Each netperf test is a:
    netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
    netperf.throughput
    =======
    case            	load    	baseline(std%)	compare%( std%)
    TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
    TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
    TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
    TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
    TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
    TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
    TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
    TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
    UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
    UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
    UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
    UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
    UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
    UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
    UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
    UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)
    
    Take the 224 threads as an example, the SIS search metrics changes are
    illustrated below:
    
        vanilla                    patched
       4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
         38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
      128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
       5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg
    
    There is -87.9% less CPU scans after patched, which indicates lower overhead.
    Besides, with this patch applied, there is -13% less rq lock contention
    in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
    .try_to_wake_up.default_wake_function.woken_wake_function.
    This might help explain the performance improvement - Because this patch allows
    the waking task to remain on the previous CPU, rather than grabbing other CPUs'
    lock.
    
    Each hackbench test is a:
    hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
    hackbench.throughput
    =========
    case            	load    	baseline(std%)	compare%( std%)
    process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
    process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
    process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
    process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
    process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
    process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
    process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
    process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
    threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
    threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
    threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
    threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
    threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
    threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
    threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
    threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)
    
    Each tbench test is a:
    tbench -t 100 $job 127.0.0.1
    tbench.throughput
    ======
    case            	load    	baseline(std%)	compare%( std%)
    loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
    loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
    loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
    loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
    loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
    loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
    loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
    loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)
    
    Each schbench test is a:
    schbench -m $job -t 28 -r 100 -s 30000 -c 30000
    schbench.latency_90%_us
    ========
    case            	load    	baseline(std%)	compare%( std%)
    normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
    normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
    normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
    normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)
    
    *Consider the Standard Deviation, this -7.36% regression might not be valid.
    
    Also, a OLTP workload with a commercial RDBMS has been tested, and there
    is no significant change.
    
    There were concerns that unbalanced tasks among CPUs would cause problems.
    For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
    bound to CPU0~CPU6, while CPU7 is idle:
    
              CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
    util_avg  1024    1024    1024    1024    1024    1024    1024    0
    
    Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
    select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
    But according to Mel, it is unlikely the CPU7 will be idle all the time
    because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
    
    lkp(kernel test robot) has reported a regression on stress-ng.sock on a
    very busy system. According to the sched_debug statistics, it might be caused
    by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
    might introduce more context switch, especially involuntary preemption, which
    impacts a busy stress-ng. This regression has shown that, not all benchmarks
    in every scenario benefit from idle CPU scan limit, and it needs further
    investigation.
    
    Besides, there is slight regression in hackbench's 16 groups case when the
    LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
    in an LLC domain with 16 CPUs. Because the cost to search for an idle one
    among 16 CPUs is negligible. The current patch aims to propose a generic
    solution and only considers the util_avg. Something like the below could
    be applied on top of the current patch to fulfill the requirement:
    
    	if (llc_weight <= 16)
    		nr_scan = nr_scan * 32 / llc_weight;
    
    For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
    The smaller the CPU number this LLC domain has, the larger nr_scan will be
    expanded. This needs further investigation.
    
    There is also ongoing work[2] from Abel to filter out the busy CPUs during
    wakeup, to further speed up the idle CPU scan. And it could be a following-up
    optimization on top of this change.
    Suggested-by: default avatarTim Chen <tim.c.chen@intel.com>
    Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
    Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: default avatarYicong Yang <yangyicong@hisilicon.com>
    Tested-by: default avatarMohini Narkhede <mohini.narkhede@intel.com>
    Tested-by: default avatarK Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
    70fb5ccf
fair.c 312 KB