- 13 Jul, 2022 1 commit
-
-
Vincent Guittot authored
The capacity of the CPU available for CFS tasks can be reduced because of other activities running on the latter. In such case, it's worth trying to move CFS tasks on a CPU with more available capacity. The rework of the load balance has filtered the case when the CPU is classified to be fully busy but its capacity is reduced. Check if CPU's capacity is reduced while gathering load balance statistic and classify it group_misfit_task instead of group_fully_busy so we can try to move the load on another CPU. Reported-by: David Chen <david.chen@nutanix.com> Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Chen <david.chen@nutanix.com> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org
-
- 04 Jul, 2022 2 commits
-
-
Uros Bizjak authored
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg. The definition of cmpxchg based fetch_or was changed in the same way as atomic_fetch_##op definitions were changed in e6790e4b. Also declare these two functions as inline to ensure inlining. In the case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart itself by constructing the boolean return value with logic operations on the fetched value, and these extra operations enlarge the function over the inlining threshold value. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com
-
Josh Don authored
4feee7d1 previously added per-task forced idle accounting. This patch extends this to also include cgroups. rstat is used for cgroup accounting, except for the root, which uses kcpustat in order to bypass the need for doing an rstat flush when reading root stats. Only cgroup v2 is supported. Similar to the task accounting, the cgroup accounting requires that schedstats is enabled. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
-
- 28 Jun, 2022 14 commits
-
-
Vincent Donnefort authored
find_energy_efficient_cpu() integrates a margin to protect tasks from bouncing back and forth from a CPU to another. This margin is set as being 6% of the total current energy estimated on the system. This however does not work for two reasons: 1. The energy estimation is not a good absolute value: compute_energy() used in feec() is a good estimation for task placement as it allows to compare the energy with and without a task. The computed delta will give a good overview of the cost for a certain task placement. It, however, doesn't work as an absolute estimation for the total energy of the system. First it adds the contribution to idle CPUs into the energy, second it mixes util_avg with util_est values. util_avg contains the near history for a CPU usage, it doesn't tell at all what the current utilization is. A system that has been quite busy in the near past will hold a very high energy and then a high margin preventing any task migration to a lower capacity CPU, wasting energy. It even creates a negative feedback loop: by holding the tasks on a less efficient CPU, the margin contributes in keeping the energy high. 2. The margin handicaps small tasks: On a system where the workload is composed mostly of small tasks (which is often the case on Android), the overall energy will be high enough to create a margin none of those tasks can cross. On a Pixel4, a small utilization of 5% on all the CPUs creates a global estimated energy of 140 joules, as per the Energy Model declaration of that same device. This means, after applying the 6% margin that any migration must save more than 8 joules to happen. No task with a utilization lower than 40 would then be able to migrate away from the biggest CPU of the system. The 6% of the overall system energy was brought by the following patch: (eb92692b sched/fair: Speed-up energy-aware wake-ups) It was previously 6% of the prev_cpu energy. Also, the following one made this margin value conditional on the clusters where the task fits: (8d4c97c1 sched/fair: Only compute base_energy_pd if necessary) We could simply revert that margin change to what it was, but the original version didn't have strong grounds neither and as demonstrated in (1.) the estimated energy isn't a good absolute value. Instead, removing it completely. It is indeed, made possible by recent changes that improved energy estimation comparison fairness (sched/fair: Remove task_util from effective utilization in feec()) (PM: EM: Increase energy calculation precision) and task utilization stabilization (sched/fair: Decay task util_avg during migration) Without a margin, we could have feared bouncing between CPUs. But running LISA's eas_behaviour test coverage on three different platforms (Hikey960, RB-5 and DB-845) showed no issue. Removing the energy margin enables more energy-optimized placements for a more energy efficient system. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
-
Vincent Donnefort authored
The energy estimation in find_energy_efficient_cpu() (feec()) relies on the computation of the effective utilization for each CPU of a perf domain (PD). This effective utilization is then used as an estimation of the busy time for this pd. The function effective_cpu_util() which gives this value, scales the utilization relative to IRQ pressure on the CPU to take into account that the IRQ time is hidden from the task clock. The IRQ scaling is as follow: effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of the CPU and irq the IRQ avg time. If now we take as an example a task placement which doesn't raise the OPP on the candidate CPU, we can write the energy delta as: delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) - effective_cpu_util(cpu_util)) = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util We end-up with an energy delta depending on the IRQ avg time, which is a problem: first the time spent on IRQs by a CPU has no effect on the additional energy that would be consumed by a task. Second, we don't want to favour a CPU with a higher IRQ avg time value. Nonetheless, we need to take the IRQ avg time into account. If a task placement raises the PD's frequency, it will increase the energy cost for the entire time where the CPU is busy. A solution is to only use effective_cpu_util() with the CPU contribution part. The task contribution is added separately and scaled according to prev_cpu's IRQ time. No change for the FREQUENCY_UTIL component of the energy estimation. We still want to get the actual frequency that would be selected after the task placement. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
-
Dietmar Eggemann authored
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays invariant after Energy Model creation, i.e. it is not updated after CPU hotplug operations. That's why the PD mask is used in conjunction with the cpu_online_mask (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched multiple times (in compute_energy()) during a run-queue selection for a task. cpu_online_mask may change during this time which can lead to wrong energy calculations. To be able to avoid this, use the select_rq_mask per-cpu cpumask to create a cpumask out of PD cpumask and cpu_online_mask and pass it through the function calls of the EAS run-queue selection path. The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection (find_energy_efficient_cpu()) is now ANDed not only with the SD mask but also with the cpu_online_mask. This is fine since this cpumask has to be in syc with the one used for energy computation (compute_energy()). An exclusive cpuset setup with at least one asymmetric CPU capacity island (hence the additional AND with the SD cpumask) is the obvious exception here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
-
Dietmar Eggemann authored
On 21/06/2022 11:04, Vincent Donnefort wrote: > From: Dietmar Eggemann <dietmar.eggemann@arm.com> https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered that this patch doesn't build anymore (on tip sched/core or linux-next) because of commit f5b2eeb4 ("sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()"). New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to select_rq_mask below. -- >8 -- Decouple the name of the per-cpu cpumask select_idle_mask from its usage in select_idle_[cpu/capacity]() of the CFS run-queue selection (select_task_rq_fair()). This is to support the reuse of this cpumask in the Energy Aware Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue selection. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
-
Dietmar Eggemann authored
effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
-
Vincent Donnefort authored
Before being migrated to a new CPU, a task sees its PELT values synchronized with rq last_update_time. Once done, that same task will also have its sched_avg last_update_time reset. This means the time between the migration and the last clock update will not be accounted for in util_avg and a discontinuity will appear. This issue is amplified by the PELT clock scaling. It takes currently one tick after the CPU being idle to let clock_pelt catching up clock_task. This is especially problematic for asymmetric CPU capacity systems which need stable util_avg signals for task placement and energy estimation. Ideally, this problem would be solved by updating the runqueue clocks before the migration. But that would require taking the runqueue lock which is quite expensive [1]. Instead estimate the missing time and update the task util_avg with that value. To that end, we need sched_clock_cpu() but it is a costly function. Limit the usage to the case where the source CPU is idle as we know this is when the clock is having the biggest risk of being outdated. See comment in migrate_se_pelt_lag() for more details about how the PELT value is estimated. Notice though this estimation doesn't take into account IRQ and Paravirt time. [1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.comSigned-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com
-
Vincent Donnefort authored
Introducing macro helpers u64_u32_{store,load}() to factorize lockless accesses to u64 variables for 32-bits architectures. Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To accommodate the later where the copy lies outside of the structure (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy), use the _copy() version of those helpers. Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and therefore, have a small penalty for 32-bits machines in set_task_rq_fair() and init_cfs_rq(). Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
-
Chen Yu authored
[Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285 ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen <tim.c.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
-
Christian Göttsche authored
sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler() a CAP_SYS_NICE audit event unconditionally, even when the requested operation does not require that capability / is unprivileged, i.e. for reducing niceness. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to allow the task the capability CAP_SYS_NICE, while it does not need it, violating the principle of least privilege. Conduct privilged/unprivileged categorization first and perform a capable test (and at most once) only if needed. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
-
Zhang Qiao authored
As of commit afe06efd ("sched: Extend scheduler's asym packing") group_first_cpu() became an unused function, remove it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220617181151.29980-3-zhangqiao22@huawei.com
-
Zhang Qiao authored
" *" is redundant. so remove it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220617181151.29980-2-zhangqiao22@huawei.com
-
Michael Jeanson authored
When checking for libc rseq support in the library constructor, don't only depend on the symbols presence, check that the registration was completed. This targets a scenario where the libc has rseq support but it is not wired for the current architecture in 'bits/rseq.h', we want to fallback to our internal registration mechanism. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20220614154830.1367382-4-mjeanson@efficios.com
-
Michael Jeanson authored
This header is also used in librseq where it can be included in C++ code, add a space between literals and string macros. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20220614154830.1367382-3-mjeanson@efficios.com
-
Michael Jeanson authored
Make the RISC-V rseq selftests compatible with glibc-2.35 by using the rseq_get_abi() helper. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20220614154830.1367382-2-mjeanson@efficios.com
-
- 13 Jun, 2022 10 commits
-
-
Tianchen Ding authored
Wakelist can help avoid cache bouncing and offload the overhead of waker cpu. So far, using wakelist within the same llc only happens on WF_ON_CPU, and this limitation could be removed to further improve wakeup performance. The commit 518cd623 ("sched: Only queue remote wakeups when crossing cache boundaries") disabled queuing tasks on wakelist when the cpus share llc. This is because, at that time, the scheduler must send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also supports TIF_POLLING, so this is not a problem now when the wakee cpu is in idle polling. Benefits: Queuing the task on idle cpu can help improving performance on waker cpu and utilization on wakee cpu, and further improve locality because the wakee cpu can handle its own rq. This patch helps improving rt on our real java workloads where wakeup happens frequently. Consider the normal condition (CPU0 and CPU1 share same llc) Before this patch: CPU0 CPU1 select_task_rq() idle rq_lock(CPU1->rq) enqueue_task(CPU1->rq) notify CPU1 (by sending IPI or CPU1 polling) resched() After this patch: CPU0 CPU1 select_task_rq() idle add to wakelist of CPU1 notify CPU1 (by sending IPI or CPU1 polling) rq_lock(CPU1->rq) enqueue_task(CPU1->rq) resched() We see CPU0 can finish its work earlier. It only needs to put task to wakelist and return. While CPU1 is idle, so let itself handle its own runqueue data. This patch brings no difference about IPI. This patch only takes effect when the wakee cpu is: 1) idle polling 2) idle not polling For 1), there will be no IPI with or without this patch. For 2), there will always be an IPI before or after this patch. Before this patch: waker cpu will enqueue task and check preempt. Since "idle" will be sure to be preempted, waker cpu must send a resched IPI. After this patch: waker cpu will put the task to the wakelist of wakee cpu, and send an IPI. Benchmark: We've tested schbench, unixbench, and hachbench on both x86 and arm64. On x86 (Intel Xeon Platinum 8269CY): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 8 6 75.0000th: 10 7 90.0000th: 11 8 95.0000th: 12 8 *99.0000th: 13 10 99.5000th: 15 11 99.9000th: 18 14 Unixbench with full threads (104) before after Dhrystone 2 using register variables 3011862938 3009935994 -0.06% Double-Precision Whetstone 617119.3 617298.5 0.03% Execl Throughput 27667.3 27627.3 -0.14% File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12% File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20% File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36% Pipe Throughput 145535622.8 145323033.2 -0.15% Pipe-based Context Switching 3221686.4 3583975.4 11.25% Process Creation 101347.1 103345.4 1.97% Shell Scripts (1 concurrent) 120193.5 123977.8 3.15% Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55% System Call Overhead 5300604.8 5312213.6 0.22% hackbench -g 1 -l 100000 before after Time 3.246 2.251 On arm64 (Ampere Altra): schbench -m 2 -t 8 Latency percentiles (usec) before after 50.0000th: 14 10 75.0000th: 19 14 90.0000th: 22 16 95.0000th: 23 16 *99.0000th: 24 17 99.5000th: 24 17 99.9000th: 28 25 Unixbench with full threads (80) before after Dhrystone 2 using register variables 3536194249 3537019613 0.02% Double-Precision Whetstone 629383.6 629431.6 0.01% Execl Throughput 65920.5 65846.2 -0.11% File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026b.8 0.03% File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23% File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83% Pipe Throughput 133542875.3 131619389.8 -1.44% Pipe-based Context Switching 3215356.1 3576945.1 11.25% Process Creation 108520.5 120184.6 10.75% Shell Scripts (1 concurrent) 122636.3 121888 -0.61% Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46% System Call Overhead 4429998.9 44350061.7 0.11% hackbench -g 1 -l 100000 before after Time 4.217 2.916 Our patch has improvement on schbench, hackbench and Pipe-based Context Switching of unixbench when there exists idle cpus, and no obvious regression on other tests of unixbench. This can help improve rt in scenes where wakeup happens frequently. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
-
Tianchen Ding authored
The commit 2ebb1771 ("sched/core: Offload wakee task activation if it the wakee is descheduling") checked rq->nr_running <= 1 to avoid task stacking when WF_ON_CPU. Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through the deactivate_task() in __schedule(), thus p has been accounted out of rq->nr_running. As such, the task being the only runnable task on the rq implies reading rq->nr_running == 0 at that point. The benchmark result is in [1]. [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
-
Josh Don authored
While doing newidle load balancing, it is possible for new tasks to arrive, such as with pending wakeups. newidle_balance() already accounts for this by exiting the sched_domain load_balance() iteration if it detects these cases. This is very important for minimizing wakeup latency. However, if we are already in load_balance(), we may stay there for a while before returning back to newidle_balance(). This is most exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A very straightforward workaround to this is to adjust should_we_balance() to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are detected. This was tested with the following reproduction: - two threads that take turns sleeping and waking each other up are affined to two cores - a large number of threads with 100% utilization are pinned to all other cores Without this patch, wakeup latency was ~120us for the pair of threads, almost entirely spent in load_balance(). With this patch, wakeup latency is ~6us. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
-
Yajun Deng authored
sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned integer, but proc_dointvec() wouldn't return error even if we set a negative number. Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min. It's just an optimization for match data and proc_handler in struct ctl_table. The 'if (period < min || period > max)' in __checkparam_dl() will work fine even if there hasn't this patch. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev
-
Chengming Zhou authored
We notice the rq leaf_cfs_rq_list has two problems when do bugfix backports and some test profiling. 1. cfs_rqs under throttled subtree could be added to the list, and make their fully decayed ancestors on the list, even though not needed. 2. #1 also make the leaf_cfs_rq_list management complex and error prone, this is the list of related bugfix so far: commit 31bc6aea ("sched/fair: Optimize update_blocked_averages()") commit fe61468b ("sched/fair: Fix enqueue_task_fair warning") commit b34cb07d ("sched/fair: Fix enqueue_task_fair() warning some more") commit 39f23ce0 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list") commit 0258bdfa ("sched/fair: Fix unfairness caused by missing load decay") commit a7b359fc ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") commit fdaba61e ("sched/fair: Ensure that the CFS parent is added after unthrottling") commit 2630cde2 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq") commit 31bc6aea ("sched/fair: Optimize update_blocked_averages()") delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list, and delete the throttled_hierarchy() test in update_blocked_averages(), which optimized update_blocked_averages(). But those later bugfix add cfs_rqs under throttled subtree back to rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for the integrity of rq->leaf_cfs_rq_list. This patch takes another method, skip all cfs_rqs under throttled hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs under throttled subtree off the leaf_cfs_rq_list. So we don't need to consider throttled related things in enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(), which simplify the code a lot. Also optimize update_blocked_averages() since cfs_rqs under throttled hierarchy and their ancestors won't be on the leaf_cfs_rq_list. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com
-
K Prateek Nayak authored
In the case of systems containing multiple LLCs per socket, like AMD Zen systems, users want to spread bandwidth hungry applications across multiple LLCs. Stream is one such representative workload where the best performance is obtained by limiting one stream thread per LLC. To ensure this, users are known to pin the tasks to a specify a subset of the CPUs consisting of one CPU per LLC while running such bandwidth hungry tasks. Suppose we kickstart a multi-threaded task like stream with 8 threads using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3 server where each socket contains 128 CPUs (0-63,128-191 in one socket, 64-127,192-255 in another socket) Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8 Here each CPU in the list is from a different LLC and 4 of those LLCs are on one socket, while the other 4 are on another socket. Ideally we would prefer that each stream thread runs on a different CPU from the allowed list of CPUs. However, the current heuristics in find_idlest_group() do not allow this during the initial placement. Suppose the first socket (0-63,128-191) is our local group from which we are kickstarting the stream tasks. The first four stream threads will be placed in this socket. When it comes to placing the 5th thread, all the allowed CPUs are from the local group (0,16,32,48) would have been taken. However, the current scheduler code simply checks if the number of tasks in the local group is fewer than the allowed numa-imbalance threshold. This threshold was previously 25% of the NUMA domain span (in this case threshold = 32) but after the v6 of Mel's patchset "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip, Commit: e496132e ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs in the NUMA domain, for processors with multiple LLCs. (in this case threshold = 8). For this example, the number of tasks will always be within threshold and thus all the 8 stream threads will be woken up on the first socket thereby resulting in sub-optimal performance. The following sched_wakeup_new tracepoint output shows the initial placement of tasks in the current tip/sched/core on the Zen3 machine: stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032 stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048 stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000 stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016 Once the first four threads are distributed among the allowed CPUs of socket one, the rest of the treads start piling on these same CPUs when clearly there are CPUs on the second socket that can be used. Following the initial pile up on a small number of CPUs, though the load-balancer eventually kicks in, it takes a while to get to {4}{4} and even {4}{4} isn't stable as we observe a bunch of ping ponging between {4}{4} to {5}{3} and back before a stable state is reached much later (1 Stream thread per allowed CPU) and no more migration is required. We can detect this piling and avoid it by checking if the number of allowed CPUs in the local group are fewer than the number of tasks running in the local group and use this information to spread the 5th task out into the next socket (after all, the goal in this slowpath is to find the idlest group and the idlest CPU during the initial placement!). The following sched_wakeup_new tracepoint output shows the initial placement of tasks after adding this fix on the Zen3 machine: stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032 stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048 stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000 stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112 stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096 stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080 stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064 We see that threads are using all of the allowed CPUs and there is no pileup. No output is generated for tracepoint sched_migrate_task with this patch due to a perfect initial placement which removes the need for balancing later on - both across NUMA boundaries and within NUMA boundaries for stream. Following are the results from running 8 Stream threads with and without pinning on a dual socket Zen3 Machine (2 x 64C/128T): During the testing of this patch, the tip sched/core was at commit: 089c02ae "ftrace: Use preemption model accessors for trace header printout" Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1 tip sched/core tip sched/core tip sched/core (no pinning) + pinning + this-patch + pinning Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct) Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct) Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct) Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct) Pinning currently hurts the performance compared to unbound case on tip/sched/core. With the addition of this patch, we are able to outperform tip/sched/core by a good margin with pinning. Following are the results from running 16 Stream threads with and without pinning on a dual socket IceLake Machine (2 x 32C/64T): NUMA Topology of Intel Skylake machine: Node 1: 0,2,4,6 ... 126 (Even numbers) Node 2: 1,3,5,7 ... 127 (Odd numbers) Pinning is done using: numactl -C 0-15 ./stream16 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1 tip sched/core tip sched/core tip sched/core (no pinning) +pinning + this-patch + pinning Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct) Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct) Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct) Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct) In case of Icelake machine, with single LLC per socket, pinning across the two sockets reduces cache contention, thus showing great improvement in pinned case which is further benefited by this patch. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com
-
Mel Gorman authored
For a single LLC per node, a NUMA imbalance is allowed up until 25% of CPUs sharing a node could be active. One intent of the cut-off is to avoid an imbalance of memory channels but there is no topological information based on active memory channels. Furthermore, there can be differences between nodes depending on the number of populated DIMMs. A cut-off of 25% was arbitrary but generally worked. It does have a severe corner cases though when an parallel workload is using 25% of all available CPUs over-saturates memory channels. This can happen due to the initial forking of tasks that get pulled more to one node after early wakeups (e.g. a barrier synchronisation) that is not quickly corrected by the load balancer. The LB may fail to act quickly as the parallel tasks are considered to be poor migrate candidates due to locality or cache hotness. On a range of modern Intel CPUs, 12.5% appears to be a better cut-off assuming all memory channels are populated and is used as the new cut-off point. A minimum of 1 is specified to allow a communicating pair to remain local even for CPUs with low numbers of cores. For modern AMDs, there are multiple LLCs and are not affected. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
-
Mel Gorman authored
The imbalance limitations are applied inconsistently at fork time and at runtime. At fork, a new task can remain local until there are too many running tasks even if the degree of imbalance is larger than NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance figure used during load balancing is different to the one used at NUMA placement. Load balancing uses the number of tasks that must move to restore imbalance where as NUMA balancing uses the total imbalance. In combination, it is possible for a parallel workload that uses a small number of CPUs without applying scheduler policies to have very variable run-to-run performance. [lkp@intel.com: Fix build breakage for arc-allyesconfig] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net
-
Mel Gorman authored
If a destination node has spare capacity but there is an imbalance then two tasks are selected for swapping. If the tasks have no numa group or are within the same NUMA group, it's simply shuffling tasks around without having any impact on the compute imbalance. Instead, it's just punishing one task to help another. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net
-
Mel Gorman authored
On clone, numa_migrate_retry is inherited from the parent which means that the first NUMA placement of a task is non-deterministic. This affects when load balancing recognises numa tasks and whether to migrate "regular", "remote" or "all" tasks between NUMA scheduler domains. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
-
- 12 Jun, 2022 10 commits
-
-
Linus Torvalds authored
-
Linus Torvalds authored
Merge tag 'platform-drivers-x86-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Hans de Goede: "Highlights: - Fix hp-wmi regression on HP Omen laptops introduced in 5.18 - Several hardware-id additions - A couple of other tiny fixes" * tag 'platform-drivers-x86-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86/intel: hid: Add Surface Go to VGBS allow list platform/x86: hp-wmi: Use zero insize parameter only when supported platform/x86: hp-wmi: Resolve WMI query failures on some devices platform/x86: gigabyte-wmi: Add support for B450M DS3H-CF platform/x86: gigabyte-wmi: Add Z690M AORUS ELITE AX DDR4 support platform/x86: barco-p50-gpio: Add check for platform_driver_register platform/x86/intel: pmc: Support Intel Raptorlake P platform/x86/intel: Fix pmt_crashlog array reference platform/mellanox: Add static in struct declaration. platform/mellanox: Spelling s/platfom/platform/
-
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds authored
Pull workqueue fixes from Tejun Heo: "Tetsuo's patch to trigger build warnings if system-wide wq's are flushed along with a TP type update and trivial comment update" * tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Switch to new kerneldoc syntax for named variable macro argument workqueue: Fix type of cpu in trace event workqueue: Wrap flush_workqueue() using a macro
-
Linus Torvalds authored
Merge tag 'kbuild-fixes-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Make the *.mod build rule portable for POSIX awk - Fix regression of 'make nsdeps' - Make scripts/check-local-export working for older bash versions - Fix scripts/gdb to extract the .config data from vmlinux * tag 'kbuild-fixes-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: scripts/gdb: change kernel config dumping method scripts/check-local-export: avoid 'wait $!' for process substitution scripts/nsdeps: adjust to the format change of *.mod files kbuild: avoid regex RS for POSIX awk
-
git://git.samba.org/sfrench/cifs-2.6Linus Torvalds authored
Pull cifs client fixes from Steve French: "Three reconnect fixes, all for stable as well. One of these three reconnect fixes does address a problem with multichannel reconnect, but this does not include the additional fix (still being tested) for dynamically detecting multichannel adapter changes which will improve those reconnect scenarios even more" * tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: populate empty hostnames for extra channels cifs: return errors during session setup during reconnects cifs: fix reconnect on smb3 mount types
-
git://git.kernel.org/pub/scm/linux/kernel/git/crng/randomLinus Torvalds authored
Pull random number generator fixes from Jason Donenfeld: - A fix for a 5.19 regression for a case in which early device tree initializes the RNG, which flips a static branch. On most plaforms, jump labels aren't initialized until much later, so this caused splats. On a few mailing list threads, we cooked up easy fixes for arm64, arm32, and risc-v. But then things looked slightly more involved for xtensa, powerpc, arc, and mips. And at that point, when we're patching 7 architectures in a place before the console is even available, it seems like the cost/risk just wasn't worth it. So random.c works around it now by checking the already exported `static_key_initialized` boolean, as though somebody already ran into this issue in the past. I'm not super jazzed about that; it'd be prettier to not have to complicate downstream code. But I suppose it's practical. - A few small code nits and adding a missing __init annotation. - A change to the default config values to use the cpu and bootloader's seeds for initializing the RNG earlier. This brings them into line with what all the distros do (Fedora/RHEL, Debian, Ubuntu, Gentoo, Arch, NixOS, Alpine, SUSE, and Void... at least), and moreover will now give us test coverage in various test beds that might have caught the above device tree bug earlier. - A change to WireGuard CI's configuration to increase test coverage around the RNG. - A documentation comment fix to unrelated maintainerless CRC code that I was asked to take, I guess because it has to do with polynomials (which the RNG thankfully no longer uses). * tag 'random-5.19-rc2-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: wireguard: selftests: use maximum cpu features and allow rng seeding random: remove rng_has_arch_random() random: credit cpu and bootloader seeds by default random: do not use jump labels before they are initialized random: account for arch randomness in bits random: mark bootloader randomness code as __init random: avoid checking crng_ready() twice in random_init() crc-itu-t: fix typo in CRC ITU-T polynomial comment
-
Duke Lee authored
The Surface Go reports Chassis Type 9 (Laptop,) so the device needs to be added to dmi_vgbs_allow_list to enable tablet mode when an attached Type Cover is folded back. BugLink: https://github.com/linux-surface/linux-surface/issues/837Signed-off-by: Duke Lee <krnhotwings@gmail.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20220607213654.5567-1-krnhotwings@gmail.comReviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
-
Bedant Patnaik authored
commit be9d73e6 ("platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls") and commit 12b19f14 ("platform/x86: hp-wmi: Fix hp_wmi_read_int() reporting error (0x05)") cause ACPI BIOS Error (bug): Attempt to CreateField of length zero (20211217/dsopcode-133) because of the ACPI method HWMC, which unconditionally creates a Field of size (insize*8) bits: CreateField (Arg1, 0x80, (Local5 * 0x08), DAIN) In cases where args->insize = 0, the Field size is 0, resulting in an error. Fix this by using zero insize only if 0x5 error code is returned Tested on Omen 15 AMD (2020) board ID: 8786. Fixes: be9d73e6 ("platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls") Signed-off-by: Bedant Patnaik <bedant.patnaik@gmail.com> Tested-by: Jorge Lopez <jorge.lopez2@hp.com> Link: https://lore.kernel.org/r/41be46743d21c78741232a47bbb5f1cdbcc3d21e.camel@gmail.comReviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
-
Jorge Lopez authored
WMI queries fail on some devices where the ACPI method HWMC unconditionally attempts to create Fields beyond the buffer if the buffer is too small, this breaks essential features such as power profiles: CreateByteField (Arg1, 0x10, D008) CreateByteField (Arg1, 0x11, D009) CreateByteField (Arg1, 0x12, D010) CreateDWordField (Arg1, 0x10, D032) CreateField (Arg1, 0x80, 0x0400, D128) In cases where args->data had zero length, ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [D008] at bit offset/length 128/8 exceeds size of target Buffer (128 bits) (20211217/dsopcode-198) was obtained. ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [D009] at bit offset/length 136/8 exceeds size of target Buffer (136bits) (20211217/dsopcode-198) The original code created a buffer size of 128 bytes regardless if the WMI call required a smaller buffer or not. This particular behavior occurs in older BIOS and reproduced in OMEN laptops. Newer BIOS handles buffer sizes properly and meets the latest specification requirements. This is the reason why testing with a dynamically allocated buffer did not uncover any failures with the test systems at hand. This patch was tested on several OMEN, Elite, and Zbooks. It was confirmed the patch resolves HPWMI_FAN GET/SET calls in an OMEN Laptop 15-ek0xxx. No problems were reported when testing on several Elite and Zbooks notebooks. Fixes: 4b4967cb ("platform/x86: hp-wmi: Changing bios_args.data to be dynamically allocated") Signed-off-by: Jorge Lopez <jorge.lopez2@hp.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20220608212923.8585-2-jorge.lopez2@hp.comReviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
-
Jonathan Neuschäfer authored
The syntax without dots is available since commit 43756e34 ("scripts/kernel-doc: Add support for named variable macro arguments"). The same HTML output is produced with and without this patch. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
-
- 11 Jun, 2022 3 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linuxLinus Torvalds authored
Pull gpio fixes from Bartosz Golaszewski: "A set of fixes. Most address the new warning we emit at build time when irq chips are not immutable with some additional tweaks to gpio-crystalcove from Andy and a small tweak to gpio-dwapd. - make irq_chip structs immutable in several Diolan and intel drivers to get rid of the new warning we emit when fiddling with irq chips - don't print error messages on probe deferral in gpio-dwapb" * tag 'gpio-fixes-for-v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: dwapb: Don't print error on -EPROBE_DEFER gpio: dln2: make irq_chip immutable gpio: sch: make irq_chip immutable gpio: merrifield: make irq_chip immutable gpio: wcove: make irq_chip immutable gpio: crystalcove: Join function declarations and long lines gpio: crystalcove: Use specific type and API for IRQ number gpio: crystalcove: make irq_chip immutable
-
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds authored
Pull SCSI fixes from James Bottomley: "Driver fixes and and one core patch. Nine of the driver patches are minor fixes and reworks to lpfc and the rest are trivial and minor fixes elsewhere" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: pmcraid: Fix missing resource cleanup in error case scsi: ipr: Fix missing/incorrect resource cleanup in error case scsi: mpt3sas: Fix out-of-bounds compiler warning scsi: lpfc: Update lpfc version to 14.2.0.4 scsi: lpfc: Allow reduced polling rate for nvme_admin_async_event cmd completion scsi: lpfc: Add more logging of cmd and cqe information for aborted NVMe cmds scsi: lpfc: Fix port stuck in bypassed state after LIP in PT2PT topology scsi: lpfc: Resolve NULL ptr dereference after an ELS LOGO is aborted scsi: lpfc: Address NULL pointer dereference after starget_to_rport() scsi: lpfc: Resolve some cleanup issues following SLI path refactoring scsi: lpfc: Resolve some cleanup issues following abort path refactoring scsi: lpfc: Correct BDE type for XMIT_SEQ64_WQE in lpfc_ct_reject_event() scsi: vmw_pvscsi: Expand vcpuHint to 16 bits scsi: sd: Fix interpretation of VPD B9h length
-
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds authored
Pull virtio fixes from Michael Tsirkin: "Fixes all over the place, most notably fixes for latent bugs in drivers that got exposed by suppressing interrupts before DRIVER_OK, which in turn has been done by 8b4ec69d ("virtio: harden vring IRQ")" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: um: virt-pci: set device ready in probe() vdpa: make get_vq_group and set_group_asid optional virtio: Fix all occurences of the "the the" typo vduse: Fix NULL pointer dereference on sysfs access vringh: Fix loop descriptors check in the indirect cases vdpa/mlx5: clean up indenting in handle_ctrl_vlan() vdpa/mlx5: fix error code for deleting vlan virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed vdpa/mlx5: Fix syntax errors in comments virtio-rng: make device ready before making request
-