- 20 Oct, 2023 1 commit
-
-
Joel Fernandes (Google) authored
How ILB is triggered without IPIs is cryptic. Out of mercy for future code readers, document it in code comments. The comments are derived from a discussion with Vincent in a past review. Signed-off-by:
Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
-
- 18 Oct, 2023 1 commit
-
-
Jiapeng Chong authored
./kernel/sched/fair.c: linux/sched/cond_resched.h is included more than once. Reported-by:
Abaci Robot <abaci@linux.alibaba.com> Signed-off-by:
Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231018062759.44375-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6907
-
- 16 Oct, 2023 1 commit
-
-
Fan Yu authored
The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit: 65457b74 ("sched/psi: Rename existing poll members in preparation") But some comments have not been modified along with the code, so they need to be updated. This will help readers better understand the code. Signed-off-by:
Fan Yu <fan.yu9@zte.com.cn> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn
-
- 13 Oct, 2023 3 commits
-
-
Mathieu Desnoyers authored
The PELT acronym definition can be found right at the top of kernel/sched/pelt.c (of course), but it cannot be found through use of grep -r PELT kernel/sched/ Add the acronym "(PELT)" after "Per Entity Load Tracking" at the top of the source file. Signed-off-by:
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231012125824.1260774-1-mathieu.desnoyers@efficios.com
-
Peter Zijlstra authored
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by:
"Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by:
"Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
-
Haifeng Xu authored
We could bail out early when psi was disabled. Signed-off-by:
Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com
-
- 12 Oct, 2023 1 commit
-
-
Peter Zijlstra authored
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain for the package mask. :-) Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the name less ambiguous. [ Shrikanth Hegde: rename on s390 as well. ] [ Valentin Schneider: also rename it in the comments. ] [ mingo: port to recent kernels & find all remaining occurances. ] Reported-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Acked-by:
Valentin Schneider <vschneid@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Acked-by:
Heiko Carstens <hca@linux.ibm.com> Acked-by:
Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by:
Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
-
- 11 Oct, 2023 2 commits
-
-
Yang Yang authored
The 'update_total' parameter of update_triggers() is always true after the previous commit: 80cc1d1d ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes") If the 'changed_states & group->rtpoll_states' condition is true, 'new_stall' in update_triggers() will be true, and then 'update_total' should also be true. So update_total is redundant - remove it. [ mingo: Changelog updates ] Signed-off-by:
Yang Yang <yang.yang29@zte.com.cn> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310101645437859599@zte.com.cn
-
Yang Yang authored
When psimon wakes up and there are no state changes for ->rtpoll_states, it's unnecessary to update triggers and ->rtpoll_total because the pressures being monitored by the user have not changed. This will help to slightly reduce unnecessary computations of PSI. [ mingo: Changelog updates ] Signed-off-by:
Yang Yang <yang.yang29@zte.com.cn> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310101641075436843@zte.com.cn
-
- 10 Oct, 2023 7 commits
-
-
Colin Ian King authored
There is a comment that refers to cpu_load, however, this cpu_load was removed with: 55627e3c ("sched/core: Remove rq->cpu_load[]") ... back in 2019. The comment does not make sense with respect to this removed array, so remove the comment. Signed-off-by:
Colin Ian King <colin.i.king@gmail.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010155744.1381065-1-colin.i.king@gmail.com
-
Mel Gorman authored
VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Tested-by:
Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
-
Mel Gorman authored
NUMA Balancing skips VMAs when the current task has not trapped a NUMA fault within the VMA. If the VMA is skipped then mm->numa_scan_offset advances and a task that is trapping faults within the VMA may never fully update PTEs within the VMA. Force tasks to update PTEs for partially scanned PTEs. The VMA will be tagged for NUMA hints by some task but this removes some of the benefit of tracking PID activity within a VMA. A follow-on patch will mitigate this problem. The test cases and machines evaluated did not trigger the corner case so the performance results are neutral with only small changes within the noise from normal test-to-test variance. However, the next patch makes the corner case easier to trigger. Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Tested-by:
Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-6-mgorman@techsingularity.net
-
Raghavendra K T authored
Recent NUMA hinting faulting activity is reset approximately every VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not accessed a VMA then the reset check is missed and the reset is potentially deferred forever. Check if the PID activity information should be reset before checking if the current task recently trapped a NUMA hinting fault. [ mgorman@techsingularity.net: Rewrite changelog ] Suggested-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-5-mgorman@techsingularity.net
-
Mel Gorman authored
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation for completing scans of VMAs regardless of PID access, trace the reasons why a VMA was skipped. In a later patch, the tracing will be used to track if a VMA was forcibly scanned. Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-4-mgorman@techsingularity.net
-
Mel Gorman authored
sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ::next_pid_reset => ::pids_active_reset The access_pids[] field name is somewhat ambiguous as no PIDs are accessed. Similarly, it's not clear that next_pid_reset is related to access_pids[]. Rename the fields to more accurately reflect their purpose. [ mingo: Rename in the comments too. ] Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-3-mgorman@techsingularity.net
-
Mel Gorman authored
Document the intended usage of the fields. [ mingo: Reformatted to take less vertical space & tidied it up. ] Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-2-mgorman@techsingularity.net
-
- 09 Oct, 2023 9 commits
-
-
Ingo Molnar authored
Move it out of the .c file into the shared scheduler-internal header file, to gain type-checking. Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
-
Shrikanth Hegde authored
The 'sched_energy_aware' sysctl is available for the admin to disable/enable energy aware scheduling(EAS). EAS is enabled only if few conditions are met by the platform. They are, asymmetric CPU capacity, no SMT, schedutil CPUfreq governor, frequency invariant load tracking etc. A platform may boot without EAS capability, but could gain such capability at runtime. For example, changing/registering the cpufreq governor to schedutil. At present, though platform doesn't support EAS, this sysctl returns 1 and it ends up calling build_perf_domains on write to 1 and NOP when writing to 0. That is confusing and un-necessary. Desired behavior would be to have this sysctl to enable/disable the EAS on supported platform. On non-supported platform write to the sysctl would return not supported error and read of the sysctl would return empty. So sched_energy_aware returns empty - EAS is not possible at this moment This will include EAS capable platforms which have at least one EAS condition false during startup, e.g. not using the schedutil cpufreq governor sched_energy_aware returns 0 - EAS is supported but disabled by admin. sched_energy_aware returns 1 - EAS is supported and enabled. User can find out the reason why EAS is not possible by checking info messages. sched_is_eas_possible returns true if the platform can do EAS at this moment. Signed-off-by:
Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Tested-by:
Pierre Gondois <pierre.gondois@arm.com> Reviewed-by:
Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
-
Yang Yang authored
Update_triggers() always returns now + group->rtpoll_min_period, and the return value is only used by psi_rtpoll_work(), so change update_triggers() to a void function, let group->rtpoll_next_update = now + group->rtpoll_min_period directly. This will avoid unnecessary function return value passing & simplifies the function. [ mingo: Updated changelog ] Suggested-by:
Suren Baghdasaryan <surenb@google.com> Signed-off-by:
Yang Yang <yang.yang29@zte.com.cn> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/202310092024289721617@zte.com.cn
-
Pierre Gondois authored
The Energy Aware Scheduler (EAS) estimates the energy consumption of placing a task on different CPUs. The goal is to minimize this energy consumption. Estimating the energy of different task placements is increasingly complex with the size of the platform. To avoid having a slow wake-up path, EAS is only enabled if this complexity is low enough. The current complexity limit was set in: b68a4c0d ("sched/topology: Disable EAS on inappropriate platforms") ... based on the first implementation of EAS, which was re-computing the power of the whole platform for each task placement scenario, see: 390031e4 ("sched/fair: Introduce an energy estimation helper function") ... but the complexity of EAS was reduced in: eb92692b ("sched/fair: Speed-up energy-aware wake-ups") ... and find_energy_efficient_cpu() (feec) algorithm was updated in: 3e8c6c9a ("sched/fair: Remove task_util from effective utilization in feec()") find_energy_efficient_cpu() (feec) is now doing: feec() \_ for_each_pd(pd) [0] // get max_spare_cap_cpu and compute_prev_delta \_ for_each_cpu(pd) [1] \_ eenv_pd_busy_time(pd) [2] \_ for_each_cpu(pd) // compute_energy(pd) without the task \_ eenv_pd_max_util(pd, -1) [3.0] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, -1) \_ for_each_ps(pd) // compute_energy(pd) with the task on prev_cpu \_ eenv_pd_max_util(pd, prev_cpu) [3.1] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, prev_cpu) \_ for_each_ps(pd) // compute_energy(pd) with the task on max_spare_cap_cpu \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2] \_ for_each_cpu(pd) \_ em_cpu_energy(pd, max_spare_cap_cpu) \_ for_each_ps(pd) [3.1] happens only once since prev_cpu is unique. With the same definitions for nr_pd, nr_cpus and nr_ps, the complexity is of: nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd])) + ([nr_cpus in pd] + [nr_ps in pd]) [0] * ( [1] + [2] + [3.0] + [3.2] ) + [3.1] = nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd]) + [nr_cpus in prev pd] + nr_ps The complexity limit was set to 2048 in: b68a4c0d ("sched/topology: Disable EAS on inappropriate platforms") ... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8 performance states each". For the same platform, the complexity would actually be of: 16 * (4 + 2 * 7) + 1 + 7 = 296 Since the EAS complexity was greatly reduced since the limit was introduced, bigger platforms can handle EAS. For instance, a platform with 112 CPUs with 7 performance states each would not reach it: 112 * (4 + 2 * 7) + 1 + 7 = 2024 To reflect this improvement in the underlying EAS code, remove the EAS complexity check. Note that a limit on the number of CPUs still holds against EM_MAX_NUM_CPUS to avoid overflows during the energy estimation. [ mingo: Updates to the changelog. ] Signed-off-by:
Pierre Gondois <Pierre.Gondois@arm.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Lukasz Luba <lukasz.luba@arm.com> Reviewed-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
-
Vincent Guittot authored
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity() instead. The scheduler uses 3 methods to get access to a CPU's max compute capacity: - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity. - cpu_capacity_orig field which is periodically updated with arch_scale_cpu_capacity(). - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig. There is no real need to save the value returned by arch_scale_cpu_capacity() in struct rq. arch_scale_cpu_capacity() returns: - either a per_cpu variable. - or a const value for systems which have only one capacity. Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere. No functional changes. Some performance tests on Arm64: - small SMP device (hikey): no noticeable changes - HMP device (RB5): hackbench shows minor improvement (1-2%) - large smp (thx2): hackbench and tbench shows minor improvement (1%) Signed-off-by:
Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
-
Yajun Deng authored
Doing this matches the natural type of 'int' based calculus in sched_rt_handler(), and also enables the adding in of a correct upper bounds check on the sysctl interface. [ mingo: Rewrote the changelog. ] Signed-off-by:
Yajun Deng <yajun.deng@linux.dev> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev
-
Ingo Molnar authored
find_new_ilb() returns nr_cpu_ids on failure - which is the usual cpumask bitops return pattern, but is weird & unnecessary in this context: not only is it a global variable, it it is a +1 out of bounds CPU index and also has different signedness ... Its only user, kick_ilb(), then checks the return against nr_cpu_ids to decide to return. There's no other use. So instead of this, use a standard -1 return on failure to find an idle CPU, as the argument is signed already. Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-4-mingo@kernel.org
-
Ingo Molnar authored
Use 'ilb_cpu' consistently in both functions. Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-3-mingo@kernel.org
-
Ingo Molnar authored
- Fix incorrect/misleading comments, - clarify some others, - fix typos & grammar, - and use more consistent style throughout. Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lore.kernel.org/r/20231006102518.2452758-2-mingo@kernel.org
-
- 07 Oct, 2023 7 commits
-
-
Yajun Deng authored
Multiple blocked tasks are printed when the system hangs. They may have the same parent pid, but belong to different task groups. Printing tgid lets users better know whether these tasks are from the same task group or not. Signed-off-by:
Yajun Deng <yajun.deng@linux.dev> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev
-
Waiman Long authored
Commit bf5835bc ("intel_idle: Disable IBRS during long idle") disables IBRS when the cstate is 6 or lower. However, there are some use cases where a customer may want to use max_cstate=1 to lower latency. Such use cases will suffer from the performance degradation caused by the enabling of IBRS in the sibling idle thread. Add a "ibrs_off" module parameter to force disable IBRS and the CPUIDLE_FLAG_IRQ_ENABLE flag if set. In the case of a Skylake server with max_cstate=1, this new ibrs_off option will likely increase the IRQ response latency as IRQ will now be disabled. When running SPECjbb2015 with cstates set to C1 on a Skylake system. First test when the kernel is booted with: "intel_idle.ibrs_off": max-jOPS = 117828, critical-jOPS = 66047 Then retest when the kernel is booted without the "intel_idle.ibrs_off" added: max-jOPS = 116408, critical-jOPS = 58958 That means booting with "intel_idle.ibrs_off" improves performance by: max-jOPS: +1.2%, which could be considered noise range. critical-jOPS: +12%, which is definitely a solid improvement. The admin-guide/pm/intel_idle.rst file is updated to add a description about the new "ibrs_off" module parameter. Signed-off-by:
Waiman Long <longman@redhat.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Acked-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-5-longman@redhat.com
-
Waiman Long authored
When intel_idle_ibrs() is called, it modifies the SPEC_CTRL MSR to 0 in order disable IBRS. However, the new MSR value isn't reflected in x86_spec_ctrl_current which is at odd with the other code that keep track of its state in that percpu variable. Use the new __update_spec_ctrl() to have the x86_spec_ctrl_current percpu value properly updated. Since spec-ctrl.h includes both msr.h and nospec-branch.h, we can remove those from the include file list. Signed-off-by:
Waiman Long <longman@redhat.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Acked-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-4-longman@redhat.com
-
Waiman Long authored
Commit bf5835bc ("intel_idle: Disable IBRS during long idle") disables IBRS when the CPU enters long idle. However, when a CPU becomes offline, the IBRS bit is still set when X86_FEATURE_KERNEL_IBRS is enabled. That will impact the performance of a sibling CPU. Mitigate this performance impact by clearing all the mitigation bits in SPEC_CTRL MSR when offline. When the CPU is online again, it will be re-initialized and so restoring the SPEC_CTRL value isn't needed. Add a comment to say that native_play_dead() is a __noreturn function, but it can't be marked as such to avoid confusion about the missing MSR restoration code. When DPDK is running on an isolated CPU thread processing network packets in user space while its sibling thread is idle. The performance of the busy DPDK thread with IBRS on and off in the sibling idle thread are: IBRS on IBRS off ------- -------- packets/second: 7.8M 10.4M avg tsc cycles/packet: 282.26 209.86 This is a 25% performance degradation. The test system is a Intel Xeon 4114 CPU @ 2.20GHz. [ mingo: Extended the changelog with performance data from the 0/4 mail. ] Signed-off-by:
Waiman Long <longman@redhat.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Acked-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-3-longman@redhat.com
-
Waiman Long authored
Add a new __update_spec_ctrl() helper which is a variant of update_spec_ctrl() that can be used in a noinstr function. Suggested-by:
Peter Zijlstra <peterz@infradead.org> Signed-off-by:
Waiman Long <longman@redhat.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Acked-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230727184600.26768-2-longman@redhat.com
-
Ingo Molnar authored
The following commit: 9b3c4ab3 ("sched,rcu: Rework try_invoke_on_locked_down_task()") ... renamed try_invoke_on_locked_down_task() to task_call_func(), but forgot to update the comment in try_to_wake_up(). But it turns out that the smp_rmb() doesn't live in task_call_func() either, it was moved to __task_needs_rq_lock() in: 91dabf33 ("sched: Fix race in task_call_func()") Fix that now. Also fix the s/smb/smp typo while at it. Reported-by:
Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230731085759.11443-1-zhangqiao22@huawei.com
-
Ingo Molnar authored
Signed-off-by:
Ingo Molnar <mingo@kernel.org>
-
- 05 Oct, 2023 1 commit
-
-
Xuewen Yan authored
When cpufreq's policy is 'single', there is a scenario that will cause sg_policy's next_freq to be unable to update. When the CPU's util is always max, the cpufreq will be max, and then if we change the policy's scaling_max_freq to be a lower freq, indeed, the sg_policy's next_freq need change to be the lower freq, however, because the cpu_is_busy, the next_freq would keep the max_freq. For example: The cpu7 is a single CPU: unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737 pid 4737's current affinity mask: ff pid 4737's new affinity mask: 80 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq 2301000 unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq 2171000 At this time, the sg_policy's next_freq would stay at 2301000, which is wrong. To fix this, add a check for the ->need_freq_update flag. [ mingo: Clarified the changelog. ] Co-developed-by:
Guohua Yan <guohua.yan@unisoc.com> Signed-off-by:
Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by:
Guohua Yan <guohua.yan@unisoc.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Acked-by:
"Rafael J. Wysocki" <rafael@kernel.org> Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
-
- 03 Oct, 2023 3 commits
-
-
Yu Liao authored
<linux/psi.h> and "autogroup.h" are included twice, remove the duplicate header inclusion. Signed-off-by:
Yu Liao <liaoyu15@huawei.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230802021501.2511569-1-liaoyu15@huawei.com
-
Peter Zijlstra authored
The expectation is that placing a task at avg_vruntime() makes it eligible. Turns out there is a corner case where this is not the case. Specifically, avg_vruntime() relies on the fact that integer division is a flooring function (eg. it discards the remainder). By this property the value returned is slightly left of the true average. However! when the average is a negative (relative to min_vruntime) the effect is flipped and it becomes a ceil, with the result that the returned value is just right of the average and thus not eligible. Fixes: af4cf404 ("sched/fair: Add cfs_rq::avg_vruntime") Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org>
-
Peter Zijlstra authored
Tasks that never consume their full slice would not update their slice value. This means that tasks that are spawned before the sysctl scaling keep their original (UP) slice length. Fixes: 147f3efa ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
-
- 02 Oct, 2023 4 commits
-
-
Kir Kolyshkin authored
Both glibc and musl define 'struct sched_param' in sched.h, while kernel has it in uapi/linux/sched/types.h, making it cumbersome to use sched_getattr(2) or sched_setattr(2) from userspace. For example, something like this: #include <sched.h> #include <linux/sched/types.h> struct sched_attr sa; will result in "error: redefinition of ‘struct sched_param’" (note the code doesn't need sched_param at all -- it needs struct sched_attr plus some stuff from sched.h). The situation is, glibc is not going to provide a wrapper for sched_{get,set}attr, thus the need to include linux/sched_types.h directly, which leads to the above problem. Thus, the userspace is left with a few sub-par choices when it wants to use e.g. sched_setattr(2), such as maintaining a copy of struct sched_attr definition, or using some other ugly tricks. OTOH, 'struct sched_param' is well known, defined in POSIX, and it won't be ever changed (as that would break backward compatibility). So, while 'struct sched_param' is indeed part of the kernel uapi, exposing it the way it's done now creates an issue, and hiding it (like this patch does) fixes that issue, hopefully without creating another one: common userspace software rely on libc headers, and as for "special" software (like libc), it looks like glibc and musl do not rely on kernel headers for 'struct sched_param' definition (but let's Cc their mailing lists in case it's otherwise). The alternative to this patch would be to move struct sched_attr to, say, linux/sched.h, or linux/sched/attr.h (the new file). Oh, and here is the previous attempt to fix the issue: https://lore.kernel.org/all/20200528135552.GA87103@google.com/ While I support Linus arguments, the issue is still here and needs to be fixed. [ mingo: Linus is right, this shouldn't be needed - but on the other hand I agree that this header is not really helpful to user-space as-is. So let's pretend that <uapi/linux/sched/types.h> is only about sched_attr, and call this commit a workaround for user-space breakage that it in reality is ... Also, remove the Fixes tag. ] Signed-off-by:
Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230808030357.1213829-1-kolyshkin@gmail.com
-
Cyril Hrubis authored
Standardize on a single variant. Signed-off-by:
Cyril Hrubis <chrubis@suse.cz> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231002115553.3007-4-chrubis@suse.cz
-
Cyril Hrubis authored
- Describe explicitly that sched_rt_runtime_us is allocated from sched_rt_period_us and hence always less or equal to that value. - The limit for sched_rt_runtime_us is not INT_MAX-1, but rather it's limited by the value of sched_rt_period_us. If sched_rt_period_us is INT_MAX then sched_rt_runtime_us can be set to INT_MAX as well. Signed-off-by:
Cyril Hrubis <chrubis@suse.cz> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231002115553.3007-3-chrubis@suse.cz
-
Cyril Hrubis authored
The validation of the value written to sched_rt_period_us was broken because: - the sysclt_sched_rt_period is declared as unsigned int - parsed by proc_do_intvec() - the range is asserted after the value parsed by proc_do_intvec() Because of this negative values written to the file were written into a unsigned integer that were later on interpreted as large positive integers which did passed the check: if (sysclt_sched_rt_period <= 0) return EINVAL; This commit fixes the parsing by setting explicit range for both perid_us and runtime_us into the sched_rt_sysctls table and processes the values with proc_dointvec_minmax() instead. Alternatively if we wanted to use full range of unsigned int for the period value we would have to split the proc_handler and use proc_douintvec() for it however even the Documentation/scheduller/sched-rt-group.rst describes the range as 1 to INT_MAX. As far as I can tell the only problem this causes is that the sysctl file allows writing negative values which when read back may confuse userspace. There is also a LTP test being submitted for these sysctl files at: http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/Signed-off-by:
Cyril Hrubis <chrubis@suse.cz> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
-