Commit bfe8eb3b authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Energy scheduling:

   - Consolidate how the max compute capacity is used in the scheduler
     and how we calculate the frequency for a level of utilization.

   - Rework interface between the scheduler and the schedutil governor

   - Simplify the util_est logic

  Deadline scheduler:

   - Work more towards reducing SCHED_DEADLINE starvation of low
     priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks
     monopolize CPU cycles, via the introduction of 'deadline servers'
     (nested/2-level scheduling).

     "Fair servers" to make use of this facility are not introduced yet.

  EEVDF:

   - Introduce O(1) fastpath for EEVDF task selection

  NUMA balancing:

   - Tune the NUMA-balancing vma scanning logic some more, to better
     distribute the probability of a particular vma getting scanned.

  Plus misc fixes, cleanups and updates"

* tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  sched/fair: Fix tg->load when offlining a CPU
  sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
  sched/fair: Use all little CPUs for CPU-bound workloads
  sched/fair: Simplify util_est
  sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
  arm64/amu: Use capacity_ref_freq() to set AMU ratio
  cpufreq/cppc: Set the frequency used for computing the capacity
  cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
  energy_model: Use a fixed reference frequency
  cpufreq/schedutil: Use a fixed reference frequency
  cpufreq: Use the fixed and coherent frequency for scaling capacity
  sched/topology: Add a new arch_scale_freq_ref() method
  freezer,sched: Clean saved_state when restoring it during thaw
  sched/fair: Update min_vruntime for reweight_entity() correctly
  sched/doc: Update documentation after renames and synchronize Chinese version
  sched/cpufreq: Rework iowait boost
  sched/cpufreq: Rework schedutil governor performance estimation
  sched/pelt: Avoid underestimation of task utilization
  sched/timers: Explain why idle task schedules out on remote timer enqueue
  sched/cpuidle: Comment about timers requirements VS idle handler
  ...
parents aac4de46 cdb3033e
...@@ -180,7 +180,7 @@ This is the (partial) list of the hooks: ...@@ -180,7 +180,7 @@ This is the (partial) list of the hooks:
compat_yield sysctl is turned on; in that case, it places the scheduling compat_yield sysctl is turned on; in that case, it places the scheduling
entity at the right-most end of the red-black tree. entity at the right-most end of the red-black tree.
- check_preempt_curr(...) - wakeup_preempt(...)
This function checks if a task that entered the runnable state should This function checks if a task that entered the runnable state should
preempt the currently running task. preempt the currently running task.
...@@ -189,10 +189,10 @@ This is the (partial) list of the hooks: ...@@ -189,10 +189,10 @@ This is the (partial) list of the hooks:
This function chooses the most appropriate task eligible to run next. This function chooses the most appropriate task eligible to run next.
- set_curr_task(...) - set_next_task(...)
This function is called when a task changes its scheduling class or changes This function is called when a task changes its scheduling class, changes
its task group. its task group or is scheduled.
- task_tick(...) - task_tick(...)
......
...@@ -90,8 +90,8 @@ For more detail see: ...@@ -90,8 +90,8 @@ For more detail see:
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
UTIL_EST / UTIL_EST_FASTUP UTIL_EST
========================== ========
Because periodic tasks have their averages decayed while they sleep, even Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a though when running their expected utilization will be the same, they suffer a
...@@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a ...@@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a
To alleviate this (a default enabled option) UTIL_EST drives an Infinite To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR highest. UTIL_EST filters to instantly increase and only decay on decrease.
filter to instantly increase and only decay on decrease.
A further runqueue wide sum (of runnable tasks) is maintained of: A further runqueue wide sum (of runnable tasks) is maintained of:
......
...@@ -80,7 +80,7 @@ p->se.vruntime。一旦p->se.vruntime变得足够大,其它的任务将成为 ...@@ -80,7 +80,7 @@ p->se.vruntime。一旦p->se.vruntime变得足够大,其它的任务将成为
CFS使用纳秒粒度的计时,不依赖于任何jiffies或HZ的细节。因此CFS并不像之前的调度器那样 CFS使用纳秒粒度的计时,不依赖于任何jiffies或HZ的细节。因此CFS并不像之前的调度器那样
有“时间片”的概念,也没有任何启发式的设计。唯一可调的参数(你需要打开CONFIG_SCHED_DEBUG)是: 有“时间片”的概念,也没有任何启发式的设计。唯一可调的参数(你需要打开CONFIG_SCHED_DEBUG)是:
/sys/kernel/debug/sched/min_granularity_ns /sys/kernel/debug/sched/base_slice_ns
它可以用来将调度器从“桌面”模式(也就是低时延)调节为“服务器”(也就是高批处理)模式。 它可以用来将调度器从“桌面”模式(也就是低时延)调节为“服务器”(也就是高批处理)模式。
它的默认设置是适合桌面的工作负载。SCHED_BATCH也被CFS调度器模块处理。 它的默认设置是适合桌面的工作负载。SCHED_BATCH也被CFS调度器模块处理。
...@@ -147,7 +147,7 @@ array)。 ...@@ -147,7 +147,7 @@ array)。
这个函数的行为基本上是出队,紧接着入队,除非compat_yield sysctl被开启。在那种情况下, 这个函数的行为基本上是出队,紧接着入队,除非compat_yield sysctl被开启。在那种情况下,
它将调度实体放在红黑树的最右端。 它将调度实体放在红黑树的最右端。
- check_preempt_curr(...) - wakeup_preempt(...)
这个函数检查进入可运行状态的任务能否抢占当前正在运行的任务。 这个函数检查进入可运行状态的任务能否抢占当前正在运行的任务。
...@@ -155,9 +155,9 @@ array)。 ...@@ -155,9 +155,9 @@ array)。
这个函数选择接下来最适合运行的任务。 这个函数选择接下来最适合运行的任务。
- set_curr_task(...) - set_next_task(...)
这个函数在任务改变调度类或改变任务组时被调用。 这个函数在任务改变调度类,改变任务组时,或者任务被调度时被调用。
- task_tick(...) - task_tick(...)
......
...@@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最 ...@@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" - Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
UTIL_EST / UTIL_EST_FASTUP UTIL_EST
========================== ========
由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同, 由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。 因此它们在再次运行后会面临(DVFS)的上涨。
为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应 为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。 (Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加, UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。
仅在利用率下降时衰减。
进一步,运行队列的(可运行任务的)利用率之和由下式计算: 进一步,运行队列的(可运行任务的)利用率之和由下式计算:
......
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
#define arch_set_freq_scale topology_set_freq_scale #define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale #define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant #define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
#endif #endif
/* Replace task scheduler's default cpu-invariant accounting */ /* Replace task scheduler's default cpu-invariant accounting */
......
...@@ -23,6 +23,7 @@ void update_freq_counters_refs(void); ...@@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
#define arch_set_freq_scale topology_set_freq_scale #define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale #define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant #define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
#ifdef CONFIG_ACPI_CPPC_LIB #ifdef CONFIG_ACPI_CPPC_LIB
#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc #define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
......
...@@ -82,7 +82,12 @@ int __init parse_acpi_topology(void) ...@@ -82,7 +82,12 @@ int __init parse_acpi_topology(void)
#undef pr_fmt #undef pr_fmt
#define pr_fmt(fmt) "AMU: " fmt #define pr_fmt(fmt) "AMU: " fmt
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale); /*
* Ensure that amu_scale_freq_tick() will return SCHED_CAPACITY_SCALE until
* the CPU capacity and its associated frequency have been correctly
* initialized.
*/
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
static DEFINE_PER_CPU(u64, arch_const_cycles_prev); static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
static DEFINE_PER_CPU(u64, arch_core_cycles_prev); static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
static cpumask_var_t amu_fie_cpus; static cpumask_var_t amu_fie_cpus;
...@@ -112,14 +117,14 @@ static inline bool freq_counters_valid(int cpu) ...@@ -112,14 +117,14 @@ static inline bool freq_counters_valid(int cpu)
return true; return true;
} }
static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate) void freq_inv_set_max_ratio(int cpu, u64 max_rate)
{ {
u64 ratio; u64 ratio, ref_rate = arch_timer_get_rate();
if (unlikely(!max_rate || !ref_rate)) { if (unlikely(!max_rate || !ref_rate)) {
pr_debug("CPU%d: invalid maximum or reference frequency.\n", WARN_ONCE(1, "CPU%d: invalid maximum or reference frequency.\n",
cpu); cpu);
return -EINVAL; return;
} }
/* /*
...@@ -139,12 +144,10 @@ static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate) ...@@ -139,12 +144,10 @@ static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate)
ratio = div64_u64(ratio, max_rate); ratio = div64_u64(ratio, max_rate);
if (!ratio) { if (!ratio) {
WARN_ONCE(1, "Reference frequency too low.\n"); WARN_ONCE(1, "Reference frequency too low.\n");
return -EINVAL; return;
} }
per_cpu(arch_max_freq_scale, cpu) = (unsigned long)ratio; WRITE_ONCE(per_cpu(arch_max_freq_scale, cpu), (unsigned long)ratio);
return 0;
} }
static void amu_scale_freq_tick(void) static void amu_scale_freq_tick(void)
...@@ -195,10 +198,7 @@ static void amu_fie_setup(const struct cpumask *cpus) ...@@ -195,10 +198,7 @@ static void amu_fie_setup(const struct cpumask *cpus)
return; return;
for_each_cpu(cpu, cpus) { for_each_cpu(cpu, cpus) {
if (!freq_counters_valid(cpu) || if (!freq_counters_valid(cpu))
freq_inv_set_max_ratio(cpu,
cpufreq_get_hw_max_freq(cpu) * 1000ULL,
arch_timer_get_rate()))
return; return;
} }
......
...@@ -9,6 +9,7 @@ ...@@ -9,6 +9,7 @@
#define arch_set_freq_scale topology_set_freq_scale #define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale #define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant #define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
/* Replace task scheduler's default cpu-invariant accounting */ /* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale #define arch_scale_cpu_capacity topology_get_cpu_scale
......
...@@ -39,6 +39,9 @@ ...@@ -39,6 +39,9 @@
#include <linux/rwsem.h> #include <linux/rwsem.h>
#include <linux/wait.h> #include <linux/wait.h>
#include <linux/topology.h> #include <linux/topology.h>
#include <linux/dmi.h>
#include <linux/units.h>
#include <asm/unaligned.h>
#include <acpi/cppc_acpi.h> #include <acpi/cppc_acpi.h>
...@@ -1760,3 +1763,104 @@ unsigned int cppc_get_transition_latency(int cpu_num) ...@@ -1760,3 +1763,104 @@ unsigned int cppc_get_transition_latency(int cpu_num)
return latency_ns; return latency_ns;
} }
EXPORT_SYMBOL_GPL(cppc_get_transition_latency); EXPORT_SYMBOL_GPL(cppc_get_transition_latency);
/* Minimum struct length needed for the DMI processor entry we want */
#define DMI_ENTRY_PROCESSOR_MIN_LENGTH 48
/* Offset in the DMI processor structure for the max frequency */
#define DMI_PROCESSOR_MAX_SPEED 0x14
/* Callback function used to retrieve the max frequency from DMI */
static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
{
const u8 *dmi_data = (const u8 *)dm;
u16 *mhz = (u16 *)private;
if (dm->type == DMI_ENTRY_PROCESSOR &&
dm->length >= DMI_ENTRY_PROCESSOR_MIN_LENGTH) {
u16 val = (u16)get_unaligned((const u16 *)
(dmi_data + DMI_PROCESSOR_MAX_SPEED));
*mhz = val > *mhz ? val : *mhz;
}
}
/* Look up the max frequency in DMI */
static u64 cppc_get_dmi_max_khz(void)
{
u16 mhz = 0;
dmi_walk(cppc_find_dmi_mhz, &mhz);
/*
* Real stupid fallback value, just in case there is no
* actual value set.
*/
mhz = mhz ? mhz : 1;
return KHZ_PER_MHZ * mhz;
}
/*
* If CPPC lowest_freq and nominal_freq registers are exposed then we can
* use them to convert perf to freq and vice versa. The conversion is
* extrapolated as an affine function passing by the 2 points:
* - (Low perf, Low freq)
* - (Nominal perf, Nominal freq)
*/
unsigned int cppc_perf_to_khz(struct cppc_perf_caps *caps, unsigned int perf)
{
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_freq - caps->lowest_freq;
mul *= KHZ_PER_MHZ;
div = caps->nominal_perf - caps->lowest_perf;
offset = caps->nominal_freq * KHZ_PER_MHZ -
div64_u64(caps->nominal_perf * mul, div);
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = max_khz;
div = caps->highest_perf;
}
retval = offset + div64_u64(perf * mul, div);
if (retval >= 0)
return retval;
return 0;
}
EXPORT_SYMBOL_GPL(cppc_perf_to_khz);
unsigned int cppc_khz_to_perf(struct cppc_perf_caps *caps, unsigned int freq)
{
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_perf - caps->lowest_perf;
div = caps->nominal_freq - caps->lowest_freq;
/*
* We don't need to convert to kHz for computing offset and can
* directly use nominal_freq and lowest_freq as the div64_u64
* will remove the frequency unit.
*/
offset = caps->nominal_perf -
div64_u64(caps->nominal_freq * mul, div);
/* But we need it for computing the perf level. */
div *= KHZ_PER_MHZ;
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = caps->highest_perf;
div = max_khz;
}
retval = offset + div64_u64(freq * mul, div);
if (retval >= 0)
return retval;
return 0;
}
EXPORT_SYMBOL_GPL(cppc_khz_to_perf);
...@@ -19,6 +19,7 @@ ...@@ -19,6 +19,7 @@
#include <linux/init.h> #include <linux/init.h>
#include <linux/rcupdate.h> #include <linux/rcupdate.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/units.h>
#define CREATE_TRACE_POINTS #define CREATE_TRACE_POINTS
#include <trace/events/thermal_pressure.h> #include <trace/events/thermal_pressure.h>
...@@ -26,7 +27,8 @@ ...@@ -26,7 +27,8 @@
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data); static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask; static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant; static bool scale_freq_invariant;
static DEFINE_PER_CPU(u32, freq_factor) = 1; DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);
static bool supports_scale_freq_counters(const struct cpumask *cpus) static bool supports_scale_freq_counters(const struct cpumask *cpus)
{ {
...@@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure); ...@@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
* operating on stale data when hot-plug is used for some CPUs. The * operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to * @capped_freq reflects the currently allowed max CPUs frequency due to
* thermal capping. It might be also a boost frequency value, which is bigger * thermal capping. It might be also a boost frequency value, which is bigger
* than the internal 'freq_factor' max frequency. In such case the pressure * than the internal 'capacity_freq_ref' max frequency. In such case the
* value should simply be removed, since this is an indication that there is * pressure value should simply be removed, since this is an indication that
* no thermal throttling. The @capped_freq must be provided in kHz. * there is no thermal throttling. The @capped_freq must be provided in kHz.
*/ */
void topology_update_thermal_pressure(const struct cpumask *cpus, void topology_update_thermal_pressure(const struct cpumask *cpus,
unsigned long capped_freq) unsigned long capped_freq)
...@@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus, ...@@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
cpu = cpumask_first(cpus); cpu = cpumask_first(cpus);
max_capacity = arch_scale_cpu_capacity(cpu); max_capacity = arch_scale_cpu_capacity(cpu);
max_freq = per_cpu(freq_factor, cpu); max_freq = arch_scale_freq_ref(cpu);
/* Convert to MHz scale which is used in 'freq_factor' */
capped_freq /= 1000;
/* /*
* Handle properly the boost frequencies, which should simply clean * Handle properly the boost frequencies, which should simply clean
...@@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void) ...@@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)
capacity_scale = 1; capacity_scale = 1;
for_each_possible_cpu(cpu) { for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu); capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity_scale = max(capacity, capacity_scale); capacity_scale = max(capacity, capacity_scale);
} }
pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale); pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) { for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu); capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT, capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale); capacity_scale);
topology_set_cpu_scale(cpu, capacity); topology_set_cpu_scale(cpu, capacity);
...@@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) ...@@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
cpu_node, raw_capacity[cpu]); cpu_node, raw_capacity[cpu]);
/* /*
* Update freq_factor for calculating early boot cpu capacities. * Update capacity_freq_ref for calculating early boot CPU capacities.
* For non-clk CPU DVFS mechanism, there's no way to get the * For non-clk CPU DVFS mechanism, there's no way to get the
* frequency value now, assuming they are running at the same * frequency value now, assuming they are running at the same
* frequency (by keeping the initial freq_factor value). * frequency (by keeping the initial capacity_freq_ref value).
*/ */
cpu_clk = of_clk_get(cpu_node, 0); cpu_clk = of_clk_get(cpu_node, 0);
if (!PTR_ERR_OR_ZERO(cpu_clk)) { if (!PTR_ERR_OR_ZERO(cpu_clk)) {
per_cpu(freq_factor, cpu) = per_cpu(capacity_freq_ref, cpu) =
clk_get_rate(cpu_clk) / 1000; clk_get_rate(cpu_clk) / HZ_PER_KHZ;
clk_put(cpu_clk); clk_put(cpu_clk);
} }
} else { } else {
...@@ -345,11 +344,16 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) ...@@ -345,11 +344,16 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
return !ret; return !ret;
} }
void __weak freq_inv_set_max_ratio(int cpu, u64 max_rate)
{
}
#ifdef CONFIG_ACPI_CPPC_LIB #ifdef CONFIG_ACPI_CPPC_LIB
#include <acpi/cppc_acpi.h> #include <acpi/cppc_acpi.h>
void topology_init_cpu_capacity_cppc(void) void topology_init_cpu_capacity_cppc(void)
{ {
u64 capacity, capacity_scale = 0;
struct cppc_perf_caps perf_caps; struct cppc_perf_caps perf_caps;
int cpu; int cpu;
...@@ -366,6 +370,10 @@ void topology_init_cpu_capacity_cppc(void) ...@@ -366,6 +370,10 @@ void topology_init_cpu_capacity_cppc(void)
(perf_caps.highest_perf >= perf_caps.nominal_perf) && (perf_caps.highest_perf >= perf_caps.nominal_perf) &&
(perf_caps.highest_perf >= perf_caps.lowest_perf)) { (perf_caps.highest_perf >= perf_caps.lowest_perf)) {
raw_capacity[cpu] = perf_caps.highest_perf; raw_capacity[cpu] = perf_caps.highest_perf;
capacity_scale = max_t(u64, capacity_scale, raw_capacity[cpu]);
per_cpu(capacity_freq_ref, cpu) = cppc_perf_to_khz(&perf_caps, raw_capacity[cpu]);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%u (raw).\n", pr_debug("cpu_capacity: CPU%d cpu_capacity=%u (raw).\n",
cpu, raw_capacity[cpu]); cpu, raw_capacity[cpu]);
continue; continue;
...@@ -376,7 +384,18 @@ void topology_init_cpu_capacity_cppc(void) ...@@ -376,7 +384,18 @@ void topology_init_cpu_capacity_cppc(void)
goto exit; goto exit;
} }
topology_normalize_cpu_scale(); for_each_possible_cpu(cpu) {
freq_inv_set_max_ratio(cpu,
per_cpu(capacity_freq_ref, cpu) * HZ_PER_KHZ);
capacity = raw_capacity[cpu];
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(cpu));
}
schedule_work(&update_topology_flags_work); schedule_work(&update_topology_flags_work);
pr_debug("cpu_capacity: cpu_capacity initialization done\n"); pr_debug("cpu_capacity: cpu_capacity initialization done\n");
...@@ -410,8 +429,11 @@ init_cpu_capacity_callback(struct notifier_block *nb, ...@@ -410,8 +429,11 @@ init_cpu_capacity_callback(struct notifier_block *nb,
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus); cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
for_each_cpu(cpu, policy->related_cpus) for_each_cpu(cpu, policy->related_cpus) {
per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000; per_cpu(capacity_freq_ref, cpu) = policy->cpuinfo.max_freq;
freq_inv_set_max_ratio(cpu,
per_cpu(capacity_freq_ref, cpu) * HZ_PER_KHZ);
}
if (cpumask_empty(cpus_to_visit)) { if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale(); topology_normalize_cpu_scale();
......
...@@ -16,7 +16,6 @@ ...@@ -16,7 +16,6 @@
#include <linux/delay.h> #include <linux/delay.h>
#include <linux/cpu.h> #include <linux/cpu.h>
#include <linux/cpufreq.h> #include <linux/cpufreq.h>
#include <linux/dmi.h>
#include <linux/irq_work.h> #include <linux/irq_work.h>
#include <linux/kthread.h> #include <linux/kthread.h>
#include <linux/time.h> #include <linux/time.h>
...@@ -27,12 +26,6 @@ ...@@ -27,12 +26,6 @@
#include <acpi/cppc_acpi.h> #include <acpi/cppc_acpi.h>
/* Minimum struct length needed for the DMI processor entry we want */
#define DMI_ENTRY_PROCESSOR_MIN_LENGTH 48
/* Offset in the DMI processor structure for the max frequency */
#define DMI_PROCESSOR_MAX_SPEED 0x14
/* /*
* This list contains information parsed from per CPU ACPI _CPC and _PSD * This list contains information parsed from per CPU ACPI _CPC and _PSD
* structures: e.g. the highest and lowest supported performance, capabilities, * structures: e.g. the highest and lowest supported performance, capabilities,
...@@ -291,97 +284,9 @@ static inline void cppc_freq_invariance_exit(void) ...@@ -291,97 +284,9 @@ static inline void cppc_freq_invariance_exit(void)
} }
#endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */ #endif /* CONFIG_ACPI_CPPC_CPUFREQ_FIE */
/* Callback function used to retrieve the max frequency from DMI */
static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
{
const u8 *dmi_data = (const u8 *)dm;
u16 *mhz = (u16 *)private;
if (dm->type == DMI_ENTRY_PROCESSOR &&
dm->length >= DMI_ENTRY_PROCESSOR_MIN_LENGTH) {
u16 val = (u16)get_unaligned((const u16 *)
(dmi_data + DMI_PROCESSOR_MAX_SPEED));
*mhz = val > *mhz ? val : *mhz;
}
}
/* Look up the max frequency in DMI */
static u64 cppc_get_dmi_max_khz(void)
{
u16 mhz = 0;
dmi_walk(cppc_find_dmi_mhz, &mhz);
/*
* Real stupid fallback value, just in case there is no
* actual value set.
*/
mhz = mhz ? mhz : 1;
return (1000 * mhz);
}
/*
* If CPPC lowest_freq and nominal_freq registers are exposed then we can
* use them to convert perf to freq and vice versa. The conversion is
* extrapolated as an affine function passing by the 2 points:
* - (Low perf, Low freq)
* - (Nominal perf, Nominal perf)
*/
static unsigned int cppc_cpufreq_perf_to_khz(struct cppc_cpudata *cpu_data,
unsigned int perf)
{
struct cppc_perf_caps *caps = &cpu_data->perf_caps;
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_freq - caps->lowest_freq;
div = caps->nominal_perf - caps->lowest_perf;
offset = caps->nominal_freq - div64_u64(caps->nominal_perf * mul, div);
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = max_khz;
div = caps->highest_perf;
}
retval = offset + div64_u64(perf * mul, div);
if (retval >= 0)
return retval;
return 0;
}
static unsigned int cppc_cpufreq_khz_to_perf(struct cppc_cpudata *cpu_data,
unsigned int freq)
{
struct cppc_perf_caps *caps = &cpu_data->perf_caps;
s64 retval, offset = 0;
static u64 max_khz;
u64 mul, div;
if (caps->lowest_freq && caps->nominal_freq) {
mul = caps->nominal_perf - caps->lowest_perf;
div = caps->nominal_freq - caps->lowest_freq;
offset = caps->nominal_perf - div64_u64(caps->nominal_freq * mul, div);
} else {
if (!max_khz)
max_khz = cppc_get_dmi_max_khz();
mul = caps->highest_perf;
div = max_khz;
}
retval = offset + div64_u64(freq * mul, div);
if (retval >= 0)
return retval;
return 0;
}
static int cppc_cpufreq_set_target(struct cpufreq_policy *policy, static int cppc_cpufreq_set_target(struct cpufreq_policy *policy,
unsigned int target_freq, unsigned int target_freq,
unsigned int relation) unsigned int relation)
{ {
struct cppc_cpudata *cpu_data = policy->driver_data; struct cppc_cpudata *cpu_data = policy->driver_data;
unsigned int cpu = policy->cpu; unsigned int cpu = policy->cpu;
...@@ -389,7 +294,7 @@ static int cppc_cpufreq_set_target(struct cpufreq_policy *policy, ...@@ -389,7 +294,7 @@ static int cppc_cpufreq_set_target(struct cpufreq_policy *policy,
u32 desired_perf; u32 desired_perf;
int ret = 0; int ret = 0;
desired_perf = cppc_cpufreq_khz_to_perf(cpu_data, target_freq); desired_perf = cppc_khz_to_perf(&cpu_data->perf_caps, target_freq);
/* Return if it is exactly the same perf */ /* Return if it is exactly the same perf */
if (desired_perf == cpu_data->perf_ctrls.desired_perf) if (desired_perf == cpu_data->perf_ctrls.desired_perf)
return ret; return ret;
...@@ -417,7 +322,7 @@ static unsigned int cppc_cpufreq_fast_switch(struct cpufreq_policy *policy, ...@@ -417,7 +322,7 @@ static unsigned int cppc_cpufreq_fast_switch(struct cpufreq_policy *policy,
u32 desired_perf; u32 desired_perf;
int ret; int ret;
desired_perf = cppc_cpufreq_khz_to_perf(cpu_data, target_freq); desired_perf = cppc_khz_to_perf(&cpu_data->perf_caps, target_freq);
cpu_data->perf_ctrls.desired_perf = desired_perf; cpu_data->perf_ctrls.desired_perf = desired_perf;
ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls); ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
...@@ -530,7 +435,7 @@ static int cppc_get_cpu_power(struct device *cpu_dev, ...@@ -530,7 +435,7 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
min_step = min_cap / CPPC_EM_CAP_STEP; min_step = min_cap / CPPC_EM_CAP_STEP;
max_step = max_cap / CPPC_EM_CAP_STEP; max_step = max_cap / CPPC_EM_CAP_STEP;
perf_prev = cppc_cpufreq_khz_to_perf(cpu_data, *KHz); perf_prev = cppc_khz_to_perf(perf_caps, *KHz);
step = perf_prev / perf_step; step = perf_prev / perf_step;
if (step > max_step) if (step > max_step)
...@@ -550,8 +455,8 @@ static int cppc_get_cpu_power(struct device *cpu_dev, ...@@ -550,8 +455,8 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
perf = step * perf_step; perf = step * perf_step;
} }
*KHz = cppc_cpufreq_perf_to_khz(cpu_data, perf); *KHz = cppc_perf_to_khz(perf_caps, perf);
perf_check = cppc_cpufreq_khz_to_perf(cpu_data, *KHz); perf_check = cppc_khz_to_perf(perf_caps, *KHz);
step_check = perf_check / perf_step; step_check = perf_check / perf_step;
/* /*
...@@ -561,8 +466,8 @@ static int cppc_get_cpu_power(struct device *cpu_dev, ...@@ -561,8 +466,8 @@ static int cppc_get_cpu_power(struct device *cpu_dev,
*/ */
while ((*KHz == prev_freq) || (step_check != step)) { while ((*KHz == prev_freq) || (step_check != step)) {
perf++; perf++;
*KHz = cppc_cpufreq_perf_to_khz(cpu_data, perf); *KHz = cppc_perf_to_khz(perf_caps, perf);
perf_check = cppc_cpufreq_khz_to_perf(cpu_data, *KHz); perf_check = cppc_khz_to_perf(perf_caps, *KHz);
step_check = perf_check / perf_step; step_check = perf_check / perf_step;
} }
...@@ -591,7 +496,7 @@ static int cppc_get_cpu_cost(struct device *cpu_dev, unsigned long KHz, ...@@ -591,7 +496,7 @@ static int cppc_get_cpu_cost(struct device *cpu_dev, unsigned long KHz,
perf_caps = &cpu_data->perf_caps; perf_caps = &cpu_data->perf_caps;
max_cap = arch_scale_cpu_capacity(cpu_dev->id); max_cap = arch_scale_cpu_capacity(cpu_dev->id);
perf_prev = cppc_cpufreq_khz_to_perf(cpu_data, KHz); perf_prev = cppc_khz_to_perf(perf_caps, KHz);
perf_step = CPPC_EM_CAP_STEP * perf_caps->highest_perf / max_cap; perf_step = CPPC_EM_CAP_STEP * perf_caps->highest_perf / max_cap;
step = perf_prev / perf_step; step = perf_prev / perf_step;
...@@ -679,10 +584,6 @@ static struct cppc_cpudata *cppc_cpufreq_get_cpu_data(unsigned int cpu) ...@@ -679,10 +584,6 @@ static struct cppc_cpudata *cppc_cpufreq_get_cpu_data(unsigned int cpu)
goto free_mask; goto free_mask;
} }
/* Convert the lowest and nominal freq from MHz to KHz */
cpu_data->perf_caps.lowest_freq *= 1000;
cpu_data->perf_caps.nominal_freq *= 1000;
list_add(&cpu_data->node, &cpu_data_list); list_add(&cpu_data->node, &cpu_data_list);
return cpu_data; return cpu_data;
...@@ -724,20 +625,16 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy) ...@@ -724,20 +625,16 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
* Set min to lowest nonlinear perf to avoid any efficiency penalty (see * Set min to lowest nonlinear perf to avoid any efficiency penalty (see
* Section 8.4.7.1.1.5 of ACPI 6.1 spec) * Section 8.4.7.1.1.5 of ACPI 6.1 spec)
*/ */
policy->min = cppc_cpufreq_perf_to_khz(cpu_data, policy->min = cppc_perf_to_khz(caps, caps->lowest_nonlinear_perf);
caps->lowest_nonlinear_perf); policy->max = cppc_perf_to_khz(caps, caps->nominal_perf);
policy->max = cppc_cpufreq_perf_to_khz(cpu_data,
caps->nominal_perf);
/* /*
* Set cpuinfo.min_freq to Lowest to make the full range of performance * Set cpuinfo.min_freq to Lowest to make the full range of performance
* available if userspace wants to use any perf between lowest & lowest * available if userspace wants to use any perf between lowest & lowest
* nonlinear perf * nonlinear perf
*/ */
policy->cpuinfo.min_freq = cppc_cpufreq_perf_to_khz(cpu_data, policy->cpuinfo.min_freq = cppc_perf_to_khz(caps, caps->lowest_perf);
caps->lowest_perf); policy->cpuinfo.max_freq = cppc_perf_to_khz(caps, caps->nominal_perf);
policy->cpuinfo.max_freq = cppc_cpufreq_perf_to_khz(cpu_data,
caps->nominal_perf);
policy->transition_delay_us = cppc_cpufreq_get_transition_delay_us(cpu); policy->transition_delay_us = cppc_cpufreq_get_transition_delay_us(cpu);
policy->shared_type = cpu_data->shared_type; policy->shared_type = cpu_data->shared_type;
...@@ -773,7 +670,7 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy) ...@@ -773,7 +670,7 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
boost_supported = true; boost_supported = true;
/* Set policy->cur to max now. The governors will adjust later. */ /* Set policy->cur to max now. The governors will adjust later. */
policy->cur = cppc_cpufreq_perf_to_khz(cpu_data, caps->highest_perf); policy->cur = cppc_perf_to_khz(caps, caps->highest_perf);
cpu_data->perf_ctrls.desired_perf = caps->highest_perf; cpu_data->perf_ctrls.desired_perf = caps->highest_perf;
ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls); ret = cppc_set_perf(cpu, &cpu_data->perf_ctrls);
...@@ -863,7 +760,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu) ...@@ -863,7 +760,7 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpu)
delivered_perf = cppc_perf_from_fbctrs(cpu_data, &fb_ctrs_t0, delivered_perf = cppc_perf_from_fbctrs(cpu_data, &fb_ctrs_t0,
&fb_ctrs_t1); &fb_ctrs_t1);
return cppc_cpufreq_perf_to_khz(cpu_data, delivered_perf); return cppc_perf_to_khz(&cpu_data->perf_caps, delivered_perf);
} }
static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
...@@ -878,11 +775,9 @@ static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) ...@@ -878,11 +775,9 @@ static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
} }
if (state) if (state)
policy->max = cppc_cpufreq_perf_to_khz(cpu_data, policy->max = cppc_perf_to_khz(caps, caps->highest_perf);
caps->highest_perf);
else else
policy->max = cppc_cpufreq_perf_to_khz(cpu_data, policy->max = cppc_perf_to_khz(caps, caps->nominal_perf);
caps->nominal_perf);
policy->cpuinfo.max_freq = policy->max; policy->cpuinfo.max_freq = policy->max;
ret = freq_qos_update_request(policy->max_freq_req, policy->max); ret = freq_qos_update_request(policy->max_freq_req, policy->max);
...@@ -937,7 +832,7 @@ static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu) ...@@ -937,7 +832,7 @@ static unsigned int hisi_cppc_cpufreq_get_rate(unsigned int cpu)
if (ret < 0) if (ret < 0)
return -EIO; return -EIO;
return cppc_cpufreq_perf_to_khz(cpu_data, desired_perf); return cppc_perf_to_khz(&cpu_data->perf_caps, desired_perf);
} }
static void cppc_check_hisi_workaround(void) static void cppc_check_hisi_workaround(void)
......
...@@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy, ...@@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy,
arch_set_freq_scale(policy->related_cpus, arch_set_freq_scale(policy->related_cpus,
policy->cur, policy->cur,
policy->cpuinfo.max_freq); arch_scale_freq_ref(policy->cpu));
spin_lock(&policy->transition_lock); spin_lock(&policy->transition_lock);
policy->transition_ongoing = false; policy->transition_ongoing = false;
...@@ -2174,7 +2174,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy, ...@@ -2174,7 +2174,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,
policy->cur = freq; policy->cur = freq;
arch_set_freq_scale(policy->related_cpus, freq, arch_set_freq_scale(policy->related_cpus, freq,
policy->cpuinfo.max_freq); arch_scale_freq_ref(policy->cpu));
cpufreq_stats_record_transition(policy, freq); cpufreq_stats_record_transition(policy, freq);
if (trace_cpu_frequency_enabled()) { if (trace_cpu_frequency_enabled()) {
......
...@@ -144,6 +144,8 @@ extern int cppc_set_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls); ...@@ -144,6 +144,8 @@ extern int cppc_set_perf(int cpu, struct cppc_perf_ctrls *perf_ctrls);
extern int cppc_set_enable(int cpu, bool enable); extern int cppc_set_enable(int cpu, bool enable);
extern int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps); extern int cppc_get_perf_caps(int cpu, struct cppc_perf_caps *caps);
extern bool cppc_perf_ctrs_in_pcc(void); extern bool cppc_perf_ctrs_in_pcc(void);
extern unsigned int cppc_perf_to_khz(struct cppc_perf_caps *caps, unsigned int perf);
extern unsigned int cppc_khz_to_perf(struct cppc_perf_caps *caps, unsigned int freq);
extern bool acpi_cpc_valid(void); extern bool acpi_cpc_valid(void);
extern bool cppc_allow_fast_switch(void); extern bool cppc_allow_fast_switch(void);
extern int acpi_get_psd_map(unsigned int cpu, struct cppc_cpudata *cpu_data); extern int acpi_get_psd_map(unsigned int cpu, struct cppc_cpudata *cpu_data);
......
...@@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu) ...@@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity); void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);
DECLARE_PER_CPU(unsigned long, capacity_freq_ref);
static inline unsigned long topology_get_freq_ref(int cpu)
{
return per_cpu(capacity_freq_ref, cpu);
}
DECLARE_PER_CPU(unsigned long, arch_freq_scale); DECLARE_PER_CPU(unsigned long, arch_freq_scale);
static inline unsigned long topology_get_freq_scale(int cpu) static inline unsigned long topology_get_freq_scale(int cpu)
...@@ -92,6 +99,7 @@ void update_siblings_masks(unsigned int cpu); ...@@ -92,6 +99,7 @@ void update_siblings_masks(unsigned int cpu);
void remove_cpu_topology(unsigned int cpuid); void remove_cpu_topology(unsigned int cpuid);
void reset_cpu_topology(void); void reset_cpu_topology(void);
int parse_acpi_topology(void); int parse_acpi_topology(void);
void freq_inv_set_max_ratio(int cpu, u64 max_rate);
#endif #endif
#endif /* _LINUX_ARCH_TOPOLOGY_H_ */ #endif /* _LINUX_ARCH_TOPOLOGY_H_ */
...@@ -1203,6 +1203,7 @@ void arch_set_freq_scale(const struct cpumask *cpus, ...@@ -1203,6 +1203,7 @@ void arch_set_freq_scale(const struct cpumask *cpus,
{ {
} }
#endif #endif
/* the following are really really optional */ /* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs; extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs; extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
......
...@@ -224,7 +224,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, ...@@ -224,7 +224,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util, unsigned long max_util, unsigned long sum_util,
unsigned long allowed_cpu_cap) unsigned long allowed_cpu_cap)
{ {
unsigned long freq, scale_cpu; unsigned long freq, ref_freq, scale_cpu;
struct em_perf_state *ps; struct em_perf_state *ps;
int cpu; int cpu;
...@@ -241,11 +241,10 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, ...@@ -241,11 +241,10 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
*/ */
cpu = cpumask_first(to_cpumask(pd->cpus)); cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu); scale_cpu = arch_scale_cpu_capacity(cpu);
ps = &pd->table[pd->nr_perf_states - 1]; ref_freq = arch_scale_freq_ref(cpu);
max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap); max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ps->frequency, scale_cpu); freq = map_util_freq(max_util, ref_freq, scale_cpu);
/* /*
* Find the lowest performance state of the Energy Model above the * Find the lowest performance state of the Energy Model above the
......
...@@ -600,6 +600,9 @@ struct vma_numab_state { ...@@ -600,6 +600,9 @@ struct vma_numab_state {
*/ */
unsigned long pids_active[2]; unsigned long pids_active[2];
/* MM scan sequence ID when scan first started after VMA creation */
int start_scan_seq;
/* /*
* MM scan sequence ID when the VMA was last completely scanned. * MM scan sequence ID when the VMA was last completely scanned.
* A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq * A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
......
...@@ -63,11 +63,13 @@ struct robust_list_head; ...@@ -63,11 +63,13 @@ struct robust_list_head;
struct root_domain; struct root_domain;
struct rq; struct rq;
struct sched_attr; struct sched_attr;
struct sched_dl_entity;
struct seq_file; struct seq_file;
struct sighand_struct; struct sighand_struct;
struct signal_struct; struct signal_struct;
struct task_delay_info; struct task_delay_info;
struct task_group; struct task_group;
struct task_struct;
struct user_event_mm; struct user_event_mm;
/* /*
...@@ -413,42 +415,6 @@ struct load_weight { ...@@ -413,42 +415,6 @@ struct load_weight {
u32 inv_weight; u32 inv_weight;
}; };
/**
* struct util_est - Estimation utilization of FAIR tasks
* @enqueued: instantaneous estimated utilization of a task/cpu
* @ewma: the Exponential Weighted Moving Average (EWMA)
* utilization of a task
*
* Support data structure to track an Exponential Weighted Moving Average
* (EWMA) of a FAIR task's utilization. New samples are added to the moving
* average each time a task completes an activation. Sample's weight is chosen
* so that the EWMA will be relatively insensitive to transient changes to the
* task's workload.
*
* The enqueued attribute has a slightly different meaning for tasks and cpus:
* - task: the task's util_avg at last task dequeue time
* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
* Thus, the util_est.enqueued of a task represents the contribution on the
* estimated utilization of the CPU where that task is currently enqueued.
*
* Only for tasks we track a moving average of the past instantaneous
* estimated utilization. This allows to absorb sporadic drops in utilization
* of an otherwise almost periodic task.
*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est.enqueued at dequeue
* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
* for a task) it is safe to use MSB.
*/
struct util_est {
unsigned int enqueued;
unsigned int ewma;
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000
} __attribute__((__aligned__(sizeof(u64))));
/* /*
* The load/runnable/util_avg accumulates an infinite geometric series * The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c). * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
...@@ -503,9 +469,20 @@ struct sched_avg { ...@@ -503,9 +469,20 @@ struct sched_avg {
unsigned long load_avg; unsigned long load_avg;
unsigned long runnable_avg; unsigned long runnable_avg;
unsigned long util_avg; unsigned long util_avg;
struct util_est util_est; unsigned int util_est;
} ____cacheline_aligned; } ____cacheline_aligned;
/*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est at dequeue time.
* Since max value of util_est for a task is 1024 (PELT util_avg for a task)
* it is safe to use MSB.
*/
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000
struct sched_statistics { struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS #ifdef CONFIG_SCHEDSTATS
u64 wait_start; u64 wait_start;
...@@ -523,7 +500,7 @@ struct sched_statistics { ...@@ -523,7 +500,7 @@ struct sched_statistics {
u64 block_max; u64 block_max;
s64 sum_block_runtime; s64 sum_block_runtime;
u64 exec_max; s64 exec_max;
u64 slice_max; u64 slice_max;
u64 nr_migrations_cold; u64 nr_migrations_cold;
...@@ -553,7 +530,7 @@ struct sched_entity { ...@@ -553,7 +530,7 @@ struct sched_entity {
struct load_weight load; struct load_weight load;
struct rb_node run_node; struct rb_node run_node;
u64 deadline; u64 deadline;
u64 min_deadline; u64 min_vruntime;
struct list_head group_node; struct list_head group_node;
unsigned int on_rq; unsigned int on_rq;
...@@ -607,6 +584,9 @@ struct sched_rt_entity { ...@@ -607,6 +584,9 @@ struct sched_rt_entity {
#endif #endif
} __randomize_layout; } __randomize_layout;
typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
struct sched_dl_entity { struct sched_dl_entity {
struct rb_node rb_node; struct rb_node rb_node;
...@@ -654,6 +634,7 @@ struct sched_dl_entity { ...@@ -654,6 +634,7 @@ struct sched_dl_entity {
unsigned int dl_yielded : 1; unsigned int dl_yielded : 1;
unsigned int dl_non_contending : 1; unsigned int dl_non_contending : 1;
unsigned int dl_overrun : 1; unsigned int dl_overrun : 1;
unsigned int dl_server : 1;
/* /*
* Bandwidth enforcement timer. Each -deadline task has its * Bandwidth enforcement timer. Each -deadline task has its
...@@ -668,7 +649,20 @@ struct sched_dl_entity { ...@@ -668,7 +649,20 @@ struct sched_dl_entity {
* timer is needed to decrease the active utilization at the correct * timer is needed to decrease the active utilization at the correct
* time. * time.
*/ */
struct hrtimer inactive_timer; struct hrtimer inactive_timer;
/*
* Bits for DL-server functionality. Also see the comment near
* dl_server_update().
*
* @rq the runqueue this server is for
*
* @server_has_tasks() returns true if @server_pick return a
* runnable task.
*/
struct rq *rq;
dl_server_has_tasks_f server_has_tasks;
dl_server_pick_f server_pick;
#ifdef CONFIG_RT_MUTEXES #ifdef CONFIG_RT_MUTEXES
/* /*
...@@ -795,6 +789,7 @@ struct task_struct { ...@@ -795,6 +789,7 @@ struct task_struct {
struct sched_entity se; struct sched_entity se;
struct sched_rt_entity rt; struct sched_rt_entity rt;
struct sched_dl_entity dl; struct sched_dl_entity dl;
struct sched_dl_entity *dl_server;
const struct sched_class *sched_class; const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE #ifdef CONFIG_SCHED_CORE
......
...@@ -279,6 +279,14 @@ void arch_update_thermal_pressure(const struct cpumask *cpus, ...@@ -279,6 +279,14 @@ void arch_update_thermal_pressure(const struct cpumask *cpus,
{ } { }
#endif #endif
#ifndef arch_scale_freq_ref
static __always_inline
unsigned int arch_scale_freq_ref(int cpu)
{
return 0;
}
#endif
static inline int task_node(const struct task_struct *p) static inline int task_node(const struct task_struct *p)
{ {
return cpu_to_node(task_cpu(p)); return cpu_to_node(task_cpu(p));
......
...@@ -493,33 +493,30 @@ DEFINE_EVENT_SCHEDSTAT(sched_stat_template, sched_stat_blocked, ...@@ -493,33 +493,30 @@ DEFINE_EVENT_SCHEDSTAT(sched_stat_template, sched_stat_blocked,
*/ */
DECLARE_EVENT_CLASS(sched_stat_runtime, DECLARE_EVENT_CLASS(sched_stat_runtime,
TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime), TP_PROTO(struct task_struct *tsk, u64 runtime),
TP_ARGS(tsk, __perf_count(runtime), vruntime), TP_ARGS(tsk, __perf_count(runtime)),
TP_STRUCT__entry( TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN ) __array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid ) __field( pid_t, pid )
__field( u64, runtime ) __field( u64, runtime )
__field( u64, vruntime )
), ),
TP_fast_assign( TP_fast_assign(
memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
__entry->pid = tsk->pid; __entry->pid = tsk->pid;
__entry->runtime = runtime; __entry->runtime = runtime;
__entry->vruntime = vruntime;
), ),
TP_printk("comm=%s pid=%d runtime=%Lu [ns] vruntime=%Lu [ns]", TP_printk("comm=%s pid=%d runtime=%Lu [ns]",
__entry->comm, __entry->pid, __entry->comm, __entry->pid,
(unsigned long long)__entry->runtime, (unsigned long long)__entry->runtime)
(unsigned long long)__entry->vruntime)
); );
DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime, DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime,
TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime), TP_PROTO(struct task_struct *tsk, u64 runtime),
TP_ARGS(tsk, runtime, vruntime)); TP_ARGS(tsk, runtime));
/* /*
* Tracepoint for showing priority inheritance modifying a tasks * Tracepoint for showing priority inheritance modifying a tasks
......
...@@ -187,6 +187,7 @@ static int __restore_freezer_state(struct task_struct *p, void *arg) ...@@ -187,6 +187,7 @@ static int __restore_freezer_state(struct task_struct *p, void *arg)
if (state != TASK_RUNNING) { if (state != TASK_RUNNING) {
WRITE_ONCE(p->__state, state); WRITE_ONCE(p->__state, state);
p->saved_state = TASK_RUNNING;
return 1; return 1;
} }
......
...@@ -1131,6 +1131,28 @@ static void wake_up_idle_cpu(int cpu) ...@@ -1131,6 +1131,28 @@ static void wake_up_idle_cpu(int cpu)
if (cpu == smp_processor_id()) if (cpu == smp_processor_id())
return; return;
/*
* Set TIF_NEED_RESCHED and send an IPI if in the non-polling
* part of the idle loop. This forces an exit from the idle loop
* and a round trip to schedule(). Now this could be optimized
* because a simple new idle loop iteration is enough to
* re-evaluate the next tick. Provided some re-ordering of tick
* nohz functions that would need to follow TIF_NR_POLLING
* clearing:
*
* - On most archs, a simple fetch_or on ti::flags with a
* "0" value would be enough to know if an IPI needs to be sent.
*
* - x86 needs to perform a last need_resched() check between
* monitor and mwait which doesn't take timers into account.
* There a dedicated TIF_TIMER flag would be required to
* fetch_or here and be checked along with TIF_NEED_RESCHED
* before mwait().
*
* However, remote timer enqueue is not such a frequent event
* and testing of the above solutions didn't appear to report
* much benefits.
*/
if (set_nr_and_not_polling(rq->idle)) if (set_nr_and_not_polling(rq->idle))
smp_send_reschedule(cpu); smp_send_reschedule(cpu);
else else
...@@ -2124,12 +2146,14 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags) ...@@ -2124,12 +2146,14 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
enqueue_task(rq, p, flags); enqueue_task(rq, p, flags);
p->on_rq = TASK_ON_RQ_QUEUED; WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
} }
void deactivate_task(struct rq *rq, struct task_struct *p, int flags) void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{ {
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
dequeue_task(rq, p, flags); dequeue_task(rq, p, flags);
} }
...@@ -3795,6 +3819,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, ...@@ -3795,6 +3819,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
rq->idle_stamp = 0; rq->idle_stamp = 0;
} }
#endif #endif
p->dl_server = NULL;
} }
/* /*
...@@ -4509,10 +4535,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) ...@@ -4509,10 +4535,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
memset(&p->stats, 0, sizeof(p->stats)); memset(&p->stats, 0, sizeof(p->stats));
#endif #endif
RB_CLEAR_NODE(&p->dl.rb_node); init_dl_entity(&p->dl);
init_dl_task_timer(&p->dl);
init_dl_inactive_task_timer(&p->dl);
__dl_clear_params(p);
INIT_LIST_HEAD(&p->rt.run_list); INIT_LIST_HEAD(&p->rt.run_list);
p->rt.timeout = 0; p->rt.timeout = 0;
...@@ -6004,12 +6027,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) ...@@ -6004,12 +6027,27 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_next_task_idle(rq); p = pick_next_task_idle(rq);
} }
/*
* This is the fast path; it cannot be a DL server pick;
* therefore even if @p == @prev, ->dl_server must be NULL.
*/
if (p->dl_server)
p->dl_server = NULL;
return p; return p;
} }
restart: restart:
put_prev_task_balance(rq, prev, rf); put_prev_task_balance(rq, prev, rf);
/*
* We've updated @prev and no longer need the server link, clear it.
* Must be done before ->pick_next_task() because that can (re)set
* ->dl_server.
*/
if (prev->dl_server)
prev->dl_server = NULL;
for_each_class(class) { for_each_class(class) {
p = class->pick_next_task(rq); p = class->pick_next_task(rq);
if (p) if (p)
...@@ -7429,18 +7467,13 @@ int sched_core_idle_cpu(int cpu) ...@@ -7429,18 +7467,13 @@ int sched_core_idle_cpu(int cpu)
* required to meet deadlines. * required to meet deadlines.
*/ */
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
enum cpu_util_type type, unsigned long *min,
struct task_struct *p) unsigned long *max)
{ {
unsigned long dl_util, util, irq, max; unsigned long util, irq, scale;
struct rq *rq = cpu_rq(cpu); struct rq *rq = cpu_rq(cpu);
max = arch_scale_cpu_capacity(cpu); scale = arch_scale_cpu_capacity(cpu);
if (!uclamp_is_used() &&
type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt)) {
return max;
}
/* /*
* Early check to see if IRQ/steal time saturates the CPU, can be * Early check to see if IRQ/steal time saturates the CPU, can be
...@@ -7448,45 +7481,49 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, ...@@ -7448,45 +7481,49 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* update_irq_load_avg(). * update_irq_load_avg().
*/ */
irq = cpu_util_irq(rq); irq = cpu_util_irq(rq);
if (unlikely(irq >= max)) if (unlikely(irq >= scale)) {
return max; if (min)
*min = scale;
if (max)
*max = scale;
return scale;
}
if (min) {
/*
* The minimum utilization returns the highest level between:
* - the computed DL bandwidth needed with the IRQ pressure which
* steals time to the deadline task.
* - The minimum performance requirement for CFS and/or RT.
*/
*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
/*
* When an RT task is runnable and uclamp is not used, we must
* ensure that the task will run at maximum compute capacity.
*/
if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
*min = max(*min, scale);
}
/* /*
* Because the time spend on RT/DL tasks is visible as 'lost' time to * Because the time spend on RT/DL tasks is visible as 'lost' time to
* CFS tasks and we use the same metric to track the effective * CFS tasks and we use the same metric to track the effective
* utilization (PELT windows are synchronized) we can directly add them * utilization (PELT windows are synchronized) we can directly add them
* to obtain the CPU's actual utilization. * to obtain the CPU's actual utilization.
*
* CFS and RT utilization can be boosted or capped, depending on
* utilization clamp constraints requested by currently RUNNABLE
* tasks.
* When there are no CFS RUNNABLE tasks, clamps are released and
* frequency will be gracefully reduced with the utilization decay.
*/ */
util = util_cfs + cpu_util_rt(rq); util = util_cfs + cpu_util_rt(rq);
if (type == FREQUENCY_UTIL) util += cpu_util_dl(rq);
util = uclamp_rq_util_with(rq, util, p);
dl_util = cpu_util_dl(rq);
/* /*
* For frequency selection we do not make cpu_util_dl() a permanent part * The maximum hint is a soft bandwidth requirement, which can be lower
* of this sum because we want to use cpu_bw_dl() later on, but we need * than the actual utilization because of uclamp_max requirements.
* to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
* that we select f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us to not quite hit
* saturation when we should -- something for later.
*/ */
if (util + dl_util >= max) if (max)
return max; *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
/* if (util >= scale)
* OTOH, for energy computation we need the estimated running time, so return scale;
* include util_dl and ignore dl_bw.
*/
if (type == ENERGY_UTIL)
util += dl_util;
/* /*
* There is still idle time; further improve the number by using the * There is still idle time; further improve the number by using the
...@@ -7497,28 +7534,15 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, ...@@ -7497,28 +7534,15 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
* U' = irq + --------- * U * U' = irq + --------- * U
* max * max
*/ */
util = scale_irq_capacity(util, irq, max); util = scale_irq_capacity(util, irq, scale);
util += irq; util += irq;
/* return min(scale, util);
* Bandwidth required by DEADLINE must always be granted while, for
* FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
* to gracefully reduce the frequency when no tasks show up for longer
* periods of time.
*
* Ideally we would like to set bw_dl as min/guaranteed freq and util +
* bw_dl as requested freq. However, cpufreq is not yet ready for such
* an interface. So, we only do the latter for now.
*/
if (type == FREQUENCY_UTIL)
util += cpu_bw_dl(rq);
return min(max, util);
} }
unsigned long sched_cpu_util(int cpu) unsigned long sched_cpu_util(int cpu)
{ {
return effective_cpu_util(cpu, cpu_util_cfs(cpu), ENERGY_UTIL, NULL); return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
} }
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
......
...@@ -47,7 +47,7 @@ struct sugov_cpu { ...@@ -47,7 +47,7 @@ struct sugov_cpu {
u64 last_update; u64 last_update;
unsigned long util; unsigned long util;
unsigned long bw_dl; unsigned long bw_min;
/* The field below is for single-CPU policies only: */ /* The field below is for single-CPU policies only: */
#ifdef CONFIG_NO_HZ_COMMON #ifdef CONFIG_NO_HZ_COMMON
...@@ -114,6 +114,28 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy) ...@@ -114,6 +114,28 @@ static void sugov_deferred_update(struct sugov_policy *sg_policy)
} }
} }
/**
* get_capacity_ref_freq - get the reference frequency that has been used to
* correlate frequency and compute capacity for a given cpufreq policy. We use
* the CPU managing it for the arch_scale_freq_ref() call in the function.
* @policy: the cpufreq policy of the CPU in question.
*
* Return: the reference CPU frequency to compute a capacity.
*/
static __always_inline
unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
{
unsigned int freq = arch_scale_freq_ref(policy->cpu);
if (freq)
return freq;
if (arch_scale_freq_invariant())
return policy->cpuinfo.max_freq;
return policy->cur;
}
/** /**
* get_next_freq - Compute a new frequency for a given cpufreq policy. * get_next_freq - Compute a new frequency for a given cpufreq policy.
* @sg_policy: schedutil policy object to compute the new frequency for. * @sg_policy: schedutil policy object to compute the new frequency for.
...@@ -140,10 +162,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, ...@@ -140,10 +162,9 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
unsigned long util, unsigned long max) unsigned long util, unsigned long max)
{ {
struct cpufreq_policy *policy = sg_policy->policy; struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ? unsigned int freq;
policy->cpuinfo.max_freq : policy->cur;
util = map_util_perf(util); freq = get_capacity_ref_freq(policy);
freq = map_util_freq(util, freq, max); freq = map_util_freq(util, freq, max);
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
...@@ -153,14 +174,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, ...@@ -153,14 +174,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq); return cpufreq_driver_resolve_freq(policy, freq);
} }
static void sugov_get_util(struct sugov_cpu *sg_cpu) unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max)
{
/* Add dvfs headroom to actual utilization */
actual = map_util_perf(actual);
/* Actually we don't need to target the max performance */
if (actual < max)
max = actual;
/*
* Ensure at least minimum performance while providing more compute
* capacity when possible.
*/
return max(min, max);
}
static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
{ {
unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu); unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
struct rq *rq = cpu_rq(sg_cpu->cpu);
sg_cpu->bw_dl = cpu_bw_dl(rq); util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util, util = max(util, boost);
FREQUENCY_UTIL, NULL); sg_cpu->bw_min = min;
sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
} }
/** /**
...@@ -251,18 +289,16 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, ...@@ -251,18 +289,16 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
* This mechanism is designed to boost high frequently IO waiting tasks, while * This mechanism is designed to boost high frequently IO waiting tasks, while
* being more conservative on tasks which does sporadic IO operations. * being more conservative on tasks which does sporadic IO operations.
*/ */
static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
unsigned long max_cap) unsigned long max_cap)
{ {
unsigned long boost;
/* No boost currently required */ /* No boost currently required */
if (!sg_cpu->iowait_boost) if (!sg_cpu->iowait_boost)
return; return 0;
/* Reset boost if the CPU appears to have been idle enough */ /* Reset boost if the CPU appears to have been idle enough */
if (sugov_iowait_reset(sg_cpu, time, false)) if (sugov_iowait_reset(sg_cpu, time, false))
return; return 0;
if (!sg_cpu->iowait_boost_pending) { if (!sg_cpu->iowait_boost_pending) {
/* /*
...@@ -271,7 +307,7 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, ...@@ -271,7 +307,7 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
sg_cpu->iowait_boost >>= 1; sg_cpu->iowait_boost >>= 1;
if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) { if (sg_cpu->iowait_boost < IOWAIT_BOOST_MIN) {
sg_cpu->iowait_boost = 0; sg_cpu->iowait_boost = 0;
return; return 0;
} }
} }
...@@ -281,10 +317,7 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, ...@@ -281,10 +317,7 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
* sg_cpu->util is already in capacity scale; convert iowait_boost * sg_cpu->util is already in capacity scale; convert iowait_boost
* into the same scale so we can compare. * into the same scale so we can compare.
*/ */
boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT; return (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
if (sg_cpu->util < boost)
sg_cpu->util = boost;
} }
#ifdef CONFIG_NO_HZ_COMMON #ifdef CONFIG_NO_HZ_COMMON
...@@ -306,7 +339,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } ...@@ -306,7 +339,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
*/ */
static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu) static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
{ {
if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl) if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
sg_cpu->sg_policy->limits_changed = true; sg_cpu->sg_policy->limits_changed = true;
} }
...@@ -314,6 +347,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, ...@@ -314,6 +347,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
u64 time, unsigned long max_cap, u64 time, unsigned long max_cap,
unsigned int flags) unsigned int flags)
{ {
unsigned long boost;
sugov_iowait_boost(sg_cpu, time, flags); sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time; sg_cpu->last_update = time;
...@@ -322,8 +357,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, ...@@ -322,8 +357,8 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
return false; return false;
sugov_get_util(sg_cpu); boost = sugov_iowait_apply(sg_cpu, time, max_cap);
sugov_iowait_apply(sg_cpu, time, max_cap); sugov_get_util(sg_cpu, boost);
return true; return true;
} }
...@@ -407,8 +442,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, ...@@ -407,8 +442,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
sg_cpu->util = prev_util; sg_cpu->util = prev_util;
cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl), cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
map_util_perf(sg_cpu->util), max_cap); sg_cpu->util, max_cap);
sg_cpu->sg_policy->last_freq_update_time = time; sg_cpu->sg_policy->last_freq_update_time = time;
} }
...@@ -424,9 +459,10 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) ...@@ -424,9 +459,10 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
for_each_cpu(j, policy->cpus) { for_each_cpu(j, policy->cpus) {
struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
unsigned long boost;
sugov_get_util(j_sg_cpu); boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
sugov_iowait_apply(j_sg_cpu, time, max_cap); sugov_get_util(j_sg_cpu, boost);
util = max(j_sg_cpu->util, util); util = max(j_sg_cpu->util, util);
} }
......
This diff is collapsed.
...@@ -628,8 +628,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu) ...@@ -628,8 +628,8 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{ {
s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread; s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, left_deadline = -1, spread;
struct sched_entity *last, *first; struct sched_entity *last, *first, *root;
struct rq *rq = cpu_rq(cpu); struct rq *rq = cpu_rq(cpu);
unsigned long flags; unsigned long flags;
...@@ -644,15 +644,20 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) ...@@ -644,15 +644,20 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SPLIT_NS(cfs_rq->exec_clock)); SPLIT_NS(cfs_rq->exec_clock));
raw_spin_rq_lock_irqsave(rq, flags); raw_spin_rq_lock_irqsave(rq, flags);
root = __pick_root_entity(cfs_rq);
if (root)
left_vruntime = root->min_vruntime;
first = __pick_first_entity(cfs_rq); first = __pick_first_entity(cfs_rq);
if (first) if (first)
left_vruntime = first->vruntime; left_deadline = first->deadline;
last = __pick_last_entity(cfs_rq); last = __pick_last_entity(cfs_rq);
if (last) if (last)
right_vruntime = last->vruntime; right_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime; min_vruntime = cfs_rq->min_vruntime;
raw_spin_rq_unlock_irqrestore(rq, flags); raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
SPLIT_NS(left_deadline));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime", SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime",
SPLIT_NS(left_vruntime)); SPLIT_NS(left_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime", SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
...@@ -679,8 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) ...@@ -679,8 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_avg); cfs_rq->avg.runnable_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg", SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg); cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued", SEQ_printf(m, " .%-30s: %u\n", "util_est",
cfs_rq->avg.util_est.enqueued); cfs_rq->avg.util_est);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg", SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg); cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg", SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
...@@ -1070,8 +1075,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, ...@@ -1070,8 +1075,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_avg); P(se.avg.runnable_avg);
P(se.avg.util_avg); P(se.avg.util_avg);
P(se.avg.last_update_time); P(se.avg.last_update_time);
P(se.avg.util_est.ewma); PM(se.avg.util_est, ~UTIL_AVG_UNCHANGED);
PM(se.avg.util_est.enqueued, ~UTIL_AVG_UNCHANGED);
#endif #endif
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
__PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value); __PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value);
......
This diff is collapsed.
...@@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true) ...@@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization. * UtilEstimation. Use estimated CPU utilization.
*/ */
SCHED_FEAT(UTIL_EST, true) SCHED_FEAT(UTIL_EST, true)
SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(LATENCY_WARN, false)
......
...@@ -258,6 +258,36 @@ static void do_idle(void) ...@@ -258,6 +258,36 @@ static void do_idle(void)
while (!need_resched()) { while (!need_resched()) {
rmb(); rmb();
/*
* Interrupts shouldn't be re-enabled from that point on until
* the CPU sleeping instruction is reached. Otherwise an interrupt
* may fire and queue a timer that would be ignored until the CPU
* wakes from the sleeping instruction. And testing need_resched()
* doesn't tell about pending needed timer reprogram.
*
* Several cases to consider:
*
* - SLEEP-UNTIL-PENDING-INTERRUPT based instructions such as
* "wfi" or "mwait" are fine because they can be entered with
* interrupt disabled.
*
* - sti;mwait() couple is fine because the interrupts are
* re-enabled only upon the execution of mwait, leaving no gap
* in-between.
*
* - ROLLBACK based idle handlers with the sleeping instruction
* called with interrupts enabled are NOT fine. In this scheme
* when the interrupt detects it has interrupted an idle handler,
* it rolls back to its beginning which performs the
* need_resched() check before re-executing the sleeping
* instruction. This can leak a pending needed timer reprogram.
* If such a scheme is really mandatory due to the lack of an
* appropriate CPU sleeping instruction, then a FAST-FORWARD
* must instead be applied: when the interrupt detects it has
* interrupted an idle handler, it must resume to the end of
* this idle handler so that the generic idle loop is iterated
* again to reprogram the tick.
*/
local_irq_disable(); local_irq_disable();
if (cpu_is_offline(cpu)) { if (cpu_is_offline(cpu)) {
......
...@@ -52,13 +52,13 @@ static inline void cfs_se_util_change(struct sched_avg *avg) ...@@ -52,13 +52,13 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
return; return;
/* Avoid store if the flag has been already reset */ /* Avoid store if the flag has been already reset */
enqueued = avg->util_est.enqueued; enqueued = avg->util_est;
if (!(enqueued & UTIL_AVG_UNCHANGED)) if (!(enqueued & UTIL_AVG_UNCHANGED))
return; return;
/* Reset flag to report util_avg has been updated */ /* Reset flag to report util_avg has been updated */
enqueued &= ~UTIL_AVG_UNCHANGED; enqueued &= ~UTIL_AVG_UNCHANGED;
WRITE_ONCE(avg->util_est.enqueued, enqueued); WRITE_ONCE(avg->util_est, enqueued);
} }
static inline u64 rq_clock_pelt(struct rq *rq) static inline u64 rq_clock_pelt(struct rq *rq)
......
...@@ -1002,24 +1002,15 @@ static void update_curr_rt(struct rq *rq) ...@@ -1002,24 +1002,15 @@ static void update_curr_rt(struct rq *rq)
{ {
struct task_struct *curr = rq->curr; struct task_struct *curr = rq->curr;
struct sched_rt_entity *rt_se = &curr->rt; struct sched_rt_entity *rt_se = &curr->rt;
u64 delta_exec; s64 delta_exec;
u64 now;
if (curr->sched_class != &rt_sched_class) if (curr->sched_class != &rt_sched_class)
return; return;
now = rq_clock_task(rq); delta_exec = update_curr_common(rq);
delta_exec = now - curr->se.exec_start; if (unlikely(delta_exec <= 0))
if (unlikely((s64)delta_exec <= 0))
return; return;
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
trace_sched_stat_runtime(curr, delta_exec, 0);
update_current_exec_runtime(curr, now, delta_exec);
if (!rt_bandwidth_enabled()) if (!rt_bandwidth_enabled())
return; return;
......
...@@ -273,8 +273,6 @@ struct rt_bandwidth { ...@@ -273,8 +273,6 @@ struct rt_bandwidth {
unsigned int rt_period_active; unsigned int rt_period_active;
}; };
void __dl_clear_params(struct task_struct *p);
static inline int dl_bandwidth_enabled(void) static inline int dl_bandwidth_enabled(void)
{ {
return sysctl_sched_rt_runtime >= 0; return sysctl_sched_rt_runtime >= 0;
...@@ -315,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att ...@@ -315,6 +313,33 @@ extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *att
extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); extern int dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int dl_bw_check_overflow(int cpu); extern int dl_bw_check_overflow(int cpu);
/*
* SCHED_DEADLINE supports servers (nested scheduling) with the following
* interface:
*
* dl_se::rq -- runqueue we belong to.
*
* dl_se::server_has_tasks() -- used on bandwidth enforcement; we 'stop' the
* server when it runs out of tasks to run.
*
* dl_se::server_pick() -- nested pick_next_task(); we yield the period if this
* returns NULL.
*
* dl_server_update() -- called from update_curr_common(), propagates runtime
* to the server.
*
* dl_server_start()
* dl_server_stop() -- start/stop the server when it has (no) tasks.
*
* dl_server_init() -- initializes the server.
*/
extern void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec);
extern void dl_server_start(struct sched_dl_entity *dl_se);
extern void dl_server_stop(struct sched_dl_entity *dl_se);
extern void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
dl_server_has_tasks_f has_tasks,
dl_server_pick_f pick);
#ifdef CONFIG_CGROUP_SCHED #ifdef CONFIG_CGROUP_SCHED
struct cfs_rq; struct cfs_rq;
...@@ -2179,6 +2204,10 @@ extern const u32 sched_prio_to_wmult[40]; ...@@ -2179,6 +2204,10 @@ extern const u32 sched_prio_to_wmult[40];
* MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location * MOVE - paired with SAVE/RESTORE, explicitly does not preserve the location
* in the runqueue. * in the runqueue.
* *
* NOCLOCK - skip the update_rq_clock() (avoids double updates)
*
* MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
*
* ENQUEUE_HEAD - place at front of runqueue (tail if not specified) * ENQUEUE_HEAD - place at front of runqueue (tail if not specified)
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline) * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup * ENQUEUE_MIGRATED - the task was migrated during wakeup
...@@ -2189,6 +2218,7 @@ extern const u32 sched_prio_to_wmult[40]; ...@@ -2189,6 +2218,7 @@ extern const u32 sched_prio_to_wmult[40];
#define DEQUEUE_SAVE 0x02 /* Matches ENQUEUE_RESTORE */ #define DEQUEUE_SAVE 0x02 /* Matches ENQUEUE_RESTORE */
#define DEQUEUE_MOVE 0x04 /* Matches ENQUEUE_MOVE */ #define DEQUEUE_MOVE 0x04 /* Matches ENQUEUE_MOVE */
#define DEQUEUE_NOCLOCK 0x08 /* Matches ENQUEUE_NOCLOCK */ #define DEQUEUE_NOCLOCK 0x08 /* Matches ENQUEUE_NOCLOCK */
#define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */
#define ENQUEUE_WAKEUP 0x01 #define ENQUEUE_WAKEUP 0x01
#define ENQUEUE_RESTORE 0x02 #define ENQUEUE_RESTORE 0x02
...@@ -2203,6 +2233,7 @@ extern const u32 sched_prio_to_wmult[40]; ...@@ -2203,6 +2233,7 @@ extern const u32 sched_prio_to_wmult[40];
#define ENQUEUE_MIGRATED 0x00 #define ENQUEUE_MIGRATED 0x00
#endif #endif
#define ENQUEUE_INITIAL 0x80 #define ENQUEUE_INITIAL 0x80
#define ENQUEUE_MIGRATING 0x100
#define RETRY_TASK ((void *)-1UL) #define RETRY_TASK ((void *)-1UL)
...@@ -2212,6 +2243,8 @@ struct affinity_context { ...@@ -2212,6 +2243,8 @@ struct affinity_context {
unsigned int flags; unsigned int flags;
}; };
extern s64 update_curr_common(struct rq *rq);
struct sched_class { struct sched_class {
#ifdef CONFIG_UCLAMP_TASK #ifdef CONFIG_UCLAMP_TASK
...@@ -2425,8 +2458,7 @@ extern struct rt_bandwidth def_rt_bandwidth; ...@@ -2425,8 +2458,7 @@ extern struct rt_bandwidth def_rt_bandwidth;
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime); extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq); extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
extern void init_dl_task_timer(struct sched_dl_entity *dl_se); extern void init_dl_entity(struct sched_dl_entity *dl_se);
extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se);
#define BW_SHIFT 20 #define BW_SHIFT 20
#define BW_UNIT (1 << BW_SHIFT) #define BW_UNIT (1 << BW_SHIFT)
...@@ -2822,6 +2854,7 @@ DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq, ...@@ -2822,6 +2854,7 @@ DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
double_rq_lock(_T->lock, _T->lock2), double_rq_lock(_T->lock, _T->lock2),
double_rq_unlock(_T->lock, _T->lock2)) double_rq_unlock(_T->lock, _T->lock2))
extern struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq); extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq); extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
...@@ -2961,24 +2994,14 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} ...@@ -2961,24 +2994,14 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
#endif #endif
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
/**
* enum cpu_util_type - CPU utilization type
* @FREQUENCY_UTIL: Utilization used to select frequency
* @ENERGY_UTIL: Utilization used during energy calculation
*
* The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
* need to be aggregated differently depending on the usage made of them. This
* enum is used within effective_cpu_util() to differentiate the types of
* utilization expected by the callers, and adjust the aggregation accordingly.
*/
enum cpu_util_type {
FREQUENCY_UTIL,
ENERGY_UTIL,
};
unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
enum cpu_util_type type, unsigned long *min,
struct task_struct *p); unsigned long *max);
unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max);
/* /*
* Verify the fitness of task @p to run on @cpu taking into account the * Verify the fitness of task @p to run on @cpu taking into account the
...@@ -3035,59 +3058,6 @@ static inline bool uclamp_rq_is_idle(struct rq *rq) ...@@ -3035,59 +3058,6 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
return rq->uclamp_flags & UCLAMP_FLAG_IDLE; return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
} }
/**
* uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
* @rq: The rq to clamp against. Must not be NULL.
* @util: The util value to clamp.
* @p: The task to clamp against. Can be NULL if you want to clamp
* against @rq only.
*
* Clamps the passed @util to the max(@rq, @p) effective uclamp values.
*
* If sched_uclamp_used static key is disabled, then just return the util
* without any clamping since uclamp aggregation at the rq level in the fast
* path is disabled, rendering this operation a NOP.
*
* Use uclamp_eff_value() if you don't care about uclamp values at rq level. It
* will return the correct effective uclamp value of the task even if the
* static key is disabled.
*/
static __always_inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p)
{
unsigned long min_util = 0;
unsigned long max_util = 0;
if (!static_branch_likely(&sched_uclamp_used))
return util;
if (p) {
min_util = uclamp_eff_value(p, UCLAMP_MIN);
max_util = uclamp_eff_value(p, UCLAMP_MAX);
/*
* Ignore last runnable task's max clamp, as this task will
* reset it. Similarly, no need to read the rq's min clamp.
*/
if (uclamp_rq_is_idle(rq))
goto out;
}
min_util = max_t(unsigned long, min_util, uclamp_rq_get(rq, UCLAMP_MIN));
max_util = max_t(unsigned long, max_util, uclamp_rq_get(rq, UCLAMP_MAX));
out:
/*
* Since CPU's {min,max}_util clamps are MAX aggregated considering
* RUNNABLE tasks with _different_ clamps, we can end up with an
* inversion. Fix it now when the clamps are applied.
*/
if (unlikely(min_util >= max_util))
return min_util;
return clamp(util, min_util, max_util);
}
/* Is the rq being capped/throttled by uclamp_max? */ /* Is the rq being capped/throttled by uclamp_max? */
static inline bool uclamp_rq_is_capped(struct rq *rq) static inline bool uclamp_rq_is_capped(struct rq *rq)
{ {
...@@ -3125,13 +3095,6 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p, ...@@ -3125,13 +3095,6 @@ static inline unsigned long uclamp_eff_value(struct task_struct *p,
return SCHED_CAPACITY_SCALE; return SCHED_CAPACITY_SCALE;
} }
static inline
unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
struct task_struct *p)
{
return util;
}
static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; } static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
static inline bool uclamp_is_used(void) static inline bool uclamp_is_used(void)
...@@ -3261,16 +3224,6 @@ extern int sched_dynamic_mode(const char *str); ...@@ -3261,16 +3224,6 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode); extern void sched_dynamic_update(int mode);
#endif #endif
static inline void update_current_exec_runtime(struct task_struct *curr,
u64 now, u64 delta_exec)
{
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
curr->se.exec_start = now;
cgroup_account_cputime(curr, delta_exec);
}
#ifdef CONFIG_SCHED_MM_CID #ifdef CONFIG_SCHED_MM_CID
#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ #define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
......
...@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq) ...@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
static void put_prev_task_stop(struct rq *rq, struct task_struct *prev) static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
{ {
struct task_struct *curr = rq->curr; update_curr_common(rq);
u64 now, delta_exec;
now = rq_clock_task(rq);
delta_exec = now - curr->se.exec_start;
if (unlikely((s64)delta_exec < 0))
delta_exec = 0;
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
update_current_exec_runtime(curr, now, delta_exec);
} }
/* /*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment