• Vincent Donnefort's avatar
    sched/fair: Fix task utilization accountability in compute_energy() · 0372e1cf
    Vincent Donnefort authored
    find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
    energy delta as follows:
    
      feec(task)
        for_each_pd
          base_energy = compute_energy(task, -1, pd)
            -> for_each_cpu(pd)
               -> cpu_util_next(cpu, task, -1)
    
          energy_delta = compute_energy(task, dst_cpu, pd)
            -> for_each_cpu(pd)
               -> cpu_util_next(cpu, task, dst_cpu)
          energy_delta -= base_energy
    
    Then it picks the best CPU as being the one that minimizes energy_delta.
    
    cpu_util_next() estimates the CPU utilization that would happen if the
    task was placed on dst_cpu as follows:
    
      max(cpu_util + task_util, cpu_util_est + _task_util_est)
    
    The task contribution to the energy delta can then be either:
    
      (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
          and _task_util_est > cpu_util.
      (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
    
      (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
       otherwise must be small enough so that feec() takes the CPU as a
       potential target for the task placement)
    
    This is problematic for feec(), as cpu_util_next() might give an unfair
    advantage to a CPU which is mostly busy (2) compared to one which is
    mostly idle (1). _task_util_est being always bigger than task_util in
    feec() (as the task is waking up), the task contribution to the energy
    might look smaller on certain CPUs (2) and this breaks the energy
    comparison.
    
    This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
    their cpu_util < _task_util_est (1) while others will maintain cpu_util >
    _task_util_est (2).
    
    Fix this problem by always using max(task_util, _task_util_est) as a task
    contribution to the energy (ENERGY_UTIL). The new estimated CPU
    utilization for the energy would then be:
    
      max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
    
    compute_energy() still needs to know which OPP would be selected if the
    task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
    cpu_util_next() is still used to estimate the maximum util within the pd.
    Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Reviewed-by: default avatarQuentin Perret <qperret@google.com>
    Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
    Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
    0372e1cf
fair.c 298 KB