• Chen Yu's avatar
    cpufreq: intel_pstate: Handle no_turbo in frequency invariance · addca285
    Chen Yu authored
    Problem statement:
    
    Once the user has disabled turbo frequency by
    
    # echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
    
    the cfs_rq's util_avg becomes quite small when compared with
    CPU capacity.
    
    Step to reproduce:
    
    # echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
    
    # ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99
    
    would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
    with a CPU utilization of 99%. [1]
    
    top result:
    %Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    
    check util_avg:
    cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
      .util_avg                      : 611
    
    So the util_avg/cpu capacity is 611/1024, which is much smaller than
    98.4% shown in the top result.
    
    This might impact some logic in the scheduler. For example,
    group_is_overloaded() would compare the group_capacity and group_util
    in the sched group, to check if this sched group is overloaded or not.
    With this gap, even when there is a nearly 100% workload, the sched
    group will not be regarded as overloaded. Besides group_is_overloaded(),
    there are also other victims. There is a ongoing work that aims to
    optimize the task wakeup in a LLC domain. The main idea is to stop
    searching idle CPUs if the sched domain is overloaded[2]. This proposal
    also relies on the util_avg/CPU capacity to decide whether the LLC
    domain is overloaded.
    
    Analysis:
    
    CPU frequency invariance has caused this difference. In summary,
    the util_sum of cfs rq would decay quite fast when the CPU is in
    idle, when the CPU frequency invariance is enabled.
    
    The detail is as followed:
    
    As depicted in update_rq_clock_pelt(), when the frequency invariance
    is enabled, there would be two clock variables on each rq, clock_task
    and clock_pelt:
    
       The clock_pelt scales the time to reflect the effective amount of
       computation done during the running delta time but then syncs back to
       clock_task when rq is idle.
    
       absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
       @ max frequency  ------******---------------******---------------
       @ half frequency ------************---------************---------
       clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
    
    The fast decay of util_sum during idle is due to:
    
     1. rq->clock_pelt is always behind rq->clock_task
     2. rq->last_update is updated to rq->clock_pelt' after invoking
        ___update_load_sum()
     3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly
        increased a lot to rq->clock_task
     4. Enters ___update_load_sum() again, the idle period is calculated by
        rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
        The lower the CPU frequency is, the larger the delta =
        rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
        used to decay the util_sum only, the util_sum drops significantly during
        idle period.
    
    Proposal:
    
    This symptom is not only caused by disabling turbo frequency, but it
    would also appear if the user limits the max frequency at runtime.
    
    Because, if the frequency is always lower than the max frequency,
    CPU frequency invariance would decay the util_sum quite fast during
    idle.
    
    As some end users would disable turbo after boot up, this patch aims to
    present this symptom and deals with turbo scenarios for now.
    
    It might be ideal if CPU frequency invariance is aware of the max CPU
    frequency (user specified) at runtime in the future.
    
    Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
    Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
    Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
    Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
    Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
    addca285
intel_pstate.c 90.4 KB