• Mel Gorman's avatar
    sched/core: Offload wakee task activation if it the wakee is descheduling · 2ebb1771
    Mel Gorman authored
    The previous commit:
    
      c6e7bd7a: ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    
    avoids spinning on p->on_rq when the task is descheduling, but only if the
    wakee is on a CPU that does not share cache with the waker.
    
    This patch offloads the activation of the wakee to the CPU that is about to
    go idle if the task is the only one on the runqueue. This potentially allows
    the waker task to continue making progress when the wakeup is not strictly
    synchronous.
    
    This is very obvious with netperf UDP_STREAM running on localhost. The
    waker is sending packets as quickly as possible without waiting for any
    reply. It frequently wakes the server for the processing of packets and
    when netserver is using local memory, it quickly completes the processing
    and goes back to idle. The waker often observes that netserver is on_rq
    and spins excessively leading to a drop in throughput.
    
    This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
    on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
    localwakelist-v1r2 respectively.
    
                                      5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                        vanilla           optttwu-v1r1     localwakelist-v1r2
    Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
    Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
    Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
    Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
    Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
    Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
    Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
    Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
    Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*
    
    Note that this benefit is not universal to all wakeups, it only applies
    to the case where the waker often spins on p->on_rq.
    
    The impact can be seen from a "perf sched latency" report generated from
    a single iteration of one packet size:
    
       -----------------------------------------------------------------------------------------------------------------
        Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
       -----------------------------------------------------------------------------------------------------------------
    
      vanilla
        netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
        netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s
    
      localwakelist-v1r2
        netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
        netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
       -----------------------------------------------------------------------------------------------------------------
    
    Note that the average wakeup delay is quite small on both the vanilla
    kernel and with the two patches applied. However, there are significant
    outliers with the vanilla kernel with the maximum one measured as 1615
    milliseconds with a vanilla kernel but never worse than 0.362 ms with
    both patches applied and a much higher rate of context switching.
    
    Similarly a separate profile of cycles showed that 2.83% of all cycles
    were spent in try_to_wake_up() with almost half of the cycles spent
    on spinning on p->on_rq. With the two patches, the percentage of cycles
    spent in try_to_wake_up() drops to 1.13%
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Jirka Hladky <jhladky@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: valentin.schneider@arm.com
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Rik van Riel <riel@surriel.com>
    Link: https://lore.kernel.org/r/20200524202956.27665-3-mgorman@techsingularity.net
    2ebb1771
core.c 199 KB