• Giovanni Gherdovich's avatar
    x86, sched: Add support for frequency invariance · 1567c3e3
    Giovanni Gherdovich authored
    Implement arch_scale_freq_capacity() for 'modern' x86. This function
    is used by the scheduler to correctly account usage in the face of
    DVFS.
    
    The present patch addresses Intel processors specifically and has positive
    performance and performance-per-watt implications for the schedutil cpufreq
    governor, bringing it closer to, if not on-par with, the powersave governor
    from the intel_pstate driver/framework.
    
    Large performance gains are obtained when the machine is lightly loaded and
    no regression are observed at saturation. The benchmarks with the largest
    gains are kernel compilation, tbench (the networking version of dbench) and
    shell-intensive workloads.
    
    1. FREQUENCY INVARIANCE: MOTIVATION
       * Without it, a task looks larger if the CPU runs slower
    
    2. PECULIARITIES OF X86
       * freq invariance accounting requires knowing the ratio freq_curr/freq_max
       2.1 CURRENT FREQUENCY
           * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
       2.2 MAX FREQUENCY
           * It varies with time (turbo). As an approximation, we set it to a
             constant, i.e. 4-cores turbo frequency.
    
    3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
       * The invariant schedutil's formula has no feedback loop and reacts faster
         to utilization changes
    
    4. KNOWN LIMITATIONS
       * In some cases tasks can't reach max util despite how hard they try
    
    5. PERFORMANCE TESTING
       5.1 MACHINES
           * Skylake, Broadwell, Haswell
       5.2 SETUP
           * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
             active cores turbo w/ invariant schedutil, and intel_pstate/powersave
       5.3 BENCHMARK RESULTS
           5.3.1 NEUTRAL BENCHMARKS
                 * NAS Parallel Benchmark (HPC), hackbench
           5.3.2 NON-NEUTRAL BENCHMARKS
                 * tbench (10-30% better), kernbench (10-15% better),
                   shell-intensive-scripts (30-50% better)
                 * no regressions
           5.3.3 SELECTION OF DETAILED RESULTS
           5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
                 * dbench (5% worse on one machine), kernbench (3% worse),
                   tbench (5-10% better), shell-intensive-scripts (10-40% better)
    
    6. MICROARCH'ES ADDRESSED HERE
       * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
         etc have different MSRs semantic for querying turbo levels)
    
    7. REFERENCES
       * MMTests performance testing framework, github.com/gormanm/mmtests
    
     +-------------------------------------------------------------------------+
     | 1. FREQUENCY INVARIANCE: MOTIVATION
     +-------------------------------------------------------------------------+
    
    For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
    running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
    appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
    false impression this CPU is almost at capacity, even though it can go
    faster [*]. In a nutshell, without frequency scale-invariance tasks look
    larger just because the CPU is running slower.
    
    [*] (footnote: this assumes a linear frequency/performance relation; which
    everybody knows to be false, but given realities its the best approximation
    we can make.)
    
     +-------------------------------------------------------------------------+
     | 2. PECULIARITIES OF X86
     +-------------------------------------------------------------------------+
    
    Accounting for frequency changes in PELT signals requires the computation of
    the ratio freq_curr / freq_max. On x86 neither of those terms is readily
    available.
    
    2.1 CURRENT FREQUENCY
    ====================
    
    Since modern x86 has hardware control over the actual frequency we run
    at (because amongst other things, Turbo-Mode), we cannot simply use
    the frequency as requested through cpufreq.
    
    Instead we use the APERF/MPERF MSRs to compute the effective frequency
    over the recent past. Also, because reading MSRs is expensive, don't
    do so every time we need the value, but amortize the cost by doing it
    every tick.
    
    2.2 MAX FREQUENCY
    =================
    
    Obtaining freq_max is also non-trivial because at any time the hardware can
    provide a frequency boost to a selected subset of cores if the package has
    enough power to spare (eg: Turbo Boost). This means that the maximum frequency
    available to a given core changes with time.
    
    The approach taken in this change is to arbitrarily set freq_max to a constant
    value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
    microarchitectures, after evaluating the following candidates:
    
        * 1-core (1C) turbo frequency (the fastest turbo state available)
        * around base frequency (a.k.a. max P-state)
        * something in between, such as 4C turbo
    
    To interpret these options, consider that this is the denominator in
    freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
    util_avg and load_avg. A large denominator will undershoot (util_avg looks a
    bit smaller than it really is), viceversa with a smaller denominator PELT
    signals will tend to overshoot. Given that PELT drives frequency selection
    in the schedutil governor, we will have:
    
        freq_max set to     | effect on DVFS
        --------------------+------------------
        1C turbo            | power efficiency (lower freq choices)
        base freq           | performance (higher util_avg, higher freq requests)
        4C turbo            | a bit of both
    
    4C turbo proves to be a good compromise in a number of benchmarks (see below).
    
     +-------------------------------------------------------------------------+
     | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
     +-------------------------------------------------------------------------+
    
    Once an architecture implements a frequency scale-invariant utilization (the
    PELT signal util_avg), schedutil switches its frequency selection formula from
    
        freq_next = 1.25 * freq_curr * util            [non-invariant util signal]
    
    to
    
        freq_next = 1.25 * freq_max * util             [invariant util signal]
    
    where, in the second formula, freq_max is set to the 1C turbo frequency (max
    turbo). The advantage of the second formula, whose usage we unlock with this
    patch, is that freq_next doesn't depend on the current frequency in an
    iterative fashion, but can jump to any frequency in a single update. This
    absence of feedback in the formula makes it quicker to react to utilization
    changes and more robust against pathological instabilities.
    
    Compare it to the update formula of intel_pstate/powersave:
    
        freq_next = 1.25 * freq_max * Busy%
    
    where again freq_max is 1C turbo and Busy% is the percentage of time not spent
    idling (calculated with delta_MPERF / delta_TSC); essentially the same as
    invariant schedutil, and largely responsible for intel_pstate/powersave good
    reputation. The non-invariant schedutil formula is derived from the invariant
    one by approximating util_inv with util_raw * freq_curr / freq_max, but this
    has limitations.
    
    Testing shows improved performances due to better frequency selections when
    the machine is lightly loaded, and essentially no change in behaviour at
    saturation / overutilization.
    
     +-------------------------------------------------------------------------+
     | 4. KNOWN LIMITATIONS
     +-------------------------------------------------------------------------+
    
    It's been shown that it is possible to create pathological scenarios where a
    CPU-bound task cannot reach max utilization, if the normalizing factor
    freq_max is fixed to a constant value (see [Lelli-2018]).
    
    If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
    cores in a package doing some busywork, and observe that none of those task
    will ever reach max util (1024) because they're all running at less than the
    4C turbo frequency.
    
    While this concern still applies, we believe the performance benefit of
    frequency scale-invariant PELT signals outweights the cost of this limitation.
    
     [Lelli-2018]
     https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/
    
     +-------------------------------------------------------------------------+
     | 5. PERFORMANCE TESTING
     +-------------------------------------------------------------------------+
    
    5.1 MACHINES
    ============
    
    We tested the patch on three machines, with Skylake, Broadwell and Haswell
    CPUs. The details are below, together with the available turbo ratios as
    reported by the appropriate MSRs.
    
    * 8x-SKYLAKE-UMA:
      Single socket E3-1240 v5, Skylake 4 cores/8 threads
      Max EFFiciency, BASE frequency and available turbo levels (MHz):
    
        EFFIC    800 |********
        BASE    3500 |***********************************
        4C      3700 |*************************************
        3C      3800 |**************************************
        2C      3900 |***************************************
        1C      3900 |***************************************
    
    * 80x-BROADWELL-NUMA:
      Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
      Max EFFiciency, BASE frequency and available turbo levels (MHz):
    
        EFFIC   1200 |************
        BASE    2200 |**********************
        8C      2900 |*****************************
        7C      3000 |******************************
        6C      3100 |*******************************
        5C      3200 |********************************
        4C      3300 |*********************************
        3C      3400 |**********************************
        2C      3600 |************************************
        1C      3600 |************************************
    
    * 48x-HASWELL-NUMA
      Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
      Max EFFiciency, BASE frequency and available turbo levels (MHz):
    
        EFFIC   1200 |************
        BASE    2300 |***********************
        12C     2600 |**************************
        11C     2600 |**************************
        10C     2600 |**************************
        9C      2600 |**************************
        8C      2600 |**************************
        7C      2600 |**************************
        6C      2600 |**************************
        5C      2700 |***************************
        4C      2800 |****************************
        3C      2900 |*****************************
        2C      3100 |*******************************
        1C      3100 |*******************************
    
    5.2 SETUP
    =========
    
    * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
      driver in passive mode.
    * The rationale for choosing the various freq_max values to test have been to
      try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
      on all machines), plus one more value closer to base_freq but still in the
      turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
    * In addition we've run all tests with intel_pstate/powersave for comparison.
    * The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
    * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
      with active intel_pstate on this machine use that.
    
    This gives, in terms of combinations tested on each machine:
    
    * 8x-SKYLAKE-UMA
      * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
      * intel_pstate active + powersave + HWP
      * invariant schedutil, freq_max = 1C turbo
      * invariant schedutil, freq_max = 3C turbo
      * invariant schedutil, freq_max = 4C turbo
    
    * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
      * [same as 8x-SKYLAKE-UMA, but no HWP capable]
      * invariant schedutil, freq_max = 8C turbo
        (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")
    
    5.3 BENCHMARK RESULTS
    =====================
    
    5.3.1 NEUTRAL BENCHMARKS
    ------------------------
    
    Tests that didn't show any measurable difference in performance on any of the
    test machines between non-invariant schedutil and our patch are:
    
    * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
      computational kernel
    * flexible I/O (FIO)
    * hackbench (using threads or processes, and using pipes or sockets)
    
    5.3.2 NON-NEUTRAL BENCHMARKS
    ----------------------------
    
    What follow are summary tables where each benchmark result is given a score.
    
    * A tilde (~) means a neutral result, i.e. no difference from baseline.
    * Scores are computed with the ratio result_new / result_baseline, so a tilde
      means a score of 1.00.
    * The results in the score ratio are the geometric means of results running
      the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
      ... number of processes; for pgbench: varying the number of clients, and so
      on).
    * The first three tables show higher-is-better kind of tests (i.e. measured in
      operations/second), the subsequent three show lower-is-better kind of tests
      (i.e. the workload is fixed and we measure elapsed time, think kernbench).
    * "gitsource" is a name we made up for the test consisting in running the
      entire unit tests suite of the Git SCM and measuring how long it takes. We
      take it as a typical example of shell-intensive serialized workload.
    * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
      columns show invariant schedutil for different values of freq_max. 4C turbo
      is circled as it's the value we've chosen for the final implementation.
    
    80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                             +------+
                     I_PSTATE   1C     3C    | 4C   |  8C
    pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
    pgbench-rw           ~      ~      ~     | ~    |  ~
    netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
    netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
    tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                             +------+
    
    8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                             +------+
                 I_PSTATE/HWP   1C     3C    | 4C   |
    pgbench-ro           ~      ~      ~     | ~    |
    pgbench-rw           ~      ~      ~     | ~    |
    netperf-udp          ~      ~      ~     | ~    |
    netperf-tcp          ~      ~      ~     | ~    |
    tbench4              1.30   1.14   1.14  | 1.16 |
                                             +------+
    
    48x-HASWELL-NUMA (comparison ratio; higher is better)
                                             +------+
                     I_PSTATE   1C     3C    | 4C   |  12C
    pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
    pgbench-rw           ~      ~      ~     | ~    |  ~
    netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
    netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
    tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                             +------+
    
    In the table above we see that active intel_pstate is slightly better than our
    4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
    read-only pgbench and much better on tbench. Both cases are notable in which
    it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
    80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
    schedutil to get closer.
    
    If we ignore active intel_pstate and focus on the comparison with baseline
    alone, there are several instances of double-digit performance improvement.
    
    80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                             +------+
                     I_PSTATE   1C     3C    | 4C   |  8C
    dbench4              1.23   0.95   0.95  | 0.95 |  0.95
    kernbench            0.93   0.83   0.83  | 0.83 |  0.82
    gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                             +------+
    
    8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                             +------+
                 I_PSTATE/HWP   1C     3C    | 4C   |
    dbench4              ~      ~      ~     | ~    |
    kernbench            ~      ~      ~     | ~    |
    gitsource            0.92   0.55   0.55  | 0.55 |
                                             +------+
    
    48x-HASWELL-NUMA (comparison ratio; lower is better)
                                             +------+
                     I_PSTATE   1C     3C    | 4C   |  8C
    dbench4              ~      ~      ~     | ~    |  ~
    kernbench            0.94   0.90   0.89  | 0.90 |  0.90
    gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                             +------+
    
    dbench is not very remarkable here, unless we notice how poorly active
    intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
    non-invariant schedutil. We repeated that run getting consistent results. Out
    of scope for the patch at hand, but deserving future investigation. Other than
    that, we previously ran this campaign with Linux v5.0 and saw the patch doing
    better on dbench a the time. We haven't checked closely and can only speculate
    at this point.
    
    On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
    the detailed tables that the gains concentrate on low process counts (lightly
    loaded machines).
    
    The test we call "gitsource" (running the git unit test suite, a long-running
    single-threaded shell script) appears rather spectacular in this table (gains
    of 30-50% depending on the machine). It is to be noted, however, that
    gitsource has no adjustable parameters (such as the number of jobs in
    kernbench, which we average over in order to get a single-number summary
    score) and is exactly the kind of low-parallelism workload that benefits the
    most from this patch. When looking at the detailed tables of kernbench or
    tbench4, at low process or client counts one can see similar numbers.
    
    5.3.3 SELECTION OF DETAILED RESULTS
    -----------------------------------
    
    Machine            : 48x-HASWELL-NUMA
    Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
    Varying parameter  : number of clients
    Unit               : MB/sec (higher is better)
    
                       5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
    Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
    Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
    Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
    Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
    Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
    Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
    Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
    Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)
    
                                 5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
    Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
    Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
    Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
    Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
    Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
    Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
    Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
    Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)
    
    This is one of the cases where the patch still can't surpass active
    intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
    visible up to 16 clients and the saturated scenario is the same as baseline.
    
    The scores in the summary table from the previous sections are ratios of
    geometric means of the results over different clients, as seen in this table.
    
    Machine            : 80x-BROADWELL-NUMA
    Benchmark          : kernbench (kernel compilation)
    Varying parameter  : number of jobs
    Unit               : seconds (lower is better)
    
                       5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
    Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
    Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
    Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
    Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
    Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
    Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
    Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)
    
                                 5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
    - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
    Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
    Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
    Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
    Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
    Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
    Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
    Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)
    
    The patch outperform active intel_pstate (and baseline) by a considerable
    margin; the summary table from the previous section says 4C turbo and active
    intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
    0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
    noticeable difference with regard to the value of freq_max.
    
    Machine            : 8x-SKYLAKE-UMA
    Benchmark          : gitsource (time to run the git unit test suite)
    Varying parameter  : none
    Unit               : seconds (lower is better)
    
                                5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
    - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)
    
                               5.2.0 3C-turbo                   5.2.0 4C-turbo
    - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)
    
    In this test, which is of interest as representing shell-intensive
    (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
    intel_pstate/powersave by a whopping 40% margin.
    
    5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
    ---------------------------------------------
    
    The following table shows average power consumption in watt for each
    benchmark. Data comes from turbostat (package average), which in turn is read
    from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
    it's reasonable to ignore other power consumers (such as memory or I/O). Also,
    we don't have a power meter available in the lab so RAPL is the best we have.
    
    turbostat sampled average power every 10 seconds for the entire duration of
    each benchmark. We took all those values and averaged them (i.e. with don't
    have detail on a per-parameter granularity, only on whole benchmarks).
    
    80x-BROADWELL-NUMA (power consumption, watts)
                                                        +--------+
                   BASELINE I_PSTATE       1C       3C  |     4C |      8C
    pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
    pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
    dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
    netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
    netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
    tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
    kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
    gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                        +--------+
    8x-SKYLAKE-UMA (power consumption, watts)
                                                        +--------+
                  BASELINE I_PSTATE/HWP    1C       3C  |     4C |
    pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
    pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
    dbench4           27.28    27.37    27.49    27.41  |  27.38 |
    netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
    netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
    tbench4           41.13    45.61    43.10    43.33  |  43.56 |
    kernbench         42.56    42.63    43.01    43.01  |  43.01 |
    gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                        +--------+
    48x-HASWELL-NUMA (power consumption, watts)
                                                        +--------+
                   BASELINE I_PSTATE       1C       3C  |     4C |     12C
    pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
    pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
    dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
    netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
    netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
    tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
    kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
    gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                        +--------+
    
    A lower power consumption isn't necessarily better, it depends on what is done
    with that energy. Here are tables with the ratio of performance-per-watt on
    each machine and benchmark. Higher is always better; a tilde (~) means a
    neutral ratio (i.e. 1.00).
    
    80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                         +------+
                 I_PSTATE     1C     3C  |   4C |    8C
    pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
    pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
    dbench4          1.24   0.94   0.95  | 0.94 |  0.92
    netperf-udp      ~      1.02   1.02  | ~    |  1.02
    netperf-tcp      ~      1.02   ~     | ~    |  1.02
    tbench4          1.26   1.10   1.06  | 1.12 |  1.26
    kernbench        0.98   0.97   0.97  | 0.97 |  0.98
    gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                         +------+
    
    8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                         +------+
             I_PSTATE/HWP     1C     3C  |   4C |
    pgbench-ro       ~      ~      ~     | ~    |
    pgbench-rw       0.95   0.97   0.96  | 0.96 |
    dbench4          ~      ~      ~     | ~    |
    netperf-udp      ~      ~      ~     | ~    |
    netperf-tcp      ~      ~      ~     | ~    |
    tbench4          1.17   1.09   1.08  | 1.10 |
    kernbench        ~      ~      ~     | ~    |
    gitsource        1.06   1.40   1.40  | 1.40 |
                                         +------+
    
    48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                         +------+
                 I_PSTATE     1C     3C  |   4C |   12C
    pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
    pgbench-rw       ~      0.86   ~     | ~    |  0.86
    dbench4          ~      1.02   1.02  | 1.02 |  ~
    netperf-udp      ~      0.97   1.03  | 1.02 |  ~
    netperf-tcp      0.96   ~      ~     | ~    |  ~
    tbench4          1.24   ~      1.06  | 1.05 |  1.11
    kernbench        0.97   0.97   0.98  | 0.97 |  0.96
    gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                         +------+
    
    These results are overall pleasing: in plenty of cases we observe
    performance-per-watt improvements. The few regressions (read/write pgbench and
    dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
    percentage points (it has a 10-15% performance improvement, but apparently the
    increase in power consumption is larger than that). tbench4 and gitsource, which
    benefit the most from the patch, keep a positive score in this table which is
    a welcome surprise; that suggests that in those particular workloads the
    non-invariant schedutil (and active intel_pstate, too) makes some rather
    suboptimal frequency selections.
    
    +-------------------------------------------------------------------------+
    | 6. MICROARCH'ES ADDRESSED HERE
    +-------------------------------------------------------------------------+
    
    The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
    MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
    respectively. This excludes the recent Xeon Scalable Performance processors
    line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.
    
    Subsequent patches will address:
    
    * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
    * Xeon Phi (Knights Landing, Knights Mill)
    * Atom Silvermont
    
    +-------------------------------------------------------------------------+
    | 7. REFERENCES
    +-------------------------------------------------------------------------+
    
    Tests have been run with the help of the MMTests performance testing
    framework, see github.com/gormanm/mmtests. The configuration file names for
    the benchmark used are:
    
        db-pgbench-timed-ro-small-xfs
        db-pgbench-timed-rw-small-xfs
        io-dbench4-async-xfs
        network-netperf-unbound
        network-tbench
        scheduler-unbound
        workload-kerndevel-xfs
        workload-shellscripts-xfs
        hpc-nas-c-class-mpi-full-xfs
        hpc-nas-c-class-omp-full
    
    All those benchmarks are generally available on the web:
    
    pgbench: https://www.postgresql.org/docs/10/pgbench.html
    netperf: https://hewlettpackard.github.io/netperf/
    dbench/tbench: https://dbench.samba.org/
    gitsource: git unit test suite, github.com/git/git
    NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
    hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.cSuggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
    Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Acked-by: default avatarDoug Smythies <dsmythies@telus.net>
    Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
    Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
    1567c3e3
topology.h 6.01 KB