1. 28 Jan, 2020 14 commits
    • Konstantin Khlebnikov's avatar
      sched/rt: Optimize checking group RT scheduler constraints · b4fb015e
      Konstantin Khlebnikov authored
      Group RT scheduler contains protection against setting zero runtime for
      cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates
      over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which
      runtime is zero (not only for changed one). Default RT runtime is zero,
      thus tg_has_rt_tasks() will is called for almost at CPU cgroups.
      
      This protection already is slightly racy: runtime limit could be changed
      between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing
      cgroup attribute does not lock cgroup_mutex while attach does not lock
      rt_constraints_mutex. Changing task scheduler class also races with
      changing rt runtime: check in __sched_setscheduler() isn't protected.
      
      Function tg_has_rt_tasks() iterates over all threads in the system.
      This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock
      locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking
      tasklist_lock for write (for example fork) will stuck with disabled irqs.
      
      This patch makes two optimizations:
      1) Remove locking tasklist_lock and iterate only tasks in cgroup
      2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero
      
      All changed code is under CONFIG_RT_GROUP_SCHED.
      
      Testcase:
      
       # mkdir /sys/fs/cgroup/cpu/test{1..10000}
       # echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us
      
      At the same time without patch fork time will be >100ms:
      
       # perf trace -e clone --duration 100 stress-ng --fork 1
      
      Also remote ping will show timings >100ms caused by irq latency.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz
      b4fb015e
    • Srikar Dronamraju's avatar
      sched/fair: Optimize select_idle_core() · bec2860a
      Srikar Dronamraju authored
      Currently we loop through all threads of a core to evaluate if the core is
      idle or not. This is unnecessary. If a thread of a core is not idle, skip
      evaluating other threads of a core. Also while clearing the cpumask, bits
      of all CPUs of a core can be cleared in one-shot.
      
      Collecting ticks on a Power 9 SMT 8 system around select_idle_core
      while running schbench shows us
      
      (units are in ticks, hence lesser is better)
      Without patch
          N        Min     Max     Median         Avg      Stddev
      x 130        151    1083        284   322.72308   144.41494
      
      With patch
          N        Min     Max     Median         Avg      Stddev   Improvement
      x 164         88     610        201   225.79268   106.78943        30.03%
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20191206172422.6578-1-srikar@linux.vnet.ibm.com
      bec2860a
    • Giovanni Gherdovich's avatar
      x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance · 918229cd
      Giovanni Gherdovich authored
      On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
      when the machine is disconnected from AC, and viceversa it enables it again
      when it's reconnected. In these cases a _PPC ACPI notification is issued.
      
      The scheduler needs to know freq_max for frequency-invariant calculations.
      To account for turbo availability to come and go, record freq_max at boot as
      if turbo was available and store it in a helper variable. Use a setter
      function to swap between freq_base and freq_max every time turbo goes off or on.
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz
      918229cd
    • Giovanni Gherdovich's avatar
      x86, sched: Add support for frequency invariance on ATOM · 298c6f99
      Giovanni Gherdovich authored
      The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
      accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
      turbo ratio.
      
      We intended to perform tests validating that this patch doesn't regress in
      terms of energy efficiency, given that this is the primary concern on Atom
      processors. Alas, we found out that turbostat doesn't support reading RAPL
      interfaces on our test machine (Airmont), and we don't have external equipment
      to measure power consumption; all we have is the performance results of the
      benchmarks we ran.
      
      Test machine:
      
      Platform    : Dell Wyse 3040 Thin Client[1]
      CPU Model   : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
      Fam/Mod/Ste : 6:76:4
      Topology    : 1 socket, 4 cores / 4 threads
      Memory      : 2G
      Storage     : onboard flash, XFS filesystem
      
      [1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client
      
      Base frequency and available turbo levels (MHz):
      
          Min Operating Freq   266 |***
          Low Freq Mode        800 |********
          Base Freq           2400 |************************
          4 Cores             2800 |****************************
          3 Cores             2800 |****************************
          2 Cores             3200 |********************************
          1 Core              3200 |********************************
      
      Tested kernels:
      
      Baseline      : v5.4-rc1,              intel_pstate passive,  schedutil
      Comparison #1 : v5.4-rc1,              intel_pstate active ,  powersave
      Comparison #2 : v5.4-rc1, this patch,  intel_pstate passive,  schedutil
      
      tbench, hackbench and kernbench performed the same under all three kernels;
      dbench ran faster with intel_pstate/powersave and the git unit tests were a
      lot faster with intel_pstate/powersave and invariant schedutil wrt the
      baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
      an Atom system to go fast. Power consumption regressions aren't expected but
      we lack the equipment to make that measurement. Turbostat seems to think that
      reading RAPL on this machine isn't a good idea and we're trusting that
      decision.
      
      comparison ratio of performance with baseline; 1.00 means neutral,
      lower is better:
      
                            I_PSTATE      FREQ-INV
          ----------------------------------------
          dbench                0.90             ~
          kernbench             0.98          0.97
          gitsource             0.63          0.43
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-6-ggherdovich@suse.cz
      298c6f99
    • Giovanni Gherdovich's avatar
      x86, sched: Add support for frequency invariance on ATOM_GOLDMONT* · eacf0474
      Giovanni Gherdovich authored
      The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
      accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
      GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
      reported by the CPU.
      
      The encoding of turbo ratios for GOLDMONT* is identical to the one for
      SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
      a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
      conservative frequency selections (favoring power efficiency).
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-5-ggherdovich@suse.cz
      eacf0474
    • Giovanni Gherdovich's avatar
      x86, sched: Add support for frequency invariance on XEON_PHI_KNL/KNM · 8bea0dfb
      Giovanni Gherdovich authored
      The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
      accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
      reported by the CPU.
      
      Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
      one or two turbo frequencies; in the former case that's 100 MHz above the base
      frequency, in the latter case the two levels are 100 MHz and 200 MHz above
      base frequency.
      
      We set freq_max to the second-highest frequency reported by the CPU. This
      could be the base frequency (if only one turbo level is available) or the first
      turbo level (if two levels are available). The rationale is to compromise
      between power efficiency or performance -- going straight to max turbo would
      favor efficiency and blindly using base freq would favor performance.
      
      For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
      to get the available frequencies (taken from a comment in turbostat's sources):
      
          [0] -- Reserved
          [7:1] -- Base value of number of active cores of bucket 1.
          [15:8] -- Base value of freq ratio of bucket 1.
          [20:16] -- +ve delta of number of active cores of bucket 2.
          i.e. active cores of bucket 2 =
          active cores of bucket 1 + delta
          [23:21] -- Negative delta of freq ratio of bucket 2.
          i.e. freq ratio of bucket 2 =
          freq ratio of bucket 1 - delta
          [28:24]-- +ve delta of number of active cores of bucket 3.
          [31:29]-- -ve delta of freq ratio of bucket 3.
          [36:32]-- +ve delta of number of active cores of bucket 4.
          [39:37]-- -ve delta of freq ratio of bucket 4.
          [44:40]-- +ve delta of number of active cores of bucket 5.
          [47:45]-- -ve delta of freq ratio of bucket 5.
          [52:48]-- +ve delta of number of active cores of bucket 6.
          [55:53]-- -ve delta of freq ratio of bucket 6.
          [60:56]-- +ve delta of number of active cores of bucket 7.
          [63:61]-- -ve delta of freq ratio of bucket 7.
      
      1. PERFORMANCE EVALUATION: TBENCH +5%
      2. NEUTRAL BENCHMARKS (ALL OTHERS)
      3. TEST SETUP
      
      1. PERFORMANCE EVALUATION: TBENCH +5%
      -------------------------------------
      
      A performance evaluation was conducted on a Knights Mill machine (see "Test
      Setup" below), were the frequency-invariance patch (on schedutil) is compared
      to both non-invariant schedutil and active intel_pstate with powersave: all
      three tested kernels behave the same performance-wise and with regard to power
      consumption (performance per watt). The only notable difference is tbench:
      
      comparison ratio of performance with baseline; 1.00 means neutral,
      higher is better:
      
                            I_PSTATE      FREQ-INV
          ----------------------------------------
          tbench                1.04          1.05
      
      performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
      
                            I_PSTATE      FREQ-INV
          ----------------------------------------
          tbench                1.03          1.04
      
      which essentially means that frequency-invariant schedutil is 5% better than
      baseline, the same as intel_pstate+powersave.
      
      As the results above are averaged over the varying parameter, here the detailed
      table.
      
      Varying parameter  : number of clients
      Unit               : MB/sec (higher is better)
      
                          5.2.0 vanilla (BASELINE)                 5.2.0 intel_pstate                     5.2.0 freq-inv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean   1         49.06  +- 2.12% (        )         51.66  +- 1.52% (   5.30%)         52.87  +- 0.88% (   7.76%)
      Hmean   2         93.82  +- 0.45% (        )        103.24  +- 0.70% (  10.05%)        105.90  +- 0.70% (  12.88%)
      Hmean   4        192.46  +- 1.15% (        )        215.95  +- 0.60% (  12.21%)        215.78  +- 1.43% (  12.12%)
      Hmean   8        406.74  +- 2.58% (        )        438.58  +- 0.36% (   7.83%)        437.61  +- 0.97% (   7.59%)
      Hmean   16       857.70  +- 1.22% (        )        890.26  +- 0.72% (   3.80%)        889.11  +- 0.73% (   3.66%)
      Hmean   32      1760.10  +- 0.92% (        )       1791.70  +- 0.44% (   1.79%)       1787.95  +- 0.44% (   1.58%)
      Hmean   64      3183.50  +- 0.34% (        )       3183.19  +- 0.36% (  -0.01%)       3187.53  +- 0.36% (   0.13%)
      Hmean   128     4830.96  +- 0.31% (        )       4846.53  +- 0.30% (   0.32%)       4855.86  +- 0.30% (   0.52%)
      Hmean   256     5467.98  +- 0.38% (        )       5793.80  +- 0.28% (   5.96%)       5821.94  +- 0.17% (   6.47%)
      Hmean   512     5398.10  +- 0.06% (        )       5745.56  +- 0.08% (   6.44%)       5503.68  +- 0.07% (   1.96%)
      Hmean   1024    5290.43  +- 0.63% (        )       5221.07  +- 0.47% (  -1.31%)       5277.22  +- 0.80% (  -0.25%)
      Hmean   1088    5139.71  +- 0.57% (        )       5236.02  +- 0.71% (   1.87%)       5190.57  +- 0.41% (   0.99%)
      
      2. NEUTRAL BENCHMARKS (ALL OTHERS)
      ----------------------------------
      
      * pgbench (both read/write and read-only)
      * NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
      * hackbench
      * netperf
      * dbench
      * kernbench
      * gitsource (git unit test suite)
      
      3. TEST SETUP
      -------------
      
      Test machine:
      
      CPU Model   : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
      Fam/Mod/Ste : 6:133:0
      Topology    : 1 socket, 68 cores / 272 threads
      Memory      : 96G
      Storage     : rotary, XFS filesystem
      
      Max EFFICiency, BASE frequency and available turbo levels (MHz):
      
          EFFIC   1000 |**********
          BASE    1100 |***********
          68C     1100 |***********
          30C     1200 |************
      
      Tested kernels:
      
      Baseline      : v5.2,              intel_pstate passive,  schedutil
      Comparison #1 : v5.2,              intel_pstate active ,  powersave
      Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-4-ggherdovich@suse.cz
      8bea0dfb
    • Giovanni Gherdovich's avatar
      x86, sched: Add support for frequency invariance on SKYLAKE_X · 2a0abc59
      Giovanni Gherdovich authored
      The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
      accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can
      be sustained by a group of at least 4 cores.
      
      From the changelog of commit 31e07522 ("tools/power turbostat: fix
      decoding for GLM, DNV, SKX turbo-ratio limits"):
      
       >   Newer processors do not hard-code the the number of cpus in each bin
       >   to {1, 2, 3, 4, 5, 6, 7, 8}  Rather, they can specify any number
       >   of CPUS in each of the 8 bins:
       >
       >   eg.
       >
       >   ...
       >   37 * 100.0 = 3600.0 MHz max turbo 4 active cores
       >   38 * 100.0 = 3700.0 MHz max turbo 3 active cores
       >   39 * 100.0 = 3800.0 MHz max turbo 2 active cores
       >   39 * 100.0 = 3900.0 MHz max turbo 1 active cores
       >
       >   could now look something like this:
       >
       >   ...
       >   37 * 100.0 = 3600.0 MHz max turbo 16 active cores
       >   38 * 100.0 = 3700.0 MHz max turbo 8 active cores
       >   39 * 100.0 = 3800.0 MHz max turbo 4 active cores
       >   39 * 100.0 = 3900.0 MHz max turbo 2 active cores
      
      This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D,
      but we treat these two classes in separate commits because their freq_max
      values need to be different. For SKX we prefer a lower freq_max in the ratio
      freq_curr/freq_max, allowing load and utilization to overshoot and the
      schedutil governor to be more performance-oriented. Models from the Atom
      series (such as GOLDMONT*) are handled in a forthcoming commit as they have to
      favor power-efficiency over performance.
      
      Results from a performance evaluation follow.
      
      1. TEST SETUP
      2. NEUTRAL BENCHMARKS
      3. NON-NEUTRAL BENCHMARKS
      4. DETAILED TABLES
      
      1. TEST SETUP
      -------------
      
      Test machine:
      
      CPU Model   : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake)
      Fam/Mod/Ste : 6:85:6
      Topology    : 2 sockets, 24 cores / 48 threads each socket
      Memory      : 192G
      Storage     : SSD, XFS filesystem
      
      Max EFFICiency, BASE frequency and available turbo levels (MHz):
      
          EFFIC   1000 |**********
          BASE    2400 |************************
          24C     3100 |*******************************
          20C     3300 |*********************************
          16C     3600 |************************************
          12C     3600 |************************************
          8C      3600 |************************************
          4C      3700 |*************************************
          2C      3900 |***************************************
      
      Tested kernels:
      
      Baseline      : v5.2,              intel_pstate passive,  schedutil
      Comparison #1 : v5.2,              intel_pstate active ,  powersave+HWP
      Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil
      
      2. NEUTRAL BENCHMARKS
      ---------------------
      
      * pgbench read/write
      * NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
      * hackbench
      * netperf
      
      3. NON-NEUTRAL BENCHMARKS
      -------------------------
      
      comparison ratio with baseline; 1.00 means neutral, higher is better:
      
                            I_PSTATE      FREQ-INV
          ----------------------------------------
          pgbench read-only     1.10             ~
          tbench                1.82          1.14
      
      comparison ratio with baseline; 1.00 means neutral, lower is better:
      
                            I_PSTATE      FREQ-INV
          ----------------------------------------
          dbench                   ~          0.97
          kernbench             0.88          0.78
          gitsource[*]             ~          0.46
      
      [*] "gitsource" consists in running git's unit tests
      tilde (~) means 1.00, ie result identical to baseline
      
      Performance per watt:
      
      performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:
      
      		      I_PSTATE      FREQ-INV
          ----------------------------------------
          dbench                0.92          0.91
          tbench                1.26          1.04
          kernbench             0.95          0.96
          gitsource             1.03          1.30
      
      Similarly to earlier Xeons, measurable performance gains over non-invariant
      schedutil are observed on dbench, tbench, kernel compilation and running the
      git unit tests suite. Looking at the detailed tables show that the patch
      scores the largest difference when the machine is lightly loaded. Power
      efficiency suffers lightly on kernbench and a bit more on dbench, but largely
      improves on gitsource (which also runs considerably faster). For reference, we
      also report results using active intel_pstate with powersave and HWP; the
      largest gap between non-invariant schedutil and intel_pstate+powersave is
      still tbench, which runs 82% better and with 26% improved efficiency on the
      latter configuration -- this divide isn't closed yet by frequency-invariant
      schedutil.
      
      4. DETAILED TABLES
      ------------------
      
      Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
      Varying parameter  : number of clients
      Unit               : MB/sec (higher is better)
      
                           5.2.0 vanilla (BASELINE)            5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean   1         183.56  +- 0.21% (        )       516.12  +- 0.57% ( 181.18%)       185.59  +- 0.59% (   1.11%)
      Hmean   2         365.75  +- 0.25% (        )      1015.14  +- 0.33% ( 177.55%)       402.59  +- 4.48% (  10.07%)
      Hmean   4         720.99  +- 0.44% (        )      1951.75  +- 0.28% ( 170.70%)       738.39  +- 1.72% (   2.41%)
      Hmean   8        1449.93  +- 0.34% (        )      3830.56  +- 0.24% ( 164.19%)      1750.36  +- 4.65% (  20.72%)
      Hmean   16       2874.26  +- 0.57% (        )      7381.62  +- 0.53% ( 156.82%)      4348.35  +- 2.22% (  51.29%)
      Hmean   32       6116.17  +- 5.10% (        )     13013.05  +- 0.08% ( 112.76%)      8980.35  +- 0.66% (  46.83%)
      Hmean   64      14485.04  +- 3.46% (        )     17835.12  +- 0.35% (  23.13%)     16540.73  +- 0.51% (  14.19%)
      Hmean   128     30779.16  +- 3.20% (        )     32796.94  +- 2.13% (   6.56%)     31512.58  +- 0.20% (   2.38%)
      Hmean   256     34664.66  +- 0.81% (        )     34604.67  +- 0.46% (  -0.17%)     34943.70  +- 0.25% (   0.80%)
      Hmean   384     33957.51  +- 0.11% (        )     34091.50  +- 0.14% (   0.39%)     33921.41  +- 0.09% (  -0.11%)
      
      Benchmark          : kernbench (kernel compilation)
      Varying parameter  : number of jobs
      Unit               : seconds (lower is better)
      
                          5.2.0 vanilla (BASELINE)             5.2.0 intel_pstate/HWP                     5.2.0 freq-inv
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean   2        332.94  +- 0.40% (        )        260.16  +- 0.45% (  21.86%)        233.56  +- 0.21% (  29.85%)
      Amean   4        173.04  +- 0.43% (        )        138.76  +- 0.03% (  19.81%)        123.59  +- 0.11% (  28.58%)
      Amean   8         89.65  +- 0.20% (        )         73.54  +- 0.09% (  17.97%)         65.69  +- 0.10% (  26.72%)
      Amean   16        48.08  +- 1.41% (        )         41.64  +- 1.61% (  13.40%)         36.00  +- 1.80% (  25.11%)
      Amean   32        28.78  +- 0.72% (        )         26.61  +- 1.99% (   7.55%)         23.19  +- 1.68% (  19.43%)
      Amean   64        20.46  +- 1.85% (        )         19.76  +- 0.35% (   3.42%)         17.38  +- 0.92% (  15.06%)
      Amean   128       18.69  +- 1.70% (        )         17.59  +- 1.04% (   5.90%)         15.73  +- 1.40% (  15.85%)
      Amean   192       18.82  +- 1.01% (        )         17.76  +- 0.77% (   5.67%)         15.57  +- 1.80% (  17.28%)
      
      Benchmark          : gitsource (time to run the git unit test suite)
      Varying parameter  : none
      Unit               : seconds (lower is better)
      
                       5.2.0 vanilla (BASELINE)           5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean         792.49  +- 0.20% (        )      779.35  +- 0.24% (   1.66%)      427.14  +- 0.16% (   46.10%)
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-3-ggherdovich@suse.cz
      2a0abc59
    • Giovanni Gherdovich's avatar
      x86, sched: Add support for frequency invariance · 1567c3e3
      Giovanni Gherdovich authored
      Implement arch_scale_freq_capacity() for 'modern' x86. This function
      is used by the scheduler to correctly account usage in the face of
      DVFS.
      
      The present patch addresses Intel processors specifically and has positive
      performance and performance-per-watt implications for the schedutil cpufreq
      governor, bringing it closer to, if not on-par with, the powersave governor
      from the intel_pstate driver/framework.
      
      Large performance gains are obtained when the machine is lightly loaded and
      no regression are observed at saturation. The benchmarks with the largest
      gains are kernel compilation, tbench (the networking version of dbench) and
      shell-intensive workloads.
      
      1. FREQUENCY INVARIANCE: MOTIVATION
         * Without it, a task looks larger if the CPU runs slower
      
      2. PECULIARITIES OF X86
         * freq invariance accounting requires knowing the ratio freq_curr/freq_max
         2.1 CURRENT FREQUENCY
             * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
         2.2 MAX FREQUENCY
             * It varies with time (turbo). As an approximation, we set it to a
               constant, i.e. 4-cores turbo frequency.
      
      3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
         * The invariant schedutil's formula has no feedback loop and reacts faster
           to utilization changes
      
      4. KNOWN LIMITATIONS
         * In some cases tasks can't reach max util despite how hard they try
      
      5. PERFORMANCE TESTING
         5.1 MACHINES
             * Skylake, Broadwell, Haswell
         5.2 SETUP
             * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
               active cores turbo w/ invariant schedutil, and intel_pstate/powersave
         5.3 BENCHMARK RESULTS
             5.3.1 NEUTRAL BENCHMARKS
                   * NAS Parallel Benchmark (HPC), hackbench
             5.3.2 NON-NEUTRAL BENCHMARKS
                   * tbench (10-30% better), kernbench (10-15% better),
                     shell-intensive-scripts (30-50% better)
                   * no regressions
             5.3.3 SELECTION OF DETAILED RESULTS
             5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
                   * dbench (5% worse on one machine), kernbench (3% worse),
                     tbench (5-10% better), shell-intensive-scripts (10-40% better)
      
      6. MICROARCH'ES ADDRESSED HERE
         * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
           etc have different MSRs semantic for querying turbo levels)
      
      7. REFERENCES
         * MMTests performance testing framework, github.com/gormanm/mmtests
      
       +-------------------------------------------------------------------------+
       | 1. FREQUENCY INVARIANCE: MOTIVATION
       +-------------------------------------------------------------------------+
      
      For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
      running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
      appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
      false impression this CPU is almost at capacity, even though it can go
      faster [*]. In a nutshell, without frequency scale-invariance tasks look
      larger just because the CPU is running slower.
      
      [*] (footnote: this assumes a linear frequency/performance relation; which
      everybody knows to be false, but given realities its the best approximation
      we can make.)
      
       +-------------------------------------------------------------------------+
       | 2. PECULIARITIES OF X86
       +-------------------------------------------------------------------------+
      
      Accounting for frequency changes in PELT signals requires the computation of
      the ratio freq_curr / freq_max. On x86 neither of those terms is readily
      available.
      
      2.1 CURRENT FREQUENCY
      ====================
      
      Since modern x86 has hardware control over the actual frequency we run
      at (because amongst other things, Turbo-Mode), we cannot simply use
      the frequency as requested through cpufreq.
      
      Instead we use the APERF/MPERF MSRs to compute the effective frequency
      over the recent past. Also, because reading MSRs is expensive, don't
      do so every time we need the value, but amortize the cost by doing it
      every tick.
      
      2.2 MAX FREQUENCY
      =================
      
      Obtaining freq_max is also non-trivial because at any time the hardware can
      provide a frequency boost to a selected subset of cores if the package has
      enough power to spare (eg: Turbo Boost). This means that the maximum frequency
      available to a given core changes with time.
      
      The approach taken in this change is to arbitrarily set freq_max to a constant
      value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
      microarchitectures, after evaluating the following candidates:
      
          * 1-core (1C) turbo frequency (the fastest turbo state available)
          * around base frequency (a.k.a. max P-state)
          * something in between, such as 4C turbo
      
      To interpret these options, consider that this is the denominator in
      freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
      util_avg and load_avg. A large denominator will undershoot (util_avg looks a
      bit smaller than it really is), viceversa with a smaller denominator PELT
      signals will tend to overshoot. Given that PELT drives frequency selection
      in the schedutil governor, we will have:
      
          freq_max set to     | effect on DVFS
          --------------------+------------------
          1C turbo            | power efficiency (lower freq choices)
          base freq           | performance (higher util_avg, higher freq requests)
          4C turbo            | a bit of both
      
      4C turbo proves to be a good compromise in a number of benchmarks (see below).
      
       +-------------------------------------------------------------------------+
       | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
       +-------------------------------------------------------------------------+
      
      Once an architecture implements a frequency scale-invariant utilization (the
      PELT signal util_avg), schedutil switches its frequency selection formula from
      
          freq_next = 1.25 * freq_curr * util            [non-invariant util signal]
      
      to
      
          freq_next = 1.25 * freq_max * util             [invariant util signal]
      
      where, in the second formula, freq_max is set to the 1C turbo frequency (max
      turbo). The advantage of the second formula, whose usage we unlock with this
      patch, is that freq_next doesn't depend on the current frequency in an
      iterative fashion, but can jump to any frequency in a single update. This
      absence of feedback in the formula makes it quicker to react to utilization
      changes and more robust against pathological instabilities.
      
      Compare it to the update formula of intel_pstate/powersave:
      
          freq_next = 1.25 * freq_max * Busy%
      
      where again freq_max is 1C turbo and Busy% is the percentage of time not spent
      idling (calculated with delta_MPERF / delta_TSC); essentially the same as
      invariant schedutil, and largely responsible for intel_pstate/powersave good
      reputation. The non-invariant schedutil formula is derived from the invariant
      one by approximating util_inv with util_raw * freq_curr / freq_max, but this
      has limitations.
      
      Testing shows improved performances due to better frequency selections when
      the machine is lightly loaded, and essentially no change in behaviour at
      saturation / overutilization.
      
       +-------------------------------------------------------------------------+
       | 4. KNOWN LIMITATIONS
       +-------------------------------------------------------------------------+
      
      It's been shown that it is possible to create pathological scenarios where a
      CPU-bound task cannot reach max utilization, if the normalizing factor
      freq_max is fixed to a constant value (see [Lelli-2018]).
      
      If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
      cores in a package doing some busywork, and observe that none of those task
      will ever reach max util (1024) because they're all running at less than the
      4C turbo frequency.
      
      While this concern still applies, we believe the performance benefit of
      frequency scale-invariant PELT signals outweights the cost of this limitation.
      
       [Lelli-2018]
       https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/
      
       +-------------------------------------------------------------------------+
       | 5. PERFORMANCE TESTING
       +-------------------------------------------------------------------------+
      
      5.1 MACHINES
      ============
      
      We tested the patch on three machines, with Skylake, Broadwell and Haswell
      CPUs. The details are below, together with the available turbo ratios as
      reported by the appropriate MSRs.
      
      * 8x-SKYLAKE-UMA:
        Single socket E3-1240 v5, Skylake 4 cores/8 threads
        Max EFFiciency, BASE frequency and available turbo levels (MHz):
      
          EFFIC    800 |********
          BASE    3500 |***********************************
          4C      3700 |*************************************
          3C      3800 |**************************************
          2C      3900 |***************************************
          1C      3900 |***************************************
      
      * 80x-BROADWELL-NUMA:
        Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
        Max EFFiciency, BASE frequency and available turbo levels (MHz):
      
          EFFIC   1200 |************
          BASE    2200 |**********************
          8C      2900 |*****************************
          7C      3000 |******************************
          6C      3100 |*******************************
          5C      3200 |********************************
          4C      3300 |*********************************
          3C      3400 |**********************************
          2C      3600 |************************************
          1C      3600 |************************************
      
      * 48x-HASWELL-NUMA
        Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
        Max EFFiciency, BASE frequency and available turbo levels (MHz):
      
          EFFIC   1200 |************
          BASE    2300 |***********************
          12C     2600 |**************************
          11C     2600 |**************************
          10C     2600 |**************************
          9C      2600 |**************************
          8C      2600 |**************************
          7C      2600 |**************************
          6C      2600 |**************************
          5C      2700 |***************************
          4C      2800 |****************************
          3C      2900 |*****************************
          2C      3100 |*******************************
          1C      3100 |*******************************
      
      5.2 SETUP
      =========
      
      * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
        driver in passive mode.
      * The rationale for choosing the various freq_max values to test have been to
        try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
        on all machines), plus one more value closer to base_freq but still in the
        turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
      * In addition we've run all tests with intel_pstate/powersave for comparison.
      * The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
      * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
        with active intel_pstate on this machine use that.
      
      This gives, in terms of combinations tested on each machine:
      
      * 8x-SKYLAKE-UMA
        * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
        * intel_pstate active + powersave + HWP
        * invariant schedutil, freq_max = 1C turbo
        * invariant schedutil, freq_max = 3C turbo
        * invariant schedutil, freq_max = 4C turbo
      
      * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
        * [same as 8x-SKYLAKE-UMA, but no HWP capable]
        * invariant schedutil, freq_max = 8C turbo
          (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")
      
      5.3 BENCHMARK RESULTS
      =====================
      
      5.3.1 NEUTRAL BENCHMARKS
      ------------------------
      
      Tests that didn't show any measurable difference in performance on any of the
      test machines between non-invariant schedutil and our patch are:
      
      * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
        computational kernel
      * flexible I/O (FIO)
      * hackbench (using threads or processes, and using pipes or sockets)
      
      5.3.2 NON-NEUTRAL BENCHMARKS
      ----------------------------
      
      What follow are summary tables where each benchmark result is given a score.
      
      * A tilde (~) means a neutral result, i.e. no difference from baseline.
      * Scores are computed with the ratio result_new / result_baseline, so a tilde
        means a score of 1.00.
      * The results in the score ratio are the geometric means of results running
        the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
        ... number of processes; for pgbench: varying the number of clients, and so
        on).
      * The first three tables show higher-is-better kind of tests (i.e. measured in
        operations/second), the subsequent three show lower-is-better kind of tests
        (i.e. the workload is fixed and we measure elapsed time, think kernbench).
      * "gitsource" is a name we made up for the test consisting in running the
        entire unit tests suite of the Git SCM and measuring how long it takes. We
        take it as a typical example of shell-intensive serialized workload.
      * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
        columns show invariant schedutil for different values of freq_max. 4C turbo
        is circled as it's the value we've chosen for the final implementation.
      
      80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  8C
      pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
      pgbench-rw           ~      ~      ~     | ~    |  ~
      netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
      netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
      tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                               +------+
      
      8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                               +------+
                   I_PSTATE/HWP   1C     3C    | 4C   |
      pgbench-ro           ~      ~      ~     | ~    |
      pgbench-rw           ~      ~      ~     | ~    |
      netperf-udp          ~      ~      ~     | ~    |
      netperf-tcp          ~      ~      ~     | ~    |
      tbench4              1.30   1.14   1.14  | 1.16 |
                                               +------+
      
      48x-HASWELL-NUMA (comparison ratio; higher is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  12C
      pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
      pgbench-rw           ~      ~      ~     | ~    |  ~
      netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
      netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
      tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                               +------+
      
      In the table above we see that active intel_pstate is slightly better than our
      4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
      read-only pgbench and much better on tbench. Both cases are notable in which
      it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
      80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
      schedutil to get closer.
      
      If we ignore active intel_pstate and focus on the comparison with baseline
      alone, there are several instances of double-digit performance improvement.
      
      80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  8C
      dbench4              1.23   0.95   0.95  | 0.95 |  0.95
      kernbench            0.93   0.83   0.83  | 0.83 |  0.82
      gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                               +------+
      
      8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                               +------+
                   I_PSTATE/HWP   1C     3C    | 4C   |
      dbench4              ~      ~      ~     | ~    |
      kernbench            ~      ~      ~     | ~    |
      gitsource            0.92   0.55   0.55  | 0.55 |
                                               +------+
      
      48x-HASWELL-NUMA (comparison ratio; lower is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  8C
      dbench4              ~      ~      ~     | ~    |  ~
      kernbench            0.94   0.90   0.89  | 0.90 |  0.90
      gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                               +------+
      
      dbench is not very remarkable here, unless we notice how poorly active
      intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
      non-invariant schedutil. We repeated that run getting consistent results. Out
      of scope for the patch at hand, but deserving future investigation. Other than
      that, we previously ran this campaign with Linux v5.0 and saw the patch doing
      better on dbench a the time. We haven't checked closely and can only speculate
      at this point.
      
      On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
      the detailed tables that the gains concentrate on low process counts (lightly
      loaded machines).
      
      The test we call "gitsource" (running the git unit test suite, a long-running
      single-threaded shell script) appears rather spectacular in this table (gains
      of 30-50% depending on the machine). It is to be noted, however, that
      gitsource has no adjustable parameters (such as the number of jobs in
      kernbench, which we average over in order to get a single-number summary
      score) and is exactly the kind of low-parallelism workload that benefits the
      most from this patch. When looking at the detailed tables of kernbench or
      tbench4, at low process or client counts one can see similar numbers.
      
      5.3.3 SELECTION OF DETAILED RESULTS
      -----------------------------------
      
      Machine            : 48x-HASWELL-NUMA
      Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
      Varying parameter  : number of clients
      Unit               : MB/sec (higher is better)
      
                         5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
      Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
      Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
      Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
      Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
      Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
      Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
      Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
      Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)
      
                                   5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
      Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
      Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
      Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
      Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
      Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
      Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
      Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
      Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)
      
      This is one of the cases where the patch still can't surpass active
      intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
      visible up to 16 clients and the saturated scenario is the same as baseline.
      
      The scores in the summary table from the previous sections are ratios of
      geometric means of the results over different clients, as seen in this table.
      
      Machine            : 80x-BROADWELL-NUMA
      Benchmark          : kernbench (kernel compilation)
      Varying parameter  : number of jobs
      Unit               : seconds (lower is better)
      
                         5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
      Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
      Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
      Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
      Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
      Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
      Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
      Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)
      
                                   5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
      Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
      Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
      Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
      Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
      Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
      Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
      Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)
      
      The patch outperform active intel_pstate (and baseline) by a considerable
      margin; the summary table from the previous section says 4C turbo and active
      intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
      0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
      noticeable difference with regard to the value of freq_max.
      
      Machine            : 8x-SKYLAKE-UMA
      Benchmark          : gitsource (time to run the git unit test suite)
      Varying parameter  : none
      Unit               : seconds (lower is better)
      
                                  5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)
      
                                 5.2.0 3C-turbo                   5.2.0 4C-turbo
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)
      
      In this test, which is of interest as representing shell-intensive
      (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
      intel_pstate/powersave by a whopping 40% margin.
      
      5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
      ---------------------------------------------
      
      The following table shows average power consumption in watt for each
      benchmark. Data comes from turbostat (package average), which in turn is read
      from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
      it's reasonable to ignore other power consumers (such as memory or I/O). Also,
      we don't have a power meter available in the lab so RAPL is the best we have.
      
      turbostat sampled average power every 10 seconds for the entire duration of
      each benchmark. We took all those values and averaged them (i.e. with don't
      have detail on a per-parameter granularity, only on whole benchmarks).
      
      80x-BROADWELL-NUMA (power consumption, watts)
                                                          +--------+
                     BASELINE I_PSTATE       1C       3C  |     4C |      8C
      pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
      pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
      dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
      netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
      netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
      tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
      kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
      gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                          +--------+
      8x-SKYLAKE-UMA (power consumption, watts)
                                                          +--------+
                    BASELINE I_PSTATE/HWP    1C       3C  |     4C |
      pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
      pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
      dbench4           27.28    27.37    27.49    27.41  |  27.38 |
      netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
      netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
      tbench4           41.13    45.61    43.10    43.33  |  43.56 |
      kernbench         42.56    42.63    43.01    43.01  |  43.01 |
      gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                          +--------+
      48x-HASWELL-NUMA (power consumption, watts)
                                                          +--------+
                     BASELINE I_PSTATE       1C       3C  |     4C |     12C
      pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
      pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
      dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
      netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
      netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
      tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
      kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
      gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                          +--------+
      
      A lower power consumption isn't necessarily better, it depends on what is done
      with that energy. Here are tables with the ratio of performance-per-watt on
      each machine and benchmark. Higher is always better; a tilde (~) means a
      neutral ratio (i.e. 1.00).
      
      80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                           +------+
                   I_PSTATE     1C     3C  |   4C |    8C
      pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
      pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
      dbench4          1.24   0.94   0.95  | 0.94 |  0.92
      netperf-udp      ~      1.02   1.02  | ~    |  1.02
      netperf-tcp      ~      1.02   ~     | ~    |  1.02
      tbench4          1.26   1.10   1.06  | 1.12 |  1.26
      kernbench        0.98   0.97   0.97  | 0.97 |  0.98
      gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                           +------+
      
      8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                           +------+
               I_PSTATE/HWP     1C     3C  |   4C |
      pgbench-ro       ~      ~      ~     | ~    |
      pgbench-rw       0.95   0.97   0.96  | 0.96 |
      dbench4          ~      ~      ~     | ~    |
      netperf-udp      ~      ~      ~     | ~    |
      netperf-tcp      ~      ~      ~     | ~    |
      tbench4          1.17   1.09   1.08  | 1.10 |
      kernbench        ~      ~      ~     | ~    |
      gitsource        1.06   1.40   1.40  | 1.40 |
                                           +------+
      
      48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                           +------+
                   I_PSTATE     1C     3C  |   4C |   12C
      pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
      pgbench-rw       ~      0.86   ~     | ~    |  0.86
      dbench4          ~      1.02   1.02  | 1.02 |  ~
      netperf-udp      ~      0.97   1.03  | 1.02 |  ~
      netperf-tcp      0.96   ~      ~     | ~    |  ~
      tbench4          1.24   ~      1.06  | 1.05 |  1.11
      kernbench        0.97   0.97   0.98  | 0.97 |  0.96
      gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                           +------+
      
      These results are overall pleasing: in plenty of cases we observe
      performance-per-watt improvements. The few regressions (read/write pgbench and
      dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
      percentage points (it has a 10-15% performance improvement, but apparently the
      increase in power consumption is larger than that). tbench4 and gitsource, which
      benefit the most from the patch, keep a positive score in this table which is
      a welcome surprise; that suggests that in those particular workloads the
      non-invariant schedutil (and active intel_pstate, too) makes some rather
      suboptimal frequency selections.
      
      +-------------------------------------------------------------------------+
      | 6. MICROARCH'ES ADDRESSED HERE
      +-------------------------------------------------------------------------+
      
      The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
      MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
      respectively. This excludes the recent Xeon Scalable Performance processors
      line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.
      
      Subsequent patches will address:
      
      * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
      * Xeon Phi (Knights Landing, Knights Mill)
      * Atom Silvermont
      
      +-------------------------------------------------------------------------+
      | 7. REFERENCES
      +-------------------------------------------------------------------------+
      
      Tests have been run with the help of the MMTests performance testing
      framework, see github.com/gormanm/mmtests. The configuration file names for
      the benchmark used are:
      
          db-pgbench-timed-ro-small-xfs
          db-pgbench-timed-rw-small-xfs
          io-dbench4-async-xfs
          network-netperf-unbound
          network-tbench
          scheduler-unbound
          workload-kerndevel-xfs
          workload-shellscripts-xfs
          hpc-nas-c-class-mpi-full-xfs
          hpc-nas-c-class-omp-full
      
      All those benchmarks are generally available on the web:
      
      pgbench: https://www.postgresql.org/docs/10/pgbench.html
      netperf: https://hewlettpackard.github.io/netperf/
      dbench/tbench: https://dbench.samba.org/
      gitsource: git unit test suite, github.com/git/git
      NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
      hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.cSuggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarDoug Smythies <dsmythies@telus.net>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
      1567c3e3
    • Vincent Guittot's avatar
      sched/fair: Prevent unlimited runtime on throttled group · 2a4b03ff
      Vincent Guittot authored
      When a running task is moved on a throttled task group and there is no
      other task enqueued on the CPU, the task can keep running using 100% CPU
      whatever the allocated bandwidth for the group and although its cfs rq is
      throttled. Furthermore, the group entity of the cfs_rq and its parents are
      not enqueued but only set as curr on their respective cfs_rqs.
      
      We have the following sequence:
      
      sched_move_task
        -dequeue_task: dequeue task and group_entities.
        -put_prev_task: put task and group entities.
        -sched_change_group: move task to new group.
        -enqueue_task: enqueue only task but not group entities because cfs_rq is
          throttled.
        -set_next_task : set task and group_entities as current sched_entity of
          their cfs_rq.
      
      Another impact is that the root cfs_rq runnable_load_avg at root rq stays
      null because the group_entities are not enqueued. This situation will stay
      the same until an "external" event triggers a reschedule. Let trigger it
      immediately instead.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarBen Segall <bsegall@google.com>
      Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
      2a4b03ff
    • Wanpeng Li's avatar
      sched/nohz: Optimize get_nohz_timer_target() · e938b9c9
      Wanpeng Li authored
      On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the
      same socket are in nohz_full mode. We can observe huge time burn in the
      loop for seaching nearest busy housekeeper cpu by ftrace.
      
        2)               |                        get_nohz_timer_target() {
        2)   0.240 us    |                          housekeeping_test_cpu();
        2)   0.458 us    |                          housekeeping_test_cpu();
      
        ...
      
        2)   0.292 us    |                          housekeeping_test_cpu();
        2)   0.240 us    |                          housekeeping_test_cpu();
        2)   0.227 us    |                          housekeeping_any_cpu();
        2) + 43.460 us   |                        }
      
      This patch optimizes the searching logic by finding a nearest housekeeper
      CPU in the housekeeping cpumask, it can minimize the worst searching time
      from ~44us to < 10us in my testing. In addition, the last iterated busy
      housekeeper can become a random candidate while current CPU is a better
      fallback if it is a housekeeper.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com
      e938b9c9
    • Qais Yousef's avatar
      sched/uclamp: Reject negative values in cpu_uclamp_write() · b562d140
      Qais Yousef authored
      The check to ensure that the new written value into cpu.uclamp.{min,max}
      is within range, [0:100], wasn't working because of the signed
      comparison
      
       7301                 if (req.percent > UCLAMP_PERCENT_SCALE) {
       7302                         req.ret = -ERANGE;
       7303                         return req;
       7304                 }
      
      	# echo -1 > cpu.uclamp.min
      	# cat cpu.uclamp.min
      	42949671.96
      
      Cast req.percent into u64 to force the comparison to be unsigned and
      work as intended in capacity_from_percent().
      
      	# echo -1 > cpu.uclamp.min
      	sh: write error: Numerical result out of range
      
      Fixes: 2480c093 ("sched/uclamp: Extend CPU's cgroup controller")
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
      b562d140
    • Mel Gorman's avatar
      sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains · b396f523
      Mel Gorman authored
      The CPU load balancer balances between different domains to spread load
      and strives to have equal balance everywhere. Communicating tasks can
      migrate so they are topologically close to each other but these decisions
      are independent. On a lightly loaded NUMA machine, two communicating tasks
      pulled together at wakeup time can be pushed apart by the load balancer.
      In isolation, the load balancer decision is fine but it ignores the tasks
      data locality and the wakeup/LB paths continually conflict. NUMA balancing
      is also a factor but it also simply conflicts with the load balancer.
      
      This patch allows a fixed degree of imbalance of two tasks to exist
      between NUMA domains regardless of utilisation levels. In many cases,
      this prevents communicating tasks being pulled apart. It was evaluated
      whether the imbalance should be scaled to the domain size. However, no
      additional benefit was measured across a range of workloads and machines
      and scaling adds the risk that lower domains have to be rebalanced. While
      this could change again in the future, such a change should specify the
      use case and benefit.
      
      The most obvious impact is on netperf TCP_STREAM -- two simple
      communicating tasks with some softirq offload depending on the
      transmission rate.
      
       2-socket Haswell machine 48 core, HT enabled
       netperf-tcp -- mmtests config config-network-netperf-unbound
      			      baseline              lbnuma-v3
       Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
       Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
       Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
       Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
       Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
       Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
       Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
       Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
       Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
       Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
       Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
       Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
       Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
       Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
       Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
       Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
       Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
       Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)
      
      In this case, there was a sizable improvement to performance and
      a general reduction in variance. However, this is not univeral.
      For most machines, the impact was roughly a 3% performance gain.
      
       Ops NUMA base-page range updates       19796.00         292.00
       Ops NUMA PTE updates                   19796.00         292.00
       Ops NUMA PMD updates                       0.00           0.00
       Ops NUMA hint faults                   16113.00         143.00
       Ops NUMA hint local faults %            8407.00         142.00
       Ops NUMA hint local percent               52.18          99.30
       Ops NUMA pages migrated                 4244.00           1.00
      
      Without the patch, only 52.18% of sampled accesses are local.  In an
      earlier changelog, 100% of sampled accesses are local and indeed on
      most machines, this was still the case. In this specific case, the
      local sampled rates was 99.3% but note the "base-page range updates"
      and "PTE updates".  The activity with the patch is negligible as were
      the number of faults. The small number of pages migrated were related to
      shared libraries.  A 2-socket Broadwell showed better results on average
      but are not presented for brevity as the performance was similar except
      it showed 100% of the sampled NUMA hints were local. The patch holds up
      for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
      
      For dbench, the impact depends on the filesystem used and the number of
      clients. On XFS, there is little difference as the clients typically
      communicate with workqueues which have a separate class of scheduler
      problem at the moment. For ext4, performance is generally better,
      particularly for small numbers of clients as NUMA balancing activity is
      negligible with the patch applied.
      
      A more interesting example is the Facebook schbench which uses a
      number of messaging threads to communicate with worker threads. In this
      configuration, one messaging thread is used per NUMA node and the number of
      worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
      for response latency is then reported.
      
       Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
       Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
       Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
       Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
       Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
       Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
       Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
       Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
       Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
       Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
       Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
       Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
       Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
       Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%
      
      For higher worker threads, the differences become negligible but it's
      interesting to note the difference in wakeup latency at low utilisation
      and mpstat confirms that activity was almost all on one node until
      the number of worker threads increase.
      
      Hackbench generally showed neutral results across a range of machines.
      This is different to earlier versions of the patch which allowed imbalances
      for higher degrees of utilisation. perf bench pipe showed negligible
      differences in overall performance as the differences are very close to
      the noise.
      
      An earlier prototype of the patch showed major regressions for NAS C-class
      when running with only half of the available CPUs -- 20-30% performance
      hits were measured at the time. With this version of the patch, the impact
      is negligible with small gains/losses within the noise measured. This is
      because the number of threads far exceeds the small imbalance the aptch
      cares about. Similarly, there were report of regressions for the autonuma
      benchmark against earlier versions but again, normal load balancing now
      applies for that workload.
      
      In general, the patch simply seeks to avoid unnecessary cross-node
      migrations in the basic case where imbalances are very small.  For low
      utilisation communicating workloads, this patch generally behaves better
      with less NUMA balancing activity. For high utilisation, there is no
      change in behaviour.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: default avatarPhil Auld <pauld@redhat.com>
      Tested-by: default avatarPhil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
      b396f523
    • Peter Zijlstra (Intel)'s avatar
      timers/nohz: Update NOHZ load in remote tick · ebc0f83c
      Peter Zijlstra (Intel) authored
      The way loadavg is tracked during nohz only pays attention to the load
      upon entering nohz.  This can be particularly noticeable if full nohz is
      entered while non-idle, and then the cpu goes idle and stays that way for
      a long time.
      
      Use the remote tick to ensure that full nohz cpus report their deltas
      within a reasonable time.
      
      [ swood: Added changelog and removed recheck of stopped tick. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarScott Wood <swood@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
      ebc0f83c
    • Scott Wood's avatar
      sched/core: Don't skip remote tick for idle CPUs · 488603b8
      Scott Wood authored
      This will be used in the next patch to get a loadavg update from
      nohz cpus.  The delta check is skipped because idle_sched_class
      doesn't update se.exec_start.
      Signed-off-by: default avatarScott Wood <swood@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
      488603b8
  2. 20 Jan, 2020 1 commit
  3. 17 Jan, 2020 14 commits
  4. 25 Dec, 2019 9 commits
  5. 23 Dec, 2019 2 commits