An error occurred fetching the project authors.
  1. 28 Jan, 2020 6 commits
    • Giovanni Gherdovich's avatar
      x86, sched: Add support for frequency invariance · 1567c3e3
      Giovanni Gherdovich authored
      Implement arch_scale_freq_capacity() for 'modern' x86. This function
      is used by the scheduler to correctly account usage in the face of
      DVFS.
      
      The present patch addresses Intel processors specifically and has positive
      performance and performance-per-watt implications for the schedutil cpufreq
      governor, bringing it closer to, if not on-par with, the powersave governor
      from the intel_pstate driver/framework.
      
      Large performance gains are obtained when the machine is lightly loaded and
      no regression are observed at saturation. The benchmarks with the largest
      gains are kernel compilation, tbench (the networking version of dbench) and
      shell-intensive workloads.
      
      1. FREQUENCY INVARIANCE: MOTIVATION
         * Without it, a task looks larger if the CPU runs slower
      
      2. PECULIARITIES OF X86
         * freq invariance accounting requires knowing the ratio freq_curr/freq_max
         2.1 CURRENT FREQUENCY
             * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
         2.2 MAX FREQUENCY
             * It varies with time (turbo). As an approximation, we set it to a
               constant, i.e. 4-cores turbo frequency.
      
      3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
         * The invariant schedutil's formula has no feedback loop and reacts faster
           to utilization changes
      
      4. KNOWN LIMITATIONS
         * In some cases tasks can't reach max util despite how hard they try
      
      5. PERFORMANCE TESTING
         5.1 MACHINES
             * Skylake, Broadwell, Haswell
         5.2 SETUP
             * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
               active cores turbo w/ invariant schedutil, and intel_pstate/powersave
         5.3 BENCHMARK RESULTS
             5.3.1 NEUTRAL BENCHMARKS
                   * NAS Parallel Benchmark (HPC), hackbench
             5.3.2 NON-NEUTRAL BENCHMARKS
                   * tbench (10-30% better), kernbench (10-15% better),
                     shell-intensive-scripts (30-50% better)
                   * no regressions
             5.3.3 SELECTION OF DETAILED RESULTS
             5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
                   * dbench (5% worse on one machine), kernbench (3% worse),
                     tbench (5-10% better), shell-intensive-scripts (10-40% better)
      
      6. MICROARCH'ES ADDRESSED HERE
         * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
           etc have different MSRs semantic for querying turbo levels)
      
      7. REFERENCES
         * MMTests performance testing framework, github.com/gormanm/mmtests
      
       +-------------------------------------------------------------------------+
       | 1. FREQUENCY INVARIANCE: MOTIVATION
       +-------------------------------------------------------------------------+
      
      For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
      running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
      appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
      false impression this CPU is almost at capacity, even though it can go
      faster [*]. In a nutshell, without frequency scale-invariance tasks look
      larger just because the CPU is running slower.
      
      [*] (footnote: this assumes a linear frequency/performance relation; which
      everybody knows to be false, but given realities its the best approximation
      we can make.)
      
       +-------------------------------------------------------------------------+
       | 2. PECULIARITIES OF X86
       +-------------------------------------------------------------------------+
      
      Accounting for frequency changes in PELT signals requires the computation of
      the ratio freq_curr / freq_max. On x86 neither of those terms is readily
      available.
      
      2.1 CURRENT FREQUENCY
      ====================
      
      Since modern x86 has hardware control over the actual frequency we run
      at (because amongst other things, Turbo-Mode), we cannot simply use
      the frequency as requested through cpufreq.
      
      Instead we use the APERF/MPERF MSRs to compute the effective frequency
      over the recent past. Also, because reading MSRs is expensive, don't
      do so every time we need the value, but amortize the cost by doing it
      every tick.
      
      2.2 MAX FREQUENCY
      =================
      
      Obtaining freq_max is also non-trivial because at any time the hardware can
      provide a frequency boost to a selected subset of cores if the package has
      enough power to spare (eg: Turbo Boost). This means that the maximum frequency
      available to a given core changes with time.
      
      The approach taken in this change is to arbitrarily set freq_max to a constant
      value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
      microarchitectures, after evaluating the following candidates:
      
          * 1-core (1C) turbo frequency (the fastest turbo state available)
          * around base frequency (a.k.a. max P-state)
          * something in between, such as 4C turbo
      
      To interpret these options, consider that this is the denominator in
      freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
      util_avg and load_avg. A large denominator will undershoot (util_avg looks a
      bit smaller than it really is), viceversa with a smaller denominator PELT
      signals will tend to overshoot. Given that PELT drives frequency selection
      in the schedutil governor, we will have:
      
          freq_max set to     | effect on DVFS
          --------------------+------------------
          1C turbo            | power efficiency (lower freq choices)
          base freq           | performance (higher util_avg, higher freq requests)
          4C turbo            | a bit of both
      
      4C turbo proves to be a good compromise in a number of benchmarks (see below).
      
       +-------------------------------------------------------------------------+
       | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
       +-------------------------------------------------------------------------+
      
      Once an architecture implements a frequency scale-invariant utilization (the
      PELT signal util_avg), schedutil switches its frequency selection formula from
      
          freq_next = 1.25 * freq_curr * util            [non-invariant util signal]
      
      to
      
          freq_next = 1.25 * freq_max * util             [invariant util signal]
      
      where, in the second formula, freq_max is set to the 1C turbo frequency (max
      turbo). The advantage of the second formula, whose usage we unlock with this
      patch, is that freq_next doesn't depend on the current frequency in an
      iterative fashion, but can jump to any frequency in a single update. This
      absence of feedback in the formula makes it quicker to react to utilization
      changes and more robust against pathological instabilities.
      
      Compare it to the update formula of intel_pstate/powersave:
      
          freq_next = 1.25 * freq_max * Busy%
      
      where again freq_max is 1C turbo and Busy% is the percentage of time not spent
      idling (calculated with delta_MPERF / delta_TSC); essentially the same as
      invariant schedutil, and largely responsible for intel_pstate/powersave good
      reputation. The non-invariant schedutil formula is derived from the invariant
      one by approximating util_inv with util_raw * freq_curr / freq_max, but this
      has limitations.
      
      Testing shows improved performances due to better frequency selections when
      the machine is lightly loaded, and essentially no change in behaviour at
      saturation / overutilization.
      
       +-------------------------------------------------------------------------+
       | 4. KNOWN LIMITATIONS
       +-------------------------------------------------------------------------+
      
      It's been shown that it is possible to create pathological scenarios where a
      CPU-bound task cannot reach max utilization, if the normalizing factor
      freq_max is fixed to a constant value (see [Lelli-2018]).
      
      If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
      cores in a package doing some busywork, and observe that none of those task
      will ever reach max util (1024) because they're all running at less than the
      4C turbo frequency.
      
      While this concern still applies, we believe the performance benefit of
      frequency scale-invariant PELT signals outweights the cost of this limitation.
      
       [Lelli-2018]
       https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/
      
       +-------------------------------------------------------------------------+
       | 5. PERFORMANCE TESTING
       +-------------------------------------------------------------------------+
      
      5.1 MACHINES
      ============
      
      We tested the patch on three machines, with Skylake, Broadwell and Haswell
      CPUs. The details are below, together with the available turbo ratios as
      reported by the appropriate MSRs.
      
      * 8x-SKYLAKE-UMA:
        Single socket E3-1240 v5, Skylake 4 cores/8 threads
        Max EFFiciency, BASE frequency and available turbo levels (MHz):
      
          EFFIC    800 |********
          BASE    3500 |***********************************
          4C      3700 |*************************************
          3C      3800 |**************************************
          2C      3900 |***************************************
          1C      3900 |***************************************
      
      * 80x-BROADWELL-NUMA:
        Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
        Max EFFiciency, BASE frequency and available turbo levels (MHz):
      
          EFFIC   1200 |************
          BASE    2200 |**********************
          8C      2900 |*****************************
          7C      3000 |******************************
          6C      3100 |*******************************
          5C      3200 |********************************
          4C      3300 |*********************************
          3C      3400 |**********************************
          2C      3600 |************************************
          1C      3600 |************************************
      
      * 48x-HASWELL-NUMA
        Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
        Max EFFiciency, BASE frequency and available turbo levels (MHz):
      
          EFFIC   1200 |************
          BASE    2300 |***********************
          12C     2600 |**************************
          11C     2600 |**************************
          10C     2600 |**************************
          9C      2600 |**************************
          8C      2600 |**************************
          7C      2600 |**************************
          6C      2600 |**************************
          5C      2700 |***************************
          4C      2800 |****************************
          3C      2900 |*****************************
          2C      3100 |*******************************
          1C      3100 |*******************************
      
      5.2 SETUP
      =========
      
      * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
        driver in passive mode.
      * The rationale for choosing the various freq_max values to test have been to
        try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
        on all machines), plus one more value closer to base_freq but still in the
        turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
      * In addition we've run all tests with intel_pstate/powersave for comparison.
      * The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
      * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
        with active intel_pstate on this machine use that.
      
      This gives, in terms of combinations tested on each machine:
      
      * 8x-SKYLAKE-UMA
        * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
        * intel_pstate active + powersave + HWP
        * invariant schedutil, freq_max = 1C turbo
        * invariant schedutil, freq_max = 3C turbo
        * invariant schedutil, freq_max = 4C turbo
      
      * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
        * [same as 8x-SKYLAKE-UMA, but no HWP capable]
        * invariant schedutil, freq_max = 8C turbo
          (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")
      
      5.3 BENCHMARK RESULTS
      =====================
      
      5.3.1 NEUTRAL BENCHMARKS
      ------------------------
      
      Tests that didn't show any measurable difference in performance on any of the
      test machines between non-invariant schedutil and our patch are:
      
      * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
        computational kernel
      * flexible I/O (FIO)
      * hackbench (using threads or processes, and using pipes or sockets)
      
      5.3.2 NON-NEUTRAL BENCHMARKS
      ----------------------------
      
      What follow are summary tables where each benchmark result is given a score.
      
      * A tilde (~) means a neutral result, i.e. no difference from baseline.
      * Scores are computed with the ratio result_new / result_baseline, so a tilde
        means a score of 1.00.
      * The results in the score ratio are the geometric means of results running
        the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
        ... number of processes; for pgbench: varying the number of clients, and so
        on).
      * The first three tables show higher-is-better kind of tests (i.e. measured in
        operations/second), the subsequent three show lower-is-better kind of tests
        (i.e. the workload is fixed and we measure elapsed time, think kernbench).
      * "gitsource" is a name we made up for the test consisting in running the
        entire unit tests suite of the Git SCM and measuring how long it takes. We
        take it as a typical example of shell-intensive serialized workload.
      * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
        columns show invariant schedutil for different values of freq_max. 4C turbo
        is circled as it's the value we've chosen for the final implementation.
      
      80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  8C
      pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
      pgbench-rw           ~      ~      ~     | ~    |  ~
      netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
      netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
      tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                               +------+
      
      8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                               +------+
                   I_PSTATE/HWP   1C     3C    | 4C   |
      pgbench-ro           ~      ~      ~     | ~    |
      pgbench-rw           ~      ~      ~     | ~    |
      netperf-udp          ~      ~      ~     | ~    |
      netperf-tcp          ~      ~      ~     | ~    |
      tbench4              1.30   1.14   1.14  | 1.16 |
                                               +------+
      
      48x-HASWELL-NUMA (comparison ratio; higher is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  12C
      pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
      pgbench-rw           ~      ~      ~     | ~    |  ~
      netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
      netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
      tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                               +------+
      
      In the table above we see that active intel_pstate is slightly better than our
      4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
      read-only pgbench and much better on tbench. Both cases are notable in which
      it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
      80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
      schedutil to get closer.
      
      If we ignore active intel_pstate and focus on the comparison with baseline
      alone, there are several instances of double-digit performance improvement.
      
      80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  8C
      dbench4              1.23   0.95   0.95  | 0.95 |  0.95
      kernbench            0.93   0.83   0.83  | 0.83 |  0.82
      gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                               +------+
      
      8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                               +------+
                   I_PSTATE/HWP   1C     3C    | 4C   |
      dbench4              ~      ~      ~     | ~    |
      kernbench            ~      ~      ~     | ~    |
      gitsource            0.92   0.55   0.55  | 0.55 |
                                               +------+
      
      48x-HASWELL-NUMA (comparison ratio; lower is better)
                                               +------+
                       I_PSTATE   1C     3C    | 4C   |  8C
      dbench4              ~      ~      ~     | ~    |  ~
      kernbench            0.94   0.90   0.89  | 0.90 |  0.90
      gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                               +------+
      
      dbench is not very remarkable here, unless we notice how poorly active
      intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
      non-invariant schedutil. We repeated that run getting consistent results. Out
      of scope for the patch at hand, but deserving future investigation. Other than
      that, we previously ran this campaign with Linux v5.0 and saw the patch doing
      better on dbench a the time. We haven't checked closely and can only speculate
      at this point.
      
      On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
      the detailed tables that the gains concentrate on low process counts (lightly
      loaded machines).
      
      The test we call "gitsource" (running the git unit test suite, a long-running
      single-threaded shell script) appears rather spectacular in this table (gains
      of 30-50% depending on the machine). It is to be noted, however, that
      gitsource has no adjustable parameters (such as the number of jobs in
      kernbench, which we average over in order to get a single-number summary
      score) and is exactly the kind of low-parallelism workload that benefits the
      most from this patch. When looking at the detailed tables of kernbench or
      tbench4, at low process or client counts one can see similar numbers.
      
      5.3.3 SELECTION OF DETAILED RESULTS
      -----------------------------------
      
      Machine            : 48x-HASWELL-NUMA
      Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
      Varying parameter  : number of clients
      Unit               : MB/sec (higher is better)
      
                         5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
      Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
      Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
      Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
      Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
      Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
      Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
      Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
      Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)
      
                                   5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
      Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
      Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
      Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
      Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
      Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
      Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
      Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
      Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)
      
      This is one of the cases where the patch still can't surpass active
      intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
      visible up to 16 clients and the saturated scenario is the same as baseline.
      
      The scores in the summary table from the previous sections are ratios of
      geometric means of the results over different clients, as seen in this table.
      
      Machine            : 80x-BROADWELL-NUMA
      Benchmark          : kernbench (kernel compilation)
      Varying parameter  : number of jobs
      Unit               : seconds (lower is better)
      
                         5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
      Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
      Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
      Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
      Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
      Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
      Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
      Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)
      
                                   5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
      Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
      Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
      Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
      Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
      Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
      Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
      Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)
      
      The patch outperform active intel_pstate (and baseline) by a considerable
      margin; the summary table from the previous section says 4C turbo and active
      intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
      0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
      noticeable difference with regard to the value of freq_max.
      
      Machine            : 8x-SKYLAKE-UMA
      Benchmark          : gitsource (time to run the git unit test suite)
      Varying parameter  : none
      Unit               : seconds (lower is better)
      
                                  5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)
      
                                 5.2.0 3C-turbo                   5.2.0 4C-turbo
      - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)
      
      In this test, which is of interest as representing shell-intensive
      (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
      intel_pstate/powersave by a whopping 40% margin.
      
      5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
      ---------------------------------------------
      
      The following table shows average power consumption in watt for each
      benchmark. Data comes from turbostat (package average), which in turn is read
      from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
      it's reasonable to ignore other power consumers (such as memory or I/O). Also,
      we don't have a power meter available in the lab so RAPL is the best we have.
      
      turbostat sampled average power every 10 seconds for the entire duration of
      each benchmark. We took all those values and averaged them (i.e. with don't
      have detail on a per-parameter granularity, only on whole benchmarks).
      
      80x-BROADWELL-NUMA (power consumption, watts)
                                                          +--------+
                     BASELINE I_PSTATE       1C       3C  |     4C |      8C
      pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
      pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
      dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
      netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
      netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
      tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
      kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
      gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                          +--------+
      8x-SKYLAKE-UMA (power consumption, watts)
                                                          +--------+
                    BASELINE I_PSTATE/HWP    1C       3C  |     4C |
      pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
      pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
      dbench4           27.28    27.37    27.49    27.41  |  27.38 |
      netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
      netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
      tbench4           41.13    45.61    43.10    43.33  |  43.56 |
      kernbench         42.56    42.63    43.01    43.01  |  43.01 |
      gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                          +--------+
      48x-HASWELL-NUMA (power consumption, watts)
                                                          +--------+
                     BASELINE I_PSTATE       1C       3C  |     4C |     12C
      pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
      pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
      dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
      netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
      netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
      tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
      kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
      gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                          +--------+
      
      A lower power consumption isn't necessarily better, it depends on what is done
      with that energy. Here are tables with the ratio of performance-per-watt on
      each machine and benchmark. Higher is always better; a tilde (~) means a
      neutral ratio (i.e. 1.00).
      
      80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                           +------+
                   I_PSTATE     1C     3C  |   4C |    8C
      pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
      pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
      dbench4          1.24   0.94   0.95  | 0.94 |  0.92
      netperf-udp      ~      1.02   1.02  | ~    |  1.02
      netperf-tcp      ~      1.02   ~     | ~    |  1.02
      tbench4          1.26   1.10   1.06  | 1.12 |  1.26
      kernbench        0.98   0.97   0.97  | 0.97 |  0.98
      gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                           +------+
      
      8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                           +------+
               I_PSTATE/HWP     1C     3C  |   4C |
      pgbench-ro       ~      ~      ~     | ~    |
      pgbench-rw       0.95   0.97   0.96  | 0.96 |
      dbench4          ~      ~      ~     | ~    |
      netperf-udp      ~      ~      ~     | ~    |
      netperf-tcp      ~      ~      ~     | ~    |
      tbench4          1.17   1.09   1.08  | 1.10 |
      kernbench        ~      ~      ~     | ~    |
      gitsource        1.06   1.40   1.40  | 1.40 |
                                           +------+
      
      48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                           +------+
                   I_PSTATE     1C     3C  |   4C |   12C
      pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
      pgbench-rw       ~      0.86   ~     | ~    |  0.86
      dbench4          ~      1.02   1.02  | 1.02 |  ~
      netperf-udp      ~      0.97   1.03  | 1.02 |  ~
      netperf-tcp      0.96   ~      ~     | ~    |  ~
      tbench4          1.24   ~      1.06  | 1.05 |  1.11
      kernbench        0.97   0.97   0.98  | 0.97 |  0.96
      gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                           +------+
      
      These results are overall pleasing: in plenty of cases we observe
      performance-per-watt improvements. The few regressions (read/write pgbench and
      dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
      percentage points (it has a 10-15% performance improvement, but apparently the
      increase in power consumption is larger than that). tbench4 and gitsource, which
      benefit the most from the patch, keep a positive score in this table which is
      a welcome surprise; that suggests that in those particular workloads the
      non-invariant schedutil (and active intel_pstate, too) makes some rather
      suboptimal frequency selections.
      
      +-------------------------------------------------------------------------+
      | 6. MICROARCH'ES ADDRESSED HERE
      +-------------------------------------------------------------------------+
      
      The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
      MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
      respectively. This excludes the recent Xeon Scalable Performance processors
      line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.
      
      Subsequent patches will address:
      
      * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
      * Xeon Phi (Knights Landing, Knights Mill)
      * Atom Silvermont
      
      +-------------------------------------------------------------------------+
      | 7. REFERENCES
      +-------------------------------------------------------------------------+
      
      Tests have been run with the help of the MMTests performance testing
      framework, see github.com/gormanm/mmtests. The configuration file names for
      the benchmark used are:
      
          db-pgbench-timed-ro-small-xfs
          db-pgbench-timed-rw-small-xfs
          io-dbench4-async-xfs
          network-netperf-unbound
          network-tbench
          scheduler-unbound
          workload-kerndevel-xfs
          workload-shellscripts-xfs
          hpc-nas-c-class-mpi-full-xfs
          hpc-nas-c-class-omp-full
      
      All those benchmarks are generally available on the web:
      
      pgbench: https://www.postgresql.org/docs/10/pgbench.html
      netperf: https://hewlettpackard.github.io/netperf/
      dbench/tbench: https://dbench.samba.org/
      gitsource: git unit test suite, github.com/git/git
      NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
      hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.cSuggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarDoug Smythies <dsmythies@telus.net>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
      1567c3e3
    • Vincent Guittot's avatar
      sched/fair: Prevent unlimited runtime on throttled group · 2a4b03ff
      Vincent Guittot authored
      When a running task is moved on a throttled task group and there is no
      other task enqueued on the CPU, the task can keep running using 100% CPU
      whatever the allocated bandwidth for the group and although its cfs rq is
      throttled. Furthermore, the group entity of the cfs_rq and its parents are
      not enqueued but only set as curr on their respective cfs_rqs.
      
      We have the following sequence:
      
      sched_move_task
        -dequeue_task: dequeue task and group_entities.
        -put_prev_task: put task and group entities.
        -sched_change_group: move task to new group.
        -enqueue_task: enqueue only task but not group entities because cfs_rq is
          throttled.
        -set_next_task : set task and group_entities as current sched_entity of
          their cfs_rq.
      
      Another impact is that the root cfs_rq runnable_load_avg at root rq stays
      null because the group_entities are not enqueued. This situation will stay
      the same until an "external" event triggers a reschedule. Let trigger it
      immediately instead.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarBen Segall <bsegall@google.com>
      Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
      2a4b03ff
    • Wanpeng Li's avatar
      sched/nohz: Optimize get_nohz_timer_target() · e938b9c9
      Wanpeng Li authored
      On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the
      same socket are in nohz_full mode. We can observe huge time burn in the
      loop for seaching nearest busy housekeeper cpu by ftrace.
      
        2)               |                        get_nohz_timer_target() {
        2)   0.240 us    |                          housekeeping_test_cpu();
        2)   0.458 us    |                          housekeeping_test_cpu();
      
        ...
      
        2)   0.292 us    |                          housekeeping_test_cpu();
        2)   0.240 us    |                          housekeeping_test_cpu();
        2)   0.227 us    |                          housekeeping_any_cpu();
        2) + 43.460 us   |                        }
      
      This patch optimizes the searching logic by finding a nearest housekeeper
      CPU in the housekeeping cpumask, it can minimize the worst searching time
      from ~44us to < 10us in my testing. In addition, the last iterated busy
      housekeeper can become a random candidate while current CPU is a better
      fallback if it is a housekeeper.
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com
      e938b9c9
    • Qais Yousef's avatar
      sched/uclamp: Reject negative values in cpu_uclamp_write() · b562d140
      Qais Yousef authored
      The check to ensure that the new written value into cpu.uclamp.{min,max}
      is within range, [0:100], wasn't working because of the signed
      comparison
      
       7301                 if (req.percent > UCLAMP_PERCENT_SCALE) {
       7302                         req.ret = -ERANGE;
       7303                         return req;
       7304                 }
      
      	# echo -1 > cpu.uclamp.min
      	# cat cpu.uclamp.min
      	42949671.96
      
      Cast req.percent into u64 to force the comparison to be unsigned and
      work as intended in capacity_from_percent().
      
      	# echo -1 > cpu.uclamp.min
      	sh: write error: Numerical result out of range
      
      Fixes: 2480c093 ("sched/uclamp: Extend CPU's cgroup controller")
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
      b562d140
    • Peter Zijlstra (Intel)'s avatar
      timers/nohz: Update NOHZ load in remote tick · ebc0f83c
      Peter Zijlstra (Intel) authored
      The way loadavg is tracked during nohz only pays attention to the load
      upon entering nohz.  This can be particularly noticeable if full nohz is
      entered while non-idle, and then the cpu goes idle and stays that way for
      a long time.
      
      Use the remote tick to ensure that full nohz cpus report their deltas
      within a reasonable time.
      
      [ swood: Added changelog and removed recheck of stopped tick. ]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarScott Wood <swood@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
      ebc0f83c
    • Scott Wood's avatar
      sched/core: Don't skip remote tick for idle CPUs · 488603b8
      Scott Wood authored
      This will be used in the next patch to get a loadavg update from
      nohz cpus.  The delta check is skipped because idle_sched_class
      doesn't update se.exec_start.
      Signed-off-by: default avatarScott Wood <swood@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
      488603b8
  2. 17 Jan, 2020 2 commits
  3. 25 Dec, 2019 2 commits
  4. 17 Dec, 2019 1 commit
  5. 17 Nov, 2019 1 commit
  6. 15 Nov, 2019 1 commit
  7. 13 Nov, 2019 1 commit
    • Peter Zijlstra's avatar
      sched/core: Avoid spurious lock dependencies · ff51ff84
      Peter Zijlstra authored
      While seemingly harmless, __sched_fork() does hrtimer_init(), which,
      when DEBUG_OBJETS, can end up doing allocations.
      
      This then results in the following lock order:
      
        rq->lock
          zone->lock.rlock
            batched_entropy_u64.lock
      
      Which in turn causes deadlocks when we do wakeups while holding that
      batched_entropy lock -- as the random code does.
      
      Solve this by moving __sched_fork() out from under rq->lock. This is
      safe because nothing there relies on rq->lock, as also evident from the
      other __sched_fork() callsite.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: bigeasy@linutronix.de
      Cc: cl@linux.com
      Cc: keescook@chromium.org
      Cc: penberg@kernel.org
      Cc: rientjes@google.com
      Cc: thgarnie@google.com
      Cc: tytso@mit.edu
      Cc: will@kernel.org
      Fixes: b7d5dc21 ("random: add a spinlock_t to struct batched_entropy")
      Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff51ff84
  8. 11 Nov, 2019 3 commits
  9. 08 Nov, 2019 2 commits
    • Peter Zijlstra's avatar
      sched: Fix pick_next_task() vs 'change' pattern race · 6e2df058
      Peter Zijlstra authored
      Commit 67692435 ("sched: Rework pick_next_task() slow-path")
      inadvertly introduced a race because it changed a previously
      unexplored dependency between dropping the rq->lock and
      sched_class::put_prev_task().
      
      The comments about dropping rq->lock, in for example
      newidle_balance(), only mentions the task being current and ->on_cpu
      being set. But when we look at the 'change' pattern (in for example
      sched_setnuma()):
      
      	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
      	running = task_current(rq, p); /* rq->curr == p */
      
      	if (queued)
      		dequeue_task(...);
      	if (running)
      		put_prev_task(...);
      
      	/* change task properties */
      
      	if (queued)
      		enqueue_task(...);
      	if (running)
      		set_next_task(...);
      
      It becomes obvious that if we do this after put_prev_task() has
      already been called on @p, things go sideways. This is exactly what
      the commit in question allows to happen when it does:
      
      	prev->sched_class->put_prev_task(rq, prev, rf);
      	if (!rq->nr_running)
      		newidle_balance(rq, rf);
      
      The newidle_balance() call will drop rq->lock after we've called
      put_prev_task() and that allows the above 'change' pattern to
      interleave and mess up the state.
      
      Furthermore, it turns out we lost the RT-pull when we put the last DL
      task.
      
      Fix both problems by extracting the balancing from put_prev_task() and
      doing a multi-class balance() pass before put_prev_task().
      
      Fixes: 67692435 ("sched: Rework pick_next_task() slow-path")
      Reported-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarQuentin Perret <qperret@google.com>
      Tested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      6e2df058
    • Qais Yousef's avatar
      sched/core: Fix compilation error when cgroup not selected · e3b8b6a0
      Qais Yousef authored
      When cgroup is disabled the following compilation error was hit
      
      	kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
      	kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
      	  struct css_task_iter it;
      			       ^~
      	kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
      	  css_task_iter_start(css, 0, &it);
      	  ^~~~~~~~~~~~~~~~~~~
      	  __sg_page_iter_start
      	kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
      	  while ((p = css_task_iter_next(&it))) {
      		      ^~~~~~~~~~~~~~~~~~
      		      __sg_page_iter_next
      	kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
      	  css_task_iter_end(&it);
      	  ^~~~~~~~~~~~~~~~~
      	  get_task_cred
      	kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
      	  struct css_task_iter it;
      			       ^~
      	cc1: some warnings being treated as errors
      	make[2]: *** [kernel/sched/core.o] Error 1
      
      Fix by protetion uclamp_update_active_tasks() with
      CONFIG_UCLAMP_TASK_GROUP
      
      Fixes: babbe170 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Ben Segall <bsegall@google.com>
      Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com
      e3b8b6a0
  10. 29 Oct, 2019 1 commit
    • Jens Axboe's avatar
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe authored
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      771b53d0
  11. 09 Oct, 2019 1 commit
    • Qian Cai's avatar
      locking/lockdep: Remove unused @nested argument from lock_release() · 5facae4f
      Qian Cai authored
      Since the following commit:
      
        b4adfe8e ("locking/lockdep: Remove unused argument in __lock_release")
      
      @nested is no longer used in lock_release(), so remove it from all
      lock_release() calls and friends.
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: alexander.levin@microsoft.com
      Cc: daniel@iogearbox.net
      Cc: davem@davemloft.net
      Cc: dri-devel@lists.freedesktop.org
      Cc: duyuyang@gmail.com
      Cc: gregkh@linuxfoundation.org
      Cc: hannes@cmpxchg.org
      Cc: intel-gfx@lists.freedesktop.org
      Cc: jack@suse.com
      Cc: jlbec@evilplan.or
      Cc: joonas.lahtinen@linux.intel.com
      Cc: joseph.qi@linux.alibaba.com
      Cc: jslaby@suse.com
      Cc: juri.lelli@redhat.com
      Cc: maarten.lankhorst@linux.intel.com
      Cc: mark@fasheh.com
      Cc: mhocko@kernel.org
      Cc: mripard@kernel.org
      Cc: ocfs2-devel@oss.oracle.com
      Cc: rodrigo.vivi@intel.com
      Cc: sean@poorly.run
      Cc: st@kernel.org
      Cc: tj@kernel.org
      Cc: tytso@mit.edu
      Cc: vdavydov.dev@gmail.com
      Cc: vincent.guittot@linaro.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pwSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5facae4f
  12. 01 Oct, 2019 1 commit
  13. 25 Sep, 2019 6 commits
    • Valentin Schneider's avatar
      sched/core: Remove double update_max_interval() call on CPU startup · 9fc41acc
      Valentin Schneider authored
      update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup
      and teardown callbacks, but it turns out it's also called at the end of
      the startup callback of CPUHP_AP_ACTIVE (which is further down the
      startup sequence).
      
      There's no point in repeating this interval update in the startup sequence
      since the CPU will remain online until it goes down the teardown path.
      
      Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE).
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: juri.lelli@redhat.com
      Cc: vincent.guittot@linaro.org
      Link: https://lkml.kernel.org/r/20190923093017.11755-1-valentin.schneider@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9fc41acc
    • Valentin Schneider's avatar
      sched/core: Fix preempt_schedule() interrupt return comment · a49b4f40
      Valentin Schneider authored
      preempt_schedule_irq() is the one that should be called on return from
      interrupt, clean up the comment to avoid any ambiguity.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linux-riscv@lists.infradead.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Link: https://lkml.kernel.org/r/20190923143620.29334-2-valentin.schneider@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a49b4f40
    • KeMeng Shi's avatar
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · 714e501e
      KeMeng Shi authored
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: default avatarKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      714e501e
    • Mathieu Desnoyers's avatar
      sched/membarrier: Fix p->mm->membarrier_state racy load · 227a4aad
      Mathieu Desnoyers authored
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      227a4aad
    • Eric W. Biederman's avatar
      tasks, sched/core: RCUify the assignment of rq->curr · 5311a98f
      Eric W. Biederman authored
      The current task on the runqueue is currently read with rcu_dereference().
      
      To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
      to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
      memory barrier necessary to order assignments to the task_struct
      and the assignment to rq->curr.
      
      Unfortunately the assignment of rq->curr in __schedule is a hot path,
      and it has already been show that additional barriers in that code
      will reduce the performance of the scheduler.  So I will attempt to
      describe below why you can effectively have ordinary RCU semantics
      without any additional barriers.
      
      The assignment of rq->curr in init_idle is a slow path called once
      per cpu and that can use rcu_assign_pointer() without any concerns.
      
      As I write this there are effectively two users of rcu_dereference() on
      rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
      that only looks at "->mm" after the rcu_dereference().  Then there is
      task_numa_compare() in kernel/sched/fair.c.  My best reading of the
      code shows that task_numa_compare only access: "->flags",
      "->cpus_ptr", "->numa_group", "->numa_faults[]",
      "->total_numa_faults", and "->se.cfs_rq".
      
      The code in __schedule() essentially does:
      	rq_lock(...);
      	smp_mb__after_spinlock();
      
      	next = pick_next_task(...);
      	rq->curr = next;
      
      	context_switch(prev, next);
      
      At the start of the function the rq_lock/smp_mb__after_spinlock
      pair provides a full memory barrier.  Further there is a full memory barrier
      in context_switch().
      
      This means that any task that has already run and modified itself (the
      common case) has already seen two memory barriers before __schedule()
      runs and begins executing.  A task that modifies itself then sees a
      third full memory barrier pair with the rq_lock();
      
      For a brand new task that is enqueued with wake_up_new_task() there
      are the memory barriers present from the taking and release the
      pi_lock and the rq_lock as the processes is enqueued as well as the
      full memory barrier at the start of __schedule() assuming __schedule()
      happens on the same cpu.
      
      This means that by the time we reach the assignment of rq->curr
      except for values on the task struct modified in pick_next_task
      the code has the same guarantees as if it used rcu_assign_pointer().
      
      Reading through all of the implementations of pick_next_task it
      appears pick_next_task is limited to modifying the task_struct fields
      "->se", "->rt", "->dl".  These fields are the sched_entity structures
      of the varies schedulers.
      
      Further "->se.cfs_rq" is only changed in cgroup attach/move operations
      initialized by userspace.
      
      Unless I have missed something this means that in practice that the
      users of "rcu_dereference(rq->curr)" get normal RCU semantics of
      rcu_dereference() for the fields the care about, despite the
      assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5311a98f
    • Eric W. Biederman's avatar
      tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue · 0ff7b2cf
      Eric W. Biederman authored
      In the ordinary case today the RCU grace period for a task_struct is
      triggered when another process wait's for it's zombine and causes the
      kernel to call release_task().  As the waiting task has to receive a
      signal and then act upon it before this happens, typically this will
      occur after the original task as been removed from the runqueue.
      
      Unfortunaty in some cases such as self reaping tasks it can be shown
      that release_task() will be called starting the grace period for
      task_struct long before the task leaves the runqueue.
      
      Therefore use put_task_struct_rcu_user() in finish_task_switch() to
      guarantee that the there is a RCU lifetime after the task
      leaves the runqueue.
      
      Besides the change in the start of the RCU grace period for the
      task_struct this change may cause perf_event_delayed_put and
      trace_sched_process_free.  The function perf_event_delayed_put boils
      down to just a WARN_ON for cases that I assume never show happen.  So
      I don't see any problem with delaying it.
      
      The function trace_sched_process_free is a trace point and thus
      visible to user space.  Occassionally userspace has the strangest
      dependencies so this has a miniscule chance of causing a regression.
      This change only changes the timing of when the tracepoint is called.
      The change in timing arguably gives userspace a more accurate picture
      of what is going on.  So I don't expect there to be a regression.
      
      In the case where a task self reaps we are pretty much guaranteed that
      the RCU grace period is delayed.  So we should get quite a bit of
      coverage in of this worst case for the change in a normal threaded
      workload.  So I expect any issues to turn up quickly or not at all.
      
      I have lightly tested this change and everything appears to work
      fine.
      Inspired-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Inspired-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0ff7b2cf
  14. 07 Sep, 2019 1 commit
    • Daniel Vetter's avatar
      kernel.h: Add non_block_start/end() · 312364f3
      Daniel Vetter authored
      In some special cases we must not block, but there's not a spinlock,
      preempt-off, irqs-off or similar critical section already that arms the
      might_sleep() debug checks. Add a non_block_start/end() pair to annotate
      these.
      
      This will be used in the oom paths of mmu-notifiers, where blocking is not
      allowed to make sure there's forward progress. Quoting Michal:
      
      "The notifier is called from quite a restricted context - oom_reaper -
      which shouldn't depend on any locks or sleepable conditionals. The code
      should be swift as well but we mostly do care about it to make a forward
      progress. Checking for sleepable context is the best thing we could come
      up with that would describe these demands at least partially."
      
      Peter also asked whether we want to catch spinlocks on top, but Michal
      said those are less of a problem because spinlocks can't have an indirect
      dependency upon the page allocator and hence close the loop with the oom
      reaper.
      
      Suggested by Michal Hocko.
      
      Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch
      Acked-by: Christian König <christian.koenig@amd.com> (v1)
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      312364f3
  15. 04 Sep, 2019 1 commit
    • Ingo Molnar's avatar
      sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code · 1251201c
      Ingo Molnar authored
      Thadeu Lima de Souza Cascardo reported that 'chrt' broke on recent kernels:
      
        $ chrt -p $$
        chrt: failed to get pid 26306's policy: Argument list too long
      
      and he has root-caused the bug to the following commit increasing sched_attr
      size and breaking sched_read_attr() into returning -EFBIG:
      
        a509a7cd ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
      
      The other, bigger bug is that the whole sched_getattr() and sched_read_attr()
      logic of checking non-zero bits in new ABI components is arguably broken,
      and pretty much any extension of the ABI will spuriously break the ABI.
      That's way too fragile.
      
      Instead implement the perf syscall's extensible ABI instead, which we
      already implement on the sched_setattr() side:
      
       - if user-attributes have the same size as kernel attributes then the
         logic is unchanged.
      
       - if user-attributes are larger than the kernel knows about then simply
         skip the extra bits, but set attr->size to the (smaller) kernel size
         so that tooling can (in principle) handle older kernel as well.
      
       - if user-attributes are smaller than the kernel knows about then just
         copy whatever user-space can accept.
      
      Also clean up the whole logic:
      
       - Simplify the code flow - there's no need for 'ret' for example.
      
       - Standardize on 'kattr/uattr' and 'ksize/usize' naming to make sure we
         always know which side we are dealing with.
      
       - Why is it called 'read' when what it does is to copy to user? This
         code is so far away from VFS read() semantics that the naming is
         actively confusing. Name it sched_attr_copy_to_user() instead, which
         mirrors other copy_to_user() functionality.
      
       - Move the attr->size assignment from the head of sched_getattr() to the
         sched_attr_copy_to_user() function. Nothing else within the kernel
         should care about the size of the structure.
      
      With these fixes the sched_getattr() syscall now nicely supports an
      extensible ABI in both a forward and backward compatible fashion, and
      will also fix the chrt bug.
      
      As an added bonus the bogus -EFBIG return is removed as well, which as
      Thadeu noted should have been -E2BIG to begin with.
      Reported-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Tested-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Acked-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: a509a7cd ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
      Link: https://lkml.kernel.org/r/20190904075532.GA26751@gmail.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1251201c
  16. 03 Sep, 2019 6 commits
    • Patrick Bellasi's avatar
      sched/uclamp: Always use 'enum uclamp_id' for clamp_id values · 0413d7f3
      Patrick Bellasi authored
      The supported clamp indexes are defined in 'enum clamp_id', however, because
      of the code logic in some of the first utilization clamping series version,
      sometimes we needed to use 'unsigned int' to represent indices.
      
      This is not more required since the final version of the uclamp_* APIs can
      always use the proper enum uclamp_id type.
      
      Fix it with a bulk rename now that we have all the bits merged.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0413d7f3
    • Patrick Bellasi's avatar
      sched/uclamp: Update CPU's refcount on TG's clamp changes · babbe170
      Patrick Bellasi authored
      On updates of task group (TG) clamp values, ensure that these new values
      are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
      tasks are immediately boosted and/or capped as requested.
      
      Do that each time we update effective clamps from cpu_util_update_eff().
      Use the *cgroup_subsys_state (css) to walk the list of tasks in each
      affected TG and update their RUNNABLE tasks.
      Update each task by using the same mechanism used for cpu affinity masks
      updates, i.e. by taking the rq lock.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      babbe170
    • Patrick Bellasi's avatar
      sched/uclamp: Use TG's clamps to restrict TASK's clamps · 3eac870a
      Patrick Bellasi authored
      When a task specific clamp value is configured via sched_setattr(2), this
      value is accounted in the corresponding clamp bucket every time the task is
      {en,de}qeued. However, when cgroups are also in use, the task specific
      clamp values could be restricted by the task_group (TG) clamp values.
      
      Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
      task is enqueued, it's accounted in the clamp bucket tracking the smaller
      clamp between the task specific value and its TG effective value. This
      allows to:
      
      1. ensure cgroup clamps are always used to restrict task specific requests,
         i.e. boosted not more than its TG effective protection and capped at
         least as its TG effective limit.
      
      2. implement a "nice-like" policy, where tasks are still allowed to request
         less than what enforced by their TG effective limits and protections
      
      Do this by exploiting the concept of "effective" clamp, which is already
      used by a TG to track parent enforced restrictions.
      
      Apply task group clamp restrictions only to tasks belonging to a child
      group. While, for tasks in the root group or in an autogroup, system
      defaults are still enforced.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3eac870a
    • Patrick Bellasi's avatar
      sched/uclamp: Propagate system defaults to the root group · 7274a5c1
      Patrick Bellasi authored
      The clamp values are not tunable at the level of the root task group.
      That's for two main reasons:
      
       - the root group represents "system resources" which are always
         entirely available from the cgroup standpoint.
      
       - when tuning/restricting "system resources" makes sense, tuning must
         be done using a system wide API which should also be available when
         control groups are not.
      
      When a system wide restriction is available, cgroups should be aware of
      its value in order to know exactly how much "system resources" are
      available for the subgroups.
      
      Utilization clamping supports already the concepts of:
      
       - system defaults: which define the maximum possible clamp values
         usable by tasks.
      
       - effective clamps: which allows a parent cgroup to constraint (maybe
         temporarily) its descendants without losing the information related
         to the values "requested" from them.
      
      Exploit these two concepts and bind them together in such a way that,
      whenever system default are tuned, the new values are propagated to
      (possibly) restrict or relax the "effective" value of nested cgroups.
      
      When cgroups are in use, force an update of all the RUNNABLE tasks.
      Otherwise, keep things simple and do just a lazy update next time each
      task will be enqueued.
      Do that since we assume a more strict resource control is required when
      cgroups are in use. This allows also to keep "effective" clamp values
      updated in case we need to expose them to user-space.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7274a5c1
    • Patrick Bellasi's avatar
      sched/uclamp: Propagate parent clamps · 0b60ba2d
      Patrick Bellasi authored
      In order to properly support hierarchical resources control, the cgroup
      delegation model requires that attribute writes from a child group never
      fail but still are locally consistent and constrained based on parent's
      assigned resources. This requires to properly propagate and aggregate
      parent attributes down to its descendants.
      
      Implement this mechanism by adding a new "effective" clamp value for each
      task group. The effective clamp value is defined as the smaller value
      between the clamp value of a group and the effective clamp value of its
      parent. This is the actual clamp value enforced on tasks in a task group.
      
      Since it's possible for a cpu.uclamp.min value to be bigger than the
      cpu.uclamp.max value, ensure local consistency by restricting each
      "protection" (i.e. min utilization) with the corresponding "limit"
      (i.e. max utilization).
      
      Do that at effective clamps propagation to ensure all user-space write
      never fails while still always tracking the most restrictive values.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0b60ba2d
    • Patrick Bellasi's avatar
      sched/uclamp: Extend CPU's cgroup controller · 2480c093
      Patrick Bellasi authored
      The cgroup CPU bandwidth controller allows to assign a specified
      (maximum) bandwidth to the tasks of a group. However this bandwidth is
      defined and enforced only on a temporal base, without considering the
      actual frequency a CPU is running on. Thus, the amount of computation
      completed by a task within an allocated bandwidth can be very different
      depending on the actual frequency the CPU is running that task.
      The amount of computation can be affected also by the specific CPU a
      task is running on, especially when running on asymmetric capacity
      systems like Arm's big.LITTLE.
      
      With the availability of schedutil, the scheduler is now able
      to drive frequency selections based on actual task utilization.
      Moreover, the utilization clamping support provides a mechanism to
      bias the frequency selection operated by schedutil depending on
      constraints assigned to the tasks currently RUNNABLE on a CPU.
      
      Giving the mechanisms described above, it is now possible to extend the
      cpu controller to specify the minimum (or maximum) utilization which
      should be considered for tasks RUNNABLE on a cpu.
      This makes it possible to better defined the actual computational
      power assigned to task groups, thus improving the cgroup CPU bandwidth
      controller which is currently based just on time constraints.
      
      Extend the CPU controller with a couple of new attributes uclamp.{min,max}
      which allow to enforce utilization boosting and capping for all the
      tasks in a group.
      
      Specifically:
      
      - uclamp.min: defines the minimum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run at least at a
      	      minimum frequency which corresponds to the uclamp.min
      	      utilization
      
      - uclamp.max: defines the maximum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run up to a
      	      maximum frequency which corresponds to the uclamp.max
      	      utilization
      
      These attributes:
      
      a) are available only for non-root nodes, both on default and legacy
         hierarchies, while system wide clamps are defined by a generic
         interface which does not depends on cgroups. This system wide
         interface enforces constraints on tasks in the root node.
      
      b) enforce effective constraints at each level of the hierarchy which
         are a restriction of the group requests considering its parent's
         effective constraints. Root group effective constraints are defined
         by the system wide interface.
         This mechanism allows each (non-root) level of the hierarchy to:
         - request whatever clamp values it would like to get
         - effectively get only up to the maximum amount allowed by its parent
      
      c) have higher priority than task-specific clamps, defined via
         sched_setattr(), thus allowing to control and restrict task requests.
      
      Add two new attributes to the cpu controller to collect "requested"
      clamp values. Allow that at each non-root level of the hierarchy.
      Keep it simple by not caring now about "effective" values computation
      and propagation along the hierarchy.
      
      Update sysctl_sched_uclamp_handler() to use the newly introduced
      uclamp_mutex so that we serialize system default updates with cgroup
      relate updates.
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMichal Koutny <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2480c093
  17. 19 Aug, 2019 1 commit
    • Sebastian Andrzej Siewior's avatar
      sched/core: Schedule new worker even if PI-blocked · b0fdc013
      Sebastian Andrzej Siewior authored
      If a task is PI-blocked (blocking on sleeping spinlock) then we don't want to
      schedule a new kworker if we schedule out due to lock contention because !RT
      does not do that as well. A spinning spinlock disables preemption and a worker
      does not schedule out on lock contention (but spin).
      
      On RT the RW-semaphore implementation uses an rtmutex so
      tsk_is_pi_blocked() will return true if a task blocks on it. In this case we
      will now start a new worker which may deadlock if one worker is waiting on
      progress from another worker. Since a RW-semaphore starts a new worker on !RT,
      we should do the same on RT.
      
      XFS is able to trigger this deadlock.
      
      Allow to schedule new worker if the current worker is PI-blocked.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190816160626.12742-1-bigeasy@linutronix.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b0fdc013
  18. 08 Aug, 2019 3 commits