1. 03 Oct, 2020 1 commit
  2. 25 Sep, 2020 14 commits
    • Peter Oskolkov's avatar
      rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ · f166b111
      Peter Oskolkov authored
      Based on Google-internal RSEQ work done by Paul Turner and Andrew
      Hunter.
      
      This patch adds a selftest for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
      The test quite often fails without the previous patch in this
      patchset, but consistently passes with it.
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lkml.kernel.org/r/20200923233618.2572849-3-posk@google.com
      f166b111
    • Peter Oskolkov's avatar
      rseq/selftests,x86_64: Add rseq_offset_deref_addv() · ea366dd7
      Peter Oskolkov authored
      This patch adds rseq_offset_deref_addv() function to
      tools/testing/selftests/rseq/rseq-x86.h, to be used in a selftest in
      the next patch in the patchset.
      
      Once an architecture adds support for this function they should define
      "RSEQ_ARCH_HAS_OFFSET_DEREF_ADDV".
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lkml.kernel.org/r/20200923233618.2572849-2-posk@google.com
      ea366dd7
    • Peter Oskolkov's avatar
      rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ · 2a36ab71
      Peter Oskolkov authored
      This patchset is based on Google-internal RSEQ work done by Paul
      Turner and Andrew Hunter.
      
      When working with per-CPU RSEQ-based memory allocations, it is
      sometimes important to make sure that a global memory location is no
      longer accessed from RSEQ critical sections. For example, there can be
      two per-CPU lists, one is "active" and accessed per-CPU, while another
      one is inactive and worked on asynchronously "off CPU" (e.g.  garbage
      collection is performed). Then at some point the two lists are
      swapped, and a fast RCU-like mechanism is required to make sure that
      the previously active list is no longer accessed.
      
      This patch introduces such a mechanism: in short, membarrier() syscall
      issues an IPI to a CPU, restarting a potentially active RSEQ critical
      section on the CPU.
      Signed-off-by: default avatarPeter Oskolkov <posk@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com
      2a36ab71
    • Barry Song's avatar
      sched/fair: Use dst group while checking imbalance for NUMA balancer · 233e7aca
      Barry Song authored
      Barry Song noted the following
      
      	Something is wrong. In find_busiest_group(), we are checking if
      	src has higher load, however, in task_numa_find_cpu(), we are
      	checking if dst will have higher load after balancing. It seems
      	it is not sensible to check src.
      
      	It maybe cause wrong imbalance value, for example,
      
      	if dst_running = env->dst_stats.nr_running + 1 results in 3 or
      	above, and src_running = env->src_stats.nr_running - 1 results
      	in 1;
      
      	The current code is thinking imbalance as 0 since src_running is
      	smaller than 2.  This is inconsistent with load balancer.
      
      Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
      a task "from an almost idle domain" to a "domain with spare capacity". This
      patch forbids movement "from a misplaced domain" to "an almost idle domain"
      as that is closer to what the CPU load balancer expects.
      
      This patch is not a universal win. The old behaviour was intended to allow
      a task from an almost idle NUMA node to migrate to its preferred node if
      the destination had capacity but there are corner cases.  For example,
      a NAS compute load could be parallelised to use 1/3rd of available CPUs
      but not all those potential tasks are active at all times allowing this
      logic to trigger. An obvious example is specjbb 2005 running various
      numbers of warehouses on a 2 socket box with 80 cpus.
      
      specjbb
                                     5.9.0-rc4              5.9.0-rc4
                                       vanilla        dstbalance-v1r1
      Hmean     tput-1     46425.00 (   0.00%)    43394.00 *  -6.53%*
      Hmean     tput-2     98416.00 (   0.00%)    96031.00 *  -2.42%*
      Hmean     tput-3    150184.00 (   0.00%)   148783.00 *  -0.93%*
      Hmean     tput-4    200683.00 (   0.00%)   197906.00 *  -1.38%*
      Hmean     tput-5    236305.00 (   0.00%)   245549.00 *   3.91%*
      Hmean     tput-6    281559.00 (   0.00%)   285692.00 *   1.47%*
      Hmean     tput-7    338558.00 (   0.00%)   334467.00 *  -1.21%*
      Hmean     tput-8    340745.00 (   0.00%)   372501.00 *   9.32%*
      Hmean     tput-9    424343.00 (   0.00%)   413006.00 *  -2.67%*
      Hmean     tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
      Hmean     tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
      Hmean     tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
      Hmean     tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
      Hmean     tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
      Hmean     tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
      Hmean     tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
      Hmean     tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
      Hmean     tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
      Hmean     tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
      Hmean     tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
      Hmean     tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
      Hmean     tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
      Hmean     tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
      Hmean     tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
      Hmean     tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
      Hmean     tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
      Hmean     tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
      Hmean     tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
      Hmean     tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
      Hmean     tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
      Hmean     tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
      Hmean     tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
      Hmean     tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
      Hmean     tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
      Hmean     tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
      Hmean     tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
      Hmean     tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
      Hmean     tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
      Hmean     tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
      Hmean     tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
      Hmean     tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
      Hmean     tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
      Hmean     tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
      Hmean     tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
      Hmean     tput-45  1326752.00 (   0.00%)  1406061.00 *   5.98%*
      Hmean     tput-46  1339424.00 (   0.00%)  1418451.00 *   5.90%*
      Hmean     tput-47  1415057.00 (   0.00%)  1381570.00 *  -2.37%*
      Hmean     tput-48  1392003.00 (   0.00%)  1421167.00 *   2.10%*
      Hmean     tput-49  1408374.00 (   0.00%)  1418659.00 *   0.73%*
      Hmean     tput-50  1359822.00 (   0.00%)  1391070.00 *   2.30%*
      Hmean     tput-51  1414246.00 (   0.00%)  1392679.00 *  -1.52%*
      Hmean     tput-52  1432352.00 (   0.00%)  1354020.00 *  -5.47%*
      Hmean     tput-53  1387563.00 (   0.00%)  1409563.00 *   1.59%*
      Hmean     tput-54  1406420.00 (   0.00%)  1388711.00 *  -1.26%*
      Hmean     tput-55  1438804.00 (   0.00%)  1387472.00 *  -3.57%*
      Hmean     tput-56  1399465.00 (   0.00%)  1400296.00 *   0.06%*
      Hmean     tput-57  1428132.00 (   0.00%)  1396399.00 *  -2.22%*
      Hmean     tput-58  1432385.00 (   0.00%)  1386253.00 *  -3.22%*
      Hmean     tput-59  1421612.00 (   0.00%)  1371416.00 *  -3.53%*
      Hmean     tput-60  1429423.00 (   0.00%)  1389412.00 *  -2.80%*
      Hmean     tput-61  1396230.00 (   0.00%)  1351122.00 *  -3.23%*
      Hmean     tput-62  1418396.00 (   0.00%)  1383098.00 *  -2.49%*
      Hmean     tput-63  1409918.00 (   0.00%)  1374662.00 *  -2.50%*
      Hmean     tput-64  1410236.00 (   0.00%)  1376216.00 *  -2.41%*
      Hmean     tput-65  1396405.00 (   0.00%)  1364418.00 *  -2.29%*
      Hmean     tput-66  1395975.00 (   0.00%)  1357326.00 *  -2.77%*
      Hmean     tput-67  1392986.00 (   0.00%)  1349642.00 *  -3.11%*
      Hmean     tput-68  1386541.00 (   0.00%)  1343261.00 *  -3.12%*
      Hmean     tput-69  1374407.00 (   0.00%)  1342588.00 *  -2.32%*
      Hmean     tput-70  1377513.00 (   0.00%)  1334654.00 *  -3.11%*
      Hmean     tput-71  1369319.00 (   0.00%)  1334952.00 *  -2.51%*
      Hmean     tput-72  1354635.00 (   0.00%)  1329005.00 *  -1.89%*
      Hmean     tput-73  1350933.00 (   0.00%)  1318942.00 *  -2.37%*
      Hmean     tput-74  1351714.00 (   0.00%)  1316347.00 *  -2.62%*
      Hmean     tput-75  1352198.00 (   0.00%)  1309974.00 *  -3.12%*
      Hmean     tput-76  1349490.00 (   0.00%)  1286064.00 *  -4.70%*
      Hmean     tput-77  1336131.00 (   0.00%)  1303684.00 *  -2.43%*
      Hmean     tput-78  1308896.00 (   0.00%)  1271024.00 *  -2.89%*
      Hmean     tput-79  1326703.00 (   0.00%)  1290862.00 *  -2.70%*
      Hmean     tput-80  1336199.00 (   0.00%)  1291629.00 *  -3.34%*
      
      The performance at the mid-point is better but not universally better. The
      patch is a mixed bag depending on the workload, machine and overall
      levels of utilisation. Sometimes it's better (sometimes much better),
      other times it is worse (sometimes much worse). Given that there isn't a
      universally good decision in this section and more people seem to prefer
      the patch then it may be best to keep the LB decisions consistent and
      revisit imbalance handling when the load balancer code changes settle down.
      
      Jirka Hladky added the following observation.
      
      	Our results are mostly in line with what you see. We observe
      	big gains (20-50%) when the system is loaded to 1/3 of the
      	maximum capacity and mixed results at the full load - some
      	workloads benefit from the patch at the full load, others not,
      	but performance changes at the full load are mostly within the
      	noise of results (+/-5%). Overall, we think this patch is helpful.
      
      [mgorman@techsingularity.net: Rewrote changelog]
      Fixes: fb86f5b2 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
      233e7aca
    • Vincent Guittot's avatar
      sched/fair: Reduce busy load balance interval · 6e749913
      Vincent Guittot authored
      The busy_factor, which increases load balance interval when a cpu is busy,
      is set to 32 by default. This value generates some huge LB interval on
      large system like the THX2 made of 2 node x 28 cores x 4 threads.
      For such system, the interval increases from 112ms to 3584ms at MC level.
      And from 228ms to 7168ms at NUMA level.
      
      Even on smaller system, a lower busy factor has shown improvement on the
      fair distribution of the running time so let reduce it for all.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org
      6e749913
    • Vincent Guittot's avatar
      sched/fair: Minimize concurrent LBs between domain level · e4d32e4d
      Vincent Guittot authored
      sched domains tend to trigger simultaneously the load balance loop but
      the larger domains often need more time to collect statistics. This
      slowness makes the larger domain trying to detach tasks from a rq whereas
      tasks already migrated somewhere else at a sub-domain level. This is not
      a real problem for idle LB because the period of smaller domains will
      increase with its CPUs being busy and this will let time for higher ones
      to pulled tasks. But this becomes a problem when all CPUs are already busy
      because all domains stay synced when they trigger their LB.
      
      A simple way to minimize simultaneous LB of all domains is to decrement the
      the busy interval by 1 jiffies. Because of the busy_factor, the interval of
      larger domain will not be a multiple of smaller ones anymore.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org
      e4d32e4d
    • Vincent Guittot's avatar
      sched/fair: Reduce minimal imbalance threshold · 2208cdaa
      Vincent Guittot authored
      The 25% default imbalance threshold for DIE and NUMA domain is large
      enough to generate significant unfairness between threads. A typical
      example is the case of 11 threads running on 2x4 CPUs. The imbalance of
      20% between the 2 groups of 4 cores is just low enough to not trigger
      the load balance between the 2 groups. We will have always the same 6
      threads on one group of 4 CPUs and the other 5 threads on the other
      group of CPUS. With a fair time sharing in each group, we ends up with
      +20% running time for the group of 5 threads.
      
      Consider decreasing the imbalance threshold for overloaded case where we
      use the load to balance task and to ensure fair time sharing.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Acked-by: default avatarHillf Danton <hdanton@sina.com>
      Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org
      2208cdaa
    • Vincent Guittot's avatar
      sched/fair: Relax constraint on task's load during load balance · 5a7f5559
      Vincent Guittot authored
      Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
      load balancer currently migrates the waiting task between the CPUs in an
      almost random manner. The success of a rq pulling a task depends of the
      value of nr_balance_failed of its domains and its ability to be faster
      than others to detach it. This behavior results in an unfair distribution
      of the running time between tasks because some CPUs will run most of the
      time, if not always, the same task whereas others will share their time
      between several tasks.
      
      Instead of using nr_balance_failed as a boolean to relax the condition
      for detaching task, the LB will use nr_balanced_failed to relax the
      threshold between the tasks'load and the imbalance. This mecanism
      prevents the same rq or domain to always win the load balance fight.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org
      5a7f5559
    • Xianting Tian's avatar
      sched/fair: Remove the force parameter of update_tg_load_avg() · fe749158
      Xianting Tian authored
      In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
      sometimes update_tg_load_avg(cfs_rq, false) is used.
      update_tg_load_avg() has the parameter force, but in current code,
      it never set 1 or true to it, so remove the force parameter.
      Signed-off-by: default avatarXianting Tian <tian.xianting@h3c.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com
      fe749158
    • Xunlei Pang's avatar
      sched/fair: Fix wrong cpu selecting from isolated domain · df3cb4ea
      Xunlei Pang authored
      We've met problems that occasionally tasks with full cpumask
      (e.g. by putting it into a cpuset or setting to full affinity)
      were migrated to our isolated cpus in production environment.
      
      After some analysis, we found that it is due to the current
      select_idle_smt() not considering the sched_domain mask.
      
      Steps to reproduce on my 31-CPU hyperthreads machine:
      1. with boot parameter: "isolcpus=domain,2-31"
         (thread lists: 0,16 and 1,17)
      2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
      3. some threads will be migrated to the isolated cpu16~17.
      
      Fix it by checking the valid domain mask in select_idle_smt().
      
      Fixes: 10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings())
      Reported-by: default avatarWetp Zhang <wetp.zy@linux.alibaba.com>
      Signed-off-by: default avatarXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJiang Biao <benbjiang@tencent.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
      df3cb4ea
    • YueHaibing's avatar
    • Daniel Bristot de Oliveira's avatar
      sched/rt: Disable RT_RUNTIME_SHARE by default · 2586af1a
      Daniel Bristot de Oliveira authored
      The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
      between CPUs, allowing a CPU to run a real-time task up to 100% of the
      time while leaving more space for non-real-time tasks to run on the CPU
      that lend rt_runtime.
      
      The problem is that a CPU can easily borrow enough rt_runtime to allow
      a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
      which are non-real-time by design.
      
      This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
      The feature will still be present for users that want to enable it,
      though.
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarWei Wang <wvw@google.com>
      Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com
      2586af1a
    • Lucas Stach's avatar
      sched/deadline: Fix stale throttling on de-/boosted tasks · 46fcc4b0
      Lucas Stach authored
      When a boosted task gets throttled, what normally happens is that it's
      immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
      runtime and clears the dl_throttled flag. There is a special case however:
      if the throttling happened on sched-out and the task has been deboosted in
      the meantime, the replenish is skipped as the task will return to its
      normal scheduling class. This leaves the task with the dl_throttled flag
      set.
      
      Now if the task gets boosted up to the deadline scheduling class again
      while it is sleeping, it's still in the throttled state. The normal wakeup
      however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
      actually place it on the rq. Thus we end up with a task that is runnable,
      but not actually on the rq and neither a immediate replenishment happens,
      nor is the replenishment timer set up, so the task is stuck in
      forever-throttled limbo.
      
      Clear the dl_throttled flag before dropping back to the normal scheduling
      class to fix this issue.
      Signed-off-by: default avatarLucas Stach <l.stach@pengutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de
      46fcc4b0
    • Vincent Guittot's avatar
      sched/numa: Use runnable_avg to classify node · 8e0e0eda
      Vincent Guittot authored
      Use runnable_avg to classify numa node state similarly to what is done for
      normal load balancer. This helps to ensure that numa and normal balancers
      use the same view of the state of the system.
      
      Large arm64system: 2 nodes / 224 CPUs:
      
        hackbench -l (256000/#grp) -g #grp
      
        grp    tip/sched/core         +patchset              improvement
        1      14,008(+/- 4,99 %)     13,800(+/- 3.88 %)     1,48 %
        4       4,340(+/- 5.35 %)      4.283(+/- 4.85 %)     1,33 %
        16      3,357(+/- 0.55 %)      3.359(+/- 0.54 %)    -0,06 %
        32      3,050(+/- 0.94 %)      3.039(+/- 1,06 %)     0,38 %
        64      2.968(+/- 1,85 %)      3.006(+/- 2.92 %)    -1.27 %
        128     3,290(+/-12.61 %)      3,108(+/- 5.97 %)     5.51 %
        256     3.235(+/- 3.95 %)      3,188(+/- 2.83 %)     1.45 %
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/20200921072959.16317-1-vincent.guittot@linaro.org
      8e0e0eda
  3. 09 Sep, 2020 1 commit
  4. 04 Sep, 2020 1 commit
  5. 26 Aug, 2020 7 commits
  6. 19 Aug, 2020 16 commits