An error occurred fetching the project authors.
  1. 20 Aug, 2021 5 commits
  2. 10 Aug, 2021 1 commit
  3. 06 Aug, 2021 2 commits
  4. 04 Aug, 2021 2 commits
  5. 28 Jun, 2021 1 commit
    • Yuan ZhaoXiong's avatar
      sched: Optimize housekeeping_cpumask() in for_each_cpu_and() · 031e3bd8
      Yuan ZhaoXiong authored
      On a 128 cores AMD machine, there are 8 cores in nohz_full mode, and
      the others are used for housekeeping. When many housekeeping cpus are
      in idle state, we can observe huge time burn in the loop for searching
      nearest busy housekeeper cpu by ftrace.
      
         9)               |              get_nohz_timer_target() {
         9)               |                housekeeping_test_cpu() {
         9)   0.390 us    |                  housekeeping_get_mask.part.1();
         9)   0.561 us    |                }
         9)   0.090 us    |                __rcu_read_lock();
         9)   0.090 us    |                housekeeping_cpumask();
         9)   0.521 us    |                housekeeping_cpumask();
         9)   0.140 us    |                housekeeping_cpumask();
      
         ...
      
         9)   0.500 us    |                housekeeping_cpumask();
         9)               |                housekeeping_any_cpu() {
         9)   0.090 us    |                  housekeeping_get_mask.part.1();
         9)   0.100 us    |                  sched_numa_find_closest();
         9)   0.491 us    |                }
         9)   0.100 us    |                __rcu_read_unlock();
         9) + 76.163 us   |              }
      
      for_each_cpu_and() is a micro function, so in get_nohz_timer_target()
      function the
              for_each_cpu_and(i, sched_domain_span(sd),
                      housekeeping_cpumask(HK_FLAG_TIMER))
      equals to below:
              for (i = -1; i = cpumask_next_and(i, sched_domain_span(sd),
                      housekeeping_cpumask(HK_FLAG_TIMER)), i < nr_cpu_ids;)
      That will cause that housekeeping_cpumask() will be invoked many times.
      The housekeeping_cpumask() function returns a const value, so it is
      unnecessary to invoke it every time. This patch can minimize the worst
      searching time from ~76us to ~16us in my testing.
      
      Similarly, the find_new_ilb() function has the same problem.
      Co-developed-by: default avatarLi RongQing <lirongqing@baidu.com>
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Signed-off-by: default avatarYuan ZhaoXiong <yuanzhaoxiong@baidu.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1622985115-51007-1-git-send-email-yuanzhaoxiong@baidu.com
      031e3bd8
  6. 24 Jun, 2021 1 commit
    • Huaixin Chang's avatar
      sched/fair: Introduce the burstable CFS controller · f4183717
      Huaixin Chang authored
      The CFS bandwidth controller limits CPU requests of a task group to
      quota during each period. However, parallel workloads might be bursty
      so that they get throttled even when their average utilization is under
      quota. And they are latency sensitive at the same time so that
      throttling them is undesired.
      
      We borrow time now against our future underrun, at the cost of increased
      interference against the other system users. All nicely bounded.
      
      Traditional (UP-EDF) bandwidth control is something like:
      
        (U = \Sum u_i) <= 1
      
      This guaranteeds both that every deadline is met and that the system is
      stable. After all, if U were > 1, then for every second of walltime,
      we'd have to run more than a second of program time, and obviously miss
      our deadline, but the next deadline will be further out still, there is
      never time to catch up, unbounded fail.
      
      This work observes that a workload doesn't always executes the full
      quota; this enables one to describe u_i as a statistical distribution.
      
      For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
      (the traditional WCET). This effectively allows u to be smaller,
      increasing the efficiency (we can pack more tasks in the system), but at
      the cost of missing deadlines when all the odds line up. However, it
      does maintain stability, since every overrun must be paired with an
      underrun as long as our x is above the average.
      
      That is, suppose we have 2 tasks, both specify a p(95) value, then we
      have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
      everything is good. At the same time we have a p(5)p(5) = 0.25% chance
      both tasks will exceed their quota at the same time (guaranteed deadline
      fail). Somewhere in between there's a threshold where one exceeds and
      the other doesn't underrun enough to compensate; this depends on the
      specific CDFs.
      
      At the same time, we can say that the worst case deadline miss, will be
      \Sum e_i; that is, there is a bounded tardiness (under the assumption
      that x+e is indeed WCET).
      
      The benefit of burst is seen when testing with schbench. Default value of
      kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
      
      	mkdir /sys/fs/cgroup/cpu/test
      	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
      	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
      	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
      
      	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
      
      The average CPU usage is at 80%. I run this for 10 times, and got long tail
      latency for 6 times and got throttled for 8 times.
      
      Tail latencies are shown below, and it wasn't the worst case.
      
      	Latency percentiles (usec)
      		50.0000th: 19872
      		75.0000th: 21344
      		90.0000th: 22176
      		95.0000th: 22496
      		*99.0000th: 22752
      		99.5000th: 22752
      		99.9000th: 22752
      		min=0, max=22727
      	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
      
      The interferenece when using burst is valued by the possibilities for
      missing the deadline and the average WCET. Test results showed that when
      there many cgroups or CPU is under utilized, the interference is
      limited. More details are shown in:
      https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/Co-developed-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
      Signed-off-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
      Co-developed-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarHuaixin Chang <changhuaixin@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
      f4183717
  7. 22 Jun, 2021 1 commit
    • Qais Yousef's avatar
      sched/uclamp: Fix uclamp_tg_restrict() · 0213b708
      Qais Yousef authored
      Now cpu.uclamp.min acts as a protection, we need to make sure that the
      uclamp request of the task is within the allowed range of the cgroup,
      that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
      tg->uclamp[UCLAMP_MAX].
      
      As reported by Xuewen [1] we can have some corner cases where there's
      inversion between uclamp requested by task (p) and the uclamp values of
      the taskgroup it's attached to (tg). Following table demonstrates
      2 corner cases:
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	CASE 1
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  60%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	CASE 2
      	-----------+-----+------+-----------
      	uclamp_min | 0%  | 30%  |  30%
      	-----------+-----+------+-----------
      	uclamp_max | 20% | 50%  |  20%
      	-----------+-----+------+-----------
      
      With this fix we get:
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	CASE 1
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  50%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	CASE 2
      	-----------+-----+------+-----------
      	uclamp_min | 0%  | 30%  |  30%
      	-----------+-----+------+-----------
      	uclamp_max | 20% | 50%  |  30%
      	-----------+-----+------+-----------
      
      Additionally uclamp_update_active_tasks() must now unconditionally
      update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
      instance could have an impact on the effective UCLAMP_MIN of the tasks.
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	old
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  50%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	*new*
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   | *60%*
      	-----------+-----+------+-----------
      	uclamp_max | 80% |*70%* | *70%*
      	-----------+-----+------+-----------
      
      [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
      
      Fixes: 0c18f2ec ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
      Reported-by: default avatarXuewen Yan <xuewen.yan94@gmail.com>
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
      0213b708
  8. 18 Jun, 2021 3 commits
  9. 17 Jun, 2021 1 commit
    • Peter Zijlstra's avatar
      sched/fair: Age the average idle time · 94aafc3e
      Peter Zijlstra authored
      This is a partial forward-port of Peter Ziljstra's work first posted
      at:
      
         https://lore.kernel.org/lkml/20180530142236.667774973@infradead.org/
      
      Currently select_idle_cpu()'s proportional scheme uses the average idle
      time *for when we are idle*, that is temporally challenged.  When a CPU
      is not at all idle, we'll happily continue using whatever value we did
      see when the CPU goes idle. To fix this, introduce a separate average
      idle and age it (the existing value still makes sense for things like
      new-idle balancing, which happens when we do go idle).
      
      The overall goal is to not spend more time scanning for idle CPUs than
      we're idle for. Otherwise we're inhibiting work. This means that we need to
      consider the cost over all the wake-ups between consecutive idle periods.
      To track this, the scan cost is subtracted from the estimated average
      idle time.
      
      The impact of this patch is related to workloads that have domains that
      are fully busy or overloaded. Without the patch, the scan depth may be
      too high because a CPU is not reaching idle.
      
      Due to the nature of the patch, this is a regression magnet. It
      potentially wins when domains are almost fully busy or overloaded --
      at that point searches are likely to fail but idle is not being aged
      as CPUs are active so search depth is too large and useless. It will
      potentially show regressions when there are idle CPUs and a deep search is
      beneficial. This tbench result on a 2-socket broadwell machine partially
      illustates the problem
      
                                5.13.0-rc2             5.13.0-rc2
                                   vanilla     sched-avgidle-v1r5
      Hmean     1        445.02 (   0.00%)      451.36 *   1.42%*
      Hmean     2        830.69 (   0.00%)      846.03 *   1.85%*
      Hmean     4       1350.80 (   0.00%)     1505.56 *  11.46%*
      Hmean     8       2888.88 (   0.00%)     2586.40 * -10.47%*
      Hmean     16      5248.18 (   0.00%)     5305.26 *   1.09%*
      Hmean     32      8914.03 (   0.00%)     9191.35 *   3.11%*
      Hmean     64     10663.10 (   0.00%)    10192.65 *  -4.41%*
      Hmean     128    18043.89 (   0.00%)    18478.92 *   2.41%*
      Hmean     256    16530.89 (   0.00%)    17637.16 *   6.69%*
      Hmean     320    16451.13 (   0.00%)    17270.97 *   4.98%*
      
      Note that 8 was a regression point where a deeper search would have helped
      but it gains for high thread counts when searches are useless. Hackbench
      is a more extreme example although not perfect as the tasks idle rapidly
      
      hackbench-process-pipes
                                5.13.0-rc2             5.13.0-rc2
                                   vanilla     sched-avgidle-v1r5
      Amean     1        0.3950 (   0.00%)      0.3887 (   1.60%)
      Amean     4        0.9450 (   0.00%)      0.9677 (  -2.40%)
      Amean     7        1.4737 (   0.00%)      1.4890 (  -1.04%)
      Amean     12       2.3507 (   0.00%)      2.3360 *   0.62%*
      Amean     21       4.0807 (   0.00%)      4.0993 *  -0.46%*
      Amean     30       5.6820 (   0.00%)      5.7510 *  -1.21%*
      Amean     48       8.7913 (   0.00%)      8.7383 (   0.60%)
      Amean     79      14.3880 (   0.00%)     13.9343 *   3.15%*
      Amean     110     21.2233 (   0.00%)     19.4263 *   8.47%*
      Amean     141     28.2930 (   0.00%)     25.1003 *  11.28%*
      Amean     172     34.7570 (   0.00%)     30.7527 *  11.52%*
      Amean     203     41.0083 (   0.00%)     36.4267 *  11.17%*
      Amean     234     47.7133 (   0.00%)     42.0623 *  11.84%*
      Amean     265     53.0353 (   0.00%)     47.7720 *   9.92%*
      Amean     296     60.0170 (   0.00%)     53.4273 *  10.98%*
      Stddev    1        0.0052 (   0.00%)      0.0025 (  51.57%)
      Stddev    4        0.0357 (   0.00%)      0.0370 (  -3.75%)
      Stddev    7        0.0190 (   0.00%)      0.0298 ( -56.64%)
      Stddev    12       0.0064 (   0.00%)      0.0095 ( -48.38%)
      Stddev    21       0.0065 (   0.00%)      0.0097 ( -49.28%)
      Stddev    30       0.0185 (   0.00%)      0.0295 ( -59.54%)
      Stddev    48       0.0559 (   0.00%)      0.0168 (  69.92%)
      Stddev    79       0.1559 (   0.00%)      0.0278 (  82.17%)
      Stddev    110      1.1728 (   0.00%)      0.0532 (  95.47%)
      Stddev    141      0.7867 (   0.00%)      0.0968 (  87.69%)
      Stddev    172      1.0255 (   0.00%)      0.0420 (  95.91%)
      Stddev    203      0.8106 (   0.00%)      0.1384 (  82.92%)
      Stddev    234      1.1949 (   0.00%)      0.1328 (  88.89%)
      Stddev    265      0.9231 (   0.00%)      0.0820 (  91.11%)
      Stddev    296      1.0456 (   0.00%)      0.1327 (  87.31%)
      
      Again, higher thread counts benefit and the standard deviation
      shows that results are also a lot more stable when the idle
      time is aged.
      
      The patch potentially matters when a socket was multiple LLCs as the
      maximum search depth is lower. However, some of the test results were
      suspiciously good (e.g. specjbb2005 gaining 50% on a Zen1 machine) and
      other results were not dramatically different to other mcahines.
      
      Given the nature of the patch, Peter's full series is not being forward
      ported as each part should stand on its own. Preferably they would be
      merged at different times to reduce the risk of false bisections.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210615111611.GH30378@techsingularity.net
      94aafc3e
  10. 04 Jun, 2021 1 commit
  11. 01 Jun, 2021 2 commits
  12. 19 May, 2021 3 commits
    • Masahiro Yamada's avatar
      sched: Fix a stale comment in pick_next_task() · 1699949d
      Masahiro Yamada authored
      fair_sched_class->next no longer exists since commit:
      
        a87e749e ("sched: Remove struct sched_class::next field").
      
      Now the sched_class order is specified by the linker script.
      
      Rewrite the comment in a more generic way.
      Signed-off-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20210519063709.323162-1-masahiroy@kernel.org
      1699949d
    • Qais Yousef's avatar
      sched/uclamp: Fix locking around cpu_util_update_eff() · 93b73858
      Qais Yousef authored
      cpu_cgroup_css_online() calls cpu_util_update_eff() without holding the
      uclamp_mutex or rcu_read_lock() like other call sites, which is
      a mistake.
      
      The uclamp_mutex is required to protect against concurrent reads and
      writes that could update the cgroup hierarchy.
      
      The rcu_read_lock() is required to traverse the cgroup data structures
      in cpu_util_update_eff().
      
      Surround the caller with the required locks and add some asserts to
      better document the dependency in cpu_util_update_eff().
      
      Fixes: 7226017a ("sched/uclamp: Fix a bug in propagating uclamp value in new cgroups")
      Reported-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210510145032.1934078-3-qais.yousef@arm.com
      93b73858
    • Qais Yousef's avatar
      sched/uclamp: Fix wrong implementation of cpu.uclamp.min · 0c18f2ec
      Qais Yousef authored
      cpu.uclamp.min is a protection as described in cgroup-v2 Resource
      Distribution Model
      
      	Documentation/admin-guide/cgroup-v2.rst
      
      which means we try our best to preserve the minimum performance point of
      tasks in this group. See full description of cpu.uclamp.min in the
      cgroup-v2.rst.
      
      But the current implementation makes it a limit, which is not what was
      intended.
      
      For example:
      
      	tg->cpu.uclamp.min = 20%
      
      	p0->uclamp[UCLAMP_MIN] = 0
      	p1->uclamp[UCLAMP_MIN] = 50%
      
      	Previous Behavior (limit):
      
      		p0->effective_uclamp = 0
      		p1->effective_uclamp = 20%
      
      	New Behavior (Protection):
      
      		p0->effective_uclamp = 20%
      		p1->effective_uclamp = 50%
      
      Which is inline with how protections should work.
      
      With this change the cgroup and per-task behaviors are the same, as
      expected.
      
      Additionally, we remove the confusing relationship between cgroup and
      !user_defined flag.
      
      We don't want for example RT tasks that are boosted by default to max to
      change their boost value when they attach to a cgroup. If a cgroup wants
      to limit the max performance point of tasks attached to it, then
      cpu.uclamp.max must be set accordingly.
      
      Or if they want to set different boost value based on cgroup, then
      sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max
      and set the right cpu.uclamp.min for each group to let the RT tasks
      obtain the desired boost value when attached to that group.
      
      As it stands the dependency on !user_defined flag adds an extra layer of
      complexity that is not required now cpu.uclamp.min behaves properly as
      a protection.
      
      The propagation model of effective cpu.uclamp.min in child cgroups as
      implemented by cpu_util_update_eff() is still correct. The parent
      protection sets an upper limit of what the child cgroups will
      effectively get.
      
      Fixes: 3eac870a (sched/uclamp: Use TG's clamps to restrict TASK's clamps)
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com
      0c18f2ec
  13. 18 May, 2021 1 commit
  14. 12 May, 2021 16 commits