1. 14 Apr, 2021 5 commits
  2. 09 Apr, 2021 4 commits
    • Valentin Schneider's avatar
      sched/fair: Introduce a CPU capacity comparison helper · 4aed8aa4
      Valentin Schneider authored
      During load-balance, groups classified as group_misfit_task are filtered
      out if they do not pass
      
        group_smaller_max_cpu_capacity(<candidate group>, <local group>);
      
      which itself employs fits_capacity() to compare the sgc->max_capacity of
      both groups.
      
      Due to the underlying margin, fits_capacity(X, 1024) will return false for
      any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
      {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
      CPUs, misfit migration will never intentionally upmigrate it to a CPU of
      higher capacity due to the aforementioned margin.
      
      One may argue the 20% margin of fits_capacity() is excessive in the advent
      of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
      is that fits_capacity() is meant to compare a utilization value to a
      capacity value, whereas here it is being used to compare two capacity
      values. As CPU capacity and task utilization have different dynamics, a
      sensible approach here would be to add a new helper dedicated to comparing
      CPU capacities.
      
      Also note that comparing capacity extrema of local and source sched_group's
      doesn't make much sense when at the day of the day the imbalance will be
      pulled by a known env->dst_cpu, whose capacity can be anywhere within the
      local group's capacity extrema.
      
      While at it, replace group_smaller_{min, max}_cpu_capacity() with
      comparisons of the source group's min/max capacity and the destination
      CPU's capacity.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarQais Yousef <qais.yousef@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: default avatarLingutla Chandrasekhar <clingutla@codeaurora.org>
      Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
      4aed8aa4
    • Valentin Schneider's avatar
      sched/fair: Clean up active balance nr_balance_failed trickery · 23fb06d9
      Valentin Schneider authored
      When triggering an active load balance, sd->nr_balance_failed is set to
      such a value that any further can_migrate_task() using said sd will ignore
      the output of task_hot().
      
      This behaviour makes sense, as active load balance intentionally preempts a
      rq's running task to migrate it right away, but this asynchronous write is
      a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
      before the sd->nr_balance_failed write either becomes visible to the
      stopper's CPU or even happens on the CPU that appended the stopper work.
      
      Add a struct lb_env flag to denote active balancing, and use it in
      can_migrate_task(). Remove the sd->nr_balance_failed write that served the
      same purpose. Cleanup the LBF_DST_PINNED active balance special case.
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
      23fb06d9
    • Lingutla Chandrasekhar's avatar
      sched/fair: Ignore percpu threads for imbalance pulls · 9bcb959d
      Lingutla Chandrasekhar authored
      During load balance, LBF_SOME_PINNED will be set if any candidate task
      cannot be detached due to CPU affinity constraints. This can result in
      setting env->sd->parent->sgc->group_imbalance, which can lead to a group
      being classified as group_imbalanced (rather than any of the other, lower
      group_type) when balancing at a higher level.
      
      In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
      set due to per-CPU kthreads being the only other runnable tasks on any
      given rq. This results in changing the group classification during
      load-balance at higher levels when in reality there is nothing that can be
      done for this affinity constraint: per-CPU kthreads, as the name implies,
      don't get to move around (modulo hotplug shenanigans).
      
      It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
      with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
      an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
      we can leverage here to not set LBF_SOME_PINNED.
      
      Note that the aforementioned classification to group_imbalance (when
      nothing can be done) is especially problematic on big.LITTLE systems, which
      have a topology the likes of:
      
        DIE [          ]
        MC  [    ][    ]
             0  1  2  3
             L  L  B  B
      
        arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
      
      Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
      level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
      the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
      CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
      can significantly delay the upgmigration of said misfit tasks. Systems
      relying on ASYM_PACKING are likely to face similar issues.
      Signed-off-by: default avatarLingutla Chandrasekhar <clingutla@codeaurora.org>
      [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
      [Reword changelog]
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
      9bcb959d
    • Rik van Riel's avatar
      sched/fair: Bring back select_idle_smt(), but differently · c722f35b
      Rik van Riel authored
      Mel Gorman did some nice work in 9fe1f127 ("sched/fair: Merge
      select_idle_core/cpu()"), resulting in the kernel being more efficient
      at finding an idle CPU, and in tasks spending less time waiting to be
      run, both according to the schedstats run_delay numbers, and according
      to measured application latencies. Yay.
      
      The flip side of this is that we see more task migrations (about 30%
      more), higher cache misses, higher memory bandwidth utilization, and
      higher CPU use, for the same number of requests/second.
      
      This is most pronounced on a memcache type workload, which saw a
      consistent 1-3% increase in total CPU use on the system, due to those
      increased task migrations leading to higher L2 cache miss numbers, and
      higher memory utilization. The exclusive L3 cache on Skylake does us
      no favors there.
      
      On our web serving workload, that effect is usually negligible.
      
      It appears that the increased number of CPU migrations is generally a
      good thing, since it leads to lower cpu_delay numbers, reflecting the
      fact that tasks get to run faster. However, the reduced locality and
      the corresponding increase in L2 cache misses hurts a little.
      
      The patch below appears to fix the regression, while keeping the
      benefit of the lower cpu_delay numbers, by reintroducing
      select_idle_smt with a twist: when a socket has no idle cores, check
      to see if the sibling of "prev" is idle, before searching all the
      other CPUs.
      
      This fixes both the occasional 9% regression on the web serving
      workload, and the continuous 2% CPU use regression on the memcache
      type workload.
      
      With Mel's patches and this patch together, task migrations are still
      high, but L2 cache misses, memory bandwidth, and CPU time used are
      back down to what they were before. The p95 and p99 response times for
      the memcache type application improve by about 10% over what they were
      before Mel's patches got merged.
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
      c722f35b
  3. 08 Apr, 2021 1 commit
  4. 25 Mar, 2021 3 commits
  5. 23 Mar, 2021 4 commits
  6. 21 Mar, 2021 1 commit
    • Ingo Molnar's avatar
      sched: Fix various typos · 3b03706f
      Ingo Molnar authored
      Fix ~42 single-word typos in scheduler code comments.
      
      We have accumulated a few fun ones over the years. :-)
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: linux-kernel@vger.kernel.org
      3b03706f
  7. 17 Mar, 2021 1 commit
  8. 10 Mar, 2021 2 commits
  9. 06 Mar, 2021 19 commits
    • Chengming Zhou's avatar
      psi: Optimize task switch inside shared cgroups · 4117cebf
      Chengming Zhou authored
      The commit 36b238d5 ("psi: Optimize switching tasks inside shared
      cgroups") only update cgroups whose state actually changes during a
      task switch only in task preempt case, not in task sleep case.
      
      We actually don't need to clear and set TSK_ONCPU state for common cgroups
      of next and prev task in sleep case, that can save many psi_group_change
      especially when most activity comes from one leaf cgroup.
      
      sleep before:
      psi_dequeue()
        while ((group = iterate_groups(prev)))  # all ancestors
          psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
      psi_task_switch()
        while ((group = iterate_groups(next)))  # all ancestors
          psi_group_change(next, .set=TSK_ONCPU)
      
      sleep after:
      psi_dequeue()
        nop
      psi_task_switch()
        while ((group = iterate_groups(next)))  # until (prev & next)
          psi_group_change(next, .set=TSK_ONCPU)
        while ((group = iterate_groups(prev)))  # all ancestors
          psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
      
      When a voluntary sleep switches to another task, we remove one call of
      psi_group_change() for every common cgroup ancestor of the two tasks.
      Co-developed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
      4117cebf
    • Johannes Weiner's avatar
      psi: Pressure states are unlikely · fddc8bab
      Johannes Weiner authored
      Move the unlikely branches out of line. This eliminates undesirable
      jumps during wakeup and sleeps for workloads that aren't under any
      sort of resource pressure.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
      fddc8bab
    • Chengming Zhou's avatar
      psi: Use ONCPU state tracking machinery to detect reclaim · 7fae6c81
      Chengming Zhou authored
      Move the reclaim detection from the timer tick to the task state
      tracking machinery using the recently added ONCPU state. And we
      also add task psi_flags changes checking in the psi_task_switch()
      optimization to update the parents properly.
      
      In terms of performance and cost, this ONCPU task state tracking
      is not cheaper than previous timer tick in aggregate. But the code is
      simpler and shorter this way, so it's a maintainability win. And
      Johannes did some testing with perf bench, the performace and cost
      changes would be acceptable for real workloads.
      
      Thanks to Johannes Weiner for pointing out the psi_task_switch()
      optimization things and the clearer changelog.
      Co-developed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
      7fae6c81
    • Chengming Zhou's avatar
      psi: Add PSI_CPU_FULL state · e7fcd762
      Chengming Zhou authored
      The FULL state doesn't exist for the CPU resource at the system level,
      but exist at the cgroup level, means all non-idle tasks in a cgroup are
      delayed on the CPU resource which used by others outside of the cgroup
      or throttled by the cgroup cpu.max configuration.
      Co-developed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.com
      e7fcd762
    • Barry Song's avatar
      sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 · 585b6d27
      Barry Song authored
      As long as NUMA diameter > 2, building sched_domain by sibling's child
      domain will definitely create a sched_domain with sched_group which will
      span out of the sched_domain:
      
                     +------+         +------+        +-------+       +------+
                     | node |  12     |node  | 20     | node  |  12   |node  |
                     |  0   +---------+1     +--------+ 2     +-------+3     |
                     +------+         +------+        +-------+       +------+
      
      domain0        node0            node1            node2          node3
      
      domain1        node0+1          node0+1          node2+3        node2+3
                                                       +
      domain2        node0+1+2                         |
                   group: node0+1                      |
                     group:node2+3 <-------------------+
      
      when node2 is added into the domain2 of node0, kernel is using the child
      domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
      the span of the domain including node0+1+2.
      
      This will make load_balance() run based on screwed avg_load and group_type
      in the sched_group spanning out of the sched_domain, and it also makes
      select_task_rq_fair() pick an idle CPU outside the sched_domain.
      
      Real servers which suffer from this problem include Kunpeng920 and 8-node
      Sun Fire X4600-M2, at least.
      
      Here we move to use the *child* domain of the *child* domain of node2's
      domain2 as the new added sched_group. At the same, we re-use the lower
      level sgc directly.
                     +------+         +------+        +-------+       +------+
                     | node |  12     |node  | 20     | node  |  12   |node  |
                     |  0   +---------+1     +--------+ 2     +-------+3     |
                     +------+         +------+        +-------+       +------+
      
      domain0        node0            node1          +- node2          node3
                                                     |
      domain1        node0+1          node0+1        | node2+3        node2+3
                                                     |
      domain2        node0+1+2                       |
                   group: node0+1                    |
                     group:node2 <-------------------+
      
      While the lower level sgc is re-used, this patch only changes the remote
      sched_groups for those sched_domains playing grandchild trick, therefore,
      sgc->next_update is still safe since it's only touched by CPUs that have
      the group span as local group. And sgc->imbalance is also safe because
      sd_parent remains the same in load_balance and LB only tries other CPUs
      from the local group.
      Moreover, since local groups are not touched, they are still getting
      roughly equal size in a TL. And should_we_balance() only matters with
      local groups, so the pull probability of those groups are still roughly
      equal.
      
      Tested by the below topology:
      qemu-system-aarch64  -M virt -nographic \
       -smp cpus=8 \
       -numa node,cpus=0-1,nodeid=0 \
       -numa node,cpus=2-3,nodeid=1 \
       -numa node,cpus=4-5,nodeid=2 \
       -numa node,cpus=6-7,nodeid=3 \
       -numa dist,src=0,dst=1,val=12 \
       -numa dist,src=0,dst=2,val=20 \
       -numa dist,src=0,dst=3,val=22 \
       -numa dist,src=1,dst=2,val=22 \
       -numa dist,src=2,dst=3,val=12 \
       -numa dist,src=1,dst=3,val=24 \
       -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image
      
      w/o patch, we get lots of "groups don't span domain->span":
      [    0.802139] CPU0 attaching sched-domain(s):
      [    0.802193]  domain-0: span=0-1 level=MC
      [    0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
      [    0.802693]   domain-1: span=0-3 level=NUMA
      [    0.802731]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
      [    0.802811]    domain-2: span=0-5 level=NUMA
      [    0.802829]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
      [    0.802881] ERROR: groups don't span domain->span
      [    0.803058]     domain-3: span=0-7 level=NUMA
      [    0.803080]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
      [    0.804055] CPU1 attaching sched-domain(s):
      [    0.804072]  domain-0: span=0-1 level=MC
      [    0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
      [    0.804152]   domain-1: span=0-3 level=NUMA
      [    0.804170]    groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
      [    0.804219]    domain-2: span=0-5 level=NUMA
      [    0.804236]     groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
      [    0.804302] ERROR: groups don't span domain->span
      [    0.804520]     domain-3: span=0-7 level=NUMA
      [    0.804546]      groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }
      [    0.804677] CPU2 attaching sched-domain(s):
      [    0.804687]  domain-0: span=2-3 level=MC
      [    0.804705]   groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 }
      [    0.804754]   domain-1: span=0-3 level=NUMA
      [    0.804772]    groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
      [    0.804820]    domain-2: span=0-5 level=NUMA
      [    0.804836]     groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
      [    0.804944] ERROR: groups don't span domain->span
      [    0.805108]     domain-3: span=0-7 level=NUMA
      [    0.805134]      groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
      [    0.805223] CPU3 attaching sched-domain(s):
      [    0.805232]  domain-0: span=2-3 level=MC
      [    0.805249]   groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 }
      [    0.805319]   domain-1: span=0-3 level=NUMA
      [    0.805336]    groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 }
      [    0.805383]    domain-2: span=0-5 level=NUMA
      [    0.805399]     groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 }
      [    0.805458] ERROR: groups don't span domain->span
      [    0.805605]     domain-3: span=0-7 level=NUMA
      [    0.805626]      groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 }
      [    0.805712] CPU4 attaching sched-domain(s):
      [    0.805721]  domain-0: span=4-5 level=MC
      [    0.805738]   groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 }
      [    0.805787]   domain-1: span=4-7 level=NUMA
      [    0.805803]    groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
      [    0.805851]    domain-2: span=0-1,4-7 level=NUMA
      [    0.805867]     groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
      [    0.805915] ERROR: groups don't span domain->span
      [    0.806108]     domain-3: span=0-7 level=NUMA
      [    0.806130]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
      [    0.806214] CPU5 attaching sched-domain(s):
      [    0.806222]  domain-0: span=4-5 level=MC
      [    0.806240]   groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 }
      [    0.806841]   domain-1: span=4-7 level=NUMA
      [    0.806866]    groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 }
      [    0.806934]    domain-2: span=0-1,4-7 level=NUMA
      [    0.806953]     groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 }
      [    0.807004] ERROR: groups don't span domain->span
      [    0.807312]     domain-3: span=0-7 level=NUMA
      [    0.807386]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 }
      [    0.807686] CPU6 attaching sched-domain(s):
      [    0.807710]  domain-0: span=6-7 level=MC
      [    0.807750]   groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 }
      [    0.807840]   domain-1: span=4-7 level=NUMA
      [    0.807870]    groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
      [    0.807952]    domain-2: span=0-1,4-7 level=NUMA
      [    0.807985]     groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
      [    0.808045] ERROR: groups don't span domain->span
      [    0.808257]     domain-3: span=0-7 level=NUMA
      [    0.808571]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 }
      [    0.808848] CPU7 attaching sched-domain(s):
      [    0.808860]  domain-0: span=6-7 level=MC
      [    0.808880]   groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 }
      [    0.808953]   domain-1: span=4-7 level=NUMA
      [    0.808974]    groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 }
      [    0.809034]    domain-2: span=0-1,4-7 level=NUMA
      [    0.809055]     groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 }
      [    0.809128] ERROR: groups don't span domain->span
      [    0.810361]     domain-3: span=0-7 level=NUMA
      [    0.810400]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 }
      
      w/ patch, we don't get "groups don't span domain->span" any more:
      [    1.486271] CPU0 attaching sched-domain(s):
      [    1.486820]  domain-0: span=0-1 level=MC
      [    1.500924]   groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
      [    1.515717]   domain-1: span=0-3 level=NUMA
      [    1.515903]    groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
      [    1.516989]    domain-2: span=0-5 level=NUMA
      [    1.517124]     groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
      [    1.517369]     domain-3: span=0-7 level=NUMA
      [    1.517423]      groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
      [    1.520027] CPU1 attaching sched-domain(s):
      [    1.520097]  domain-0: span=0-1 level=MC
      [    1.520184]   groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 }
      [    1.520429]   domain-1: span=0-3 level=NUMA
      [    1.520487]    groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
      [    1.520687]    domain-2: span=0-5 level=NUMA
      [    1.520744]     groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
      [    1.520948]     domain-3: span=0-7 level=NUMA
      [    1.521038]      groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 }
      [    1.522068] CPU2 attaching sched-domain(s):
      [    1.522348]  domain-0: span=2-3 level=MC
      [    1.522606]   groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 }
      [    1.522832]   domain-1: span=0-3 level=NUMA
      [    1.522885]    groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
      [    1.523043]    domain-2: span=0-5 level=NUMA
      [    1.523092]     groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
      [    1.523302]     domain-3: span=0-7 level=NUMA
      [    1.523352]      groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
      [    1.523748] CPU3 attaching sched-domain(s):
      [    1.523774]  domain-0: span=2-3 level=MC
      [    1.523825]   groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 }
      [    1.524009]   domain-1: span=0-3 level=NUMA
      [    1.524086]    groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
      [    1.524281]    domain-2: span=0-5 level=NUMA
      [    1.524331]     groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 }
      [    1.524534]     domain-3: span=0-7 level=NUMA
      [    1.524586]      groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 }
      [    1.524847] CPU4 attaching sched-domain(s):
      [    1.524873]  domain-0: span=4-5 level=MC
      [    1.524954]   groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 }
      [    1.525105]   domain-1: span=4-7 level=NUMA
      [    1.525153]    groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
      [    1.525368]    domain-2: span=0-1,4-7 level=NUMA
      [    1.525428]     groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
      [    1.532726]     domain-3: span=0-7 level=NUMA
      [    1.532811]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 }
      [    1.534125] CPU5 attaching sched-domain(s):
      [    1.534159]  domain-0: span=4-5 level=MC
      [    1.534303]   groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 }
      [    1.534490]   domain-1: span=4-7 level=NUMA
      [    1.534572]    groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
      [    1.534734]    domain-2: span=0-1,4-7 level=NUMA
      [    1.534783]     groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
      [    1.536057]     domain-3: span=0-7 level=NUMA
      [    1.536430]      groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 }
      [    1.536815] CPU6 attaching sched-domain(s):
      [    1.536846]  domain-0: span=6-7 level=MC
      [    1.536934]   groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 }
      [    1.537144]   domain-1: span=4-7 level=NUMA
      [    1.537262]    groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
      [    1.537553]    domain-2: span=0-1,4-7 level=NUMA
      [    1.537613]     groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
      [    1.537872]     domain-3: span=0-7 level=NUMA
      [    1.537998]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
      [    1.538448] CPU7 attaching sched-domain(s):
      [    1.538505]  domain-0: span=6-7 level=MC
      [    1.538586]   groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 }
      [    1.538746]   domain-1: span=4-7 level=NUMA
      [    1.538798]    groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
      [    1.539048]    domain-2: span=0-1,4-7 level=NUMA
      [    1.539111]     groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 }
      [    1.539571]     domain-3: span=0-7 level=NUMA
      [    1.539610]      groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 }
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Tested-by: default avatarMeelis Roos <mroos@linux.ee>
      Link: https://lkml.kernel.org/r/20210224030944.15232-1-song.bao.hua@hisilicon.com
      585b6d27
    • Vincent Donnefort's avatar
      cpu/hotplug: Add cpuhp_invoke_callback_range() · 453e4108
      Vincent Donnefort authored
      Factorizing and unifying cpuhp callback range invocations, especially for
      the hotunplug path, where two different ways of decrementing were used. The
      first one, decrements before the callback is called:
      
       cpuhp_thread_fun()
           state = st->state;
           st->state--;
           cpuhp_invoke_callback(state);
      
      The second one, after:
      
       take_down_cpu()|cpuhp_down_callbacks()
           cpuhp_invoke_callback(st->state);
           st->state--;
      
      This is problematic for rolling back the steps in case of error, as
      depending on the decrement, the rollback will start from N or N-1. It also
      makes tracing inconsistent, between steps run in the cpuhp thread and
      the others.
      
      Additionally, avoid useless cpuhp_thread_fun() loops by skipping empty
      steps.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20210216103506.416286-4-vincent.donnefort@arm.com
      453e4108
    • Vincent Donnefort's avatar
      cpu/hotplug: CPUHP_BRINGUP_CPU failure exception · 62f25069
      Vincent Donnefort authored
      The atomic states (between CPUHP_AP_IDLE_DEAD and CPUHP_AP_ONLINE) are
      triggered by the CPUHP_BRINGUP_CPU step. If the latter fails, no atomic
      state can be rolled back.
      
      DEAD callbacks too can't fail and disallow recovery. As a consequence,
      during hotunplug, the fail injection interface should prohibit all states
      from CPUHP_BRINGUP_CPU to CPUHP_ONLINE.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20210216103506.416286-3-vincent.donnefort@arm.com
      62f25069
    • Vincent Donnefort's avatar
      cpu/hotplug: Allowing to reset fail injection · 3ae70c25
      Vincent Donnefort authored
      Currently, the only way of resetting the fail injection is to trigger a
      hotplug, hotunplug or both. This is rather annoying for testing
      and, as the default value for this file is -1, it seems pretty natural to
      let a user write it.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20210216103506.416286-2-vincent.donnefort@arm.com
      3ae70c25
    • Vincent Donnefort's avatar
      sched/pelt: Fix task util_est update filtering · b89997aa
      Vincent Donnefort authored
      Being called for each dequeue, util_est reduces the number of its updates
      by filtering out when the EWMA signal is different from the task util_avg
      by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the
      decay from a previous high util_avg, EWMA might now be close enough to
      the new util_avg. No update would then happen while it would leave
      ue.enqueued with an out-of-date value.
      
      Taking into consideration the two util_est members, EWMA and enqueued for
      the filtering, ensures, for both, an up-to-date value.
      
      This is for now an issue only for the trace probe that might return the
      stale value. Functional-wise, it isn't a problem, as the value is always
      accessed through max(enqueued, ewma).
      
      This problem has been observed using LISA's UtilConvergence:test_means on
      the sd845c board.
      
      No regression observed with Hackbench on sd845c and Perf-bench sched pipe
      on hikey/hikey960.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20210225165820.1377125-1-vincent.donnefort@arm.com
      b89997aa
    • Valentin Schneider's avatar
      sched/fair: Fix shift-out-of-bounds in load_balance() · 39a2a6eb
      Valentin Schneider authored
      Syzbot reported a handful of occurrences where an sd->nr_balance_failed can
      grow to much higher values than one would expect.
      
      A successful load_balance() resets it to 0; a failed one increments
      it. Once it gets to sd->cache_nice_tries + 3, this *should* trigger an
      active balance, which will either set it to sd->cache_nice_tries+1 or reset
      it to 0. However, in case the to-be-active-balanced task is not allowed to
      run on env->dst_cpu, then the increment is done without any further
      modification.
      
      This could then be repeated ad nauseam, and would explain the absurdly high
      values reported by syzbot (86, 149). VincentG noted there is value in
      letting sd->cache_nice_tries grow, so the shift itself should be
      fixed. That means preventing:
      
        """
        If the value of the right operand is negative or is greater than or equal
        to the width of the promoted left operand, the behavior is undefined.
        """
      
      Thus we need to cap the shift exponent to
        BITS_PER_TYPE(typeof(lefthand)) - 1.
      
      I had a look around for other similar cases via coccinelle:
      
        @expr@
        position pos;
        expression E1;
        expression E2;
        @@
        (
        E1 >> E2@pos
        |
        E1 >> E2@pos
        )
      
        @cst depends on expr@
        position pos;
        expression expr.E1;
        constant cst;
        @@
        (
        E1 >> cst@pos
        |
        E1 << cst@pos
        )
      
        @script:python depends on !cst@
        pos << expr.pos;
        exp << expr.E2;
        @@
        # Dirty hack to ignore constexpr
        if exp.upper() != exp:
           coccilib.report.print_report(pos[0], "Possible UB shift here")
      
      The only other match in kernel/sched is rq_clock_thermal() which employs
      sched_thermal_decay_shift, and that exponent is already capped to 10, so
      that one is fine.
      
      Fixes: 5a7f5559 ("sched/fair: Relax constraint on task's load during load balance")
      Reported-by: syzbot+d7581744d5fd27c9fbe1@syzkaller.appspotmail.com
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: http://lore.kernel.org/r/000000000000ffac1205b9a2112f@google.com
      39a2a6eb
    • Vincent Donnefort's avatar
      sched/fair: use lsub_positive in cpu_util_next() · 736cc6b3
      Vincent Donnefort authored
      The sub_positive local version is saving an explicit load-store and is
      enough for the cpu_util_next() usage.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarQuentin Perret <qperret@google.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
      736cc6b3
    • Vincent Donnefort's avatar
      sched/fair: Fix task utilization accountability in compute_energy() · 0372e1cf
      Vincent Donnefort authored
      find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
      energy delta as follows:
      
        feec(task)
          for_each_pd
            base_energy = compute_energy(task, -1, pd)
              -> for_each_cpu(pd)
                 -> cpu_util_next(cpu, task, -1)
      
            energy_delta = compute_energy(task, dst_cpu, pd)
              -> for_each_cpu(pd)
                 -> cpu_util_next(cpu, task, dst_cpu)
            energy_delta -= base_energy
      
      Then it picks the best CPU as being the one that minimizes energy_delta.
      
      cpu_util_next() estimates the CPU utilization that would happen if the
      task was placed on dst_cpu as follows:
      
        max(cpu_util + task_util, cpu_util_est + _task_util_est)
      
      The task contribution to the energy delta can then be either:
      
        (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
            and _task_util_est > cpu_util.
        (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
      
        (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
         otherwise must be small enough so that feec() takes the CPU as a
         potential target for the task placement)
      
      This is problematic for feec(), as cpu_util_next() might give an unfair
      advantage to a CPU which is mostly busy (2) compared to one which is
      mostly idle (1). _task_util_est being always bigger than task_util in
      feec() (as the task is waking up), the task contribution to the energy
      might look smaller on certain CPUs (2) and this breaks the energy
      comparison.
      
      This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
      their cpu_util < _task_util_est (1) while others will maintain cpu_util >
      _task_util_est (2).
      
      Fix this problem by always using max(task_util, _task_util_est) as a task
      contribution to the energy (ENERGY_UTIL). The new estimated CPU
      utilization for the energy would then be:
      
        max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
      
      compute_energy() still needs to know which OPP would be selected if the
      task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
      cpu_util_next() is still used to estimate the maximum util within the pd.
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarQuentin Perret <qperret@google.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
      0372e1cf
    • Vincent Guittot's avatar
      sched/fair: Reduce the window for duplicated update · 39b6a429
      Vincent Guittot authored
      Start to update last_blocked_load_update_tick to reduce the possibility
      of another cpu starting the update one more time
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20210224133007.28644-8-vincent.guittot@linaro.org
      39b6a429
    • Vincent Guittot's avatar
      sched/fair: Trigger the update of blocked load on newly idle cpu · c6f88654
      Vincent Guittot authored
      Instead of waking up a random and already idle CPU, we can take advantage
      of this_cpu being about to enter idle to run the ILB and update the
      blocked load.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org
      c6f88654
    • Vincent Guittot's avatar
      sched/fair: Reorder newidle_balance pulled_task tests · 6553fc18
      Vincent Guittot authored
      Reorder the tests and skip useless ones when no load balance has been
      performed and rq lock has not been released.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20210224133007.28644-6-vincent.guittot@linaro.org
      6553fc18
    • Vincent Guittot's avatar
      sched/fair: Merge for each idle cpu loop of ILB · 7a82e5f5
      Vincent Guittot authored
      Remove the specific case for handling this_cpu outside for_each_cpu() loop
      when running ILB. Instead we use for_each_cpu_wrap() and start with the
      next cpu after this_cpu so we will continue to finish with this_cpu.
      
      update_nohz_stats() is now used for this_cpu too and will prevents
      unnecessary update. We don't need a special case for handling the update of
      nohz.next_balance for this_cpu anymore because it is now handled by the
      loop like others.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20210224133007.28644-5-vincent.guittot@linaro.org
      7a82e5f5
    • Vincent Guittot's avatar
      sched/fair: Remove unused parameter of update_nohz_stats · 64f84f27
      Vincent Guittot authored
      idle load balance is the only user of update_nohz_stats and doesn't use
      force parameter. Remove it
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20210224133007.28644-4-vincent.guittot@linaro.org
      64f84f27
    • Vincent Guittot's avatar
      ab2dde5e
    • Vincent Guittot's avatar
      sched/fair: Remove update of blocked load from newidle_balance · 0826530d
      Vincent Guittot authored
      newidle_balance runs with both preempt and irq disabled which prevent
      local irq to run during this period. The duration for updating the
      blocked load of CPUs varies according to the number of CPU cgroups
      with non-decayed load and extends this critical period to an uncontrolled
      level.
      
      Remove the update from newidle_balance and trigger a normal ILB that
      will take care of the update instead.
      
      This reduces the IRQ latency from O(nr_cgroups * nr_nohz_cpus) to
      O(nr_cgroups).
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20210224133007.28644-2-vincent.guittot@linaro.org
      0826530d