1. 19 Nov, 2020 10 commits
    • Ionela Voinescu's avatar
      sched/topology: Condition EAS enablement on FIE support · fa50e2b4
      Ionela Voinescu authored
      In order to make accurate predictions across CPUs and for all performance
      states, Energy Aware Scheduling (EAS) needs frequency-invariant load
      tracking signals.
      
      EAS task placement aims to minimize energy consumption, and does so in
      part by limiting the search space to only CPUs with the highest spare
      capacity (CPU capacity - CPU utilization) in their performance domain.
      Those candidates are the placement choices that will keep frequency at
      its lowest possible and therefore save the most energy.
      
      But without frequency invariance, a CPU's utilization is relative to the
      CPU's current performance level, and not relative to its maximum
      performance level, which determines its capacity. As a result, it will
      fail to correctly indicate any potential spare capacity obtained by an
      increase in a CPU's performance level. Therefore, a non-invariant
      utilization signal would render the EAS task placement logic invalid.
      
      Now that we properly report support for the Frequency Invariance Engine
      (FIE) through arch_scale_freq_invariant() for arm and arm64 systems,
      while also ensuring a re-evaluation of the EAS use conditions for
      possible invariance status change, we can assert this is the case when
      initializing EAS. Warn and bail out otherwise.
      Suggested-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201027180713.7642-4-ionela.voinescu@arm.com
      fa50e2b4
    • Ionela Voinescu's avatar
      arm64: Rebuild sched domains on invariance status changes · ecec9e86
      Ionela Voinescu authored
      Task scheduler behavior depends on frequency invariance (FI) support and
      the resulting invariant load tracking signals. For example, in order to
      make accurate predictions across CPUs for all performance states, Energy
      Aware Scheduling (EAS) needs frequency-invariant load tracking signals
      and therefore it has a direct dependency on FI. This dependency is known,
      but EAS enablement is not yet conditioned on the presence of FI during
      the built of the scheduling domain hierarchy.
      
      Before this is done, the following must be considered: while
      arch_scale_freq_invariant() will see changes in FI support and could
      be used to condition the use of EAS, it could return different values
      during system initialisation.
      
      For arm64, such a scenario will happen for a system that does not support
      cpufreq driven FI, but does support counter-driven FI. For such a system,
      arch_scale_freq_invariant() will return false if called before counter
      based FI initialisation, but change its status to true after it.
      If EAS becomes explicitly dependent on FI this would affect the task
      scheduler behavior which builds its scheduling domain hierarchy well
      before the late counter-based FI init. During that process, EAS would be
      disabled due to its dependency on FI.
      
      Two points of future early calls to arch_scale_freq_invariant() which
      would determine EAS enablement are:
       - (1) drivers/base/arch_topology.c:126 <<update_topology_flags_workfn>>
      		rebuild_sched_domains();
             This will happen after CPU capacity initialisation.
       - (2) kernel/sched/cpufreq_schedutil.c:917 <<rebuild_sd_workfn>>
      		rebuild_sched_domains_energy();
      		-->rebuild_sched_domains();
             This will happen during sched_cpufreq_governor_change() for the
             schedutil cpufreq governor.
      
      Therefore, before enforcing the presence of FI support for the use of EAS,
      ensure the following: if there is a change in FI support status after
      counter init, use the existing rebuild_sched_domains_energy() function to
      trigger a rebuild of the scheduling and performance domains that in turn
      will determine the enablement of EAS.
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Link: https://lkml.kernel.org/r/20201027180713.7642-3-ionela.voinescu@arm.com
      ecec9e86
    • Ionela Voinescu's avatar
      sched/topology,schedutil: Wrap sched domains rebuild · 31f6a8c0
      Ionela Voinescu authored
      Add the rebuild_sched_domains_energy() function to wrap the functionality
      that rebuilds the scheduling domains if any of the Energy Aware Scheduling
      (EAS) initialisation conditions change. This functionality is used when
      schedutil is added or removed or when EAS is enabled or disabled
      through the sched_energy_aware sysctl.
      
      Therefore, create a single function that is used in both these cases and
      that can be later reused.
      Signed-off-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarQuentin Perret <qperret@google.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lkml.kernel.org/r/20201027180713.7642-2-ionela.voinescu@arm.com
      31f6a8c0
    • Dietmar Eggemann's avatar
      sched/uclamp: Allow to reset a task uclamp constraint value · 480a6ca2
      Dietmar Eggemann authored
      In case the user wants to stop controlling a uclamp constraint value
      for a task, use the magic value -1 in sched_util_{min,max} with the
      appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
      the reset.
      
      The advantage over the 'additional flag' approach (i.e. introducing
      SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
      exported via uapi. This avoids the need to document how this new flag
      has be used in conjunction with the existing uclamp related flags.
      
      The following subtle issue is fixed as well. When a uclamp constraint
      value is set on a !user_defined uclamp_se it is currently first reset
      and then set.
      Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
      stands for the 'sched class change' case.
      The related condition 'if (uc_se->user_defined)' moved from
      __setscheduler_uclamp() into uclamp_reset().
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarYun Hsiang <hsiang023167@gmail.com>
      Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com
      480a6ca2
    • Tal Zussman's avatar
    • Barry Song's avatar
      Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug · 9032dc21
      Barry Song authored
      This document seems to be out of date for many, many years. Even it has
      misspelled from the first day.
      ARCH_HASH_SCHED_TUNE should be ARCH_HAS_SCHED_TUNE
      ARCH_HASH_SCHED_DOMAIN should be ARCH_HAS_SCHED_DOMAIN
      
      Since v2.6.14, kernel completely deleted the relevant code and even
      arch_init_sched_domains() was deleted.
      
      Right now, kernel is asking architectures to call set_sched_topology() to
      override the default sched domains.
      
      On the other hand, to print the schedule debug information, users need to
      set sched_debug cmdline or enable it by sysfs entry. So this patch also
      adds the description for sched_debug.
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20201113115018.1628-1-song.bao.hua@hisilicon.com
      9032dc21
    • Valentin Schneider's avatar
      sched/topology: Warn when NUMA diameter > 2 · b5b21734
      Valentin Schneider authored
      NUMA topologies where the shortest path between some two nodes requires
      three or more hops (i.e. diameter > 2) end up being misrepresented in the
      scheduler topology structures.
      
      This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
      + sched_debug on the cmdline, although this will only yield a warning about
      sched_group spans not matching sched_domain spans:
      
        ERROR: groups don't span domain->span
      
      Add an explicit warning for that case, triggered regardless of
      CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.
      
      The topology described in the comment can be booted up on QEMU by appending
      the following to your usual QEMU incantation:
      
          -smp cores=4 \
          -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
          -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
          -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
          -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
          -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20
      
      A somewhat more realistic topology (6-node mesh) with the same affliction
      can be conjured with:
      
          -smp cores=6 \
          -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
          -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
          -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
          -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
          -numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
          -numa dist,src=0,dst=5,val=20, \
          -numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
          -numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
          -numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
          -numa dist,src=2,dst=5,val=40, \
          -numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
          -numa dist,src=4,dst=5,val=20
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
      b5b21734
    • Daniel Jordan's avatar
      cpuset: fix race between hotplug work and later CPU offline · 406100f3
      Daniel Jordan authored
      One of our machines keeled over trying to rebuild the scheduler domains.
      Mainline produces the same splat:
      
        BUG: unable to handle page fault for address: 0000607f820054db
        CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
        Workqueue: events cpuset_hotplug_workfn
        RIP: build_sched_domains
        Call Trace:
         partition_sched_domains_locked
         rebuild_sched_domains_locked
         cpuset_hotplug_workfn
      
      It happens with cgroup2 and exclusive cpusets only.  This reproducer
      triggers it on an 8-cpu vm and works most effectively with no
      preexisting child cgroups:
      
        cd $UNIFIED_ROOT
        mkdir cg1
        echo 4-7 > cg1/cpuset.cpus
        echo root > cg1/cpuset.cpus.partition
      
        # with smt/control reading 'on',
        echo off > /sys/devices/system/cpu/smt/control
      
      RIP maps to
      
        sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
      
      from sd_init().  sd_id is calculated earlier in the same function:
      
        cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
        sd_id = cpumask_first(sched_domain_span(sd));
      
      tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
      and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
      value from per_cpu_ptr() above.
      
      The problem is a race between cpuset_hotplug_workfn() and a later
      offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
      when N is still online, the offline clears N from cpu_sibling_map, and
      then the worker uses the stale effective masks that still have N to
      generate the scheduling domains, leading the worker to read
      N's empty cpu_sibling_map in sd_init().
      
      rebuild_sched_domains_locked() prevented the race during the cgroup2
      cpuset series up until the Fixes commit changed its check.  Make the
      check more robust so that it can detect an offline CPU in any exclusive
      cpuset's effective mask, not just the top one.
      
      Fixes: 0ccea8fe ("cpuset: Make generate_sched_domains() work with partition")
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
      406100f3
    • Peter Zijlstra's avatar
      sched: Fix migration_cpu_stop() WARN · 1293771e
      Peter Zijlstra authored
      Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch
      of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that
      case is just plain wrong to begin with, since per the earlier branch
      that isn't the actual CPU of the task.
      
      Replace both instances of is_cpu_allowed() by a direct p->cpus_mask
      test using task_cpu().
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Debugged-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      1293771e
    • Valentin Schneider's avatar
      sched/core: Add missing completion for affine_move_task() waiters · d707faa6
      Valentin Schneider authored
      Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on
      a wait_for_completion(). The problematic pattern seems to be:
      
        affine_move_task()
            // task_running() case
            stop_one_cpu();
            wait_for_completion(&pending->done);
      
      Combined with, on the stopper side:
      
        migration_cpu_stop()
          // Task moved between unlocks and scheduling the stopper
          task_rq(p) != rq &&
          // task_running() case
          dest_cpu >= 0
      
          => no complete_all()
      
      This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should
      be more likely to see this given the targeted task has a much bigger window
      to block and be woken up elsewhere before the stopper runs.
      
      Make migration_cpu_stop() always look at pending affinity requests; signal
      their completion if the stopper hits a rq mismatch but the task is
      still within its allowed mask. When Migrate-Disable isn't involved, this
      matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop()
      behaviour.
      
      Fixes: 6d337eab ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
      Reported-by: default avatarQian Cai <cai@redhat.com>
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com
      d707faa6
  2. 10 Nov, 2020 24 commits
  3. 29 Oct, 2020 6 commits
    • Julia Lawall's avatar
      sched/fair: Check for idle core in wake_affine · d8fcb81f
      Julia Lawall authored
      In the case of a thread wakeup, wake_affine determines whether a core
      will be chosen for the thread on the socket where the thread ran
      previously or on the socket of the waker.  This is done primarily by
      comparing the load of the core where th thread ran previously (prev)
      and the load of the waker (this).
      
      commit 11f10e54 ("sched/fair: Use load instead of runnable load
      in wakeup path") changed the load computation from the runnable load
      to the load average, where the latter includes the load of threads
      that have already blocked on the core.
      
      When a short-running daemon processes happens to run on prev, this
      change raised the situation that prev could appear to have a greater
      load than this, even when prev is actually idle.  When prev and this
      are on the same socket, the idle prev is detected later, in
      select_idle_sibling.  But if that does not hold, prev is completely
      ignored, causing the waking thread to move to the socket of the waker.
      In the case of N mostly active threads on N cores, this triggers other
      migrations and hurts performance.
      
      In contrast, before commit 11f10e54, the load on an idle core
      was 0, and in the case of a non-idle waker core, the effect of
      wake_affine was to select prev as the target for searching for a core
      for the waking thread.
      
      To avoid unnecessary migrations, extend wake_affine_idle to check
      whether the core where the thread previously ran is currently idle,
      and if so simply return that core as the target.
      
      [1] commit 11f10e54 ("sched/fair: Use load instead of runnable
      load in wakeup path")
      
      This particularly has an impact when using the ondemand power manager,
      where kworkers run every 0.004 seconds on all cores, increasing the
      likelihood that an idle core will be considered to have a load.
      
      The following numbers were obtained with the benchmarking tool
      hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
      benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
      tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
      2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
      management were used.  Times are in seconds.  All experiments use all
      160 hardware threads.
      
      	v5.9/intel-pstate	v5.9+patch/intel-pstate
      bt.C.c	24.725724+-0.962340	23.349608+-1.607214
      lu.C.x	29.105952+-4.804203	25.249052+-5.561617
      sp.C.x	31.220696+-1.831335	30.227760+-2.429792
      ua.C.x	26.606118+-1.767384	25.778367+-1.263850
      
      	v5.9/ondemand		v5.9+patch/ondemand
      bt.C.c	25.330360+-1.028316	23.544036+-1.020189
      lu.C.x	35.872659+-4.872090	23.719295+-3.883848
      sp.C.x	32.141310+-2.289541	29.125363+-0.872300
      ua.C.x	29.024597+-1.667049	25.728888+-1.539772
      
      On the smaller data sets (A and B) and on the other NAS benchmarks
      there is no impact on performance.
      
      This also has a major impact on the splash2x.volrend benchmark of the
      parsec benchmark suite that goes from 1m25 without this patch to 0m45,
      in active (intel_pstate) mode.
      
      Fixes: 11f10e54 ("sched/fair: Use load instead of runnable load in wakeup path")
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
      d8fcb81f
    • Peter Zijlstra's avatar
      sched: Remove relyance on STRUCT_ALIGNMENT · 43c31ac0
      Peter Zijlstra authored
      Florian reported that all of kernel/sched/ is rebuild when
      CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is
      unexpected. This is due to us including vmlinux.lds.h.
      
      Jakub explained that the problem is that we put the alignment
      requirement on the type instead of on a variable. Type alignment is a
      minimum, the compiler is free to pick any larger alignment for a
      specific instance of the type (eg. the variable).
      
      So force the type alignment on all individual variable definitions and
      remove the undesired dependency on vmlinux.lds.h.
      
      Fixes: 85c2ce91 ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
      Reported-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Suggested-by: default avatarJakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      43c31ac0
    • Thomas Gleixner's avatar
      sched: Reenable interrupts in do_sched_yield() · 345a957f
      Thomas Gleixner authored
      do_sched_yield() invokes schedule() with interrupts disabled which is
      not allowed. This goes back to the pre git era to commit a6efb709
      ("[PATCH] irqlock patch 2.5.27-H6") in the history tree.
      
      Reenable interrupts and remove the misleading comment which "explains" it.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
      345a957f
    • Mathieu Desnoyers's avatar
      sched: membarrier: document memory ordering scenarios · 25595eb6
      Mathieu Desnoyers authored
      Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
      Stern for refreshing my memory. Now that I have those in mind, it seems
      appropriate to serialize them to comments for posterity.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
      25595eb6
    • Mathieu Desnoyers's avatar
      sched: membarrier: cover kthread_use_mm (v4) · 618758ed
      Mathieu Desnoyers authored
      Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
      to allow the effect of membarrier(2) to apply to kthreads accessing
      user-space memory as well.
      
      Given that no prior kthread use this guarantee and that it only affects
      kthreads, adding this guarantee does not affect user-space ABI.
      
      Refine the check in membarrier_global_expedited to exclude runqueues
      running the idle thread rather than all kthreads from the IPI cpumask.
      
      Now that membarrier_global_expedited can IPI kthreads, the scheduler
      also needs to update the runqueue's membarrier_state when entering lazy
      TLB state.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
      618758ed
    • Mathieu Desnoyers's avatar
      sched: fix exit_mm vs membarrier (v4) · 5bc78502
      Mathieu Desnoyers authored
      exit_mm should issue memory barriers after user-space memory accesses,
      before clearing current->mm, to order user-space memory accesses
      performed prior to exit_mm before clearing tsk->mm, which has the
      effect of skipping the membarrier private expedited IPIs.
      
      exit_mm should also update the runqueue's membarrier_state so
      membarrier global expedited IPIs are not sent when they are not
      needed.
      
      The membarrier system call can be issued concurrently with do_exit
      if we have thread groups created with CLONE_VM but not CLONE_THREAD.
      
      Here is the scenario I have in mind:
      
      Two thread groups are created, A and B. Thread group B is created by
      issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
      Let's assume we have a single thread within each thread group (Thread A
      and Thread B).
      
      The AFAIU we can have:
      
      Userspace variables:
      
      int x = 0, y = 0;
      
      CPU 0                   CPU 1
      Thread A                Thread B
      (in thread group A)     (in thread group B)
      
      x = 1
      barrier()
      y = 1
      exit()
      exit_mm()
      current->mm = NULL;
                              r1 = load y
                              membarrier()
                                skips CPU 0 (no IPI) because its current mm is NULL
                              r2 = load x
                              BUG_ON(r1 == 1 && r2 == 0)
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
      5bc78502