1. 19 Nov, 2020 6 commits
    • Tal Zussman's avatar
    • Barry Song's avatar
      Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug · 9032dc21
      Barry Song authored
      This document seems to be out of date for many, many years. Even it has
      misspelled from the first day.
      ARCH_HASH_SCHED_TUNE should be ARCH_HAS_SCHED_TUNE
      ARCH_HASH_SCHED_DOMAIN should be ARCH_HAS_SCHED_DOMAIN
      
      Since v2.6.14, kernel completely deleted the relevant code and even
      arch_init_sched_domains() was deleted.
      
      Right now, kernel is asking architectures to call set_sched_topology() to
      override the default sched domains.
      
      On the other hand, to print the schedule debug information, users need to
      set sched_debug cmdline or enable it by sysfs entry. So this patch also
      adds the description for sched_debug.
      Signed-off-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20201113115018.1628-1-song.bao.hua@hisilicon.com
      9032dc21
    • Valentin Schneider's avatar
      sched/topology: Warn when NUMA diameter > 2 · b5b21734
      Valentin Schneider authored
      NUMA topologies where the shortest path between some two nodes requires
      three or more hops (i.e. diameter > 2) end up being misrepresented in the
      scheduler topology structures.
      
      This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
      + sched_debug on the cmdline, although this will only yield a warning about
      sched_group spans not matching sched_domain spans:
      
        ERROR: groups don't span domain->span
      
      Add an explicit warning for that case, triggered regardless of
      CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.
      
      The topology described in the comment can be booted up on QEMU by appending
      the following to your usual QEMU incantation:
      
          -smp cores=4 \
          -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
          -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
          -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
          -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
          -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20
      
      A somewhat more realistic topology (6-node mesh) with the same affliction
      can be conjured with:
      
          -smp cores=6 \
          -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
          -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
          -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
          -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
          -numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
          -numa dist,src=0,dst=5,val=20, \
          -numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
          -numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
          -numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
          -numa dist,src=2,dst=5,val=40, \
          -numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
          -numa dist,src=4,dst=5,val=20
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
      b5b21734
    • Daniel Jordan's avatar
      cpuset: fix race between hotplug work and later CPU offline · 406100f3
      Daniel Jordan authored
      One of our machines keeled over trying to rebuild the scheduler domains.
      Mainline produces the same splat:
      
        BUG: unable to handle page fault for address: 0000607f820054db
        CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
        Workqueue: events cpuset_hotplug_workfn
        RIP: build_sched_domains
        Call Trace:
         partition_sched_domains_locked
         rebuild_sched_domains_locked
         cpuset_hotplug_workfn
      
      It happens with cgroup2 and exclusive cpusets only.  This reproducer
      triggers it on an 8-cpu vm and works most effectively with no
      preexisting child cgroups:
      
        cd $UNIFIED_ROOT
        mkdir cg1
        echo 4-7 > cg1/cpuset.cpus
        echo root > cg1/cpuset.cpus.partition
      
        # with smt/control reading 'on',
        echo off > /sys/devices/system/cpu/smt/control
      
      RIP maps to
      
        sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
      
      from sd_init().  sd_id is calculated earlier in the same function:
      
        cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
        sd_id = cpumask_first(sched_domain_span(sd));
      
      tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
      and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
      value from per_cpu_ptr() above.
      
      The problem is a race between cpuset_hotplug_workfn() and a later
      offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
      when N is still online, the offline clears N from cpu_sibling_map, and
      then the worker uses the stale effective masks that still have N to
      generate the scheduling domains, leading the worker to read
      N's empty cpu_sibling_map in sd_init().
      
      rebuild_sched_domains_locked() prevented the race during the cgroup2
      cpuset series up until the Fixes commit changed its check.  Make the
      check more robust so that it can detect an offline CPU in any exclusive
      cpuset's effective mask, not just the top one.
      
      Fixes: 0ccea8fe ("cpuset: Make generate_sched_domains() work with partition")
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
      406100f3
    • Peter Zijlstra's avatar
      sched: Fix migration_cpu_stop() WARN · 1293771e
      Peter Zijlstra authored
      Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch
      of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that
      case is just plain wrong to begin with, since per the earlier branch
      that isn't the actual CPU of the task.
      
      Replace both instances of is_cpu_allowed() by a direct p->cpus_mask
      test using task_cpu().
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Debugged-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      1293771e
    • Valentin Schneider's avatar
      sched/core: Add missing completion for affine_move_task() waiters · d707faa6
      Valentin Schneider authored
      Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on
      a wait_for_completion(). The problematic pattern seems to be:
      
        affine_move_task()
            // task_running() case
            stop_one_cpu();
            wait_for_completion(&pending->done);
      
      Combined with, on the stopper side:
      
        migration_cpu_stop()
          // Task moved between unlocks and scheduling the stopper
          task_rq(p) != rq &&
          // task_running() case
          dest_cpu >= 0
      
          => no complete_all()
      
      This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should
      be more likely to see this given the targeted task has a much bigger window
      to block and be woken up elsewhere before the stopper runs.
      
      Make migration_cpu_stop() always look at pending affinity requests; signal
      their completion if the stopper hits a rq mismatch but the task is
      still within its allowed mask. When Migrate-Disable isn't involved, this
      matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop()
      behaviour.
      
      Fixes: 6d337eab ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
      Reported-by: default avatarQian Cai <cai@redhat.com>
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com
      d707faa6
  2. 10 Nov, 2020 24 commits
  3. 29 Oct, 2020 10 commits
    • Julia Lawall's avatar
      sched/fair: Check for idle core in wake_affine · d8fcb81f
      Julia Lawall authored
      In the case of a thread wakeup, wake_affine determines whether a core
      will be chosen for the thread on the socket where the thread ran
      previously or on the socket of the waker.  This is done primarily by
      comparing the load of the core where th thread ran previously (prev)
      and the load of the waker (this).
      
      commit 11f10e54 ("sched/fair: Use load instead of runnable load
      in wakeup path") changed the load computation from the runnable load
      to the load average, where the latter includes the load of threads
      that have already blocked on the core.
      
      When a short-running daemon processes happens to run on prev, this
      change raised the situation that prev could appear to have a greater
      load than this, even when prev is actually idle.  When prev and this
      are on the same socket, the idle prev is detected later, in
      select_idle_sibling.  But if that does not hold, prev is completely
      ignored, causing the waking thread to move to the socket of the waker.
      In the case of N mostly active threads on N cores, this triggers other
      migrations and hurts performance.
      
      In contrast, before commit 11f10e54, the load on an idle core
      was 0, and in the case of a non-idle waker core, the effect of
      wake_affine was to select prev as the target for searching for a core
      for the waking thread.
      
      To avoid unnecessary migrations, extend wake_affine_idle to check
      whether the core where the thread previously ran is currently idle,
      and if so simply return that core as the target.
      
      [1] commit 11f10e54 ("sched/fair: Use load instead of runnable
      load in wakeup path")
      
      This particularly has an impact when using the ondemand power manager,
      where kworkers run every 0.004 seconds on all cores, increasing the
      likelihood that an idle core will be considered to have a load.
      
      The following numbers were obtained with the benchmarking tool
      hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
      benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
      tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
      2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
      management were used.  Times are in seconds.  All experiments use all
      160 hardware threads.
      
      	v5.9/intel-pstate	v5.9+patch/intel-pstate
      bt.C.c	24.725724+-0.962340	23.349608+-1.607214
      lu.C.x	29.105952+-4.804203	25.249052+-5.561617
      sp.C.x	31.220696+-1.831335	30.227760+-2.429792
      ua.C.x	26.606118+-1.767384	25.778367+-1.263850
      
      	v5.9/ondemand		v5.9+patch/ondemand
      bt.C.c	25.330360+-1.028316	23.544036+-1.020189
      lu.C.x	35.872659+-4.872090	23.719295+-3.883848
      sp.C.x	32.141310+-2.289541	29.125363+-0.872300
      ua.C.x	29.024597+-1.667049	25.728888+-1.539772
      
      On the smaller data sets (A and B) and on the other NAS benchmarks
      there is no impact on performance.
      
      This also has a major impact on the splash2x.volrend benchmark of the
      parsec benchmark suite that goes from 1m25 without this patch to 0m45,
      in active (intel_pstate) mode.
      
      Fixes: 11f10e54 ("sched/fair: Use load instead of runnable load in wakeup path")
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
      d8fcb81f
    • Peter Zijlstra's avatar
      sched: Remove relyance on STRUCT_ALIGNMENT · 43c31ac0
      Peter Zijlstra authored
      Florian reported that all of kernel/sched/ is rebuild when
      CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is
      unexpected. This is due to us including vmlinux.lds.h.
      
      Jakub explained that the problem is that we put the alignment
      requirement on the type instead of on a variable. Type alignment is a
      minimum, the compiler is free to pick any larger alignment for a
      specific instance of the type (eg. the variable).
      
      So force the type alignment on all individual variable definitions and
      remove the undesired dependency on vmlinux.lds.h.
      
      Fixes: 85c2ce91 ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
      Reported-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Suggested-by: default avatarJakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      43c31ac0
    • Thomas Gleixner's avatar
      sched: Reenable interrupts in do_sched_yield() · 345a957f
      Thomas Gleixner authored
      do_sched_yield() invokes schedule() with interrupts disabled which is
      not allowed. This goes back to the pre git era to commit a6efb709
      ("[PATCH] irqlock patch 2.5.27-H6") in the history tree.
      
      Reenable interrupts and remove the misleading comment which "explains" it.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
      345a957f
    • Mathieu Desnoyers's avatar
      sched: membarrier: document memory ordering scenarios · 25595eb6
      Mathieu Desnoyers authored
      Document membarrier ordering scenarios in membarrier.c. Thanks to Alan
      Stern for refreshing my memory. Now that I have those in mind, it seems
      appropriate to serialize them to comments for posterity.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoyers@efficios.com
      25595eb6
    • Mathieu Desnoyers's avatar
      sched: membarrier: cover kthread_use_mm (v4) · 618758ed
      Mathieu Desnoyers authored
      Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
      to allow the effect of membarrier(2) to apply to kthreads accessing
      user-space memory as well.
      
      Given that no prior kthread use this guarantee and that it only affects
      kthreads, adding this guarantee does not affect user-space ABI.
      
      Refine the check in membarrier_global_expedited to exclude runqueues
      running the idle thread rather than all kthreads from the IPI cpumask.
      
      Now that membarrier_global_expedited can IPI kthreads, the scheduler
      also needs to update the runqueue's membarrier_state when entering lazy
      TLB state.
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
      618758ed
    • Mathieu Desnoyers's avatar
      sched: fix exit_mm vs membarrier (v4) · 5bc78502
      Mathieu Desnoyers authored
      exit_mm should issue memory barriers after user-space memory accesses,
      before clearing current->mm, to order user-space memory accesses
      performed prior to exit_mm before clearing tsk->mm, which has the
      effect of skipping the membarrier private expedited IPIs.
      
      exit_mm should also update the runqueue's membarrier_state so
      membarrier global expedited IPIs are not sent when they are not
      needed.
      
      The membarrier system call can be issued concurrently with do_exit
      if we have thread groups created with CLONE_VM but not CLONE_THREAD.
      
      Here is the scenario I have in mind:
      
      Two thread groups are created, A and B. Thread group B is created by
      issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
      Let's assume we have a single thread within each thread group (Thread A
      and Thread B).
      
      The AFAIU we can have:
      
      Userspace variables:
      
      int x = 0, y = 0;
      
      CPU 0                   CPU 1
      Thread A                Thread B
      (in thread group A)     (in thread group B)
      
      x = 1
      barrier()
      y = 1
      exit()
      exit_mm()
      current->mm = NULL;
                              r1 = load y
                              membarrier()
                                skips CPU 0 (no IPI) because its current mm is NULL
                              r2 = load x
                              BUG_ON(r1 == 1 && r2 == 0)
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
      5bc78502
    • Peter Zijlstra's avatar
      sched/fair: Exclude the current CPU from find_new_ilb() · 45da7a2b
      Peter Zijlstra authored
      It is possible for find_new_ilb() to select the current CPU, however,
      this only happens from newidle balancing, in which case need_resched()
      will be true, and consequently nohz_csd_func() will not trigger the
      softirq.
      
      Exclude the current CPU from becoming an ILB target.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      45da7a2b
    • Peter Zijlstra's avatar
      sched/cpupri: Add CPUPRI_HIGHER · b13772f8
      Peter Zijlstra authored
      Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use
      by higher priority tasks (specifically deadline).
      
      XXX: we should probably drive PUSH-PULL from cpupri, that would
      automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      b13772f8
    • Peter Zijlstra's avatar
      sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1 · 934fc331
      Peter Zijlstra authored
      This makes the mapping continuous and frees up 100 for other usage.
      
      Prev mapping:
      
      p->rt_priority   p->prio   newpri   cpupri
      
                                     -1       -1 (CPUPRI_INVALID)
      
                                    100        0 (CPUPRI_NORMAL)
      
                   1        98       98        1
                 ...
                  49        50       50       49
                  50        49       49       50
                 ...
                  99         0        0       99
      
      New mapping:
      
      p->rt_priority   p->prio   newpri   cpupri
      
                                     -1       -1 (CPUPRI_INVALID)
      
                                     99        0 (CPUPRI_NORMAL)
      
                   1        98       98        1
                 ...
                  49        50       50       49
                  50        49       49       50
                 ...
                  99         0        0       99
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      934fc331
    • Dietmar Eggemann's avatar
      sched/cpupri: Remove pri_to_cpu[1] · 1b08782c
      Dietmar Eggemann authored
      pri_to_cpu[1] isn't used since cpupri_set(..., newpri) is
      never called with newpri = 99.
      
      The valid RT priorities RT1..RT99 (p->rt_priority = [1..99]) map into
      cpupri (idx of pri_to_cpu[]) = [2..100]
      
      Current mapping:
      
      p->rt_priority   p->prio   newpri   cpupri
      
                                     -1       -1 (CPUPRI_INVALID)
      
                                    100        0 (CPUPRI_NORMAL)
      
                   1        98       98        2
                 ...
                  49        50       50       50
                  50        49       49       51
                 ...
                  99         0        0      100
      
      So cpupri = 1 isn't used.
      
      Reduce the size of pri_to_cpu[] by 1 and adapt the cpupri
      implementation accordingly. This will save a useless for loop with an
      atomic_read in cpupri_find_fitness() calling __cpupri_find().
      
      New mapping:
      
      p->rt_priority   p->prio   newpri   cpupri
      
                                     -1       -1 (CPUPRI_INVALID)
      
                                    100        0 (CPUPRI_NORMAL)
      
                   1        98       98        1
                 ...
                  49        50       50       49
                  50        49       49       50
                 ...
                  99         0        0       99
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200922083934.19275-3-dietmar.eggemann@arm.com
      1b08782c