1. 27 Aug, 2021 1 commit
    • Thomas Gleixner's avatar
      eventfd: Make signal recursion protection a task bit · b542e383
      Thomas Gleixner authored
      The recursion protection for eventfd_signal() is based on a per CPU
      variable and relies on the !RT semantics of spin_lock_irqsave() for
      protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
      disables preemption nor interrupts which allows the spin lock held section
      to be preempted. If the preempting task invokes eventfd_signal() as well,
      then the recursion warning triggers.
      
      Paolo suggested to protect the per CPU variable with a local lock, but
      that's heavyweight and actually not necessary. The goal of this protection
      is to prevent the task stack from overflowing, which can be achieved with a
      per task recursion protection as well.
      
      Replace the per CPU variable with a per task bit similar to other recursion
      protection bits like task_struct::in_page_owner. This works on both !RT and
      RT kernels and removes as a side effect the extra per CPU storage.
      
      No functional change for !RT kernels.
      Reported-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
      b542e383
  2. 26 Aug, 2021 1 commit
    • Ingo Molnar's avatar
      sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case · 366e7ad6
      Ingo Molnar authored
      It's not actually used in the !CONFIG_FAIR_GROUP_SCHED case:
      
        kernel/sched/fair.c:488:12: warning: ‘tg_is_idle’ defined but not used [-Wunused-function]
      
      Keep around a placeholder nevertheless, for API completeness. Mark it inline,
      so the compiler doesn't think it must be used.
      
      Fixes: 30400039: ("sched: Cgroup SCHED_IDLE support")
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Josh Don <joshdon@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      366e7ad6
  3. 20 Aug, 2021 11 commits
  4. 10 Aug, 2021 1 commit
  5. 06 Aug, 2021 3 commits
  6. 04 Aug, 2021 6 commits
  7. 28 Jun, 2021 5 commits
  8. 24 Jun, 2021 5 commits
    • Beata Michalska's avatar
      sched/doc: Update the CPU capacity asymmetry bits · adf3c31e
      Beata Michalska authored
      Update the documentation bits referring to capacity aware scheduling
      with regards to newly introduced SD_ASYM_CPUCAPACITY_FULL sched_domain
      flag.
      Signed-off-by: default avatarBeata Michalska <beata.michalska@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210603140627.8409-4-beata.michalska@arm.com
      adf3c31e
    • Beata Michalska's avatar
      sched/topology: Rework CPU capacity asymmetry detection · c744dc4a
      Beata Michalska authored
      Currently the CPU capacity asymmetry detection, performed through
      asym_cpu_capacity_level, tries to identify the lowest topology level
      at which the highest CPU capacity is being observed, not necessarily
      finding the level at which all possible capacity values are visible
      to all CPUs, which might be bit problematic for some possible/valid
      asymmetric topologies i.e.:
      
      DIE      [                                ]
      MC       [                       ][       ]
      
      CPU       [0] [1] [2] [3] [4] [5]  [6] [7]
      Capacity  |.....| |.....| |.....|  |.....|
      	     L	     M       B        B
      
      Where:
       arch_scale_cpu_capacity(L) = 512
       arch_scale_cpu_capacity(M) = 871
       arch_scale_cpu_capacity(B) = 1024
      
      In this particular case, the asymmetric topology level will point
      at MC, as all possible CPU masks for that level do cover the CPU
      with the highest capacity. It will work just fine for the first
      cluster, not so much for the second one though (consider the
      find_energy_efficient_cpu which might end up attempting the energy
      aware wake-up for a domain that does not see any asymmetry at all)
      
      Rework the way the capacity asymmetry levels are being detected,
      allowing to point to the lowest topology level (for a given CPU), where
      full set of available CPU capacities is visible to all CPUs within given
      domain. As a result, the per-cpu sd_asym_cpucapacity might differ across
      the domains. This will have an impact on EAS wake-up placement in a way
      that it might see different range of CPUs to be considered, depending on
      the given current and target CPUs.
      
      Additionally, those levels, where any range of asymmetry (not
      necessarily full) is being detected will get identified as well.
      The selected asymmetric topology level will be denoted by
      SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
      would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
      maintaining the current behaviour for asymmetric topologies, with
      misfit migration operating correctly on lower levels, if applicable,
      as any asymmetry is enough to trigger the misfit migration.
      The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
      relate to the full asymmetry level denoted by the sd_asym_cpucapacity
      pointer.
      
      Detecting the CPU capacity asymmetry is being based on a set of
      available CPU capacities for all possible CPUs. This data is being
      generated upon init and updated once CPU topology changes are being
      detected (through arch_update_cpu_topology). As such, any changes
      to identified CPU capacities (like initializing cpufreq) need to be
      explicitly advertised by corresponding archs to trigger rebuilding
      the data.
      
      Additional -dflags- parameter, used when building sched domains, has
      been removed as well, as the asymmetry flags are now being set directly
      in sd_init.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Suggested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarBeata Michalska <beata.michalska@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Tested-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Link: https://lore.kernel.org/r/20210603140627.8409-3-beata.michalska@arm.com
      c744dc4a
    • Beata Michalska's avatar
      sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag · 2309a05d
      Beata Michalska authored
      Introducing new, complementary to SD_ASYM_CPUCAPACITY, sched_domain
      topology flag, to distinguish between shed_domains where any CPU
      capacity asymmetry is detected (SD_ASYM_CPUCAPACITY) and ones where
      a full set of CPU capacities is visible to all domain members
      (SD_ASYM_CPUCAPACITY_FULL).
      
      With the distinction between full and partial CPU capacity asymmetry,
      brought in by the newly introduced flag, the scope of the original
      SD_ASYM_CPUCAPACITY flag gets shifted, still maintaining the existing
      behaviour when one is detected on a given sched domain, allowing
      misfit migrations within sched domains that do not observe full range
      of CPU capacities but still do have members with different capacity
      values. It loses though it's meaning when it comes to the lowest CPU
      asymmetry sched_domain level per-cpu pointer, which is to be now
      denoted by SD_ASYM_CPUCAPACITY_FULL flag.
      Signed-off-by: default avatarBeata Michalska <beata.michalska@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210603140627.8409-2-beata.michalska@arm.com
      2309a05d
    • Zhaoyang Huang's avatar
      psi: Fix race between psi_trigger_create/destroy · 8f91efd8
      Zhaoyang Huang authored
      Race detected between psi_trigger_destroy/create as shown below, which
      cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
      and psi_system->poll_timer->entry->next. Under this modification, the
      race window is removed by initialising poll_wait and poll_timer in
      group_init which are executed only once at beginning.
      
        psi_trigger_destroy()                   psi_trigger_create()
      
        mutex_lock(trigger_lock);
        rcu_assign_pointer(poll_task, NULL);
        mutex_unlock(trigger_lock);
      					  mutex_lock(trigger_lock);
      					  if (!rcu_access_pointer(group->poll_task)) {
      					    timer_setup(poll_timer, poll_timer_fn, 0);
      					    rcu_assign_pointer(poll_task, task);
      					  }
      					  mutex_unlock(trigger_lock);
      
        synchronize_rcu();
        del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
                                        psi_trigger_create()
      
      So, trigger_lock/RCU correctly protects destruction of
      group->poll_task but misses this race affecting poll_timer and
      poll_wait.
      
      Fixes: 461daba0 ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
      Co-developed-by: default avatarziwei.dai <ziwei.dai@unisoc.com>
      Signed-off-by: default avatarziwei.dai <ziwei.dai@unisoc.com>
      Co-developed-by: default avatarke.wang <ke.wang@unisoc.com>
      Signed-off-by: default avatarke.wang <ke.wang@unisoc.com>
      Signed-off-by: default avatarZhaoyang Huang <zhaoyang.huang@unisoc.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com
      8f91efd8
    • Huaixin Chang's avatar
      sched/fair: Introduce the burstable CFS controller · f4183717
      Huaixin Chang authored
      The CFS bandwidth controller limits CPU requests of a task group to
      quota during each period. However, parallel workloads might be bursty
      so that they get throttled even when their average utilization is under
      quota. And they are latency sensitive at the same time so that
      throttling them is undesired.
      
      We borrow time now against our future underrun, at the cost of increased
      interference against the other system users. All nicely bounded.
      
      Traditional (UP-EDF) bandwidth control is something like:
      
        (U = \Sum u_i) <= 1
      
      This guaranteeds both that every deadline is met and that the system is
      stable. After all, if U were > 1, then for every second of walltime,
      we'd have to run more than a second of program time, and obviously miss
      our deadline, but the next deadline will be further out still, there is
      never time to catch up, unbounded fail.
      
      This work observes that a workload doesn't always executes the full
      quota; this enables one to describe u_i as a statistical distribution.
      
      For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
      (the traditional WCET). This effectively allows u to be smaller,
      increasing the efficiency (we can pack more tasks in the system), but at
      the cost of missing deadlines when all the odds line up. However, it
      does maintain stability, since every overrun must be paired with an
      underrun as long as our x is above the average.
      
      That is, suppose we have 2 tasks, both specify a p(95) value, then we
      have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
      everything is good. At the same time we have a p(5)p(5) = 0.25% chance
      both tasks will exceed their quota at the same time (guaranteed deadline
      fail). Somewhere in between there's a threshold where one exceeds and
      the other doesn't underrun enough to compensate; this depends on the
      specific CDFs.
      
      At the same time, we can say that the worst case deadline miss, will be
      \Sum e_i; that is, there is a bounded tardiness (under the assumption
      that x+e is indeed WCET).
      
      The benefit of burst is seen when testing with schbench. Default value of
      kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.
      
      	mkdir /sys/fs/cgroup/cpu/test
      	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
      	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
      	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
      
      	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10
      
      The average CPU usage is at 80%. I run this for 10 times, and got long tail
      latency for 6 times and got throttled for 8 times.
      
      Tail latencies are shown below, and it wasn't the worst case.
      
      	Latency percentiles (usec)
      		50.0000th: 19872
      		75.0000th: 21344
      		90.0000th: 22176
      		95.0000th: 22496
      		*99.0000th: 22752
      		99.5000th: 22752
      		99.9000th: 22752
      		min=0, max=22727
      	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%
      
      The interferenece when using burst is valued by the possibilities for
      missing the deadline and the average WCET. Test results showed that when
      there many cgroups or CPU is under utilized, the interference is
      limited. More details are shown in:
      https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/Co-developed-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
      Signed-off-by: default avatarShanpei Chen <shanpeic@linux.alibaba.com>
      Co-developed-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarTianchen Ding <dtcccc@linux.alibaba.com>
      Signed-off-by: default avatarHuaixin Chang <changhuaixin@linux.alibaba.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarBen Segall <bsegall@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
      f4183717
  9. 22 Jun, 2021 3 commits
    • Qais Yousef's avatar
      sched/uclamp: Fix uclamp_tg_restrict() · 0213b708
      Qais Yousef authored
      Now cpu.uclamp.min acts as a protection, we need to make sure that the
      uclamp request of the task is within the allowed range of the cgroup,
      that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and
      tg->uclamp[UCLAMP_MAX].
      
      As reported by Xuewen [1] we can have some corner cases where there's
      inversion between uclamp requested by task (p) and the uclamp values of
      the taskgroup it's attached to (tg). Following table demonstrates
      2 corner cases:
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	CASE 1
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  60%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	CASE 2
      	-----------+-----+------+-----------
      	uclamp_min | 0%  | 30%  |  30%
      	-----------+-----+------+-----------
      	uclamp_max | 20% | 50%  |  20%
      	-----------+-----+------+-----------
      
      With this fix we get:
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	CASE 1
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  50%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	CASE 2
      	-----------+-----+------+-----------
      	uclamp_min | 0%  | 30%  |  30%
      	-----------+-----+------+-----------
      	uclamp_max | 20% | 50%  |  30%
      	-----------+-----+------+-----------
      
      Additionally uclamp_update_active_tasks() must now unconditionally
      update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for
      instance could have an impact on the effective UCLAMP_MIN of the tasks.
      
      	           |  p  |  tg  |  effective
      	-----------+-----+------+-----------
      	old
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   |  50%
      	-----------+-----+------+-----------
      	uclamp_max | 80% | 50%  |  50%
      	-----------+-----+------+-----------
      	*new*
      	-----------+-----+------+-----------
      	uclamp_min | 60% | 0%   | *60%*
      	-----------+-----+------+-----------
      	uclamp_max | 80% |*70%* | *70%*
      	-----------+-----+------+-----------
      
      [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/
      
      Fixes: 0c18f2ec ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min")
      Reported-by: default avatarXuewen Yan <xuewen.yan94@gmail.com>
      Signed-off-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com
      0213b708
    • Vincent Donnefort's avatar
      sched/rt: Fix Deadline utilization tracking during policy change · d7d60709
      Vincent Donnefort authored
      DL keeps track of the utilization on a per-rq basis with the structure
      avg_dl. This utilization is updated during task_tick_dl(),
      put_prev_task_dl() and set_next_task_dl(). However, when the current
      running task changes its policy, set_next_task_dl() which would usually
      take care of updating the utilization when the rq starts running DL
      tasks, will not see a such change, leaving the avg_dl structure outdated.
      When that very same task will be dequeued later, put_prev_task_dl() will
      then update the utilization, based on a wrong last_update_time, leading to
      a huge spike in the DL utilization signal.
      
      The signal would eventually recover from this issue after few ms. Even
      if no DL tasks are run, avg_dl is also updated in
      __update_blocked_others(). But as the CPU capacity depends partly on the
      avg_dl, this issue has nonetheless a significant impact on the scheduler.
      
      Fix this issue by ensuring a load update when a running task changes
      its policy to DL.
      
      Fixes: 3727e0e1 ("sched/dl: Add dl_rq utilization tracking")
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/1624271872-211872-3-git-send-email-vincent.donnefort@arm.com
      d7d60709
    • Vincent Donnefort's avatar
      sched/rt: Fix RT utilization tracking during policy change · fecfcbc2
      Vincent Donnefort authored
      RT keeps track of the utilization on a per-rq basis with the structure
      avg_rt. This utilization is updated during task_tick_rt(),
      put_prev_task_rt() and set_next_task_rt(). However, when the current
      running task changes its policy, set_next_task_rt() which would usually
      take care of updating the utilization when the rq starts running RT tasks,
      will not see a such change, leaving the avg_rt structure outdated. When
      that very same task will be dequeued later, put_prev_task_rt() will then
      update the utilization, based on a wrong last_update_time, leading to a
      huge spike in the RT utilization signal.
      
      The signal would eventually recover from this issue after few ms. Even if
      no RT tasks are run, avg_rt is also updated in __update_blocked_others().
      But as the CPU capacity depends partly on the avg_rt, this issue has
      nonetheless a significant impact on the scheduler.
      
      Fix this issue by ensuring a load update when a running task changes
      its policy to RT.
      
      Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
      Signed-off-by: default avatarVincent Donnefort <vincent.donnefort@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
      fecfcbc2
  10. 18 Jun, 2021 4 commits