1. 02 Oct, 2023 3 commits
  2. 29 Sep, 2023 4 commits
  3. 25 Sep, 2023 1 commit
    • Valentin Schneider's avatar
      sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask · 612f769e
      Valentin Schneider authored
      Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
      that has an empty pushable_tasks list, which means nothing useful will be
      done in the IPI other than queue the work for the next CPU on the rto_mask.
      
      rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
      but the conditions for that irq_work to be queued (and for a CPU to be
      added to the rto_mask) rely on rq_rt->nr_migratory instead.
      
      nr_migratory is increased whenever an RT task entity is enqueued and it has
      nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
      rt_rq's current task. This means a rt_rq can have a migratible current, N
      non-migratible queued tasks, and be flagged as overloaded / have its CPU
      set in the rto_mask, despite having an empty pushable_tasks list.
      
      Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
      Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.
      
      Note that the case where the current task is pushed away to make way for a
      migration-disabled task remains unchanged: the migration-disabled task has
      to be in the pushable_tasks list in the first place, which means it has
      nr_cpus_allowed > 1.
      Reported-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com
      612f769e
  4. 24 Sep, 2023 3 commits
  5. 22 Sep, 2023 2 commits
  6. 21 Sep, 2023 7 commits
  7. 19 Sep, 2023 2 commits
  8. 18 Sep, 2023 4 commits
    • GUO Zihua's avatar
      sched/headers: Remove duplicated includes in kernel/sched/sched.h · 7ad0354d
      GUO Zihua authored
      Remove duplicated includes of linux/cgroup.h and linux/psi.h. Both of
      these includes are included regardless of the config and they are all
      protected by ifndef, so no point including them again.
      Signed-off-by: default avatarGUO Zihua <guozihua@huawei.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20230818015633.18370-1-guozihua@huawei.com
      7ad0354d
    • Aaron Lu's avatar
      sched/fair: Ratelimit update to tg->load_avg · 1528c661
      Aaron Lu authored
      When using sysbench to benchmark Postgres in a single docker instance
      with sysbench's nr_threads set to nr_cpu, it is observed there are times
      update_cfs_group() and update_load_avg() shows noticeable overhead on
      a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
      
          13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
          10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
      
      Annotate shows the cycles are mostly spent on accessing tg->load_avg
      with update_load_avg() being the write side and update_cfs_group() being
      the read side. tg->load_avg is per task group and when different tasks
      of the same taskgroup running on different CPUs frequently access
      tg->load_avg, it can be heavily contended.
      
      E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
      Sappire Rapids, during a 5s window, the wakeup number is 14millions and
      migration number is 11millions and with each migration, the task's load
      will transfer from src cfs_rq to target cfs_rq and each change involves
      an update to tg->load_avg. Since the workload can trigger as many wakeups
      and migrations, the access(both read and write) to tg->load_avg can be
      unbound. As a result, the two mentioned functions showed noticeable
      overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
      during a 5s window, wakeup number is 21millions and migration number is
      14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
      
      Reduce the overhead by limiting updates to tg->load_avg to at most once
      per ms. The update frequency is a tradeoff between tracking accuracy and
      overhead. 1ms is chosen because PELT window is roughly 1ms and it
      delivered good results for the tests that I've done. After this change,
      the cost of accessing tg->load_avg is greatly reduced and performance
      improved. Detailed test results below.
      
        ==============================
        postgres_sysbench on SPR:
        25%
        base:   42382±19.8%
        patch:  50174±9.5%  (noise)
      
        50%
        base:   67626±1.3%
        patch:  67365±3.1%  (noise)
      
        75%
        base:   100216±1.2%
        patch:  112470±0.1% +12.2%
      
        100%
        base:    93671±0.4%
        patch:  113563±0.2% +21.2%
      
        ==============================
        hackbench on ICL:
        group=1
        base:    114912±5.2%
        patch:   117857±2.5%  (noise)
      
        group=4
        base:    359902±1.6%
        patch:   361685±2.7%  (noise)
      
        group=8
        base:    461070±0.8%
        patch:   491713±0.3% +6.6%
      
        group=16
        base:    309032±5.0%
        patch:   378337±1.3% +22.4%
      
        =============================
        hackbench on SPR:
        group=1
        base:    100768±2.9%
        patch:   103134±2.9%  (noise)
      
        group=4
        base:    413830±12.5%
        patch:   378660±16.6% (noise)
      
        group=8
        base:    436124±0.6%
        patch:   490787±3.2% +12.5%
      
        group=16
        base:    457730±3.2%
        patch:   680452±1.3% +48.8%
      
        ============================
        netperf/udp_rr on ICL
        25%
        base:    114413±0.1%
        patch:   115111±0.0% +0.6%
      
        50%
        base:    86803±0.5%
        patch:   86611±0.0%  (noise)
      
        75%
        base:    35959±5.3%
        patch:   49801±0.6% +38.5%
      
        100%
        base:    61951±6.4%
        patch:   70224±0.8% +13.4%
      
        ===========================
        netperf/udp_rr on SPR
        25%
        base:   104954±1.3%
        patch:  107312±2.8%  (noise)
      
        50%
        base:    55394±4.6%
        patch:   54940±7.4%  (noise)
      
        75%
        base:    13779±3.1%
        patch:   36105±1.1% +162%
      
        100%
        base:     9703±3.7%
        patch:   28011±0.2% +189%
      
        ==============================================
        netperf/tcp_stream on ICL (all in noise range)
        25%
        base:    43092±0.1%
        patch:   42891±0.5%
      
        50%
        base:    19278±14.9%
        patch:   22369±7.2%
      
        75%
        base:    16822±3.0%
        patch:   17086±2.3%
      
        100%
        base:    18216±0.6%
        patch:   18078±2.9%
      
        ===============================================
        netperf/tcp_stream on SPR (all in noise range)
        25%
        base:    34491±0.3%
        patch:   34886±0.5%
      
        50%
        base:    19278±14.9%
        patch:   22369±7.2%
      
        75%
        base:    16822±3.0%
        patch:   17086±2.3%
      
        100%
        base:    18216±0.6%
        patch:   18078±2.9%
      Reported-by: default avatarNitin Tekchandani <nitin.tekchandani@intel.com>
      Suggested-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDavid Vernet <void@manifault.com>
      Tested-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Tested-by: default avatarSwapnil Sapkal <Swapnil.Sapkal@amd.com>
      Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
      1528c661
    • Elliot Berman's avatar
      freezer,sched: Use saved_state to reduce some spurious wakeups · 8f0eed4a
      Elliot Berman authored
      After commit f5d39b02 ("freezer,sched: Rewrite core freezer logic"),
      tasks that transition directly from TASK_FREEZABLE to TASK_FROZEN  are
      always woken up on the thaw path. Prior to that commit, tasks could ask
      freezer to consider them "frozen enough" via freezer_do_not_count(). The
      commit replaced freezer_do_not_count() with a TASK_FREEZABLE state which
      allows freezer to immediately mark the task as TASK_FROZEN without
      waking up the task.  This is efficient for the suspend path, but on the
      thaw path, the task is always woken up even if the task didn't need to
      wake up and goes back to its TASK_(UN)INTERRUPTIBLE state. Although
      these tasks are capable of handling of the wakeup, we can observe a
      power/perf impact from the extra wakeup.
      
      We observed on Android many tasks wait in the TASK_FREEZABLE state
      (particularly due to many of them being binder clients). We observed
      nearly 4x the number of tasks and a corresponding linear increase in
      latency and power consumption when thawing the system. The latency
      increased from ~15ms to ~50ms.
      
      Avoid the spurious wakeups by saving the state of TASK_FREEZABLE tasks.
      If the task was running before entering TASK_FROZEN state
      (__refrigerator()) or if the task received a wake up for the saved
      state, then the task is woken on thaw. saved_state from PREEMPT_RT locks
      can be re-used because freezer would not stomp on the rtlock wait flow:
      TASK_RTLOCK_WAIT isn't considered freezable.
      Reported-by: default avatarPrakash Viswalingam <quic_prakashv@quicinc.com>
      Signed-off-by: default avatarElliot Berman <quic_eberman@quicinc.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8f0eed4a
    • Elliot Berman's avatar
      sched/core: Remove ifdeffery for saved_state · fbaa6a18
      Elliot Berman authored
      In preparation for freezer to also use saved_state, remove the
      CONFIG_PREEMPT_RT compilation guard around saved_state.
      
      On the arm64 platform I tested which did not have CONFIG_PREEMPT_RT,
      there was no statistically significant deviation by applying this patch.
      
      Test methodology:
      
      perf bench sched message -g 40 -l 40
      Signed-off-by: default avatarElliot Berman <quic_eberman@quicinc.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fbaa6a18
  9. 15 Sep, 2023 9 commits
  10. 13 Sep, 2023 5 commits