1. 03 Oct, 2023 1 commit
  2. 02 Oct, 2023 4 commits
    • Kir Kolyshkin's avatar
      sched/headers: Move 'struct sched_param' out of uapi, to work around glibc/musl breakage · d844fe65
      Kir Kolyshkin authored
      Both glibc and musl define 'struct sched_param' in sched.h, while kernel
      has it in uapi/linux/sched/types.h, making it cumbersome to use
      sched_getattr(2) or sched_setattr(2) from userspace.
      
      For example, something like this:
      
      	#include <sched.h>
      	#include <linux/sched/types.h>
      
      	struct sched_attr sa;
      
      will result in "error: redefinition of ‘struct sched_param’" (note the
      code doesn't need sched_param at all -- it needs struct sched_attr
      plus some stuff from sched.h).
      
      The situation is, glibc is not going to provide a wrapper for
      sched_{get,set}attr, thus the need to include linux/sched_types.h
      directly, which leads to the above problem.
      
      Thus, the userspace is left with a few sub-par choices when it wants to
      use e.g. sched_setattr(2), such as maintaining a copy of struct
      sched_attr definition, or using some other ugly tricks.
      
      OTOH, 'struct sched_param' is well known, defined in POSIX, and it won't
      be ever changed (as that would break backward compatibility).
      
      So, while 'struct sched_param' is indeed part of the kernel uapi,
      exposing it the way it's done now creates an issue, and hiding it
      (like this patch does) fixes that issue, hopefully without creating
      another one: common userspace software rely on libc headers, and as
      for "special" software (like libc), it looks like glibc and musl
      do not rely on kernel headers for 'struct sched_param' definition
      (but let's Cc their mailing lists in case it's otherwise).
      
      The alternative to this patch would be to move struct sched_attr to,
      say, linux/sched.h, or linux/sched/attr.h (the new file).
      
      Oh, and here is the previous attempt to fix the issue:
      
        https://lore.kernel.org/all/20200528135552.GA87103@google.com/
      
      While I support Linus arguments, the issue is still here
      and needs to be fixed.
      
      [ mingo: Linus is right, this shouldn't be needed - but on the other
               hand I agree that this header is not really helpful to
      	 user-space as-is. So let's pretend that
      	 <uapi/linux/sched/types.h> is only about sched_attr, and
      	 call this commit a workaround for user-space breakage
      	 that it in reality is ... Also, remove the Fixes tag. ]
      Signed-off-by: default avatarKir Kolyshkin <kolyshkin@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20230808030357.1213829-1-kolyshkin@gmail.com
      d844fe65
    • Cyril Hrubis's avatar
      83494dc5
    • Cyril Hrubis's avatar
      sched/rt/docs: Clarify & fix sched_rt_* sysctl docs · e6dbdd8f
      Cyril Hrubis authored
      - Describe explicitly that sched_rt_runtime_us is allocated from
        sched_rt_period_us and hence always less or equal to that value.
      
      - The limit for sched_rt_runtime_us is not INT_MAX-1, but rather it's
        limited by the value of sched_rt_period_us. If sched_rt_period_us is
        INT_MAX then sched_rt_runtime_us can be set to INT_MAX as well.
      Signed-off-by: default avatarCyril Hrubis <chrubis@suse.cz>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20231002115553.3007-3-chrubis@suse.cz
      e6dbdd8f
    • Cyril Hrubis's avatar
      sched/rt: Disallow writing invalid values to sched_rt_period_us · 079be8fc
      Cyril Hrubis authored
      The validation of the value written to sched_rt_period_us was broken
      because:
      
        - the sysclt_sched_rt_period is declared as unsigned int
        - parsed by proc_do_intvec()
        - the range is asserted after the value parsed by proc_do_intvec()
      
      Because of this negative values written to the file were written into a
      unsigned integer that were later on interpreted as large positive
      integers which did passed the check:
      
        if (sysclt_sched_rt_period <= 0)
      	return EINVAL;
      
      This commit fixes the parsing by setting explicit range for both
      perid_us and runtime_us into the sched_rt_sysctls table and processes
      the values with proc_dointvec_minmax() instead.
      
      Alternatively if we wanted to use full range of unsigned int for the
      period value we would have to split the proc_handler and use
      proc_douintvec() for it however even the
      Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
      INT_MAX.
      
      As far as I can tell the only problem this causes is that the sysctl
      file allows writing negative values which when read back may confuse
      userspace.
      
      There is also a LTP test being submitted for these sysctl files at:
      
        http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/Signed-off-by: default avatarCyril Hrubis <chrubis@suse.cz>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
      079be8fc
  3. 29 Sep, 2023 4 commits
  4. 25 Sep, 2023 1 commit
    • Valentin Schneider's avatar
      sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask · 612f769e
      Valentin Schneider authored
      Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
      that has an empty pushable_tasks list, which means nothing useful will be
      done in the IPI other than queue the work for the next CPU on the rto_mask.
      
      rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
      but the conditions for that irq_work to be queued (and for a CPU to be
      added to the rto_mask) rely on rq_rt->nr_migratory instead.
      
      nr_migratory is increased whenever an RT task entity is enqueued and it has
      nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
      rt_rq's current task. This means a rt_rq can have a migratible current, N
      non-migratible queued tasks, and be flagged as overloaded / have its CPU
      set in the rto_mask, despite having an empty pushable_tasks list.
      
      Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
      Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.
      
      Note that the case where the current task is pushed away to make way for a
      migration-disabled task remains unchanged: the migration-disabled task has
      to be in the pushable_tasks list in the first place, which means it has
      nr_cpus_allowed > 1.
      Reported-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarValentin Schneider <vschneid@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Tested-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com
      612f769e
  5. 24 Sep, 2023 3 commits
  6. 22 Sep, 2023 2 commits
  7. 21 Sep, 2023 7 commits
  8. 19 Sep, 2023 2 commits
  9. 18 Sep, 2023 4 commits
    • GUO Zihua's avatar
      sched/headers: Remove duplicated includes in kernel/sched/sched.h · 7ad0354d
      GUO Zihua authored
      Remove duplicated includes of linux/cgroup.h and linux/psi.h. Both of
      these includes are included regardless of the config and they are all
      protected by ifndef, so no point including them again.
      Signed-off-by: default avatarGUO Zihua <guozihua@huawei.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20230818015633.18370-1-guozihua@huawei.com
      7ad0354d
    • Aaron Lu's avatar
      sched/fair: Ratelimit update to tg->load_avg · 1528c661
      Aaron Lu authored
      When using sysbench to benchmark Postgres in a single docker instance
      with sysbench's nr_threads set to nr_cpu, it is observed there are times
      update_cfs_group() and update_load_avg() shows noticeable overhead on
      a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
      
          13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
          10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
      
      Annotate shows the cycles are mostly spent on accessing tg->load_avg
      with update_load_avg() being the write side and update_cfs_group() being
      the read side. tg->load_avg is per task group and when different tasks
      of the same taskgroup running on different CPUs frequently access
      tg->load_avg, it can be heavily contended.
      
      E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
      Sappire Rapids, during a 5s window, the wakeup number is 14millions and
      migration number is 11millions and with each migration, the task's load
      will transfer from src cfs_rq to target cfs_rq and each change involves
      an update to tg->load_avg. Since the workload can trigger as many wakeups
      and migrations, the access(both read and write) to tg->load_avg can be
      unbound. As a result, the two mentioned functions showed noticeable
      overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
      during a 5s window, wakeup number is 21millions and migration number is
      14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
      
      Reduce the overhead by limiting updates to tg->load_avg to at most once
      per ms. The update frequency is a tradeoff between tracking accuracy and
      overhead. 1ms is chosen because PELT window is roughly 1ms and it
      delivered good results for the tests that I've done. After this change,
      the cost of accessing tg->load_avg is greatly reduced and performance
      improved. Detailed test results below.
      
        ==============================
        postgres_sysbench on SPR:
        25%
        base:   42382±19.8%
        patch:  50174±9.5%  (noise)
      
        50%
        base:   67626±1.3%
        patch:  67365±3.1%  (noise)
      
        75%
        base:   100216±1.2%
        patch:  112470±0.1% +12.2%
      
        100%
        base:    93671±0.4%
        patch:  113563±0.2% +21.2%
      
        ==============================
        hackbench on ICL:
        group=1
        base:    114912±5.2%
        patch:   117857±2.5%  (noise)
      
        group=4
        base:    359902±1.6%
        patch:   361685±2.7%  (noise)
      
        group=8
        base:    461070±0.8%
        patch:   491713±0.3% +6.6%
      
        group=16
        base:    309032±5.0%
        patch:   378337±1.3% +22.4%
      
        =============================
        hackbench on SPR:
        group=1
        base:    100768±2.9%
        patch:   103134±2.9%  (noise)
      
        group=4
        base:    413830±12.5%
        patch:   378660±16.6% (noise)
      
        group=8
        base:    436124±0.6%
        patch:   490787±3.2% +12.5%
      
        group=16
        base:    457730±3.2%
        patch:   680452±1.3% +48.8%
      
        ============================
        netperf/udp_rr on ICL
        25%
        base:    114413±0.1%
        patch:   115111±0.0% +0.6%
      
        50%
        base:    86803±0.5%
        patch:   86611±0.0%  (noise)
      
        75%
        base:    35959±5.3%
        patch:   49801±0.6% +38.5%
      
        100%
        base:    61951±6.4%
        patch:   70224±0.8% +13.4%
      
        ===========================
        netperf/udp_rr on SPR
        25%
        base:   104954±1.3%
        patch:  107312±2.8%  (noise)
      
        50%
        base:    55394±4.6%
        patch:   54940±7.4%  (noise)
      
        75%
        base:    13779±3.1%
        patch:   36105±1.1% +162%
      
        100%
        base:     9703±3.7%
        patch:   28011±0.2% +189%
      
        ==============================================
        netperf/tcp_stream on ICL (all in noise range)
        25%
        base:    43092±0.1%
        patch:   42891±0.5%
      
        50%
        base:    19278±14.9%
        patch:   22369±7.2%
      
        75%
        base:    16822±3.0%
        patch:   17086±2.3%
      
        100%
        base:    18216±0.6%
        patch:   18078±2.9%
      
        ===============================================
        netperf/tcp_stream on SPR (all in noise range)
        25%
        base:    34491±0.3%
        patch:   34886±0.5%
      
        50%
        base:    19278±14.9%
        patch:   22369±7.2%
      
        75%
        base:    16822±3.0%
        patch:   17086±2.3%
      
        100%
        base:    18216±0.6%
        patch:   18078±2.9%
      Reported-by: default avatarNitin Tekchandani <nitin.tekchandani@intel.com>
      Suggested-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDavid Vernet <void@manifault.com>
      Tested-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Tested-by: default avatarSwapnil Sapkal <Swapnil.Sapkal@amd.com>
      Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
      1528c661
    • Elliot Berman's avatar
      freezer,sched: Use saved_state to reduce some spurious wakeups · 8f0eed4a
      Elliot Berman authored
      After commit f5d39b02 ("freezer,sched: Rewrite core freezer logic"),
      tasks that transition directly from TASK_FREEZABLE to TASK_FROZEN  are
      always woken up on the thaw path. Prior to that commit, tasks could ask
      freezer to consider them "frozen enough" via freezer_do_not_count(). The
      commit replaced freezer_do_not_count() with a TASK_FREEZABLE state which
      allows freezer to immediately mark the task as TASK_FROZEN without
      waking up the task.  This is efficient for the suspend path, but on the
      thaw path, the task is always woken up even if the task didn't need to
      wake up and goes back to its TASK_(UN)INTERRUPTIBLE state. Although
      these tasks are capable of handling of the wakeup, we can observe a
      power/perf impact from the extra wakeup.
      
      We observed on Android many tasks wait in the TASK_FREEZABLE state
      (particularly due to many of them being binder clients). We observed
      nearly 4x the number of tasks and a corresponding linear increase in
      latency and power consumption when thawing the system. The latency
      increased from ~15ms to ~50ms.
      
      Avoid the spurious wakeups by saving the state of TASK_FREEZABLE tasks.
      If the task was running before entering TASK_FROZEN state
      (__refrigerator()) or if the task received a wake up for the saved
      state, then the task is woken on thaw. saved_state from PREEMPT_RT locks
      can be re-used because freezer would not stomp on the rtlock wait flow:
      TASK_RTLOCK_WAIT isn't considered freezable.
      Reported-by: default avatarPrakash Viswalingam <quic_prakashv@quicinc.com>
      Signed-off-by: default avatarElliot Berman <quic_eberman@quicinc.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8f0eed4a
    • Elliot Berman's avatar
      sched/core: Remove ifdeffery for saved_state · fbaa6a18
      Elliot Berman authored
      In preparation for freezer to also use saved_state, remove the
      CONFIG_PREEMPT_RT compilation guard around saved_state.
      
      On the arm64 platform I tested which did not have CONFIG_PREEMPT_RT,
      there was no statistically significant deviation by applying this patch.
      
      Test methodology:
      
      perf bench sched message -g 40 -l 40
      Signed-off-by: default avatarElliot Berman <quic_eberman@quicinc.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fbaa6a18
  10. 15 Sep, 2023 9 commits
  11. 13 Sep, 2023 3 commits