1. 10 Sep, 2024 1 commit
    • Tejun Heo's avatar
      sched_ext: Synchronize bypass state changes with rq lock · 750a40d8
      Tejun Heo authored
      While the BPF scheduler is being unloaded, the following warning messages
      trigger sometimes:
      
       NOHZ tick-stop error: local softirq work is pending, handler #80!!!
      
      This is caused by the CPU entering idle while there are pending softirqs.
      The main culprit is the bypassing state assertion not being synchronized
      with rq operations. As the BPF scheduler cannot be trusted in the disable
      path, the first step is entering the bypass mode where the BPF scheduler is
      ignored and scheduling becomes global FIFO.
      
      This is implemented by turning scx_ops_bypassing() true. However, the
      transition isn't synchronized against anything and it's possible for enqueue
      and dispatch paths to have different ideas on whether bypass mode is on.
      
      Make each rq track its own bypass state with SCX_RQ_BYPASSING which is
      modified while rq is locked.
      
      This removes most of the NOHZ tick-stop messages but not completely. I
      believe the stragglers are from the sched core bug where pick_task_scx() can
      be called without preceding balance_scx(). Once that bug is fixed, we should
      verify that all occurrences of this error message are gone too.
      
      v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that
          the per-cpu states are always synchronized with the global state.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDavid Vernet <void@manifault.com>
      750a40d8
  2. 09 Sep, 2024 13 commits
    • Tejun Heo's avatar
      scx_qmap: Implement highpri boosting · 2d285d56
      Tejun Heo authored
      Implement a silly boosting mechanism for nice -20 tasks. The only purpose is
      demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only
      works within SHARED_DSQ and makes only minor differences with increased
      dispatch batch (-b).
      
      This exercises moving tasks to a user DSQ and all local DSQs from
      ops.dispatch() and BPF timerfn.
      
      v2: - Updated to use scx_bpf_dispatch_from_dsq_set_{slice|vtime}().
      
          - Drop the workaround for the iterated tasks not being trusted by the
            verifier. The issue is fixed from BPF side.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: David Vernet <void@manifault.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Andrea Righi <andrea.righi@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      2d285d56
    • Tejun Heo's avatar
      sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() · 4c30f5ce
      Tejun Heo authored
      Once a task is put into a DSQ, the allowed operations are fairly limited.
      Tasks in the built-in local and global DSQs are executed automatically and,
      ignoring dequeue, there is only one way a task in a user DSQ can be
      manipulated - scx_bpf_consume() moves the first task to the dispatching
      local DSQ. This inflexibility sometimes gets in the way and is an area where
      multiple feature requests have been made.
      
      Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
      DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
      user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
      which dosen't hold a rq lock including BPF timers and SYSCALL programs.
      
      This is an expansion of an earlier patch which only allowed moving into the
      dispatching local DSQ:
      
        http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org
      
      v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
          they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
          count limit and often won't be needed anyway. Instead provide
          scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
          only when needed and override the specified parameter for the subsequent
          dispatch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: David Vernet <void@manifault.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Andrea Righi <andrea.righi@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      4c30f5ce
    • Tejun Heo's avatar
      sched_ext: Compact struct bpf_iter_scx_dsq_kern · 6462dd53
      Tejun Heo authored
      struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
      using 5 of them. We want to add two more u64 fields but it's better if we do
      so while staying within scx_iter_scx_dsq to maintain binary compatibility.
      
      The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
      field takes up three u64's but only one bit of the last u64 is used. Turn
      the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
      16 bits for flags, 32 bits for a u32 - for use by struct
      bpf_iter_scx_dsq_kern.
      
      This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
      into the cursor field reducing the struct size by a full u64.
      
      No behavior changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6462dd53
    • Tejun Heo's avatar
      sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() · cf3e9443
      Tejun Heo authored
      - Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
      
      - Rename consume_local_task() to move_local_task_to_local_dsq() and remove
        task_unlink_from_dsq() and source DSQ unlocking from it.
      
      This is to make the migration code easier to reuse.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      cf3e9443
    • Tejun Heo's avatar
      sched_ext: Move consume_local_task() upward · d434210e
      Tejun Heo authored
      So that the local case comes first and two CONFIG_SMP blocks can be merged.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      d434210e
    • Tejun Heo's avatar
      sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() · 6557133e
      Tejun Heo authored
      All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
      task_unlink_from_dsq(). Also move sanity check into it.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      6557133e
    • Tejun Heo's avatar
      sched_ext: Reorder args for consume_local/remote_task() · 1389f490
      Tejun Heo authored
      Reorder args for consistency in the order of:
      
        current_rq, p, src_[rq|dsq], dst_[rq|dsq].
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1389f490
    • Tejun Heo's avatar
      sched_ext: Restructure dispatch_to_local_dsq() · 18f85699
      Tejun Heo authored
      Now that there's nothing left after the big if block, flip the if condition
      and unindent the body.
      
      No functional changes intended.
      
      v2: Add BUG() to clarify control can't reach the end of
          dispatch_to_local_dsq() in UP kernels per David.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      18f85699
    • Tejun Heo's avatar
      sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling · 0aab2630
      Tejun Heo authored
      With the preceding update, the only return value which makes meaningful
      difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
      back to the global DSQ and the other, process_ddsp_deferred_locals(),
      doesn't do anything.
      
      It should always fallback to the global DSQ. Move the global DSQ fallback
      into dispatch_to_local_dsq() and remove the return value.
      
      v2: Patch title and description updated to reflect the behavior fix for
          process_ddsp_deferred_locals().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      0aab2630
    • Tejun Heo's avatar
      sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON · e683949a
      Tejun Heo authored
      find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
      Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
      SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
      code in direct_dispatch() and dispatch_to_local_dsq().
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      e683949a
    • Tejun Heo's avatar
      sched_ext: Refactor consume_remote_task() · 4d3ca89b
      Tejun Heo authored
      The tricky p->scx.holding_cpu handling was split across
      consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:
      
      - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
        consolidated documentation.
      
      - move_task_to_local_dsq() now implements straightforward task migration
        making it easier to use in other places.
      
      - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
        usage is updated accordingly. This makes the local and remote cases more
        symmetric.
      
      No functional changes intended.
      
      v2: s/task_rq/src_rq/ for consistency.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      4d3ca89b
    • Tejun Heo's avatar
      sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate · fdaedba2
      Tejun Heo authored
      Sleepables don't need to be in its own kfunc set as each is tagged with
      KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
      not held and relocate right above the any set. This will be used to add
      kfuncs that are allowed to be called from SYSCALL but not TRACING.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      fdaedba2
    • Tejun Heo's avatar
      sched_ext: Add missing static to scx_dump_data · 3ac35279
      Tejun Heo authored
      scx_dump_data is only used inside ext.c but doesn't have static. Add it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/
      3ac35279
  3. 06 Sep, 2024 2 commits
  4. 04 Sep, 2024 14 commits
    • Tejun Heo's avatar
      Merge branch 'bpf/master' into for-6.12 · 649e980d
      Tejun Heo authored
      Pull bpf/master to receive baebe9aa ("bpf: allow passing struct
      bpf_iter_<type> as kfunc arguments") and related changes in preparation for
      the DSQ iterator patchset.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      649e980d
    • Tejun Heo's avatar
      sched_ext: Add a cgroup scheduler which uses flattened hierarchy · a4103eac
      Tejun Heo authored
      This patch adds scx_flatcg example scheduler which implements hierarchical
      weight-based cgroup CPU control by flattening the cgroup hierarchy into a
      single layer by compounding the active weight share at each level.
      
      This flattening of hierarchy can bring a substantial performance gain when
      the cgroup hierarchy is nested multiple levels. in a simple benchmark using
      wrk[8] on apache serving a CGI script calculating sha1sum of a small file,
      it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
      apache instances competing with 2:1 weight ratio nested four level deep.
      
      However, the gain comes at the cost of not being able to properly handle
      thundering herd of cgroups. For example, if many cgroups which are nested
      behind a low priority parent cgroup wake up around the same time, they may
      be able to consume more CPU cycles than they are entitled to. In many use
      cases, this isn't a real concern especially given the performance gain.
      Also, there are ways to mitigate the problem further by e.g. introducing an
      extra scheduling layer on cgroup delegation boundaries.
      
      v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of
            SCX_OPS_KNOB_CGROUP_WEIGHT.
      
      v4: - Revert reference counted kptr for cgv_node as the change caused easily
            reproducible stalls.
      
      v3: - Updated to reflect the core API changes including ops.init/exit_task()
            and direct dispatch from ops.select_cpu(). Fixes and improvements
            including additional statistics.
      
          - Use reference counted kptr for cgv_node instead of xchg'ing against
            stash location.
      
          - Dropped '-p' option.
      
      v2: - Use SCX_BUG[_ON]() to simplify error handling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      a4103eac
    • Tejun Heo's avatar
      sched_ext: Add cgroup support · 81951366
      Tejun Heo authored
      Add sched_ext_ops operations to init/exit cgroups, and track task migrations
      and config changes. A BPF scheduler may not implement or implement only
      subset of cgroup features. The implemented features can be indicated using
      %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
      that are not implemented, a warning is triggered.
      
      While a BPF scheduler is being enabled and disabled, relevant cgroup
      operations are locked out using scx_cgroup_rwsem. This avoids situations
      like task prep taking place while the task is being moved across cgroups,
      making things easier for BPF schedulers.
      
      v7: - cgroup interface file visibility toggling is dropped in favor just
            warning messages. Dynamically changing interface visiblity caused more
            confusion than helping.
      
      v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.
      
          - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
            !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.
      
      v5: - Flipped the locking order between scx_cgroup_rwsem and
            cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
            documentation around locking.
      
          - sched_move_task() takes an early exit if the source and destination
            are identical. This triggered the warning in scx_cgroup_can_attach()
            as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
            migration path so that ops.cgroup_prep_move() is skipped for identity
            migrations so that its invocations always match ops.cgroup_move()
            one-to-one.
      
      v4: - Example schedulers moved into their own patches.
      
          - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.
      
      v3: - Make scx_example_pair switch all tasks by default.
      
          - Convert to BPF inline iterators.
      
          - scx_bpf_task_cgroup() is added to determine the current cgroup from
            CPU controller's POV. This allows BPF schedulers to accurately track
            CPU cgroup membership.
      
          - scx_example_flatcg added. This demonstrates flattened hierarchy
            implementation of CPU cgroup control and shows significant performance
            improvement when cgroups which are nested multiple levels are under
            competition.
      
      v2: - Build fixes for different CONFIG combinations.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      81951366
    • Tejun Heo's avatar
      sched: Introduce CONFIG_GROUP_SCHED_WEIGHT · e179e80c
      Tejun Heo authored
      sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code
      is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or
      SCX may implement the feature, put the interface code behind the new
      CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED.
      This allows either sched class to enable the itnerface code without ading
      more complex CONFIG tests.
      
      When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares()
      is added to support later CONFIG_CGROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED builds.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e179e80c
    • Tejun Heo's avatar
      sched: Make cpu_shares_read_u64() use tg_weight() · 41082c1d
      Tejun Heo authored
      Move tg_weight() upward and make cpu_shares_read_u64() use it too. This
      makes the weight retrieval shared between cgroup v1 and v2 paths and will be
      used to implement cgroup support for sched_ext.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      41082c1d
    • Tejun Heo's avatar
      sched: Expose css_tg() · 859dc4ec
      Tejun Heo authored
      A new BPF extensible sched_class will use css_tg() in the init and exit
      paths to visit all task_groups by walking cgroups.
      
      v4: __setscheduler_prio() is already exposed. Dropped from this patch.
      
      v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
          mechanism.
      
      v2: Expose SCHED_CHANGE_BLOCK() too and update the description.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      859dc4ec
    • Tejun Heo's avatar
      sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable · a8532fac
      Tejun Heo authored
      During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
      on every task. To do this, it does get_task_struct() on each iterated task,
      drop the lock and then call ops.init_task().
      
      However, a TASK_DEAD task may already have lost all its usage count and be
      waiting for RCU grace period to be freed. If get_task_struct() is called on
      such task, use-after-free can happen. To avoid such situations,
      scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
      as they are never going to be scheduled again.
      
      Unfortunately, a racing sched_setscheduler(2) can grab the task before the
      task is unhashed and then continue to e.g. move the task from RT to SCX
      after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
      gone through scx_ops_init_task(), scx_ops_enable_task() called from
      switching_to_scx() triggers the following warning:
      
        sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
        WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
        ...
        RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
        ...
         switching_to_scx+0x13/0xa0
         __sched_setscheduler+0x84e/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      As in the ops_disable path, it just doesn't seem like a good idea to leave
      any task in an inconsistent state, even when the task is dead. The root
      cause is ops_enable not being able to tell reliably whether a task is truly
      dead (no one else is looking at it and it's about to be freed) and was
      testing TASK_DEAD instead. Fix it by testing the task's usage count
      directly.
      
      - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
        tasks, @include_dead is removed from scx_task_iter_next_locked() along
        with dead task filtering.
      
      - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
        fails.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a8532fac
    • Tejun Heo's avatar
      sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable · 61eeb9a9
      Tejun Heo authored
      scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while
      calling scx_ops_exit_task() on all tasks including dead ones. This can leave
      a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent.
      
      If another task was in the process of changing the TASK_DEAD task's
      scheduling class and grabs the rq lock after scx_ops_disable_workfn() is
      done with the task, the task ends up calling scx_ops_disable_task() on the
      dead task which is in an inconsistent state triggering a warning:
      
        WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160
        ...
        RIP: 0010:scx_ops_disable_task+0x12c/0x160
        ...
        Call Trace:
         <TASK>
         check_class_changed+0x2c/0x70
         __sched_setscheduler+0x8a0/0xa50
         do_sched_setscheduler+0x104/0x1c0
         __x64_sys_sched_setscheduler+0x18/0x30
         do_syscall_64+0x7b/0x140
         entry_SYSCALL_64_after_hwframe+0x76/0x7e
        RIP: 0033:0x7f140d70ea5b
      
      There is no reason to leave dead tasks on SCX when unloading the BPF
      scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including
      the dead ones from SCX.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      61eeb9a9
    • Tejun Heo's avatar
      sched_ext: Remove sched_class->switch_class() · 37cb049e
      Tejun Heo authored
      With sched_ext converted to use put_prev_task() for class switch detection,
      there's no user of switch_class() left. Drop it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      37cb049e
    • Tejun Heo's avatar
      sched_ext: Remove switch_class_scx() · f422316d
      Tejun Heo authored
      Now that put_prev_task_scx() is called with @next on task switches, there's
      no reason to use sched_class.switch_class(). Rename switch_class_scx() to
      switch_class() and call it from put_prev_task_scx().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f422316d
    • Tejun Heo's avatar
      sched_ext: Relocate functions in kernel/sched/ext.c · 65aaf905
      Tejun Heo authored
      Relocate functions to ease the removal of switch_class_scx(). No functional
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      65aaf905
    • Tejun Heo's avatar
      sched_ext: Unify regular and core-sched pick task paths · 753e2836
      Tejun Heo authored
      Because the BPF scheduler's dispatch path is invoked from balance(),
      sched_ext needs to invoke balance_one() on all sibling rq's before picking
      the next task for core-sched.
      
      Before the recent pick_next_task() updates, sched_ext couldn't share pick
      task between regular and core-sched paths because pick_next_task() depended
      on put_prev_task() being called on the current task. Tasks currently running
      on sibling rq's can't be put when one rq is trying to pick the next task, so
      pick_task_scx() had to have a separate mechanism to pick between a sibling
      rq's current task and the first task in its local DSQ.
      
      However, with the preceding updates, pick_next_task_scx() no longer depends
      on the current task being put and can compare the current task and the next
      in line statelessly, and the pick task logic should be shareable between
      regular and core-sched paths.
      
      Unify regular and core-sched pick task paths:
      
      - There's no reason to distinguish local and sibling picks anymore. @local
        is removed from balance_one().
      
      - pick_next_task_scx() is turned into pick_task_scx() by dropping the
        put_prev_set_next_task() call.
      
      - The old pick_task_scx() is dropped.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      753e2836
    • Tejun Heo's avatar
      sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP · 8b1451f2
      Tejun Heo authored
      SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
      keep running the current task. It's not really a task property. Replace it
      with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
      the usage. Also, the existing clearing rule is unnecessarily strict and
      makes it difficult to use with core-sched. Just clear it on entry to
      balance_one().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8b1451f2
    • Tejun Heo's avatar
      sched_ext: Don't call put_prev_task_scx() before picking the next task · 7c65ae81
      Tejun Heo authored
      fd03c5b8 ("sched: Rework pick_next_task()") changed the definition of
      pick_next_task() from:
      
        pick_next_task() := pick_task() + set_next_task(.first = true)
      
      to:
      
        pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)
      
      making invoking put_prev_task() pick_next_task()'s responsibility. This
      reordering allows pick_task() to be shared between regular and core-sched
      paths and put_prev_task() to know the next task.
      
      sched_ext depended on put_prev_task_scx() enqueueing the current task before
      pick_next_task_scx() is called. While pulling sched/core changes,
      70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an
      explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx()
      before picking the first task as a workaround.
      
      Clean it up and adopt the conventions that other sched classes are
      following.
      
      The operation of keeping running the current task was spread and required
      the task to be put on the local DSQ before picking:
      
        - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still
          runnable, hasn't exhausted its slice, and thus should keep running.
      
        - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP
          is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the
          only runnable task. do_enqueue_task() in turn decided whether to use the
          local DSQ depending on SCX_OPS_ENQ_LAST.
      
      Consolidate the logic in balance_one() as it always knows whether it is
      going to keep the current task. balance_one() now considers all conditions
      where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell
      pick_next_task_scx() to keep the current task instead of picking one from
      the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from
      put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is
      updated to pick the current task if SCX_TASK_BAL_KEEP is set.
      
      The workaround put_prev_task[_scx]() calls are replaced with
      put_prev_set_next_task().
      
      This causes two behavior changes observable from the BPF scheduler:
      
      - When a task keep running, it no longer goes through enqueue/dequeue cycle
        and thus ops.stopping/running() transitions. The new behavior is better
        and all the existing schedulers should be able to handle the new behavior.
      
      - The BPF scheduler cannot keep executing the current task by enqueueing
        SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
        BPF scheduler is responsible for resuming execution after each
        SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling
        decisions are not made on the local CPU - e.g. central or userspace-driven
        schedulin - and the new behavior is more logical and shouldn't pose any
        problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it
        doesn't fit that well anymore and the last task handling is moved to the
        end of qmap_dispatch().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      7c65ae81
  5. 03 Sep, 2024 10 commits