An error occurred fetching the project authors.
  1. 04 Sep, 2024 5 commits
    • Tejun Heo's avatar
      sched_ext: Add cgroup support · 81951366
      Tejun Heo authored
      Add sched_ext_ops operations to init/exit cgroups, and track task migrations
      and config changes. A BPF scheduler may not implement or implement only
      subset of cgroup features. The implemented features can be indicated using
      %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
      that are not implemented, a warning is triggered.
      
      While a BPF scheduler is being enabled and disabled, relevant cgroup
      operations are locked out using scx_cgroup_rwsem. This avoids situations
      like task prep taking place while the task is being moved across cgroups,
      making things easier for BPF schedulers.
      
      v7: - cgroup interface file visibility toggling is dropped in favor just
            warning messages. Dynamically changing interface visiblity caused more
            confusion than helping.
      
      v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.
      
          - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
            !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.
      
      v5: - Flipped the locking order between scx_cgroup_rwsem and
            cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
            documentation around locking.
      
          - sched_move_task() takes an early exit if the source and destination
            are identical. This triggered the warning in scx_cgroup_can_attach()
            as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
            migration path so that ops.cgroup_prep_move() is skipped for identity
            migrations so that its invocations always match ops.cgroup_move()
            one-to-one.
      
      v4: - Example schedulers moved into their own patches.
      
          - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.
      
      v3: - Make scx_example_pair switch all tasks by default.
      
          - Convert to BPF inline iterators.
      
          - scx_bpf_task_cgroup() is added to determine the current cgroup from
            CPU controller's POV. This allows BPF schedulers to accurately track
            CPU cgroup membership.
      
          - scx_example_flatcg added. This demonstrates flattened hierarchy
            implementation of CPU cgroup control and shows significant performance
            improvement when cgroups which are nested multiple levels are under
            competition.
      
      v2: - Build fixes for different CONFIG combinations.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      81951366
    • Tejun Heo's avatar
      sched: Introduce CONFIG_GROUP_SCHED_WEIGHT · e179e80c
      Tejun Heo authored
      sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code
      is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or
      SCX may implement the feature, put the interface code behind the new
      CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED.
      This allows either sched class to enable the itnerface code without ading
      more complex CONFIG tests.
      
      When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares()
      is added to support later CONFIG_CGROUP_SCHED_WEIGHT &&
      !CONFIG_FAIR_GROUP_SCHED builds.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e179e80c
    • Tejun Heo's avatar
      sched: Expose css_tg() · 859dc4ec
      Tejun Heo authored
      A new BPF extensible sched_class will use css_tg() in the init and exit
      paths to visit all task_groups by walking cgroups.
      
      v4: __setscheduler_prio() is already exposed. Dropped from this patch.
      
      v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
          mechanism.
      
      v2: Expose SCHED_CHANGE_BLOCK() too and update the description.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      859dc4ec
    • Tejun Heo's avatar
      sched_ext: Remove sched_class->switch_class() · 37cb049e
      Tejun Heo authored
      With sched_ext converted to use put_prev_task() for class switch detection,
      there's no user of switch_class() left. Drop it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      37cb049e
    • Tejun Heo's avatar
      sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP · 8b1451f2
      Tejun Heo authored
      SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
      keep running the current task. It's not really a task property. Replace it
      with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
      the usage. Also, the existing clearing rule is unnecessarily strict and
      makes it difficult to use with core-sched. Just clear it on entry to
      balance_one().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8b1451f2
  2. 03 Sep, 2024 6 commits
  3. 17 Aug, 2024 6 commits
  4. 06 Aug, 2024 2 commits
  5. 29 Jul, 2024 6 commits
  6. 12 Jul, 2024 3 commits
    • Tejun Heo's avatar
      sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches · 5b26f7b9
      Tejun Heo authored
      In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
      local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
      and ops.enqueue(), this isn't allowed. This is because dispatching to the
      local DSQ of a remote CPU requires locking both the task's current and new
      rq's and such double locking can't be done directly from ops.enqueue().
      
      While waking up a task, as ops.select_cpu() can pick any CPU and both
      ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
      target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
      do whatever it wants to do. However, while a task is being enqueued for a
      different reason, e.g. after its slice expiration, only ops.enqueue() is
      called and there's no way for the BPF scheduler to directly dispatch to the
      local DSQ of a remote CPU. This gap in API forces schedulers into
      work-arounds which are not straightforward or optimal such as skipping
      direct dispatches in such cases.
      
      Implement deferred enqueueing to allow directly dispatching to the local DSQ
      of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
      temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
      safely released, the tasks are taken off the list and queued on the target
      local DSQs using dispatch_to_local_dsq().
      
      v2: - Add missing return after queue_balance_callback() in
            schedule_deferred(). (David).
      
          - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
            and thus no longer takes @rf. Updated accordingly.
      
          - UP build warning fix.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarAndrea Righi <righi.andrea@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      Cc: Changwoo Min <changwoo@igalia.com>
      5b26f7b9
    • Tejun Heo's avatar
      sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP · f47a8189
      Tejun Heo authored
      SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
      Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
      the rq is currently enqueueing for a wakeup. This will be used to implement
      direct dispatching to local DSQ of another CPU.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      f47a8189
    • Tejun Heo's avatar
      sched: Move struct balance_callback definition upward · fc283116
      Tejun Heo authored
      Move struct balance_callback definition upward so that it's visible to
      class-specific rq struct definitions. This will be used to embed a struct
      balance_callback in struct scx_rq.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      fc283116
  7. 08 Jul, 2024 1 commit
    • Tejun Heo's avatar
      sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h · e196c908
      Tejun Heo authored
      While sched_ext was out of tree, everything sched_ext specific which can be
      put in kernel/sched/ext.h was put there to ease forward porting. However,
      kernel/sched/sched.h is the better location for some of them. Relocate.
      
      - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and
        sched_enq_and_set_task().
      
      - scx_enabled() and scx_switched_all().
      
      - for_active_class_range() and for_each_active_class(). sched_class
        declarations are moved above the class iterators for this.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      e196c908
  8. 04 Jul, 2024 1 commit
  9. 01 Jul, 2024 1 commit
    • John Stultz's avatar
      sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath · ddae0ca2
      John Stultz authored
      It was reported that in moving to 6.1, a larger then 10%
      regression was seen in the performance of
      clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).
      
      Using a simple reproducer, I found:
      5.10:
      100000000 calls in 24345994193 ns => 243.460 ns per call
      100000000 calls in 24288172050 ns => 242.882 ns per call
      100000000 calls in 24289135225 ns => 242.891 ns per call
      
      6.1:
      100000000 calls in 28248646742 ns => 282.486 ns per call
      100000000 calls in 28227055067 ns => 282.271 ns per call
      100000000 calls in 28177471287 ns => 281.775 ns per call
      
      The cause of this was finally narrowed down to the addition of
      psi_account_irqtime() in update_rq_clock_task(), in commit
      52b1364b ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
      pressure").
      
      In my initial attempt to resolve this, I leaned towards moving
      all accounting work out of the clock_gettime() call path, but it
      wasn't very pretty, so it will have to wait for a later deeper
      rework. Instead, Peter shared this approach:
      
      Rework psi_account_irqtime() to use its own psi_irq_time base
      for accounting, and move it out of the hotpath, calling it
      instead from sched_tick() and __schedule().
      
      In testing this, we found the importance of ensuring
      psi_account_irqtime() is run under the rq_lock, which Johannes
      Weiner helpfully explained, so also add some lockdep annotations
      to make that requirement clear.
      
      With this change the performance is back in-line with 5.10:
      6.1+fix:
      100000000 calls in 24297324597 ns => 242.973 ns per call
      100000000 calls in 24318869234 ns => 243.189 ns per call
      100000000 calls in 24291564588 ns => 242.916 ns per call
      Reported-by: default avatarJimmy Shiu <jimmyshiu@google.com>
      Originally-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJohn Stultz <jstultz@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarQais Yousef <qyousef@layalina.io>
      Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com
      ddae0ca2
  10. 21 Jun, 2024 2 commits
    • Tejun Heo's avatar
      sched_ext: Add cpuperf support · d86adb4f
      Tejun Heo authored
      sched_ext currently does not integrate with schedutil. When schedutil is the
      governor, frequencies are left unregulated and usually get stuck close to
      the highest performance level from running RT tasks.
      
      Add CPU performance monitoring and scaling support by integrating into
      schedutil. The following kfuncs are added:
      
      - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
        different CPUs in the system.
      
      - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
        relative to its max performance.
      
      - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
      
      This gives direct control over CPU performance setting to the BPF scheduler.
      The only changes on the schedutil side are accounting for the utilization
      factor from sched_ext and disabling frequency holding heuristics as it may
      not apply well to sched_ext schedulers which may have a lot weaker
      connection between tasks and their current / last CPU.
      
      With cpuperf support added, there is no reason to block uclamp. Enable while
      at it.
      
      A toy implementation of cpuperf is added to scx_qmap as a demonstration of
      the feature.
      
      v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
          to avoid factoring in stale util metric. (Christian)
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Christian Loehle <christian.loehle@arm.com>
      d86adb4f
    • Tejun Heo's avatar
      sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() · b999e365
      Tejun Heo authored
      scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
      a CPU is taken away by a task dispatched from a higher priority sched_class
      so that the BPF scheduler can, e.g., punt the task[s] which was running or
      were waiting for the CPU to other CPUs.
      
      Replace the sched_ext specific hook scx_next_task_picked() with a new
      sched_class operation switch_class().
      
      The changes are straightforward and the code looks better afterwards.
      However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook
      which is unlikely to be useful to other sched_classes. For further
      discussion on this subject, please refer to the following:
      
        http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.comSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      b999e365
  11. 18 Jun, 2024 7 commits
    • Tejun Heo's avatar
      sched_ext: Implement sched_ext_ops.cpu_online/offline() · 60c27fb5
      Tejun Heo authored
      Add ops.cpu_online/offline() which are invoked when CPUs come online and
      offline respectively. As the enqueue path already automatically bypasses
      tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
      to see tasks only on CPUs which are between online() and offline().
      
      If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
      scheduler is automatically exited with SCX_ECODE_RESTART |
      SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
      trivially by simply reinitializing and reloading the scheduler.
      
      scx_qmap is updated to print out online CPUs on hotplug events. Other
      schedulers are updated to restart based on ecode.
      
      v3: - The previous implementation added @reason to
            sched_class.rq_on/offline() to distinguish between CPU hotplug events
            and topology updates. This was buggy and fragile as the methods are
            skipped if the current state equals the target state. Instead, add
            scx_rq_[de]activate() which are directly called from
            sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to
            sleep which can be useful.
      
          - ops.dispatch() could be called on a CPU that the BPF scheduler was
            told to be offline. The dispatch patch is updated to bypass in such
            cases.
      
      v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
            cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
            block and enabled eariler during scx_ope_enable() so that
            cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
      
          - Auto exit with ECODE added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      60c27fb5
    • David Vernet's avatar
      sched_ext: Implement sched_ext_ops.cpu_acquire/release() · 245254f7
      David Vernet authored
      Scheduler classes are strictly ordered and when a higher priority class has
      tasks to run, the lower priority ones lose access to the CPU. Being able to
      monitor and act on these events are necessary for use cases includling
      strict core-scheduling and latency management.
      
      This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
      former is invoked when a CPU becomes available to the BPF scheduler and the
      opposite for the latter. This patch also implements
      scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
      requeueing of all tasks in the local dsq of the CPU so that the tasks can be
      reassigned to other available CPUs.
      
      scx_pair is updated to use .cpu_acquire/release() along with
      %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
      is preempted by a higher priority scheduler class.
      
      scx_qmap is updated to use .cpu_acquire/release() to empty the local
      dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
      that want to have a tight control over latency.
      
      v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.
      
      v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
          access control through the verifier, so the qualifier isn't actually
          operative and only gets in the way when interacting with various
          helpers.
      
      v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
          from ops.cpu_release() nested inside ops.init() and other sleepable
          operations.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      245254f7
    • David Vernet's avatar
      sched_ext: Implement SCX_KICK_WAIT · 90e55164
      David Vernet authored
      If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
      the kicked cpu to enter the scheduler. See the following for example usage:
      
        https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c
      
      v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.
      
          - Include SCX_KICK_WAIT related information in debug dump.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      90e55164
    • Tejun Heo's avatar
      sched_ext: Implement tickless support · 22a92020
      Tejun Heo authored
      Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
      to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
      tickless operation.
      
      scx_central is updated to use tickless operations for all tasks and
      instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
      and task state tracking added by the previous patches.
      
      Currently, there is no way to pin the timer on the central CPU, so it may
      end up on one of the worker CPUs; however, outside of that, the worker CPUs
      can go tickless both while running sched_ext tasks and idling.
      
      With schbench running, scx_central shows:
      
        root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
        LOC:     142024        656        664        449   Local timer interrupts
        LOC:     161663        663        665        449   Local timer interrupts
      
      Without it:
      
        root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
        LOC:     188778       3142       3793       3993   Local timer interrupts
        LOC:     198993       5314       6323       6438   Local timer interrupts
      
      While scx_central itself is too barebone to be useful as a
      production scheduler, a more featureful central scheduler can be built using
      the same approach. Google's experience shows that such an approach can have
      significant benefits for certain applications such as VM hosting.
      
      v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.
      
      v3: Pin the central scheduler's timer on the central_cpu using
          BPF_F_TIMER_CPU_PIN.
      
      v2: Convert to BPF inline iterators.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      22a92020
    • Tejun Heo's avatar
      sched_ext: Implement scx_bpf_kick_cpu() and task preemption support · 81aae789
      Tejun Heo authored
      It's often useful to wake up and/or trigger reschedule on other CPUs. This
      patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
      kick the target CPU into the scheduling path.
      
      As a sched_ext task relinquishes its CPU only after its slice is depleted,
      this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
      slice of the target CPU's current task to guarantee that sched_ext's
      scheduling path runs on the CPU.
      
      If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
      to guarantee that the target CPU will go through at least one full sched_ext
      scheduling cycle after the kicking. This can be used to wake up idle CPUs
      without incurring unnecessary overhead if it isn't currently idle.
      
      As a demonstration of how backward compatibility can be supported using BPF
      CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
      __COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
      becomes a regular kicking otherwise. This allows schedulers to use the new
      SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
      temporarily use compat helpers to ease API updates and drop them after a few
      kernel releases.
      
      v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
            schedulers so that they can support kernels without SCX_KICK_IDLE.
            This is useful as a demonstration of how new feature flags can be
            added in a backward compatible way.
      
          - kick_cpus_irq_workfn() reimplemented so that it touches the pending
            cpumasks only as necessary to reduce kicking overhead on machines with
            a lot of CPUs.
      
          - tools/sched_ext/include/scx/compat.bpf.h added.
      
      v4: - Move example scheduler to its own patch.
      
      v3: - Make scx_example_central switch all tasks by default.
      
          - Convert to BPF inline iterators.
      
      v2: - Julia Lawall reported that scx_example_central can overflow the
            dispatch buffer and malfunction. As scheduling for other CPUs can't be
            handled by the automatic retry mechanism, fix by implementing an
            explicit overflow and retry handling.
      
          - Updated to use generic BPF cpumask helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      81aae789
    • Tejun Heo's avatar
      sched_ext: Implement BPF extensible scheduler class · f0e1a064
      Tejun Heo authored
      Implement a new scheduler class sched_ext (SCX), which allows scheduling
      policies to be implemented as BPF programs to achieve the following:
      
      1. Ease of experimentation and exploration: Enabling rapid iteration of new
         scheduling policies.
      
      2. Customization: Building application-specific schedulers which implement
         policies that are not applicable to general-purpose schedulers.
      
      3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
         policies in production environments.
      
      sched_ext leverages BPF’s struct_ops feature to define a structure which
      exports function callbacks and flags to BPF programs that wish to implement
      scheduling policies. The struct_ops structure exported by sched_ext is
      struct sched_ext_ops, and is conceptually similar to struct sched_class. The
      role of sched_ext is to map the complex sched_class callbacks to the more
      simple and ergonomic struct sched_ext_ops callbacks.
      
      For more detailed discussion on the motivations and overview, please refer
      to the cover letter.
      
      Later patches will also add several example schedulers and documentation.
      
      This patch implements the minimum core framework to enable implementation of
      BPF schedulers. Subsequent patches will gradually add functionalities
      including safety guarantee mechanisms, nohz and cgroup support.
      
      include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
      top, each operation should be self-explanatory. The followings are worth
      noting:
      
      - Both "sched_ext" and its shorthand "scx" are used. If the identifier
        already has "sched" in it, "ext" is used; otherwise, "scx".
      
      - In sched_ext_ops, only .name is mandatory. Every operation is optional and
        if omitted a simple but functional default behavior is provided.
      
      - A new policy constant SCHED_EXT is added and a task can select sched_ext
        by invoking sched_setscheduler(2) with the new policy constant. However,
        if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
        and the task is scheduled by CFS. When the BPF scheduler is loaded, all
        tasks which have the SCHED_EXT policy are switched to sched_ext.
      
      - To bridge the workflow imbalance between the scheduler core and
        sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
        queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
        one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
        convenience and need not be used by a scheduler that doesn't require it.
        SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
        the next task on the CPU. The BPF scheduler can manage an arbitrary number
        of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
      
      - sched_ext guarantees system integrity no matter what the BPF scheduler
        does. To enable this, each task's ownership is tracked through
        p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
        can always recover and revert all tasks back to CFS. See p->scx.ops_state
        and scx_tasks.
      
      - A task is not tied to its rq while enqueued. This decouples CPU selection
        from queueing and allows sharing a scheduling queue across an arbitrary
        subset of CPUs. This adds some complexities as a task may need to be
        bounced between rq's right before it starts executing. See
        dispatch_to_local_dsq() and move_task_to_local_dsq().
      
      - One complication that arises from the above weak association between task
        and rq is that synchronizing with dequeue() gets complicated as dequeue()
        may happen anytime while the task is enqueued and the dispatch path might
        need to release the rq lock to transfer the task. Solving this requires a
        bit of complexity. See the logic around p->scx.sticky_cpu and
        p->scx.ops_qseq.
      
      - Both enable and disable paths are a bit complicated. The enable path
        switches all tasks without blocking to avoid issues which can arise from
        partially switched states (e.g. the switching task itself being starved).
        The disable path can't trust the BPF scheduler at all, so it also has to
        guarantee forward progress without blocking. See scx_ops_enable() and
        scx_ops_disable_workfn().
      
      - When sched_ext is disabled, static_branches are used to shut down the
        entry points from hot paths.
      
      v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
            scx_ops_enable_mutex which can lead to deadlocks in the disable path.
            Fixed.
      
          - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
            to use-after-free.
      
          - Consolidated per-cpu variable usages and other cleanups.
      
      v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
            groups can be expressed. Later CPU hotplug operations are put into
            their own group.
      
          - SCX_OPS_DISABLING state is replaced with the new bypass mechanism
            which allows temporarily putting the system into simple FIFO
            scheduling mode bypassing the BPF scheduler. In addition to the shut
            down path, this will also be used to isolate the BPF scheduler across
            PM events. Enabling and disabling the bypass mode requires iterating
            all runnable tasks. rq->scx.runnable_list addition is moved from the
            later watchdog patch.
      
          - ops.prep_enable() is replaced with ops.init_task() and
            ops.enable/disable() are now called whenever the task enters and
            leaves sched_ext instead of when the task becomes schedulable on
            sched_ext and stops being so. A new operation - ops.exit_task() - is
            called when the task stops being schedulable on sched_ext.
      
          - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
            removes the need for communicating local dispatch decision made by
            ops.select_cpu() to ops.enqueue() via per-task storage.
            SCX_KF_SELECT_CPU is added to support the change.
      
          - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
            scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
            was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
            if it finds a suitable idle CPU. If such behavior is not desired,
            users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
            bool out param.
      
          - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
            queueing many tasks on a local DSQ which makes tasks to execute in
            order while other CPUs stay idle which made some hackbench numbers
            really bad. Fixed.
      
          - The current state of sched_ext can now be monitored through files
            under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
            to enable monitoring on kernels which don't enable debugfs.
      
          - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
            be NULL and a BPF scheduler which derefs the pointer without checking
            could crash the kernel. Tell BPF. This is currently a bit ugly. A
            better way to annotate this is expected in the future.
      
          - scx_exit_info updated to carry pointers to message buffers instead of
            embedding them directly. This decouples buffer sizes from API so that
            they can be changed without breaking compatibility.
      
          - exit_code added to scx_exit_info. This is used to indicate different
            exit conditions on non-error exits and will be used to handle e.g. CPU
            hotplugs.
      
          - The patch "sched_ext: Allow BPF schedulers to switch all eligible
            tasks into sched_ext" is folded in and the interface is changed so
            that partial switching is indicated with a new ops flag
            %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
            and in turn SCX_KF_INIT. ops.init() is now called with
            SCX_KF_SLEEPABLE.
      
          - Code reorganized so that only the parts necessary to integrate with
            the rest of the kernel are in the header files.
      
          - Changes to reflect the BPF and other kernel changes including the
            addition of bpf_sched_ext_ops.cfi_stubs.
      
      v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
            instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
            load_acquire/store_release is now unsigned long instead of u64.
      
          - Fix the bug where bpf_scx_btf_struct_access() was allowing write
            access to arbitrary fields.
      
          - Distinguish kfuncs which can be called from any sched_ext ops and from
            anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
            sched_ext ops.
      
          - Rename "type" to "kind" in scx_exit_info to make it easier to use on
            languages in which "type" is a reserved keyword.
      
          - Since cff9b233 ("kernel/sched: Modify initial boot task idle
            setup"), PF_IDLE is not set on idle tasks which haven't been online
            yet which made scx_task_iter_next_filtered() include those idle tasks
            in iterations leading to oopses. Update scx_task_iter_next_filtered()
            to directly test p->sched_class against idle_sched_class instead of
            using is_idle_task() which tests PF_IDLE.
      
          - Other updates to match upstream changes such as adding const to
            set_cpumask() param and renaming check_preempt_curr() to
            wakeup_preempt().
      
      v4: - SCHED_CHANGE_BLOCK replaced with the previous
            sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
            because upstream is adaopting a different generic cleanup mechanism.
            Once that lands, the code will be adapted accordingly.
      
          - task_on_scx() used to test whether a task should be switched into SCX,
            which is confusing. Renamed to task_should_scx(). task_on_scx() now
            tests whether a task is currently on SCX.
      
          - scx_has_idle_cpus is barely used anymore and replaced with direct
            check on the idle cpumask.
      
          - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
            fully idle cores.
      
          - ops.enable() now sees up-to-date p->scx.weight value.
      
          - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
            schedulers expecting ->select_cpu() call.
      
          - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
            of the scheduler.
      
      v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
            without polling p->scx.weight.
      
          - move_task_to_local_dsq() was losing SCX-specific enq_flags when
            enqueueing the task on the target dsq because it goes through
            activate_task() which loses the upper 32bit of the flags. Carry the
            flags through rq->scx.extra_enq_flags.
      
          - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
            and scx_bpf_task_cpu() now use the new KF_RCU instead of
            KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
      
          - The kfunc helper access control mechanism implemented through
            sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
            used when invoking scx_ops operations.
      
      v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
            called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
            To determine whether balance_scx() should be called from
            put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
            comment in put_prev_task_scx() for details.
      
          - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
            with SCHED_CHANGE_BLOCK().
      
          - Unused all_dsqs list removed. This was a left-over from previous
            iterations.
      
          - p->scx.kf_mask is added to track and enforce which kfunc helpers are
            allowed. Also, init/exit sequences are updated to make some kfuncs
            always safe to call regardless of the current BPF scheduler state.
            Combined, this should make all the kfuncs safe.
      
          - BPF now supports sleepable struct_ops operations. Hacky workaround
            removed and operations and kfunc helpers are tagged appropriately.
      
          - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
            and friends are added so that BPF schedulers can use the idle masks
            with the generic helpers. This replaces the hacky kfunc helpers added
            by a separate patch in V1.
      
          - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
            enabled. This restriction will be removed by a later patch which adds
            core-sched support.
      
          - Add MAINTAINERS entries and other misc changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Co-authored-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      f0e1a064
    • Tejun Heo's avatar
      sched_ext: Add boilerplate for extensible scheduler class · a7a9fc54
      Tejun Heo authored
      This adds dummy implementations of sched_ext interfaces which interact with
      the scheduler core and hook them in the correct places. As they're all
      dummies, this doesn't cause any behavior changes. This is split out to help
      reviewing.
      
      v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      a7a9fc54