- 09 Sep, 2024 3 commits
-
-
Tejun Heo authored
The tricky p->scx.holding_cpu handling was split across consume_remote_task() body and move_task_to_local_dsq(). Refactor such that: - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with consolidated documentation. - move_task_to_local_dsq() now implements straightforward task migration making it easier to use in other places. - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The usage is updated accordingly. This makes the local and remote cases more symmetric. No functional changes intended. v2: s/task_rq/src_rq/ for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
-
Tejun Heo authored
Sleepables don't need to be in its own kfunc set as each is tagged with KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is not held and relocate right above the any set. This will be used to add kfuncs that are allowed to be called from SYSCALL but not TRACING. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>
-
Tejun Heo authored
scx_dump_data is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/
-
- 06 Sep, 2024 2 commits
-
-
Tejun Heo authored
scx_has_op[] is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409062337.m7qqI88I-lkp@intel.com/
-
Tejun Heo authored
pick_task_scx() must be preceded by balance_scx() but there currently is a bug where fair could say yes on balance() but no on pick_task(), which then ends up calling pick_task_scx() without preceding balance_scx(). Work around by dropping WARN_ON_ONCE() and ignoring cases which don't make sense. This isn't great and can theoretically lead to stalls. However, for switch_all cases, this happens only while a BPF scheduler is being loaded or unloaded, and, for partial cases, fair will likely keep triggering this CPU. This will be reverted once the fair behavior is fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 04 Sep, 2024 14 commits
-
-
Tejun Heo authored
Pull bpf/master to receive baebe9aa ("bpf: allow passing struct bpf_iter_<type> as kfunc arguments") and related changes in preparation for the DSQ iterator patchset. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
This patch adds scx_flatcg example scheduler which implements hierarchical weight-based cgroup CPU control by flattening the cgroup hierarchy into a single layer by compounding the active weight share at each level. This flattening of hierarchy can bring a substantial performance gain when the cgroup hierarchy is nested multiple levels. in a simple benchmark using wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two apache instances competing with 2:1 weight ratio nested four level deep. However, the gain comes at the cost of not being able to properly handle thundering herd of cgroups. For example, if many cgroups which are nested behind a low priority parent cgroup wake up around the same time, they may be able to consume more CPU cycles than they are entitled to. In many use cases, this isn't a real concern especially given the performance gain. Also, there are ways to mitigate the problem further by e.g. introducing an extra scheduling layer on cgroup delegation boundaries. v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of SCX_OPS_KNOB_CGROUP_WEIGHT. v4: - Revert reference counted kptr for cgv_node as the change caused easily reproducible stalls. v3: - Updated to reflect the core API changes including ops.init/exit_task() and direct dispatch from ops.select_cpu(). Fixes and improvements including additional statistics. - Use reference counted kptr for cgv_node instead of xchg'ing against stash location. - Dropped '-p' option. v2: - Use SCX_BUG[_ON]() to simplify error handling. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
-
Tejun Heo authored
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>
-
Tejun Heo authored
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or SCX may implement the feature, put the interface code behind the new CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED. This allows either sched class to enable the itnerface code without ading more complex CONFIG tests. When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares() is added to support later CONFIG_CGROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED builds. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
Move tg_weight() upward and make cpu_shares_read_u64() use it too. This makes the weight retrieval shared between cgroup v1 and v2 paths and will be used to implement cgroup support for sched_ext. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
A new BPF extensible sched_class will use css_tg() in the init and exit paths to visit all task_groups by walking cgroups. v4: __setscheduler_prio() is already exposed. Dropped from this patch. v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup mechanism. v2: Expose SCHED_CHANGE_BLOCK() too and update the description. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com>
-
Tejun Heo authored
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task(). However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again. Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning: sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly. - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering. - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
Tejun Heo authored
scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while calling scx_ops_exit_task() on all tasks including dead ones. This can leave a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent. If another task was in the process of changing the TASK_DEAD task's scheduling class and grabs the rq lock after scx_ops_disable_workfn() is done with the task, the task ends up calling scx_ops_disable_task() on the dead task which is in an inconsistent state triggering a warning: WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160 ... RIP: 0010:scx_ops_disable_task+0x12c/0x160 ... Call Trace: <TASK> check_class_changed+0x2c/0x70 __sched_setscheduler+0x8a0/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f140d70ea5b There is no reason to leave dead tasks on SCX when unloading the BPF scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including the dead ones from SCX. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
With sched_ext converted to use put_prev_task() for class switch detection, there's no user of switch_class() left. Drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
-
Tejun Heo authored
Now that put_prev_task_scx() is called with @next on task switches, there's no reason to use sched_class.switch_class(). Rename switch_class_scx() to switch_class() and call it from put_prev_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
Relocate functions to ease the removal of switch_class_scx(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
Because the BPF scheduler's dispatch path is invoked from balance(), sched_ext needs to invoke balance_one() on all sibling rq's before picking the next task for core-sched. Before the recent pick_next_task() updates, sched_ext couldn't share pick task between regular and core-sched paths because pick_next_task() depended on put_prev_task() being called on the current task. Tasks currently running on sibling rq's can't be put when one rq is trying to pick the next task, so pick_task_scx() had to have a separate mechanism to pick between a sibling rq's current task and the first task in its local DSQ. However, with the preceding updates, pick_next_task_scx() no longer depends on the current task being put and can compare the current task and the next in line statelessly, and the pick task logic should be shareable between regular and core-sched paths. Unify regular and core-sched pick task paths: - There's no reason to distinguish local and sibling picks anymore. @local is removed from balance_one(). - pick_next_task_scx() is turned into pick_task_scx() by dropping the put_prev_set_next_task() call. - The old pick_task_scx() is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to keep running the current task. It's not really a task property. Replace it with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for the usage. Also, the existing clearing rule is unnecessarily strict and makes it difficult to use with core-sched. Just clear it on entry to balance_one(). Signed-off-by: Tejun Heo <tj@kernel.org>
-
Tejun Heo authored
fd03c5b8 ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
-
- 03 Sep, 2024 11 commits
-
-
Tejun Heo authored
- Resolve trivial context conflicts from dl_server clearing being moved around. - Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to match sched/core. - Merge sched_class->switch_class() addition from sched_ext with tip/sched/core changes in __pick_next_task(). - Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous behavior where sched_class->put_prev_task() was called before sched_class->pick_next_task(). While this makes sched_ext build and function, the behavior is not in line with other sched classes. The follow-up patches will address the discrepancies and remove sched_class->switch_class(). Signed-off-by: Tejun Heo <tj@kernel.org>
-
Peter Zijlstra authored
In order to tell the previous sched_class what the next task is, add put_prev_task(.next). Notable SCX will use this to: 1) determine the next task will leave the SCX sched class and push the current task to another CPU if possible. 2) statistics on how often and which other classes preempt it Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
-
Peter Zijlstra authored
When a task is selected through a dl_server, it will have p->dl_server set, such that it can account runtime to the dl_server, see update_curr_task(). Currently p->dl_server is set in pick*task() whenever it goes through the dl_server, clearing it is a bit of a mess though. The trivial solution is clearing it on the final put (now that we have this location). However, this gives a problem when: p = pick_task(rq); if (p) put_prev_set_next_task(rq, prev, next); picks the same task but through a different path, notably when it goes from picking through the dl_server to a direct pick or vice-versa. In that case we cannot readily determine wether we should clear or preserve p->dl_server. An additional complication is pick_*task() setting p->dl_server for a remote pick, it might still need to update runtime before it schedules the core_pick. Close all these holes and remove all the random clearing of p->dl_server by: - having pick_*task() manage rq->dl_server - having the final put_prev_task() clear p->dl_server - having the first set_next_task() set p->dl_server = rq->dl_server - complicate the core_sched code to save/restore rq->dl_server where appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
-
Peter Zijlstra authored
Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
-
Peter Zijlstra authored
The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
-
Peter Zijlstra authored
With the goal of pushing put_prev_task() after pick_task() / into pick_next_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org
-
Peter Zijlstra authored
Abide by the simple rule: pick_next_task() := pick_task() + set_next_task(.first = true) This allows us to trivially get rid of server_pick_next() and things collapse nicely. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
-
Peter Zijlstra authored
The rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) Turns out, there's still a few things in pick_next_task() that are missing from that combination. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org
-
Peter Zijlstra authored
Turns out the core_sched bits forgot to use the set_next_task(.first=true) variant. Notably: pick_next_task() := pick_task() + set_next_task(.first = true) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
-
Valentin Schneider authored
__sched_setscheduler() goes through an enqueue/dequeue cycle like so: flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; prev_class->dequeue_task(rq, p, flags); new_class->enqueue_task(rq, p, flags); when prev_class := fair_sched_class, this is followed by: dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP); the idea being that since the task has switched classes, we need to drop the sched_delayed logic and have that task be deactivated per its previous dequeue_task(..., DEQUEUE_SLEEP). Unfortunately, this leaves the task on_rq. This is missing the tail end of dequeue_entities() that issues __block_task(), which __sched_setscheduler() won't have done due to not using DEQUEUE_DELAYED - not that it should, as it is pretty much a fair_sched_class specific thing. Make switched_from_fair() properly deactivate sched_delayed tasks upon class changes via __block_task(), as if a dequeue_task(..., DEQUEUE_DELAYED) had been issued. Fixes: 2e0199df ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240829135353.1524260-1-vschneid@redhat.com
-
Huang Shijie authored
In dl_server_start(), when schedstats is enabled, the following happens: dl_server_start() dl_se->dl_server = 1; enqueue_dl_entity() update_stats_enqueue_dl() __schedstats_from_dl_se() dl_task_of() BUG_ON(dl_server(dl_se)); Since only tasks have schedstats and internal entries do not, avoid trying to update stats in this case. Fixes: 63ba8422 ("sched/deadline: Introduce deadline servers") Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com
-
- 31 Aug, 2024 2 commits
-
-
Tejun Heo authored
Since 3cf78c5d ("sched_ext: Unpin and repin rq lock from balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost function. This is simpler and in line with what other balance functions are doing but it loses control over rq->clock_update_flags which makes assert_clock_udpated() trigger if other CPUs pins the rq lock. The only place this matters is touch_core_sched() which uses the timestamp to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may be better to use per-core dispatch sequence number. v2: Use sched_clock_cpu() instead of ktime_get_ns() per David. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3cf78c5d ("sched_ext: Unpin and repin rq lock from balance_scx()") Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
Tejun Heo authored
When deciding whether a task can be migrated to a CPU, dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online() tests instead of using task_can_run_on_remote_rq(). This had two problems. - It was missing is_migration_disabled() check and thus could try to migrate a task which shouldn't leading to assertion and scheduling failures. - It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu() and thus failed to consider ISA compatibility. Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq(): - Move scx_ops_error() triggering into task_can_run_on_remote_rq(). - When migration isn't allowed, fall back to the global DSQ instead of the source DSQ by returning DTL_INVALID. This is both simpler and an overall better behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>
-
- 30 Aug, 2024 8 commits
-
-
Ihor Solodrai authored
%.bpf.o objects depend on vmlinux.h, which makes them transitively dependent on unnecessary libbpf headers. However vmlinux.h doesn't actually change as often. When generating vmlinux.h, compare it to a previous version and update it only if there are changes. Example of build time improvement (after first clean build): $ touch ../../../lib/bpf/bpf.h $ time make -j8 Before: real 1m37.592s After: real 0m27.310s Notice that %.bpf.o gen step is skipped if vmlinux.h hasn't changed. Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/CAEf4BzY1z5cC7BKye8=A8aTVxpsCzD=p1jdTfKC7i0XVuYoHUQ@mail.gmail.com Link: https://lore.kernel.org/bpf/20240828174608.377204-2-ihor.solodrai@pm.me
-
Ihor Solodrai authored
Test %.bpf.o objects actually depend only on some libbpf headers. Define a list of required headers and use it as TRUNNER_BPF_OBJS dependency. bpf_*.h list was determined by: $ grep -rh 'include <bpf/bpf_' progs | sort -u Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: Link: https://lore.kernel.org/bpf/20240828174608.377204-1-ihor.solodrai@pm.me https://lore.kernel.org/bpf/CAEf4BzYQ-j2i_xjs94Nn=8+FVfkWt51mLZyiYKiz9oA4Z=pCeA@mail.gmail.com/
-
Eduard Zingerman authored
Create a BTF with endianness different from host, make a distilled base/split BTF pair from it, dump as raw bytes, import again and verify that endianness is preserved. Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240830173406.1581007-1-eddyz87@gmail.com
-
Tony Ambardar authored
New split BTF needs to preserve base's endianness. Similarly, when creating a distilled BTF, we need to preserve original endianness. Fix by updating libbpf's btf__distill_base() and btf_new_empty() to retain the byte order of any source BTF objects when creating new ones. Fixes: ba451366 ("libbpf: Implement basic split BTF support") Fixes: 58e185a0 ("libbpf: Add btf__distill_base() creating split BTF with distilled base BTF") Reported-by: Song Liu <song@kernel.org> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/6358db36c5f68b07873a0a5be2d062b1af5ea5f8.camel@gmail.com/ Link: https://lore.kernel.org/bpf/20240830095150.278881-1-tony.ambardar@gmail.com
-
Jinjie Ruan authored
Replace fput() with sockfd_put() in bpf_fd_reuseport_array_update_elem(). Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20240830020756.607877-1-ruanjinjie@huawei.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexey Gladkov authored
According to the documentation, when building a kernel with the C=2 parameter, all source files should be checked. But this does not happen for the kernel/bpf/ directory. $ touch kernel/bpf/core.o $ make C=2 CHECK=true kernel/bpf/core.o Outputs: CHECK scripts/mod/empty.c CALL scripts/checksyscalls.sh DESCEND objtool INSTALL libsubcmd_headers CC kernel/bpf/core.o As can be seen the compilation is done, but CHECK is not executed. This happens because kernel/bpf/Makefile has defined its own rule for compilation and forgotten the macro that does the check. There is no need to duplicate the build code, and this rule can be removed to use generic rules. Acked-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lore.kernel.org/r/20240830074350.211308-1-legion@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Juntong Deng authored
This patch adds test cases for iter next method returning valid pointer, which can also used as usage examples. Currently iter next method should return valid pointer. iter_next_trusted is the correct usage and test if iter next method return valid pointer. bpf_iter_task_vma_next has KF_RET_NULL flag, so the returned pointer may be NULL. We need to check if the pointer is NULL before using it. iter_next_trusted_or_null is the incorrect usage. There is no checking before using the pointer, so it will be rejected by the verifier. iter_next_rcu and iter_next_rcu_or_null are similar test cases for KF_RCU_PROTECTED iterators. iter_next_rcu_not_trusted is used to test that the pointer returned by iter next method of KF_RCU_PROTECTED iterator cannot be passed in KF_TRUSTED_ARGS kfuncs. iter_next_ptr_mem_not_trusted is used to test that base type PTR_TO_MEM should not be combined with type flag PTR_TRUSTED. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Link: https://lore.kernel.org/r/AM6PR03MB5848709758F6922F02AF9F1F99962@AM6PR03MB5848.eurprd03.prod.outlook.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Juntong Deng authored
Currently we cannot pass the pointer returned by iter next method as argument to KF_TRUSTED_ARGS or KF_RCU kfuncs, because the pointer returned by iter next method is not "valid". This patch sets the pointer returned by iter next method to be valid. This is based on the fact that if the iterator is implemented correctly, then the pointer returned from the iter next method should be valid. This does not make NULL pointer valid. If the iter next method has KF_RET_NULL flag, then the verifier will ask the ebpf program to check NULL pointer. KF_RCU_PROTECTED iterator is a special case, the pointer returned by iter next method should only be valid within RCU critical section, so it should be with MEM_RCU, not PTR_TRUSTED. Another special case is bpf_iter_num_next, which returns a pointer with base type PTR_TO_MEM. PTR_TO_MEM should not be combined with type flag PTR_TRUSTED (PTR_TO_MEM already means the pointer is valid). The pointer returned by iter next method of other types of iterators is with PTR_TRUSTED. In addition, this patch adds get_iter_from_state to help us get the current iterator from the current state. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Link: https://lore.kernel.org/r/AM6PR03MB584869F8B448EA1C87B7CDA399962@AM6PR03MB5848.eurprd03.prod.outlook.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-