1. 17 Mar, 2023 10 commits
  2. 16 Mar, 2023 13 commits
  3. 14 Mar, 2023 14 commits
  4. 13 Mar, 2023 3 commits
    • Dave Marchevsky's avatar
      bpf: Disable migration when freeing stashed local kptr using obj drop · 9e36a204
      Dave Marchevsky authored
      When a local kptr is stashed in a map and freed when the map goes away,
      currently an error like the below appears:
      
      [   39.195695] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u32:15/2875
      [   39.196549] caller is bpf_mem_free+0x56/0xc0
      [   39.196958] CPU: 15 PID: 2875 Comm: kworker/u32:15 Tainted: G           O       6.2.0-13016-g22df776a #4477
      [   39.197897] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
      [   39.198949] Workqueue: events_unbound bpf_map_free_deferred
      [   39.199470] Call Trace:
      [   39.199703]  <TASK>
      [   39.199911]  dump_stack_lvl+0x60/0x70
      [   39.200267]  check_preemption_disabled+0xbf/0xe0
      [   39.200704]  bpf_mem_free+0x56/0xc0
      [   39.201032]  ? bpf_obj_new_impl+0xa0/0xa0
      [   39.201430]  bpf_obj_free_fields+0x1cd/0x200
      [   39.201838]  array_map_free+0xad/0x220
      [   39.202193]  ? finish_task_switch+0xe5/0x3c0
      [   39.202614]  bpf_map_free_deferred+0xea/0x210
      [   39.203006]  ? lockdep_hardirqs_on_prepare+0xe/0x220
      [   39.203460]  process_one_work+0x64f/0xbe0
      [   39.203822]  ? pwq_dec_nr_in_flight+0x110/0x110
      [   39.204264]  ? do_raw_spin_lock+0x107/0x1c0
      [   39.204662]  ? lockdep_hardirqs_on_prepare+0xe/0x220
      [   39.205107]  worker_thread+0x74/0x7a0
      [   39.205451]  ? process_one_work+0xbe0/0xbe0
      [   39.205818]  kthread+0x171/0x1a0
      [   39.206111]  ? kthread_complete_and_exit+0x20/0x20
      [   39.206552]  ret_from_fork+0x1f/0x30
      [   39.206886]  </TASK>
      
      This happens because the call to __bpf_obj_drop_impl I added in the patch
      adding support for stashing local kptrs doesn't disable migration. Prior
      to that patch, __bpf_obj_drop_impl logic only ran when called by a BPF
      progarm, whereas now it can be called from map free path, so it's
      necessary to explicitly disable migration.
      
      Also, refactor a bit to just call __bpf_obj_drop_impl directly instead
      of bothering w/ dtor union and setting pointer-to-obj_drop.
      
      Fixes: c8e18754 ("bpf: Support __kptr to local kptrs")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230313214641.3731908-1-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9e36a204
    • David Vernet's avatar
      tasks: Extract rcu_users out of union · 22df776a
      David Vernet authored
      In commit 3fbd7ee2 ("tasks: Add a count of task RCU users"), a
      count on the number of RCU users was added to struct task_struct. This
      was done so as to enable the removal of task_rcu_dereference(), and
      allow tasks to be protected by RCU even after exiting and being removed
      from the runqueue. In this commit, the 'refcount_t rcu_users' field that
      keeps track of this refcount was put into a union co-located with
      'struct rcu_head rcu', so as to avoid taking up any extra space in
      task_struct. This was possible to do safely, because the field was only
      ever decremented by a static set of specific callers, and then never
      incremented again.
      
      While this restriction of there only being a small, static set of users
      of this field has worked fine, it prevents us from leveraging the field
      to use RCU to protect tasks in other contexts.
      
      During tracing, for example, it would be useful to be able to collect
      some tasks that performed a certain operation, put them in a map, and
      then periodically summarize who they are, which cgroup they're in, how
      much CPU time they've utilized, etc. While this can currently be done
      with 'usage', it becomes tricky when a task is already in a map, or if a
      reference should only be taken if a task is valid and will not soon be
      reaped. Ideally, we could do something like pass a reference to a map
      value, and then try to acquire a reference to the task in an RCU read
      region by using refcount_inc_not_zero().
      
      Similarly, in sched_ext, schedulers are using integer pids to remember
      tasks, and then looking them up with find_task_by_pid_ns(). This is
      slow, error prone, and adds complexity. It would be more convenient and
      performant if BPF schedulers could instead store tasks directly in maps,
      and then leverage RCU to ensure they can be safely accessed with low
      overhead.
      
      Finally, overloading fields like this is error prone. Someone that wants
      to use 'rcu_users' could easily overlook the fact that once the rcu
      callback is scheduled, the refcount will go back to being nonzero, thus
      precluding the use of refcount_inc_not_zero(). Furthermore, as described
      below, it's possible to extract the fields of the union without changing
      the size of task_struct.
      
      There are several possible ways to enable this:
      
      1. The lightest touch approach is likely the one proposed in this patch,
         which is to simply extract 'rcu_users' and 'rcu' from the union, so
         that scheduling the 'rcu' callback doesn't overwrite the 'rcu_users'
         refcount. If we have a trusted task pointer, this would allow us to
         use refcnt_inc_not_zero() inside of an RCU region to determine if we
         can safely acquire a reference to the task and store it in a map. As
         mentioned below, this can be done without changing the size of
         task_struct, by moving the location of the union to another location
         that has padding gaps we can fill in.
      
      2. Removing 'refcount_t rcu_users', and instead having the entire task
         be freed in an rcu callback. This is likely the most sound overall
         design, though it changes the behavioral semantics exposed to
         callers, who currently expect that a task that's successfully looked
         up in e.g. the pid_list with find_task_by_pid_ns(), can always have a
         'usage' reference acquired on them, as it's guaranteed to be >
         0 until after the next gp. In order for this approach to work, we'd
         have to audit all callers. This approach also slightly changes
         behavior observed by user space by not invoking
         trace_sched_process_free() until the whole task_struct is actually being
         freed, rather than just after it's exited. It also may change
         timings, as memory will be freed in an RCU callback rather than
         immediately when the final 'usage' refcount drops to 0. This also is
         arguably a benefit, as it provides more predictable performance to
         callers who are refcounting tasks.
      
      3. There may be other solutions as well that don't require changing the
         layout of task_struct. For example, we could possibly do something
         complex from the BPF side, such as listen for task exit and remove a
         task from a map when the task is exiting. This would likely require
         significant custom handling for task_struct in the verifier, so a
         more generalizable solution is likely warranted.
      
      As mentioned above, this patch proposes the lightest-touch approach
      which allows callers elsewhere in the kernel to use 'rcu_users' to
      ensure the lifetime of a task, by extracting 'rcu_users' and 'rcu' from
      the union. There is no size change in task_struct with this patch.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20230215233033.889644-1-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      22df776a
    • Andrii Nakryiko's avatar
      bpf: fix precision propagation verbose logging · 34f0677e
      Andrii Nakryiko authored
      Fix wrong order of frame index vs register/slot index in precision
      propagation verbose (level 2) output. It's wrong and very confusing as is.
      
      Fixes: 529409ea ("bpf: propagate precision across all frames, not just the last one")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      34f0677e