1. 01 Apr, 2023 9 commits
    • Alexei Starovoitov's avatar
      Merge branch 'Enable RCU semantics for task kptrs' · a033907e
      Alexei Starovoitov authored
      David Vernet says:
      
      ====================
      
      In commit 22df776a ("tasks: Extract rcu_users out of union"), the
      'refcount_t rcu_users' field was extracted out of a union with the
      'struct rcu_head rcu' field. This allows us to use the field for
      refcounting struct task_struct with RCU protection, as the RCU callback
      no longer flips rcu_users to be nonzero after the callback is scheduled.
      
      This patch set leverages this to do a few things:
      
      1. Marks struct task_struct as RCU safe in the verifier, allowing
         referenced kptr tasks stored in maps to be accessed in an RCU
         read region without acquiring a reference (with just a NULL check).
      2. Makes bpf_task_acquire() a KF_ACQUIRE | KF_RCU | KF_RET_NULL kfunc.
      3. Removes bpf_task_kptr_get() and bpf_task_acquire_not_zero(), as
         they're now redundant with the above two changes.
      4. Updates selftests and documentation accordingly.
      ---
      Changelog:
      v1: https://lore.kernel.org/all/20230331005733.406202-1-void@manifault.com/
      v1 -> v2:
      - Remove testcases validating nested trust inheritance. The first
        version used 'struct task_struct __rcu *parent', but because that
        field has the __rcu tag it functions differently on gcc and llvm and
        causes gcc selftests to fail. Alexei is reworking nested trust,
        anyways so let's leave it off for now (Alexei).
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a033907e
    • David Vernet's avatar
      bpf,docs: Update documentation to reflect new task kfuncs · db9d479a
      David Vernet authored
      Now that struct task_struct objects are RCU safe, and bpf_task_acquire()
      can return NULL, we should update the BPF task kfunc documentation to
      reflect the current state of the API.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230331195733.699708-4-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      db9d479a
    • David Vernet's avatar
      bpf: Remove now-defunct task kfuncs · f85671c6
      David Vernet authored
      In commit 22df776a ("tasks: Extract rcu_users out of union"), the
      'refcount_t rcu_users' field was extracted out of a union with the
      'struct rcu_head rcu' field. This allows us to safely perform a
      refcount_inc_not_zero() on task->rcu_users when acquiring a reference on
      a task struct. A prior patch leveraged this by making struct task_struct
      an RCU-protected object in the verifier, and by bpf_task_acquire() to
      use the task->rcu_users field for synchronization.
      
      Now that we can use RCU to protect tasks, we no longer need
      bpf_task_kptr_get(), or bpf_task_acquire_not_zero(). bpf_task_kptr_get()
      is truly completely unnecessary, as we can just use RCU to get the
      object. bpf_task_acquire_not_zero() is now equivalent to
      bpf_task_acquire().
      
      In addition to these changes, this patch also updates the associated
      selftests to no longer use these kfuncs.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230331195733.699708-3-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f85671c6
    • David Vernet's avatar
      bpf: Make struct task_struct an RCU-safe type · d02c48fa
      David Vernet authored
      struct task_struct objects are a bit interesting in terms of how their
      lifetime is protected by refcounts. task structs have two refcount
      fields:
      
      1. refcount_t usage: Protects the memory backing the task struct. When
         this refcount drops to 0, the task is immediately freed, without
         waiting for an RCU grace period to elapse. This is the field that
         most callers in the kernel currently use to ensure that a task
         remains valid while it's being referenced, and is what's currently
         tracked with bpf_task_acquire() and bpf_task_release().
      
      2. refcount_t rcu_users: A refcount field which, when it drops to 0,
         schedules an RCU callback that drops a reference held on the 'usage'
         field above (which is acquired when the task is first created). This
         field therefore provides a form of RCU protection on the task by
         ensuring that at least one 'usage' refcount will be held until an RCU
         grace period has elapsed. The qualifier "a form of" is important
         here, as a task can remain valid after task->rcu_users has dropped to
         0 and the subsequent RCU gp has elapsed.
      
      In terms of BPF, we want to use task->rcu_users to protect tasks that
      function as referenced kptrs, and to allow tasks stored as referenced
      kptrs in maps to be accessed with RCU protection.
      
      Let's first determine whether we can safely use task->rcu_users to
      protect tasks stored in maps. All of the bpf_task* kfuncs can only be
      called from tracepoint, struct_ops, or BPF_PROG_TYPE_SCHED_CLS, program
      types. For tracepoint and struct_ops programs, the struct task_struct
      passed to a program handler will always be trusted, so it will always be
      safe to call bpf_task_acquire() with any task passed to a program.
      Note, however, that we must update bpf_task_acquire() to be KF_RET_NULL,
      as it is possible that the task has exited by the time the program is
      invoked, even if the pointer is still currently valid because the main
      kernel holds a task->usage refcount. For BPF_PROG_TYPE_SCHED_CLS, tasks
      should never be passed as an argument to the any program handlers, so it
      should not be relevant.
      
      The second question is whether it's safe to use RCU to access a task
      that was acquired with bpf_task_acquire(), and stored in a map. Because
      bpf_task_acquire() now uses task->rcu_users, it follows that if the task
      is present in the map, that it must have had at least one
      task->rcu_users refcount by the time the current RCU cs was started.
      Therefore, it's safe to access that task until the end of the current
      RCU cs.
      
      With all that said, this patch makes struct task_struct is an
      RCU-protected object. In doing so, we also change bpf_task_acquire() to
      be KF_ACQUIRE | KF_RCU | KF_RET_NULL, and adjust any selftests as
      necessary. A subsequent patch will remove bpf_task_kptr_get(), and
      bpf_task_acquire_not_zero() respectively.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230331195733.699708-2-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d02c48fa
    • Alexei Starovoitov's avatar
      Merge branch 'Prepare veristat for packaging' · 85850058
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      
      This patch set relicenses veristat.c to dual GPL-2.0/BSD-2 license and
      prepares it to be mirrored to Github at libbpf/veristat repo.
      
      Few small issues in the source code are fixed, found during Github sync
      preparetion.
      
      v2->v3:
        - fix few warnings about uninitialized variable uses;
      v1->v2:
        - drop linux/compiler.h and define own ARRAY_SIZE macro;
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      85850058
    • Andrii Nakryiko's avatar
      veristat: small fixed found in -O2 mode · ebf390c9
      Andrii Nakryiko authored
      Fix few potentially unitialized variables uses, found while building
      veristat.c in release (-O2) mode.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230331222405.3468634-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ebf390c9
    • Andrii Nakryiko's avatar
      veristat: avoid using kernel-internal headers · e3b65c0c
      Andrii Nakryiko authored
      Drop linux/compiler.h include, which seems to be needed for ARRAY_SIZE
      macro only. Redefine own version of ARRAY_SIZE instead.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230331222405.3468634-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e3b65c0c
    • Andrii Nakryiko's avatar
      veristat: improve version reporting · 71c8c39f
      Andrii Nakryiko authored
      For packaging version of the tool is important, so add a simple way to
      specify veristat version for upstream mirror at Github.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230331222405.3468634-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      71c8c39f
    • Andrii Nakryiko's avatar
      veristat: relicense veristat.c as dual GPL-2.0-only or BSD-2-Clause licensed · 3ed85ae8
      Andrii Nakryiko authored
      Dual-license veristat.c to dual GPL-2.0-only or BSD-2-Clause license.
      This is needed to mirror it to Github to make it convenient for distro
      packagers to package veristat as a separate package.
      
      Veristat grew into a useful tool by itself, and there are already
      a bunch of users relying on veristat as generic BPF loading and
      verification helper tool. So making it easy to packagers by providing
      Github mirror just like we do for bpftool and libbpf is the next step to
      get veristat into the hands of users.
      
      Apart from few typo fixes, I'm the sole contributor to veristat.c so
      far, so no extra Acks should be needed for relicensing.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230331222405.3468634-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3ed85ae8
  2. 31 Mar, 2023 5 commits
  3. 30 Mar, 2023 8 commits
    • David Vernet's avatar
      selftests/bpf: Add testcases for ptr_*_or_null_ in bpf_kptr_xchg · 67efbd57
      David Vernet authored
      The second argument of the bpf_kptr_xchg() helper function is
      ARG_PTR_TO_BTF_ID_OR_NULL. A recent patch fixed a bug whereby the
      verifier would fail with an internal error message if a program invoked
      the helper with a PTR_TO_BTF_ID | PTR_MAYBE_NULL register. This testcase
      adds some testcases to ensure that it fails gracefully moving forward.
      
      Before the fix, these testcases would have failed an error resembling
      the following:
      
      ; p = bpf_kfunc_call_test_acquire(&(unsigned long){0});
      99: (7b) *(u64 *)(r10 -16) = r7       ; frame1: ...
      100: (bf) r1 = r10                    ; frame1: ...
      101: (07) r1 += -16                   ; frame1: ...
      ; p = bpf_kfunc_call_test_acquire(&(unsigned long){0});
      102: (85) call bpf_kfunc_call_test_acquire#13908
      ; frame1: R0_w=ptr_or_null_prog_test_ref_kfunc...
      ; p = bpf_kptr_xchg(&v->ref_ptr, p);
      103: (bf) r1 = r6                     ; frame1: ...
      104: (bf) r2 = r0
      ; frame1: R0_w=ptr_or_null_prog_test_ref_kfunc...
      105: (85) call bpf_kptr_xchg#194
      verifier internal error: invalid PTR_TO_BTF_ID register for type match
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230330145203.80506-2-void@manifault.com
      67efbd57
    • David Vernet's avatar
      bpf: Handle PTR_MAYBE_NULL case in PTR_TO_BTF_ID helper call arg · e4c2acab
      David Vernet authored
      When validating a helper function argument, we use check_reg_type() to
      ensure that the register containing the argument is of the correct type.
      When the register's base type is PTR_TO_BTF_ID, there is some
      supplemental logic where we do extra checks for various combinations of
      PTR_TO_BTF_ID type modifiers. For example, for PTR_TO_BTF_ID,
      PTR_TO_BTF_ID | PTR_TRUSTED, and PTR_TO_BTF_ID | MEM_RCU, we call
      map_kptr_match_type() for bpf_kptr_xchg() calls, and
      btf_struct_ids_match() for other helper calls.
      
      When an unhandled PTR_TO_BTF_ID type modifier combination is passed to
      check_reg_type(), the verifier fails with an internal verifier error
      message. This can currently be triggered by passing a PTR_MAYBE_NULL
      pointer to helper functions (currently just bpf_kptr_xchg()) with an
      ARG_PTR_TO_BTF_ID_OR_NULL arg type. For example, by callin
      bpf_kptr_xchg(&v->kptr, bpf_cpumask_create()).
      
      Whether or not passing a PTR_MAYBE_NULL arg to an
      ARG_PTR_TO_BTF_ID_OR_NULL argument is valid is an interesting question.
      In a vacuum, it seems fine. A helper function with an
      ARG_PTR_TO_BTF_ID_OR_NULL arg would seem to be implying that it can
      handle either a NULL or non-NULL arg, and has logic in place to detect
      and gracefully handle each. This is the case for bpf_kptr_xchg(), which
      of course simply does an xchg(). On the other hand, bpf_kptr_xchg() also
      specifies OBJ_RELEASE, and refcounting semantics for a PTR_MAYBE_NULL
      pointer is different than handling it for a NULL _OR_ non-NULL pointer.
      For example, with a non-NULL arg, we should always fail if there was not
      a nonzero refcount for the value in the register being passed to the
      helper. For PTR_MAYBE_NULL on the other hand, it's unclear. If the
      pointer is NULL it would be fine, but if it's not NULL, it would be
      incorrect to load the program.
      
      The current solution to this is to just fail if PTR_MAYBE_NULL is
      passed, and to instead require programs to have a NULL check to
      explicitly handle the NULL and non-NULL cases. This seems reasonable.
      Not only would it possibly be quite complicated to correctly handle
      PTR_MAYBE_NULL refcounting in the verifier, but it's also an arguably
      odd programming pattern in general to not explicitly handle the NULL
      case anyways. For example, it seems odd to not care about whether a
      pointer you're passing to bpf_kptr_xchg() was successfully allocated in
      a program such as the following:
      
      private(MASK) static struct bpf_cpumask __kptr * global_mask;
      
      SEC("tp_btf/task_newtask")
      int BPF_PROG(example, struct task_struct *task, u64 clone_flags)
      {
              struct bpf_cpumask *prev;
      
      	/* bpf_cpumask_create() returns PTR_MAYBE_NULL */
      	prev = bpf_kptr_xchg(&global_mask, bpf_cpumask_create());
      	if (prev)
      		bpf_cpumask_release(prev);
      
      	return 0;
      }
      
      This patch therefore updates the verifier to explicitly check for
      PTR_MAYBE_NULL in check_reg_type(), and fail gracefully if it's
      observed. This isn't really "fixing" anything unsafe or incorrect. We're
      just updating the verifier to fail gracefully, and explicitly handle
      this pattern rather than unintentionally falling back to an internal
      verifier error path. A subsequent patch will update selftests.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20230330145203.80506-1-void@manifault.com
      e4c2acab
    • Andrii Nakryiko's avatar
      veristat: change guess for __sk_buff from CGROUP_SKB to SCHED_CLS · d8161295
      Andrii Nakryiko authored
      SCHED_CLS seems to be a better option as a default guess for freplace
      programs that have __sk_buff as a context type.
      Reported-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230330190115.3942962-1-andrii@kernel.orgSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      d8161295
    • Xu Kuohai's avatar
      selftests/bpf: Rewrite two infinite loops in bound check cases · 4ca13d10
      Xu Kuohai authored
      The two infinite loops in bound check cases added by commit
      1a3148fc ("selftests/bpf: Check when bounds are not in the 32-bit range")
      increased the execution time of test_verifier from about 6 seconds to
      about 9 seconds. Rewrite these two infinite loops to finite loops to get
      rid of this extra time cost.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20230329011048.1721937-1-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4ca13d10
    • Alexei Starovoitov's avatar
      Merge branch 'veristat: add better support of freplace programs' · 8a9abe02
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      
      Teach veristat how to deal with freplace BPF programs. As they can't be
      directly loaded by veristat without custom user-space part that sets correct
      target program FD, veristat always fails freplace programs. This patch set
      teaches veristat to guess target program type that will be inherited by
      freplace program itself, and subtitute it for BPF_PROG_TYPE_EXT (freplace) one
      for the purposes of BPF verification.
      
      Patch #1 fixes bug in libbpf preventing overriding freplace with specific
      program type.
      
      Patch #2 adds convenient -d flag to request veristat to emit libbpf debug
      logs. It help debugging why a specific BPF program fails to load, if the
      problem is not due to BPF verification itself.
      
      v3->v4:
        - fix optional kern_name check when guessing prog type (Alexei);
      v2->v3:
        - fix bpf_obj_id selftest that uses legacy bpf_prog_test_load() helper,
          which always sets program type programmatically; teach the helper to do it
          only if actually necessary (Stanislav);
      v1->v2:
        - fix compilation error reported by old GCC (my GCC v11 doesn't produce even
          a warning) and Clang (see CI failure at [0]):
      
      GCC version:
      
        veristat.c: In function ‘fixup_obj’:
        veristat.c:908:1: error: label at end of compound statement
          908 | skip_freplace_fixup:
              | ^~~~~~~~~~~~~~~~~~~
      
      Clang version:
      
        veristat.c:909:1: error: label at end of compound statement is a C2x extension [-Werror,-Wc2x-extensions]
        }
        ^
        1 error generated.
      
        [0] https://github.com/kernel-patches/bpf/actions/runs/4515972059/jobs/7953845335
      ====================
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8a9abe02
    • Andrii Nakryiko's avatar
      veristat: guess and substitue underlying program type for freplace (EXT) progs · fa7cc906
      Andrii Nakryiko authored
      SEC("freplace") (i.e., BPF_PROG_TYPE_EXT) programs are not loadable as
      is through veristat, as kernel expects actual program's FD during
      BPF_PROG_LOAD time, which veristat has no way of knowing.
      
      Unfortunately, freplace programs are a pretty important class of
      programs, especially when dealing with XDP chaining solutions, which
      rely on EXT programs.
      
      So let's do our best and teach veristat to try to guess the original
      program type, based on program's context argument type. And if guessing
      process succeeds, we manually override freplace/EXT with guessed program
      type using bpf_program__set_type() setter to increase chances of proper
      BPF verification.
      
      We rely on BTF and maintain a simple lookup table. This process is
      obviously not 100% bulletproof, as valid program might not use context
      and thus wouldn't have to specify correct type. Also, __sk_buff is very
      ambiguous and is the context type across many different program types.
      We pick BPF_PROG_TYPE_CGROUP_SKB for now, which seems to work fine in
      practice so far. Similarly, some program types require specifying attach
      type, and so we pick one out of possible few variants.
      
      Best effort at its best. But this makes veristat even more widely
      applicable.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20230327185202.1929145-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fa7cc906
    • Andrii Nakryiko's avatar
      veristat: add -d debug mode option to see debug libbpf log · b3c63d7a
      Andrii Nakryiko authored
      Add -d option to allow requesting libbpf debug logs from veristat.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230327185202.1929145-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b3c63d7a
    • Andrii Nakryiko's avatar
      libbpf: disassociate section handler on explicit bpf_program__set_type() call · d6e6286a
      Andrii Nakryiko authored
      If user explicitly overrides programs's type with
      bpf_program__set_type() API call, we need to disassociate whatever
      SEC_DEF handler libbpf determined initially based on program's SEC()
      definition, as it's not goind to be valid anymore and could lead to
      crashes and/or confusing failures.
      
      Also, fix up bpf_prog_test_load() helper in selftests/bpf, which is
      force-setting program type (even if that's completely unnecessary; this
      is quite a legacy piece of code), and thus should expect auto-attach to
      not work, yet one of the tests explicitly relies on auto-attach for
      testing.
      
      Instead, force-set program type only if it differs from the desired one.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20230327185202.1929145-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d6e6286a
  4. 29 Mar, 2023 4 commits
  5. 28 Mar, 2023 4 commits
  6. 27 Mar, 2023 2 commits
  7. 26 Mar, 2023 8 commits
    • Nuno Gonçalves's avatar
      xsk: allow remap of fill and/or completion rings · 5f5a7d8d
      Nuno Gonçalves authored
      The remap of fill and completion rings was frowned upon as they
      control the usage of UMEM which does not support concurrent use.
      At the same time this would disallow the remap of these rings
      into another process.
      
      A possible use case is that the user wants to transfer the socket/
      UMEM ownership to another process (via SYS_pidfd_getfd) and so
      would need to also remap these rings.
      
      This will have no impact on current usages and just relaxes the
      remap limitation.
      Signed-off-by: default avatarNuno Gonçalves <nunog@fr24.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230324100222.13434-1-nunog@fr24.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5f5a7d8d
    • Dave Thaler's avatar
      bpf, docs: Add extended call instructions · 8cfee110
      Dave Thaler authored
      Add extended call instructions.  Uses the term "program-local" for
      call by offset.  And there are instructions for calling helper functions
      by "address" (the old way of using integer values), and for calling
      helper functions by BTF ID (for kfuncs).
      
      V1 -> V2: addressed comments from David Vernet
      
      V2 -> V3: make descriptions in table consistent with updated names
      
      V3 -> V4: addressed comments from Alexei
      
      V4 -> V5: fixed alignment
      Signed-off-by: default avatarDave Thaler <dthaler@microsoft.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230326033117.1075-1-dthaler1968@googlemail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8cfee110
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage' · 8d275960
      Alexei Starovoitov authored
      Martin KaFai Lau says:
      
      ====================
      
      From: Martin KaFai Lau <martin.lau@kernel.org>
      
      This set is a continuation of the effort in using
      bpf_mem_cache_alloc/free in bpf_local_storage [1]
      
      Major change is only using bpf_mem_alloc for task and cgrp storage
      while sk and inode stay with kzalloc/kfree. The details is
      in patch 2.
      
      [1]: https://lore.kernel.org/bpf/20230308065936.1550103-1-martin.lau@linux.dev/
      
      v3:
      - Only use bpf_mem_alloc for task and cgrp storage.
      - sk and inode storage stay with kzalloc/kfree.
      - Check NULL and add comments in bpf_mem_cache_raw_free() in patch 1.
      - Added test and benchmark for task storage.
      
      v2:
      - Added bpf_mem_cache_alloc_flags() and bpf_mem_cache_raw_free()
        to hide the internal data structure of the bpf allocator.
      - Fixed a typo bug in bpf_selem_free()
      - Simplified the test_local_storage test by directly using
        err returned from libbpf
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8d275960
    • Martin KaFai Lau's avatar
      selftests/bpf: Add bench for task storage creation · cbe9d93d
      Martin KaFai Lau authored
      This patch adds a task storage benchmark to the existing
      local-storage-create benchmark.
      
      For task storage,
      ./bench --storage-type task --batch-size 32:
         bpf_ma: Summary: creates   30.456 ± 0.507k/s ( 30.456k/prod), 6.08 kmallocs/create
      no bpf_ma: Summary: creates   31.962 ± 0.486k/s ( 31.962k/prod), 6.13 kmallocs/create
      
      ./bench --storage-type task --batch-size 64:
         bpf_ma: Summary: creates   30.197 ± 1.476k/s ( 30.197k/prod), 6.08 kmallocs/create
      no bpf_ma: Summary: creates   31.103 ± 0.297k/s ( 31.103k/prod), 6.13 kmallocs/create
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20230322215246.1675516-6-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cbe9d93d
    • Martin KaFai Lau's avatar
      selftests/bpf: Test task storage when local_storage->smap is NULL · d8db84d7
      Martin KaFai Lau authored
      The current sk storage test ensures the memory free works when
      the local_storage->smap is NULL.
      
      This patch adds a task storage test to ensure the memory free
      code path works when local_storage->smap is NULL.
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20230322215246.1675516-5-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d8db84d7
    • Martin KaFai Lau's avatar
      bpf: Use bpf_mem_cache_alloc/free for bpf_local_storage · 6ae9d5e9
      Martin KaFai Lau authored
      This patch uses bpf_mem_cache_alloc/free for allocating and freeing
      bpf_local_storage for task and cgroup storage.
      
      The changes are similar to the previous patch. A few things that
      worth to mention for bpf_local_storage:
      
      The local_storage is freed when the last selem is deleted.
      Before deleting a selem from local_storage, it needs to retrieve the
      local_storage->smap because the bpf_selem_unlink_storage_nolock()
      may have set it to NULL. Note that local_storage->smap may have
      already been NULL when the selem created this local_storage has
      been removed. In this case, call_rcu will be used to free the
      local_storage.
      Also, the bpf_ma (true or false) value is needed before calling
      bpf_local_storage_free(). The bpf_ma can either be obtained from
      the local_storage->smap (if available) or any of its selem's smap.
      A new helper check_storage_bpf_ma() is added to obtain
      bpf_ma for a deleting bpf_local_storage.
      
      When bpf_local_storage_alloc getting a reused memory, all
      fields are either in the correct values or will be initialized.
      'cache[]' must already be all NULLs. 'list' must be empty.
      Others will be initialized.
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20230322215246.1675516-4-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6ae9d5e9
    • Martin KaFai Lau's avatar
      bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem · 08a7ce38
      Martin KaFai Lau authored
      This patch uses bpf_mem_alloc for the task and cgroup local storage that
      the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID.
      eg. bpf_get_current_task_btf() can be used in some of the kmalloc code
      path which will cause deadlock/recursion. bpf_mem_cache_alloc is
      deadlock free and will solve a legit use case in [1].
      
      For sk storage, its batch creation benchmark shows a few percent
      regression when the sk create/destroy batch size is larger than 32.
      The sk creation/destruction happens much more often and
      depends on external traffic. Considering it is hypothetical
      to be able to cause deadlock with sk storage, it can cross
      the bridge to use bpf_mem_alloc till a legit (ie. useful)
      use case comes up.
      
      For inode storage, bpf_local_storage_destroy() is called before
      waiting for a rcu gp and its memory cannot be reused immediately.
      inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp.
      
      A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc().
      Only task and cgroup storage have 'bpf_ma == true' which
      means to use bpf_mem_cache_alloc/free(). This patch only changes
      selem to use bpf_mem_alloc for task and cgroup. The next patch
      will change the local_storage to use bpf_mem_alloc also for
      task and cgroup.
      
      Here is some more details on the changes:
      
      * memory allocation:
      After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because
      bpf_mem_cache_alloc() could return a reused selem. It is to keep
      the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data
      is zero-ed. SDATA(selem)->data is the visible part to the bpf prog.
      No need to use zero_map_value() to do the zeroing because
      bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using
      the selem before returning the selem through bpf_mem_cache_free().
      For the internal fields of selem, they will be initialized when
      linking to the new smap and the new local_storage.
      
      When 'bpf_ma == false', nothing changes in this patch. It will
      stay with the bpf_map_kzalloc().
      
      * memory free:
      The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle
      the bpf_ma == true case.
      
      For the common selem free path where its owner is also being destroyed,
      the mem is freed in bpf_local_storage_destroy(), the owner (task
      and cgroup) has gone through a rcu gp. The memory can be reused
      immediately, so bpf_local_storage_destroy() will call
      bpf_selem_free(..., reuse_now = true) which will do
      bpf_mem_cache_free() for immediate reuse consideration.
      
      An exception is the delete elem code path. The delete elem code path
      is called from the helper bpf_*_storage_delete() and the syscall
      bpf_map_delete_elem(). This path is an unusual case for local
      storage because the common use case is to have the local storage
      staying with its owner life time so that the bpf prog and the user
      space does not have to monitor the owner's destruction. For the delete
      elem path, the selem cannot be reused immediately because there could
      be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false)
      and it will wait for a rcu tasks trace gp before freeing the elem. The
      rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree().
      
      When 'bpf_ma == false', it should be the same as before.
      __bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu().
      A few words on the 'reuse_now == true'. When 'reuse_now == true',
      it is still racing with bpf_local_storage_map_free which is under rcu
      protection, so it still needs to wait for a rcu gp instead of kfree().
      Otherwise, the selem may be reused by slab for a totally different struct
      while the bpf_local_storage_map_free() is still using it (as a
      rcu reader). For the inode case, there may be other rcu readers also.
      In short, when bpf_ma == false and reuse_now == true => vanilla rcu.
      
      [1]: https://lore.kernel.org/bpf/20221118190109.1512674-1-namhyung@kernel.org/
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20230322215246.1675516-3-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      08a7ce38
    • Martin KaFai Lau's avatar
      bpf: Add a few bpf mem allocator functions · e65a5c6e
      Martin KaFai Lau authored
      This patch adds a few bpf mem allocator functions which will
      be used in the bpf_local_storage in a later patch.
      
      bpf_mem_cache_alloc_flags(..., gfp_t flags) is added. When the
      flags == GFP_KERNEL, it will fallback to __alloc(..., GFP_KERNEL).
      bpf_local_storage knows its running context is sleepable (GFP_KERNEL)
      and provides a better guarantee on memory allocation.
      
      bpf_local_storage has some uncommon cases that its selem
      cannot be reused immediately. It handles its own
      rcu_head and goes through a rcu_trace gp and then free it.
      bpf_mem_cache_raw_free() is added for direct free purpose
      without leaking the LLIST_NODE_SZ internal knowledge.
      During free time, the 'struct bpf_mem_alloc *ma' is no longer
      available. However, the caller should know if it is
      percpu memory or not and it can call different raw_free functions.
      bpf_local_storage does not support percpu value, so only
      the non-percpu 'bpf_mem_cache_raw_free()' is added in
      this patch.
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20230322215246.1675516-2-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e65a5c6e