1. 11 Nov, 2023 1 commit
    • Yonghong Song's avatar
      selftests/bpf: Fix pyperf180 compilation failure with clang18 · 100888fb
      Yonghong Song authored
      With latest clang18 (main branch of llvm-project repo), when building bpf selftests,
          [~/work/bpf-next (master)]$ make -C tools/testing/selftests/bpf LLVM=1 -j
      
      The following compilation error happens:
          fatal error: error in backend: Branch target out of insn range
          ...
          Stack dump:
          0.      Program arguments: clang -g -Wall -Werror -D__TARGET_ARCH_x86 -mlittle-endian
            -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/include
            -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -I/home/yhs/work/bpf-next/tools/include/uapi
            -I/home/yhs/work/bpf-next/tools/testing/selftests/usr/include -idirafter
            /home/yhs/work/llvm-project/llvm/build.18/install/lib/clang/18/include -idirafter /usr/local/include
            -idirafter /usr/include -Wno-compare-distinct-pointer-types -DENABLE_ATOMICS_TESTS -O2 --target=bpf
            -c progs/pyperf180.c -mcpu=v3 -o /home/yhs/work/bpf-next/tools/testing/selftests/bpf/pyperf180.bpf.o
          1.      <eof> parser at end of file
          2.      Code generation
          ...
      
      The compilation failure only happens to cpu=v2 and cpu=v3. cpu=v4 is okay
      since cpu=v4 supports 32-bit branch target offset.
      
      The above failure is due to upstream llvm patch [1] where some inlining behavior
      are changed in clang18.
      
      To workaround the issue, previously all 180 loop iterations are fully unrolled.
      The bpf macro __BPF_CPU_VERSION__ (implemented in clang18 recently) is used to avoid
      unrolling changes if cpu=v4. If __BPF_CPU_VERSION__ is not available and the
      compiler is clang18, the unrollng amount is unconditionally reduced.
      
        [1] https://github.com/llvm/llvm-project/commit/1a2e77cf9e11dbf56b5720c607313a566eebb16eSigned-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Link: https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev
      100888fb
  2. 10 Nov, 2023 39 commits
    • Jordan Rome's avatar
      bpf: Add crosstask check to __bpf_get_stack · b8e3a87a
      Jordan Rome authored
      Currently get_perf_callchain only supports user stack walking for
      the current task. Passing the correct *crosstask* param will return
      0 frames if the task passed to __bpf_get_stack isn't the current
      one instead of a single incorrect frame/address. This change
      passes the correct *crosstask* param but also does a preemptive
      check in __bpf_get_stack if the task is current and returns
      -EOPNOTSUPP if it is not.
      
      This issue was found using bpf_get_task_stack inside a BPF
      iterator ("iter/task"), which iterates over all tasks.
      bpf_get_task_stack works fine for fetching kernel stacks
      but because get_perf_callchain relies on the caller to know
      if the requested *task* is the current one (via *crosstask*)
      it was failing in a confusing way.
      
      It might be possible to get user stacks for all tasks utilizing
      something like access_process_vm but that requires the bpf
      program calling bpf_get_task_stack to be sleepable and would
      therefore be a breaking change.
      
      Fixes: fa28dcb8 ("bpf: Introduce helper bpf_get_task_stack()")
      Signed-off-by: default avatarJordan Rome <jordalgo@meta.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
      b8e3a87a
    • Alexei Starovoitov's avatar
      Merge branch 'for-6.8-bpf' of... · 92411764
      Alexei Starovoitov authored
      Merge branch 'for-6.8-bpf' of https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into bpf-next
      
      Merge cgroup prerequisite patches.
      
      Link: https://lore.kernel.org/bpf/20231029061438.4215-1-laoar.shao@gmail.com/Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      92411764
    • Yafang Shao's avatar
      compiler-gcc: Suppress -Wmissing-prototypes warning for all supported GCC · 689b097a
      Yafang Shao authored
      The kernel supports a minimum GCC version of 5.1.0 for building. However,
      the "__diag_ignore_all" directive only suppresses the
      "-Wmissing-prototypes" warning for GCC versions >= 8.0.0. As a result, when
      building the kernel with older GCC versions, warnings may be triggered. The
      example below illustrates the warnings reported by the kernel test robot
      using GCC 7.5.0:
      
        compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
        All warnings (new ones prefixed by >>):
      
         kernel/bpf/helpers.c:1893:19: warning: no previous prototype for 'bpf_obj_new_impl' [-Wmissing-prototypes]
          __bpf_kfunc void *bpf_obj_new_impl(u64 local_type_id__k, void *meta__ign)
                            ^~~~~~~~~~~~~~~~
         kernel/bpf/helpers.c:1907:19: warning: no previous prototype for 'bpf_percpu_obj_new_impl' [-Wmissing-prototypes]
          __bpf_kfunc void *bpf_percpu_obj_new_impl(u64 local_type_id__k, void *meta__ign)
         [...]
      
      To address this, we should also suppress the "-Wmissing-prototypes" warning
      for older GCC versions. "#pragma GCC diagnostic push" is supported as
      of GCC 4.6, and both "-Wmissing-prototypes" and "-Wmissing-declarations"
      are supported for all the GCC versions that we currently support.
      Therefore, it is reasonable to suppress these warnings for all supported
      GCC versions.
      
      With this adjustment, it's important to note that after implementing
      "__diag_ignore_all", it will effectively suppress warnings for all the
      supported GCC versions.
      
      In the future, if you wish to suppress warnings that are only supported on
      higher GCC versions, it is advisable to explicitly use "__diag_ignore" to
      specify the GCC version you are targeting.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202311031651.A7crZEur-lkp@intel.com/Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231106031802.4188-1-laoar.shao@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      689b097a
    • Yonghong Song's avatar
      bpf: Use named fields for certain bpf uapi structs · 155addf0
      Yonghong Song authored
      Martin and Vadim reported a verifier failure with bpf_dynptr usage.
      The issue is mentioned but Vadim workarounded the issue with source
      change ([1]). The below describes what is the issue and why there
      is a verification failure.
      
        int BPF_PROG(skb_crypto_setup) {
          struct bpf_dynptr algo, key;
          ...
      
          bpf_dynptr_from_mem(..., ..., 0, &algo);
          ...
        }
      
      The bpf program is using vmlinux.h, so we have the following definition in
      vmlinux.h:
        struct bpf_dynptr {
              long: 64;
              long: 64;
        };
      Note that in uapi header bpf.h, we have
        struct bpf_dynptr {
              long: 64;
              long: 64;
      } __attribute__((aligned(8)));
      
      So we lost alignment information for struct bpf_dynptr by using vmlinux.h.
      Let us take a look at a simple program below:
        $ cat align.c
        typedef unsigned long long __u64;
        struct bpf_dynptr_no_align {
              __u64 :64;
              __u64 :64;
        };
        struct bpf_dynptr_yes_align {
              __u64 :64;
              __u64 :64;
        } __attribute__((aligned(8)));
      
        void bar(void *, void *);
        int foo() {
          struct bpf_dynptr_no_align a;
          struct bpf_dynptr_yes_align b;
          bar(&a, &b);
          return 0;
        }
        $ clang --target=bpf -O2 -S -emit-llvm align.c
      
      Look at the generated IR file align.ll:
        ...
        %a = alloca %struct.bpf_dynptr_no_align, align 1
        %b = alloca %struct.bpf_dynptr_yes_align, align 8
        ...
      
      The compiler dictates the alignment for struct bpf_dynptr_no_align is 1 and
      the alignment for struct bpf_dynptr_yes_align is 8. So theoretically compiler
      could allocate variable %a with alignment 1 although in reallity the compiler
      may choose a different alignment by considering other local variables.
      
      In [1], the verification failure happens because variable 'algo' is allocated
      on the stack with alignment 4 (fp-28). But the verifer wants its alignment
      to be 8.
      
      To fix the issue, the RFC patch ([1]) tried to add '__attribute__((aligned(8)))'
      to struct bpf_dynptr plus other similar structs. Andrii suggested that
      we could directly modify uapi struct with named fields like struct 'bpf_iter_num':
        struct bpf_iter_num {
              /* opaque iterator state; having __u64 here allows to preserve correct
               * alignment requirements in vmlinux.h, generated from BTF
               */
              __u64 __opaque[1];
        } __attribute__((aligned(8)));
      
      Indeed, adding named fields for those affected structs in this patch can preserve
      alignment when bpf program references them in vmlinux.h. With this patch,
      the verification failure in [1] can also be resolved.
      
        [1] https://lore.kernel.org/bpf/1b100f73-7625-4c1f-3ae5-50ecf84d3ff0@linux.dev/
        [2] https://lore.kernel.org/bpf/20231103055218.2395034-1-yonghong.song@linux.dev/
      
      Cc: Vadim Fedorenko <vadfed@meta.com>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231104024900.1539182-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      155addf0
    • Alexei Starovoitov's avatar
      Merge branch 'allow-bpf_refcount_acquire-of-mapval-obtained-via-direct-ld' · 3f6d04d7
      Alexei Starovoitov authored
      Dave Marchevsky says:
      
      ====================
      Allow bpf_refcount_acquire of mapval obtained via direct LD
      
      Consider this BPF program:
      
        struct cgv_node {
          int d;
          struct bpf_refcount r;
          struct bpf_rb_node rb;
        };
      
        struct val_stash {
          struct cgv_node __kptr *v;
        };
      
        struct {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __type(key, int);
          __type(value, struct val_stash);
          __uint(max_entries, 10);
        } array_map SEC(".maps");
      
        long bpf_program(void *ctx)
        {
          struct val_stash *mapval;
          struct cgv_node *p;
          int idx = 0;
      
          mapval = bpf_map_lookup_elem(&array_map, &idx);
          if (!mapval || !mapval->v) { /* omitted */ }
      
          p = bpf_refcount_acquire(mapval->v); /* Verification FAILs here */
      
          /* Add p to some tree */
          return 0;
        }
      
      Verification fails on the refcount_acquire:
      
        160: (79) r1 = *(u64 *)(r8 +8)        ; R1_w=untrusted_ptr_or_null_cgv_node(id=11,off=0,imm=0) R8_w=map_value(id=10,off=0,ks=8,vs=16,imm=0) refs=6
        161: (b7) r2 = 0                      ; R2_w=0 refs=6
        162: (85) call bpf_refcount_acquire_impl#117824
        arg#0 is neither owning or non-owning ref
      
      The above verifier dump is actually from sched_ext's scx_flatcg [0],
      which is the motivating usecase for this series' changes. Specifically,
      scx_flatcg stashes a rb_node type w/ cgroup-specific info (struct
      cgv_node) in a map when the cgroup is created, then later puts that
      cgroup's node in a rbtree in .enqueue . Making struct cgv_node
      refcounted would simplify the code a bit by virtue of allowing us to
      remove the kptr_xchg's, but "later puts that cgroups node in a rbtree"
      is not possible without a refcount_acquire, which suffers from the above
      verification failure.
      
      If we get rid of PTR_UNTRUSTED flag, and add MEM_ALLOC | NON_OWN_REF,
      mapval->v would be a non-owning ref and verification would succeed. Due
      to the most recent set of refcount changes [1], which modified
      bpf_obj_drop behavior to not reuse refcounted graph node's underlying
      memory until after RCU grace period, this is safe to do. Once mapval->v
      has the correct flags it _is_ a non-owning reference and verification of
      the motivating example will succeed.
      
        [0]: https://github.com/sched-ext/sched_ext/blob/52911e1040a0f94b9c426dddcc00be5364a7ae9f/tools/sched_ext/scx_flatcg.bpf.c#L275
        [1]: https://lore.kernel.org/bpf/20230821193311.3290257-1-davemarchevsky@fb.com/
      
      Summary of patches:
        * Patch 1 fixes an issue with bpf_refcount_acquire verification
          letting MAYBE_NULL ptrs through
          * Patch 2 tests Patch 1's fix
        * Patch 3 broadens the use of "free only after RCU GP" to all
          user-allocated types
        * Patch 4 is a small nonfunctional refactoring
        * Patch 5 changes verifier to mark direct LD of stashed graph node
          kptr as non-owning ref
          * Patch 6 tests Patch 5's verifier changes
      
      Changelog:
      
      v1 -> v2: https://lore.kernel.org/bpf/20231025214007.2920506-1-davemarchevsky@fb.com/
      
      Series title changed to "Allow bpf_refcount_acquire of mapval obtained via
      direct LD". V1's title was mistakenly truncated.
      
        * Patch 5 ("bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref")
          * Direct LD of percpu kptr should not have MEM_ALLOC flag (Yonghong)
        * Patch 6 ("selftests/bpf: Test bpf_refcount_acquire of node obtained via direct ld")
          * Test read from stashed value as well
      ====================
      
      Link: https://lore.kernel.org/r/20231107085639.3016113-1-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3f6d04d7
    • Shung-Hsi Yu's avatar
      bpf: replace register_is_const() with is_reg_const() · 82ce364c
      Shung-Hsi Yu authored
      The addition of is_reg_const() in commit 171de12646d2 ("bpf: generalize
      is_branch_taken to handle all conditional jumps in one place") has made the
      register_is_const() redundant. Give the former has more feature, plus the
      fact the latter is only used in one place, replace register_is_const() with
      is_reg_const(), and remove the definition of register_is_const.
      
      This requires moving the definition of is_reg_const() further up. And since
      the comment of reg_const_value() reference is_reg_const(), move it up as
      well.
      Signed-off-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231108140043.12282-1-shung-hsi.yu@suse.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      82ce364c
    • Dave Marchevsky's avatar
      selftests/bpf: Test bpf_refcount_acquire of node obtained via direct ld · e9ed8df7
      Dave Marchevsky authored
      This patch demonstrates that verifier changes earlier in this series
      result in bpf_refcount_acquire(mapval->stashed_kptr) passing
      verification. The added test additionally validates that stashing a kptr
      in mapval and - in a separate BPF program - refcount_acquiring the kptr
      without unstashing works as expected at runtime.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-7-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e9ed8df7
    • Andrii Nakryiko's avatar
      veristat: add ability to filter top N results · 27007fae
      Andrii Nakryiko authored
      Add ability to filter top B results, both in replay/verifier mode and
      comparison mode. Just adding `-n10` will emit only first 10 rows, or
      less, if there is not enough rows.
      
      This is not just a shortcut instead of passing veristat output through
      `head`, though. Filtering out all the other rows influences final table
      formatting, as table column widths are calculated based on actual
      emitted test.
      
      To demonstrate the difference, compare two "equivalent" forms below, one
      using head and another using -n argument.
      
      TOP N FEATURE
      =============
      [vmuser@archvm bpf]$ sudo ./veristat -C ~/baseline-results-selftests.csv ~/sanity2-results-selftests.csv -e file,prog,insns,states -s '|insns_diff|' -n10
      File                                      Program                Insns (A)  Insns (B)  Insns (DIFF)  States (A)  States (B)  States (DIFF)
      ----------------------------------------  ---------------------  ---------  ---------  ------------  ----------  ----------  -------------
      test_seg6_loop.bpf.linked3.o              __add_egr_x                12440      12360  -80 (-0.64%)         364         357    -7 (-1.92%)
      async_stack_depth.bpf.linked3.o           async_call_root_check        145        145   +0 (+0.00%)           3           3    +0 (+0.00%)
      async_stack_depth.bpf.linked3.o           pseudo_call_check            139        139   +0 (+0.00%)           3           3    +0 (+0.00%)
      atomic_bounds.bpf.linked3.o               sub                            7          7   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o  kmalloc                        5          5   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o  sched_process_fork            22         22   +0 (+0.00%)           2           2    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o  socket_post_create            23         23   +0 (+0.00%)           2           2    +0 (+0.00%)
      bind4_prog.bpf.linked3.o                  bind_v4_prog                 358        358   +0 (+0.00%)          33          33    +0 (+0.00%)
      bind6_prog.bpf.linked3.o                  bind_v6_prog                 429        429   +0 (+0.00%)          37          37    +0 (+0.00%)
      bind_perm.bpf.linked3.o                   bind_v4_prog                  15         15   +0 (+0.00%)           1           1    +0 (+0.00%)
      
      PIPING TO HEAD
      ==============
      [vmuser@archvm bpf]$ sudo ./veristat -C ~/baseline-results-selftests.csv ~/sanity2-results-selftests.csv -e file,prog,insns,states -s '|insns_diff|' | head -n12
      File                                                   Program                                               Insns (A)  Insns (B)  Insns (DIFF)  States (A)  States (B)  States (DIFF)
      -----------------------------------------------------  ----------------------------------------------------  ---------  ---------  ------------  ----------  ----------  -------------
      test_seg6_loop.bpf.linked3.o                           __add_egr_x                                               12440      12360  -80 (-0.64%)         364         357    -7 (-1.92%)
      async_stack_depth.bpf.linked3.o                        async_call_root_check                                       145        145   +0 (+0.00%)           3           3    +0 (+0.00%)
      async_stack_depth.bpf.linked3.o                        pseudo_call_check                                           139        139   +0 (+0.00%)           3           3    +0 (+0.00%)
      atomic_bounds.bpf.linked3.o                            sub                                                           7          7   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o               kmalloc                                                       5          5   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o               sched_process_fork                                           22         22   +0 (+0.00%)           2           2    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o               socket_post_create                                           23         23   +0 (+0.00%)           2           2    +0 (+0.00%)
      bind4_prog.bpf.linked3.o                               bind_v4_prog                                                358        358   +0 (+0.00%)          33          33    +0 (+0.00%)
      bind6_prog.bpf.linked3.o                               bind_v6_prog                                                429        429   +0 (+0.00%)          37          37    +0 (+0.00%)
      bind_perm.bpf.linked3.o                                bind_v4_prog                                                 15         15   +0 (+0.00%)           1           1    +0 (+0.00%)
      
      Note all the wasted whitespace in the "PIPING TO HEAD" variant.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231108051430.1830950-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      27007fae
    • Dave Marchevsky's avatar
      bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref · 1b121715
      Dave Marchevsky authored
      This patch enables the following pattern:
      
        /* mapval contains a __kptr pointing to refcounted local kptr */
        mapval = bpf_map_lookup_elem(&map, &idx);
        if (!mapval || !mapval->some_kptr) { /* omitted */ }
      
        p = bpf_refcount_acquire(&mapval->some_kptr);
      
      Currently this doesn't work because bpf_refcount_acquire expects an
      owning or non-owning ref. The verifier defines non-owning ref as a type:
      
        PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF
      
      while mapval->some_kptr is PTR_TO_BTF_ID | PTR_UNTRUSTED. It's possible
      to do the refcount_acquire by first bpf_kptr_xchg'ing mapval->some_kptr
      into a temp kptr, refcount_acquiring that, and xchg'ing back into
      mapval, but this is unwieldy and shouldn't be necessary.
      
      This patch modifies btf_ld_kptr_type such that user-allocated types are
      marked MEM_ALLOC and if those types have a bpf_{rb,list}_node they're
      marked NON_OWN_REF as well. Additionally, due to changes to
      bpf_obj_drop_impl earlier in this series, rcu_protected_object now
      returns true for all user-allocated types, resulting in
      mapval->some_kptr being marked MEM_RCU.
      
      After this patch's changes, mapval->some_kptr is now:
      
        PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU
      
      which results in it passing the non-owning ref test, and the motivating
      example passing verification.
      
      Future work will likely get rid of special non-owning ref lifetime logic
      in the verifier, at which point we'll be able to delete the NON_OWN_REF
      flag entirely.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-6-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b121715
    • Andrii Nakryiko's avatar
      veristat: add ability to sort by stat's absolute value · 5d4a7aac
      Andrii Nakryiko authored
      Add ability to sort results by absolute values of specified stats. This
      is especially useful to find biggest deviations in comparison mode. When
      comparing verifier change effect against a large base of BPF object
      files, it's necessary to see big changes both in positive and negative
      directions, as both might be a signal for regressions or bugs.
      
      The syntax is natural, e.g., adding `-s '|insns_diff|'^` will instruct
      veristat to sort by absolute value of instructions difference in
      ascending order.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231108051430.1830950-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d4a7aac
    • Dave Marchevsky's avatar
      bpf: Move GRAPH_{ROOT,NODE}_MASK macros into btf_field_type enum · 790ce3cf
      Dave Marchevsky authored
      This refactoring patch removes the unused BPF_GRAPH_NODE_OR_ROOT
      btf_field_type and moves BPF_GRAPH_{NODE,ROOT} macros into the
      btf_field_type enum. Further patches in the series will use
      BPF_GRAPH_NODE, so let's move this useful definition out of btf.c.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-5-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      790ce3cf
    • Yonghong Song's avatar
      libbpf: Fix potential uninitialized tail padding with LIBBPF_OPTS_RESET · 7f7c4369
      Yonghong Song authored
      Martin reported that there is a libbpf complaining of non-zero-value tail
      padding with LIBBPF_OPTS_RESET macro if struct bpf_netkit_opts is modified
      to have a 4-byte tail padding. This only happens to clang compiler.
      The commend line is: ./test_progs -t tc_netkit_multi_links
      Martin and I did some investigation and found this indeed the case and
      the following are the investigation details.
      
      Clang:
        clang version 18.0.0
        <I tried clang15/16/17 and they all have similar results>
      
      tools/lib/bpf/libbpf_common.h:
        #define LIBBPF_OPTS_RESET(NAME, ...)                                      \
              do {                                                                \
                      memset(&NAME, 0, sizeof(NAME));                             \
                      NAME = (typeof(NAME)) {                                     \
                              .sz = sizeof(NAME),                                 \
                              __VA_ARGS__                                         \
                      };                                                          \
              } while (0)
      
        #endif
      
      tools/lib/bpf/libbpf.h:
        struct bpf_netkit_opts {
              /* size of this struct, for forward/backward compatibility */
              size_t sz;
              __u32 flags;
              __u32 relative_fd;
              __u32 relative_id;
              __u64 expected_revision;
              size_t :0;
        };
        #define bpf_netkit_opts__last_field expected_revision
      In the above struct bpf_netkit_opts, there is no tail padding.
      
      prog_tests/tc_netkit.c:
        static void serial_test_tc_netkit_multi_links_target(int mode, int target)
        {
              ...
              LIBBPF_OPTS(bpf_netkit_opts, optl);
              ...
              LIBBPF_OPTS_RESET(optl,
                      .flags = BPF_F_BEFORE,
                      .relative_fd = bpf_program__fd(skel->progs.tc1),
              );
              ...
        }
      
      Let us make the following source change, note that we have a 4-byte
      tailing padding now.
        diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
        index 6cd9c501624f..0dd83910ae9a 100644
        --- a/tools/lib/bpf/libbpf.h
        +++ b/tools/lib/bpf/libbpf.h
        @@ -803,13 +803,13 @@ bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
         struct bpf_netkit_opts {
              /* size of this struct, for forward/backward compatibility */
              size_t sz;
        -       __u32 flags;
              __u32 relative_fd;
              __u32 relative_id;
              __u64 expected_revision;
        +       __u32 flags;
              size_t :0;
         };
        -#define bpf_netkit_opts__last_field expected_revision
        +#define bpf_netkit_opts__last_field flags
      
      The clang 18 generated asm code looks like below:
          ;       LIBBPF_OPTS_RESET(optl,
          55e3: 48 8d 7d 98                   leaq    -0x68(%rbp), %rdi
          55e7: 31 f6                         xorl    %esi, %esi
          55e9: ba 20 00 00 00                movl    $0x20, %edx
          55ee: e8 00 00 00 00                callq   0x55f3 <serial_test_tc_netkit_multi_links_target+0x18d3>
          55f3: 48 c7 85 10 fd ff ff 20 00 00 00      movq    $0x20, -0x2f0(%rbp)
          55fe: 48 8b 85 68 ff ff ff          movq    -0x98(%rbp), %rax
          5605: 48 8b 78 18                   movq    0x18(%rax), %rdi
          5609: e8 00 00 00 00                callq   0x560e <serial_test_tc_netkit_multi_links_target+0x18ee>
          560e: 89 85 18 fd ff ff             movl    %eax, -0x2e8(%rbp)
          5614: c7 85 1c fd ff ff 00 00 00 00 movl    $0x0, -0x2e4(%rbp)
          561e: 48 c7 85 20 fd ff ff 00 00 00 00      movq    $0x0, -0x2e0(%rbp)
          5629: c7 85 28 fd ff ff 08 00 00 00 movl    $0x8, -0x2d8(%rbp)
          5633: 48 8b 85 10 fd ff ff          movq    -0x2f0(%rbp), %rax
          563a: 48 89 45 98                   movq    %rax, -0x68(%rbp)
          563e: 48 8b 85 18 fd ff ff          movq    -0x2e8(%rbp), %rax
          5645: 48 89 45 a0                   movq    %rax, -0x60(%rbp)
          5649: 48 8b 85 20 fd ff ff          movq    -0x2e0(%rbp), %rax
          5650: 48 89 45 a8                   movq    %rax, -0x58(%rbp)
          5654: 48 8b 85 28 fd ff ff          movq    -0x2d8(%rbp), %rax
          565b: 48 89 45 b0                   movq    %rax, -0x50(%rbp)
          ;       link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
      
      At -O0 level, the clang compiler creates an intermediate copy.
      We have below to store 'flags' with 4-byte store and leave another 4 byte
      in the same 8-byte-aligned storage undefined,
          5629: c7 85 28 fd ff ff 08 00 00 00 movl    $0x8, -0x2d8(%rbp)
      and later we store 8-byte to the original zero'ed buffer
          5654: 48 8b 85 28 fd ff ff          movq    -0x2d8(%rbp), %rax
          565b: 48 89 45 b0                   movq    %rax, -0x50(%rbp)
      
      This caused a problem as the 4-byte value at [%rbp-0x2dc, %rbp-0x2e0)
      may be garbage.
      
      gcc (gcc 11.4) does not have this issue as it does zeroing struct first before
      doing assignments:
        ;       LIBBPF_OPTS_RESET(optl,
          50fd: 48 8d 85 40 fc ff ff          leaq    -0x3c0(%rbp), %rax
          5104: ba 20 00 00 00                movl    $0x20, %edx
          5109: be 00 00 00 00                movl    $0x0, %esi
          510e: 48 89 c7                      movq    %rax, %rdi
          5111: e8 00 00 00 00                callq   0x5116 <serial_test_tc_netkit_multi_links_target+0x1522>
          5116: 48 8b 45 f0                   movq    -0x10(%rbp), %rax
          511a: 48 8b 40 18                   movq    0x18(%rax), %rax
          511e: 48 89 c7                      movq    %rax, %rdi
          5121: e8 00 00 00 00                callq   0x5126 <serial_test_tc_netkit_multi_links_target+0x1532>
          5126: 48 c7 85 40 fc ff ff 00 00 00 00      movq    $0x0, -0x3c0(%rbp)
          5131: 48 c7 85 48 fc ff ff 00 00 00 00      movq    $0x0, -0x3b8(%rbp)
          513c: 48 c7 85 50 fc ff ff 00 00 00 00      movq    $0x0, -0x3b0(%rbp)
          5147: 48 c7 85 58 fc ff ff 00 00 00 00      movq    $0x0, -0x3a8(%rbp)
          5152: 48 c7 85 40 fc ff ff 20 00 00 00      movq    $0x20, -0x3c0(%rbp)
          515d: 89 85 48 fc ff ff             movl    %eax, -0x3b8(%rbp)
          5163: c7 85 58 fc ff ff 08 00 00 00 movl    $0x8, -0x3a8(%rbp)
        ;       link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
      
      It is not clear how to resolve the compiler code generation as the compiler
      generates correct code w.r.t. how to handle unnamed padding in C standard.
      So this patch changed LIBBPF_OPTS_RESET macro to avoid uninitialized tail
      padding. We already knows LIBBPF_OPTS macro works on both gcc and clang,
      even with tail padding. So LIBBPF_OPTS_RESET is changed to be a
      LIBBPF_OPTS followed by a memcpy(), thus avoiding uninitialized tail padding.
      
      The below is asm code generated with this patch and with clang compiler:
          ;       LIBBPF_OPTS_RESET(optl,
          55e3: 48 8d bd 10 fd ff ff          leaq    -0x2f0(%rbp), %rdi
          55ea: 31 f6                         xorl    %esi, %esi
          55ec: ba 20 00 00 00                movl    $0x20, %edx
          55f1: e8 00 00 00 00                callq   0x55f6 <serial_test_tc_netkit_multi_links_target+0x18d6>
          55f6: 48 c7 85 10 fd ff ff 20 00 00 00      movq    $0x20, -0x2f0(%rbp)
          5601: 48 8b 85 68 ff ff ff          movq    -0x98(%rbp), %rax
          5608: 48 8b 78 18                   movq    0x18(%rax), %rdi
          560c: e8 00 00 00 00                callq   0x5611 <serial_test_tc_netkit_multi_links_target+0x18f1>
          5611: 89 85 18 fd ff ff             movl    %eax, -0x2e8(%rbp)
          5617: c7 85 1c fd ff ff 00 00 00 00 movl    $0x0, -0x2e4(%rbp)
          5621: 48 c7 85 20 fd ff ff 00 00 00 00      movq    $0x0, -0x2e0(%rbp)
          562c: c7 85 28 fd ff ff 08 00 00 00 movl    $0x8, -0x2d8(%rbp)
          5636: 48 8b 85 10 fd ff ff          movq    -0x2f0(%rbp), %rax
          563d: 48 89 45 98                   movq    %rax, -0x68(%rbp)
          5641: 48 8b 85 18 fd ff ff          movq    -0x2e8(%rbp), %rax
          5648: 48 89 45 a0                   movq    %rax, -0x60(%rbp)
          564c: 48 8b 85 20 fd ff ff          movq    -0x2e0(%rbp), %rax
          5653: 48 89 45 a8                   movq    %rax, -0x58(%rbp)
          5657: 48 8b 85 28 fd ff ff          movq    -0x2d8(%rbp), %rax
          565e: 48 89 45 b0                   movq    %rax, -0x50(%rbp)
          ;       link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
      
      In the above code, a temporary buffer is zeroed and then has proper value assigned.
      Finally, values in temporary buffer are copied to the original variable buffer,
      hence tail padding is guaranteed to be 0.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/bpf/20231107201511.2548645-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7f7c4369
    • Dave Marchevsky's avatar
      bpf: Use bpf_mem_free_rcu when bpf_obj_dropping non-refcounted nodes · 649924b7
      Dave Marchevsky authored
      The use of bpf_mem_free_rcu to free refcounted local kptrs was added
      in commit 7e26cd12 ("bpf: Use bpf_mem_free_rcu when
      bpf_obj_dropping refcounted nodes"). In the cover letter for the
      series containing that patch [0] I commented:
      
          Perhaps it makes sense to move to mem_free_rcu for _all_
          non-owning refs in the future, not just refcounted. This might
          allow custom non-owning ref lifetime + invalidation logic to be
          entirely subsumed by MEM_RCU handling. IMO this needs a bit more
          thought and should be tackled outside of a fix series, so it's not
          attempted here.
      
      It's time to start moving in the "non-owning refs have MEM_RCU
      lifetime" direction. As mentioned in that comment, using
      bpf_mem_free_rcu for all local kptrs - not just refcounted - is
      necessarily the first step towards that goal. This patch does so.
      
      After this patch the memory pointed to by all local kptrs will not be
      reused until RCU grace period elapses. The verifier's understanding of
      non-owning ref validity and the clobbering logic it uses to enforce
      that understanding are not changed here, that'll happen gradually in
      future work, including further patches in the series.
      
        [0]: https://lore.kernel.org/all/20230821193311.3290257-1-davemarchevsky@fb.com/Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-4-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      649924b7
    • Dave Marchevsky's avatar
      selftests/bpf: Add test passing MAYBE_NULL reg to bpf_refcount_acquire · f460e7bd
      Dave Marchevsky authored
      The test added in this patch exercises the logic fixed in the previous
      patch in this series. Before the previous patch's changes,
      bpf_refcount_acquire accepts MAYBE_NULL local kptrs; after the change
      the verifier correctly rejects the such a call.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-3-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f460e7bd
    • Dave Marchevsky's avatar
      bpf: Add KF_RCU flag to bpf_refcount_acquire_impl · 1500a5d9
      Dave Marchevsky authored
      Refcounted local kptrs are kptrs to user-defined types with a
      bpf_refcount field. Recent commits ([0], [1]) modified the lifetime of
      refcounted local kptrs such that the underlying memory is not reused
      until RCU grace period has elapsed.
      
      Separately, verification of bpf_refcount_acquire calls currently
      succeeds for MAYBE_NULL non-owning reference input, which is a problem
      as bpf_refcount_acquire_impl has no handling for this case.
      
      This patch takes advantage of aforementioned lifetime changes to tag
      bpf_refcount_acquire_impl kfunc KF_RCU, thereby preventing MAYBE_NULL
      input to the kfunc. The KF_RCU flag applies to all kfunc params; it's
      fine for it to apply to the void *meta__ign param as that's populated by
      the verifier and is tagged __ign regardless.
      
        [0]: commit 7e26cd12 ("bpf: Use bpf_mem_free_rcu when
             bpf_obj_dropping refcounted nodes") is the actual change to
             allocation behaivor
        [1]: commit 0816b8c6 ("bpf: Consider non-owning refs to refcounted
             nodes RCU protected") modified verifier understanding of
             refcounted local kptrs to match [0]'s changes
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Fixes: 7c50b1cb ("bpf: Add bpf_refcount_acquire kfunc")
      Link: https://lore.kernel.org/r/20231107085639.3016113-2-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1500a5d9
    • Andrii Nakryiko's avatar
      Merge branch 'bpf: __bpf_dynptr_data* and __str annotation' · b0d1c729
      Andrii Nakryiko authored
      Song Liu says:
      
      ====================
      This set contains the first 3 patches of set [1]. Currently, [1] is waiting
      for [3] to be merged to bpf-next tree. So send these 3 patches first to
      unblock other works, such as [2]. This set is verified with new version of
      [1] in CI run [4].
      
      Changes since v12 of [1]:
      1. Reuse bpf_dynptr_slice() in __bpf_dynptr_data(). (Andrii)
      2. Add Acked-by from Vadim Fedorenko.
      
      [1] https://lore.kernel.org/bpf/20231104001313.3538201-1-song@kernel.org/
      [2] https://lore.kernel.org/bpf/20231031134900.1432945-1-vadfed@meta.com/
      [3] https://lore.kernel.org/bpf/20231031215625.2343848-1-davemarchevsky@fb.com/
      [4] https://github.com/kernel-patches/bpf/actions/runs/6779945063/job/18427926114
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b0d1c729
    • Florian Lehner's avatar
      bpf, lpm: Fix check prefixlen before walking trie · 9b75dbeb
      Florian Lehner authored
      When looking up an element in LPM trie, the condition 'matchlen ==
      trie->max_prefixlen' will never return true, if key->prefixlen is larger
      than trie->max_prefixlen. Consequently all elements in the LPM trie will
      be visited and no element is returned in the end.
      
      To resolve this, check key->prefixlen first before walking the LPM trie.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Signed-off-by: default avatarFlorian Lehner <dev@der-flo.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9b75dbeb
    • Song Liu's avatar
      bpf: Introduce KF_ARG_PTR_TO_CONST_STR · 045edee1
      Song Liu authored
      Similar to ARG_PTR_TO_CONST_STR for BPF helpers, KF_ARG_PTR_TO_CONST_STR
      specifies kfunc args that point to const strings. Annotation "__str" is
      used to specify kfunc arg of type KF_ARG_PTR_TO_CONST_STR. Also, add
      documentation for the "__str" annotation.
      
      bpf_get_file_xattr() will be the first kfunc that uses this type.
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Link: https://lore.kernel.org/bpf/20231107045725.2278852-4-song@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      045edee1
    • Anders Roxell's avatar
      selftests/bpf: Disable CONFIG_DEBUG_INFO_REDUCED in config.aarch64 · f2d2c7e1
      Anders Roxell authored
      Building an arm64 kernel and seftests/bpf with defconfig +
      selftests/bpf/config and selftests/bpf/config.aarch64 the fragment
      CONFIG_DEBUG_INFO_REDUCED is enabled in arm64's defconfig, it should be
      disabled in file sefltests/bpf/config.aarch64 since if its not disabled
      CONFIG_DEBUG_INFO_BTF wont be enabled.
      Signed-off-by: default avatarAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231103220912.333930-1-anders.roxell@linaro.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f2d2c7e1
    • Song Liu's avatar
      bpf: Factor out helper check_reg_const_str() · 0b519407
      Song Liu authored
      ARG_PTR_TO_CONST_STR is used to specify constant string args for BPF
      helpers. The logic that verifies a reg is ARG_PTR_TO_CONST_STR is
      implemented in check_func_arg().
      
      As we introduce kfuncs with constant string args, it is necessary to
      do the same check for kfuncs (in check_kfunc_args). Factor out the logic
      for ARG_PTR_TO_CONST_STR to a new check_reg_const_str() so that it can be
      reused.
      
      check_func_arg() ensures check_reg_const_str() is only called with reg of
      type PTR_TO_MAP_VALUE. Add a redundent type check in check_reg_const_str()
      to avoid misuse in the future. Other than this redundent check, there is
      no change in behavior.
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Link: https://lore.kernel.org/bpf/20231107045725.2278852-3-song@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0b519407
    • Artem Savkov's avatar
      bpftool: Fix prog object type in manpage · a46afaa0
      Artem Savkov authored
      bpftool's man page lists "program" as one of possible values for OBJECT,
      while in fact bpftool accepts "prog" instead.
      Reported-by: default avatarJerry Snitselaar <jsnitsel@redhat.com>
      Signed-off-by: default avatarArtem Savkov <asavkov@redhat.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20231103081126.170034-1-asavkov@redhat.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a46afaa0
    • Song Liu's avatar
      bpf: Add __bpf_dynptr_data* for in kernel use · 74523c06
      Song Liu authored
      Different types of bpf dynptr have different internal data storage.
      Specifically, SKB and XDP type of dynptr may have non-continuous data.
      Therefore, it is not always safe to directly access dynptr->data.
      
      Add __bpf_dynptr_data and __bpf_dynptr_data_rw to replace direct access to
      dynptr->data.
      
      Update bpf_verify_pkcs7_signature to use __bpf_dynptr_data instead of
      dynptr->data.
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Link: https://lore.kernel.org/bpf/20231107045725.2278852-2-song@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      74523c06
    • Manu Bretelle's avatar
      selftests/bpf: Consolidate VIRTIO/9P configs in config.vm file · b0cf0dcd
      Manu Bretelle authored
      Those configs are needed to be able to run VM somewhat consistently.
      For instance, ATM, s390x is missing the `CONFIG_VIRTIO_CONSOLE` which
      prevents s390x kernels built in CI to leverage qemu-guest-agent.
      
      By moving them to `config,vm`, we should have selftest kernels which are
      equal in term of VM functionalities when they include this file.
      
      The set of config unabled were picked using
      
          grep -h -E '(_9P|_VIRTIO)' config.x86_64 config | sort | uniq
      
      added to `config.vm` and then
          grep -vE '(_9P|_VIRTIO)' config.{x86_64,aarch64,s390x}
      
      as a side-effect, some config may have disappeared to the aarch64 and
      s390x kernels, but they should not be needed. CI will tell.
      Signed-off-by: default avatarManu Bretelle <chantr4@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231031212717.4037892-1-chantr4@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b0cf0dcd
    • Andrii Nakryiko's avatar
      Merge branch 'selftests/bpf: Fixes for map_percpu_stats test' · e3499962
      Andrii Nakryiko authored
      Hou Tao says:
      
      ====================
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      BPF CI failed due to map_percpu_stats_percpu_hash from time to time [1].
      It seems that the failure reason is per-cpu bpf memory allocator may not
      be able to allocate per-cpu pointer successfully and it can not refill
      free llist timely, and bpf_map_update_elem() will return -ENOMEM.
      
      Patch #1 fixes the size of value passed to per-cpu map update API. The
      problem was found when fixing the ENOMEM problem, so also post it in
      this patchset. Patch #2 & #3 mitigates the ENOMEM problem by retrying
      the update operation for non-preallocated per-cpu map.
      
      Please see individual patches for more details. And comments are always
      welcome.
      
      Regards,
      Tao
      
      [1]: https://github.com/kernel-patches/bpf/actions/runs/6713177520/job/18244865326?pr=5909
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e3499962
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-register-bounds-logic-and-testing-improvements' · cd9c1270
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      BPF register bounds logic and testing improvements
      
      This patch set adds a big set of manual and auto-generated test cases
      validating BPF verifier's register bounds tracking and deduction logic. See
      details in the last patch.
      
      We start with building a tester that validates existing <range> vs <scalar>
      verifier logic for range bounds. To make all this work, BPF verifier's logic
      needed a bunch of improvements to handle some cases that previously were not
      covered. This had no implications as to correctness of verifier logic, but it
      was incomplete enough to cause significant disagreements with alternative
      implementation of register bounds logic that tests in this patch set
      implement. So we need BPF verifier logic improvements to make all the tests
      pass. This is what we do in patches #3 through #9.
      
      The end goal of this work, though, is to extend BPF verifier range state
      tracking such as to allow to derive new range bounds when comparing non-const
      registers. There is some more investigative work required to investigate and
      fix existing potential issues with range tracking as part of ALU/ALU64
      operations, so <range> x <range> part of v5 patch set ([0]) is dropped until
      these issues are sorted out.
      
      For now, we include preparatory refactorings and clean ups, that set up BPF
      verifier code base to extend the logic to <range> vs <range> logic in
      subsequent patch set. Patches #10-#16 perform preliminary refactorings without
      functionally changing anything. But they do clean up check_cond_jmp_op() logic
      and generalize a bunch of other pieces in is_branch_taken() logic.
      
        [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=797178&state=*
      
      v5->v6:
        - dropped <range> vs <range> patches (original patches #18 through #23) to
          add more register range sanity checks and fix preexisting issues;
        - comments improvements, addressing other feedback on first 17 patches
          (Eduard, Alexei);
      v4->v5:
        - added entirety of verifier reg bounds tracking changes, now handling
          <range> vs <range> cases (Alexei);
        - added way more comments trying to explain why deductions added are
          correct, hopefully they are useful and clarify things a bit (Daniel,
          Shung-Hsi);
        - added two preliminary selftests fixes necessary for RELEASE=1 build to
          work again, it keeps breaking.
      v3->v4:
        - improvements to reg_bounds tester (progress report, split 32-bit and
          64-bit ranges, fix various verbosity output issues, etc);
      v2->v3:
        - fix a subtle little-endianness assumption inside parge_reg_state() (CI);
      v1->v2:
        - fix compilation when building selftests with llvm-16 toolchain (CI).
      ====================
      
      Link: https://lore.kernel.org/r/20231102033759.2541186-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cd9c1270
    • Hou Tao's avatar
      selftsets/bpf: Retry map update for non-preallocated per-cpu map · 2f553b03
      Hou Tao authored
      BPF CI failed due to map_percpu_stats_percpu_hash from time to time [1].
      It seems that the failure reason is per-cpu bpf memory allocator may not
      be able to allocate per-cpu pointer successfully and it can not refill
      free llist timely, and bpf_map_update_elem() will return -ENOMEM.
      
      So mitigate the problem by retrying the update operation for
      non-preallocated per-cpu map.
      
      [1]: https://github.com/kernel-patches/bpf/actions/runs/6713177520/job/18244865326?pr=5909Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231101032455.3808547-4-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2f553b03
    • Andrii Nakryiko's avatar
      bpf: generalize reg_set_min_max() to handle two sets of two registers · 4621202a
      Andrii Nakryiko authored
      Change reg_set_min_max() to take FALSE/TRUE sets of two registers each,
      instead of assuming that we are always comparing to a constant. For now
      we still assume that right-hand side registers are constants (and make
      sure that's the case by swapping src/dst regs, if necessary), but
      subsequent patches will remove this limitation.
      
      reg_set_min_max() is now called unconditionally for any register
      comparison, so that might include pointer vs pointer. This makes it
      consistent with is_branch_taken() generality. But we currently only
      support adjustments based on SCALAR vs SCALAR comparisons, so
      reg_set_min_max() has to guard itself againts pointers.
      
      Taking two by two registers allows to further unify and simplify
      check_cond_jmp_op() logic. We utilize fake register for BPF_K
      conditional jump case, just like with is_branch_taken() part.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-18-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4621202a
    • Hou Tao's avatar
      selftests/bpf: Export map_update_retriable() · b9b79553
      Hou Tao authored
      Export map_update_retriable() to make it usable for other map_test
      cases. These cases may only need retry for specific errno, so add
      a new callback parameter to let map_update_retriable() decide whether or
      not the errno is retriable.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231101032455.3808547-3-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b9b79553
    • Andrii Nakryiko's avatar
      bpf: prepare reg_set_min_max for second set of registers · 811476e9
      Andrii Nakryiko authored
      Similarly to is_branch_taken()-related refactorings, start preparing
      reg_set_min_max() to handle more generic case of two non-const
      registers. Start with renaming arguments to accommodate later addition
      of second register as an input argument.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-17-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      811476e9
    • Hou Tao's avatar
      selftests/bpf: Use value with enough-size when updating per-cpu map · d79924ca
      Hou Tao authored
      When updating per-cpu map in map_percpu_stats test, patch_map_thread()
      only passes 4-bytes-sized value to bpf_map_update_elem(). The expected
      size of the value is 8 * num_possible_cpus(), so fix it by passing a
      value with enough-size for per-cpu map update.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231101032455.3808547-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d79924ca
    • Andrii Nakryiko's avatar
      bpf: unify 32-bit and 64-bit is_branch_taken logic · 4d345887
      Andrii Nakryiko authored
      Combine 32-bit and 64-bit is_branch_taken logic for SCALAR_VALUE
      registers. It makes it easier to see parallels between two domains
      (32-bit and 64-bit), and makes subsequent refactoring more
      straightforward.
      
      No functional changes.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-16-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4d345887
    • Andrii Nakryiko's avatar
      bpf: generalize is_branch_taken to handle all conditional jumps in one place · b74c2a84
      Andrii Nakryiko authored
      Make is_branch_taken() a single entry point for branch pruning decision
      making, handling both pointer vs pointer, pointer vs scalar, and scalar
      vs scalar cases in one place. This also nicely cleans up check_cond_jmp_op().
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-15-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b74c2a84
    • Andrii Nakryiko's avatar
      bpf: move is_branch_taken() down · c697289e
      Andrii Nakryiko authored
      Move is_branch_taken() slightly down. In subsequent patched we'll need
      both flip_opcode() and is_pkt_ptr_branch_taken() for is_branch_taken(),
      but instead of sprinkling forward declarations around, it makes more
      sense to move is_branch_taken() lower below is_pkt_ptr_branch_taken(),
      and also keep it closer to very tightly related reg_set_min_max(), as
      they are two critical parts of the same SCALAR range tracking logic.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-14-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c697289e
    • Andrii Nakryiko's avatar
      bpf: generalize is_branch_taken() to work with two registers · c3153426
      Andrii Nakryiko authored
      While still assuming that second register is a constant, generalize
      is_branch_taken-related code to accept two registers instead of register
      plus explicit constant value. This also, as a side effect, allows to
      simplify check_cond_jmp_op() by unifying BPF_K case with BPF_X case, for
      which we use a fake register to represent BPF_K's imm constant as
      a register.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231102033759.2541186-13-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c3153426
    • Andrii Nakryiko's avatar
      bpf: rename is_branch_taken reg arguments to prepare for the second one · c2a3ab09
      Andrii Nakryiko authored
      Just taking mundane refactoring bits out into a separate patch. No
      functional changes.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231102033759.2541186-12-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c2a3ab09
    • Andrii Nakryiko's avatar
      bpf: drop knowledge-losing __reg_combine_{32,64}_into_{64,32} logic · 9e314f5d
      Andrii Nakryiko authored
      When performing 32-bit conditional operation operating on lower 32 bits
      of a full 64-bit register, register full value isn't changed. We just
      potentially gain new knowledge about that register's lower 32 bits.
      
      Unfortunately, __reg_combine_{32,64}_into_{64,32} logic that
      reg_set_min_max() performs as a last step, can lose information in some
      cases due to __mark_reg64_unbounded() and __reg_assign_32_into_64().
      That's bad and completely unnecessary. Especially __reg_assign_32_into_64()
      looks completely out of place here, because we are not performing
      zero-extending subregister assignment during conditional jump.
      
      So this patch replaced __reg_combine_* with just a normal
      reg_bounds_sync() which will do a proper job of deriving u64/s64 bounds
      from u32/s32, and vice versa (among all other combinations).
      
      __reg_combine_64_into_32() is also used in one more place,
      coerce_reg_to_size(), while handling 1- and 2-byte register loads.
      Looking into this, it seems like besides marking subregister as
      unbounded before performing reg_bounds_sync(), we were also performing
      deduction of smin32/smax32 and umin32/umax32 bounds from respective
      smin/smax and umin/umax bounds. It's now redundant as reg_bounds_sync()
      performs all the same logic more generically (e.g., without unnecessary
      assumption that upper 32 bits of full register should be zero).
      
      Long story short, we remove __reg_combine_64_into_32() completely, and
      coerce_reg_to_size() now only does resetting subreg to unbounded and then
      performing reg_bounds_sync() to recover as much information as possible
      from 64-bit umin/umax and smin/smax bounds, set explicitly in
      coerce_reg_to_size() earlier.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231102033759.2541186-10-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9e314f5d
    • Andrii Nakryiko's avatar
      bpf: try harder to deduce register bounds from different numeric domains · d7f00873
      Andrii Nakryiko authored
      There are cases (caught by subsequent reg_bounds tests in selftests/bpf)
      where performing one round of __reg_deduce_bounds() doesn't propagate
      all the information from, say, s32 to u32 bounds and than from newly
      learned u32 bounds back to u64 and s64. So perform __reg_deduce_bounds()
      twice to make sure such derivations are propagated fully after
      reg_bounds_sync().
      
      One such example is test `(s64)[0xffffffff00000001; 0] (u64)<
      0xffffffff00000000` from selftest patch from this patch set. It demonstrates an
      intricate dance of u64 -> s64 -> u64 -> u32 bounds adjustments, which requires
      two rounds of __reg_deduce_bounds(). Here are corresponding refinement log from
      selftest, showing evolution of knowledge.
      
      REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (u64)DST_OLD=[0; U64_MAX] (u64)DST_NEW=[0xffffffff00000000; U64_MAX]
      REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (s64)DST_OLD=[0xffffffff00000001; 0] (s64)DST_NEW=[0xffffffff00000001; -1]
      REFINING (FALSE R1) (s64)SRC=[0xffffffff00000001; -1] (u64)DST_OLD=[0xffffffff00000000; U64_MAX] (u64)DST_NEW=[0xffffffff00000001; U64_MAX]
      REFINING (FALSE R1) (u64)SRC=[0xffffffff00000001; U64_MAX] (u32)DST_OLD=[0; U32_MAX] (u32)DST_NEW=[1; U32_MAX]
      
      R1 initially has smin/smax set to [0xffffffff00000001; -1], while umin/umax is
      unknown. After (u64)< comparison, in FALSE branch we gain knowledge that
      umin/umax is [0xffffffff00000000; U64_MAX]. That causes smin/smax to learn that
      zero can't happen and upper bound is -1. Then smin/smax is adjusted from
      umin/umax improving lower bound from 0xffffffff00000000 to 0xffffffff00000001.
      And then eventually umin32/umax32 bounds are drived from umin/umax and become
      [1; U32_MAX].
      
      Selftest in the last patch is actually implementing a multi-round fixed-point
      convergence logic, but so far all the tests are handled by two rounds of
      reg_bounds_sync() on the verifier state, so we keep it simple for now.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d7f00873
    • Andrii Nakryiko's avatar
      bpf: improve deduction of 64-bit bounds from 32-bit bounds · c51d5ad6
      Andrii Nakryiko authored
      Add a few interesting cases in which we can tighten 64-bit bounds based
      on newly learnt information about 32-bit bounds. E.g., when full u64/s64
      registers are used in BPF program, and then eventually compared as
      u32/s32. The latter comparison doesn't change the value of full
      register, but it does impose new restrictions on possible lower 32 bits
      of such full registers. And we can use that to derive additional full
      register bounds information.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231102033759.2541186-8-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c51d5ad6
    • Andrii Nakryiko's avatar
      bpf: add special smin32/smax32 derivation from 64-bit bounds · 6593f2e6
      Andrii Nakryiko authored
      Add a special case where we can derive valid s32 bounds from umin/umax
      or smin/smax by stitching together negative s32 subrange and
      non-negative s32 subrange. That requires upper 32 bits to form a [N, N+1]
      range in u32 domain (taking into account wrap around, so 0xffffffff
      to 0x00000000 is a valid [N, N+1] range in this sense). See code comment
      for concrete examples.
      
      Eduard Zingerman also provided an alternative explanation ([0]) for more
      mathematically inclined readers:
      
      Suppose:
      . there are numbers a, b, c
      . 2**31 <= b < 2**32
      . 0 <= c < 2**31
      . umin = 2**32 * a + b
      . umax = 2**32 * (a + 1) + c
      
      The number of values in the range represented by [umin; umax] is:
      . N = umax - umin + 1 = 2**32 + c - b + 1
      . min(N) = 2**32 + 0 - (2**32-1) + 1 = 2, with b = 2**32-1, c = 0
      . max(N) = 2**32 + (2**31 - 1) - 2**31 + 1 = 2**32, with b = 2**31, c = 2**31-1
      
      Hence [(s32)b; (s32)c] forms a valid range.
      
        [0] https://lore.kernel.org/bpf/d7af631802f0cfae20df77fe70068702d24bbd31.camel@gmail.com/Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231102033759.2541186-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6593f2e6