1. 15 Nov, 2023 14 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-register-bounds-range-vs-range-support' · 9cea90c0
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      BPF register bounds range vs range support
      
      This patch set is a continuation of work started in [0]. It adds a big set of
      manual, auto-generated, and now also random test cases validating BPF
      verifier's register bounds tracking and deduction logic.
      
      First few patches generalize verifier's logic to handle conditional jumps and
      corresponding range adjustments in case when two non-const registers are
      compared to each other. Patch #1 generalizes reg_set_min_max() portion, while
      patch #2 does the same for is_branch_taken() part of the overall solution.
      
      Patch #3 improves equality and inequality for cases when BPF program code
      mixes 64-bit and 32-bit uses of the same register. Depending on specific
      sequence, it's possible to get to the point where u64/s64 bounds will be very
      generic (e.g., after signed 32-bit comparison), while we still keep pretty
      tight u32/s32 bounds. If in such state we proceed with 32-bit equality or
      inequality comparison, reg_set_min_max() might have to deal with adjusting s32
      bounds for two registers that don't overlap, which breaks reg_set_min_max().
      This doesn't manifest in <range> vs <const> cases, because if that happens
      reg_set_min_max() in effect will force s32 bounds to be a new "impossible"
      constant (from original smin32/smax32 bounds point of view). Things get tricky
      when we have <range> vs <range> adjustments, so instead of trying to somehow
      make sense out of such situations, it's best to detect such impossible
      situations and prune the branch that can't be taken in is_branch_taken()
      logic.  This equality/inequality was the only such category of situations with
      auto-generated tests added later in the patch set.
      
      But when we start mixing arithmetic operations in different numeric domains
      and conditionals, things get even hairier. So, patch #4 adds sanity checking
      logic after all ALU/ALU64, JMP/JMP32, and LDX operations. By default, instead
      of failing verification, we conservatively reset range bounds to unknown
      values, reporting violation in verifier log (if verbose logs are requested).
      But to aid development, detection, and debugging, we also introduce a new test
      flag, BPF_F_TEST_SANITY_STRICT, which triggers verification failure on range
      sanity violation.
      
      Patch #11 sets BPF_F_TEST_SANITY_STRICT by default for test_progs and
      test_verifier. Patch #12 adds support for controlling this in veristat for
      testing with production BPF object files.
      
      Getting back to BPF verifier, patches #5 and #6 complete verifier's range
      tracking logic clean up. See respective patches for details.
      
      With kernel-side taken care of, we move to testing. We start with building
      a tester that validates existing <range> vs <scalar> verifier logic for range
      bounds. Patch #7 implements an initial version of such a tester. We guard
      millions of generated tests behind SLOW_TESTS=1 envvar requirement, but also
      have a relatively small number of tricky cases that came up during development
      and debugging of this work. Those will be executed as part of a normal
      test_progs run.
      
      Patch #8 simulates more nuanced JEQ/JNE logic we added to verifier in patch #3.
      Patch #9 adds <range> vs <range> "slow tests".
      
      Patch #10 is a completely new one, it adds a bunch of randomly generated cases
      to be run normally, without SLOW_TESTS=1 guard. This should help to get
      a bunch of cover, and hopefully find some remaining latent problems if
      verifier proactively as part of normal BPF CI runs.
      
      Finally, a tiny test which was, amazingly, an initial motivation for this
      whole work, is added in lucky patch #13, demonstrating how verifier is now
      smart enough to track actual number of elements in the array and won't require
      additional checks on loop iteration variable inside the bpf_for() open-coded
      iterator loop.
      
        [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=798308&state=*
      
      v1->v2:
        - use x < y => y > x property to minimize reg_set_min_max (Eduard);
        - fix for JEQ/JNE logic in reg_bounds.c (Eduard);
        - split BPF_JSET and !BPF_JSET cases handling (Shung-Hsi);
        - adjustments to reg_bounds.c to make it easier to follow (Alexei);
        - added acks (Eduard, Shung-Hsi).
      ====================
      
      Link: https://lore.kernel.org/r/20231112010609.848406-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9cea90c0
    • Andrii Nakryiko's avatar
      selftests/bpf: add iter test requiring range x range logic · 882e3d87
      Andrii Nakryiko authored
      Add a simple verifier test that requires deriving reg bounds for one
      register from another register that's not a constant. This is
      a realistic example of iterating elements of an array with fixed maximum
      number of elements, but smaller actual number of elements.
      
      This small example was an original motivation for doing this whole patch
      set in the first place, yes.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-14-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      882e3d87
    • Andrii Nakryiko's avatar
      veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag · a5c57f81
      Andrii Nakryiko authored
      Add a new flag -r (--test-sanity), similar to -t (--test-states), to add
      extra BPF program flags when loading BPF programs.
      
      This allows to use veristat to easily catch sanity violations in
      production BPF programs.
      
      reg_bounds tests are also enforcing BPF_F_TEST_SANITY_STRICT flag now.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-13-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a5c57f81
    • Andrii Nakryiko's avatar
      selftests/bpf: set BPF_F_TEST_SANITY_SCRIPT by default · 8c5677f8
      Andrii Nakryiko authored
      Make sure to set BPF_F_TEST_SANITY_STRICT program flag by default across
      most verifier tests (and a bunch of others that set custom prog flags).
      
      There are currently two tests that do fail validation, if enforced
      strictly: verifier_bounds/crossing_64_bit_signed_boundary_2 and
      verifier_bounds/crossing_32_bit_signed_boundary_2. To accommodate them,
      we teach test_loader a flag negation:
      
      __flag(!<flagname>) will *clear* specified flag, allowing easy opt-out.
      
      We apply __flag(!BPF_F_TEST_SANITY_STRICT) to these to tests.
      
      Also sprinkle BPF_F_TEST_SANITY_STRICT everywhere where we already set
      test-only BPF_F_TEST_RND_HI32 flag, for completeness.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-12-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8c5677f8
    • Andrii Nakryiko's avatar
      selftests/bpf: add randomized reg_bounds tests · dab16659
      Andrii Nakryiko authored
      Add random cases generation to reg_bounds.c and run them without
      SLOW_TESTS=1 to increase a chance of BPF CI catching latent issues.
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-11-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dab16659
    • Andrii Nakryiko's avatar
      selftests/bpf: add range x range test to reg_bounds · 2b0d204e
      Andrii Nakryiko authored
      Now that verifier supports range vs range bounds adjustments, validate
      that by checking each generated range against every other generated
      range, across all supported operators (everything by JSET).
      
      We also add few cases that were problematic during development either
      for verifier or for selftest's range tracking implementation.
      
      Note that we utilize the same trick with splitting everything into
      multiple independent parallelizable tests, but init_t and cond_t. This
      brings down verification time in parallel mode from more than 8 hours
      down to less that 1.5 hours. 106 million cases were successfully
      validate for range vs range logic, in addition to about 7 million range
      vs const cases, added in earlier patch.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-10-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2b0d204e
    • Andrii Nakryiko's avatar
      selftests/bpf: adjust OP_EQ/OP_NE handling to use subranges for branch taken · 774f94c5
      Andrii Nakryiko authored
      Similar to kernel-side BPF verifier logic enhancements, use 32-bit
      subrange knowledge for is_branch_taken() logic in reg_bounds selftests.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231112010609.848406-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      774f94c5
    • Andrii Nakryiko's avatar
      selftests/bpf: BPF register range bounds tester · 88632389
      Andrii Nakryiko authored
      Add test to validate BPF verifier's register range bounds tracking logic.
      
      The main bulk is a lot of auto-generated tests based on a small set of
      seed values for lower and upper 32 bits of full 64-bit values.
      Currently we validate only range vs const comparisons, but the idea is
      to start validating range over range comparisons in subsequent patch set.
      
      When setting up initial register ranges we treat registers as one of
      u64/s64/u32/s32 numeric types, and then independently perform conditional
      comparisons based on a potentially different u64/s64/u32/s32 types. This
      tests lots of tricky cases of deriving bounds information across
      different numeric domains.
      
      Given there are lots of auto-generated cases, we guard them behind
      SLOW_TESTS=1 envvar requirement, and skip them altogether otherwise.
      With current full set of upper/lower seed value, all supported
      comparison operators and all the combinations of u64/s64/u32/s32 number
      domains, we get about 7.7 million tests, which run in about 35 minutes
      on my local qemu instance without parallelization. But we also split
      those tests by init/cond numeric types, which allows to rely on
      test_progs's parallelization of tests with `-j` option, getting run time
      down to about 5 minutes on 8 cores. It's still something that shouldn't
      be run during normal test_progs run.  But we can run it a reasonable
      time, and so perhaps a nightly CI test run (once we have it) would be
      a good option for this.
      
      We also add a small set of tricky conditions that came up during
      development and triggered various bugs or corner cases in either
      selftest's reimplementation of range bounds logic or in verifier's logic
      itself. These are fast enough to be run as part of normal test_progs
      test run and are great for a quick sanity checking.
      
      Let's take a look at test output to understand what's going on:
      
        $ sudo ./test_progs -t reg_bounds_crafted
        #191/1   reg_bounds_crafted/(u64)[0; 0xffffffff] (u64)< 0:OK
        ...
        #191/115 reg_bounds_crafted/(u64)[0; 0x17fffffff] (s32)< 0:OK
        ...
        #191/137 reg_bounds_crafted/(u64)[0xffffffff; 0x100000000] (u64)== 0:OK
      
      Each test case is uniquely and fully described by this generated string.
      E.g.: "(u64)[0; 0x17fffffff] (s32)< 0". This means that we
      initialize a register (R6) in such a way that verifier knows that it can
      have a value in [(u64)0; (u64)0x17fffffff] range. Another
      register (R7) is also set up as u64, but this time a constant (zero in
      this case). They then are compared using 32-bit signed < operation.
      Resulting TRUE/FALSE branches are evaluated (including cases where it's
      known that one of the branches will never be taken, in which case we
      validate that verifier also determines this as a dead code). Test
      validates that verifier's final register state matches expected state
      based on selftest's own reg_state logic, implemented from scratch for
      cross-checking purposes.
      
      These test names can be conveniently used for further debugging, and if -vv
      verboseness is requested we can get a corresponding verifier log (with
      mark_precise logs filtered out as irrelevant and distracting). Example below is
      slightly redacted for brevity, omitting irrelevant register output in
      some places, marked with [...].
      
        $ sudo ./test_progs -a 'reg_bounds_crafted/(u32)[0; U32_MAX] (s32)< -1' -vv
        ...
        VERIFIER LOG:
        ========================
        func#0 @0
        0: R1=ctx(off=0,imm=0) R10=fp0
        0: (05) goto pc+2
        3: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
        4: (bc) w6 = w0                       ; R0_w=scalar() R6_w=scalar(smin=0,smax=umax=4294967295,var_off=(0x0; 0xffffffff))
        5: (85) call bpf_get_current_pid_tgid#14      ; R0_w=scalar()
        6: (bc) w7 = w0                       ; R0_w=scalar() R7_w=scalar(smin=0,smax=umax=4294967295,var_off=(0x0; 0xffffffff))
        7: (b4) w1 = 0                        ; R1_w=0
        8: (b4) w2 = -1                       ; R2=4294967295
        9: (ae) if w6 < w1 goto pc-9
        9: R1=0 R6=scalar(smin=0,smax=umax=4294967295,var_off=(0x0; 0xffffffff))
        10: (2e) if w6 > w2 goto pc-10
        10: R2=4294967295 R6=scalar(smin=0,smax=umax=4294967295,var_off=(0x0; 0xffffffff))
        11: (b4) w1 = -1                      ; R1_w=4294967295
        12: (b4) w2 = -1                      ; R2_w=4294967295
        13: (ae) if w7 < w1 goto pc-13        ; R1_w=4294967295 R7=4294967295
        14: (2e) if w7 > w2 goto pc-14
        14: R2_w=4294967295 R7=4294967295
        15: (bc) w0 = w6                      ; [...] R6=scalar(id=1,smin=0,smax=umax=4294967295,var_off=(0x0; 0xffffffff))
        16: (bc) w0 = w7                      ; [...] R7=4294967295
        17: (ce) if w6 s< w7 goto pc+3        ; R6=scalar(id=1,smin=0,smax=umax=4294967295,smin32=-1,var_off=(0x0; 0xffffffff)) R7=4294967295
        18: (bc) w0 = w6                      ; [...] R6=scalar(id=1,smin=0,smax=umax=4294967295,smin32=-1,var_off=(0x0; 0xffffffff))
        19: (bc) w0 = w7                      ; [...] R7=4294967295
        20: (95) exit
      
        from 17 to 21: [...]
        21: (bc) w0 = w6                      ; [...] R6=scalar(id=1,smin=umin=umin32=2147483648,smax=umax=umax32=4294967294,smax32=-2,var_off=(0x80000000; 0x7fffffff))
        22: (bc) w0 = w7                      ; [...] R7=4294967295
        23: (95) exit
      
        from 13 to 1: [...]
        1: [...]
        1: (b7) r0 = 0                        ; R0_w=0
        2: (95) exit
        processed 24 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
        =====================
      
      Verifier log above is for `(u32)[0; U32_MAX] (s32)< -1` use cases, where u32
      range is used for initialization, followed by signed < operator. Note
      how we use w6/w7 in this case for register initialization (it would be
      R6/R7 for 64-bit types) and then `if w6 s< w7` for comparison at
      instruction #17. It will be `if R6 < R7` for 64-bit unsigned comparison.
      Above example gives a good impression of the overall structure of a BPF
      programs generated for reg_bounds tests.
      
      In the future, this "framework" can be extended to test not just
      conditional jumps, but also arithmetic operations. Adding randomized
      testing is another possibility.
      
      Some implementation notes. We basically have our own generics-like
      operations on numbers, where all the numbers are stored in u64, but how
      they are interpreted is passed as runtime argument enum num_t. Further,
      `struct range` represents a bounds range, and those are collected
      together into a minimal `struct reg_state`, which collects range bounds
      across all four numberical domains: u64, s64, u32, s64.
      
      Based on these primitives and `enum op` representing possible
      conditional operation (<, <=, >, >=, ==, !=), there is a set of generic
      helpers to perform "range arithmetics", which is used to maintain struct
      reg_state. We simulate what verifier will do for reg bounds of R6 and R7
      registers using these range and reg_state primitives. Simulated
      information is used to determine branch taken conclusion and expected
      exact register state across all four number domains.
      
      Implementation of "range arithmetics" is more generic than what verifier
      is currently performing: it allows range over range comparisons and
      adjustments. This is the intended end goal of this patch set overall and verifier
      logic is enhanced in subsequent patches in this series to handle range
      vs range operations, at which point selftests are extended to validate
      these conditions as well. For now it's range vs const cases only.
      
      Note that tests are split into multiple groups by their numeric types
      for initialization of ranges and for comparison operation. This allows
      to use test_progs's -j parallelization to speed up tests, as we now have
      16 groups of parallel running tests. Overall reduction of running time
      that allows is pretty good, we go down from more than 30 minutes to
      slightly less than 5 minutes running time.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231112010609.848406-8-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      88632389
    • Andrii Nakryiko's avatar
      bpf: make __reg{32,64}_deduce_bounds logic more robust · cf5fe3c7
      Andrii Nakryiko authored
      This change doesn't seem to have any effect on selftests and production
      BPF object files, but we preemptively try to make it more robust.
      
      First, "learn sign from signed bounds" comment is misleading, as we are
      learning not just sign, but also values.
      
      Second, we simplify the check for determining whether entire range is
      positive or negative similarly to other checks added earlier, using
      appropriate u32/u64 cast and single comparisons. As explain in comments
      in __reg64_deduce_bounds(), the checks are equivalent.
      
      Last but not least, smin/smax and s32_min/s32_max reassignment based on
      min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard
      to explain and justify. We are updating unsigned bounds from signed
      bounds, why would we update signed bounds at the same time? This might
      be correct, but it's far from obvious why and the code or comments don't
      try to justify this. Given we've added a separate deduction of signed
      bounds from unsigned bounds earlier, this seems at least redundant, if
      not just wrong.
      
      In short, we remove doubtful pieces, and streamline the rest to follow
      the logic and approach of the rest of reg_bounds_sync() checks.
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cf5fe3c7
    • Andrii Nakryiko's avatar
      bpf: remove redundant s{32,64} -> u{32,64} deduction logic · 3cf98cf5
      Andrii Nakryiko authored
      Equivalent checks were recently added in more succinct and, arguably,
      safer form in:
        - f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds");
        - 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds").
      
      The checks we are removing in this patch set do similar checks to detect
      if entire u32/u64 range has signed bit set or not set, but does it with
      two separate checks.
      
      Further, we forcefully overwrite either smin or smax (and 32-bit equvalents)
      without applying normal min/max intersection logic. It's not clear why
      that would be correct in all cases and seems to work by accident. This
      logic is also "gated" by previous signed -> unsigned derivation, which
      returns early.
      
      All this is quite confusing and seems error-prone, while we already have
      at least equivalent checks happening earlier. So remove this duplicate
      and error-prone logic to simplify things a bit.
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3cf98cf5
    • Andrii Nakryiko's avatar
      bpf: add register bounds sanity checks and sanitization · 5f99f312
      Andrii Nakryiko authored
      Add simple sanity checks that validate well-formed ranges (min <= max)
      across u64, s64, u32, and s32 ranges. Also for cases when the value is
      constant (either 64-bit or 32-bit), we validate that ranges and tnums
      are in agreement.
      
      These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
      operations, on conditional jumps, and for LDX instructions (where subreg
      zero/sign extension is probably the most important to check). This
      covers most of the interesting cases.
      
      Also, we validate the sanity of the return register when manually
      adjusting it for some special helpers.
      
      By default, sanity violation will trigger a warning in verifier log and
      resetting register bounds to "unbounded" ones. But to aid development
      and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
      trigger hard failure of verification with -EFAULT on register bounds
      violations. This allows selftests to catch such issues. veristat will
      also gain a CLI option to enable this behavior.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5f99f312
    • Andrii Nakryiko's avatar
      bpf: enhance BPF_JEQ/BPF_JNE is_branch_taken logic · be41a203
      Andrii Nakryiko authored
      Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions
      that otherwise would be "inconclusive" (i.e., is_branch_taken() would
      return -1). This can happen, for example, when registers are initialized
      as 64-bit u64/s64, then compared for inequality as 32-bit subregisters,
      and then followed by 64-bit equality/inequality check. That 32-bit
      inequality can establish some pattern for lower 32 bits of a register
      (e.g., s< 0 condition determines whether the bit #31 is zero or not),
      while overall 64-bit value could be anything (according to a value range
      representation).
      
      This is not a fancy quirky special case, but actually a handling that's
      necessary to prevent correctness issue with BPF verifier's range
      tracking: set_range_min_max() assumes that register ranges are
      non-overlapping, and if that condition is not guaranteed by
      is_branch_taken() we can end up with invalid ranges, where min > max.
      
        [0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      be41a203
    • Andrii Nakryiko's avatar
      bpf: generalize is_scalar_branch_taken() logic · 96381879
      Andrii Nakryiko authored
      Generalize is_branch_taken logic for SCALAR_VALUE register to handle
      cases when both registers are not constants. Previously supported
      <range> vs <scalar> cases are a natural subset of more generic <range>
      vs <range> set of cases.
      
      Generalized logic relies on straightforward segment intersection checks.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96381879
    • Andrii Nakryiko's avatar
      bpf: generalize reg_set_min_max() to handle non-const register comparisons · 67420501
      Andrii Nakryiko authored
      Generalize bounds adjustment logic of reg_set_min_max() to handle not
      just register vs constant case, but in general any register vs any
      register cases. For most of the operations it's trivial extension based
      on range vs range comparison logic, we just need to properly pick
      min/max of a range to compare against min/max of the other range.
      
      For BPF_JSET we keep the original capabilities, just make sure JSET is
      integrated in the common framework. This is manifested in the
      internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more
      uniform rev_opcode() handling. See the code for details. This allows to
      reuse the same code exactly both for TRUE and FALSE branches without
      explicitly handling both conditions with custom code.
      
      Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE
      case none of the registers are constants. This is now just a normal
      generic case handled by reg_set_min_max().
      
      To make tnum handling cleaner, tnum_with_subreg() helper is added, as
      that's a common operator when dealing with 32-bit subregister bounds.
      This keeps the overall logic much less noisy when it comes to tnums.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      67420501
  2. 14 Nov, 2023 8 commits
  3. 11 Nov, 2023 1 commit
    • Yonghong Song's avatar
      selftests/bpf: Fix pyperf180 compilation failure with clang18 · 100888fb
      Yonghong Song authored
      With latest clang18 (main branch of llvm-project repo), when building bpf selftests,
          [~/work/bpf-next (master)]$ make -C tools/testing/selftests/bpf LLVM=1 -j
      
      The following compilation error happens:
          fatal error: error in backend: Branch target out of insn range
          ...
          Stack dump:
          0.      Program arguments: clang -g -Wall -Werror -D__TARGET_ARCH_x86 -mlittle-endian
            -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/include
            -I/home/yhs/work/bpf-next/tools/testing/selftests/bpf -I/home/yhs/work/bpf-next/tools/include/uapi
            -I/home/yhs/work/bpf-next/tools/testing/selftests/usr/include -idirafter
            /home/yhs/work/llvm-project/llvm/build.18/install/lib/clang/18/include -idirafter /usr/local/include
            -idirafter /usr/include -Wno-compare-distinct-pointer-types -DENABLE_ATOMICS_TESTS -O2 --target=bpf
            -c progs/pyperf180.c -mcpu=v3 -o /home/yhs/work/bpf-next/tools/testing/selftests/bpf/pyperf180.bpf.o
          1.      <eof> parser at end of file
          2.      Code generation
          ...
      
      The compilation failure only happens to cpu=v2 and cpu=v3. cpu=v4 is okay
      since cpu=v4 supports 32-bit branch target offset.
      
      The above failure is due to upstream llvm patch [1] where some inlining behavior
      are changed in clang18.
      
      To workaround the issue, previously all 180 loop iterations are fully unrolled.
      The bpf macro __BPF_CPU_VERSION__ (implemented in clang18 recently) is used to avoid
      unrolling changes if cpu=v4. If __BPF_CPU_VERSION__ is not available and the
      compiler is clang18, the unrollng amount is unconditionally reduced.
      
        [1] https://github.com/llvm/llvm-project/commit/1a2e77cf9e11dbf56b5720c607313a566eebb16eSigned-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Link: https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev
      100888fb
  4. 10 Nov, 2023 17 commits
    • Jordan Rome's avatar
      bpf: Add crosstask check to __bpf_get_stack · b8e3a87a
      Jordan Rome authored
      Currently get_perf_callchain only supports user stack walking for
      the current task. Passing the correct *crosstask* param will return
      0 frames if the task passed to __bpf_get_stack isn't the current
      one instead of a single incorrect frame/address. This change
      passes the correct *crosstask* param but also does a preemptive
      check in __bpf_get_stack if the task is current and returns
      -EOPNOTSUPP if it is not.
      
      This issue was found using bpf_get_task_stack inside a BPF
      iterator ("iter/task"), which iterates over all tasks.
      bpf_get_task_stack works fine for fetching kernel stacks
      but because get_perf_callchain relies on the caller to know
      if the requested *task* is the current one (via *crosstask*)
      it was failing in a confusing way.
      
      It might be possible to get user stacks for all tasks utilizing
      something like access_process_vm but that requires the bpf
      program calling bpf_get_task_stack to be sleepable and would
      therefore be a breaking change.
      
      Fixes: fa28dcb8 ("bpf: Introduce helper bpf_get_task_stack()")
      Signed-off-by: default avatarJordan Rome <jordalgo@meta.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
      b8e3a87a
    • Alexei Starovoitov's avatar
      Merge branch 'for-6.8-bpf' of... · 92411764
      Alexei Starovoitov authored
      Merge branch 'for-6.8-bpf' of https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into bpf-next
      
      Merge cgroup prerequisite patches.
      
      Link: https://lore.kernel.org/bpf/20231029061438.4215-1-laoar.shao@gmail.com/Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      92411764
    • Yafang Shao's avatar
      compiler-gcc: Suppress -Wmissing-prototypes warning for all supported GCC · 689b097a
      Yafang Shao authored
      The kernel supports a minimum GCC version of 5.1.0 for building. However,
      the "__diag_ignore_all" directive only suppresses the
      "-Wmissing-prototypes" warning for GCC versions >= 8.0.0. As a result, when
      building the kernel with older GCC versions, warnings may be triggered. The
      example below illustrates the warnings reported by the kernel test robot
      using GCC 7.5.0:
      
        compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
        All warnings (new ones prefixed by >>):
      
         kernel/bpf/helpers.c:1893:19: warning: no previous prototype for 'bpf_obj_new_impl' [-Wmissing-prototypes]
          __bpf_kfunc void *bpf_obj_new_impl(u64 local_type_id__k, void *meta__ign)
                            ^~~~~~~~~~~~~~~~
         kernel/bpf/helpers.c:1907:19: warning: no previous prototype for 'bpf_percpu_obj_new_impl' [-Wmissing-prototypes]
          __bpf_kfunc void *bpf_percpu_obj_new_impl(u64 local_type_id__k, void *meta__ign)
         [...]
      
      To address this, we should also suppress the "-Wmissing-prototypes" warning
      for older GCC versions. "#pragma GCC diagnostic push" is supported as
      of GCC 4.6, and both "-Wmissing-prototypes" and "-Wmissing-declarations"
      are supported for all the GCC versions that we currently support.
      Therefore, it is reasonable to suppress these warnings for all supported
      GCC versions.
      
      With this adjustment, it's important to note that after implementing
      "__diag_ignore_all", it will effectively suppress warnings for all the
      supported GCC versions.
      
      In the future, if you wish to suppress warnings that are only supported on
      higher GCC versions, it is advisable to explicitly use "__diag_ignore" to
      specify the GCC version you are targeting.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202311031651.A7crZEur-lkp@intel.com/Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231106031802.4188-1-laoar.shao@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      689b097a
    • Yonghong Song's avatar
      bpf: Use named fields for certain bpf uapi structs · 155addf0
      Yonghong Song authored
      Martin and Vadim reported a verifier failure with bpf_dynptr usage.
      The issue is mentioned but Vadim workarounded the issue with source
      change ([1]). The below describes what is the issue and why there
      is a verification failure.
      
        int BPF_PROG(skb_crypto_setup) {
          struct bpf_dynptr algo, key;
          ...
      
          bpf_dynptr_from_mem(..., ..., 0, &algo);
          ...
        }
      
      The bpf program is using vmlinux.h, so we have the following definition in
      vmlinux.h:
        struct bpf_dynptr {
              long: 64;
              long: 64;
        };
      Note that in uapi header bpf.h, we have
        struct bpf_dynptr {
              long: 64;
              long: 64;
      } __attribute__((aligned(8)));
      
      So we lost alignment information for struct bpf_dynptr by using vmlinux.h.
      Let us take a look at a simple program below:
        $ cat align.c
        typedef unsigned long long __u64;
        struct bpf_dynptr_no_align {
              __u64 :64;
              __u64 :64;
        };
        struct bpf_dynptr_yes_align {
              __u64 :64;
              __u64 :64;
        } __attribute__((aligned(8)));
      
        void bar(void *, void *);
        int foo() {
          struct bpf_dynptr_no_align a;
          struct bpf_dynptr_yes_align b;
          bar(&a, &b);
          return 0;
        }
        $ clang --target=bpf -O2 -S -emit-llvm align.c
      
      Look at the generated IR file align.ll:
        ...
        %a = alloca %struct.bpf_dynptr_no_align, align 1
        %b = alloca %struct.bpf_dynptr_yes_align, align 8
        ...
      
      The compiler dictates the alignment for struct bpf_dynptr_no_align is 1 and
      the alignment for struct bpf_dynptr_yes_align is 8. So theoretically compiler
      could allocate variable %a with alignment 1 although in reallity the compiler
      may choose a different alignment by considering other local variables.
      
      In [1], the verification failure happens because variable 'algo' is allocated
      on the stack with alignment 4 (fp-28). But the verifer wants its alignment
      to be 8.
      
      To fix the issue, the RFC patch ([1]) tried to add '__attribute__((aligned(8)))'
      to struct bpf_dynptr plus other similar structs. Andrii suggested that
      we could directly modify uapi struct with named fields like struct 'bpf_iter_num':
        struct bpf_iter_num {
              /* opaque iterator state; having __u64 here allows to preserve correct
               * alignment requirements in vmlinux.h, generated from BTF
               */
              __u64 __opaque[1];
        } __attribute__((aligned(8)));
      
      Indeed, adding named fields for those affected structs in this patch can preserve
      alignment when bpf program references them in vmlinux.h. With this patch,
      the verification failure in [1] can also be resolved.
      
        [1] https://lore.kernel.org/bpf/1b100f73-7625-4c1f-3ae5-50ecf84d3ff0@linux.dev/
        [2] https://lore.kernel.org/bpf/20231103055218.2395034-1-yonghong.song@linux.dev/
      
      Cc: Vadim Fedorenko <vadfed@meta.com>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231104024900.1539182-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      155addf0
    • Alexei Starovoitov's avatar
      Merge branch 'allow-bpf_refcount_acquire-of-mapval-obtained-via-direct-ld' · 3f6d04d7
      Alexei Starovoitov authored
      Dave Marchevsky says:
      
      ====================
      Allow bpf_refcount_acquire of mapval obtained via direct LD
      
      Consider this BPF program:
      
        struct cgv_node {
          int d;
          struct bpf_refcount r;
          struct bpf_rb_node rb;
        };
      
        struct val_stash {
          struct cgv_node __kptr *v;
        };
      
        struct {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __type(key, int);
          __type(value, struct val_stash);
          __uint(max_entries, 10);
        } array_map SEC(".maps");
      
        long bpf_program(void *ctx)
        {
          struct val_stash *mapval;
          struct cgv_node *p;
          int idx = 0;
      
          mapval = bpf_map_lookup_elem(&array_map, &idx);
          if (!mapval || !mapval->v) { /* omitted */ }
      
          p = bpf_refcount_acquire(mapval->v); /* Verification FAILs here */
      
          /* Add p to some tree */
          return 0;
        }
      
      Verification fails on the refcount_acquire:
      
        160: (79) r1 = *(u64 *)(r8 +8)        ; R1_w=untrusted_ptr_or_null_cgv_node(id=11,off=0,imm=0) R8_w=map_value(id=10,off=0,ks=8,vs=16,imm=0) refs=6
        161: (b7) r2 = 0                      ; R2_w=0 refs=6
        162: (85) call bpf_refcount_acquire_impl#117824
        arg#0 is neither owning or non-owning ref
      
      The above verifier dump is actually from sched_ext's scx_flatcg [0],
      which is the motivating usecase for this series' changes. Specifically,
      scx_flatcg stashes a rb_node type w/ cgroup-specific info (struct
      cgv_node) in a map when the cgroup is created, then later puts that
      cgroup's node in a rbtree in .enqueue . Making struct cgv_node
      refcounted would simplify the code a bit by virtue of allowing us to
      remove the kptr_xchg's, but "later puts that cgroups node in a rbtree"
      is not possible without a refcount_acquire, which suffers from the above
      verification failure.
      
      If we get rid of PTR_UNTRUSTED flag, and add MEM_ALLOC | NON_OWN_REF,
      mapval->v would be a non-owning ref and verification would succeed. Due
      to the most recent set of refcount changes [1], which modified
      bpf_obj_drop behavior to not reuse refcounted graph node's underlying
      memory until after RCU grace period, this is safe to do. Once mapval->v
      has the correct flags it _is_ a non-owning reference and verification of
      the motivating example will succeed.
      
        [0]: https://github.com/sched-ext/sched_ext/blob/52911e1040a0f94b9c426dddcc00be5364a7ae9f/tools/sched_ext/scx_flatcg.bpf.c#L275
        [1]: https://lore.kernel.org/bpf/20230821193311.3290257-1-davemarchevsky@fb.com/
      
      Summary of patches:
        * Patch 1 fixes an issue with bpf_refcount_acquire verification
          letting MAYBE_NULL ptrs through
          * Patch 2 tests Patch 1's fix
        * Patch 3 broadens the use of "free only after RCU GP" to all
          user-allocated types
        * Patch 4 is a small nonfunctional refactoring
        * Patch 5 changes verifier to mark direct LD of stashed graph node
          kptr as non-owning ref
          * Patch 6 tests Patch 5's verifier changes
      
      Changelog:
      
      v1 -> v2: https://lore.kernel.org/bpf/20231025214007.2920506-1-davemarchevsky@fb.com/
      
      Series title changed to "Allow bpf_refcount_acquire of mapval obtained via
      direct LD". V1's title was mistakenly truncated.
      
        * Patch 5 ("bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref")
          * Direct LD of percpu kptr should not have MEM_ALLOC flag (Yonghong)
        * Patch 6 ("selftests/bpf: Test bpf_refcount_acquire of node obtained via direct ld")
          * Test read from stashed value as well
      ====================
      
      Link: https://lore.kernel.org/r/20231107085639.3016113-1-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3f6d04d7
    • Shung-Hsi Yu's avatar
      bpf: replace register_is_const() with is_reg_const() · 82ce364c
      Shung-Hsi Yu authored
      The addition of is_reg_const() in commit 171de12646d2 ("bpf: generalize
      is_branch_taken to handle all conditional jumps in one place") has made the
      register_is_const() redundant. Give the former has more feature, plus the
      fact the latter is only used in one place, replace register_is_const() with
      is_reg_const(), and remove the definition of register_is_const.
      
      This requires moving the definition of is_reg_const() further up. And since
      the comment of reg_const_value() reference is_reg_const(), move it up as
      well.
      Signed-off-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231108140043.12282-1-shung-hsi.yu@suse.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      82ce364c
    • Dave Marchevsky's avatar
      selftests/bpf: Test bpf_refcount_acquire of node obtained via direct ld · e9ed8df7
      Dave Marchevsky authored
      This patch demonstrates that verifier changes earlier in this series
      result in bpf_refcount_acquire(mapval->stashed_kptr) passing
      verification. The added test additionally validates that stashing a kptr
      in mapval and - in a separate BPF program - refcount_acquiring the kptr
      without unstashing works as expected at runtime.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-7-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e9ed8df7
    • Andrii Nakryiko's avatar
      veristat: add ability to filter top N results · 27007fae
      Andrii Nakryiko authored
      Add ability to filter top B results, both in replay/verifier mode and
      comparison mode. Just adding `-n10` will emit only first 10 rows, or
      less, if there is not enough rows.
      
      This is not just a shortcut instead of passing veristat output through
      `head`, though. Filtering out all the other rows influences final table
      formatting, as table column widths are calculated based on actual
      emitted test.
      
      To demonstrate the difference, compare two "equivalent" forms below, one
      using head and another using -n argument.
      
      TOP N FEATURE
      =============
      [vmuser@archvm bpf]$ sudo ./veristat -C ~/baseline-results-selftests.csv ~/sanity2-results-selftests.csv -e file,prog,insns,states -s '|insns_diff|' -n10
      File                                      Program                Insns (A)  Insns (B)  Insns (DIFF)  States (A)  States (B)  States (DIFF)
      ----------------------------------------  ---------------------  ---------  ---------  ------------  ----------  ----------  -------------
      test_seg6_loop.bpf.linked3.o              __add_egr_x                12440      12360  -80 (-0.64%)         364         357    -7 (-1.92%)
      async_stack_depth.bpf.linked3.o           async_call_root_check        145        145   +0 (+0.00%)           3           3    +0 (+0.00%)
      async_stack_depth.bpf.linked3.o           pseudo_call_check            139        139   +0 (+0.00%)           3           3    +0 (+0.00%)
      atomic_bounds.bpf.linked3.o               sub                            7          7   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o  kmalloc                        5          5   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o  sched_process_fork            22         22   +0 (+0.00%)           2           2    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o  socket_post_create            23         23   +0 (+0.00%)           2           2    +0 (+0.00%)
      bind4_prog.bpf.linked3.o                  bind_v4_prog                 358        358   +0 (+0.00%)          33          33    +0 (+0.00%)
      bind6_prog.bpf.linked3.o                  bind_v6_prog                 429        429   +0 (+0.00%)          37          37    +0 (+0.00%)
      bind_perm.bpf.linked3.o                   bind_v4_prog                  15         15   +0 (+0.00%)           1           1    +0 (+0.00%)
      
      PIPING TO HEAD
      ==============
      [vmuser@archvm bpf]$ sudo ./veristat -C ~/baseline-results-selftests.csv ~/sanity2-results-selftests.csv -e file,prog,insns,states -s '|insns_diff|' | head -n12
      File                                                   Program                                               Insns (A)  Insns (B)  Insns (DIFF)  States (A)  States (B)  States (DIFF)
      -----------------------------------------------------  ----------------------------------------------------  ---------  ---------  ------------  ----------  ----------  -------------
      test_seg6_loop.bpf.linked3.o                           __add_egr_x                                               12440      12360  -80 (-0.64%)         364         357    -7 (-1.92%)
      async_stack_depth.bpf.linked3.o                        async_call_root_check                                       145        145   +0 (+0.00%)           3           3    +0 (+0.00%)
      async_stack_depth.bpf.linked3.o                        pseudo_call_check                                           139        139   +0 (+0.00%)           3           3    +0 (+0.00%)
      atomic_bounds.bpf.linked3.o                            sub                                                           7          7   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o               kmalloc                                                       5          5   +0 (+0.00%)           0           0    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o               sched_process_fork                                           22         22   +0 (+0.00%)           2           2    +0 (+0.00%)
      bench_local_storage_create.bpf.linked3.o               socket_post_create                                           23         23   +0 (+0.00%)           2           2    +0 (+0.00%)
      bind4_prog.bpf.linked3.o                               bind_v4_prog                                                358        358   +0 (+0.00%)          33          33    +0 (+0.00%)
      bind6_prog.bpf.linked3.o                               bind_v6_prog                                                429        429   +0 (+0.00%)          37          37    +0 (+0.00%)
      bind_perm.bpf.linked3.o                                bind_v4_prog                                                 15         15   +0 (+0.00%)           1           1    +0 (+0.00%)
      
      Note all the wasted whitespace in the "PIPING TO HEAD" variant.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231108051430.1830950-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      27007fae
    • Dave Marchevsky's avatar
      bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref · 1b121715
      Dave Marchevsky authored
      This patch enables the following pattern:
      
        /* mapval contains a __kptr pointing to refcounted local kptr */
        mapval = bpf_map_lookup_elem(&map, &idx);
        if (!mapval || !mapval->some_kptr) { /* omitted */ }
      
        p = bpf_refcount_acquire(&mapval->some_kptr);
      
      Currently this doesn't work because bpf_refcount_acquire expects an
      owning or non-owning ref. The verifier defines non-owning ref as a type:
      
        PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF
      
      while mapval->some_kptr is PTR_TO_BTF_ID | PTR_UNTRUSTED. It's possible
      to do the refcount_acquire by first bpf_kptr_xchg'ing mapval->some_kptr
      into a temp kptr, refcount_acquiring that, and xchg'ing back into
      mapval, but this is unwieldy and shouldn't be necessary.
      
      This patch modifies btf_ld_kptr_type such that user-allocated types are
      marked MEM_ALLOC and if those types have a bpf_{rb,list}_node they're
      marked NON_OWN_REF as well. Additionally, due to changes to
      bpf_obj_drop_impl earlier in this series, rcu_protected_object now
      returns true for all user-allocated types, resulting in
      mapval->some_kptr being marked MEM_RCU.
      
      After this patch's changes, mapval->some_kptr is now:
      
        PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU
      
      which results in it passing the non-owning ref test, and the motivating
      example passing verification.
      
      Future work will likely get rid of special non-owning ref lifetime logic
      in the verifier, at which point we'll be able to delete the NON_OWN_REF
      flag entirely.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-6-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1b121715
    • Andrii Nakryiko's avatar
      veristat: add ability to sort by stat's absolute value · 5d4a7aac
      Andrii Nakryiko authored
      Add ability to sort results by absolute values of specified stats. This
      is especially useful to find biggest deviations in comparison mode. When
      comparing verifier change effect against a large base of BPF object
      files, it's necessary to see big changes both in positive and negative
      directions, as both might be a signal for regressions or bugs.
      
      The syntax is natural, e.g., adding `-s '|insns_diff|'^` will instruct
      veristat to sort by absolute value of instructions difference in
      ascending order.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231108051430.1830950-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d4a7aac
    • Dave Marchevsky's avatar
      bpf: Move GRAPH_{ROOT,NODE}_MASK macros into btf_field_type enum · 790ce3cf
      Dave Marchevsky authored
      This refactoring patch removes the unused BPF_GRAPH_NODE_OR_ROOT
      btf_field_type and moves BPF_GRAPH_{NODE,ROOT} macros into the
      btf_field_type enum. Further patches in the series will use
      BPF_GRAPH_NODE, so let's move this useful definition out of btf.c.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-5-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      790ce3cf
    • Yonghong Song's avatar
      libbpf: Fix potential uninitialized tail padding with LIBBPF_OPTS_RESET · 7f7c4369
      Yonghong Song authored
      Martin reported that there is a libbpf complaining of non-zero-value tail
      padding with LIBBPF_OPTS_RESET macro if struct bpf_netkit_opts is modified
      to have a 4-byte tail padding. This only happens to clang compiler.
      The commend line is: ./test_progs -t tc_netkit_multi_links
      Martin and I did some investigation and found this indeed the case and
      the following are the investigation details.
      
      Clang:
        clang version 18.0.0
        <I tried clang15/16/17 and they all have similar results>
      
      tools/lib/bpf/libbpf_common.h:
        #define LIBBPF_OPTS_RESET(NAME, ...)                                      \
              do {                                                                \
                      memset(&NAME, 0, sizeof(NAME));                             \
                      NAME = (typeof(NAME)) {                                     \
                              .sz = sizeof(NAME),                                 \
                              __VA_ARGS__                                         \
                      };                                                          \
              } while (0)
      
        #endif
      
      tools/lib/bpf/libbpf.h:
        struct bpf_netkit_opts {
              /* size of this struct, for forward/backward compatibility */
              size_t sz;
              __u32 flags;
              __u32 relative_fd;
              __u32 relative_id;
              __u64 expected_revision;
              size_t :0;
        };
        #define bpf_netkit_opts__last_field expected_revision
      In the above struct bpf_netkit_opts, there is no tail padding.
      
      prog_tests/tc_netkit.c:
        static void serial_test_tc_netkit_multi_links_target(int mode, int target)
        {
              ...
              LIBBPF_OPTS(bpf_netkit_opts, optl);
              ...
              LIBBPF_OPTS_RESET(optl,
                      .flags = BPF_F_BEFORE,
                      .relative_fd = bpf_program__fd(skel->progs.tc1),
              );
              ...
        }
      
      Let us make the following source change, note that we have a 4-byte
      tailing padding now.
        diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
        index 6cd9c501624f..0dd83910ae9a 100644
        --- a/tools/lib/bpf/libbpf.h
        +++ b/tools/lib/bpf/libbpf.h
        @@ -803,13 +803,13 @@ bpf_program__attach_tcx(const struct bpf_program *prog, int ifindex,
         struct bpf_netkit_opts {
              /* size of this struct, for forward/backward compatibility */
              size_t sz;
        -       __u32 flags;
              __u32 relative_fd;
              __u32 relative_id;
              __u64 expected_revision;
        +       __u32 flags;
              size_t :0;
         };
        -#define bpf_netkit_opts__last_field expected_revision
        +#define bpf_netkit_opts__last_field flags
      
      The clang 18 generated asm code looks like below:
          ;       LIBBPF_OPTS_RESET(optl,
          55e3: 48 8d 7d 98                   leaq    -0x68(%rbp), %rdi
          55e7: 31 f6                         xorl    %esi, %esi
          55e9: ba 20 00 00 00                movl    $0x20, %edx
          55ee: e8 00 00 00 00                callq   0x55f3 <serial_test_tc_netkit_multi_links_target+0x18d3>
          55f3: 48 c7 85 10 fd ff ff 20 00 00 00      movq    $0x20, -0x2f0(%rbp)
          55fe: 48 8b 85 68 ff ff ff          movq    -0x98(%rbp), %rax
          5605: 48 8b 78 18                   movq    0x18(%rax), %rdi
          5609: e8 00 00 00 00                callq   0x560e <serial_test_tc_netkit_multi_links_target+0x18ee>
          560e: 89 85 18 fd ff ff             movl    %eax, -0x2e8(%rbp)
          5614: c7 85 1c fd ff ff 00 00 00 00 movl    $0x0, -0x2e4(%rbp)
          561e: 48 c7 85 20 fd ff ff 00 00 00 00      movq    $0x0, -0x2e0(%rbp)
          5629: c7 85 28 fd ff ff 08 00 00 00 movl    $0x8, -0x2d8(%rbp)
          5633: 48 8b 85 10 fd ff ff          movq    -0x2f0(%rbp), %rax
          563a: 48 89 45 98                   movq    %rax, -0x68(%rbp)
          563e: 48 8b 85 18 fd ff ff          movq    -0x2e8(%rbp), %rax
          5645: 48 89 45 a0                   movq    %rax, -0x60(%rbp)
          5649: 48 8b 85 20 fd ff ff          movq    -0x2e0(%rbp), %rax
          5650: 48 89 45 a8                   movq    %rax, -0x58(%rbp)
          5654: 48 8b 85 28 fd ff ff          movq    -0x2d8(%rbp), %rax
          565b: 48 89 45 b0                   movq    %rax, -0x50(%rbp)
          ;       link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
      
      At -O0 level, the clang compiler creates an intermediate copy.
      We have below to store 'flags' with 4-byte store and leave another 4 byte
      in the same 8-byte-aligned storage undefined,
          5629: c7 85 28 fd ff ff 08 00 00 00 movl    $0x8, -0x2d8(%rbp)
      and later we store 8-byte to the original zero'ed buffer
          5654: 48 8b 85 28 fd ff ff          movq    -0x2d8(%rbp), %rax
          565b: 48 89 45 b0                   movq    %rax, -0x50(%rbp)
      
      This caused a problem as the 4-byte value at [%rbp-0x2dc, %rbp-0x2e0)
      may be garbage.
      
      gcc (gcc 11.4) does not have this issue as it does zeroing struct first before
      doing assignments:
        ;       LIBBPF_OPTS_RESET(optl,
          50fd: 48 8d 85 40 fc ff ff          leaq    -0x3c0(%rbp), %rax
          5104: ba 20 00 00 00                movl    $0x20, %edx
          5109: be 00 00 00 00                movl    $0x0, %esi
          510e: 48 89 c7                      movq    %rax, %rdi
          5111: e8 00 00 00 00                callq   0x5116 <serial_test_tc_netkit_multi_links_target+0x1522>
          5116: 48 8b 45 f0                   movq    -0x10(%rbp), %rax
          511a: 48 8b 40 18                   movq    0x18(%rax), %rax
          511e: 48 89 c7                      movq    %rax, %rdi
          5121: e8 00 00 00 00                callq   0x5126 <serial_test_tc_netkit_multi_links_target+0x1532>
          5126: 48 c7 85 40 fc ff ff 00 00 00 00      movq    $0x0, -0x3c0(%rbp)
          5131: 48 c7 85 48 fc ff ff 00 00 00 00      movq    $0x0, -0x3b8(%rbp)
          513c: 48 c7 85 50 fc ff ff 00 00 00 00      movq    $0x0, -0x3b0(%rbp)
          5147: 48 c7 85 58 fc ff ff 00 00 00 00      movq    $0x0, -0x3a8(%rbp)
          5152: 48 c7 85 40 fc ff ff 20 00 00 00      movq    $0x20, -0x3c0(%rbp)
          515d: 89 85 48 fc ff ff             movl    %eax, -0x3b8(%rbp)
          5163: c7 85 58 fc ff ff 08 00 00 00 movl    $0x8, -0x3a8(%rbp)
        ;       link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
      
      It is not clear how to resolve the compiler code generation as the compiler
      generates correct code w.r.t. how to handle unnamed padding in C standard.
      So this patch changed LIBBPF_OPTS_RESET macro to avoid uninitialized tail
      padding. We already knows LIBBPF_OPTS macro works on both gcc and clang,
      even with tail padding. So LIBBPF_OPTS_RESET is changed to be a
      LIBBPF_OPTS followed by a memcpy(), thus avoiding uninitialized tail padding.
      
      The below is asm code generated with this patch and with clang compiler:
          ;       LIBBPF_OPTS_RESET(optl,
          55e3: 48 8d bd 10 fd ff ff          leaq    -0x2f0(%rbp), %rdi
          55ea: 31 f6                         xorl    %esi, %esi
          55ec: ba 20 00 00 00                movl    $0x20, %edx
          55f1: e8 00 00 00 00                callq   0x55f6 <serial_test_tc_netkit_multi_links_target+0x18d6>
          55f6: 48 c7 85 10 fd ff ff 20 00 00 00      movq    $0x20, -0x2f0(%rbp)
          5601: 48 8b 85 68 ff ff ff          movq    -0x98(%rbp), %rax
          5608: 48 8b 78 18                   movq    0x18(%rax), %rdi
          560c: e8 00 00 00 00                callq   0x5611 <serial_test_tc_netkit_multi_links_target+0x18f1>
          5611: 89 85 18 fd ff ff             movl    %eax, -0x2e8(%rbp)
          5617: c7 85 1c fd ff ff 00 00 00 00 movl    $0x0, -0x2e4(%rbp)
          5621: 48 c7 85 20 fd ff ff 00 00 00 00      movq    $0x0, -0x2e0(%rbp)
          562c: c7 85 28 fd ff ff 08 00 00 00 movl    $0x8, -0x2d8(%rbp)
          5636: 48 8b 85 10 fd ff ff          movq    -0x2f0(%rbp), %rax
          563d: 48 89 45 98                   movq    %rax, -0x68(%rbp)
          5641: 48 8b 85 18 fd ff ff          movq    -0x2e8(%rbp), %rax
          5648: 48 89 45 a0                   movq    %rax, -0x60(%rbp)
          564c: 48 8b 85 20 fd ff ff          movq    -0x2e0(%rbp), %rax
          5653: 48 89 45 a8                   movq    %rax, -0x58(%rbp)
          5657: 48 8b 85 28 fd ff ff          movq    -0x2d8(%rbp), %rax
          565e: 48 89 45 b0                   movq    %rax, -0x50(%rbp)
          ;       link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl);
      
      In the above code, a temporary buffer is zeroed and then has proper value assigned.
      Finally, values in temporary buffer are copied to the original variable buffer,
      hence tail padding is guaranteed to be 0.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/bpf/20231107201511.2548645-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7f7c4369
    • Dave Marchevsky's avatar
      bpf: Use bpf_mem_free_rcu when bpf_obj_dropping non-refcounted nodes · 649924b7
      Dave Marchevsky authored
      The use of bpf_mem_free_rcu to free refcounted local kptrs was added
      in commit 7e26cd12 ("bpf: Use bpf_mem_free_rcu when
      bpf_obj_dropping refcounted nodes"). In the cover letter for the
      series containing that patch [0] I commented:
      
          Perhaps it makes sense to move to mem_free_rcu for _all_
          non-owning refs in the future, not just refcounted. This might
          allow custom non-owning ref lifetime + invalidation logic to be
          entirely subsumed by MEM_RCU handling. IMO this needs a bit more
          thought and should be tackled outside of a fix series, so it's not
          attempted here.
      
      It's time to start moving in the "non-owning refs have MEM_RCU
      lifetime" direction. As mentioned in that comment, using
      bpf_mem_free_rcu for all local kptrs - not just refcounted - is
      necessarily the first step towards that goal. This patch does so.
      
      After this patch the memory pointed to by all local kptrs will not be
      reused until RCU grace period elapses. The verifier's understanding of
      non-owning ref validity and the clobbering logic it uses to enforce
      that understanding are not changed here, that'll happen gradually in
      future work, including further patches in the series.
      
        [0]: https://lore.kernel.org/all/20230821193311.3290257-1-davemarchevsky@fb.com/Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-4-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      649924b7
    • Dave Marchevsky's avatar
      selftests/bpf: Add test passing MAYBE_NULL reg to bpf_refcount_acquire · f460e7bd
      Dave Marchevsky authored
      The test added in this patch exercises the logic fixed in the previous
      patch in this series. Before the previous patch's changes,
      bpf_refcount_acquire accepts MAYBE_NULL local kptrs; after the change
      the verifier correctly rejects the such a call.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20231107085639.3016113-3-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f460e7bd
    • Dave Marchevsky's avatar
      bpf: Add KF_RCU flag to bpf_refcount_acquire_impl · 1500a5d9
      Dave Marchevsky authored
      Refcounted local kptrs are kptrs to user-defined types with a
      bpf_refcount field. Recent commits ([0], [1]) modified the lifetime of
      refcounted local kptrs such that the underlying memory is not reused
      until RCU grace period has elapsed.
      
      Separately, verification of bpf_refcount_acquire calls currently
      succeeds for MAYBE_NULL non-owning reference input, which is a problem
      as bpf_refcount_acquire_impl has no handling for this case.
      
      This patch takes advantage of aforementioned lifetime changes to tag
      bpf_refcount_acquire_impl kfunc KF_RCU, thereby preventing MAYBE_NULL
      input to the kfunc. The KF_RCU flag applies to all kfunc params; it's
      fine for it to apply to the void *meta__ign param as that's populated by
      the verifier and is tagged __ign regardless.
      
        [0]: commit 7e26cd12 ("bpf: Use bpf_mem_free_rcu when
             bpf_obj_dropping refcounted nodes") is the actual change to
             allocation behaivor
        [1]: commit 0816b8c6 ("bpf: Consider non-owning refs to refcounted
             nodes RCU protected") modified verifier understanding of
             refcounted local kptrs to match [0]'s changes
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Fixes: 7c50b1cb ("bpf: Add bpf_refcount_acquire kfunc")
      Link: https://lore.kernel.org/r/20231107085639.3016113-2-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1500a5d9
    • Andrii Nakryiko's avatar
      Merge branch 'bpf: __bpf_dynptr_data* and __str annotation' · b0d1c729
      Andrii Nakryiko authored
      Song Liu says:
      
      ====================
      This set contains the first 3 patches of set [1]. Currently, [1] is waiting
      for [3] to be merged to bpf-next tree. So send these 3 patches first to
      unblock other works, such as [2]. This set is verified with new version of
      [1] in CI run [4].
      
      Changes since v12 of [1]:
      1. Reuse bpf_dynptr_slice() in __bpf_dynptr_data(). (Andrii)
      2. Add Acked-by from Vadim Fedorenko.
      
      [1] https://lore.kernel.org/bpf/20231104001313.3538201-1-song@kernel.org/
      [2] https://lore.kernel.org/bpf/20231031134900.1432945-1-vadfed@meta.com/
      [3] https://lore.kernel.org/bpf/20231031215625.2343848-1-davemarchevsky@fb.com/
      [4] https://github.com/kernel-patches/bpf/actions/runs/6779945063/job/18427926114
      ====================
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b0d1c729
    • Florian Lehner's avatar
      bpf, lpm: Fix check prefixlen before walking trie · 9b75dbeb
      Florian Lehner authored
      When looking up an element in LPM trie, the condition 'matchlen ==
      trie->max_prefixlen' will never return true, if key->prefixlen is larger
      than trie->max_prefixlen. Consequently all elements in the LPM trie will
      be visited and no element is returned in the end.
      
      To resolve this, check key->prefixlen first before walking the LPM trie.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Signed-off-by: default avatarFlorian Lehner <dev@der-flo.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9b75dbeb