1. 04 Nov, 2022 25 commits
    • Andrii Nakryiko's avatar
      selftests/bpf: make test_align selftest more robust · 4f999b76
      Andrii Nakryiko authored
      test_align selftest relies on BPF verifier log emitting register states
      for specific instructions in expected format. Unfortunately, BPF
      verifier precision backtracking log interferes with such expectations.
      And instruction on which precision propagation happens sometimes don't
      output full expected register states. This does indeed look like
      something to be improved in BPF verifier, but is beyond the scope of
      this patch set.
      
      So to make test_align a bit more robust, inject few dummy R4 = R5
      instructions which capture desired state of R5 and won't have precision
      tracking logs on them. This fixes tests until we can improve BPF
      verifier output in the presence of precision tracking.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221104163649.121784-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4f999b76
    • Andrii Nakryiko's avatar
      bpf: aggressively forget precise markings during state checkpointing · 7a830b53
      Andrii Nakryiko authored
      Exploit the property of about-to-be-checkpointed state to be able to
      forget all precise markings up to that point even more aggressively. We
      now clear all potentially inherited precise markings right before
      checkpointing and branching off into child state. If any of children
      states require precise knowledge of any SCALAR register, those will be
      propagated backwards later on before this state is finalized, preserving
      correctness.
      
      There is a single selftests BPF program change, but tremendous one: 25x
      reduction in number of verified instructions and states in
      trace_virtqueue_add_sgs.
      
      Cilium results are more modest, but happen across wider range of programs.
      
      SELFTESTS RESULTS
      =================
      
      $ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results.csv ~/imprecise-aggressive-results.csv | grep -v '+0'
      File                 Program                  Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      -------------------  -----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      loop6.bpf.linked1.o  trace_virtqueue_add_sgs           398057            15114   -382943 (-96.20%)              8717               336      -8381 (-96.15%)
      -------------------  -----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      
      CILIUM RESULTS
      ==============
      
      $ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results-cilium.csv ~/imprecise-aggressive-results-cilium.csv | grep -v '+0'
      File           Program                           Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      -------------  --------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      bpf_host.o     tail_handle_nat_fwd_ipv4                    23426            23221       -205 (-0.88%)              1537              1515         -22 (-1.43%)
      bpf_host.o     tail_handle_nat_fwd_ipv6                    13009            12904       -105 (-0.81%)               719               708         -11 (-1.53%)
      bpf_host.o     tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
      bpf_host.o     tail_nodeport_nat_ipv6_egress                3446             3406        -40 (-1.16%)               203               198          -5 (-2.46%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv4                    23426            23221       -205 (-0.88%)              1537              1515         -22 (-1.43%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv6                    13009            12904       -105 (-0.81%)               719               708         -11 (-1.53%)
      bpf_lxc.o      tail_ipv4_ct_egress                          5074             4897       -177 (-3.49%)               255               248          -7 (-2.75%)
      bpf_lxc.o      tail_ipv4_ct_ingress                         5100             4923       -177 (-3.47%)               255               248          -7 (-2.75%)
      bpf_lxc.o      tail_ipv4_ct_ingress_policy_only             5100             4923       -177 (-3.47%)               255               248          -7 (-2.75%)
      bpf_lxc.o      tail_ipv6_ct_egress                          4558             4536        -22 (-0.48%)               188               187          -1 (-0.53%)
      bpf_lxc.o      tail_ipv6_ct_ingress                         4578             4556        -22 (-0.48%)               188               187          -1 (-0.53%)
      bpf_lxc.o      tail_ipv6_ct_ingress_policy_only             4578             4556        -22 (-0.48%)               188               187          -1 (-0.53%)
      bpf_lxc.o      tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
      bpf_overlay.o  tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
      bpf_overlay.o  tail_nodeport_nat_ipv6_egress                3482             3442        -40 (-1.15%)               204               201          -3 (-1.47%)
      bpf_xdp.o      tail_nodeport_nat_egress_ipv4               17200            15619      -1581 (-9.19%)              1111              1010        -101 (-9.09%)
      -------------  --------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221104163649.121784-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7a830b53
    • Andrii Nakryiko's avatar
      bpf: stop setting precise in current state · f63181b6
      Andrii Nakryiko authored
      Setting reg->precise to true in current state is not necessary from
      correctness standpoint, but it does pessimise the whole precision (or
      rather "imprecision", because that's what we want to keep as much as
      possible) tracking. Why is somewhat subtle and my best attempt to
      explain this is recorded in an extensive comment for __mark_chain_precise()
      function. Some more careful thinking and code reading is probably required
      still to grok this completely, unfortunately. Whiteboarding and a bunch
      of extra handwaiving in person would be even more helpful, but is deemed
      impractical in Git commit.
      
      Next patch pushes this imprecision property even further, building on top of
      the insights described in this patch.
      
      End results are pretty nice, we get reduction in number of total instructions
      and states verified due to a better states reuse, as some of the states are now
      more generic and permissive due to less unnecessary precise=true requirements.
      
      SELFTESTS RESULTS
      =================
      
      $ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results.csv ~/imprecise-early-results.csv | grep -v '+0'
      File                                     Program                 Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      ---------------------------------------  ----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      bpf_iter_ksym.bpf.linked1.o              dump_ksym                           347              285       -62 (-17.87%)                20                19          -1 (-5.00%)
      pyperf600_bpf_loop.bpf.linked1.o         on_event                           3678             3736        +58 (+1.58%)               276               285          +9 (+3.26%)
      setget_sockopt.bpf.linked1.o             skops_sockopt                      4038             3947        -91 (-2.25%)               347               343          -4 (-1.15%)
      test_l4lb.bpf.linked1.o                  balancer_ingress                   4559             2611     -1948 (-42.73%)               118               105        -13 (-11.02%)
      test_l4lb_noinline.bpf.linked1.o         balancer_ingress                   6279             6268        -11 (-0.18%)               237               236          -1 (-0.42%)
      test_misc_tcp_hdr_options.bpf.linked1.o  misc_estab                         1307             1303         -4 (-0.31%)               100                99          -1 (-1.00%)
      test_sk_lookup.bpf.linked1.o             ctx_narrow_access                   456              447         -9 (-1.97%)                39                38          -1 (-2.56%)
      test_sysctl_loop1.bpf.linked1.o          sysctl_tcp_mem                     1389             1384         -5 (-0.36%)                26                25          -1 (-3.85%)
      test_tc_dtime.bpf.linked1.o              egress_fwdns_prio101                518              485        -33 (-6.37%)                51                46          -5 (-9.80%)
      test_tc_dtime.bpf.linked1.o              egress_host                         519              468        -51 (-9.83%)                50                44         -6 (-12.00%)
      test_tc_dtime.bpf.linked1.o              ingress_fwdns_prio101               842             1000      +158 (+18.76%)                73                88        +15 (+20.55%)
      xdp_synproxy_kern.bpf.linked1.o          syncookie_tc                     405757           373173     -32584 (-8.03%)             25735             22882      -2853 (-11.09%)
      xdp_synproxy_kern.bpf.linked1.o          syncookie_xdp                    479055           371590   -107465 (-22.43%)             29145             22207      -6938 (-23.81%)
      ---------------------------------------  ----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      
      Slight regression in test_tc_dtime.bpf.linked1.o/ingress_fwdns_prio101
      is left for a follow up, there might be some more precision-related bugs
      in existing BPF verifier logic.
      
      CILIUM RESULTS
      ==============
      
      $ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results-cilium.csv ~/imprecise-early-results-cilium.csv | grep -v '+0'
      File           Program                         Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      -------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      bpf_host.o     cil_from_host                               762              556      -206 (-27.03%)                43                37         -6 (-13.95%)
      bpf_host.o     tail_handle_nat_fwd_ipv4                  23541            23426       -115 (-0.49%)              1538              1537          -1 (-0.07%)
      bpf_host.o     tail_nodeport_nat_egress_ipv4             33592            33566        -26 (-0.08%)              2163              2161          -2 (-0.09%)
      bpf_lxc.o      tail_handle_nat_fwd_ipv4                  23541            23426       -115 (-0.49%)              1538              1537          -1 (-0.07%)
      bpf_overlay.o  tail_nodeport_nat_egress_ipv4             33581            33543        -38 (-0.11%)              2160              2157          -3 (-0.14%)
      bpf_xdp.o      tail_handle_nat_fwd_ipv4                  21659            20920       -739 (-3.41%)              1440              1376         -64 (-4.44%)
      bpf_xdp.o      tail_handle_nat_fwd_ipv6                  17084            17039        -45 (-0.26%)               907               905          -2 (-0.22%)
      bpf_xdp.o      tail_lb_ipv4                              73442            73430        -12 (-0.02%)              4370              4369          -1 (-0.02%)
      bpf_xdp.o      tail_lb_ipv6                             152114           151895       -219 (-0.14%)              6493              6479         -14 (-0.22%)
      bpf_xdp.o      tail_nodeport_nat_egress_ipv4             17377            17200       -177 (-1.02%)              1125              1111         -14 (-1.24%)
      bpf_xdp.o      tail_nodeport_nat_ingress_ipv6             6405             6397         -8 (-0.12%)               309               308          -1 (-0.32%)
      bpf_xdp.o      tail_rev_nodeport_lb4                      7126             6934       -192 (-2.69%)               414               402         -12 (-2.90%)
      bpf_xdp.o      tail_rev_nodeport_lb6                     18059            17905       -154 (-0.85%)              1105              1096          -9 (-0.81%)
      -------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221104163649.121784-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f63181b6
    • Andrii Nakryiko's avatar
      bpf: allow precision tracking for programs with subprogs · be2ef816
      Andrii Nakryiko authored
      Stop forcing precise=true for SCALAR registers when BPF program has any
      subprograms. Current restriction means that any BPF program, as soon as
      it uses subprograms, will end up not getting any of the precision
      tracking benefits in reduction of number of verified states.
      
      This patch keeps the fallback mark_all_scalars_precise() behavior if
      precise marking has to cross function frames. E.g., if subprogram
      requires R1 (first input arg) to be marked precise, ideally we'd need to
      backtrack to the parent function and keep marking R1 and its
      dependencies as precise. But right now we give up and force all the
      SCALARs in any of the current and parent states to be forced to
      precise=true. We can lift that restriction in the future.
      
      But this patch fixes two issues identified when trying to enable
      precision tracking for subprogs.
      
      First, prevent "escaping" from top-most state in a global subprog. While
      with entry-level BPF program we never end up requesting precision for
      R1-R5 registers, because R2-R5 are not initialized (and so not readable
      in correct BPF program), and R1 is PTR_TO_CTX, not SCALAR, and so is
      implicitly precise. With global subprogs, though, it's different, as
      global subprog a) can have up to 5 SCALAR input arguments, which might
      get marked as precise=true and b) it is validated in isolation from its
      main entry BPF program. b) means that we can end up exhausting parent
      state chain and still not mark all registers in reg_mask as precise,
      which would lead to verifier bug warning.
      
      To handle that, we need to consider two cases. First, if the very first
      state is not immediately "checkpointed" (i.e., stored in state lookup
      hashtable), it will get correct first_insn_idx and last_insn_idx
      instruction set during state checkpointing. As such, this case is
      already handled and __mark_chain_precision() already handles that by
      just doing nothing when we reach to the very first parent state.
      st->parent will be NULL and we'll just stop. Perhaps some extra check
      for reg_mask and stack_mask is due here, but this patch doesn't address
      that issue.
      
      More problematic second case is when global function's initial state is
      immediately checkpointed before we manage to process the very first
      instruction. This is happening because when there is a call to global
      subprog from the main program the very first subprog's instruction is
      marked as pruning point, so before we manage to process first
      instruction we have to check and checkpoint state. This patch adds
      a special handling for such "empty" state, which is identified by having
      st->last_insn_idx set to -1. In such case, we check that we are indeed
      validating global subprog, and with some sanity checking we mark input
      args as precise if requested.
      
      Note that we also initialize state->first_insn_idx with correct start
      insn_idx offset. For main program zero is correct value, but for any
      subprog it's quite confusing to not have first_insn_idx set. This
      doesn't have any functional impact, but helps with debugging and state
      printing. We also explicitly initialize state->last_insns_idx instead of
      relying on is_state_visited() to do this with env->prev_insns_idx, which
      will be -1 on the very first instruction. This concludes necessary
      changes to handle specifically global subprog's precision tracking.
      
      Second identified problem was missed handling of BPF helper functions
      that call into subprogs (e.g., bpf_loop and few others). From precision
      tracking and backtracking logic's standpoint those are effectively calls
      into subprogs and should be called as BPF_PSEUDO_CALL calls.
      
      This patch takes the least intrusive way and just checks against a short
      list of current BPF helpers that do call subprogs, encapsulated in
      is_callback_calling_function() function. But to prevent accidentally
      forgetting to add new BPF helpers to this "list", we also do a sanity
      check in __check_func_call, which has to be called for each such special
      BPF helper, to validate that BPF helper is indeed recognized as
      callback-calling one. This should catch any missed checks in the future.
      Adding some special flags to be added in function proto definitions
      seemed like an overkill in this case.
      
      With the above changes, it's possible to remove forceful setting of
      reg->precise to true in __mark_reg_unknown, which turns on precision
      tracking both inside subprogs and entry progs that have subprogs. No
      warnings or errors were detected across all the selftests, but also when
      validating with veristat against internal Meta BPF objects and Cilium
      objects. Further, in some BPF programs there are noticeable reduction in
      number of states and instructions validated due to more effective
      precision tracking, especially benefiting syncookie test.
      
      $ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/subprog-precise-results.csv  | grep -v '+0'
      File                                      Program                     Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      ----------------------------------------  --------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      pyperf600_bpf_loop.bpf.linked1.o          on_event                               3966             3678       -288 (-7.26%)               306               276         -30 (-9.80%)
      pyperf_global.bpf.linked1.o               on_event                               7563             7530        -33 (-0.44%)               520               517          -3 (-0.58%)
      pyperf_subprogs.bpf.linked1.o             on_event                              36358            36934       +576 (+1.58%)              2499              2531         +32 (+1.28%)
      setget_sockopt.bpf.linked1.o              skops_sockopt                          3965             4038        +73 (+1.84%)               343               347          +4 (+1.17%)
      test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect                          64965            64901        -64 (-0.10%)              4619              4612          -7 (-0.15%)
      test_misc_tcp_hdr_options.bpf.linked1.o   misc_estab                             1491             1307      -184 (-12.34%)               110               100         -10 (-9.09%)
      test_pkt_access.bpf.linked1.o             test_pkt_access                         354              349         -5 (-1.41%)                25                24          -1 (-4.00%)
      test_sock_fields.bpf.linked1.o            egress_read_sock_fields                 435              375       -60 (-13.79%)                22                20          -2 (-9.09%)
      test_sysctl_loop2.bpf.linked1.o           sysctl_tcp_mem                         1508             1501         -7 (-0.46%)                29                28          -1 (-3.45%)
      test_tc_dtime.bpf.linked1.o               egress_fwdns_prio100                    468              435        -33 (-7.05%)                45                41          -4 (-8.89%)
      test_tc_dtime.bpf.linked1.o               ingress_fwdns_prio100                   398              408        +10 (+2.51%)                42                39          -3 (-7.14%)
      test_tc_dtime.bpf.linked1.o               ingress_fwdns_prio101                  1096              842      -254 (-23.18%)                97                73        -24 (-24.74%)
      test_tcp_hdr_options.bpf.linked1.o        estab                                  2758             2408      -350 (-12.69%)               208               181        -27 (-12.98%)
      test_urandom_usdt.bpf.linked1.o           urand_read_with_sema                    466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
      test_urandom_usdt.bpf.linked1.o           urand_read_without_sema                 466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
      test_urandom_usdt.bpf.linked1.o           urandlib_read_with_sema                 466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
      test_urandom_usdt.bpf.linked1.o           urandlib_read_without_sema              466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
      test_xdp_noinline.bpf.linked1.o           balancer_ingress_v6                    4302             4294         -8 (-0.19%)               257               256          -1 (-0.39%)
      xdp_synproxy_kern.bpf.linked1.o           syncookie_tc                         583722           405757   -177965 (-30.49%)             35846             25735     -10111 (-28.21%)
      xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp                        609123           479055   -130068 (-21.35%)             35452             29145      -6307 (-17.79%)
      ----------------------------------------  --------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221104163649.121784-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      be2ef816
    • Andrii Nakryiko's avatar
      bpf: propagate precision across all frames, not just the last one · 529409ea
      Andrii Nakryiko authored
      When equivalent completed state is found and it has additional precision
      restrictions, BPF verifier propagates precision to
      currently-being-verified state chain (i.e., including parent states) so
      that if some of the states in the chain are not yet completed, necessary
      precision restrictions are enforced.
      
      Unfortunately, right now this happens only for the last frame (deepest
      active subprogram's frame), not all the frames. This can lead to
      incorrect matching of states due to missing precision marker. Currently
      this doesn't seem possible as BPF verifier forces everything to precise
      when validated BPF program has any subprograms. But with the next patch
      lifting this restriction, this becomes problematic.
      
      In fact, without this fix, we'll start getting failure in one of the
      existing test_verifier test cases:
      
        #906/p precise: cross frame pruning FAIL
        Unexpected success to load!
        verification time 48 usec
        stack depth 0+0
        processed 26 insns (limit 1000000) max_states_per_insn 3 total_states 17 peak_states 17 mark_read 8
      
      This patch adds precision propagation across all frames.
      
      Fixes: a3ce685d ("bpf: fix precision tracking")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221104163649.121784-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      529409ea
    • Andrii Nakryiko's avatar
      bpf: propagate precision in ALU/ALU64 operations · a3b666bf
      Andrii Nakryiko authored
      When processing ALU/ALU64 operations (apart from BPF_MOV, which is
      handled correctly already; and BPF_NEG and BPF_END are special and don't
      have source register), if destination register is already marked
      precise, this causes problem with potentially missing precision tracking
      for the source register. E.g., when we have r1 >>= r5 and r1 is marked
      precise, but r5 isn't, this will lead to r5 staying as imprecise. This
      is due to the precision backtracking logic stopping early when it sees
      r1 is already marked precise. If r1 wasn't precise, we'd keep
      backtracking and would add r5 to the set of registers that need to be
      marked precise. So there is a discrepancy here which can lead to invalid
      and incompatible states matched due to lack of precision marking on r5.
      If r1 wasn't precise, precision backtracking would correctly mark both
      r1 and r5 as precise.
      
      This is simple to fix, though. During the forward instruction simulation
      pass, for arithmetic operations of `scalar <op>= scalar` form (where
      <op> is ALU or ALU64 operations), if destination register is already
      precise, mark source register as precise. This applies only when both
      involved registers are SCALARs. `ptr += scalar` and `scalar += ptr`
      cases are already handled correctly.
      
      This does have (negative) effect on some selftest programs and few
      Cilium programs.  ~/baseline-tmp-results.csv are veristat results with
      this patch, while ~/baseline-results.csv is without it. See post
      scriptum for instructions on how to make Cilium programs testable with
      veristat. Correctness has a price.
      
      $ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/baseline-tmp-results.csv | grep -v '+0'
      File                     Program               Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      -----------------------  --------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      bpf_cubic.bpf.linked1.o  bpf_cubic_cong_avoid              997             1700      +703 (+70.51%)                62                90        +28 (+45.16%)
      test_l4lb.bpf.linked1.o  balancer_ingress                 4559             5469      +910 (+19.96%)               118               126          +8 (+6.78%)
      -----------------------  --------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      
      $ ./veristat -C -e file,prog,verdict,insns,states ~/baseline-results-cilium.csv ~/baseline-tmp-results-cilium.csv | grep -v '+0'
      File           Program                         Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
      -------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      bpf_host.o     tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
      bpf_host.o     tail_nodeport_nat_ipv6_egress              3396             3446        +50 (+1.47%)               201               203          +2 (+1.00%)
      bpf_lxc.o      tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
      bpf_overlay.o  tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
      bpf_xdp.o      tail_lb_ipv4                              71736            73442      +1706 (+2.38%)              4295              4370         +75 (+1.75%)
      -------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
      
      P.S. To make Cilium ([0]) programs libbpf-compatible and thus
      veristat-loadable, apply changes from topmost commit in [1], which does
      minimal changes to Cilium source code, mostly around SEC() annotations
      and BPF map definitions.
      
        [0] https://github.com/cilium/cilium/
        [1] https://github.com/anakryiko/cilium/commits/libbpf-friendliness
      
      Fixes: b5dc0163 ("bpf: precise scalar_value tracking")
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221104163649.121784-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a3b666bf
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Refactor map->off_arr handling · f71b2f64
      Kumar Kartikeya Dwivedi authored
      Refactor map->off_arr handling into generic functions that can work on
      their own without hardcoding map specific code. The btf_fields_offs
      structure is now returned from btf_parse_field_offs, which can be reused
      later for types in program BTF.
      
      All functions like copy_map_value, zero_map_value call generic
      underlying functions so that they can also be reused later for copying
      to values allocated in programs which encode specific fields.
      
      Later, some helper functions will also require access to this
      btf_field_offs structure to be able to skip over special fields at
      runtime.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f71b2f64
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Consolidate spin_lock, timer management into btf_record · db559117
      Kumar Kartikeya Dwivedi authored
      Now that kptr_off_tab has been refactored into btf_record, and can hold
      more than one specific field type, accomodate bpf_spin_lock and
      bpf_timer as well.
      
      While they don't require any more metadata than offset, having all
      special fields in one place allows us to share the same code for
      allocated user defined types and handle both map values and these
      allocated objects in a similar fashion.
      
      As an optimization, we still keep spin_lock_off and timer_off offsets in
      the btf_record structure, just to avoid having to find the btf_field
      struct each time their offset is needed. This is mostly needed to
      manipulate such objects in a map value at runtime. It's ok to hardcode
      just one offset as more than one field is disallowed.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      db559117
    • Alexei Starovoitov's avatar
      Merge branch 'veristat: replay, filtering, sorting' · af085f55
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      
      This patch set adds a bunch of new featurs and improvements that were sorely
      missing during recent active use of veristat to develop BPF verifier precision
      changes. Individual patches provide justification, explanation and often
      examples showing how new capabilities can be used.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af085f55
    • Andrii Nakryiko's avatar
      selftests/bpf: support stat filtering in comparison mode in veristat · d5ce4b89
      Andrii Nakryiko authored
      Finally add support for filtering stats values, similar to
      non-comparison mode filtering. For comparison mode 4 variants of stats
      are important for filtering, as they allow to filter either A or B side,
      but even more importantly they allow to filter based on value
      difference, and for verdict stat value difference is MATCH/MISMATCH
      classification. So with these changes it's finally possible to easily
      check if there were any mismatches between failure/success outcomes on
      two separate data sets. Like in an example below:
      
        $ ./veristat -e file,prog,verdict,insns -C ~/baseline-results.csv ~/shortest-results.csv -f verdict_diff=mismatch
        File                                   Program                Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns        (DIFF)
        -------------------------------------  ---------------------  -----------  -----------  --------------  ---------  ---------  -------------------
        dynptr_success.bpf.linked1.o           test_data_slice        success      failure      MISMATCH               85          0       -85 (-100.00%)
        dynptr_success.bpf.linked1.o           test_read_write        success      failure      MISMATCH             1992          0     -1992 (-100.00%)
        dynptr_success.bpf.linked1.o           test_ringbuf           success      failure      MISMATCH               74          0       -74 (-100.00%)
        kprobe_multi.bpf.linked1.o             test_kprobe            failure      success      MISMATCH                0        246      +246 (+100.00%)
        kprobe_multi.bpf.linked1.o             test_kprobe_manual     failure      success      MISMATCH                0        246      +246 (+100.00%)
        kprobe_multi.bpf.linked1.o             test_kretprobe         failure      success      MISMATCH                0        248      +248 (+100.00%)
        kprobe_multi.bpf.linked1.o             test_kretprobe_manual  failure      success      MISMATCH                0        248      +248 (+100.00%)
        kprobe_multi.bpf.linked1.o             trigger                failure      success      MISMATCH                0          2        +2 (+100.00%)
        netcnt_prog.bpf.linked1.o              bpf_nextcnt            failure      success      MISMATCH                0         56       +56 (+100.00%)
        pyperf600_nounroll.bpf.linked1.o       on_event               success      failure      MISMATCH           568128    1000001    +431873 (+76.02%)
        ringbuf_bench.bpf.linked1.o            bench_ringbuf          success      failure      MISMATCH                8          0        -8 (-100.00%)
        strobemeta.bpf.linked1.o               on_event               success      failure      MISMATCH           557149    1000001    +442852 (+79.49%)
        strobemeta_nounroll1.bpf.linked1.o     on_event               success      failure      MISMATCH            57240    1000001  +942761 (+1647.03%)
        strobemeta_nounroll2.bpf.linked1.o     on_event               success      failure      MISMATCH           501725    1000001    +498276 (+99.31%)
        strobemeta_subprogs.bpf.linked1.o      on_event               success      failure      MISMATCH            65420    1000001  +934581 (+1428.59%)
        test_map_in_map_invalid.bpf.linked1.o  xdp_noop0              success      failure      MISMATCH                2          0        -2 (-100.00%)
        test_mmap.bpf.linked1.o                test_mmap              success      failure      MISMATCH               46          0       -46 (-100.00%)
        test_verif_scale3.bpf.linked1.o        balancer_ingress       success      failure      MISMATCH           845499    1000001    +154502 (+18.27%)
        -------------------------------------  ---------------------  -----------  -----------  --------------  ---------  ---------  -------------------
      
      Note that by filtering on verdict_diff=mismatch, it's now extremely easy and
      fast to see any changes in verdict. Example above showcases both failure ->
      success transitions (which are generally surprising) and success -> failure
      transitions (which are expected if bugs are present).
      
      Given veristat allows to query relative percent difference values, internal
      logic for comparison mode is based on floating point numbers, so requires a bit
      of epsilon precision logic, deviating from typical integer simple handling
      rules.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-11-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d5ce4b89
    • Andrii Nakryiko's avatar
      selftests/bpf: support stats ordering in comparison mode in veristat · fa9bb590
      Andrii Nakryiko authored
      Introduce the concept of "stat variant", by which it's possible to
      specify whether to use the value from A (baseline) side, B (comparison
      or control) side, the absolute difference value or relative (percentage)
      difference value.
      
      To support specifying this, veristat recognizes `_a`, `_b`, `_diff`,
      `_pct` suffixes, which can be appended to stat name(s). In
      non-comparison mode variants are ignored (there is only `_a` variant
      effectively), if no variant suffix is provided, `_b` is assumed, as
      control group is of primary interest in comparison mode.
      
      These stat variants can be flexibly combined with asc/desc orders.
      
      Here's an example of ordering results first by verdict match/mismatch (or n/a
      if one of the sides is missing; n/a is always considered to be the lowest
      value), and within each match/mismatch/n/a group further sort by number of
      instructions in B side. In this case we don't have MISMATCH cases, but N/A are
      split from MATCH, demonstrating this custom ordering.
      
        $ ./veristat -e file,prog,verdict,insns -s verdict_diff,insns_b_ -C ~/base.csv ~/comp.csv
        File                Program                         Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns   (DIFF)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
        bpf_xdp.o           tail_lb_ipv6                    N/A          success      N/A                   N/A     151895             N/A
        bpf_xdp.o           tail_nodeport_nat_egress_ipv4   N/A          success      N/A                   N/A      15619             N/A
        bpf_xdp.o           tail_nodeport_ipv6_dsr          N/A          success      N/A                   N/A       1206             N/A
        bpf_xdp.o           tail_nodeport_ipv4_dsr          N/A          success      N/A                   N/A       1162             N/A
        bpf_alignchecker.o  tail_icmp6_send_echo_reply      N/A          failure      N/A                   N/A         74             N/A
        bpf_alignchecker.o  __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_host.o          __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_host.o          cil_from_host                   success      N/A          N/A                   762        N/A             N/A
        bpf_xdp.o           tail_lb_ipv4                    success      success      MATCH               71736      73430  +1694 (+2.36%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv4        success      success      MATCH               21547      20920   -627 (-2.91%)
        bpf_xdp.o           tail_rev_nodeport_lb6           success      success      MATCH               17954      17905    -49 (-0.27%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv6        success      success      MATCH               16974      17039    +65 (+0.38%)
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv4  success      success      MATCH                7658       7713    +55 (+0.72%)
        bpf_xdp.o           tail_rev_nodeport_lb4           success      success      MATCH                7126       6934   -192 (-2.69%)
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv6  success      success      MATCH                6405       6397     -8 (-0.12%)
        bpf_xdp.o           tail_nodeport_nat_ipv6_egress   failure      failure      MATCH                 752        752     +0 (+0.00%)
        bpf_xdp.o           cil_xdp_entry                   success      success      MATCH                 423        423     +0 (+0.00%)
        bpf_xdp.o           __send_drop_notify              success      success      MATCH                 151        151     +0 (+0.00%)
        bpf_alignchecker.o  tail_icmp6_handle_ns            failure      failure      MATCH                  33         33     +0 (+0.00%)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-10-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fa9bb590
    • Andrii Nakryiko's avatar
      selftests/bpf: handle missing records in comparison mode better in veristat · a5710848
      Andrii Nakryiko authored
      When comparing two datasets, if either side is missing corresponding
      record with the same file and prog name, currently veristat emits
      misleading zeros/failures, and even tried to calculate a difference,
      even though there is no data to compare against.
      
      This patch improves internal logic of handling such situations. Now
      we'll emit "N/A" in places where data is missing and comparison is
      non-sensical.
      
      As an example, in an artificially truncated and mismatched Cilium
      results, the output looks like below:
      
        $ ./veristat -e file,prog,verdict,insns -C ~/base.csv ~/comp.csv
        File                Program                         Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns   (DIFF)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
        bpf_alignchecker.o  __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_alignchecker.o  tail_icmp6_handle_ns            failure      failure      MATCH                  33         33     +0 (+0.00%)
        bpf_alignchecker.o  tail_icmp6_send_echo_reply      N/A          failure      N/A                   N/A         74             N/A
        bpf_host.o          __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_host.o          cil_from_host                   success      N/A          N/A                   762        N/A             N/A
        bpf_xdp.o           __send_drop_notify              success      success      MATCH                 151        151     +0 (+0.00%)
        bpf_xdp.o           cil_xdp_entry                   success      success      MATCH                 423        423     +0 (+0.00%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv4        success      success      MATCH               21547      20920   -627 (-2.91%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv6        success      success      MATCH               16974      17039    +65 (+0.38%)
        bpf_xdp.o           tail_lb_ipv4                    success      success      MATCH               71736      73430  +1694 (+2.36%)
        bpf_xdp.o           tail_lb_ipv6                    N/A          success      N/A                   N/A     151895             N/A
        bpf_xdp.o           tail_nodeport_ipv4_dsr          N/A          success      N/A                   N/A       1162             N/A
        bpf_xdp.o           tail_nodeport_ipv6_dsr          N/A          success      N/A                   N/A       1206             N/A
        bpf_xdp.o           tail_nodeport_nat_egress_ipv4   N/A          success      N/A                   N/A      15619             N/A
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv4  success      success      MATCH                7658       7713    +55 (+0.72%)
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv6  success      success      MATCH                6405       6397     -8 (-0.12%)
        bpf_xdp.o           tail_nodeport_nat_ipv6_egress   failure      failure      MATCH                 752        752     +0 (+0.00%)
        bpf_xdp.o           tail_rev_nodeport_lb4           success      success      MATCH                7126       6934   -192 (-2.69%)
        bpf_xdp.o           tail_rev_nodeport_lb6           success      success      MATCH               17954      17905    -49 (-0.27%)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
      
      Internally veristat now separates joining two datasets and remembering the
      join, and actually emitting a comparison view. This will come handy when we add
      support for filtering and custom ordering in comparison mode.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a5710848
    • Andrii Nakryiko's avatar
      selftests/bpf: make veristat emit all stats in CSV mode by default · 77534401
      Andrii Nakryiko authored
      Make veristat distinguish between table and CSV output formats and use
      different default set of stats (columns) that are emitted. While for
      human-readable table output it doesn't make sense to output all known
      stats, it is very useful for CSV mode to record all possible data, so
      that it can later be queried and filtered in replay or comparison mode.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-8-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77534401
    • Andrii Nakryiko's avatar
      selftests/bpf: support simple filtering of stats in veristat · 1bb4ec81
      Andrii Nakryiko authored
      Define simple expressions to filter not just by file and program name,
      but also by resulting values of collected stats. Support usual
      equality and inequality operators. Verdict, which is a boolean-like
      field can be also filtered either as 0/1, failure/success (with f/s,
      fail/succ, and failure/success aliases) symbols, or as false/true (f/t).
      Aliases are case insensitive.
      
      Currently this filtering is honored only in verification and replay
      modes. Comparison mode support will be added in next patch.
      
      Here's an example of verifying a bunch of BPF object files and emitting
      only results for successfully validated programs that have more than 100
      total instructions processed by BPF verifier, sorted by number of
      instructions in ascending order:
      
        $ sudo ./veristat *.bpf.o -s insns^ -f 'insns>100'
      
      There can be many filters (both allow and deny flavors), all of them are
      combined.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1bb4ec81
    • Andrii Nakryiko's avatar
      selftests/bpf: allow to define asc/desc ordering for sort specs in veristat · d68c07e2
      Andrii Nakryiko authored
      Allow to specify '^' at the end of stat name to designate that it should
      be sorted in ascending order. Similarly, allow any of 'v', 'V', '.',
      '!', or '_' suffix "symbols" to designate descending order. It's such
      a zoo for descending order because there is no single intuitive symbol
      that could be used (using 'v' looks pretty weird in practice), so few
      symbols that are "downwards leaning or pointing" were chosen. Either
      way, it shouldn't cause any troubles in practice.
      
      This new feature allows to customize sortering order to match user's
      needs.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d68c07e2
    • Andrii Nakryiko's avatar
      selftests/bpf: ensure we always have non-ambiguous sorting in veristat · b9670b90
      Andrii Nakryiko authored
      Always fall back to unique file/prog comparison if user's custom order
      specs are ambiguous. This ensures stable output no matter what.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b9670b90
    • Andrii Nakryiko's avatar
      selftests/bpf: consolidate and improve file/prog filtering in veristat · 10b1b3f3
      Andrii Nakryiko authored
      Slightly change rules of specifying file/prog glob filters. In practice
      it's quite often inconvenient to do `*/<prog-glob>` if that program glob
      is unique enough and won't accidentally match any file names.
      
      This patch changes the rules so that `-f <glob>` will apply specified
      glob to both file and program names. User still has all the control by
      doing '*/<prog-only-glob>' or '<file-only-glob/*'. We also now allow
      '/<prog-glob>' and '<file-glob/' (all matching wildcard is assumed if
      missing).
      
      Also, internally unify file-only and file+prog checks
      (should_process_file and should_process_prog are now
      should_process_file_prog that can handle prog name as optional). This
      makes maintaining and extending this code easier.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      10b1b3f3
    • Andrii Nakryiko's avatar
      selftests/bpf: shorten "Total insns/states" column names in veristat · 62d2c08b
      Andrii Nakryiko authored
      In comparison mode the "Total " part is pretty useless, but takes
      a considerable amount of horizontal space. Drop the "Total " parts.
      
      Also make sure that table headers for numerical columns are aligned in
      the same fashion as integer values in those columns. This looks better
      and is now more obvious with shorter "Insns" and "States" column
      headers.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      62d2c08b
    • Andrii Nakryiko's avatar
      selftests/bpf: add veristat replay mode · 9b5e3536
      Andrii Nakryiko authored
      Replay mode allow to parse previously stored CSV file with verification
      results and present it in desired output (presumable human-readable
      table, but CSV to CSV convertion is supported as well). While doing
      that, it's possible to use veristat's sorting rules, specify subset of
      columns, and filter by file and program name.
      
      In subsequent patches veristat's filtering capabilities will just grow
      making replay mode even more useful in practice for post-processing
      results.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9b5e3536
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Refactor kptr_off_tab into btf_record · aa3496ac
      Kumar Kartikeya Dwivedi authored
      To prepare the BPF verifier to handle special fields in both map values
      and program allocated types coming from program BTF, we need to refactor
      the kptr_off_tab handling code into something more generic and reusable
      across both cases to avoid code duplication.
      
      Later patches also require passing this data to helpers at runtime, so
      that they can work on user defined types, initialize them, destruct
      them, etc.
      
      The main observation is that both map values and such allocated types
      point to a type in program BTF, hence they can be handled similarly. We
      can prepare a field metadata table for both cases and store them in
      struct bpf_map or struct btf depending on the use case.
      
      Hence, refactor the code into generic btf_record and btf_field member
      structs. The btf_record represents the fields of a specific btf_type in
      user BTF. The cnt indicates the number of special fields we successfully
      recognized, and field_mask is a bitmask of fields that were found, to
      enable quick determination of availability of a certain field.
      
      Subsequently, refactor the rest of the code to work with these generic
      types, remove assumptions about kptr and kptr_off_tab, rename variables
      to more meaningful names, etc.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      aa3496ac
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Drop reg_type_may_be_refcounted_or_null · a28ace78
      Kumar Kartikeya Dwivedi authored
      It is not scalable to maintain a list of types that can have non-zero
      ref_obj_id. It is never set for scalars anyway, so just remove the
      conditional on register types and print it whenever it is non-zero.
      Acked-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-6-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a28ace78
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fix slot type check in check_stack_write_var_off · f5e477a8
      Kumar Kartikeya Dwivedi authored
      For the case where allow_ptr_leaks is false, code is checking whether
      slot type is STACK_INVALID and STACK_SPILL and rejecting other cases.
      This is a consequence of incorrectly checking for register type instead
      of the slot type (NOT_INIT and SCALAR_VALUE respectively). Fix the
      check.
      
      Fixes: 01f810ac ("bpf: Allow variable-offset stack access")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-5-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f5e477a8
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Clobber stack slot when writing over spilled PTR_TO_BTF_ID · 261f4664
      Kumar Kartikeya Dwivedi authored
      When support was added for spilled PTR_TO_BTF_ID to be accessed by
      helper memory access, the stack slot was not overwritten to STACK_MISC
      (and that too is only safe when env->allow_ptr_leaks is true).
      
      This means that helpers who take ARG_PTR_TO_MEM and write to it may
      essentially overwrite the value while the verifier continues to track
      the slot for spilled register.
      
      This can cause issues when PTR_TO_BTF_ID is spilled to stack, and then
      overwritten by helper write access, which can then be passed to BPF
      helpers or kfuncs.
      
      Handle this by falling back to the case introduced in a later commit,
      which will also handle PTR_TO_BTF_ID along with other pointer types,
      i.e. cd17d38f ("bpf: Permits pointers on stack for helper calls").
      
      Finally, include a comment on why REG_LIVE_WRITTEN is not being set when
      clobber is set to true. In short, the reason is that while when clobber
      is unset, we know that we won't be writing, when it is true, we *may*
      write to any of the stack slots in that range. It may be a partial or
      complete write, to just one or many stack slots.
      
      We cannot be sure, hence to be conservative, we leave things as is and
      never set REG_LIVE_WRITTEN for any stack slot. However, clobber still
      needs to reset them to STACK_MISC assuming writes happened. However read
      marks still need to be propagated upwards from liveness point of view,
      as parent stack slot's contents may still continue to matter to child
      states.
      
      Cc: Yonghong Song <yhs@meta.com>
      Fixes: 1d68f22b ("bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-4-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      261f4664
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Allow specifying volatile type modifier for kptrs · 23da464d
      Kumar Kartikeya Dwivedi authored
      This is useful in particular to mark the pointer as volatile, so that
      compiler treats each load and store to the field as a volatile access.
      The alternative is having to define and use READ_ONCE and WRITE_ONCE in
      the BPF program.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-3-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      23da464d
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Document UAPI details for special BPF types · 9805af8d
      Kumar Kartikeya Dwivedi authored
      The kernel recognizes some special BPF types in map values or local
      kptrs. Document that only bpf_spin_lock and bpf_timer will preserve
      backwards compatibility, and kptr will preserve backwards compatibility
      for the operations on the pointer, not the types supported for such
      kptrs.
      
      For local kptrs, document that there are no stability guarantees at all.
      
      Finally, document that 'bpf_' namespace is reserved for adding future
      special fields, hence BPF programs must not declare types with such
      names in their programs and still expect backwards compatibility.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-2-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9805af8d
  2. 03 Nov, 2022 15 commits
    • Stanislav Fomichev's avatar
      bpf: make sure skb->len != 0 when redirecting to a tunneling device · 07ec7b50
      Stanislav Fomichev authored
      syzkaller managed to trigger another case where skb->len == 0
      when we enter __dev_queue_xmit:
      
      WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline]
      WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295
      
      Call Trace:
       dev_queue_xmit+0x17/0x20 net/core/dev.c:4406
       __bpf_tx_skb net/core/filter.c:2115 [inline]
       __bpf_redirect_no_mac net/core/filter.c:2140 [inline]
       __bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163
       ____bpf_clone_redirect net/core/filter.c:2447 [inline]
       bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419
       bpf_prog_48159a89cb4a9a16+0x59/0x5e
       bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline]
       __bpf_prog_run include/linux/filter.h:596 [inline]
       bpf_prog_run include/linux/filter.h:603 [inline]
       bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402
       bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170
       bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648
       __sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005
       __do_sys_bpf kernel/bpf/syscall.c:5091 [inline]
       __se_sys_bpf kernel/bpf/syscall.c:5089 [inline]
       __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089
       do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x61/0xc6
      
      The reproducer doesn't really reproduce outside of syzkaller
      environment, so I'm taking a guess here. It looks like we
      do generate correct ETH_HLEN-sized packet, but we redirect
      the packet to the tunneling device. Before we do so, we
      __skb_pull l2 header and arrive again at skb->len == 0.
      Doesn't seem like we can do anything better than having
      an explicit check after __skb_pull?
      
      Cc: Eric Dumazet <edumazet@google.com>
      Reported-by: syzbot+f635e86ec3fa0a37e019@syzkaller.appspotmail.com
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20221027225537.353077-1-sdf@google.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      07ec7b50
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · fbeb229a
      Jakub Kicinski authored
      No conflicts.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fbeb229a
    • Linus Torvalds's avatar
      Merge tag 'net-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 9521c9d6
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bluetooth and netfilter.
      
        Current release - regressions:
      
         - net: several zerocopy flags fixes
      
         - netfilter: fix possible memory leak in nf_nat_init()
      
         - openvswitch: add missing .resv_start_op
      
        Previous releases - regressions:
      
         - neigh: fix null-ptr-deref in neigh_table_clear()
      
         - sched: fix use after free in red_enqueue()
      
         - dsa: fall back to default tagger if we can't load the one from DT
      
         - bluetooth: fix use-after-free in l2cap_conn_del()
      
        Previous releases - always broken:
      
         - netfilter: netlink notifier might race to release objects
      
         - nfc: fix potential memory leak of skb
      
         - bluetooth: fix use-after-free caused by l2cap_reassemble_sdu
      
         - bluetooth: use skb_put to set length
      
         - eth: tun: fix bugs for oversize packet when napi frags enabled
      
         - eth: lan966x: fixes for when MTU is changed
      
         - eth: dwmac-loongson: fix invalid mdio_node"
      
      * tag 'net-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
        vsock: fix possible infinite sleep in vsock_connectible_wait_data()
        vsock: remove the unused 'wait' in vsock_connectible_recvmsg()
        ipv6: fix WARNING in ip6_route_net_exit_late()
        bridge: Fix flushing of dynamic FDB entries
        net, neigh: Fix null-ptr-deref in neigh_table_clear()
        net/smc: Fix possible leaked pernet namespace in smc_init()
        stmmac: dwmac-loongson: fix invalid mdio_node
        ibmvnic: Free rwi on reset success
        net: mdio: fix undefined behavior in bit shift for __mdiobus_register
        Bluetooth: L2CAP: Fix attempting to access uninitialized memory
        Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm
        Bluetooth: L2CAP: Fix accepting connection request for invalid SPSM
        Bluetooth: hci_conn: Fix not restoring ISO buffer count on disconnect
        Bluetooth: L2CAP: Fix memory leak in vhci_write
        Bluetooth: L2CAP: fix use-after-free in l2cap_conn_del()
        Bluetooth: virtio_bt: Use skb_put to set length
        Bluetooth: hci_conn: Fix CIS connection dst_type handling
        Bluetooth: L2CAP: Fix use-after-free caused by l2cap_reassemble_sdu
        netfilter: ipset: enforce documented limit to prevent allocating huge memory
        isdn: mISDN: netjet: fix wrong check of device registration
        ...
      9521c9d6
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 4d740391
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix an endian thinko in the asm-generic compat_arg_u64() which led to
         syscall arguments being swapped for some compat syscalls.
      
       - Fix syscall wrapper handling of syscalls with 64-bit arguments on
         32-bit kernels, which led to syscall arguments being misplaced.
      
       - A build fix for amdgpu on Book3E with AltiVec disabled.
      
      Thanks to Andreas Schwab, Christian Zigotzky, and Arnd Bergmann.
      
      * tag 'powerpc-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/32: Select ARCH_SPLIT_ARG64
        powerpc/32: fix syscall wrappers with 64-bit arguments
        asm-generic: compat: fix compat_arg_u64() and compat_arg_u64_dual()
        powerpc/64e: Fix amdgpu build on Book3E w/o AltiVec
      4d740391
    • Paolo Abeni's avatar
      Merge branch 'add-new-pcp-and-apptrust-attributes-to-dcbnl' · d9095f92
      Paolo Abeni authored
      Daniel Machon says:
      
      ====================
      Add new PCP and APPTRUST attributes to dcbnl
      
      This patch series adds new extension attributes to dcbnl, to support PCP
      prioritization (and thereby hw offloadable pcp-based queue
      classification) and per-selector trust and trust order. Additionally,
      the microchip sparx5 driver has been dcb-enabled to make use of the new
      attributes to offload PCP, DSCP and Default prio to the switch, and
      implement trust order of selectors.
      
      For pre-RFC discussion see:
      https://lore.kernel.org/netdev/Yv9VO1DYAxNduw6A@DEN-LT-70577/
      
      For RFC series see:
      https://lore.kernel.org/netdev/20220915095757.2861822-1-daniel.machon@microchip.com/
      
      In summary: there currently exist no convenient way to offload per-port
      PCP-based queue classification to hardware. The DCB subsystem offers
      different ways to prioritize through its APP table, but lacks an option
      for PCP. Similarly, there is no way to indicate the notion of trust for
      APP table selectors. This patch series addresses both topics.
      
      PCP based queue classification:
        - 8021Q standardizes the Priority Code Point table (see 6.9.3 of IEEE
          Std 802.1Q-2018).  This patch series makes it possible, to offload
          the PCP classification to said table.  The new PCP selector is not a
          standard part of the APP managed object, therefore it is
          encapsulated in a new non-std extension attribute.
      
      Selector trust:
        - ASIC's often has the notion of trust DSCP and trust PCP. The new
          attribute makes it possible to specify a trust order of app
          selectors, which drivers can then react on.
      
      DCB-enable sparx5 driver:
       - Now supports offloading of DSCP, PCP and default priority. Only one
         mapping of protocol:priority is allowed. Consecutive mappings of the
         same protocol to some new priority, will overwrite the previous. This
         is to keep a consistent view of the app table and the hardware.
       - Now supports dscp and pcp trust, by use of the introduced
         dcbnl_set/getapptrust ops. Sparx5 supports trust orders: [], [dscp],
         [pcp] and [dscp, pcp]. For now, only DSCP and PCP selectors are
         supported by the driver, everything else is bounced.
      
      Patch #1 introduces a new PCP selector to the APP object, which makes it
      possible to encode PCP and DEI in the app triplet and offload it to the
      PCP table of the ASIC.
      
      Patch #2 Introduces the new extension attributes
      DCB_ATTR_DCB_APP_TRUST_TABLE and DCB_ATTR_DCB_APP_TRUST. Trusted
      selectors are passed in the nested DCB_ATTR_DCB_APP_TRUST_TABLE
      attribute, and assembled into an array of selectors:
      
        u8 selectors[256];
      
      where lower indexes has higher precedence.  In the array, selectors are
      stored consecutively, starting from index zero. With a maximum number of
      256 unique selectors, the list has the same maximum size.
      
      Patch #3 Sets up the dcbnl ops hook, and adds support for offloading pcp
      app entries, to the PCP table of the switch.
      
      Patch #4 Makes use of the dcbnl_set/getapptrust ops, to set a per-port
      trust order.
      
      Patch #5 Adds support for offloading dscp app entries to the DSCP table
      of the switch.
      
      Patch #6 Adds support for offloading default prio app entries to the
      switch.
      
      ====================
      
      Link: https://lore.kernel.org/r/20221101094834.2726202-1-daniel.machon@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d9095f92
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for offloading default prio · c58ff3ed
      Daniel Machon authored
      Add support for offloading default prio {ETHERTYPE, 0, prio}.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c58ff3ed
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for offloading dscp table · 8dcf69a6
      Daniel Machon authored
      Add support for offloading dscp app entries. Dscp values are global for
      all ports on the sparx5 switch. Therefore, we replicate each dscp app
      entry per-port.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8dcf69a6
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for apptrust · 23f8382c
      Daniel Machon authored
      Make use of set/getapptrust() to implement per-selector trust and trust
      order.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      23f8382c
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for offloading pcp table · 92ef3d01
      Daniel Machon authored
      Add new registers and functions to support offload of pcp app entries.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      92ef3d01
    • Daniel Machon's avatar
      net: dcb: add new apptrust attribute · 6182d587
      Daniel Machon authored
      Add new apptrust extension attributes to the 8021Qaz APP managed object.
      
      Two new attributes, DCB_ATTR_DCB_APP_TRUST_TABLE and
      DCB_ATTR_DCB_APP_TRUST, has been added. Trusted selectors are passed in
      the nested attribute DCB_ATTR_DCB_APP_TRUST, in order of precedence.
      
      The new attributes are meant to allow drivers, whose hw supports the
      notion of trust, to be able to set whether a particular app selector is
      trusted - and in which order.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6182d587
    • Daniel Machon's avatar
      net: dcb: add new pcp selector to app object · ec32c0c4
      Daniel Machon authored
      Add new PCP selector for the 8021Qaz APP managed object.
      
      As the PCP selector is not part of the 8021Qaz standard, a new non-std
      extension attribute DCB_ATTR_DCB_APP has been introduced. Also two
      helper functions to translate between selector and app attribute type
      has been added. The new selector has been given a value of 255, to
      minimize the risk of future overlap of std- and non-std attributes.
      
      The new DCB_ATTR_DCB_APP is sent alongside the ieee std attribute in the
      app table. This means that the dcb_app struct can now both contain std-
      and non-std app attributes. Currently there is no overlap between the
      selector values of the two attributes.
      
      The purpose of adding the PCP selector, is to be able to offload
      PCP-based queue classification to the 8021Q Priority Code Point table,
      see 6.9.3 of IEEE Std 802.1Q-2018.
      
      PCP and DEI is encoded in the protocol field as 8*dei+pcp, so that a
      mapping of PCP 2 and DEI 1 to priority 3 is encoded as {255, 10, 3}.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ec32c0c4
    • Saurabh Sengar's avatar
      net: mana: Assign interrupts to CPUs based on NUMA nodes · 71fa6887
      Saurabh Sengar authored
      In large VMs with multiple NUMA nodes, network performance is usually
      best if network interrupts are all assigned to the same virtual NUMA
      node. This patch assigns online CPU according to a numa aware policy,
      local cpus are returned first, followed by non-local ones, then it wraps
      around.
      Signed-off-by: default avatarSaurabh Sengar <ssengar@linux.microsoft.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Link: https://lore.kernel.org/r/1667282761-11547-1-git-send-email-ssengar@linux.microsoft.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      71fa6887
    • Paolo Abeni's avatar
      Merge branch 'vsock-remove-an-unused-variable-and-fix-infinite-sleep' · 715aee0f
      Paolo Abeni authored
      Dexuan Cui says:
      
      ====================
      vsock: remove an unused variable and fix infinite sleep
      
      Patch 1 removes the unused 'wait' variable.
      Patch 2 fixes an infinite sleep issue reported by a hv_sock user.
      ====================
      
      Link: https://lore.kernel.org/r/20221101021706.26152-1-decui@microsoft.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      715aee0f
    • Dexuan Cui's avatar
      vsock: fix possible infinite sleep in vsock_connectible_wait_data() · 466a8533
      Dexuan Cui authored
      Currently vsock_connectible_has_data() may miss a wakeup operation
      between vsock_connectible_has_data() == 0 and the prepare_to_wait().
      
      Fix the race by adding the process to the wait queue before checking
      vsock_connectible_has_data().
      
      Fixes: b3f7fd54 ("af_vsock: separate wait data loop")
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reported-by: default avatarFrédéric Dalleau <frederic.dalleau@docker.com>
      Tested-by: default avatarFrédéric Dalleau <frederic.dalleau@docker.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      466a8533
    • Dexuan Cui's avatar
      vsock: remove the unused 'wait' in vsock_connectible_recvmsg() · cf6ff0df
      Dexuan Cui authored
      Remove the unused variable introduced by 19c1b90e.
      
      Fixes: 19c1b90e ("af_vsock: separate receive data loop")
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cf6ff0df