1. 24 Oct, 2023 15 commits
    • Daniel Borkmann's avatar
      libbpf: Add link-based API for netkit · 05c31b4a
      Daniel Borkmann authored
      This adds bpf_program__attach_netkit() API to libbpf. Overall it is very
      similar to tcx. The API looks as following:
      
        LIBBPF_API struct bpf_link *
        bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex,
                                   const struct bpf_netkit_opts *opts);
      
      The struct bpf_netkit_opts is done in similar way as struct bpf_tcx_opts
      for supporting bpf_mprog control parameters. The attach location for the
      primary and peer device is derived from the program section "netkit/primary"
      and "netkit/peer", respectively.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20231024214904.29825-4-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      05c31b4a
    • Daniel Borkmann's avatar
      tools: Sync if_link uapi header · 5c1b994d
      Daniel Borkmann authored
      Sync if_link uapi header to the latest version as we need the refresher
      in tooling for netkit device. Given it's been a while since the last sync
      and the diff is fairly big, it has been done as its own commit.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20231024214904.29825-3-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      5c1b994d
    • Daniel Borkmann's avatar
      netkit, bpf: Add bpf programmable net device · 35dfaad7
      Daniel Borkmann authored
      This work adds a new, minimal BPF-programmable device called "netkit"
      (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The
      core idea is that BPF programs are executed within the drivers xmit routine
      and therefore e.g. in case of containers/Pods moving BPF processing closer
      to the source.
      
      One of the goals was that in case of Pod egress traffic, this allows to
      move BPF programs from hostns tcx ingress into the device itself, providing
      earlier drop or forward mechanisms, for example, if the BPF program
      determines that the skb must be sent out of the node, then a redirect to
      the physical device can take place directly without going through per-CPU
      backlog queue. This helps to shift processing for such traffic from softirq
      to process context, leading to better scheduling decisions/performance (see
      measurements in the slides).
      
      In this initial version, the netkit device ships as a pair, but we plan to
      extend this further so it can also operate in single device mode. The pair
      comes with a primary and a peer device. Only the primary device, typically
      residing in hostns, can manage BPF programs for itself and its peer. The
      peer device is designated for containers/Pods and cannot attach/detach
      BPF programs. Upon the device creation, the user can set the default policy
      to 'pass' or 'drop' for the case when no BPF program is attached.
      
      Additionally, the device can be operated in L3 (default) or L2 mode. The
      management of BPF programs is done via bpf_mprog, so that multi-attach is
      supported right from the beginning with similar API and dependency controls
      as tcx. For details on the latter see commit 053c8e1f ("bpf: Add generic
      attach/detach/query API for multi-progs"). tc BPF compatibility is provided,
      so that existing programs can be easily migrated.
      
      Going forward, we plan to use netkit devices in Cilium as the main device
      type for connecting Pods. They will be operated in L3 mode in order to
      simplify a Pod's neighbor management and the peer will operate in default
      drop mode, so that no traffic is leaving between the time when a Pod is
      brought up by the CNI plugin and programs attached by the agent.
      Additionally, the programs we attach via tcx on the physical devices are
      using bpf_redirect_peer() for inbound traffic into netkit device, hence the
      latter is also supporting the ndo_get_peer_dev callback. Similarly, we use
      bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device
      directly. Also, BIG TCP is supported on netkit device. For the follow-up
      work in single device mode, we plan to convert Cilium's cilium_host/_net
      devices into a single one.
      
      An extensive test suite for checking device operations and the BPF program
      and link management API comes as BPF selftests in this series.
      Co-developed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://github.com/borkmann/iproute2/tree/pr/netkit
      Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.)
      Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.netSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      35dfaad7
    • Andrii Nakryiko's avatar
      bpf: Improve JEQ/JNE branch taken logic · 42d31dd6
      Andrii Nakryiko authored
      When determining if an if/else branch will always or never be taken, use
      signed range knowledge in addition to currently used unsigned range knowledge.
      If either signed or unsigned range suggests that condition is always/never
      taken, return corresponding branch_taken verdict.
      
      Current use of unsigned range for this seems arbitrary and unnecessarily
      incomplete. It is possible for *signed* operations to be performed on
      register, which could "invalidate" unsigned range for that register. In such
      case branch_taken will be artificially useless, even if we can still tell
      that some constant is outside of register value range based on its signed
      bounds.
      
      veristat-based validation shows zero differences across selftests, Cilium,
      and Meta-internal BPF object files.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
      42d31dd6
    • Paul E. McKenney's avatar
      bpf: Fold smp_mb__before_atomic() into atomic_set_release() · 06646da0
      Paul E. McKenney authored
      The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set()
      immediately preceded by smp_mb__before_atomic() so as to order storing
      of ring-buffer consumer and producer positions prior to the atomic_set()
      call's clearing of the ->busy flag, as follows:
      
              smp_mb__before_atomic();
              atomic_set(&rb->busy, 0);
      
      Although this works given current architectures and implementations, and
      given that this only needs to order prior writes against a later write.
      However, it does so by accident because the smp_mb__before_atomic()
      is only guaranteed to work with read-modify-write atomic operations, and
      not at all with things like atomic_set() and atomic_read().
      
      Note especially that smp_mb__before_atomic() will not, repeat *not*,
      order the prior write to "a" before the subsequent non-read-modify-write
      atomic read from "b", even on strongly ordered systems such as x86:
      
              WRITE_ONCE(a, 1);
              smp_mb__before_atomic();
              r1 = atomic_read(&b);
      
      Therefore, replace the smp_mb__before_atomic() and atomic_set() with
      atomic_set_release() as follows:
      
              atomic_set_release(&rb->busy, 0);
      
      This is no slower (and sometimes is faster) than the original, and also
      provides a formal guarantee of ordering that the original lacks.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
      06646da0
    • Song Liu's avatar
      bpf: Fix unnecessary -EBUSY from htab_lock_bucket · d35381aa
      Song Liu authored
      htab_lock_bucket uses the following logic to avoid recursion:
      
      1. preempt_disable();
      2. check percpu counter htab->map_locked[hash] for recursion;
         2.1. if map_lock[hash] is already taken, return -BUSY;
      3. raw_spin_lock_irqsave();
      
      However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
      logic will not able to access the same hash of the hashtab and get -EBUSY.
      
      This -EBUSY is not really necessary. Fix it by disabling IRQ before
      checking map_locked:
      
      1. preempt_disable();
      2. local_irq_save();
      3. check percpu counter htab->map_locked[hash] for recursion;
         3.1. if map_lock[hash] is already taken, return -BUSY;
      4. raw_spin_lock().
      
      Similarly, use raw_spin_unlock() and local_irq_restore() in
      htab_unlock_bucket().
      
      Fixes: 20b6cc34 ("bpf: Avoid hashtab deadlock with map_locked")
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com
      Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
      d35381aa
    • Albert Huang's avatar
      xsk: Avoid starving the xsk further down the list · 99b29a49
      Albert Huang authored
      In the previous implementation, when multiple xsk sockets were
      associated with a single xsk_buff_pool, a situation could arise
      where the xsk_tx_list maintained data at the front for one xsk
      socket while starving the xsk sockets at the back of the list.
      This could result in issues such as the inability to transmit packets,
      increased latency, and jitter. To address this problem, we introduce
      a new variable called tx_budget_spent, which limits each xsk to transmit
      a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures
      equitable opportunities for subsequent xsk sockets to send tx descriptors.
      The value of MAX_PER_SOCKET_BUDGET is set to 32.
      Signed-off-by: default avatarAlbert Huang <huangjie.albert@bytedance.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
      99b29a49
    • Alexei Starovoitov's avatar
      Merge branch 'exact-states-comparison-for-iterator-convergence-checks' · dedd6c89
      Alexei Starovoitov authored
      Eduard Zingerman says:
      
      ====================
      exact states comparison for iterator convergence checks
      
      Iterator convergence logic in is_state_visited() uses state_equals()
      for states with branches counter > 0 to check if iterator based loop
      converges. This is not fully correct because state_equals() relies on
      presence of read and precision marks on registers. These marks are not
      guaranteed to be finalized while state has branches.
      Commit message for patch #3 describes a program that exhibits such
      behavior.
      
      This patch-set aims to fix iterator convergence logic by adding notion
      of exact states comparison. Exact comparison does not rely on presence
      of read or precision marks and thus is more strict.
      As explained in commit message for patch #3 exact comparisons require
      addition of speculative register bounds widening. The end result for
      BPF verifier users could be summarized as follows:
      
      (!) After this update verifier would reject programs that conjure an
          imprecise value on the first loop iteration and use it as precise
          on the second (for iterator based loops).
      
      I urge people to at least skim over the commit message for patch #3.
      
      Patches are organized as follows:
      - patches #1,2: moving/extracting utility functions;
      - patch #3: introduces exact mode for states comparison and adds
        widening heuristic;
      - patch #4: adds test-cases that demonstrate why the series is
        necessary;
      - patch #5: extends patch #3 with a notion of state loop entries,
        these entries have to be tracked to correctly identify that
        different verifier states belong to the same states loop;
      - patch #6: adds a test-case that demonstrates a program
        which requires loop entry tracking for correct verification;
      - patch #7: just adds a few debug prints.
      
      The following actions are planned as a followup for this patch-set:
      - implementation has to be adapted for callbacks handling logic as a
        part of a fix for [1];
      - it is necessary to explore ways to improve widening heuristic to
        handle iters_task_vma test w/o need to insert barrier_var() calls;
      - explored states eviction logic on cache miss has to be extended
        to either:
        - allow eviction of checkpoint states -or-
        - be sped up in case if there are many active checkpoints associated
          with the same instruction.
      
      The patch-set is a followup for mailing list discussion [1].
      
      Changelog:
      - V2 [3] -> V3:
        - correct check for stack spills in widen_imprecise_scalars(),
          added test case progs/iters.c:widen_spill to check the behavior
          (suggested by Andrii);
        - allow eviction of checkpoint states in is_state_visited() to avoid
          pathological verifier performance when iterator based loop does not
          converge (discussion with Alexei).
      - V1 [2] -> V2, applied changes suggested by Alexei offlist:
        - __explored_state() function removed;
        - same_callsites() function is now used in clean_live_states();
        - patches #1,2 are added as preparatory code movement;
        - in process_iter_next_call() a safeguard is added to verify that
          cur_st->parent exists and has expected insn index / call sites.
      
      [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/
      [2] https://lore.kernel.org/bpf/20231021005939.1041-1-eddyz87@gmail.com/
      [3] https://lore.kernel.org/bpf/20231022010812.9201-1-eddyz87@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20231024000917.12153-1-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dedd6c89
    • Eduard Zingerman's avatar
      bpf: print full verifier states on infinite loop detection · b4d82395
      Eduard Zingerman authored
      Additional logging in is_state_visited(): if infinite loop is detected
      print full verifier state for both current and equivalent states.
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b4d82395
    • Eduard Zingerman's avatar
      selftests/bpf: test if state loops are detected in a tricky case · 64870fee
      Eduard Zingerman authored
      A convoluted test case for iterators convergence logic that
      demonstrates that states with branch count equal to 0 might still be
      a part of not completely explored loop.
      
      E.g. consider the following state diagram:
      
                     initial     Here state 'succ' was processed first,
                       |         it was eventually tracked to produce a
                       V         state identical to 'hdr'.
          .---------> hdr        All branches from 'succ' had been explored
          |            |         and thus 'succ' has its .branches == 0.
          |            V
          |    .------...        Suppose states 'cur' and 'succ' correspond
          |    |       |         to the same instruction + callsites.
          |    V       V         In such case it is necessary to check
          |   ...     ...        whether 'succ' and 'cur' are identical.
          |    |       |         If 'succ' and 'cur' are a part of the same loop
          |    V       V         they have to be compared exactly.
          |   succ <- cur
          |    |
          |    V
          |   ...
          |    |
          '----'
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-7-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      64870fee
    • Eduard Zingerman's avatar
      bpf: correct loop detection for iterators convergence · 2a099282
      Eduard Zingerman authored
      It turns out that .branches > 0 in is_state_visited() is not a
      sufficient condition to identify if two verifier states form a loop
      when iterators convergence is computed. This commit adds logic to
      distinguish situations like below:
      
       (I)            initial       (II)            initial
                        |                             |
                        V                             V
           .---------> hdr                           ..
           |            |                             |
           |            V                             V
           |    .------...                    .------..
           |    |       |                     |       |
           |    V       V                     V       V
           |   ...     ...               .-> hdr     ..
           |    |       |                |    |       |
           |    V       V                |    V       V
           |   succ <- cur               |   succ <- cur
           |    |                        |    |
           |    V                        |    V
           |   ...                       |   ...
           |    |                        |    |
           '----'                        '----'
      
      For both (I) and (II) successor 'succ' of the current state 'cur' was
      previously explored and has branches count at 0. However, loop entry
      'hdr' corresponding to 'succ' might be a part of current DFS path.
      If that is the case 'succ' and 'cur' are members of the same loop
      and have to be compared exactly.
      Co-developed-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Co-developed-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Reviewed-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2a099282
    • Eduard Zingerman's avatar
      selftests/bpf: tests with delayed read/precision makrs in loop body · 389ede06
      Eduard Zingerman authored
      These test cases try to hide read and precision marks from loop
      convergence logic: marks would only be assigned on subsequent loop
      iterations or after exploring states pushed to env->head stack first.
      Without verifier fix to use exact states comparison logic for
      iterators convergence these tests (except 'triple_continue') would be
      errorneously marked as safe.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-5-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      389ede06
    • Eduard Zingerman's avatar
      bpf: exact states comparison for iterator convergence checks · 2793a8b0
      Eduard Zingerman authored
      Convergence for open coded iterators is computed in is_state_visited()
      by examining states with branches count > 1 and using states_equal().
      states_equal() computes sub-state relation using read and precision marks.
      Read and precision marks are propagated from children states,
      thus are not guaranteed to be complete inside a loop when branches
      count > 1. This could be demonstrated using the following unsafe program:
      
           1. r7 = -16
           2. r6 = bpf_get_prandom_u32()
           3. while (bpf_iter_num_next(&fp[-8])) {
           4.   if (r6 != 42) {
           5.     r7 = -32
           6.     r6 = bpf_get_prandom_u32()
           7.     continue
           8.   }
           9.   r0 = r10
          10.   r0 += r7
          11.   r8 = *(u64 *)(r0 + 0)
          12.   r6 = bpf_get_prandom_u32()
          13. }
      
      Here verifier would first visit path 1-3, create a checkpoint at 3
      with r7=-16, continue to 4-7,3 with r7=-32.
      
      Because instructions at 9-12 had not been visitied yet existing
      checkpoint at 3 does not have read or precision mark for r7.
      Thus states_equal() would return true and verifier would discard
      current state, thus unsafe memory access at 11 would not be caught.
      
      This commit fixes this loophole by introducing exact state comparisons
      for iterator convergence logic:
      - registers are compared using regs_exact() regardless of read or
        precision marks;
      - stack slots have to have identical type.
      
      Unfortunately, this is too strict even for simple programs like below:
      
          i = 0;
          while(iter_next(&it))
            i++;
      
      At each iteration step i++ would produce a new distinct state and
      eventually instruction processing limit would be reached.
      
      To avoid such behavior speculatively forget (widen) range for
      imprecise scalar registers, if those registers were not precise at the
      end of the previous iteration and do not match exactly.
      
      This a conservative heuristic that allows to verify wide range of
      programs, however it precludes verification of programs that conjure
      an imprecise value on the first loop iteration and use it as precise
      on the second.
      
      Test case iter_task_vma_for_each() presents one of such cases:
      
              unsigned int seen = 0;
              ...
              bpf_for_each(task_vma, vma, task, 0) {
                      if (seen >= 1000)
                              break;
                      ...
                      seen++;
              }
      
      Here clang generates the following code:
      
      <LBB0_4>:
            24:       r8 = r6                          ; stash current value of
                      ... body ...                       'seen'
            29:       r1 = r10
            30:       r1 += -0x8
            31:       call bpf_iter_task_vma_next
            32:       r6 += 0x1                        ; seen++;
            33:       if r0 == 0x0 goto +0x2 <LBB0_6>  ; exit on next() == NULL
            34:       r7 += 0x10
            35:       if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000
      
      <LBB0_6>:
            ... exit ...
      
      Note that counter in r6 is copied to r8 and then incremented,
      conditional jump is done using r8. Because of this precision mark for
      r6 lags one state behind of precision mark on r8 and widening logic
      kicks in.
      
      Adding barrier_var(seen) after conditional is sufficient to force
      clang use the same register for both counting and conditional jump.
      
      This issue was discussed in the thread [1] which was started by
      Andrew Werner <awerner32@gmail.com> demonstrating a similar bug
      in callback functions handling. The callbacks would be addressed
      in a followup patch.
      
      [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/Co-developed-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Co-developed-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2793a8b0
    • Eduard Zingerman's avatar
      bpf: extract same_callsites() as utility function · 4c97259a
      Eduard Zingerman authored
      Extract same_callsites() from clean_live_states() as a utility function.
      This function would be used by the next patch in the set.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4c97259a
    • Eduard Zingerman's avatar
      bpf: move explored_state() closer to the beginning of verifier.c · 3c4e420c
      Eduard Zingerman authored
      Subsequent patches would make use of explored_state() function.
      Move it up to avoid adding unnecessary prototype.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3c4e420c
  2. 23 Oct, 2023 2 commits
  3. 20 Oct, 2023 18 commits
  4. 19 Oct, 2023 2 commits
  5. 18 Oct, 2023 2 commits
  6. 17 Oct, 2023 1 commit