1. 29 Jul, 2024 21 commits
    • Martin KaFai Lau's avatar
      bpf: Check unsupported ops from the bpf_struct_ops's cfi_stubs · e42ac141
      Martin KaFai Lau authored
      The bpf_tcp_ca struct_ops currently uses a "u32 unsupported_ops[]"
      array to track which ops is not supported.
      
      After cfi_stubs had been added, the function pointer in cfi_stubs is
      also NULL for the unsupported ops. Thus, the "u32 unsupported_ops[]"
      becomes redundant. This observation was originally brought up in the
      bpf/cfi discussion:
      https://lore.kernel.org/bpf/CAADnVQJoEkdjyCEJRPASjBw1QGsKYrF33QdMGc1RZa9b88bAEA@mail.gmail.com/
      
      The recent bpf qdisc patch (https://lore.kernel.org/bpf/20240714175130.4051012-6-amery.hung@bytedance.com/)
      also needs to specify quite many unsupported ops. It is a good time
      to clean it up.
      
      This patch removes the need of "u32 unsupported_ops[]" and tests for null-ness
      in the cfi_stubs instead.
      
      Testing the cfi_stubs is done in a new function bpf_struct_ops_supported().
      The verifier will call bpf_struct_ops_supported() when loading the
      struct_ops program. The ".check_member" is removed from the bpf_tcp_ca
      in this patch. ".check_member" could still be useful for other subsytems
      to enforce other restrictions (e.g. sched_ext checks for prog->sleepable).
      
      To keep the same error return, ENOTSUPP is used.
      
      Cc: Amery Hung <ameryhung@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240722183049.2254692-2-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      e42ac141
    • Tao Chen's avatar
      bpftool: Add document for net attach/detach on tcx subcommand · 0d7c0612
      Tao Chen authored
      This commit adds sample output for net attach/detach on
      tcx subcommand.
      Signed-off-by: default avatarTao Chen <chen.dylane@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarQuentin Monnet <qmo@kernel.org>
      Link: https://lore.kernel.org/bpf/20240721144252.96264-1-chen.dylane@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      0d7c0612
    • Tao Chen's avatar
      bpftool: Add bash-completion for tcx subcommand · 4f88dde0
      Tao Chen authored
      This commit adds bash-completion for attaching tcx program on interface.
      Signed-off-by: default avatarTao Chen <chen.dylane@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarQuentin Monnet <qmo@kernel.org>
      Link: https://lore.kernel.org/bpf/20240721144238.96246-1-chen.dylane@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      4f88dde0
    • Tao Chen's avatar
      bpftool: Add net attach/detach command to tcx prog · 3b9d4fee
      Tao Chen authored
      Now, attach/detach tcx prog supported in libbpf, so we can add new
      command 'bpftool attach/detach tcx' to attach tcx prog with bpftool
      for user.
      
       # bpftool prog load tc_prog.bpf.o /sys/fs/bpf/tc_prog
       # bpftool prog show
      	...
      	192: sched_cls  name tc_prog  tag 187aeb611ad00cfc  gpl
      	loaded_at 2024-07-11T15:58:16+0800  uid 0
      	xlated 152B  jited 97B  memlock 4096B  map_ids 100,99,97
      	btf_id 260
       # bpftool net attach tcx_ingress name tc_prog dev lo
       # bpftool net
      	...
      	tc:
      	lo(1) tcx/ingress tc_prog prog_id 29
      
       # bpftool net detach tcx_ingress dev lo
       # bpftool net
      	...
      	tc:
       # bpftool net attach tcx_ingress name tc_prog dev lo
       # bpftool net
      	tc:
      	lo(1) tcx/ingress tc_prog prog_id 29
      
      Test environment: ubuntu_22_04, 6.7.0-060700-generic
      Signed-off-by: default avatarTao Chen <chen.dylane@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarQuentin Monnet <qmo@kernel.org>
      Link: https://lore.kernel.org/bpf/20240721144221.96228-1-chen.dylane@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      3b9d4fee
    • Tao Chen's avatar
      bpftool: Refactor xdp attach/detach type judgment · b7264f87
      Tao Chen authored
      This commit no logical changed, just increases code readability and
      facilitates TCX prog expansion, which will be implemented in the next
      patch.
      Signed-off-by: default avatarTao Chen <chen.dylane@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarQuentin Monnet <qmo@kernel.org>
      Link: https://lore.kernel.org/bpf/20240721143353.95980-2-chen.dylane@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      b7264f87
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fix-tailcall-hierarchy' · 81a0b954
      Alexei Starovoitov authored
      Leon Hwang says:
      
      ====================
      bpf: Fix tailcall hierarchy
      
      This patchset fixes a tailcall hierarchy issue.
      
      The issue is confirmed in the discussions of
      "bpf, x64: Fix tailcall infinite loop" [0].
      
      The issue has been resolved on both x86_64 and arm64 [1].
      
      I provide a long commit message in the "bpf, x64: Fix tailcall hierarchy"
      patch to describe how the issue happens and how this patchset resolves the
      issue in details.
      
      How does this patchset resolve the issue?
      
      In short, it stores tail_call_cnt on the stack of main prog, and propagates
      tail_call_cnt_ptr to its subprogs.
      
      First, at the prologue of main prog, it initializes tail_call_cnt and
      prepares tail_call_cnt_ptr. And at the prologue of subprog, it reuses
      the tail_call_cnt_ptr from caller.
      
      Then, when a tailcall happens, it increments tail_call_cnt by its pointer.
      
      v5 -> v6:
        * Address comments from Eduard:
          * Add JITed dumping along annotating comments
          * Rewrite two selftests with RUN_TESTS macro.
      
      v4 -> v5:
        * Solution changes from tailcall run ctx to tail_call_cnt and its pointer.
          It's because v4 solution is unable to handle the case that there is no
          tailcall in subprog but there is tailcall in EXT prog which attaches to
          the subprog.
      
      v3 -> v4:
        * Solution changes from per-task tail_call_cnt to tailcall run ctx.
          As for per-cpu/per-task solution, there is a case it is unable to handle [2].
      
      v2 -> v3:
        * Solution changes from percpu tail_call_cnt to tail_call_cnt at task_struct.
      
      v1 -> v2:
        * Solution changes from extra run-time call insn to percpu tail_call_cnt.
        * Address comments from Alexei:
          * Use percpu tail_call_cnt.
          * Use asm to make sure no callee saved registers are touched.
      
      RFC v2 -> v1:
        * Solution changes from propagating tail_call_cnt with its pointer to extra
          run-time call insn.
        * Address comments from Maciej:
          * Replace all memcpy(prog, x86_nops[5], X86_PATCH_SIZE) with
              emit_nops(&prog, X86_PATCH_SIZE)
      
      RFC v1 -> RFC v2:
        * Address comments from Stanislav:
          * Separate moving emit_nops() as first patch.
      
      Links:
      [0] https://lore.kernel.org/bpf/6203dd01-789d-f02c-5293-def4c1b18aef@gmail.com/
      [1] https://github.com/kernel-patches/bpf/pull/7350/checks
      [2] https://lore.kernel.org/bpf/CAADnVQK1qF+uBjwom2s2W-yEmgd_3rGi5Nr+KiV3cW0T+UPPfA@mail.gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240714123902.32305-1-hffilwlqm@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      81a0b954
    • Leon Hwang's avatar
      selftests/bpf: Add testcases for tailcall hierarchy fixing · b83b936f
      Leon Hwang authored
      Add some test cases to confirm the tailcall hierarchy issue has been fixed.
      
      On x64, the selftests result is:
      
      cd tools/testing/selftests/bpf && ./test_progs -t tailcalls
      327/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
      327/19  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
      327/20  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
      327/21  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
      327/22  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_entry:OK
      327/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
      327/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
      327     tailcalls:OK
      Summary: 1/24 PASSED, 0 SKIPPED, 0 FAILED
      
      On arm64, the selftests result is:
      
      cd tools/testing/selftests/bpf && ./test_progs -t tailcalls
      327/18  tailcalls/tailcall_bpf2bpf_hierarchy_1:OK
      327/19  tailcalls/tailcall_bpf2bpf_hierarchy_fentry:OK
      327/20  tailcalls/tailcall_bpf2bpf_hierarchy_fexit:OK
      327/21  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_fexit:OK
      327/22  tailcalls/tailcall_bpf2bpf_hierarchy_fentry_entry:OK
      327/23  tailcalls/tailcall_bpf2bpf_hierarchy_2:OK
      327/24  tailcalls/tailcall_bpf2bpf_hierarchy_3:OK
      327     tailcalls:OK
      Summary: 1/24 PASSED, 0 SKIPPED, 0 FAILED
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarLeon Hwang <hffilwlqm@gmail.com>
      Link: https://lore.kernel.org/r/20240714123902.32305-4-hffilwlqm@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      b83b936f
    • Leon Hwang's avatar
      bpf, arm64: Fix tailcall hierarchy · 66ff4d61
      Leon Hwang authored
      This patch fixes a tailcall issue caused by abusing the tailcall in
      bpf2bpf feature on arm64 like the way of "bpf, x64: Fix tailcall
      hierarchy".
      
      On arm64, when a tail call happens, it uses tail_call_cnt_ptr to
      increment tail_call_cnt, too.
      
      At the prologue of main prog, it has to initialize tail_call_cnt and
      prepare tail_call_cnt_ptr.
      
      At the prologue of subprog, it pushes x26 register twice, and does not
      initialize tail_call_cnt.
      
      At the epilogue, it pops x26 twice, no matter whether it is main prog or
      subprog.
      
      Fixes: d4609a5d ("bpf, arm64: Keep tail call count across bpf2bpf calls")
      Acked-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Signed-off-by: default avatarLeon Hwang <hffilwlqm@gmail.com>
      Link: https://lore.kernel.org/r/20240714123902.32305-3-hffilwlqm@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      66ff4d61
    • Leon Hwang's avatar
      bpf, x64: Fix tailcall hierarchy · 116e04ba
      Leon Hwang authored
      This patch fixes a tailcall issue caused by abusing the tailcall in
      bpf2bpf feature.
      
      As we know, tail_call_cnt propagates by rax from caller to callee when
      to call subprog in tailcall context. But, like the following example,
      MAX_TAIL_CALL_CNT won't work because of missing tail_call_cnt
      back-propagation from callee to caller.
      
      \#include <linux/bpf.h>
      \#include <bpf/bpf_helpers.h>
      \#include "bpf_legacy.h"
      
      struct {
      	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
      	__uint(max_entries, 1);
      	__uint(key_size, sizeof(__u32));
      	__uint(value_size, sizeof(__u32));
      } jmp_table SEC(".maps");
      
      int count = 0;
      
      static __noinline
      int subprog_tail1(struct __sk_buff *skb)
      {
      	bpf_tail_call_static(skb, &jmp_table, 0);
      	return 0;
      }
      
      static __noinline
      int subprog_tail2(struct __sk_buff *skb)
      {
      	bpf_tail_call_static(skb, &jmp_table, 0);
      	return 0;
      }
      
      SEC("tc")
      int entry(struct __sk_buff *skb)
      {
      	volatile int ret = 1;
      
      	count++;
      	subprog_tail1(skb);
      	subprog_tail2(skb);
      
      	return ret;
      }
      
      char __license[] SEC("license") = "GPL";
      
      At run time, the tail_call_cnt in entry() will be propagated to
      subprog_tail1() and subprog_tail2(). But, when the tail_call_cnt in
      subprog_tail1() updates when bpf_tail_call_static(), the tail_call_cnt
      in entry() won't be updated at the same time. As a result, in entry(),
      when tail_call_cnt in entry() is less than MAX_TAIL_CALL_CNT and
      subprog_tail1() returns because of MAX_TAIL_CALL_CNT limit,
      bpf_tail_call_static() in suprog_tail2() is able to run because the
      tail_call_cnt in subprog_tail2() propagated from entry() is less than
      MAX_TAIL_CALL_CNT.
      
      So, how many tailcalls are there for this case if no error happens?
      
      From top-down view, does it look like hierarchy layer and layer?
      
      With this view, there will be 2+4+8+...+2^33 = 2^34 - 2 = 17,179,869,182
      tailcalls for this case.
      
      How about there are N subprog_tail() in entry()? There will be almost
      N^34 tailcalls.
      
      Then, in this patch, it resolves this case on x86_64.
      
      In stead of propagating tail_call_cnt from caller to callee, it
      propagates its pointer, tail_call_cnt_ptr, tcc_ptr for short.
      
      However, where does it store tail_call_cnt?
      
      It stores tail_call_cnt on the stack of main prog. When tail call
      happens in subprog, it increments tail_call_cnt by tcc_ptr.
      
      Meanwhile, it stores tail_call_cnt_ptr on the stack of main prog, too.
      
      And, before jump to tail callee, it has to pop tail_call_cnt and
      tail_call_cnt_ptr.
      
      Then, at the prologue of subprog, it must not make rax as
      tail_call_cnt_ptr again. It has to reuse tail_call_cnt_ptr from caller.
      
      As a result, at run time, it has to recognize rax is tail_call_cnt or
      tail_call_cnt_ptr at prologue by:
      
      1. rax is tail_call_cnt if rax is <= MAX_TAIL_CALL_CNT.
      2. rax is tail_call_cnt_ptr if rax is > MAX_TAIL_CALL_CNT, because a
         pointer won't be <= MAX_TAIL_CALL_CNT.
      
      Here's an example to dump JITed.
      
      struct {
      	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
      	__uint(max_entries, 1);
      	__uint(key_size, sizeof(__u32));
      	__uint(value_size, sizeof(__u32));
      } jmp_table SEC(".maps");
      
      int count = 0;
      
      static __noinline
      int subprog_tail(struct __sk_buff *skb)
      {
      	bpf_tail_call_static(skb, &jmp_table, 0);
      	return 0;
      }
      
      SEC("tc")
      int entry(struct __sk_buff *skb)
      {
      	int ret = 1;
      
      	count++;
      	subprog_tail(skb);
      	subprog_tail(skb);
      
      	return ret;
      }
      
      When bpftool p d j id 42:
      
      int entry(struct __sk_buff * skb):
      bpf_prog_0c0f4c2413ef19b1_entry:
      ; int entry(struct __sk_buff *skb)
         0:	endbr64
         4:	nopl	(%rax,%rax)
         9:	xorq	%rax, %rax		;; rax = 0 (tail_call_cnt)
         c:	pushq	%rbp
         d:	movq	%rsp, %rbp
        10:	endbr64
        14:	cmpq	$33, %rax		;; if rax > 33, rax = tcc_ptr
        18:	ja	0x20			;; if rax > 33 goto 0x20 ---+
        1a:	pushq	%rax			;; [rbp - 8] = rax = 0      |
        1b:	movq	%rsp, %rax		;; rax = rbp - 8            |
        1e:	jmp	0x21			;; ---------+               |
        20:	pushq	%rax			;; <--------|---------------+
        21:	pushq	%rax			;; <--------+ [rbp - 16] = rax
        22:	pushq	%rbx			;; callee saved
        23:	movq	%rdi, %rbx		;; rbx = skb (callee saved)
      ; count++;
        26:	movabsq	$-82417199407104, %rdi
        30:	movl	(%rdi), %esi
        33:	addl	$1, %esi
        36:	movl	%esi, (%rdi)
      ; subprog_tail(skb);
        39:	movq	%rbx, %rdi		;; rdi = skb
        3c:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
        43:	callq	0x80			;; call subprog_tail()
      ; subprog_tail(skb);
        48:	movq	%rbx, %rdi		;; rdi = skb
        4b:	movq	-16(%rbp), %rax		;; rax = tcc_ptr
        52:	callq	0x80			;; call subprog_tail()
      ; return ret;
        57:	movl	$1, %eax
        5c:	popq	%rbx
        5d:	leave
        5e:	retq
      
      int subprog_tail(struct __sk_buff * skb):
      bpf_prog_3a140cef239a4b4f_subprog_tail:
      ; int subprog_tail(struct __sk_buff *skb)
         0:	endbr64
         4:	nopl	(%rax,%rax)
         9:	nopl	(%rax)			;; do not touch tail_call_cnt
         c:	pushq	%rbp
         d:	movq	%rsp, %rbp
        10:	endbr64
        14:	pushq	%rax			;; [rbp - 8]  = rax (tcc_ptr)
        15:	pushq	%rax			;; [rbp - 16] = rax (tcc_ptr)
        16:	pushq	%rbx			;; callee saved
        17:	pushq	%r13			;; callee saved
        19:	movq	%rdi, %rbx		;; rbx = skb
      ; asm volatile("r1 = %[ctx]\n\t"
        1c:	movabsq	$-105487587488768, %r13	;; r13 = jmp_table
        26:	movq	%rbx, %rdi		;; 1st arg, skb
        29:	movq	%r13, %rsi		;; 2nd arg, jmp_table
        2c:	xorl	%edx, %edx		;; 3rd arg, index = 0
        2e:	movq	-16(%rbp), %rax		;; rax = [rbp - 16] (tcc_ptr)
        35:	cmpq	$33, (%rax)
        39:	jae	0x4e			;; if *tcc_ptr >= 33 goto 0x4e --------+
        3b:	jmp	0x4e			;; jmp bypass, toggled by poking       |
        40:	addq	$1, (%rax)		;; (*tcc_ptr)++                        |
        44:	popq	%r13			;; callee saved                        |
        46:	popq	%rbx			;; callee saved                        |
        47:	popq	%rax			;; undo rbp-16 push                    |
        48:	popq	%rax			;; undo rbp-8  push                    |
        49:	nopl	(%rax,%rax)		;; tail call target, toggled by poking |
      ; return 0;				;;                                     |
        4e:	popq	%r13			;; restore callee saved <--------------+
        50:	popq	%rbx			;; restore callee saved
        51:	leave
        52:	retq
      
      Furthermore, when trampoline is the caller of bpf prog, which is
      tail_call_reachable, it is required to propagate rax through trampoline.
      
      Fixes: ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
      Fixes: e411901c ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
      Reviewed-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarLeon Hwang <hffilwlqm@gmail.com>
      Link: https://lore.kernel.org/r/20240714123902.32305-2-hffilwlqm@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      116e04ba
    • Andrii Nakryiko's avatar
      Merge branch 'bpf-track-find_equal_scalars-history-on-per-instruction-level' · bde0c5a7
      Andrii Nakryiko authored
      Eduard Zingerman says:
      
      ====================
      bpf: track find_equal_scalars history on per-instruction level
      
      This is a fix for precision tracking bug reported in [0].
      It supersedes my previous attempt to fix similar issue in commit [1].
      Here is a minimized test case from [0]:
      
          0:  call bpf_get_prandom_u32;
          1:  r7 = r0;
          2:  r8 = r0;
          3:  call bpf_get_prandom_u32;
          4:  if r0 > 1 goto +0;
          /* --- checkpoint #1: r7.id=1, r8.id=1 --- */
          5:  if r8 >= r0 goto 9f;
          6:  r8 += r8;
          /* --- checkpoint #2: r7.id=1, r8.id=0 --- */
          7:  if r7 == 0 goto 9f;
          8:  r0 /= 0;
          /* --- checkpoint #3 --- */
          9:  r0 = 42;
          10: exit;
      
      W/o this fix verifier incorrectly assumes that instruction at label
      (8) is unreachable. The issue is caused by failure to infer
      precision mark for r0 at checkpoint #1:
      - first verification path is:
        - (0-4): r0 range [0,1];
        - (5): r8 range [0,0], propagated to r7;
        - (6): r8.id is reset;
        - (7): jump is predicted to happen;
        - (9-10): safe exit.
      - when jump at (7) is predicted mark_chain_precision() for r7 is
        called and backtrack_insn() proceeds as follows:
        - at (7) r7 is marked as precise;
        - at (5) r8 is not currently tracked and thus r0 is not marked;
        - at (4-5) boundary logic from [1] is triggered and r7,r8 are marked
          as precise;
        - => r0 precision mark is missed.
      - when second branch of (4) is considered, verifier prunes the state
        because r0 is not marked as precise in the visited state.
      
      Basically, backtracking logic fails to notice that at (5)
      range information is gained for both r7 and r8, and thus both
      r8 and r0 have to be marked as precise.
      This happens because [1] can only account for such range
      transfers at parent/child state boundaries.
      
      The solution suggested by Andrii Nakryiko in [0] is to use jump
      history to remember which registers gained range as a result of
      find_equal_scalars() [renamed to sync_linked_regs()] and use
      this information in backtrack_insn().
      Which is what this patch-set does.
      
      The patch-set uses u64 value as a vector of 10-bit values that
      identify registers gaining range in find_equal_scalars().
      This amounts to maximum of 6 possible values.
      To check if such capacity is sufficient I've instrumented kernel
      to track a histogram for maximal amount of registers that gain range
      in find_equal_scalars per program verification [2].
      Measurements done for verifier selftests and Cilium bpf object files
      from [3] show that number of such registers is *always* <= 4 and
      in 98% of cases it is <= 2.
      
      When tested on a subset of selftests identified by
      selftests/bpf/veristat.cfg and Cilium bpf object files from [3]
      this patch-set has minimal verification performance impact:
      
      File                      Program                   Insns   (DIFF)  States (DIFF)
      ------------------------  ------------------------  --------------  -------------
      bpf_host.o                tail_handle_nat_fwd_ipv4    -75 (-0.61%)    -3 (-0.39%)
      pyperf600_nounroll.bpf.o  on_event                  +1673 (+0.33%)    +3 (+0.01%)
      
      [0] https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
      [1] commit 904e6ddf ("bpf: Use scalar ids in mark_chain_precision()")
      [2] https://github.com/eddyz87/bpf/tree/find-equal-scalars-in-jump-history-with-stats
      [3] https://github.com/anakryiko/cilium
      
      Changes:
      - v2 -> v3:
        A number of stylistic changes suggested by Andrii:
        - renamings:
          - struct reg_or_spill   -> linked_reg;
          - find_equal_scalars()  -> collect_linked_regs;
          - copy_known_reg()      -> sync_linked_regs;
        - collect_linked_regs() now returns linked regs set of
          size 2 or larger;
        - dropped usage of bit fields in struct linked_reg;
        - added a patch changing references to find_equal_scalars() in
          selftests comments.
      - v1 -> v2:
        - patch "bpf: replace env->cur_hist_ent with a getter function" is
          dropped (Andrii);
        - added structure linked_regs and helper functions to [de]serialize
          u64 value as such structure (Andrii);
        - bt_set_equal_scalars() renamed to bt_sync_linked_regs(), moved to
          start and end of backtrack_insn() in order to untie linked
          register logic from conditional jumps backtracking.
          Andrii requested a more radical change of moving linked registers
          processing to bt_set_xxx() functions, I did an experiment in this
          direction:
          https://github.com/eddyz87/bpf/tree/find-equal-scalars-in-jump-history--linked-regs-in-bt-set-reg
          the end result of the experiment seems much uglier than version
          presented in v2.
      
      Revisions:
      - v1: https://lore.kernel.org/bpf/20240222005005.31784-1-eddyz87@gmail.com/
      - v2: https://lore.kernel.org/bpf/20240705205851.2635794-1-eddyz87@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240718202357.1746514-1-eddyz87@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      bde0c5a7
    • Eduard Zingerman's avatar
      selftests/bpf: Update comments find_equal_scalars->sync_linked_regs · cfbf2548
      Eduard Zingerman authored
      find_equal_scalars() is renamed to sync_linked_regs(),
      this commit updates existing references in the selftests comments.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240718202357.1746514-5-eddyz87@gmail.com
      cfbf2548
    • Eduard Zingerman's avatar
      selftests/bpf: Tests for per-insn sync_linked_regs() precision tracking · bebc17b1
      Eduard Zingerman authored
      Add a few test cases to verify precision tracking for scalars gaining
      range because of sync_linked_regs():
      - check what happens when more than 6 registers might gain range in
        sync_linked_regs();
      - check if precision is propagated correctly when operand of
        conditional jump gained range in sync_linked_regs() and one of
        linked registers is marked precise;
      - check if precision is propagated correctly when operand of
        conditional jump gained range in sync_linked_regs() and a
        other-linked operand of the conditional jump is marked precise;
      - add a minimized reproducer for precision tracking bug reported in [0];
      - Check that mark_chain_precision() for one of the conditional jump
        operands does not trigger equal scalars precision propagation.
      
      [0] https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240718202357.1746514-4-eddyz87@gmail.com
      bebc17b1
    • Eduard Zingerman's avatar
      bpf: Remove mark_precise_scalar_ids() · 842edb55
      Eduard Zingerman authored
      Function mark_precise_scalar_ids() is superseded by
      bt_sync_linked_regs() and equal scalars tracking in jump history.
      mark_precise_scalar_ids() propagates precision over registers sharing
      same ID on parent/child state boundaries, while jump history records
      allow bt_sync_linked_regs() to propagate same information with
      instruction level granularity, which is strictly more precise.
      
      This commit removes mark_precise_scalar_ids() and updates test cases
      in progs/verifier_scalar_ids to reflect new verifier behavior.
      
      The tests are updated in the following manner:
      - mark_precise_scalar_ids() propagated precision regardless of
        presence of conditional jumps, while new jump history based logic
        only kicks in when conditional jumps are present.
        Hence test cases are augmented with conditional jumps to still
        trigger precision propagation.
      - As equal scalars tracking no longer relies on parent/child state
        boundaries some test cases are no longer interesting,
        such test cases are removed, namely:
        - precision_same_state and precision_cross_state are superseded by
          linked_regs_bpf_k;
        - precision_same_state_broken_link and equal_scalars_broken_link
          are superseded by linked_regs_broken_link.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240718202357.1746514-3-eddyz87@gmail.com
      842edb55
    • Eduard Zingerman's avatar
      bpf: Track equal scalars history on per-instruction level · 4bf79f9b
      Eduard Zingerman authored
      Use bpf_verifier_state->jmp_history to track which registers were
      updated by find_equal_scalars() (renamed to collect_linked_regs())
      when conditional jump was verified. Use recorded information in
      backtrack_insn() to propagate precision.
      
      E.g. for the following program:
      
                  while verifying instructions
        1: r1 = r0              |
        2: if r1 < 8  goto ...  | push r0,r1 as linked registers in jmp_history
        3: if r0 > 16 goto ...  | push r0,r1 as linked registers in jmp_history
        4: r2 = r10             |
        5: r2 += r0             v mark_chain_precision(r0)
      
                  while doing mark_chain_precision(r0)
        5: r2 += r0             | mark r0 precise
        4: r2 = r10             |
        3: if r0 > 16 goto ...  | mark r0,r1 as precise
        2: if r1 < 8  goto ...  | mark r0,r1 as precise
        1: r1 = r0              v
      
      Technically, do this as follows:
      - Use 10 bits to identify each register that gains range because of
        sync_linked_regs():
        - 3 bits for frame number;
        - 6 bits for register or stack slot number;
        - 1 bit to indicate if register is spilled.
      - Use u64 as a vector of 6 such records + 4 bits for vector length.
      - Augment struct bpf_jmp_history_entry with a field 'linked_regs'
        representing such vector.
      - When doing check_cond_jmp_op() remember up to 6 registers that
        gain range because of sync_linked_regs() in such a vector.
      - Don't propagate range information and reset IDs for registers that
        don't fit in 6-value vector.
      - Push a pair {instruction index, linked registers vector}
        to bpf_verifier_state->jmp_history.
      - When doing backtrack_insn() check if any of recorded linked
        registers is currently marked precise, if so mark all linked
        registers as precise.
      
      This also requires fixes for two test_verifier tests:
      - precise: test 1
      - precise: test 2
      
      Both tests contain the following instruction sequence:
      
      19: (bf) r2 = r9                      ; R2=scalar(id=3) R9=scalar(id=3)
      20: (a5) if r2 < 0x8 goto pc+1        ; R2=scalar(id=3,umin=8)
      21: (95) exit
      22: (07) r2 += 1                      ; R2_w=scalar(id=3+1,...)
      23: (bf) r1 = r10                     ; R1_w=fp0 R10=fp0
      24: (07) r1 += -8                     ; R1_w=fp-8
      25: (b7) r3 = 0                       ; R3_w=0
      26: (85) call bpf_probe_read_kernel#113
      
      The call to bpf_probe_read_kernel() at (26) forces r2 to be precise.
      Previously, this forced all registers with same id to become precise
      immediately when mark_chain_precision() is called.
      After this change, the precision is propagated to registers sharing
      same id only when 'if' instruction is backtracked.
      Hence verification log for both tests is changed:
      regs=r2,r9 -> regs=r2 for instructions 25..20.
      
      Fixes: 904e6ddf ("bpf: Use scalar ids in mark_chain_precision()")
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240718202357.1746514-2-eddyz87@gmail.com
      
      Closes: https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
      4bf79f9b
    • Ihor Solodrai's avatar
      selftests/bpf: Use auto-dependencies for test objects · 844f7315
      Ihor Solodrai authored
      Make use of -M compiler options when building .test.o objects to
      generate .d files and avoid re-building all tests every time.
      
      Previously, if a single test bpf program under selftests/bpf/progs/*.c
      has changed, make would rebuild all the *.bpf.o, *.skel.h and *.test.o
      objects, which is a lot of unnecessary work.
      
      A typical dependency chain is:
      progs/x.c -> x.bpf.o -> x.skel.h -> x.test.o -> trunner_binary
      
      However for many tests it's not a 1:1 mapping by name, and so far
      %.test.o have been simply dependent on all %.skel.h files, and
      %.skel.h files on all %.bpf.o objects.
      
      Avoid full rebuilds by instructing the compiler (via -MMD) to
      produce *.d files with real dependencies, and appropriately including
      them. Exploit make feature that rebuilds included makefiles if they
      were changed by setting %.test.d as prerequisite for %.test.o files.
      
      A couple of examples of compilation time speedup (after the first
      clean build):
      
      $ touch progs/verifier_and.c && time make -j8
      Before: real	0m16.651s
      After:  real	0m2.245s
      $ touch progs/read_vsyscall.c && time make -j8
      Before: real	0m15.743s
      After:  real	0m1.575s
      
      A drawback of this change is that now there is an overhead due to make
      processing lots of .d files, which potentially may slow down unrelated
      targets. However a time to make all from scratch hasn't changed
      significantly:
      
      $ make clean && time make -j8
      Before: real	1m31.148s
      After:  real	1m30.309s
      Suggested-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarIhor Solodrai <ihor.solodrai@pm.me>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/VJihUTnvtwEgv_mOnpfy7EgD9D2MPNoHO-MlANeLIzLJPGhDeyOuGKIYyKgk0O6KPjfM-MuhtvPwZcngN8WFqbTnTRyCSMc2aMZ1ODm1T_g=@pm.me
      844f7315
    • Markus Elfring's avatar
      bpf: Simplify character output in seq_print_delegate_opts() · f157f9cb
      Markus Elfring authored
      Single characters should be put into a sequence.
      Thus use the corresponding function “seq_putc” for two selected calls.
      
      This issue was transformed by using the Coccinelle software.
      Suggested-by: default avatarChristophe Jaillet <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/abde0992-3d71-44d2-ab27-75b382933a22@web.de
      f157f9cb
    • Markus Elfring's avatar
      bpf: Replace 8 seq_puts() calls by seq_putc() calls · df862de4
      Markus Elfring authored
      Single line breaks should occasionally be put into a sequence.
      Thus use the corresponding function “seq_putc”.
      
      This issue was transformed by using the Coccinelle software.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/e26b7df9-cd63-491f-85e8-8cabe60a85e5@web.de
      df862de4
    • Martin KaFai Lau's avatar
      Merge branch 'use network helpers, part 9' · 22912472
      Martin KaFai Lau authored
      Geliang Tang says:
      
      ====================
      v3:
       - patch 2:
         - clear errno before connect_to_fd_opts.
         - print err logs in run_test.
         - set err to -1 when fd >= 0.
       - patch 3:
         - drop "int err".
      
      v2:
       - update patch 2 as Martin suggested.
      
      This is the 9th part of series "use network helpers" all BPF selftests
      wide.
      
      Patches 1-2 update network helpers interfaces suggested by Martin.
      Patch 3 adds a new helper connect_to_addr_str() as Martin suggested
      instead of adding connect_fd_to_addr_str().
      Patch 4 uses this newly added helper in make_client().
      Patch 5 uses make_client() in sk_lookup and drop make_socket().
      ====================
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      22912472
    • Geliang Tang's avatar
      selftests/bpf: Add connect_to_addr_str helper · c70b2d90
      Geliang Tang authored
      Similar to connect_to_addr() helper for connecting to a server with the
      given sockaddr_storage type address, this patch adds a new helper named
      connect_to_addr_str() for connecting to a server with the given string
      type address "addr_str", together with its "family" and "port" as other
      parameters of connect_to_addr_str().
      
      In connect_to_addr_str(), the parameters "family", "addr_str" and "port"
      are used to create a sockaddr_storage type address "addr" by invoking
      make_sockaddr(). Then pass this "addr" together with "addrlen", "type"
      and "opts" to connect_to_addr().
      Suggested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Link: https://lore.kernel.org/r/647e82170831558dbde132a7a3d86df660dba2c4.1721282219.git.tanggeliang@kylinos.cnSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      c70b2d90
    • Geliang Tang's avatar
      selftests/bpf: Drop must_fail from network_helper_opts · e1ee5a48
      Geliang Tang authored
      The struct member "must_fail" of network_helper_opts() is only used in
      cgroup_v1v2 tests, it makes sense to drop it from network_helper_opts.
      
      Return value (fd) of connect_to_fd_opts() and the expect errno (EPERM)
      can be checked in cgroup_v1v2.c directly, no need to check them in
      connect_fd_to_addr() in network_helpers.c.
      
      This also makes connect_fd_to_addr() function useless. It can be replaced
      by connect().
      Suggested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Link: https://lore.kernel.org/r/3faf336019a9a48e2e8951f4cdebf19e3ac6e441.1721282219.git.tanggeliang@kylinos.cnSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      e1ee5a48
    • Geliang Tang's avatar
      selftests/bpf: Drop type of connect_to_fd_opts · a63507f3
      Geliang Tang authored
      The "type" parameter of connect_to_fd_opts() is redundant of "server_fd".
      Since the "type" can be obtained inside by invoking getsockopt(SO_TYPE),
      without passing it in as a parameter.
      
      This patch drops the "type" parameter of connect_to_fd_opts() and updates
      its callers.
      Suggested-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarGeliang Tang <tanggeliang@kylinos.cn>
      Link: https://lore.kernel.org/r/50d8ce7ab7ab0c0f4d211fc7cc4ebe3d3f63424c.1721282219.git.tanggeliang@kylinos.cnSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      a63507f3
  2. 25 Jul, 2024 19 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 1722389b
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf and netfilter.
      
        A lot of networking people were at a conference last week, busy
        catching COVID, so relatively short PR.
      
        Current release - regressions:
      
         - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
      
        Current release - new code bugs:
      
         - l2tp: protect session IDR and tunnel session list with one lock,
           make sure the state is coherent to avoid a warning
      
         - eth: bnxt_en: update xdp_rxq_info in queue restart logic
      
         - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
      
        Previous releases - regressions:
      
         - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
           the field reuses previously un-validated pad
      
        Previous releases - always broken:
      
         - tap/tun: drop short frames to prevent crashes later in the stack
      
         - eth: ice: add a per-VF limit on number of FDIR filters
      
         - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"
      
      * tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
        tun: add missing verification for short frame
        tap: add missing verification for short frame
        mISDN: Fix a use after free in hfcmulti_tx()
        gve: Fix an edge case for TSO skb validity check
        bnxt_en: update xdp_rxq_info in queue restart logic
        tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
        selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
        xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
        bpf: Fix a segment issue when downgrading gso_size
        net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
        MAINTAINERS: make Breno the netconsole maintainer
        MAINTAINERS: Update bonding entry
        net: nexthop: Initialize all fields in dumped nexthops
        net: stmmac: Correct byte order of perfect_match
        selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
        tipc: Return non-zero value from tipc_udp_addr2str() on error
        netfilter: nft_set_pipapo_avx2: disable softinterrupts
        ice: Fix recipe read procedure
        ice: Add a per-VF limit on number of FDIR filters
        net: bonding: correctly annotate RCU in bond_should_notify_peers()
        ...
      1722389b
    • Linus Torvalds's avatar
      Merge tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux · 8bf10009
      Linus Torvalds authored
      Pull printk updates from Petr Mladek:
      
       - trivial printk changes
      
      The bigger "real" printk work is still being discussed.
      
      * tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
        vsprintf: add missing MODULE_DESCRIPTION() macro
        printk: Rename console_replay_all() and update context
      8bf10009
    • Linus Torvalds's avatar
      Merge tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl · b4856250
      Linus Torvalds authored
      Pull sysctl constification from Joel Granados:
       "Treewide constification of the ctl_table argument of proc_handlers
        using a coccinelle script and some manual code formatting fixups.
      
        This is a prerequisite to moving the static ctl_table structs into
        read-only data section which will ensure that proc_handler function
        pointers cannot be modified"
      
      * tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
        sysctl: treewide: constify the ctl_table argument of proc_handlers
      b4856250
    • Linus Torvalds's avatar
      Merge tag 'efi-fixes-for-v6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi · bba959f4
      Linus Torvalds authored
      Pull EFI fixes from Ard Biesheuvel:
      
       - Wipe screen_info after allocating it from the heap - used by arm32
         and EFI zboot, other EFI architectures allocate it statically
      
       - Revert to allocating boot_params from the heap on x86 when entering
         via the native PE entrypoint, to work around a regression on older
         Dell hardware
      
      * tag 'efi-fixes-for-v6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
        x86/efistub: Revert to heap allocated boot_params for PE entrypoint
        efi/libstub: Zero initialize heap allocated struct screen_info
      bba959f4
    • Linus Torvalds's avatar
      Merge tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux · 9b219936
      Linus Torvalds authored
      Pull kgdb updates from Daniel Thompson:
       "Three small changes this cycle:
      
         - Clean up an architecture abstraction that is no longer needed
           because all the architectures have converged.
      
         - Actually use the prompt argument to kdb_position_cursor() instead
           of ignoring it (functionally this fix is a nop but that was due to
           luck rather than good judgement)
      
         - Fix a -Wformat-security warning"
      
      * tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
        kdb: Get rid of redundant kdb_curr_task()
        kdb: Use the passed prompt in kdb_position_cursor()
        kdb: address -Wformat-security warnings
      9b219936
    • Linus Torvalds's avatar
      Merge tag 'mips_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 28e7241c
      Linus Torvalds authored
      Pull MIPS updates from Thomas Bogendoerfer:
      
       - Use improved timer sync for Loongson64
      
       - Fix address of GCR_ACCESS register
      
       - Add missing MODULE_DESCRIPTION
      
      * tag 'mips_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        mips: sibyte: add missing MODULE_DESCRIPTION() macro
        MIPS: SMP-CPS: Fix address for GCR_ACCESS register for CM3 and later
        MIPS: Loongson64: Switch to SYNC_R4K
      28e7241c
    • Linus Torvalds's avatar
      Merge tag 'parisc-for-6.11-rc1' of... · f6464295
      Linus Torvalds authored
      Merge tag 'parisc-for-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
      
      Pull parisc updates from Helge Deller:
       "The gettimeofday() and clock_gettime() syscalls are now available as
        vDSO functions, and Dave added a patch which allows to use NVMe cards
        in the PCI slots as fast and easy alternative to SCSI discs.
      
        Summary:
      
         - add gettimeofday() and clock_gettime() vDSO functions
      
         - enable PCI_MSI_ARCH_FALLBACKS to allow PCI to PCIe bridge adaptor
           with PCIe NVME card to function in parisc machines
      
         - allow users to reduce kernel unaligned runtime warnings
      
         - minor code cleanups"
      
      * tag 'parisc-for-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Add support for CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN
        parisc: Use max() to calculate parisc_tlb_flush_threshold
        parisc: Fix warning at drivers/pci/msi/msi.h:121
        parisc: Add 64-bit gettimeofday() and clock_gettime() vDSO functions
        parisc: Add 32-bit gettimeofday() and clock_gettime() vDSO functions
        parisc: Clean up unistd.h file
      f6464295
    • Linus Torvalds's avatar
      Merge tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux · f9bcc61a
      Linus Torvalds authored
      Pull UML updates from Richard Weinberger:
      
       - Support for preemption
      
       - i386 Rust support
      
       - Huge cleanup by Benjamin Berg
      
       - UBSAN support
      
       - Removal of dead code
      
      * tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (41 commits)
        um: vector: always reset vp->opened
        um: vector: remove vp->lock
        um: register power-off handler
        um: line: always fill *error_out in setup_one_line()
        um: remove pcap driver from documentation
        um: Enable preemption in UML
        um: refactor TLB update handling
        um: simplify and consolidate TLB updates
        um: remove force_flush_all from fork_handler
        um: Do not flush MM in flush_thread
        um: Delay flushing syscalls until the thread is restarted
        um: remove copy_context_skas0
        um: remove LDT support
        um: compress memory related stub syscalls while adding them
        um: Rework syscall handling
        um: Add generic stub_syscall6 function
        um: Create signal stack memory assignment in stub_data
        um: Remove stub-data.h include from common-offsets.h
        um: time-travel: fix signal blocking race/hang
        um: time-travel: remove time_exit()
        ...
      f9bcc61a
    • Linus Torvalds's avatar
      Merge tag 'driver-core-6.11-rc1' of... · c2a96b7f
      Linus Torvalds authored
      Merge tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
      
      Pull driver core updates from Greg KH:
       "Here is the big set of driver core changes for 6.11-rc1.
      
        Lots of stuff in here, with not a huge diffstat, but apis are evolving
        which required lots of files to be touched. Highlights of the changes
        in here are:
      
         - platform remove callback api final fixups (Uwe took many releases
           to get here, finally!)
      
         - Rust bindings for basic firmware apis and initial driver-core
           interactions.
      
           It's not all that useful for a "write a whole driver in rust" type
           of thing, but the firmware bindings do help out the phy rust
           drivers, and the driver core bindings give a solid base on which
           others can start their work.
      
           There is still a long way to go here before we have a multitude of
           rust drivers being added, but it's a great first step.
      
         - driver core const api changes.
      
           This reached across all bus types, and there are some fix-ups for
           some not-common bus types that linux-next and 0-day testing shook
           out.
      
           This work is being done to help make the rust bindings more safe,
           as well as the C code, moving toward the end-goal of allowing us to
           put driver structures into read-only memory. We aren't there yet,
           but are getting closer.
      
         - minor devres cleanups and fixes found by code inspection
      
         - arch_topology minor changes
      
         - other minor driver core cleanups
      
        All of these have been in linux-next for a very long time with no
        reported problems"
      
      * tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits)
        ARM: sa1100: make match function take a const pointer
        sysfs/cpu: Make crash_hotplug attribute world-readable
        dio: Have dio_bus_match() callback take a const *
        zorro: make match function take a const pointer
        driver core: module: make module_[add|remove]_driver take a const *
        driver core: make driver_find_device() take a const *
        driver core: make driver_[create|remove]_file take a const *
        firmware_loader: fix soundness issue in `request_internal`
        firmware_loader: annotate doctests as `no_run`
        devres: Correct code style for functions that return a pointer type
        devres: Initialize an uninitialized struct member
        devres: Fix memory leakage caused by driver API devm_free_percpu()
        devres: Fix devm_krealloc() wasting memory
        driver core: platform: Switch to use kmemdup_array()
        driver core: have match() callback in struct bus_type take a const *
        MAINTAINERS: add Rust device abstractions to DRIVER CORE
        device: rust: improve safety comments
        MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer
        MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER
        firmware: rust: improve safety comments
        ...
      c2a96b7f
    • Linus Torvalds's avatar
      Merge tag 'linux-watchdog-6.11-rc1' of git://www.linux-watchdog.org/linux-watchdog · b2eed733
      Linus Torvalds authored
      Pull watchdog updates from Wim Van Sebroeck:
      
       - make watchdog_class const
      
       - rework of the rzg2l_wdt driver
      
       - other small fixes and improvements
      
      * tag 'linux-watchdog-6.11-rc1' of git://www.linux-watchdog.org/linux-watchdog:
        dt-bindings: watchdog: dlg,da9062-watchdog: Drop blank space
        watchdog: rzn1: Convert comma to semicolon
        watchdog: lenovo_se10_wdt: Convert comma to semicolon
        dt-bindings: watchdog: renesas,wdt: Document RZ/G3S support
        watchdog: rzg2l_wdt: Add suspend/resume support
        watchdog: rzg2l_wdt: Rely on the reset driver for doing proper reset
        watchdog: rzg2l_wdt: Remove comparison with zero
        watchdog: rzg2l_wdt: Remove reset de-assert from probe
        watchdog: rzg2l_wdt: Check return status of pm_runtime_put()
        watchdog: rzg2l_wdt: Use pm_runtime_resume_and_get()
        watchdog: rzg2l_wdt: Make the driver depend on PM
        watchdog: rzg2l_wdt: Restrict the driver to ARCH_RZG2L and ARCH_R9A09G011
        watchdog: imx7ulp_wdt: keep already running watchdog enabled
        watchdog: starfive: Add missing clk_disable_unprepare()
        watchdog: Make watchdog_class const
      b2eed733
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping · 9cf601e8
      Linus Torvalds authored
      Pull dma-mapping fix from Christoph Hellwig:
      
       - fix the order of actions in dmam_free_coherent (Lance Richardson)
      
      * tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping:
        dma: fix call order in dmam_free_coherent
      9cf601e8
    • Jakub Kicinski's avatar
      Merge branch 'tap-tun-harden-by-dropping-short-frame' · af65ea42
      Jakub Kicinski authored
      Dongli Zhang says:
      
      ====================
      tap/tun: harden by dropping short frame
      
      This is to harden all of tap/tun to avoid any short frame smaller than the
      Ethernet header (ETH_HLEN).
      
      While the xen-netback already rejects short frame smaller than ETH_HLEN ...
      
       914 static void xenvif_tx_build_gops(struct xenvif_queue *queue,
       915                                      int budget,
       916                                      unsigned *copy_ops,
       917                                      unsigned *map_ops)
       918 {
      ... ...
      1007                 if (unlikely(txreq.size < ETH_HLEN)) {
      1008                         netdev_dbg(queue->vif->dev,
      1009                                    "Bad packet size: %d\n", txreq.size);
      1010                         xenvif_tx_err(queue, &txreq, extra_count, idx);
      1011                         break;
      1012                 }
      
      ... the short frame may not be dropped by vhost-net/tap/tun.
      
      This fixes CVE-2024-41090 and CVE-2024-41091.
      ====================
      
      Link: https://patch.msgid.link/20240724170452.16837-1-dongli.zhang@oracle.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      af65ea42
    • Dongli Zhang's avatar
      tun: add missing verification for short frame · 04958480
      Dongli Zhang authored
      The cited commit missed to check against the validity of the frame length
      in the tun_xdp_one() path, which could cause a corrupted skb to be sent
      downstack. Even before the skb is transmitted, the
      tun_xdp_one-->eth_type_trans() may access the Ethernet header although it
      can be less than ETH_HLEN. Once transmitted, this could either cause
      out-of-bound access beyond the actual length, or confuse the underlayer
      with incorrect or inconsistent header length in the skb metadata.
      
      In the alternative path, tun_get_user() already prohibits short frame which
      has the length less than Ethernet header size from being transmitted for
      IFF_TAP.
      
      This is to drop any frame shorter than the Ethernet header size just like
      how tun_get_user() does.
      
      CVE: CVE-2024-41091
      Inspired-by: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/
      Fixes: 043d222f ("tuntap: accept an array of XDP buffs through sendmsg()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Reviewed-by: default avatarSi-Wei Liu <si-wei.liu@oracle.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://patch.msgid.link/20240724170452.16837-3-dongli.zhang@oracle.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04958480
    • Si-Wei Liu's avatar
      tap: add missing verification for short frame · ed7f2afd
      Si-Wei Liu authored
      The cited commit missed to check against the validity of the frame length
      in the tap_get_user_xdp() path, which could cause a corrupted skb to be
      sent downstack. Even before the skb is transmitted, the
      tap_get_user_xdp()-->skb_set_network_header() may assume the size is more
      than ETH_HLEN. Once transmitted, this could either cause out-of-bound
      access beyond the actual length, or confuse the underlayer with incorrect
      or inconsistent header length in the skb metadata.
      
      In the alternative path, tap_get_user() already prohibits short frame which
      has the length less than Ethernet header size from being transmitted.
      
      This is to drop any frame shorter than the Ethernet header size just like
      how tap_get_user() does.
      
      CVE: CVE-2024-41090
      Link: https://lore.kernel.org/netdev/1717026141-25716-1-git-send-email-si-wei.liu@oracle.com/
      Fixes: 0efac277 ("tap: accept an array of XDP buffs through sendmsg()")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSi-Wei Liu <si-wei.liu@oracle.com>
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://patch.msgid.link/20240724170452.16837-2-dongli.zhang@oracle.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed7f2afd
    • Dan Carpenter's avatar
      mISDN: Fix a use after free in hfcmulti_tx() · 61ab7514
      Dan Carpenter authored
      Don't dereference *sp after calling dev_kfree_skb(*sp).
      
      Fixes: af69fb3a ("Add mISDN HFC multiport driver")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/8be65f5a-c2dd-4ba0-8a10-bfe5980b8cfb@stanley.mountainSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      61ab7514
    • Bailey Forrest's avatar
      gve: Fix an edge case for TSO skb validity check · 36e3b949
      Bailey Forrest authored
      The NIC requires each TSO segment to not span more than 10
      descriptors. NIC further requires each descriptor to not exceed
      16KB - 1 (GVE_TX_MAX_BUF_SIZE_DQO).
      
      The descriptors for an skb are generated by
      gve_tx_add_skb_no_copy_dqo() for DQO RDA queue format.
      gve_tx_add_skb_no_copy_dqo() loops through each skb frag and
      generates a descriptor for the entire frag if the frag size is
      not greater than GVE_TX_MAX_BUF_SIZE_DQO. If the frag size is
      greater than GVE_TX_MAX_BUF_SIZE_DQO, it is split into descriptor(s)
      of size GVE_TX_MAX_BUF_SIZE_DQO and a descriptor is generated for
      the remainder (frag size % GVE_TX_MAX_BUF_SIZE_DQO).
      
      gve_can_send_tso() checks if the descriptors thus generated for an
      skb would meet the requirement that each TSO-segment not span more
      than 10 descriptors. However, the current code misses an edge case
      when a TSO segment spans multiple descriptors within a large frag.
      This change fixes the edge case.
      
      gve_can_send_tso() relies on the assumption that max gso size (9728)
      is less than GVE_TX_MAX_BUF_SIZE_DQO and therefore within an skb
      fragment a TSO segment can never span more than 2 descriptors.
      
      Fixes: a57e5de4 ("gve: DQO: Add TX path")
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Signed-off-by: default avatarBailey Forrest <bcf@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://patch.msgid.link/20240724143431.3343722-1-pkaligineedi@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      36e3b949
    • Taehee Yoo's avatar
      bnxt_en: update xdp_rxq_info in queue restart logic · b537633c
      Taehee Yoo authored
      When the netdev_rx_queue_restart() restarts queues, the bnxt_en driver
      updates(creates and deletes) a page_pool.
      But it doesn't update xdp_rxq_info, so the xdp_rxq_info is still
      connected to an old page_pool.
      So, bnxt_rx_ring_info->page_pool indicates a new page_pool, but
      bnxt_rx_ring_info->xdp_rxq is still connected to an old page_pool.
      
      An old page_pool is no longer used so it is supposed to be
      deleted by page_pool_destroy() but it isn't.
      Because the xdp_rxq_info is holding the reference count for it and the
      xdp_rxq_info is not updated, an old page_pool will not be deleted in
      the queue restart logic.
      
      Before restarting 1 queue:
      ./tools/net/ynl/samples/page-pool
      enp10s0f1np1[6] page pools: 4 (zombies: 0)
      	refs: 8192 bytes: 33554432 (refs: 0 bytes: 0)
      	recycling: 0.0% (alloc: 128:8048 recycle: 0:0)
      
      After restarting 1 queue:
      ./tools/net/ynl/samples/page-pool
      enp10s0f1np1[6] page pools: 5 (zombies: 0)
      	refs: 10240 bytes: 41943040 (refs: 0 bytes: 0)
      	recycling: 20.0% (alloc: 160:10080 recycle: 1920:128)
      
      Before restarting queues, an interface has 4 page_pools.
      After restarting one queue, an interface has 5 page_pools, but it
      should be 4, not 5.
      The reason is that queue restarting logic creates a new page_pool and
      an old page_pool is not deleted due to the absence of an update of
      xdp_rxq_info logic.
      
      Fixes: 2d694c27 ("bnxt_en: implement netdev_queue_mgmt_ops")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Reviewed-by: default avatarDavid Wei <dw@davidwei.uk>
      Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Link: https://patch.msgid.link/20240721053554.1233549-1-ap420073@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b537633c
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · f7578df9
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2024-07-25
      
      We've added 14 non-merge commits during the last 8 day(s) which contain
      a total of 19 files changed, 177 insertions(+), 70 deletions(-).
      
      The main changes are:
      
      1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and
         BPF sockhash. Also add test coverage for this case, from Michal Luczaj.
      
      2) Fix a segmentation issue when downgrading gso_size in the BPF helper
         bpf_skb_adjust_room(), from Fred Li.
      
      3) Fix a compiler warning in resolve_btfids due to a missing type cast,
         from Liwei Song.
      
      4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte
         boundary in the fexit_sleep BPF selftest, from Puranjay Mohan.
      
      5) Fix a xsk regression to require a flag when actuating tx_metadata_len,
         from Stanislav Fomichev.
      
      6) Fix function prototype BTF dumping in libbpf for prototypes that have
         no input arguments, from Andrii Nakryiko.
      
      7) Fix stacktrace symbol resolution in perf script for BPF programs
         containing subprograms, from Hou Tao.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
        xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
        bpf: Fix a segment issue when downgrading gso_size
        tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids
        bpf, events: Use prog to emit ksymbol event for main program
        selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB
        selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags
        selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected()
        af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash
        bpftool: Fix typo in usage help
        libbpf: Fix no-args func prototype BTF dumping syntax
        MAINTAINERS: Update powerpc BPF JIT maintainers
        MAINTAINERS: Update email address of Naveen
        selftests/bpf: fexit_sleep: Fix stack allocation for arm64
      ====================
      
      Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f7578df9
    • Matthieu Baerts (NGI0)'s avatar
      tcp: process the 3rd ACK with sk_socket for TFO/MPTCP · c1668292
      Matthieu Baerts (NGI0) authored
      The 'Fixes' commit recently changed the behaviour of TCP by skipping the
      processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
      skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
      unnecessary ACK in case of simultaneous connect(). Unfortunately, that
      had an impact on TFO and MPTCP.
      
      I started to look at the impact on MPTCP, because the MPTCP CI found
      some issues with the MPTCP Packetdrill tests [1]. Then Paolo Abeni
      suggested me to look at the impact on TFO with "plain" TCP.
      
      For MPTCP, when receiving the 3rd ACK of a request adding a new path
      (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
      has been created when the MPTCP connection got established before with
      the first path. The newly added 'goto' will then skip the processing of
      the segment text (step 7) and not go through tcp_data_queue() where the
      MPTCP options are validated, and some actions are triggered, e.g.
      sending the MPJ 4th ACK [2] as demonstrated by the new errors when
      running a packetdrill test [3] establishing a second subflow.
      
      This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
      delayed. Still, we don't want to have this behaviour as it delays the
      switch to the fully established mode, and invalid MPTCP options in this
      3rd ACK will not be caught any more. This modification also affects the
      MPTCP + TFO feature as well, and being the reason why the selftests
      started to be unstable the last few days [4].
      
      For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
      passing: if the 3rd ACK contains data, and the connection is accept()ed
      before receiving them, these data would no longer be processed, and thus
      not ACKed.
      
      One last thing about MPTCP, in case of simultaneous connect(), a
      fallback to TCP will be done, which seems fine:
      
        `../common/defaults.sh`
      
         0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
        +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
      
        +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
        +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
        +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
        +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
        +0 >  . 1:1(0) ack 1           <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
      
      Simultaneous SYN-data crossing is also not supported by TFO, see [6].
      
      Kuniyuki Iwashima suggested to restrict the processing to SYN+ACK only:
      that's a more generic solution than the one initially proposed, and
      also enough to fix the issues described above.
      
      Later on, Eric Dumazet mentioned that an ACK should still be sent in
      reaction to the second SYN+ACK that is received: not sending a DUPACK
      here seems wrong and could hurt:
      
         0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
        +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
      
        +0 > S  0:0(0)                <mss 1460, sackOK, TS val 1000 ecr 0,nop,wscale 8>
        +0 < S  0:0(0)       win 1000 <mss 1000, sackOK, nop, nop>
        +0 > S. 0:0(0) ack 1          <mss 1460, sackOK, TS val 3308134035 ecr 0,nop,wscale 8>
        +0 < S. 0:0(0) ack 1 win 1000 <mss 1000, sackOK, nop, nop>
        +0 >  . 1:1(0) ack 1          <nop, nop, sack 0:1>  // <== Here
      
      So in this version, the 'goto consume' is dropped, to always send an ACK
      when switching from TCP_SYN_RECV to TCP_ESTABLISHED. This ACK will be
      seen as a DUPACK -- with DSACK if SACK has been negotiated -- in case of
      simultaneous SYN crossing: that's what is expected here.
      
      Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
      Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
      Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
      Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
      Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
      Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
      Fixes: 23e89e8e ("tcp: Don't drop SYN+ACK for simultaneous connect().")
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Suggested-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20240724-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v3-1-d48339764ce9@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c1668292