1. 18 Jan, 2024 5 commits
    • Andrii Nakryiko's avatar
      libbpf: warn on unexpected __arg_ctx type when rewriting BTF · 76ec90a9
      Andrii Nakryiko authored
      On kernel that don't support arg:ctx tag, before adjusting global
      subprog BTF information to match kernel's expected canonical type names,
      make sure that types used by user are meaningful, and if not, warn and
      don't do BTF adjustments.
      
      This is similar to checks that kernel performs, but narrower in scope,
      as only a small subset of BPF program types can be accommodated by
      libbpf using canonical type names.
      
      Libbpf unconditionally allows `struct pt_regs *` for perf_event program
      types, unlike kernel, which supports that conditionally on architecture.
      This is done to keep things simple and not cause unnecessary false
      positives. This seems like a minor and harmless deviation, which in
      real-world programs will be caught by kernels with arg:ctx tag support
      anyways. So KISS principle.
      
      This logic is hard to test (especially on latest kernels), so manual
      testing was performed instead. Libbpf emitted the following warning for
      perf_event program with wrong context argument type:
      
        libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      76ec90a9
    • Andrii Nakryiko's avatar
      selftests/bpf: add tests confirming type logic in kernel for __arg_ctx · 989410cd
      Andrii Nakryiko authored
      Add a bunch of global subprogs across variety of program types to
      validate expected kernel type enforcement logic for __arg_ctx arguments.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      989410cd
    • Andrii Nakryiko's avatar
      bpf: enforce types for __arg_ctx-tagged arguments in global subprogs · 0ba97151
      Andrii Nakryiko authored
      Add enforcement of expected types for context arguments tagged with
      arg:ctx (__arg_ctx) tag.
      
      First, any program type will accept generic `void *` context type when
      combined with __arg_ctx tag.
      
      Besides accepting "canonical" struct names and `void *`, for a bunch of
      program types for which program context is actually a named struct, we
      allows a bunch of pragmatic exceptions to match real-world and expected
      usage:
      
        - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
          canonical context argument type, where `bpf_user_pt_regs_t` is a
          *typedef*, not a struct;
        - for kprobes, we also always accept `struct pt_regs *`, as that's what
          actually is passed as a context to any kprobe program;
        - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
          down to actual struct type and accept `struct pt_regs *`, or
          `struct user_pt_regs *`, or `struct user_regs_struct *`, depending
          on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
          typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
          expected;
        - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
          what's expected with BPF_PROG() usage; otherwise, canonical
          `struct bpf_raw_tracepoint_args *` is expected;
        - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
          formats, both are coded as expections as tp_btf is actually a TRACING
          program type, which has no canonical context type;
        - iterator programs accept `struct bpf_iter__xxx *` structs, currently
          with no further iterator-type specific enforcement;
        - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
        - classic tracepoint programs, as well as syscall and freplace
          programs allow any user-provided type.
      
      In all other cases kernel will enforce exact match of struct name to
      expected canonical type. And if user-provided type doesn't match that
      expectation, verifier will emit helpful message with expected type name.
      
      Note a bit unnatural way the check is done after processing all the
      arguments. This is done to avoid conflict between bpf and bpf-next
      trees. Once trees converge, a small follow up patch will place a simple
      btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
      (which bpf-next tree patch refactored already), removing duplicated
      arg:ctx detection logic.
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ba97151
    • Andrii Nakryiko's avatar
      bpf: extract bpf_ctx_convert_map logic and make it more reusable · 66967a32
      Andrii Nakryiko authored
      Refactor btf_get_prog_ctx_type() a bit to allow reuse of
      bpf_ctx_convert_map logic in more than one places. Simplify interface by
      returning btf_type instead of btf_member (field reference in BTF).
      
      To do the above we need to touch and start untangling
      btf_translate_to_vmlinux() implementation. We do the bare minimum to
      not regress anything for btf_translate_to_vmlinux(), but its
      implementation is very questionable for what it claims to be doing.
      Mapping kfunc argument types to kernel corresponding types conceptually
      is quite different from recognizing program context types. Fixing this
      is out of scope for this change though.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      66967a32
    • Andrii Nakryiko's avatar
      libbpf: feature-detect arg:ctx tag support in kernel · 01b55f4f
      Andrii Nakryiko authored
      Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
      this is detected, libbpf will avoid doing any __arg_ctx-related BTF
      rewriting and checks in favor of letting kernel handle this completely.
      
      test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
      feature detection (albeit in much simpler, though round-about and
      inefficient, way), and skip the tests. This is done to still be able to
      execute this test on older kernels (like in libbpf CI).
      
      Note, BPF token series ([0]) does a major refactor and code moving of
      libbpf-internal feature detection "framework", so to avoid unnecessary
      conflicts we keep newly added feature detection stand-alone with ad-hoc
      result caching. Once things settle, there will be a small follow up to
      re-integrate everything back and move code into its final place in
      newly-added (by BPF token series) features.c file.
      
        [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=814209&state=*Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20240118033143.3384355-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      01b55f4f
  2. 16 Jan, 2024 2 commits
    • Hao Sun's avatar
      selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS · 33772ff3
      Hao Sun authored
      Add a test case for PTR_TO_FLOW_KEYS alu. Testing if alu with variable
      offset on flow_keys is rejected. For the fixed offset success case, we
      already have C code coverage to verify (e.g. via bpf_flow.c).
      Signed-off-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20240115082028.9992-2-sunhao.th@gmail.com
      33772ff3
    • Hao Sun's avatar
      bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS · 22c7fa17
      Hao Sun authored
      For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
      for validation. However, variable offset ptr alu is not prohibited
      for this ptr kind. So the variable offset is not checked.
      
      The following prog is accepted:
      
        func#0 @0
        0: R1=ctx() R10=fp0
        0: (bf) r6 = r1                       ; R1=ctx() R6_w=ctx()
        1: (79) r7 = *(u64 *)(r6 +144)        ; R6_w=ctx() R7_w=flow_keys()
        2: (b7) r8 = 1024                     ; R8_w=1024
        3: (37) r8 /= 1                       ; R8_w=scalar()
        4: (57) r8 &= 1024                    ; R8_w=scalar(smin=smin32=0,
        smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400))
        5: (0f) r7 += r8
        mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
        mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024
        mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1
        mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024
        6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off
        =(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024,
        var_off=(0x0; 0x400))
        6: (79) r0 = *(u64 *)(r7 +0)          ; R0_w=scalar()
        7: (95) exit
      
      This prog loads flow_keys to r7, and adds the variable offset r8
      to r7, and finally causes out-of-bounds access:
      
        BUG: unable to handle page fault for address: ffffc90014c80038
        [...]
        Call Trace:
         <TASK>
         bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline]
         __bpf_prog_run include/linux/filter.h:651 [inline]
         bpf_prog_run include/linux/filter.h:658 [inline]
         bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline]
         bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991
         bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359
         bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline]
         __sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475
         __do_sys_bpf kernel/bpf/syscall.c:5561 [inline]
         __se_sys_bpf kernel/bpf/syscall.c:5559 [inline]
         __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Fix this by rejecting ptr alu with variable offset on flow_keys.
      Applying the patch rejects the program with "R7 pointer arithmetic
      on flow_keys prohibited".
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Signed-off-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
      22c7fa17
  3. 13 Jan, 2024 16 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fix-backward-progress-bug-in-bpf_iter_udp' · 8e33d5db
      Alexei Starovoitov authored
      Martin KaFai Lau says:
      
      ====================
      bpf: Fix backward progress bug in bpf_iter_udp
      
      From: Martin KaFai Lau <martin.lau@kernel.org>
      
      This patch set fixes an issue in bpf_iter_udp that makes backward
      progress and prevents the user space process from finishing. There is
      a test at the end to reproduce the bug.
      
      Please see individual patches for details.
      
      v3:
      - Fixed the iter_fd check and local_port check in the
        patch 3 selftest. (Yonghong)
      - Moved jhash2 to test_jhash.h in the patch 3. (Yonghong)
      - Added explanation in the bucket selection in the patch 3. (Yonghong)
      
      v2:
      - Added patch 1 to fix another bug that goes back to
        the previous bucket
      - Simplify the fix in patch 2 to always reset iter->offset to 0
      - Add a test case to close all udp_sk in a bucket while
        in the middle of the iteration.
      ====================
      
      Link: https://lore.kernel.org/r/20240112190530.3751661-1-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8e33d5db
    • Martin KaFai Lau's avatar
      selftests/bpf: Test udp and tcp iter batching · dbd7db77
      Martin KaFai Lau authored
      The patch adds a test to exercise the bpf_iter_udp batching
      logic. It specifically tests the case that there are multiple
      so_reuseport udp_sk in a bucket of the udp_table.
      
      The test creates two sets of so_reuseport sockets and
      each set on a different port. Meaning there will be
      two buckets in the udp_table.
      
      The test does the following:
      1. read() 3 out of 4 sockets in the first bucket.
      2. close() all sockets in the first bucket. This
         will ensure the current bucket's offset in
         the kernel does not affect the read() of the
         following bucket.
      3. read() all 4 sockets in the second bucket.
      
      The test also reads one udp_sk at a time from
      the bpf_iter_udp prog. The true case in
      "do_test(..., bool onebyone)". This is the buggy case
      that the previous patch fixed.
      
      It also tests the "false" case in "do_test(..., bool onebyone)",
      meaning the userspace reads the whole bucket. There is
      no bug in this case but adding this test also while
      at it.
      
      Considering the way to have multiple tcp_sk in the same
      bucket is similar (by using so_reuseport),
      this patch also tests the bpf_iter_tcp even though the
      bpf_iter_tcp batching logic works correctly.
      
      Both IP v4 and v6 are exercising the same bpf_iter batching
      code path, so only v6 is tested.
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240112190530.3751661-4-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dbd7db77
    • Martin KaFai Lau's avatar
      bpf: Avoid iter->offset making backward progress in bpf_iter_udp · 2242fd53
      Martin KaFai Lau authored
      There is a bug in the bpf_iter_udp_batch() function that stops
      the userspace from making forward progress.
      
      The case that triggers the bug is the userspace passed in
      a very small read buffer. When the bpf prog does bpf_seq_printf,
      the userspace read buffer is not enough to capture the whole bucket.
      
      When the read buffer is not large enough, the kernel will remember
      the offset of the bucket in iter->offset such that the next userspace
      read() can continue from where it left off.
      
      The kernel will skip the number (== "iter->offset") of sockets in
      the next read(). However, the code directly decrements the
      "--iter->offset". This is incorrect because the next read() may
      not consume the whole bucket either and then the next-next read()
      will start from offset 0. The net effect is the userspace will
      keep reading from the beginning of a bucket and the process will
      never finish. "iter->offset" must always go forward until the
      whole bucket is consumed.
      
      This patch fixes it by using a local variable "resume_offset"
      and "resume_bucket". "iter->offset" is always reset to 0 before
      it may be used. "iter->offset" will be advanced to the
      "resume_offset" when it continues from the "resume_bucket" (i.e.
      "state->bucket == resume_bucket"). This brings it closer to
      the bpf_iter_tcp's offset handling which does not suffer
      the same bug.
      
      Cc: Aditi Ghag <aditi.ghag@isovalent.com>
      Fixes: c96dac8d ("bpf: udp: Implement batching for sockets iterator")
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Reviewed-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20240112190530.3751661-3-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2242fd53
    • Martin KaFai Lau's avatar
      bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket · 19ca0823
      Martin KaFai Lau authored
      The current logic is to use a default size 16 to batch the whole bucket.
      If it is too small, it will retry with a larger batch size.
      
      The current code accidentally does a state->bucket-- before retrying.
      This goes back to retry with the previous bucket which has already
      been done. This patch fixed it.
      
      It is hard to create a selftest. I added a WARN_ON(state->bucket < 0),
      forced a particular port to be hashed to the first bucket,
      created >16 sockets, and observed the for-loop went back
      to the "-1" bucket.
      
      Cc: Aditi Ghag <aditi.ghag@isovalent.com>
      Fixes: c96dac8d ("bpf: udp: Implement batching for sockets iterator")
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Reviewed-by: default avatarAditi Ghag <aditi.ghag@isovalent.com>
      Link: https://lore.kernel.org/r/20240112190530.3751661-2-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      19ca0823
    • Marc Kleine-Budde's avatar
      net: netdev_queue: netdev_txq_completed_mb(): fix wake condition · 894d7508
      Marc Kleine-Budde authored
      netif_txq_try_stop() uses "get_desc >= start_thrs" as the check for
      the call to netif_tx_start_queue().
      
      Use ">=" i netdev_txq_completed_mb(), too.
      
      Fixes: c91c46de ("net: provide macros for commonly copied lockless queue stop/wake code")
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      894d7508
    • Eric Dumazet's avatar
      net: add more sanity check in virtio_net_hdr_to_skb() · 9181d6f8
      Eric Dumazet authored
      syzbot/KMSAN reports access to uninitialized data from gso_features_check() [1]
      
      The repro use af_packet, injecting a gso packet and hdrlen == 0.
      
      We could fix the issue making gso_features_check() more careful
      while dealing with NETIF_F_TSO_MANGLEID in fast path.
      
      Or we can make sure virtio_net_hdr_to_skb() pulls minimal network and
      transport headers as intended.
      
      Note that for GSO packets coming from untrusted sources, SKB_GSO_DODGY
      bit forces a proper header validation (and pull) before the packet can
      hit any device ndo_start_xmit(), thus we do not need a precise disection
      at virtio_net_hdr_to_skb() stage.
      
      [1]
      BUG: KMSAN: uninit-value in skb_gso_segment include/net/gso.h:83 [inline]
      BUG: KMSAN: uninit-value in validate_xmit_skb+0x10f2/0x1930 net/core/dev.c:3629
       skb_gso_segment include/net/gso.h:83 [inline]
       validate_xmit_skb+0x10f2/0x1930 net/core/dev.c:3629
       __dev_queue_xmit+0x1eac/0x5130 net/core/dev.c:4341
       dev_queue_xmit include/linux/netdevice.h:3134 [inline]
       packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
       packet_snd net/packet/af_packet.c:3087 [inline]
       packet_sendmsg+0x8b1d/0x9f30 net/packet/af_packet.c:3119
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       ____sys_sendmsg+0x9c2/0xd60 net/socket.c:2584
       ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638
       __sys_sendmsg net/socket.c:2667 [inline]
       __do_sys_sendmsg net/socket.c:2676 [inline]
       __se_sys_sendmsg net/socket.c:2674 [inline]
       __x64_sys_sendmsg+0x307/0x490 net/socket.c:2674
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Uninit was created at:
       slab_post_alloc_hook+0x129/0xa70 mm/slab.h:768
       slab_alloc_node mm/slub.c:3478 [inline]
       kmem_cache_alloc_node+0x5e9/0xb10 mm/slub.c:3523
       kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:560
       __alloc_skb+0x318/0x740 net/core/skbuff.c:651
       alloc_skb include/linux/skbuff.h:1286 [inline]
       alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6334
       sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2780
       packet_alloc_skb net/packet/af_packet.c:2936 [inline]
       packet_snd net/packet/af_packet.c:3030 [inline]
       packet_sendmsg+0x70e8/0x9f30 net/packet/af_packet.c:3119
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       ____sys_sendmsg+0x9c2/0xd60 net/socket.c:2584
       ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638
       __sys_sendmsg net/socket.c:2667 [inline]
       __do_sys_sendmsg net/socket.c:2676 [inline]
       __se_sys_sendmsg net/socket.c:2674 [inline]
       __x64_sys_sendmsg+0x307/0x490 net/socket.c:2674
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      CPU: 0 PID: 5025 Comm: syz-executor279 Not tainted 6.7.0-rc7-syzkaller-00003-gfbafc3e6 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      
      Reported-by: syzbot+7f4d0ea3df4d4fa9a65f@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/netdev/0000000000005abd7b060eb160cd@google.com/
      Fixes: 9274124f ("net: stricter validation of untrusted gso packets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9181d6f8
    • Jiri Pirko's avatar
      net: sched: track device in tcf_block_get/put_ext() only for clsact binder types · e18405d0
      Jiri Pirko authored
      Clsact/ingress qdisc is not the only one using shared block,
      red is also using it. The device tracking was originally introduced
      by commit 913b47d3 ("net/sched: Introduce tc block netdev
      tracking infra") for clsact/ingress only. Commit 94e2557d ("net:
      sched: move block device tracking into tcf_block_get/put_ext()")
      mistakenly enabled that for red as well.
      
      Fix that by adding a check for the binder type being clsact when adding
      device to the block->ports xarray.
      Reported-by: default avatarIdo Schimmel <idosch@idosch.org>
      Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/
      Fixes: 94e2557d ("net: sched: move block device tracking into tcf_block_get/put_ext()")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e18405d0
    • Eric Dumazet's avatar
      udp: annotate data-races around up->pending · 482521d8
      Eric Dumazet authored
      up->pending can be read without holding the socket lock,
      as pointed out by syzbot [1]
      
      Add READ_ONCE() in lockless contexts, and WRITE_ONCE()
      on write side.
      
      [1]
      BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg
      
      write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1:
       udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596
       inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x257/0x310 net/socket.c:2192
       __do_sys_sendto net/socket.c:2204 [inline]
       __se_sys_sendto net/socket.c:2200 [inline]
       __x64_sys_sendto+0x78/0x90 net/socket.c:2200
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0:
       udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373
       inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586
       ___sys_sendmsg net/socket.c:2640 [inline]
       __sys_sendmmsg+0x269/0x500 net/socket.c:2726
       __do_sys_sendmmsg net/socket.c:2755 [inline]
       __se_sys_sendmmsg net/socket.c:2752 [inline]
       __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      value changed: 0x00000000 -> 0x0000000a
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G        W          6.7.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      482521d8
    • Sneh Shah's avatar
      net: stmmac: Fix ethool link settings ops for integrated PCS · 08300ada
      Sneh Shah authored
      Currently get/set_link_ksettings ethtool ops are dependent on PCS.
      When PCS is integrated, it will not have separate link config.
      Bypass configuring and checking PCS for integrated PCS.
      
      Fixes: aa571b62 ("net: stmmac: add new switch to struct plat_stmmacenet_data")
      Tested-by: Andrew Halaney <ahalaney@redhat.com> # sa8775p-ride
      Signed-off-by: default avatarSneh Shah <quic_snehshah@quicinc.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08300ada
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-better-validation-of-mptcpopt_mp_join-option' · 68d27bad
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      mptcp: better validation of MPTCPOPT_MP_JOIN option
      
      Based on a syzbot report (see 4th patch in the series).
      
      We need to be more explicit about which one of the
      following flag is set by mptcp_parse_option():
      
      - OPTION_MPTCP_MPJ_SYN
      - OPTION_MPTCP_MPJ_SYNACK
      - OPTION_MPTCP_MPJ_ACK
      
      Then select the appropriate values instead of OPTIONS_MPTCP_MPJ
      
      Paolo suggested to do the same for OPTIONS_MPTCP_MPC (5th patch)
      ====================
      
      Link: https://lore.kernel.org/r/20240111194917.4044654-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68d27bad
    • Eric Dumazet's avatar
      mptcp: refine opt_mp_capable determination · 724b00c1
      Eric Dumazet authored
      OPTIONS_MPTCP_MPC is a combination of three flags.
      
      It would be better to be strict about testing what
      flag is expected, at least for code readability.
      
      mptcp_parse_option() already makes the distinction.
      
      - subflow_check_req() should use OPTION_MPTCP_MPC_SYN.
      
      - mptcp_subflow_init_cookie_req() should use OPTION_MPTCP_MPC_ACK.
      
      - subflow_finish_connect() should use OPTION_MPTCP_MPC_SYNACK
      
      - subflow_syn_recv_sock should use OPTION_MPTCP_MPC_ACK
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Fixes: 74c7dfbe ("mptcp: consolidate in_opt sub-options fields in a bitmask")
      Link: https://lore.kernel.org/r/20240111194917.4044654-6-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      724b00c1
    • Eric Dumazet's avatar
      mptcp: use OPTION_MPTCP_MPJ_SYN in subflow_check_req() · 66ff70df
      Eric Dumazet authored
      syzbot reported that subflow_check_req() was using uninitialized data in
      subflow_check_req() [1]
      
      This is because mp_opt.token is only set when OPTION_MPTCP_MPJ_SYN is also set.
      
      While we are are it, fix mptcp_subflow_init_cookie_req()
      to test for OPTION_MPTCP_MPJ_ACK.
      
      [1]
      
      BUG: KMSAN: uninit-value in subflow_token_join_request net/mptcp/subflow.c:91 [inline]
       BUG: KMSAN: uninit-value in subflow_check_req+0x1028/0x15d0 net/mptcp/subflow.c:209
        subflow_token_join_request net/mptcp/subflow.c:91 [inline]
        subflow_check_req+0x1028/0x15d0 net/mptcp/subflow.c:209
        subflow_v6_route_req+0x269/0x410 net/mptcp/subflow.c:367
        tcp_conn_request+0x153a/0x4240 net/ipv4/tcp_input.c:7164
       subflow_v6_conn_request+0x3ee/0x510
        tcp_rcv_state_process+0x2e1/0x4ac0 net/ipv4/tcp_input.c:6659
        tcp_v6_do_rcv+0x11bf/0x1fe0 net/ipv6/tcp_ipv6.c:1669
        tcp_v6_rcv+0x480b/0x4fb0 net/ipv6/tcp_ipv6.c:1900
        ip6_protocol_deliver_rcu+0xda6/0x2a60 net/ipv6/ip6_input.c:438
        ip6_input_finish net/ipv6/ip6_input.c:483 [inline]
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ip6_input+0x15d/0x430 net/ipv6/ip6_input.c:492
        dst_input include/net/dst.h:461 [inline]
        ip6_rcv_finish+0x5db/0x870 net/ipv6/ip6_input.c:79
        NF_HOOK include/linux/netfilter.h:314 [inline]
        ipv6_rcv+0xda/0x390 net/ipv6/ip6_input.c:310
        __netif_receive_skb_one_core net/core/dev.c:5532 [inline]
        __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5646
        netif_receive_skb_internal net/core/dev.c:5732 [inline]
        netif_receive_skb+0x58/0x660 net/core/dev.c:5791
        tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555
        tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002
        tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
        call_write_iter include/linux/fs.h:2020 [inline]
        new_sync_write fs/read_write.c:491 [inline]
        vfs_write+0x8ef/0x1490 fs/read_write.c:584
        ksys_write+0x20f/0x4c0 fs/read_write.c:637
        __do_sys_write fs/read_write.c:649 [inline]
        __se_sys_write fs/read_write.c:646 [inline]
        __x64_sys_write+0x93/0xd0 fs/read_write.c:646
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Local variable mp_opt created at:
        subflow_check_req+0x6d/0x15d0 net/mptcp/subflow.c:145
        subflow_v6_route_req+0x269/0x410 net/mptcp/subflow.c:367
      
      CPU: 1 PID: 5924 Comm: syz-executor.3 Not tainted 6.7.0-rc8-syzkaller-00055-g5eff55d7 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      
      Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Peter Krystad <peter.krystad@linux.intel.com>
      Cc: Matthieu Baerts <matttbe@kernel.org>
      Cc: Mat Martineau <martineau@kernel.org>
      Cc: Geliang Tang <geliang.tang@linux.dev>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20240111194917.4044654-5-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      66ff70df
    • Eric Dumazet's avatar
      mptcp: use OPTION_MPTCP_MPJ_SYNACK in subflow_finish_connect() · be1d9d9d
      Eric Dumazet authored
      subflow_finish_connect() uses four fields (backup, join_id, thmac, none)
      that may contain garbage unless OPTION_MPTCP_MPJ_SYNACK has been set
      in mptcp_parse_option()
      
      Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Peter Krystad <peter.krystad@linux.intel.com>
      Cc: Matthieu Baerts <matttbe@kernel.org>
      Cc: Mat Martineau <martineau@kernel.org>
      Cc: Geliang Tang <geliang.tang@linux.dev>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20240111194917.4044654-4-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      be1d9d9d
    • Eric Dumazet's avatar
      mptcp: strict validation before using mp_opt->hmac · c1665273
      Eric Dumazet authored
      mp_opt->hmac contains uninitialized data unless OPTION_MPTCP_MPJ_ACK
      was set in mptcp_parse_option().
      
      We must refine the condition before we call subflow_hmac_valid().
      
      Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Peter Krystad <peter.krystad@linux.intel.com>
      Cc: Matthieu Baerts <matttbe@kernel.org>
      Cc: Mat Martineau <martineau@kernel.org>
      Cc: Geliang Tang <geliang.tang@linux.dev>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20240111194917.4044654-3-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c1665273
    • Eric Dumazet's avatar
      mptcp: mptcp_parse_option() fix for MPTCPOPT_MP_JOIN · 89e23277
      Eric Dumazet authored
      mptcp_parse_option() currently sets OPTIONS_MPTCP_MPJ, for the three
      possible cases handled for MPTCPOPT_MP_JOIN option.
      
      OPTIONS_MPTCP_MPJ is the combination of three flags:
      - OPTION_MPTCP_MPJ_SYN
      - OPTION_MPTCP_MPJ_SYNACK
      - OPTION_MPTCP_MPJ_ACK
      
      This is a problem, because backup, join_id, token, nonce and/or hmac fields
      could be left uninitialized in some cases.
      
      Distinguish the three cases, as following patches will need this step.
      
      Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Peter Krystad <peter.krystad@linux.intel.com>
      Cc: Matthieu Baerts <matttbe@kernel.org>
      Cc: Mat Martineau <martineau@kernel.org>
      Cc: Geliang Tang <geliang.tang@linux.dev>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20240111194917.4044654-2-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      89e23277
    • Dmitry Antipov's avatar
      net: liquidio: fix clang-specific W=1 build warnings · cbdd50ec
      Dmitry Antipov authored
      When compiling with clang-18 and W=1, I've noticed the following
      warnings:
      
      drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c:1493:16: warning: cast
      from 'void (*)(struct octeon_device *, struct octeon_mbox_cmd *, void *)' to
      'octeon_mbox_callback_t' (aka 'void (*)(void *, void *, void *)') converts to
      incompatible function type [-Wcast-function-type-strict]
       1493 |         mbox_cmd.fn = (octeon_mbox_callback_t)cn23xx_get_vf_stats_callback;
            |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      and:
      
      drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c:432:16: warning: cast
      from 'void (*)(struct octeon_device *, struct octeon_mbox_cmd *, void *)' to
      'octeon_mbox_callback_t' (aka 'void (*)(void *, void *, void *)') converts to
      incompatible function type [-Wcast-function-type-strict]
        432 |         mbox_cmd.fn = (octeon_mbox_callback_t)octeon_pfvf_hs_callback;
            |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fix both of the above by adjusting 'octeon_mbox_callback_t' to match actual
      callback definitions (at the cost of adding an extra forward declaration).
      Signed-off-by: default avatarDmitry Antipov <dmantipov@yandex.ru>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240111162432.124014-1-dmantipov@yandex.ruSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cbdd50ec
  4. 12 Jan, 2024 17 commits
    • Jakub Kicinski's avatar
      net: fill in MODULE_DESCRIPTION()s for wx_lib · 907ee668
      Jakub Kicinski authored
      W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
      Add a description to Wangxun's common code lib.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      907ee668
    • Claudiu Beznea's avatar
      net: phy: micrel: populate .soft_reset for KSZ9131 · e398822c
      Claudiu Beznea authored
      The RZ/G3S SMARC Module has 2 KSZ9131 PHYs. In this setup, the KSZ9131 PHY
      is used with the ravb Ethernet driver. It has been discovered that when
      bringing the Ethernet interface down/up continuously, e.g., with the
      following sh script:
      
      $ while :; do ifconfig eth0 down; ifconfig eth0 up; done
      
      the link speed and duplex are wrong after interrupting the bring down/up
      operation even though the Ethernet interface is up. To recover from this
      state the following configuration sequence is necessary (executed
      manually):
      
      $ ifconfig eth0 down
      $ ifconfig eth0 up
      
      The behavior has been identified also on the Microchip SAMA7G5-EK board
      which runs the macb driver and uses the same PHY.
      
      The order of PHY-related operations in ravb_open() is as follows:
      ravb_open() ->
        ravb_phy_start() ->
          ravb_phy_init() ->
            of_phy_connect() ->
              phy_connect_direct() ->
      	  phy_attach_direct() ->
      	    phy_init_hw() ->
      	      phydev->drv->soft_reset()
      	      phydev->drv->config_init()
      	      phydev->drv->config_intr()
      	    phy_resume()
      	      kszphy_resume()
      
      The order of PHY-related operations in ravb_close is as follows:
      ravb_close() ->
        phy_stop() ->
          phy_suspend() ->
            kszphy_suspend() ->
              genphy_suspend()
      	  // set BMCR_PDOWN bit in MII_BMCR
      
      In genphy_suspend() setting the BMCR_PDWN bit in MII_BMCR switches the PHY
      to Software Power-Down (SPD) mode (according to the KSZ9131 datasheet).
      Thus, when opening the interface after it has been  previously closed (via
      ravb_close()), the phydev->drv->config_init() and
      phydev->drv->config_intr() reach the KSZ9131 PHY driver via the
      ksz9131_config_init() and kszphy_config_intr() functions.
      
      KSZ9131 specifies that the MII management interface remains operational
      during SPD (Software Power-Down), but (according to manual):
      - Only access to the standard registers (0 through 31) is supported.
      - Access to MMD address spaces other than MMD address space 1 is possible
        if the spd_clock_gate_override bit is set.
      - Access to MMD address space 1 is not possible.
      
      The spd_clock_gate_override bit is not used in the KSZ9131 driver.
      
      ksz9131_config_init() configures RGMII delay, pad skews and LEDs by
      accessesing MMD registers other than those in address space 1.
      
      The datasheet for the KSZ9131 does not specify what happens if registers
      from an unsupported address space are accessed while the PHY is in SPD.
      
      To fix the issue the .soft_reset method has been instantiated for KSZ9131,
      too. This resets the PHY to the default state before doing any
      configurations to it, thus switching it out of SPD.
      
      Fixes: bff5b4b3 ("net: phy: micrel: add Microchip KSZ9131 initial driver")
      Signed-off-by: default avatarClaudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
      Reviewed-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e398822c
    • Horatiu Vultur's avatar
      net: micrel: Fix PTP frame parsing for lan8841 · acd66c21
      Horatiu Vultur authored
      The HW has the capability to check each frame if it is a PTP frame,
      which domain it is, which ptp frame type it is, different ip address in
      the frame. And if one of these checks fail then the frame is not
      timestamp. Most of these checks were disabled except checking the field
      minorVersionPTP inside the PTP header. Meaning that once a partner sends
      a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
      frame was not timestamp because the HW expected by default a value of 0
      in minorVersionPTP.
      Fix this issue by removing this check so the userspace can decide on this.
      
      Fixes: cafc3662 ("net: micrel: Add PHC support for lan8841")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarDivya Koppera <divya.koppera@microchip.com>
      Reviewed-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acd66c21
    • Taehee Yoo's avatar
      amt: do not use overwrapped cb area · bec161ad
      Taehee Yoo authored
      amt driver uses skb->cb for storing tunnel information.
      This job is worked before TC layer and then amt driver load tunnel info
      from skb->cb after TC layer.
      So, its cb area should not be overwrapped with CB area used by TC.
      In order to not use cb area used by TC, it skips the biggest cb
      structure used by TC, which was qdisc_skb_cb.
      But it's not anymore.
      Currently, biggest structure of TC's CB is tc_skb_cb.
      So, it should skip size of tc_skb_cb instead of qdisc_skb_cb.
      
      Fixes: ec624fe7 ("net/sched: Extend qdisc control block with tc control block")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20240107144241.4169520-1-ap420073@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bec161ad
    • Jakub Kicinski's avatar
      Merge branch 'net-ethernet-ti-am65-cpsw-allow-for-mtu-values' · 66cee759
      Jakub Kicinski authored
      Sanjuán García, Jorge says:
      
      ====================
      net: ethernet: ti: am65-cpsw: Allow for MTU values
      
      The am65-cpsw-nuss driver has a fixed definition for the maximum ethernet
      frame length of 1522 bytes (AM65_CPSW_MAX_PACKET_SIZE). This limits the switch
      ports to only operate at a maximum MTU of 1500 bytes. When combining this CPSW
      switch with a DSA switch connected to one of its ports this limitation shows up.
      The extra 8 bytes the DSA subsystem adds internally to the ethernet frame
      create resulting frames bigger than 1522 bytes (1518 for non VLAN + 8 for DSA
      stuff) so they get dropped by the switch.
      
      One of the issues with the the am65-cpsw-nuss driver is that the network device
      max_mtu was being set to the same fixed value defined for the max total frame
      length (1522 bytes). This makes the DSA subsystem believe that the MTU of the
      interface can be set to 1508 bytes to make room for the extra 8 bytes of the DSA
      headers. However, all packages created assuming the 1500 bytes payload get
      dropped by the switch as oversized.
      
      This series offers a solution to this problem. The max_mtu advertised on the
      network device and the actual max frame size configured on the switch registers
      are made consistent by letting the extra room needed for the ethernet headers
      and the frame checksum (22 bytes including VLAN).
      ====================
      
      Link: https://lore.kernel.org/r/20240105085530.14070-1-jorge.sanjuangarcia@duagon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      66cee759
    • Sanjuán García, Jorge's avatar
      net: ethernet: ti: am65-cpsw: Fix max mtu to fit ethernet frames · 64e47d8a
      Sanjuán García, Jorge authored
      The value of AM65_CPSW_MAX_PACKET_SIZE represents the maximum length
      of a received frame. This value is written to the register
      AM65_CPSW_PORT_REG_RX_MAXLEN.
      
      The maximum MTU configured on the network device should then leave
      some room for the ethernet headers and frame check. Otherwise, if
      the network interface is configured to its maximum mtu possible,
      the frames will be larger than AM65_CPSW_MAX_PACKET_SIZE and will
      get dropped as oversized.
      
      The switch supports ethernet frame sizes between 64 and 2024 bytes
      (including VLAN) as stated in the technical reference manual, so
      define AM65_CPSW_MAX_PACKET_SIZE with that maximum size.
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarJorge Sanjuan Garcia <jorge.sanjuangarcia@duagon.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarSiddharth Vadapalli <s-vadapalli@ti.com>
      Link: https://lore.kernel.org/r/20240105085530.14070-2-jorge.sanjuangarcia@duagon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      64e47d8a
    • Zhu Yanjun's avatar
      virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings · e3fe8d28
      Zhu Yanjun authored
      Fix the warnings when building virtio_net driver.
      
      "
      drivers/net/virtio_net.c: In function ‘init_vqs’:
      drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]
       4551 |                 sprintf(vi->rq[i].name, "input.%d", i);
            |                                                ^~
      In function ‘virtnet_find_vqs’,
          inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
      drivers/net/virtio_net.c:4551:41: note: directive argument in the range [-2147483643, 65534]
       4551 |                 sprintf(vi->rq[i].name, "input.%d", i);
            |                                         ^~~~~~~~~~
      drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes into a destination of size 16
       4551 |                 sprintf(vi->rq[i].name, "input.%d", i);
            |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/net/virtio_net.c: In function ‘init_vqs’:
      drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=]
       4552 |                 sprintf(vi->sq[i].name, "output.%d", i);
            |                                                 ^~
      In function ‘virtnet_find_vqs’,
          inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
      drivers/net/virtio_net.c:4552:41: note: directive argument in the range [-2147483643, 65534]
       4552 |                 sprintf(vi->sq[i].name, "output.%d", i);
            |                                         ^~~~~~~~~~~
      drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes into a destination of size 16
       4552 |                 sprintf(vi->sq[i].name, "output.%d", i);
      
      "
      Reviewed-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarZhu Yanjun <yanjun.zhu@linux.dev>
      Link: https://lore.kernel.org/r/20240104020902.2753599-1-yanjun.zhu@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e3fe8d28
    • Nithin Dabilpuram's avatar
      octeontx2-af: CN10KB: Fix FIFO length calculation for RPM2 · a0cb76a7
      Nithin Dabilpuram authored
      RPM0 and RPM1 on the CN10KB SoC have 8 LMACs each, whereas RPM2
      has only 4 LMACs. Similarly, the RPM0 and RPM1 have 256KB FIFO,
      whereas RPM2 has 128KB FIFO. This patch fixes an issue with
      improper TX credit programming for the RPM2 link.
      
      Fixes: b9d0fedc ("octeontx2-af: cn10kb: Add RPM_USX MAC support")
      Signed-off-by: default avatarNithin Dabilpuram <ndabilpuram@marvell.com>
      Signed-off-by: default avatarNaveen Mamindlapalli <naveenm@marvell.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240108073036.8766-1-naveenm@marvell.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a0cb76a7
    • Jakub Kicinski's avatar
      Merge branch 'rtnetlink-allow-to-enslave-with-one-msg-an-up-interface' · 3722a987
      Jakub Kicinski authored
      Nicolas Dichtel says:
      
      ====================
      rtnetlink: allow to enslave with one msg an up interface
      
      The first patch fixes a regression, introduced in linux v6.1, by reverting
      a patch. The second patch adds a test to verify this API.
      ====================
      
      Link: https://lore.kernel.org/r/20240108094103.2001224-1-nicolas.dichtel@6wind.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3722a987
    • Nicolas Dichtel's avatar
      selftests: rtnetlink: check enslaving iface in a bond · a159cbe8
      Nicolas Dichtel authored
      The goal is to check the following two sequences:
      > ip link set dummy0 up
      > ip link set dummy0 master bond0 down
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20240108094103.2001224-3-nicolas.dichtel@6wind.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a159cbe8
    • Nicolas Dichtel's avatar
      Revert "net: rtnetlink: Enslave device before bringing it up" · ec4ffd10
      Nicolas Dichtel authored
      This reverts commit a4abfa62.
      
      The patch broke:
      > ip link set dummy0 up
      > ip link set dummy0 master bond0 down
      
      This last command is useful to be able to enslave an interface with only
      one netlink message.
      
      After discussion, there is no good reason to support:
      > ip link set dummy0 down
      > ip link set dummy0 master bond0 up
      because the bond interface already set the slave up when it is up.
      
      Cc: stable@vger.kernel.org
      Fixes: a4abfa62 ("net: rtnetlink: Enslave device before bringing it up")
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Link: https://lore.kernel.org/r/20240108094103.2001224-2-nicolas.dichtel@6wind.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec4ffd10
    • David Howells's avatar
      rxrpc: Fix use of Don't Fragment flag · 87220143
      David Howells authored
      rxrpc normally has the Don't Fragment flag set on the UDP packets it
      transmits, except when it has decided that DATA packets aren't getting
      through - in which case it turns it off just for the DATA transmissions.
      This can be a problem, however, for RESPONSE packets that convey
      authentication and crypto data from the client to the server as ticket may
      be larger than can fit in the MTU.
      
      In such a case, rxrpc gets itself into an infinite loop as the sendmsg
      returns an error (EMSGSIZE), which causes rxkad_send_response() to return
      -EAGAIN - and the CHALLENGE packet is put back on the Rx queue to retry,
      leading to the I/O thread endlessly attempting to perform the transmission.
      
      Fix this by disabling DF on RESPONSE packets for now.  The use of DF and
      best data MTU determination needs reconsidering at some point in the
      future.
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-afs@lists.infradead.org
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/1581852.1704813048@warthog.procyon.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87220143
    • Vladimir Oltean's avatar
      net: dsa: fix netdev_priv() dereference before check on non-DSA netdevice events · 844f1047
      Vladimir Oltean authored
      After the blamed commit, we started doing this dereference for every
      NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER event in the system.
      
      static inline struct dsa_port *dsa_user_to_port(const struct net_device *dev)
      {
      	struct dsa_user_priv *p = netdev_priv(dev);
      
      	return p->dp;
      }
      
      Which is obviously bogus, because not all net_devices have a netdev_priv()
      of type struct dsa_user_priv. But struct dsa_user_priv is fairly small,
      and p->dp means dereferencing 8 bytes starting with offset 16. Most
      drivers allocate that much private memory anyway, making our access not
      fault, and we discard the bogus data quickly afterwards, so this wasn't
      caught.
      
      But the dummy interface is somewhat special in that it calls
      alloc_netdev() with a priv size of 0. So every netdev_priv() dereference
      is invalid, and we get this when we emit a NETDEV_PRECHANGEUPPER event
      with a VLAN as its new upper:
      
      $ ip link add dummy1 type dummy
      $ ip link add link dummy1 name dummy1.100 type vlan id 100
      [   43.309174] ==================================================================
      [   43.316456] BUG: KASAN: slab-out-of-bounds in dsa_user_prechangeupper+0x30/0xe8
      [   43.323835] Read of size 8 at addr ffff3f86481d2990 by task ip/374
      [   43.330058]
      [   43.342436] Call trace:
      [   43.366542]  dsa_user_prechangeupper+0x30/0xe8
      [   43.371024]  dsa_user_netdevice_event+0xb38/0xee8
      [   43.375768]  notifier_call_chain+0xa4/0x210
      [   43.379985]  raw_notifier_call_chain+0x24/0x38
      [   43.384464]  __netdev_upper_dev_link+0x3ec/0x5d8
      [   43.389120]  netdev_upper_dev_link+0x70/0xa8
      [   43.393424]  register_vlan_dev+0x1bc/0x310
      [   43.397554]  vlan_newlink+0x210/0x248
      [   43.401247]  rtnl_newlink+0x9fc/0xe30
      [   43.404942]  rtnetlink_rcv_msg+0x378/0x580
      
      Avoid the kernel oops by dereferencing after the type check, as customary.
      
      Fixes: 4c3f80d2 ("net: dsa: walk through all changeupper notifier functions")
      Reported-and-tested-by: syzbot+d81bcd883824180500c8@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/0000000000001d4255060e87545c@google.com/Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <florian.fainelli@broadcom.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240110003354.2796778-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      844f1047
    • Lin Ma's avatar
      net: qualcomm: rmnet: fix global oob in rmnet_policy · b33fb5b8
      Lin Ma authored
      The variable rmnet_link_ops assign a *bigger* maxtype which leads to a
      global out-of-bounds read when parsing the netlink attributes. See bug
      trace below:
      
      ==================================================================
      BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:386 [inline]
      BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x24af/0x2750 lib/nlattr.c:600
      Read of size 1 at addr ffffffff92c438d0 by task syz-executor.6/84207
      
      CPU: 0 PID: 84207 Comm: syz-executor.6 Tainted: G                 N 6.1.0 #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x8b/0xb3 lib/dump_stack.c:106
       print_address_description mm/kasan/report.c:284 [inline]
       print_report+0x172/0x475 mm/kasan/report.c:395
       kasan_report+0xbb/0x1c0 mm/kasan/report.c:495
       validate_nla lib/nlattr.c:386 [inline]
       __nla_validate_parse+0x24af/0x2750 lib/nlattr.c:600
       __nla_parse+0x3e/0x50 lib/nlattr.c:697
       nla_parse_nested_deprecated include/net/netlink.h:1248 [inline]
       __rtnl_newlink+0x50a/0x1880 net/core/rtnetlink.c:3485
       rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3594
       rtnetlink_rcv_msg+0x43c/0xd70 net/core/rtnetlink.c:6091
       netlink_rcv_skb+0x14f/0x410 net/netlink/af_netlink.c:2540
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x54e/0x800 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x930/0xe50 net/netlink/af_netlink.c:1921
       sock_sendmsg_nosec net/socket.c:714 [inline]
       sock_sendmsg+0x154/0x190 net/socket.c:734
       ____sys_sendmsg+0x6df/0x840 net/socket.c:2482
       ___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
       __sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7fdcf2072359
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fdcf13e3168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007fdcf219ff80 RCX: 00007fdcf2072359
      RDX: 0000000000000000 RSI: 0000000020000200 RDI: 0000000000000003
      RBP: 00007fdcf20bd493 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fffbb8d7bdf R14: 00007fdcf13e3300 R15: 0000000000022000
       </TASK>
      
      The buggy address belongs to the variable:
       rmnet_policy+0x30/0xe0
      
      The buggy address belongs to the physical page:
      page:0000000065bdeb3c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x155243
      flags: 0x200000000001000(reserved|node=0|zone=2)
      raw: 0200000000001000 ffffea00055490c8 ffffea00055490c8 0000000000000000
      raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffffffff92c43780: f9 f9 f9 f9 00 00 00 02 f9 f9 f9 f9 00 00 00 07
       ffffffff92c43800: f9 f9 f9 f9 00 00 00 05 f9 f9 f9 f9 06 f9 f9 f9
      >ffffffff92c43880: f9 f9 f9 f9 00 00 00 00 00 00 f9 f9 f9 f9 f9 f9
                                                       ^
       ffffffff92c43900: 00 00 00 00 00 00 00 00 07 f9 f9 f9 f9 f9 f9 f9
       ffffffff92c43980: 00 00 00 07 f9 f9 f9 f9 00 00 00 05 f9 f9 f9 f9
      
      According to the comment of `nla_parse_nested_deprecated`, the maxtype
      should be len(destination array) - 1. Hence use `IFLA_RMNET_MAX` here.
      
      Fixes: 14452ca3 ("net: qualcomm: rmnet: Export mux_id and flags to netlink")
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Reviewed-by: default avatarSubash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20240110061400.3356108-1-linma@zju.edu.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b33fb5b8
    • Dmitry Safonov's avatar
      selftests/net/tcp-ao: Use LDLIBS instead of LDFLAGS · e689a876
      Dmitry Safonov authored
      The rules to link selftests are:
      
      > $(OUTPUT)/%_ipv4: %.c
      > 	$(LINK.c) $^ $(LDLIBS) -o $@
      >
      > $(OUTPUT)/%_ipv6: %.c
      > 	$(LINK.c) -DIPV6_TEST $^ $(LDLIBS) -o $@
      
      The intel test robot uses only selftest's Makefile, not the top linux
      Makefile:
      
      > make W=1 O=/tmp/kselftest -C tools/testing/selftests
      
      So, $(LINK.c) is determined by environment, rather than by kernel
      Makefiles. On my machine (as well as other people that ran tcp-ao
      selftests) GNU/Make implicit definition does use $(LDFLAGS):
      
      > [dima@Mindolluin ~]$ make -p -f/dev/null | grep '^LINK.c\>'
      > make: *** No targets.  Stop.
      > LINK.c = $(CC) $(CFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH)
      
      But, according to build robot report, it's not the case for them.
      While I could just avoid using pre-defined $(LINK.c), it's also used by
      selftests/lib.mk by default.
      
      Anyways, according to GNU/Make documentation [1], I should have used
      $(LDLIBS) instead of $(LDFLAGS) in the first place, so let's just do it:
      
      > LDFLAGS
      >     Extra flags to give to compilers when they are supposed to invoke
      >     the linker, ‘ld’, such as -L. Libraries (-lfoo) should be added
      >     to the LDLIBS variable instead.
      > LDLIBS
      >     Library flags or names given to compilers when they are supposed
      >     to invoke the linker, ‘ld’. LOADLIBES is a deprecated (but still
      >     supported) alternative to LDLIBS. Non-library linker flags, such
      >     as -L, should go in the LDFLAGS variable.
      
      [1]: https://www.gnu.org/software/make/manual/html_node/Implicit-Variables.html
      
      Fixes: cfbab37b ("selftests/net: Add TCP-AO library")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202401011151.veyYTJzq-lkp@intel.com/Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20240110-tcp_ao-selftests-makefile-v1-1-aa07d043f052@arista.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e689a876
    • Arnd Bergmann's avatar
      wangxunx: select CONFIG_PHYLINK where needed · b3739fb3
      Arnd Bergmann authored
      The ngbe driver needs phylink:
      
      arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/libwx/wx_ethtool.o: in function `wx_nway_reset':
      wx_ethtool.c:(.text+0x458): undefined reference to `phylink_ethtool_nway_reset'
      arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_main.o: in function `ngbe_remove':
      ngbe_main.c:(.text+0x7c): undefined reference to `phylink_destroy'
      arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_main.o: in function `ngbe_open':
      ngbe_main.c:(.text+0xf90): undefined reference to `phylink_connect_phy'
      arm-linux-gnueabi-ld: drivers/net/ethernet/wangxun/ngbe/ngbe_mdio.o: in function `ngbe_mdio_init':
      ngbe_mdio.c:(.text+0x314): undefined reference to `phylink_create'
      
      Add the missing Kconfig description for this.
      
      Fixes: bc2426d7 ("net: ngbe: convert phylib to phylink")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20240111162828.68564-1-arnd@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b3739fb3
    • Jakub Kicinski's avatar
      MAINTAINERS: ibmvnic: drop Dany from reviewers · f9678f58
      Jakub Kicinski authored
      I missed that Dany uses a different email address
      when tagging patches (drt@linux.ibm.com)
      and asked him if he's still actively working on ibmvnic.
      He doesn't really fall under our removal criteria,
      but he admitted that he already moved on to other projects.
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240109164517.3063131-8-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f9678f58