1. 06 Sep, 2023 6 commits
    • Martin KaFai Lau's avatar
      selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc · a96d1cfb
      Martin KaFai Lau authored
      This patch checks the sk_omem_alloc has been uncharged by bpf_sk_storage
      during the __sk_destruct.
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-4-martin.lau@linux.dev
      a96d1cfb
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc · 55d49f75
      Martin KaFai Lau authored
      The commit c83597fa ("bpf: Refactor some inode/task/sk storage functions
      for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
      bpf_local_storage_unlink_nolock() which then later renamed to
      bpf_local_storage_destroy(). The commit accidentally passed the
      "bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
      which then stopped the uncharge from happening to the sk->sk_omem_alloc.
      
      This missing uncharge only happens when the sk is going away (during
      __sk_destruct).
      
      This patch fixes it by always passing "uncharge_mem = true". It is a
      noop to the task/inode/cgroup storage because they do not have the
      map_local_storage_(un)charge enabled in the map_ops. A followup patch
      will be done in bpf-next to remove the uncharge_mem argument.
      
      A selftest is added in the next patch.
      
      Fixes: c83597fa ("bpf: Refactor some inode/task/sk storage functions for reuse")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
      55d49f75
    • Martin KaFai Lau's avatar
      bpf: bpf_sk_storage: Fix invalid wait context lockdep report · a96a44ab
      Martin KaFai Lau authored
      './test_progs -t test_local_storage' reported a splat:
      
      [   27.137569] =============================
      [   27.138122] [ BUG: Invalid wait context ]
      [   27.138650] 6.5.0-03980-gd11ae1b1 #247 Tainted: G           O
      [   27.139542] -----------------------------
      [   27.140106] test_progs/1729 is trying to lock:
      [   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
      [   27.141834] other info that might help us debug this:
      [   27.142437] context-{5:5}
      [   27.142856] 2 locks held by test_progs/1729:
      [   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
      [   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
      [   27.145855] stack backtrace:
      [   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b1 #247
      [   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   27.149127] Call Trace:
      [   27.149490]  <TASK>
      [   27.149867]  dump_stack_lvl+0x130/0x1d0
      [   27.152609]  dump_stack+0x14/0x20
      [   27.153131]  __lock_acquire+0x1657/0x2220
      [   27.153677]  lock_acquire+0x1b8/0x510
      [   27.157908]  local_lock_acquire+0x29/0x130
      [   27.159048]  obj_cgroup_charge+0xf4/0x3c0
      [   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
      [   27.161931]  __kmem_cache_alloc_node+0x51/0x210
      [   27.163557]  __kmalloc+0xaa/0x210
      [   27.164593]  bpf_map_kzalloc+0xbc/0x170
      [   27.165147]  bpf_selem_alloc+0x130/0x510
      [   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
      [   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
      [   27.169199]  bpf_map_update_value+0x415/0x4f0
      [   27.169871]  map_update_elem+0x413/0x550
      [   27.170330]  __sys_bpf+0x5e9/0x640
      [   27.174065]  __x64_sys_bpf+0x80/0x90
      [   27.174568]  do_syscall_64+0x48/0xa0
      [   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      [   27.175932] RIP: 0033:0x7effb40e41ad
      [   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
      [   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
      [   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
      [   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
      [   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
      [   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
      [   27.184958]  </TASK>
      
      It complains about acquiring a local_lock while holding a raw_spin_lock.
      It means it should not allocate memory while holding a raw_spin_lock
      since it is not safe for RT.
      
      raw_spin_lock is needed because bpf_local_storage supports tracing
      context. In particular for task local storage, it is easy to
      get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
      However, task (and cgroup) local storage has already been moved to
      bpf mem allocator which can be used after raw_spin_lock.
      
      The splat is for the sk storage. For sk (and inode) storage,
      it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
      kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
      However, the local storage helper requires a verifier accepted
      sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
      a bpf prog in a kzalloc unsafe context and also able to hold a verifier
      accepted sk pointer) could happen.
      
      This patch avoids kzalloc after raw_spin_lock to silent the splat.
      There is an existing kzalloc before the raw_spin_lock. At that point,
      a kzalloc is very likely required because a lookup has just been done
      before. Thus, this patch always does the kzalloc before acquiring
      the raw_spin_lock and remove the later kzalloc usage after the
      raw_spin_lock. After this change, it will have a charge and then
      uncharge during the syscall bpf_map_update_elem() code path.
      This patch opts for simplicity and not continue the old
      optimization to save one charge and uncharge.
      
      This issue is dated back to the very first commit of bpf_sk_storage
      which had been refactored multiple times to create task, inode, and
      cgroup storage. This patch uses a Fixes tag with a more recent
      commit that should be easier to do backport.
      
      Fixes: b00fa38a ("bpf: Enable non-atomic allocations in local storage")
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
      a96a44ab
    • Ilya Leoshkevich's avatar
      s390/bpf: Pass through tail call counter in trampolines · a192103a
      Ilya Leoshkevich authored
      s390x eBPF programs use the following extension to the s390x calling
      convention: tail call counter is passed on stack at offset
      STK_OFF_TCCNT, which callees otherwise use as scratch space.
      
      Currently trampoline does not respect this and clobbers tail call
      counter. This breaks enforcing tail call limits in eBPF programs, which
      have trampolines attached to them.
      
      Fix by forwarding a copy of the tail call counter to the original eBPF
      program in the trampoline (for fexit), and by restoring it at the end
      of the trampoline (for fentry).
      
      Fixes: 528eb2cb ("s390/bpf: Implement arch_prepare_bpf_trampoline()")
      Reported-by: default avatarLeon Hwang <hffilwlqm@gmail.com>
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230906004448.111674-1-iii@linux.ibm.com
      a192103a
    • Sebastian Andrzej Siewior's avatar
      bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check. · 6764e767
      Sebastian Andrzej Siewior authored
      __bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
      performing the recursion check which means in case of a recursion
      __bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
      value.
      
      __bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
      after the recursion check which means in case of a recursion
      __bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
      look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
      isn't initialized upfront.
      
      Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
      set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
      Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
      a few cycles later.
      
      Fixes: e384c7b7 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
      6764e767
    • Sebastian Andrzej Siewior's avatar
      bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf(). · 7645629f
      Sebastian Andrzej Siewior authored
      If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
      0 without undoing rcu_read_lock_trace(), migrate_disable() or
      decrementing the recursion counter. This is fine in the JIT case because
      the JIT code will jump in the 0 case to the end and invoke the matching
      exit trampoline (__bpf_prog_exit_sleepable_recur()).
      
      This is not the case in kern_sys_bpf() which returns directly to the
      caller with an error code.
      
      Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.
      
      Fixes: b1d18a75 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
      7645629f
  2. 04 Sep, 2023 1 commit
    • John Fastabend's avatar
      bpf, sockmap: Fix skb refcnt race after locking changes · a454d84e
      John Fastabend authored
      There is a race where skb's from the sk_psock_backlog can be referenced
      after userspace side has already skb_consumed() the sk_buff and its refcnt
      dropped to zer0 causing use after free.
      
      The flow is the following:
      
        while ((skb = skb_peek(&psock->ingress_skb))
          sk_psock_handle_Skb(psock, skb, ..., ingress)
          if (!ingress) ...
          sk_psock_skb_ingress
             sk_psock_skb_ingress_enqueue(skb)
                msg->skb = skb
                sk_psock_queue_msg(psock, msg)
          skb_dequeue(&psock->ingress_skb)
      
      The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
      what the application reads when recvmsg() is called. An application can
      read this anytime after the msg is placed on the queue. The recvmsg hook
      will also read msg->skb and then after user space reads the msg will call
      consume_skb(skb) on it effectively free'ing it.
      
      But, the race is in above where backlog queue still has a reference to
      the skb and calls skb_dequeue(). If the skb_dequeue happens after the
      user reads and free's the skb we have a use after free.
      
      The !ingress case does not suffer from this problem because it uses
      sendmsg_*(sk, msg) which does not pass the sk_buff further down the
      stack.
      
      The following splat was observed with 'test_progs -t sockmap_listen':
      
        [ 1022.710250][ T2556] general protection fault, ...
        [...]
        [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
        [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
        [ 1022.713653][ T2556] Code: ...
        [...]
        [ 1022.720699][ T2556] Call Trace:
        [ 1022.720984][ T2556]  <TASK>
        [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
        [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
        [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
        [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
        [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
        [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
        [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
        [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
        [ 1022.724386][ T2556]  kthread+0xfd/0x130
        [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
        [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
        [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
        [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
        [ 1022.726201][ T2556]  </TASK>
      
      To fix we add an skb_get() before passing the skb to be enqueued in the
      engress queue. This bumps the skb->users refcnt so that consume_skb()
      and kfree_skb will not immediately free the sk_buff. With this we can
      be sure the skb is still around when we do the dequeue. Then we just
      need to decrement the refcnt or free the skb in the backlog case which
      we do by calling kfree_skb() on the ingress case as well as the sendmsg
      case.
      
      Before locking change from fixes tag we had the sock locked so we
      couldn't race with user and there was no issue here.
      
      Fixes: 799aa7f9 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
      Reported-by: default avatarJiri Olsa  <jolsa@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Tested-by: default avatarJiri Olsa <jolsa@kernel.org>
      Link: https://lore.kernel.org/bpf/20230901202137.214666-1-john.fastabend@gmail.com
      a454d84e
  3. 01 Sep, 2023 16 commits
    • Eduard Zingerman's avatar
      docs/bpf: Fix "file doesn't exist" warnings in {llvm_reloc,btf}.rst · 3888fa13
      Eduard Zingerman authored
      scripts/documentation-file-ref-check reports warnings for (valid) cross-links
      of form:
      
        :ref:`Documentation/bpf/btf <BTF_Ext_Section>`
      
      Adding extension to the file name helps to avoid the warning, e.g:
      
        :ref:`Documentation/bpf/btf.rst <BTF_Ext_Section>`
      
      Fixes: be4033d3 ("docs/bpf: Add description for CO-RE relocations")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Closes: https://lore.kernel.org/oe-kbuild-all/202309010804.G3MpXo59-lkp@intel.com
      Link: https://lore.kernel.org/bpf/20230901125935.487972-1-eddyz87@gmail.com
      3888fa13
    • Xu Kuohai's avatar
      selftests/bpf: Fix a CI failure caused by vsock write · c1970e26
      Xu Kuohai authored
      While commit 90f0074c ("selftests/bpf: fix a CI failure caused by vsock sockmap test")
      fixes a receive failure of vsock sockmap test, there is still a write failure:
      
      Error: #211/79 sockmap_listen/sockmap VSOCK test_vsock_redir
      Error: #211/79 sockmap_listen/sockmap VSOCK test_vsock_redir
        ./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected
        vsock_unix_redir_connectible:FAIL:1501
        ./test_progs:vsock_unix_redir_connectible:1501: ingress: write: Transport endpoint is not connected
        vsock_unix_redir_connectible:FAIL:1501
        ./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected
        vsock_unix_redir_connectible:FAIL:1501
      
      The reason is that the vsock connection in the test is set to ESTABLISHED state
      by function virtio_transport_recv_pkt, which is executed in a workqueue thread,
      so when the user space test thread runs before the workqueue thread, this
      problem occurs.
      
      To fix it, before writing the connection, wait for it to be connected.
      
      Fixes: d61bd8c1 ("selftests/bpf: add a test case for vsock sockmap")
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230901031037.3314007-1-xukuohai@huaweicloud.com
      c1970e26
    • Edward Cree's avatar
      sfc: check for zero length in EF10 RX prefix · ae074e2b
      Edward Cree authored
      When EF10 RXDP firmware is operating in cut-through mode, packet length
       is not known at the time the RX prefix is generated, so it is left as
       zero and RX event merging is inhibited to ensure that the length is
       available in the RX event.  However, it has been found that in certain
       circumstances the RX events for these packets still get merged,
       meaning the driver cannot read the length from the RX event, and tries
       to use the length from the prefix.
      The resulting zero-length SKBs cause crashes in GRO since commit
       1d11fa69 ("net-gro: remove GRO_DROP"), so add a check to the driver
       to detect these zero-length RX events and discard the packet.
      Signed-off-by: default avatarEdward Cree <ecree.xilinx@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae074e2b
    • David S. Miller's avatar
      Merge branch 'dst-hint-multipath' · d8a30706
      David S. Miller authored
      Sriram Yagnaraman says:
      
      ====================
      Avoid TCP resets when using ECMP for load-balancing between multiple servers.
      
      All packets in the same flow (L3/L4 depending on multipath hash policy)
      should be directed to the same target, but after [0]/[1] we see stray
      packets directed towards other targets. This, for instance, causes RST
      to be sent on TCP connections.
      
      The first two patches solve the problem by ignoring route hints for
      destinations that are part of multipath group, by using new SKB flags
      for IPv4 and IPv6. The third patch is a selftest that tests the
      scenario.
      
      Thanks to Ido, for reviewing and suggesting a way forward in [2] and
      also suggesting how to write a selftest for this.
      
      v4->v5:
      - Fixed review comments from Ido
      v3->v4:
      - Remove single path test
      - Rebase to latest
      v2->v3:
      - Add NULL check for skb in fib6_select_path (Ido Schimmel)
      - Use fib_tests.sh for selftest instead of the forwarding suite (Ido
        Schimmel)
      v1->v2:
      - Update to commit messages describing the solution (Ido Schimmel)
      - Use perf stat to count fib table lookups in selftest (Ido Schimmel)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8a30706
    • Sriram Yagnaraman's avatar
      selftests: fib_tests: Add multipath list receive tests · 8ae9efb8
      Sriram Yagnaraman authored
      The test uses perf stat to count the number of fib:fib_table_lookup
      tracepoint hits for IPv4 and the number of fib6:fib6_table_lookup for
      IPv6. The measured count is checked to be within 5% of the total number
      of packets sent via veth1.
      Signed-off-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ae9efb8
    • Sriram Yagnaraman's avatar
      ipv6: ignore dst hint for multipath routes · 8423be89
      Sriram Yagnaraman authored
      Route hints when the nexthop is part of a multipath group causes packets
      in the same receive batch to be sent to the same nexthop irrespective of
      the multipath hash of the packet. So, do not extract route hint for
      packets whose destination is part of a multipath group.
      
      A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the
      flag when route is looked up in fib6_select_path() and use it in
      ip6_can_use_hint() to check for the existence of the flag.
      
      Fixes: 197dbf24 ("ipv6: introduce and uses route look hints for list input.")
      Signed-off-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8423be89
    • Sriram Yagnaraman's avatar
      ipv4: ignore dst hint for multipath routes · 6ac66cb0
      Sriram Yagnaraman authored
      Route hints when the nexthop is part of a multipath group causes packets
      in the same receive batch to be sent to the same nexthop irrespective of
      the multipath hash of the packet. So, do not extract route hint for
      packets whose destination is part of a multipath group.
      
      A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the
      flag when route is looked up in ip_mkroute_input() and use it in
      ip_extract_route_hint() to check for the existence of the flag.
      
      Fixes: 02b24941 ("ipv4: use dst hint for ipv4 list receive")
      Signed-off-by: default avatarSriram Yagnaraman <sriram.yagnaraman@est.tech>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ac66cb0
    • Mohamed Khalfella's avatar
      skbuff: skb_segment, Call zero copy functions before using skbuff frags · 2ea35288
      Mohamed Khalfella authored
      Commit bf5c25d6 ("skbuff: in skb_segment, call zerocopy functions
      once per nskb") added the call to zero copy functions in skb_segment().
      The change introduced a bug in skb_segment() because skb_orphan_frags()
      may possibly change the number of fragments or allocate new fragments
      altogether leaving nrfrags and frag to point to the old values. This can
      cause a panic with stacktrace like the one below.
      
      [  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
      [  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ #26
      [  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
      [  194.021892] Call Trace:
      [  194.027422]  <TASK>
      [  194.072861]  tcp_gso_segment+0x107/0x540
      [  194.082031]  inet_gso_segment+0x15c/0x3d0
      [  194.090783]  skb_mac_gso_segment+0x9f/0x110
      [  194.095016]  __skb_gso_segment+0xc1/0x190
      [  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
      [  194.107071]  dev_qdisc_enqueue+0x16/0x70
      [  194.110884]  __dev_queue_xmit+0x63b/0xb30
      [  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
      [  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
      [  194.131787]  __dev_queue_xmit+0x8a0/0xb30
      [  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
      [  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
      [  194.144622]  sch_direct_xmit+0xe3/0x280
      [  194.147748]  __dev_queue_xmit+0x54a/0xb30
      [  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
      [  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
      [  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
      [  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
      [  194.176959]  vhost_worker+0x76/0xb0 [vhost]
      [  194.183667]  kthread+0x118/0x140
      [  194.190358]  ret_from_fork+0x1f/0x30
      [  194.193670]  </TASK>
      
      In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
      local variable in skb_segment() stale. This resulted in the code hitting
      i >= nrfrags prematurely and trying to move to next frag_skb using
      list_skb pointer, which was NULL, and caused kernel panic. Move the call
      to zero copy functions before using frags and nr_frags.
      
      Fixes: bf5c25d6 ("skbuff: in skb_segment, call zerocopy functions once per nskb")
      Signed-off-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Reported-by: default avatarAmit Goyal <agoyal@purestorage.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ea35288
    • David S. Miller's avatar
      Merge branch 'net-data-race-annotations' · f2e977f3
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: another round of data-race annotations
      
      Series inspired by some syzbot reports, taking care
      of 4 socket fields that can be read locklessly.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2e977f3
    • Eric Dumazet's avatar
      net: annotate data-races around sk->sk_bind_phc · 251cd405
      Eric Dumazet authored
      sk->sk_bind_phc is read locklessly. Add corresponding annotations.
      
      Fixes: d463126e ("net: sock: extend SO_TIMESTAMPING for PHC binding")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      251cd405
    • Eric Dumazet's avatar
      net: annotate data-races around sk->sk_tsflags · e3390b30
      Eric Dumazet authored
      sk->sk_tsflags can be read locklessly, add corresponding annotations.
      
      Fixes: b9f40e21 ("net-timestamp: move timestamp flags out of sk_flags")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3390b30
    • Eric Dumazet's avatar
      mptcp: annotate data-races around msk->rmem_fwd_alloc · 9531e4a8
      Eric Dumazet authored
      msk->rmem_fwd_alloc can be read locklessly.
      
      Add mptcp_rmem_fwd_alloc_add(), similar to sk_forward_alloc_add(),
      and appropriate READ_ONCE()/WRITE_ONCE() annotations.
      
      Fixes: 6511882c ("mptcp: allocate fwd memory separately on the rx and tx path")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9531e4a8
    • Eric Dumazet's avatar
      net: annotate data-races around sk->sk_forward_alloc · 5e6300e7
      Eric Dumazet authored
      Every time sk->sk_forward_alloc is read locklessly,
      add a READ_ONCE().
      
      Add sk_forward_alloc_add() helper to centralize updates,
      to reduce number of WRITE_ONCE().
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e6300e7
    • Eric Dumazet's avatar
      net: use sk_forward_alloc_get() in sk_get_meminfo() · 66d58f04
      Eric Dumazet authored
      inet_sk_diag_fill() has been changed to use sk_forward_alloc_get(),
      but sk_get_meminfo() was forgotten.
      
      Fixes: 292e6077 ("net: introduce sk_forward_alloc_get()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66d58f04
    • Eric Dumazet's avatar
      net/handshake: fix null-ptr-deref in handshake_nl_done_doit() · 82ba0ff7
      Eric Dumazet authored
      We should not call trace_handshake_cmd_done_err() if socket lookup has failed.
      
      Also we should call trace_handshake_cmd_done_err() before releasing the file,
      otherwise dereferencing sock->sk can return garbage.
      
      This also reverts 7afc6d0a ("net/handshake: Fix uninitialized local variable")
      
      Unable to handle kernel paging request at virtual address dfff800000000003
      KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
      Mem abort info:
      ESR = 0x0000000096000005
      EC = 0x25: DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
      FSC = 0x05: level 1 translation fault
      Data abort info:
      ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
      CM = 0, WnR = 0, TnD = 0, TagAccess = 0
      GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
      [dfff800000000003] address between user and kernel address ranges
      Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 1 PID: 5986 Comm: syz-executor292 Not tainted 6.5.0-rc7-syzkaller-gfe4469582053 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
      pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : handshake_nl_done_doit+0x198/0x9c8 net/handshake/netlink.c:193
      lr : handshake_nl_done_doit+0x180/0x9c8
      sp : ffff800096e37180
      x29: ffff800096e37200 x28: 1ffff00012dc6e34 x27: dfff800000000000
      x26: ffff800096e373d0 x25: 0000000000000000 x24: 00000000ffffffa8
      x23: ffff800096e373f0 x22: 1ffff00012dc6e38 x21: 0000000000000000
      x20: ffff800096e371c0 x19: 0000000000000018 x18: 0000000000000000
      x17: 0000000000000000 x16: ffff800080516cc4 x15: 0000000000000001
      x14: 1fffe0001b14aa3b x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000003
      x8 : 0000000000000003 x7 : ffff800080afe47c x6 : 0000000000000000
      x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800080a88078
      x2 : 0000000000000001 x1 : 00000000ffffffa8 x0 : 0000000000000000
      Call trace:
      handshake_nl_done_doit+0x198/0x9c8 net/handshake/netlink.c:193
      genl_family_rcv_msg_doit net/netlink/genetlink.c:970 [inline]
      genl_family_rcv_msg net/netlink/genetlink.c:1050 [inline]
      genl_rcv_msg+0x96c/0xc50 net/netlink/genetlink.c:1067
      netlink_rcv_skb+0x214/0x3c4 net/netlink/af_netlink.c:2549
      genl_rcv+0x38/0x50 net/netlink/genetlink.c:1078
      netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
      netlink_unicast+0x660/0x8d4 net/netlink/af_netlink.c:1365
      netlink_sendmsg+0x834/0xb18 net/netlink/af_netlink.c:1914
      sock_sendmsg_nosec net/socket.c:725 [inline]
      sock_sendmsg net/socket.c:748 [inline]
      ____sys_sendmsg+0x56c/0x840 net/socket.c:2494
      ___sys_sendmsg net/socket.c:2548 [inline]
      __sys_sendmsg+0x26c/0x33c net/socket.c:2577
      __do_sys_sendmsg net/socket.c:2586 [inline]
      __se_sys_sendmsg net/socket.c:2584 [inline]
      __arm64_sys_sendmsg+0x80/0x94 net/socket.c:2584
      __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
      invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
      el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
      do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
      el0_svc+0x58/0x16c arch/arm64/kernel/entry-common.c:678
      el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
      el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591
      Code: 12800108 b90043e8 910062b3 d343fe68 (387b6908)
      
      Fixes: 3b3009ea ("net/handshake: Create a NETLINK service for handling handshake requests")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82ba0ff7
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · ddaa935d
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2023-08-31
      
      We've added 15 non-merge commits during the last 3 day(s) which contain
      a total of 17 files changed, 468 insertions(+), 97 deletions(-).
      
      The main changes are:
      
      1) BPF selftest fixes: one flake and one related to clang18 testing,
         from Yonghong Song.
      
      2) Fix a d_path BPF selftest failure after fast-forward from Linus'
         tree, from Jiri Olsa.
      
      3) Fix a preempt_rt splat in sockmap when using raw_spin_lock_t,
         from John Fastabend.
      
      4) Fix a xsk_diag_fill use-after-free race during socket cleanup,
         from Magnus Karlsson.
      
      5) Fix xsk_build_skb to address a buggy dereference of an ERR_PTR(),
         from Tirthendu Sarkar.
      
      6) Fix a bpftool build warning when compiled with -Wtype-limits,
         from Yafang Shao.
      
      7) Several misc fixes and cleanups in standardization docs,
         from David Vernet.
      
      8) Fix BPF selftest install to consider no_alu32/cpuv4/bpf-gcc flavors,
         from Björn Töpel.
      
      9) Annotate a data race in bpf_long_memcpy for KCSAN, from Daniel Borkmann.
      
      10) Extend documentation with a description for CO-RE relocations,
          from Eduard Zingerman.
      
      11) Fix several invalid escape sequence warnings in bpf_doc.py script,
          from Vishal Chourasia.
      
      12) Fix the instruction set doc wrt offset of BPF-to-BPF call,
          from Will Hawkins.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Include build flavors for install target
        bpf: Annotate bpf_long_memcpy with data_race
        selftests/bpf: Fix d_path test
        bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py
        xsk: Fix xsk_diag use-after-free error during socket cleanup
        bpf, docs: s/eBPF/BPF in standards documents
        bpf, docs: Add abi.rst document to standardization subdirectory
        bpf, docs: Move linux-notes.rst to root bpf docs tree
        bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t
        docs/bpf: Add description for CO-RE relocations
        bpf, docs: Correct source of offset for program-local call
        selftests/bpf: Fix flaky cgroup_iter_sleepable subtest
        xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()
        bpftool: Fix build warnings with -Wtype-limits
        bpf: Prevent inlining of bpf_fentry_test7()
      ====================
      
      Link: https://lore.kernel.org/r/20230831210019.14417-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ddaa935d
  4. 31 Aug, 2023 14 commits
    • Björn Töpel's avatar
      selftests/bpf: Include build flavors for install target · be8e754c
      Björn Töpel authored
      When using the "install" or targets depending on install, e.g. "gen_tar",
      the BPF machine flavors weren't included.
      
      A command like:
        | make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- O=/workspace/kbuild \
        |    HOSTCC=gcc FORMAT= SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 sgx" \
        |    -C tools/testing/selftests gen_tar
      would not include bpf/no_alu32, bpf/cpuv4, or bpf/bpf-gcc.
      
      Include the BPF machine flavors for "install" make target.
      Signed-off-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230831162954.111485-1-bjorn@kernel.org
      be8e754c
    • Daniel Borkmann's avatar
      bpf: Annotate bpf_long_memcpy with data_race · 6a86b5b5
      Daniel Borkmann authored
      syzbot reported a data race splat between two processes trying to
      update the same BPF map value via syscall on different CPUs:
      
        BUG: KCSAN: data-race in bpf_percpu_array_update / bpf_percpu_array_update
      
        write to 0xffffe8fffe7425d8 of 8 bytes by task 8257 on cpu 1:
         bpf_long_memcpy include/linux/bpf.h:428 [inline]
         bpf_obj_memcpy include/linux/bpf.h:441 [inline]
         copy_map_value_long include/linux/bpf.h:464 [inline]
         bpf_percpu_array_update+0x3bb/0x500 kernel/bpf/arraymap.c:380
         bpf_map_update_value+0x190/0x370 kernel/bpf/syscall.c:175
         generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1749
         bpf_map_do_batch+0x2df/0x3d0 kernel/bpf/syscall.c:4648
         __sys_bpf+0x28a/0x780
         __do_sys_bpf kernel/bpf/syscall.c:5241 [inline]
         __se_sys_bpf kernel/bpf/syscall.c:5239 [inline]
         __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5239
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        write to 0xffffe8fffe7425d8 of 8 bytes by task 8268 on cpu 0:
         bpf_long_memcpy include/linux/bpf.h:428 [inline]
         bpf_obj_memcpy include/linux/bpf.h:441 [inline]
         copy_map_value_long include/linux/bpf.h:464 [inline]
         bpf_percpu_array_update+0x3bb/0x500 kernel/bpf/arraymap.c:380
         bpf_map_update_value+0x190/0x370 kernel/bpf/syscall.c:175
         generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1749
         bpf_map_do_batch+0x2df/0x3d0 kernel/bpf/syscall.c:4648
         __sys_bpf+0x28a/0x780
         __do_sys_bpf kernel/bpf/syscall.c:5241 [inline]
         __se_sys_bpf kernel/bpf/syscall.c:5239 [inline]
         __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5239
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        value changed: 0x0000000000000000 -> 0xfffffff000002788
      
      The bpf_long_memcpy is used with 8-byte aligned pointers, power-of-8 size
      and forced to use long read/writes to try to atomically copy long counters.
      It is best-effort only and no barriers are here since it _will_ race with
      concurrent updates from BPF programs. The bpf_long_memcpy() is called from
      bpf(2) syscall. Marco suggested that the best way to make this known to
      KCSAN would be to use data_race() annotation.
      
      Reported-by: syzbot+97522333291430dd277f@syzkaller.appspotmail.com
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Link: https://lore.kernel.org/bpf/000000000000d87a7f06040c970c@google.com
      Link: https://lore.kernel.org/bpf/57628f7a15e20d502247c3b55fceb1cb2b31f266.1693342186.git.daniel@iogearbox.net
      6a86b5b5
    • Jiri Olsa's avatar
      selftests/bpf: Fix d_path test · d11ae1b1
      Jiri Olsa authored
      Recent commit [1] broke d_path test, because now filp_close is not called
      directly from sys_close, but eventually later when the file is finally
      released.
      
      As suggested by Hou Tao we don't need to re-hook the bpf program, but just
      instead we can use sys_close_range to trigger filp_close synchronously.
      
        [1] 021a160a ("fs: use __fput_sync in close(2)")
      Suggested-by: default avatarHou Tao <houtao@huaweicloud.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20230831141103.359810-1-jolsa@kernel.org
      d11ae1b1
    • Vishal Chourasia's avatar
      bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py · 121fd33b
      Vishal Chourasia authored
      The script bpf_doc.py generates multiple SyntaxWarnings related to invalid
      escape sequences when executed with Python 3.12. These warnings do not appear
      in Python 3.10 and 3.11 and do not affect the kernel build, which completes
      successfully.
      
      This patch resolves these SyntaxWarnings by converting the relevant string
      literals to raw strings or by escaping backslashes. This ensures that
      backslashes are interpreted as literal characters, eliminating the warnings.
      Reported-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarVishal Chourasia <vishalc@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20230829074931.2511204-1-vishalc@linux.ibm.com
      121fd33b
    • Magnus Karlsson's avatar
      xsk: Fix xsk_diag use-after-free error during socket cleanup · 3e019d8a
      Magnus Karlsson authored
      Fix a use-after-free error that is possible if the xsk_diag interface
      is used after the socket has been unbound from the device. This can
      happen either due to the socket being closed or the device
      disappearing. In the early days of AF_XDP, the way we tested that a
      socket was not bound to a device was to simply check if the netdevice
      pointer in the xsk socket structure was NULL. Later, a better system
      was introduced by having an explicit state variable in the xsk socket
      struct. For example, the state of a socket that is on the way to being
      closed and has been unbound from the device is XSK_UNBOUND.
      
      The commit in the Fixes tag below deleted the old way of signalling
      that a socket is unbound, setting dev to NULL. This in the belief that
      all code using the old way had been exterminated. That was
      unfortunately not true as the xsk diagnostics code was still using the
      old way and thus does not work as intended when a socket is going
      down. Fix this by introducing a test against the state variable. If
      the socket is in the state XSK_UNBOUND, simply abort the diagnostic's
      netlink operation.
      
      Fixes: 18b1ab7a ("xsk: Fix race at socket teardown")
      Reported-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com
      Tested-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/bpf/20230831100119.17408-1-magnus.karlsson@gmail.com
      3e019d8a
    • Florian Westphal's avatar
      net: fib: avoid warn splat in flow dissector · 8aae7625
      Florian Westphal authored
      New skbs allocated via nf_send_reset() have skb->dev == NULL.
      
      fib*_rules_early_flow_dissect helpers already have a 'struct net'
      argument but its not passed down to the flow dissector core, which
      will then WARN as it can't derive a net namespace to use:
      
       WARNING: CPU: 0 PID: 0 at net/core/flow_dissector.c:1016 __skb_flow_dissect+0xa91/0x1cd0
       [..]
        ip_route_me_harder+0x143/0x330
        nf_send_reset+0x17c/0x2d0 [nf_reject_ipv4]
        nft_reject_inet_eval+0xa9/0xf2 [nft_reject_inet]
        nft_do_chain+0x198/0x5d0 [nf_tables]
        nft_do_chain_inet+0xa4/0x110 [nf_tables]
        nf_hook_slow+0x41/0xc0
        ip_local_deliver+0xce/0x110
        ..
      
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Ido Schimmel <idosch@nvidia.com>
      Fixes: 812fa71f ("netfilter: Dissect flow after packet mangling")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217826Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230830110043.30497-1-fw@strlen.deSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8aae7625
    • Eric Dumazet's avatar
      net: read sk->sk_family once in sk_mc_loop() · a3e0fdf7
      Eric Dumazet authored
      syzbot is playing with IPV6_ADDRFORM quite a lot these days,
      and managed to hit the WARN_ON_ONCE(1) in sk_mc_loop()
      
      We have many more similar issues to fix.
      
      WARNING: CPU: 1 PID: 1593 at net/core/sock.c:782 sk_mc_loop+0x165/0x260
      Modules linked in:
      CPU: 1 PID: 1593 Comm: kworker/1:3 Not tainted 6.1.40-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
      Workqueue: events_power_efficient gc_worker
      RIP: 0010:sk_mc_loop+0x165/0x260 net/core/sock.c:782
      Code: 34 1b fd 49 81 c7 18 05 00 00 4c 89 f8 48 c1 e8 03 42 80 3c 20 00 74 08 4c 89 ff e8 25 36 6d fd 4d 8b 37 eb 13 e8 db 33 1b fd <0f> 0b b3 01 eb 34 e8 d0 33 1b fd 45 31 f6 49 83 c6 38 4c 89 f0 48
      RSP: 0018:ffffc90000388530 EFLAGS: 00010246
      RAX: ffffffff846d9b55 RBX: 0000000000000011 RCX: ffff88814f884980
      RDX: 0000000000000102 RSI: ffffffff87ae5160 RDI: 0000000000000011
      RBP: ffffc90000388550 R08: 0000000000000003 R09: ffffffff846d9a65
      R10: 0000000000000002 R11: ffff88814f884980 R12: dffffc0000000000
      R13: ffff88810dbee000 R14: 0000000000000010 R15: ffff888150084000
      FS: 0000000000000000(0000) GS:ffff8881f6b00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000180 CR3: 000000014ee5b000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <IRQ>
      [<ffffffff8507734f>] ip6_finish_output2+0x33f/0x1ae0 net/ipv6/ip6_output.c:83
      [<ffffffff85062766>] __ip6_finish_output net/ipv6/ip6_output.c:200 [inline]
      [<ffffffff85062766>] ip6_finish_output+0x6c6/0xb10 net/ipv6/ip6_output.c:211
      [<ffffffff85061f8c>] NF_HOOK_COND include/linux/netfilter.h:298 [inline]
      [<ffffffff85061f8c>] ip6_output+0x2bc/0x3d0 net/ipv6/ip6_output.c:232
      [<ffffffff852071cf>] dst_output include/net/dst.h:444 [inline]
      [<ffffffff852071cf>] ip6_local_out+0x10f/0x140 net/ipv6/output_core.c:161
      [<ffffffff83618fb4>] ipvlan_process_v6_outbound drivers/net/ipvlan/ipvlan_core.c:483 [inline]
      [<ffffffff83618fb4>] ipvlan_process_outbound drivers/net/ipvlan/ipvlan_core.c:529 [inline]
      [<ffffffff83618fb4>] ipvlan_xmit_mode_l3 drivers/net/ipvlan/ipvlan_core.c:602 [inline]
      [<ffffffff83618fb4>] ipvlan_queue_xmit+0x1174/0x1be0 drivers/net/ipvlan/ipvlan_core.c:677
      [<ffffffff8361ddd9>] ipvlan_start_xmit+0x49/0x100 drivers/net/ipvlan/ipvlan_main.c:229
      [<ffffffff84763fc0>] netdev_start_xmit include/linux/netdevice.h:4925 [inline]
      [<ffffffff84763fc0>] xmit_one net/core/dev.c:3644 [inline]
      [<ffffffff84763fc0>] dev_hard_start_xmit+0x320/0x980 net/core/dev.c:3660
      [<ffffffff8494c650>] sch_direct_xmit+0x2a0/0x9c0 net/sched/sch_generic.c:342
      [<ffffffff8494d883>] qdisc_restart net/sched/sch_generic.c:407 [inline]
      [<ffffffff8494d883>] __qdisc_run+0xb13/0x1e70 net/sched/sch_generic.c:415
      [<ffffffff8478c426>] qdisc_run+0xd6/0x260 include/net/pkt_sched.h:125
      [<ffffffff84796eac>] net_tx_action+0x7ac/0x940 net/core/dev.c:5247
      [<ffffffff858002bd>] __do_softirq+0x2bd/0x9bd kernel/softirq.c:599
      [<ffffffff814c3fe8>] invoke_softirq kernel/softirq.c:430 [inline]
      [<ffffffff814c3fe8>] __irq_exit_rcu+0xc8/0x170 kernel/softirq.c:683
      [<ffffffff814c3f09>] irq_exit_rcu+0x9/0x20 kernel/softirq.c:695
      
      Fixes: 7ad6848c ("ip: fix mc_loop checks for tunnels with multicast outer addresses")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230830101244.1146934-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a3e0fdf7
    • Eric Dumazet's avatar
      ipv4: annotate data-races around fi->fib_dead · fce92af1
      Eric Dumazet authored
      syzbot complained about a data-race in fib_table_lookup() [1]
      
      Add appropriate annotations to document it.
      
      [1]
      BUG: KCSAN: data-race in fib_release_info / fib_table_lookup
      
      write to 0xffff888150f31744 of 1 bytes by task 1189 on cpu 0:
      fib_release_info+0x3a0/0x460 net/ipv4/fib_semantics.c:281
      fib_table_delete+0x8d2/0x900 net/ipv4/fib_trie.c:1777
      fib_magic+0x1c1/0x1f0 net/ipv4/fib_frontend.c:1106
      fib_del_ifaddr+0x8cf/0xa60 net/ipv4/fib_frontend.c:1317
      fib_inetaddr_event+0x77/0x200 net/ipv4/fib_frontend.c:1448
      notifier_call_chain kernel/notifier.c:93 [inline]
      blocking_notifier_call_chain+0x90/0x200 kernel/notifier.c:388
      __inet_del_ifa+0x4df/0x800 net/ipv4/devinet.c:432
      inet_del_ifa net/ipv4/devinet.c:469 [inline]
      inetdev_destroy net/ipv4/devinet.c:322 [inline]
      inetdev_event+0x553/0xaf0 net/ipv4/devinet.c:1606
      notifier_call_chain kernel/notifier.c:93 [inline]
      raw_notifier_call_chain+0x6b/0x1c0 kernel/notifier.c:461
      call_netdevice_notifiers_info net/core/dev.c:1962 [inline]
      call_netdevice_notifiers_mtu+0xd2/0x130 net/core/dev.c:2037
      dev_set_mtu_ext+0x30b/0x3e0 net/core/dev.c:8673
      do_setlink+0x5be/0x2430 net/core/rtnetlink.c:2837
      rtnl_setlink+0x255/0x300 net/core/rtnetlink.c:3177
      rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6445
      netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2549
      rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6463
      netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
      netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
      netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1914
      sock_sendmsg_nosec net/socket.c:725 [inline]
      sock_sendmsg net/socket.c:748 [inline]
      sock_write_iter+0x1aa/0x230 net/socket.c:1129
      do_iter_write+0x4b4/0x7b0 fs/read_write.c:860
      vfs_writev+0x1a8/0x320 fs/read_write.c:933
      do_writev+0xf8/0x220 fs/read_write.c:976
      __do_sys_writev fs/read_write.c:1049 [inline]
      __se_sys_writev fs/read_write.c:1046 [inline]
      __x64_sys_writev+0x45/0x50 fs/read_write.c:1046
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff888150f31744 of 1 bytes by task 21839 on cpu 1:
      fib_table_lookup+0x2bf/0xd50 net/ipv4/fib_trie.c:1585
      fib_lookup include/net/ip_fib.h:383 [inline]
      ip_route_output_key_hash_rcu+0x38c/0x12c0 net/ipv4/route.c:2751
      ip_route_output_key_hash net/ipv4/route.c:2641 [inline]
      __ip_route_output_key include/net/route.h:134 [inline]
      ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2869
      send4+0x1e7/0x500 drivers/net/wireguard/socket.c:61
      wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175
      wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200
      wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline]
      wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51
      process_one_work+0x434/0x860 kernel/workqueue.c:2600
      worker_thread+0x5f2/0xa10 kernel/workqueue.c:2751
      kthread+0x1d7/0x210 kernel/kthread.c:389
      ret_from_fork+0x2e/0x40 arch/x86/kernel/process.c:145
      ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
      
      value changed: 0x00 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 21839 Comm: kworker/u4:18 Tainted: G W 6.5.0-syzkaller #0
      
      Fixes: dccd9ecc ("ipv4: Do not use dead fib_info entries.")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230830095520.1046984-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fce92af1
    • Eric Dumazet's avatar
      sctp: annotate data-races around sk->sk_wmem_queued · dc9511dd
      Eric Dumazet authored
      sk->sk_wmem_queued can be read locklessly from sctp_poll()
      
      Use sk_wmem_queued_add() when the field is changed,
      and add READ_ONCE() annotations in sctp_writeable()
      and sctp_assocs_seq_show()
      
      syzbot reported:
      
      BUG: KCSAN: data-race in sctp_poll / sctp_wfree
      
      read-write to 0xffff888149d77810 of 4 bytes by interrupt on cpu 0:
      sctp_wfree+0x170/0x4a0 net/sctp/socket.c:9147
      skb_release_head_state+0xb7/0x1a0 net/core/skbuff.c:988
      skb_release_all net/core/skbuff.c:1000 [inline]
      __kfree_skb+0x16/0x140 net/core/skbuff.c:1016
      consume_skb+0x57/0x180 net/core/skbuff.c:1232
      sctp_chunk_destroy net/sctp/sm_make_chunk.c:1503 [inline]
      sctp_chunk_put+0xcd/0x130 net/sctp/sm_make_chunk.c:1530
      sctp_datamsg_put+0x29a/0x300 net/sctp/chunk.c:128
      sctp_chunk_free+0x34/0x50 net/sctp/sm_make_chunk.c:1515
      sctp_outq_sack+0xafa/0xd70 net/sctp/outqueue.c:1381
      sctp_cmd_process_sack net/sctp/sm_sideeffect.c:834 [inline]
      sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1366 [inline]
      sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline]
      sctp_do_sm+0x12c7/0x31b0 net/sctp/sm_sideeffect.c:1169
      sctp_assoc_bh_rcv+0x2b2/0x430 net/sctp/associola.c:1051
      sctp_inq_push+0x108/0x120 net/sctp/inqueue.c:80
      sctp_rcv+0x116e/0x1340 net/sctp/input.c:243
      sctp6_rcv+0x25/0x40 net/sctp/ipv6.c:1120
      ip6_protocol_deliver_rcu+0x92f/0xf30 net/ipv6/ip6_input.c:437
      ip6_input_finish net/ipv6/ip6_input.c:482 [inline]
      NF_HOOK include/linux/netfilter.h:303 [inline]
      ip6_input+0xbd/0x1b0 net/ipv6/ip6_input.c:491
      dst_input include/net/dst.h:468 [inline]
      ip6_rcv_finish+0x1e2/0x2e0 net/ipv6/ip6_input.c:79
      NF_HOOK include/linux/netfilter.h:303 [inline]
      ipv6_rcv+0x74/0x150 net/ipv6/ip6_input.c:309
      __netif_receive_skb_one_core net/core/dev.c:5452 [inline]
      __netif_receive_skb+0x90/0x1b0 net/core/dev.c:5566
      process_backlog+0x21f/0x380 net/core/dev.c:5894
      __napi_poll+0x60/0x3b0 net/core/dev.c:6460
      napi_poll net/core/dev.c:6527 [inline]
      net_rx_action+0x32b/0x750 net/core/dev.c:6660
      __do_softirq+0xc1/0x265 kernel/softirq.c:553
      run_ksoftirqd+0x17/0x20 kernel/softirq.c:921
      smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164
      kthread+0x1d7/0x210 kernel/kthread.c:389
      ret_from_fork+0x2e/0x40 arch/x86/kernel/process.c:145
      ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304
      
      read to 0xffff888149d77810 of 4 bytes by task 17828 on cpu 1:
      sctp_writeable net/sctp/socket.c:9304 [inline]
      sctp_poll+0x265/0x410 net/sctp/socket.c:8671
      sock_poll+0x253/0x270 net/socket.c:1374
      vfs_poll include/linux/poll.h:88 [inline]
      do_pollfd fs/select.c:873 [inline]
      do_poll fs/select.c:921 [inline]
      do_sys_poll+0x636/0xc00 fs/select.c:1015
      __do_sys_ppoll fs/select.c:1121 [inline]
      __se_sys_ppoll+0x1af/0x1f0 fs/select.c:1101
      __x64_sys_ppoll+0x67/0x80 fs/select.c:1101
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x00019e80 -> 0x0000cc80
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 17828 Comm: syz-executor.1 Not tainted 6.5.0-rc7-syzkaller-00185-g28f20a19 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://lore.kernel.org/r/20230830094519.950007-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      dc9511dd
    • Eric Dumazet's avatar
      net/sched: fq_pie: avoid stalls in fq_pie_timer() · 8c21ab1b
      Eric Dumazet authored
      When setting a high number of flows (limit being 65536),
      fq_pie_timer() is currently using too much time as syzbot reported.
      
      Add logic to yield the cpu every 2048 flows (less than 150 usec
      on debug kernels).
      It should also help by not blocking qdisc fast paths for too long.
      Worst case (65536 flows) would need 31 jiffies for a complete scan.
      
      Relevant extract from syzbot report:
      
      rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 0-.... } 2663 jiffies s: 873 root: 0x1/.
      rcu: blocking rcu_node structures (internal RCU debug):
      Sending NMI from CPU 1 to CPUs 0:
      NMI backtrace for cpu 0
      CPU: 0 PID: 5177 Comm: syz-executor273 Not tainted 6.5.0-syzkaller-00453-g727dbda1 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
      RIP: 0010:check_kcov_mode kernel/kcov.c:173 [inline]
      RIP: 0010:write_comp_data+0x21/0x90 kernel/kcov.c:236
      Code: 2e 0f 1f 84 00 00 00 00 00 65 8b 05 01 b2 7d 7e 49 89 f1 89 c6 49 89 d2 81 e6 00 01 00 00 49 89 f8 65 48 8b 14 25 80 b9 03 00 <a9> 00 01 ff 00 74 0e 85 f6 74 59 8b 82 04 16 00 00 85 c0 74 4f 8b
      RSP: 0018:ffffc90000007bb8 EFLAGS: 00000206
      RAX: 0000000000000101 RBX: ffffc9000dc0d140 RCX: ffffffff885893b0
      RDX: ffff88807c075940 RSI: 0000000000000100 RDI: 0000000000000001
      RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000dc0d178
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000555555d54380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b442f6130 CR3: 000000006fe1c000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <NMI>
       </NMI>
       <IRQ>
       pie_calculate_probability+0x480/0x850 net/sched/sch_pie.c:415
       fq_pie_timer+0x1da/0x4f0 net/sched/sch_fq_pie.c:387
       call_timer_fn+0x1a0/0x580 kernel/time/timer.c:1700
      
      Fixes: ec97ecf1 ("net: sched: add Flow Queue PIE packet scheduler")
      Link: https://lore.kernel.org/lkml/00000000000017ad3f06040bf394@google.com/
      Reported-by: syzbot+e46fbd5289363464bc13@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Link: https://lore.kernel.org/r/20230829123541.3745013-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8c21ab1b
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-08-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 4e60de1e
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix mangling of TCP options with non-linear skbuff, from Xiao Liang.
      
      2) OOB read in xt_sctp due to missing sanitization of array length field.
         From Wander Lairson Costa.
      
      3) OOB read in xt_u32 due to missing sanitization of array length field.
         Also from Wander Lairson Costa.
      
      All of them above, always broken for several releases.
      
      4) Missing audit log for set element reset command, from Phil Sutter.
      
      5) Missing audit log for rule reset command, also from Phil.
      
      These audit log support are missing in 6.5.
      
      * tag 'nf-23-08-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: Audit log rule reset
        netfilter: nf_tables: Audit log setelem reset
        netfilter: xt_u32: validate user space input
        netfilter: xt_sctp: validate the flag_info count
        netfilter: nft_exthdr: Fix non-linear header modification
      ====================
      
      Link: https://lore.kernel.org/r/20230830235935.465690-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4e60de1e
    • Donald Hunter's avatar
      doc/netlink: Fix missing classic_netlink doc reference · ee940b57
      Donald Hunter authored
      Add missing cross-reference label for classic_netlink.
      
      Fixes: 2db8abf0 ("doc/netlink: Document the netlink-raw schema extensions")
      Signed-off-by: default avatarDonald Hunter <donald.hunter@gmail.com>
      Link: https://lore.kernel.org/r/20230829085539.36354-1-donald.hunter@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ee940b57
    • Oliver Neukum's avatar
    • Russell King (Oracle)'s avatar
      net: stmmac: failure to probe without MAC interface specified · b5947239
      Russell King (Oracle) authored
      Alexander Stein reports that commit a014c355 ("net: stmmac: clarify
      difference between "interface" and "phy_interface"") caused breakage,
      because plat->mac_interface will never be negative. Fix this by using
      the "rc" temporary variable in stmmac_probe_config_dt().
      Reported-by: default avatarAlexander Stein <alexander.stein@ew.tq-group.com>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Tested-by: default avatarAlexander Stein <alexander.stein@ew.tq-group.com>
      Link: https://lore.kernel.org/r/E1qayn0-006Q8J-GE@rmk-PC.armlinux.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b5947239
  5. 30 Aug, 2023 3 commits