1. 10 Jun, 2021 32 commits
    • David S. Miller's avatar
      Merge branch 'mptcp-fixes' · 232e3683
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: More v5.13 fixes
      
      Here's another batch of MPTCP fixes for v5.13.
      
      Patch 1 cleans up memory accounting between the MPTCP-level socket and
      the subflows to more reliably transfer forward allocated memory under
      pressure.
      
      Patch 2 wakes up socket readers more reliably.
      
      Patch 3 changes a WARN_ONCE to a pr_debug.
      
      Patch 4 changes the selftests to only use syncookies in test cases where
      they do not cause spurious failures.
      
      Patch 5 modifies socket error reporting to avoid a possible soft lockup.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      232e3683
    • Paolo Abeni's avatar
      mptcp: fix soft lookup in subflow_error_report() · 499ada50
      Paolo Abeni authored
      Maxim reported a soft lookup in subflow_error_report():
      
       watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
       RIP: 0010:native_queued_spin_lock_slowpath
       RSP: 0018:ffffa859c0003bc0 EFLAGS: 00000202
       RAX: 0000000000000101 RBX: 0000000000000001 RCX: 0000000000000000
       RDX: ffff9195c2772d88 RSI: 0000000000000000 RDI: ffff9195c2772d88
       RBP: ffff9195c2772d00 R08: 00000000000067b0 R09: c6e31da9eb1e44f4
       R10: ffff9195ef379700 R11: ffff9195edb50710 R12: ffff9195c2772d88
       R13: ffff9195f500e3d0 R14: ffff9195ef379700 R15: ffff9195ef379700
       FS:  0000000000000000(0000) GS:ffff91961f400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000000c000407000 CR3: 0000000002988000 CR4: 00000000000006f0
       Call Trace:
        <IRQ>
       _raw_spin_lock_bh
       subflow_error_report
       mptcp_subflow_data_available
       __mptcp_move_skbs_from_subflow
       mptcp_data_ready
       tcp_data_queue
       tcp_rcv_established
       tcp_v4_do_rcv
       tcp_v4_rcv
       ip_protocol_deliver_rcu
       ip_local_deliver_finish
       __netif_receive_skb_one_core
       netif_receive_skb
       rtl8139_poll 8139too
       __napi_poll
       net_rx_action
       __do_softirq
       __irq_exit_rcu
       common_interrupt
        </IRQ>
      
      The calling function - mptcp_subflow_data_available() - can be invoked
      from different contexts:
      - plain ssk socket lock
      - ssk socket lock + mptcp_data_lock
      - ssk socket lock + mptcp_data_lock + msk socket lock.
      
      Since subflow_error_report() tries to acquire the mptcp_data_lock, the
      latter two call chains will cause soft lookup.
      
      This change addresses the issue moving the error reporting call to
      outer functions, where the held locks list is known and the we can
      acquire only the needed one.
      Reported-by: default avatarMaxim Galaganov <max@internet.ru>
      Fixes: 15cc1045 ("mptcp: deliver ssk errors to msk")
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/199Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      499ada50
    • Paolo Abeni's avatar
      selftests: mptcp: enable syncookie only in absence of reorders · 2395da0e
      Paolo Abeni authored
      Syncookie validation may fail for OoO packets, causing spurious
      resets and self-tests failures, so let's force syncookie only
      for tests iteration with no OoO.
      
      Fixes: fed61c4b ("selftests: mptcp: make 2nd net namespace use tcp syn cookies unconditionally")
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/198Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2395da0e
    • Paolo Abeni's avatar
      mptcp: do not warn on bad input from the network · 61e71022
      Paolo Abeni authored
      warn_bad_map() produces a kernel WARN on bad input coming
      from the network. Use pr_debug() to avoid spamming the system
      log.
      
      Additionally, when the right bound check fails, warn_bad_map() reports
      the wrong ssn value, let's fix it.
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/107Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61e71022
    • Paolo Abeni's avatar
      mptcp: wake-up readers only for in sequence data · 99d1055c
      Paolo Abeni authored
      Currently we rely on the subflow->data_avail field, which is subject to
      races:
      
      	ssk1
      		skb len = 500 DSS(seq=1, len=1000, off=0)
      		# data_avail == MPTCP_SUBFLOW_DATA_AVAIL
      
      	ssk2
      		skb len = 500 DSS(seq = 501, len=1000)
      		# data_avail == MPTCP_SUBFLOW_DATA_AVAIL
      
      	ssk1
      		skb len = 500 DSS(seq = 1, len=1000, off =500)
      		# still data_avail == MPTCP_SUBFLOW_DATA_AVAIL,
      		# as the skb is covered by a pre-existing map,
      		# which was in-sequence at reception time.
      
      Instead we can explicitly check if some has been received in-sequence,
      propagating the info from __mptcp_move_skbs_from_subflow().
      
      Additionally add the 'ONCE' annotation to the 'data_avail' memory
      access, as msk will read it outside the subflow socket lock.
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99d1055c
    • Paolo Abeni's avatar
      mptcp: try harder to borrow memory from subflow under pressure · 72f96132
      Paolo Abeni authored
      If the host is under sever memory pressure, and RX forward
      memory allocation for the msk fails, we try to borrow the
      required memory from the ingress subflow.
      
      The current attempt is a bit flaky: if skb->truesize is less
      than SK_MEM_QUANTUM, the ssk will not release any memory, and
      the next schedule will fail again.
      
      Instead, directly move the required amount of pages from the
      ssk to the msk, if available
      
      Fixes: 9c3f94e1 ("mptcp: add missing memory scheduling in the rx path")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72f96132
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 22488e45
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix a crash when stateful expression with its own gc callback
         is used in a set definition.
      
      2) Skip IPv6 packets from any link-local address in IPv6 fib expression.
         Add a selftest for this scenario, from Florian Westphal.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22488e45
    • David S. Miller's avatar
      Merge branch 'tcp-options-oob-fixes' · 0280f429
      David S. Miller authored
      Maxim Mikityanskiy says:
      
      ====================
      Fix out of bounds when parsing TCP options
      
      This series fixes out-of-bounds access in various places in the kernel
      where parsing of TCP options takes place. Fortunately, many more
      occurrences don't have this bug.
      
      v2 changes:
      
      synproxy: Added an early return when length < 0 to avoid calling
      skb_header_pointer with negative length.
      
      sch_cake: Added doff validation to avoid parsing garbage.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0280f429
    • Maxim Mikityanskiy's avatar
      sch_cake: Fix out of bounds when parsing TCP options and header · ba91c49d
      Maxim Mikityanskiy authored
      The TCP option parser in cake qdisc (cake_get_tcpopt and
      cake_tcph_may_drop) could read one byte out of bounds. When the length
      is 1, the execution flow gets into the loop, reads one byte of the
      opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads
      one more byte, which exceeds the length of 1.
      
      This fix is inspired by commit 9609dad2 ("ipv4: tcp_input: fix stack
      out of bounds when parsing TCP options.").
      
      v2 changes:
      
      Added doff validation in cake_get_tcphdr to avoid parsing garbage as TCP
      header. Although it wasn't strictly an out-of-bounds access (memory was
      allocated), garbage values could be read where CAKE expected the TCP
      header if doff was smaller than 5.
      
      Cc: Young Xiao <92siuyang@gmail.com>
      Fixes: 8b713881 ("sch_cake: Add optional ACK filter")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba91c49d
    • Maxim Mikityanskiy's avatar
      mptcp: Fix out of bounds when parsing TCP options · 07718be2
      Maxim Mikityanskiy authored
      The TCP option parser in mptcp (mptcp_get_options) could read one byte
      out of bounds. When the length is 1, the execution flow gets into the
      loop, reads one byte of the opcode, and if the opcode is neither
      TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds the
      length of 1.
      
      This fix is inspired by commit 9609dad2 ("ipv4: tcp_input: fix stack
      out of bounds when parsing TCP options.").
      
      Cc: Young Xiao <92siuyang@gmail.com>
      Fixes: cec37a6e ("mptcp: Handle MP_CAPABLE options for outgoing connections")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07718be2
    • Maxim Mikityanskiy's avatar
      netfilter: synproxy: Fix out of bounds when parsing TCP options · 5fc177ab
      Maxim Mikityanskiy authored
      The TCP option parser in synproxy (synproxy_parse_options) could read
      one byte out of bounds. When the length is 1, the execution flow gets
      into the loop, reads one byte of the opcode, and if the opcode is
      neither TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds
      the length of 1.
      
      This fix is inspired by commit 9609dad2 ("ipv4: tcp_input: fix stack
      out of bounds when parsing TCP options.").
      
      v2 changes:
      
      Added an early return when length < 0 to avoid calling
      skb_header_pointer with negative length.
      
      Cc: Young Xiao <92siuyang@gmail.com>
      Fixes: 48b1de4c ("netfilter: add SYNPROXY core/target")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fc177ab
    • Eric Dumazet's avatar
      net/packet: annotate data race in packet_sendmsg() · d1b5bee4
      Eric Dumazet authored
      There is a known race in packet_sendmsg(), addressed
      in commit 32d3182c ("net/packet: fix race in tpacket_snd()")
      
      Now we have data_race(), we can use it to avoid a future KCSAN warning,
      as syzbot loves stressing af_packet sockets :)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1b5bee4
    • Eric Dumazet's avatar
      inet: annotate date races around sk->sk_txhash · b71eaed8
      Eric Dumazet authored
      UDP sendmsg() path can be lockless, it is possible for another
      thread to re-connect an change sk->sk_txhash under us.
      
      There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE()
      pair to document the race.
      
      BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w
      
      write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1:
       sk_set_txhash include/net/sock.h:1937 [inline]
       __ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75
       __ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189
       ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
       inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580
       __sys_connect_file net/socket.c:1837 [inline]
       __sys_connect+0x245/0x280 net/socket.c:1854
       __do_sys_connect net/socket.c:1864 [inline]
       __se_sys_connect net/socket.c:1861 [inline]
       __x64_sys_connect+0x3d/0x50 net/socket.c:1861
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0:
       skb_set_hash_from_sk include/net/sock.h:2211 [inline]
       skb_set_owner_w+0x118/0x220 net/core/sock.c:2101
       sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359
       sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373
       __ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621
       ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983
       udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527
       inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0xbca3c43d -> 0xfdb309e0
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b71eaed8
    • Eric Dumazet's avatar
      net: annotate data race in sock_error() · f13ef100
      Eric Dumazet authored
      sock_error() is known to be racy. The code avoids
      an atomic operation is sk_err is zero, and this field
      could be changed under us, this is fine.
      
      Sysbot reported:
      
      BUG: KCSAN: data-race in sock_alloc_send_pskb / unix_release_sock
      
      write to 0xffff888131855630 of 4 bytes by task 9365 on cpu 1:
       unix_release_sock+0x2e9/0x6e0 net/unix/af_unix.c:550
       unix_release+0x2f/0x50 net/unix/af_unix.c:859
       __sock_release net/socket.c:599 [inline]
       sock_close+0x6c/0x150 net/socket.c:1258
       __fput+0x25b/0x4e0 fs/file_table.c:280
       ____fput+0x11/0x20 fs/file_table.c:313
       task_work_run+0xae/0x130 kernel/task_work.c:164
       tracehook_notify_resume include/linux/tracehook.h:189 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
       exit_to_user_mode_prepare+0x156/0x190 kernel/entry/common.c:208
       __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
       syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:301
       do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888131855630 of 4 bytes by task 9385 on cpu 0:
       sock_error include/net/sock.h:2269 [inline]
       sock_alloc_send_pskb+0xe4/0x4e0 net/core/sock.c:2336
       unix_dgram_sendmsg+0x478/0x1610 net/unix/af_unix.c:1671
       unix_seqpacket_sendmsg+0xc2/0x100 net/unix/af_unix.c:2055
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       __sys_sendmsg_sock+0x25/0x30 net/socket.c:2416
       io_sendmsg fs/io_uring.c:4367 [inline]
       io_issue_sqe+0x231a/0x6750 fs/io_uring.c:6135
       __io_queue_sqe+0xe9/0x360 fs/io_uring.c:6414
       __io_req_task_submit fs/io_uring.c:2039 [inline]
       io_async_task_func+0x312/0x590 fs/io_uring.c:5074
       __tctx_task_work fs/io_uring.c:1910 [inline]
       tctx_task_work+0x1d4/0x3d0 fs/io_uring.c:1924
       task_work_run+0xae/0x130 kernel/task_work.c:164
       tracehook_notify_signal include/linux/tracehook.h:212 [inline]
       handle_signal_work kernel/entry/common.c:145 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0xf8/0x190 kernel/entry/common.c:208
       __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
       syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:301
       do_syscall_64+0x56/0x90 arch/x86/entry/common.c:57
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000000 -> 0x00000068
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 9385 Comm: syz-executor.3 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f13ef100
    • David S. Miller's avatar
      Merge branch 'bridge-egress-fixes' · 172947ac
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: vlan tunnel egress path fixes
      
      These two fixes take care of tunnel_dst problems in the vlan tunnel egress
      path. Patch 01 fixes a null ptr deref due to the lockless use of tunnel_dst
      pointer without checking it first, and patch 02 fixes a use-after-free
      issue due to wrong dst refcounting (dst_clone() -> dst_hold_safe()).
      
      Both fix the same commit and should be queued for stable backports:
      Fixes: 11538d03 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
      
      v2: no changes, added stable list to CC
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      172947ac
    • Nikolay Aleksandrov's avatar
      net: bridge: fix vlan tunnel dst refcnt when egressing · cfc579f9
      Nikolay Aleksandrov authored
      The egress tunnel code uses dst_clone() and directly sets the result
      which is wrong because the entry might have 0 refcnt or be already deleted,
      causing number of problems. It also triggers the WARN_ON() in dst_hold()[1]
      when a refcnt couldn't be taken. Fix it by using dst_hold_safe() and
      checking if a reference was actually taken before setting the dst.
      
      [1] dmesg WARN_ON log and following refcnt errors
       WARNING: CPU: 5 PID: 38 at include/net/dst.h:230 br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge]
       Modules linked in: 8021q garp mrp bridge stp llc bonding ipv6 virtio_net
       CPU: 5 PID: 38 Comm: ksoftirqd/5 Kdump: loaded Tainted: G        W         5.13.0-rc3+ #360
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
       RIP: 0010:br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge]
       Code: e8 85 bc 01 e1 45 84 f6 74 90 45 31 f6 85 db 48 c7 c7 a0 02 19 a0 41 0f 94 c6 31 c9 31 d2 44 89 f6 e8 64 bc 01 e1 85 db 75 02 <0f> 0b 31 c9 31 d2 44 89 f6 48 c7 c7 70 02 19 a0 e8 4b bc 01 e1 49
       RSP: 0018:ffff8881003d39e8 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa01902a0
       RBP: ffff8881040c6700 R08: 0000000000000000 R09: 0000000000000001
       R10: 2ce93d0054fe0d00 R11: 54fe0d00000e0000 R12: ffff888109515000
       R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000401
       FS:  0000000000000000(0000) GS:ffff88822bf40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f42ba70f030 CR3: 0000000109926000 CR4: 00000000000006e0
       Call Trace:
        br_handle_vlan+0xbc/0xca [bridge]
        __br_forward+0x23/0x164 [bridge]
        deliver_clone+0x41/0x48 [bridge]
        br_handle_frame_finish+0x36f/0x3aa [bridge]
        ? skb_dst+0x2e/0x38 [bridge]
        ? br_handle_ingress_vlan_tunnel+0x3e/0x1c8 [bridge]
        ? br_handle_frame_finish+0x3aa/0x3aa [bridge]
        br_handle_frame+0x2c3/0x377 [bridge]
        ? __skb_pull+0x33/0x51
        ? vlan_do_receive+0x4f/0x36a
        ? br_handle_frame_finish+0x3aa/0x3aa [bridge]
        __netif_receive_skb_core+0x539/0x7c6
        ? __list_del_entry_valid+0x16e/0x1c2
        __netif_receive_skb_list_core+0x6d/0xd6
        netif_receive_skb_list_internal+0x1d9/0x1fa
        gro_normal_list+0x22/0x3e
        dev_gro_receive+0x55b/0x600
        ? detach_buf_split+0x58/0x140
        napi_gro_receive+0x94/0x12e
        virtnet_poll+0x15d/0x315 [virtio_net]
        __napi_poll+0x2c/0x1c9
        net_rx_action+0xe6/0x1fb
        __do_softirq+0x115/0x2d8
        run_ksoftirqd+0x18/0x20
        smpboot_thread_fn+0x183/0x19c
        ? smpboot_unregister_percpu_thread+0x66/0x66
        kthread+0x10a/0x10f
        ? kthread_mod_delayed_work+0xb6/0xb6
        ret_from_fork+0x22/0x30
       ---[ end trace 49f61b07f775fd2b ]---
       dst_release: dst:00000000c02d677a refcnt:-1
       dst_release underflow
      
      Cc: stable@vger.kernel.org
      Fixes: 11538d03 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc579f9
    • Nikolay Aleksandrov's avatar
      net: bridge: fix vlan tunnel dst null pointer dereference · 58e20717
      Nikolay Aleksandrov authored
      This patch fixes a tunnel_dst null pointer dereference due to lockless
      access in the tunnel egress path. When deleting a vlan tunnel the
      tunnel_dst pointer is set to NULL without waiting a grace period (i.e.
      while it's still usable) and packets egressing are dereferencing it
      without checking. Use READ/WRITE_ONCE to annotate the lockless use of
      tunnel_id, use RCU for accessing tunnel_dst and make sure it is read
      only once and checked in the egress path. The dst is already properly RCU
      protected so we don't need to do anything fancy than to make sure
      tunnel_id and tunnel_dst are read only once and checked in the egress path.
      
      Cc: stable@vger.kernel.org
      Fixes: 11538d03 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58e20717
    • Zheng Yongjun's avatar
      ping: Check return value of function 'ping_queue_rcv_skb' · 9d44fa3e
      Zheng Yongjun authored
      Function 'ping_queue_rcv_skb' not always return success, which will
      also return fail. If not check the wrong return value of it, lead to function
      `ping_rcv` return success.
      Signed-off-by: default avatarZheng Yongjun <zhengyongjun3@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d44fa3e
    • Willem de Bruijn's avatar
      skbuff: fix incorrect msg_zerocopy copy notifications · 3bdd5ee0
      Willem de Bruijn authored
      msg_zerocopy signals if a send operation required copying with a flag
      in serr->ee.ee_code.
      
      This field can be incorrect as of the below commit, as a result of
      both structs uarg and serr pointing into the same skb->cb[].
      
      uarg->zerocopy must be read before skb->cb[] is reinitialized to hold
      serr. Similar to other fields len, hi and lo, use a local variable to
      temporarily hold the value.
      
      This was not a problem before, when the value was passed as a function
      argument.
      
      Fixes: 75518851 ("skbuff: Push status and refcounts into sock_zerocopy_callback")
      Reported-by: default avatarTalal Ahmad <talalahmad@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bdd5ee0
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2021-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 388fa7f1
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-fixes-2021-06-09
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      388fa7f1
    • Aya Levin's avatar
      net/mlx5e: Block offload of outer header csum for GRE tunnel · 54e1217b
      Aya Levin authored
      The device is able to offload either the outer header csum or inner
      header csum. The driver utilizes the inner csum offload. So, prohibit
      setting of tx-gre-csum-segmentation and let it be: off[fixed].
      
      Fixes: 27299841 ("net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      54e1217b
    • Aya Levin's avatar
      net/mlx5e: Block offload of outer header csum for UDP tunnels · 6d6727dd
      Aya Levin authored
      The device is able to offload either the outer header csum or inner
      header csum. The driver utilizes the inner csum offload. Hence, block
      setting of tx-udp_tnl-csum-segmentation and set it to off[fixed].
      
      Fixes: b49663c8 ("net/mlx5e: Add support for UDP tunnel segmentation with outer checksum offload")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6d6727dd
    • Shay Drory's avatar
      Revert "net/mlx5: Arm only EQs with EQEs" · 7a545077
      Shay Drory authored
      In the scenario described below, an EQ can remain in FIRED state which
      can result in missing an interrupt generation.
      
      The scenario:
      
      device                       mlx5_core driver
      ------                       ----------------
      EQ1.eqe generated
      EQ1.MSI-X sent
      EQ1.state = FIRED
      EQ2.eqe generated
                                   mlx5_irq()
                                     polls - eq1_eqes()
                                     arm eq1
                                     polls - eq2_eqes()
                                     arm eq2
      EQ2.MSI-X sent
      EQ2.state = FIRED
                                    mlx5_irq()
                                    polls - eq2_eqes() -- no eqes found
                                    driver skips EQ arming;
      
      ->EQ2 remains fired, misses generating interrupt.
      
      Hence, always arm the EQ by reverting the cited commit in fixes tag.
      
      Fixes: d894892d ("net/mlx5: Arm only EQs with EQEs")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7a545077
    • Aya Levin's avatar
      net/mlx5e: Fix select queue to consider SKBTX_HW_TSTAMP · a6ee6f5f
      Aya Levin authored
      Steering packets to PTP-SQ should be done only if the SKB has
      SKBTX_HW_TSTAMP set in the tx_flags. While here, take the function into
      a header and inline it.
      Set the whole condition to select the PTP-SQ to unlikely.
      
      Fixes: 24c22dd0 ("net/mlx5e: Add states to PTP channel")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a6ee6f5f
    • Aya Levin's avatar
      net/mlx5e: Don't update netdev RQs with PTP-RQ · 9ae8c18c
      Aya Levin authored
      Since the driver opens the PTP-RQ under channel 0, it appears to the
      stack as if the SKB was received on rxq0. So from thew stack POV there
      are still the same number of RX queues.
      
      Fixes: 960fbfe2 ("net/mlx5e: Allow coexistence of CQE compression and HW TS PTP")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9ae8c18c
    • Chris Mi's avatar
      net/mlx5e: Verify dev is present in get devlink port ndo · 11f5ac3e
      Chris Mi authored
      When changing eswitch mode, the netdev is detached from the
      hardware resources. So verify dev is present in get devlink
      port ndo. Otherwise, we will hit the following panic:
      
      [241535.973539] RIP: 0010:__devlink_port_phys_port_name_get+0x13/0x1b0
      [241535.976471] RSP: 0018:ffff9eaf0ae1b7c8 EFLAGS: 00010292
      [241535.977471] RAX: 000000000002d370 RBX: 000000000002d370 RCX: 0000000000000000
      [241535.978479] RDX: 0000000000000010 RSI: ffff9eaf0ae1b858 RDI: 000000000002d370
      [241535.979482] RBP: ffff9eaf0ae1b7e0 R08: 000000000000002a R09: ffff8888d54d13da
      [241535.980486] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8888e6700000
      [241535.981491] R13: ffff9eaf0ae1b858 R14: 0000000000000010 R15: 0000000000000000
      [241535.982489] FS:  00007fd374ef3740(0000) GS:ffff88909ea00000(0000) knlGS:0000000000000000
      [241535.983494] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [241535.984487] CR2: 000000000002d444 CR3: 000000089fd26006 CR4: 00000000003706e0
      [241535.985502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [241535.986499] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [241535.987477] Call Trace:
      [241535.988426]  ? nla_put_64bit+0x71/0xa0
      [241535.989368]  devlink_compat_phys_port_name_get+0x50/0xa0
      [241535.990312]  dev_get_phys_port_name+0x4b/0x60
      [241535.991252]  rtnl_fill_ifinfo+0x57b/0xcb0
      [241535.992192]  rtnl_dump_ifinfo+0x58f/0x6d0
      [241535.993123]  ? ksize+0x14/0x20
      [241535.994033]  ? __alloc_skb+0xe8/0x250
      [241535.994935]  netlink_dump+0x17c/0x300
      [241535.995821]  netlink_recvmsg+0x1de/0x2c0
      [241535.996677]  sock_recvmsg+0x70/0x80
      [241535.997518]  ____sys_recvmsg+0x9b/0x1b0
      [241535.998360]  ? iovec_from_user+0x82/0x120
      [241535.999202]  ? __import_iovec+0x2c/0x130
      [241536.000031]  ___sys_recvmsg+0x94/0x130
      [241536.000850]  ? __handle_mm_fault+0x56d/0x6e0
      [241536.001668]  __sys_recvmsg+0x5f/0xb0
      [241536.002464]  ? syscall_enter_from_user_mode+0x2b/0x80
      [241536.003242]  __x64_sys_recvmsg+0x1f/0x30
      [241536.004008]  do_syscall_64+0x38/0x50
      [241536.004767]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [241536.005532] RIP: 0033:0x7fd375014f47
      
      Fixes: 2ff349c5 ("net/mlx5e: Verify dev is present in some ndos")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      11f5ac3e
    • Maor Gottlieb's avatar
      net/mlx5: DR, Don't use SW steering when RoCE is not supported · 4aaf96ac
      Maor Gottlieb authored
      SW steering uses RC QP to write/read to/from ICM, hence it's not
      supported when RoCE is not supported as well.
      
      Fixes: 70605ea5 ("net/mlx5: DR, Expose APIs for direct rule managing")
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Reviewed-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      4aaf96ac
    • Maor Gottlieb's avatar
      net/mlx5: Consider RoCE cap before init RDMA resources · c189716b
      Maor Gottlieb authored
      Check if RoCE is supported by the device before enable it in
      the vport context and create all the RDMA steering objects.
      
      Fixes: 80f09dfc ("net/mlx5: Eswitch, enable RoCE loopback traffic")
      Signed-off-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c189716b
    • Dima Chumak's avatar
      net/mlx5e: Fix page reclaim for dead peer hairpin · a3e5fd93
      Dima Chumak authored
      When adding a hairpin flow, a firmware-side send queue is created for
      the peer net device, which claims some host memory pages for its
      internal ring buffer. If the peer net device is removed/unbound before
      the hairpin flow is deleted, then the send queue is not destroyed which
      leads to a stack trace on pci device remove:
      
      [ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
      [ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110
      [ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0
      [ 748.002171] ------------[ cut here ]------------
      [ 748.001177] FW pages counter is 4 after reclaiming all pages
      [ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]                      [  +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core]
      [ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1
      [ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]
      [ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9
      [ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286
      [ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000
      [ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51
      [ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8
      [ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30
      [ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000
      [ 748.001429] FS:  00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000
      [ 748.001695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0
      [ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 748.001654] Call Trace:
      [ 748.000576]  ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core]
      [ 748.001416]  ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core]
      [ 748.001354]  ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core]
      [ 748.001203]  mlx5_function_teardown+0x30/0x60 [mlx5_core]
      [ 748.001275]  mlx5_uninit_one+0xa7/0xc0 [mlx5_core]
      [ 748.001200]  remove_one+0x5f/0xc0 [mlx5_core]
      [ 748.001075]  pci_device_remove+0x9f/0x1d0
      [ 748.000833]  device_release_driver_internal+0x1e0/0x490
      [ 748.001207]  unbind_store+0x19f/0x200
      [ 748.000942]  ? sysfs_file_ops+0x170/0x170
      [ 748.001000]  kernfs_fop_write_iter+0x2bc/0x450
      [ 748.000970]  new_sync_write+0x373/0x610
      [ 748.001124]  ? new_sync_read+0x600/0x600
      [ 748.001057]  ? lock_acquire+0x4d6/0x700
      [ 748.000908]  ? lockdep_hardirqs_on_prepare+0x400/0x400
      [ 748.001126]  ? fd_install+0x1c9/0x4d0
      [ 748.000951]  vfs_write+0x4d0/0x800
      [ 748.000804]  ksys_write+0xf9/0x1d0
      [ 748.000868]  ? __x64_sys_read+0xb0/0xb0
      [ 748.000811]  ? filp_open+0x50/0x50
      [ 748.000919]  ? syscall_enter_from_user_mode+0x1d/0x50
      [ 748.001223]  do_syscall_64+0x3f/0x80
      [ 748.000892]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 748.001026] RIP: 0033:0x7f58bcfb22f7
      [ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      [ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7
      [ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003
      [ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001
      [ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d
      [ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700
      [ 748.001564] irq event stamp: 0
      [ 748.000787] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0
      [ 748.001854] softirqs last  enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0
      [ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [ 748.001492] ---[ end trace a6fabd773d1c51ae ]---
      
      Fix by destroying the send queue of a hairpin peer net device that is
      being removed/unbound, which returns the allocated ring buffer pages to
      the host.
      
      Fixes: 4d8fcf21 ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules")
      Signed-off-by: default avatarDima Chumak <dchumak@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a3e5fd93
    • Huy Nguyen's avatar
      net/mlx5e: Remove dependency in IPsec initialization flows · 8ad893e5
      Huy Nguyen authored
      Currently, IPsec feature is disabled because mlx5e_build_nic_netdev
      is required to be called after mlx5e_ipsec_init. This requirement is
      invalid as mlx5e_build_nic_netdev and mlx5e_ipsec_init initialize
      independent resources.
      
      Remove ipsec pointer check in mlx5e_build_nic_netdev so that the
      two functions can be called at any order.
      
      Fixes: 547eede0 ("net/mlx5e: IPSec, Innova IPSec offload infrastructure")
      Signed-off-by: default avatarHuy Nguyen <huyn@nvidia.com>
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8ad893e5
    • Vlad Buslov's avatar
      net/mlx5e: Fix use-after-free of encap entry in neigh update handler · fb1a3132
      Vlad Buslov authored
      Function mlx5e_rep_neigh_update() wasn't updated to accommodate rtnl lock
      removal from TC filter update path and properly handle concurrent encap
      entry insertion/deletion which can lead to following use-after-free:
      
       [23827.464923] ==================================================================
       [23827.469446] BUG: KASAN: use-after-free in mlx5e_encap_take+0x72/0x140 [mlx5_core]
       [23827.470971] Read of size 4 at addr ffff8881d132228c by task kworker/u20:6/21635
       [23827.472251]
       [23827.472615] CPU: 9 PID: 21635 Comm: kworker/u20:6 Not tainted 5.13.0-rc3+ #5
       [23827.473788] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       [23827.475639] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
       [23827.476731] Call Trace:
       [23827.477260]  dump_stack+0xbb/0x107
       [23827.477906]  print_address_description.constprop.0+0x18/0x140
       [23827.478896]  ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
       [23827.479879]  ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
       [23827.480905]  kasan_report.cold+0x7c/0xd8
       [23827.481701]  ? mlx5e_encap_take+0x72/0x140 [mlx5_core]
       [23827.482744]  kasan_check_range+0x145/0x1a0
       [23827.493112]  mlx5e_encap_take+0x72/0x140 [mlx5_core]
       [23827.494054]  ? mlx5e_tc_tun_encap_info_equal_generic+0x140/0x140 [mlx5_core]
       [23827.495296]  mlx5e_rep_neigh_update+0x41e/0x5e0 [mlx5_core]
       [23827.496338]  ? mlx5e_rep_neigh_entry_release+0xb80/0xb80 [mlx5_core]
       [23827.497486]  ? read_word_at_a_time+0xe/0x20
       [23827.498250]  ? strscpy+0xa0/0x2a0
       [23827.498889]  process_one_work+0x8ac/0x14e0
       [23827.499638]  ? lockdep_hardirqs_on_prepare+0x400/0x400
       [23827.500537]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
       [23827.501359]  ? rwlock_bug.part.0+0x90/0x90
       [23827.502116]  worker_thread+0x53b/0x1220
       [23827.502831]  ? process_one_work+0x14e0/0x14e0
       [23827.503627]  kthread+0x328/0x3f0
       [23827.504254]  ? _raw_spin_unlock_irq+0x24/0x40
       [23827.505065]  ? __kthread_bind_mask+0x90/0x90
       [23827.505912]  ret_from_fork+0x1f/0x30
       [23827.506621]
       [23827.506987] Allocated by task 28248:
       [23827.507694]  kasan_save_stack+0x1b/0x40
       [23827.508476]  __kasan_kmalloc+0x7c/0x90
       [23827.509197]  mlx5e_attach_encap+0xde1/0x1d40 [mlx5_core]
       [23827.510194]  mlx5e_tc_add_fdb_flow+0x397/0xc40 [mlx5_core]
       [23827.511218]  __mlx5e_add_fdb_flow+0x519/0xb30 [mlx5_core]
       [23827.512234]  mlx5e_configure_flower+0x191c/0x4870 [mlx5_core]
       [23827.513298]  tc_setup_cb_add+0x1d5/0x420
       [23827.514023]  fl_hw_replace_filter+0x382/0x6a0 [cls_flower]
       [23827.514975]  fl_change+0x2ceb/0x4a51 [cls_flower]
       [23827.515821]  tc_new_tfilter+0x89a/0x2070
       [23827.516548]  rtnetlink_rcv_msg+0x644/0x8c0
       [23827.517300]  netlink_rcv_skb+0x11d/0x340
       [23827.518021]  netlink_unicast+0x42b/0x700
       [23827.518742]  netlink_sendmsg+0x743/0xc20
       [23827.519467]  sock_sendmsg+0xb2/0xe0
       [23827.520131]  ____sys_sendmsg+0x590/0x770
       [23827.520851]  ___sys_sendmsg+0xd8/0x160
       [23827.521552]  __sys_sendmsg+0xb7/0x140
       [23827.522238]  do_syscall_64+0x3a/0x70
       [23827.522907]  entry_SYSCALL_64_after_hwframe+0x44/0xae
       [23827.523797]
       [23827.524163] Freed by task 25948:
       [23827.524780]  kasan_save_stack+0x1b/0x40
       [23827.525488]  kasan_set_track+0x1c/0x30
       [23827.526187]  kasan_set_free_info+0x20/0x30
       [23827.526968]  __kasan_slab_free+0xed/0x130
       [23827.527709]  slab_free_freelist_hook+0xcf/0x1d0
       [23827.528528]  kmem_cache_free_bulk+0x33a/0x6e0
       [23827.529317]  kfree_rcu_work+0x55f/0xb70
       [23827.530024]  process_one_work+0x8ac/0x14e0
       [23827.530770]  worker_thread+0x53b/0x1220
       [23827.531480]  kthread+0x328/0x3f0
       [23827.532114]  ret_from_fork+0x1f/0x30
       [23827.532785]
       [23827.533147] Last potentially related work creation:
       [23827.534007]  kasan_save_stack+0x1b/0x40
       [23827.534710]  kasan_record_aux_stack+0xab/0xc0
       [23827.535492]  kvfree_call_rcu+0x31/0x7b0
       [23827.536206]  mlx5e_tc_del_fdb_flow+0x577/0xef0 [mlx5_core]
       [23827.537305]  mlx5e_flow_put+0x49/0x80 [mlx5_core]
       [23827.538290]  mlx5e_delete_flower+0x6d1/0xe60 [mlx5_core]
       [23827.539300]  tc_setup_cb_destroy+0x18e/0x2f0
       [23827.540144]  fl_hw_destroy_filter+0x1d2/0x310 [cls_flower]
       [23827.541148]  __fl_delete+0x4dc/0x660 [cls_flower]
       [23827.541985]  fl_delete+0x97/0x160 [cls_flower]
       [23827.542782]  tc_del_tfilter+0x7ab/0x13d0
       [23827.543503]  rtnetlink_rcv_msg+0x644/0x8c0
       [23827.544257]  netlink_rcv_skb+0x11d/0x340
       [23827.544981]  netlink_unicast+0x42b/0x700
       [23827.545700]  netlink_sendmsg+0x743/0xc20
       [23827.546424]  sock_sendmsg+0xb2/0xe0
       [23827.547084]  ____sys_sendmsg+0x590/0x770
       [23827.547850]  ___sys_sendmsg+0xd8/0x160
       [23827.548606]  __sys_sendmsg+0xb7/0x140
       [23827.549303]  do_syscall_64+0x3a/0x70
       [23827.549969]  entry_SYSCALL_64_after_hwframe+0x44/0xae
       [23827.550853]
       [23827.551217] The buggy address belongs to the object at ffff8881d1322200
       [23827.551217]  which belongs to the cache kmalloc-256 of size 256
       [23827.553341] The buggy address is located 140 bytes inside of
       [23827.553341]  256-byte region [ffff8881d1322200, ffff8881d1322300)
       [23827.555747] The buggy address belongs to the page:
       [23827.556847] page:00000000898762aa refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1d1320
       [23827.558651] head:00000000898762aa order:2 compound_mapcount:0 compound_pincount:0
       [23827.559961] flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
       [23827.561243] raw: 002ffff800010200 dead000000000100 dead000000000122 ffff888100042b40
       [23827.562653] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
       [23827.564112] page dumped because: kasan: bad access detected
       [23827.565439]
       [23827.565932] Memory state around the buggy address:
       [23827.566917]  ffff8881d1322180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       [23827.568485]  ffff8881d1322200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       [23827.569818] >ffff8881d1322280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       [23827.571143]                       ^
       [23827.571879]  ffff8881d1322300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       [23827.573283]  ffff8881d1322380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       [23827.574654] ==================================================================
      
      Most of the necessary logic is already correctly implemented by
      mlx5e_get_next_valid_encap() helper that is used in neigh stats update
      handler. Make the handler generic by renaming it to
      mlx5e_get_next_matching_encap() and use callback to test whether flow is
      matching instead of hardcoded check for 'valid' flag value. Implement
      mlx5e_get_next_valid_encap() by calling mlx5e_get_next_matching_encap()
      with callback that tests encap MLX5_ENCAP_ENTRY_VALID flag. Implement new
      mlx5e_get_next_init_encap() helper by calling
      mlx5e_get_next_matching_encap() with callback that tests encap completion
      result to be non-error and use it in mlx5e_rep_neigh_update() to safely
      iterate over nhe->encap_list.
      
      Remove encap completion logic from mlx5e_rep_update_flows() since the encap
      entries passed to this function are already guaranteed to be properly
      initialized by similar code in mlx5e_get_next_init_encap().
      
      Fixes: 2a1f1768 ("net/mlx5e: Refactor neigh update for concurrent execution")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      fb1a3132
    • Yang Li's avatar
      net/mlx5e: Fix an error code in mlx5e_arfs_create_tables() · 2bf8d2ae
      Yang Li authored
      When the code execute 'if (!priv->fs.arfs->wq)', the value of err is 0.
      So, we use -ENOMEM to indicate that the function
      create_singlethread_workqueue() return NULL.
      
      Clean up smatch warning:
      drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c:373
      mlx5e_arfs_create_tables() warn: missing error code 'err'.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Fixes: f6755b80 ("net/mlx5e: Dynamic alloc arfs table for netdev when needed")
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2bf8d2ae
  2. 09 Jun, 2021 8 commits
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 6cde05ab
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2021-06-09
      
      This series contains updates to ice driver only.
      
      Maciej informs the user when XDP is not supported due to the driver
      being in the 'safe mode' state. He also adds a parameter to Tx queue
      configuration to resolve an issue in configuring XDP queues as it cannot
      rely on using the number Tx or Rx queues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cde05ab
    • Marcelo Ricardo Leitner's avatar
      net/sched: act_ct: handle DNAT tuple collision · 13c62f53
      Marcelo Ricardo Leitner authored
      This this the counterpart of 8aa7b526 ("openvswitch: handle DNAT
      tuple collision") for act_ct. From that commit changelog:
      
      """
      With multiple DNAT rules it's possible that after destination
      translation the resulting tuples collide.
      
      ...
      
      Netfilter handles this case by allocating a null binding for SNAT at
      egress by default.  Perform the same operation in openvswitch for DNAT
      if no explicit SNAT is requested by the user and allocate a null binding
      for SNAT for packets in the "original" direction.
      """
      
      Fixes: 95219afb ("act_ct: support asymmetric conntrack")
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13c62f53
    • Ido Schimmel's avatar
      rtnetlink: Fix regression in bridge VLAN configuration · d2e381c4
      Ido Schimmel authored
      Cited commit started returning errors when notification info is not
      filled by the bridge driver, resulting in the following regression:
      
       # ip link add name br1 type bridge vlan_filtering 1
       # bridge vlan add dev br1 vid 555 self pvid untagged
       RTNETLINK answers: Invalid argument
      
      As long as the bridge driver does not fill notification info for the
      bridge device itself, an empty notification should not be considered as
      an error. This is explained in commit 59ccaaaa ("bridge: dont send
      notification when skb->len == 0 in rtnl_bridge_notify").
      
      Fix by removing the error and add a comment to avoid future bugs.
      
      Fixes: a8db57c1 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2e381c4
    • David S. Miller's avatar
      Merge tag 'mac80211-for-net-2021-06-09' of... · 93124d4a
      David S. Miller authored
      Merge tag 'mac80211-for-net-2021-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
      
      Johannes berg says:
      
      ====================
      A fair number of fixes:
       * fix more fallout from RTNL locking changes
       * fixes for some of the bugs found by syzbot
       * drop multicast fragments in mac80211 to align
         with the spec and what drivers are doing now
       * fix NULL-ptr deref in radiotap injection
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93124d4a
    • Paolo Abeni's avatar
      udp: fix race between close() and udp_abort() · a8b897c7
      Paolo Abeni authored
      Kaustubh reported and diagnosed a panic in udp_lib_lookup().
      The root cause is udp_abort() racing with close(). Both
      racing functions acquire the socket lock, but udp{v6}_destroy_sock()
      release it before performing destructive actions.
      
      We can't easily extend the socket lock scope to avoid the race,
      instead use the SOCK_DEAD flag to prevent udp_abort from doing
      any action when the critical race happens.
      Diagnosed-and-tested-by: default avatarKaustubh Pandey <kapandey@codeaurora.org>
      Fixes: 5d77dca8 ("net: diag: support SOCK_DESTROY for UDP sockets")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8b897c7
    • Eric Dumazet's avatar
      inet: annotate data race in inet_send_prepare() and inet_dgram_connect() · dcd01eea
      Eric Dumazet authored
      Both functions are known to be racy when reading inet_num
      as we do not want to grab locks for the common case the socket
      has been bound already. The race is resolved in inet_autobind()
      by reading again inet_num under the socket lock.
      
      syzbot reported:
      BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port
      
      write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0:
       udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308
       udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89
       inet_autobind net/ipv4/af_inet.c:183 [inline]
       inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1:
       inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0x9db4
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcd01eea
    • Austin Kim's avatar
      net: ethtool: clear heap allocations for ethtool function · 80ec82e3
      Austin Kim authored
      Several ethtool functions leave heap uncleared (potentially) by
      drivers. This will leave the unused portion of heap unchanged and
      might copy the full contents back to userspace.
      Signed-off-by: default avatarAustin Kim <austindh.kim@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80ec82e3
    • Maciej Fijalkowski's avatar
      ice: parameterize functions responsible for Tx ring management · 2e84f6b3
      Maciej Fijalkowski authored
      Commit ae15e0ba ("ice: Change number of XDP Tx queues to match
      number of Rx queues") tried to address the incorrect setting of XDP
      queue count that was based on the Tx queue count, whereas in theory we
      should provide the XDP queue per Rx queue. However, the routines that
      setup and destroy the set of Tx resources are still based on the
      vsi->num_txq.
      
      Ice supports the asynchronous Tx/Rx queue count, so for a setup where
      vsi->num_txq > vsi->num_rxq, ice_vsi_stop_tx_rings and ice_vsi_cfg_txqs
      will be accessing the vsi->xdp_rings out of the bounds.
      
      Parameterize two mentioned functions so they get the size of Tx resources
      array as the input.
      
      Fixes: ae15e0ba ("ice: Change number of XDP Tx queues to match number of Rx queues")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      2e84f6b3