1. 10 Oct, 2024 24 commits
    • Eric Dumazet's avatar
      slip: make slhc_remember() more robust against malicious packets · 7d3fce8c
      Eric Dumazet authored
      syzbot found that slhc_remember() was missing checks against
      malicious packets [1].
      
      slhc_remember() only checked the size of the packet was at least 20,
      which is not good enough.
      
      We need to make sure the packet includes the IPv4 and TCP header
      that are supposed to be carried.
      
      Add iph and th pointers to make the code more readable.
      
      [1]
      
      BUG: KMSAN: uninit-value in slhc_remember+0x2e8/0x7b0 drivers/net/slip/slhc.c:666
        slhc_remember+0x2e8/0x7b0 drivers/net/slip/slhc.c:666
        ppp_receive_nonmp_frame+0xe45/0x35e0 drivers/net/ppp/ppp_generic.c:2455
        ppp_receive_frame drivers/net/ppp/ppp_generic.c:2372 [inline]
        ppp_do_recv+0x65f/0x40d0 drivers/net/ppp/ppp_generic.c:2212
        ppp_input+0x7dc/0xe60 drivers/net/ppp/ppp_generic.c:2327
        pppoe_rcv_core+0x1d3/0x720 drivers/net/ppp/pppoe.c:379
        sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1113
        __release_sock+0x1da/0x330 net/core/sock.c:3072
        release_sock+0x6b/0x250 net/core/sock.c:3626
        pppoe_sendmsg+0x2b8/0xb90 drivers/net/ppp/pppoe.c:903
        sock_sendmsg_nosec net/socket.c:729 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:744
        ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
        ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
        __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
        __do_sys_sendmmsg net/socket.c:2771 [inline]
        __se_sys_sendmmsg net/socket.c:2768 [inline]
        __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
        x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Uninit was created at:
        slab_post_alloc_hook mm/slub.c:4091 [inline]
        slab_alloc_node mm/slub.c:4134 [inline]
        kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4186
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:587
        __alloc_skb+0x363/0x7b0 net/core/skbuff.c:678
        alloc_skb include/linux/skbuff.h:1322 [inline]
        sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2732
        pppoe_sendmsg+0x3a7/0xb90 drivers/net/ppp/pppoe.c:867
        sock_sendmsg_nosec net/socket.c:729 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:744
        ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
        ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
        __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
        __do_sys_sendmmsg net/socket.c:2771 [inline]
        __se_sys_sendmmsg net/socket.c:2768 [inline]
        __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
        x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      CPU: 0 UID: 0 PID: 5460 Comm: syz.2.33 Not tainted 6.12.0-rc2-syzkaller-00006-g87d6aab2 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
      
      Fixes: b5451d78 ("slip: Move the SLIP drivers")
      Reported-by: syzbot+2ada1bc857496353be5a@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/670646db.050a0220.3f80e.0027.GAE@google.com/T/#uSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20241009091132.2136321-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7d3fce8c
    • D. Wythe's avatar
      net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC · 6fd27ea1
      D. Wythe authored
      Eric report a panic on IPPROTO_SMC, and give the facts
      that when INET_PROTOSW_ICSK was set, icsk->icsk_sync_mss must be set too.
      
      Bug: Unable to handle kernel NULL pointer dereference at virtual address
      0000000000000000
      Mem abort info:
      ESR = 0x0000000086000005
      EC = 0x21: IABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
      FSC = 0x05: level 1 translation fault
      user pgtable: 4k pages, 48-bit VAs, pgdp=00000001195d1000
      [0000000000000000] pgd=0800000109c46003, p4d=0800000109c46003,
      pud=0000000000000000
      Internal error: Oops: 0000000086000005 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 1 UID: 0 PID: 8037 Comm: syz.3.265 Not tainted
      6.11.0-rc7-syzkaller-g5f5673607153 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 08/06/2024
      pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : 0x0
      lr : cipso_v4_sock_setattr+0x2a8/0x3c0 net/ipv4/cipso_ipv4.c:1910
      sp : ffff80009b887a90
      x29: ffff80009b887aa0 x28: ffff80008db94050 x27: 0000000000000000
      x26: 1fffe0001aa6f5b3 x25: dfff800000000000 x24: ffff0000db75da00
      x23: 0000000000000000 x22: ffff0000d8b78518 x21: 0000000000000000
      x20: ffff0000d537ad80 x19: ffff0000d8b78000 x18: 1fffe000366d79ee
      x17: ffff8000800614a8 x16: ffff800080569b84 x15: 0000000000000001
      x14: 000000008b336894 x13: 00000000cd96feaa x12: 0000000000000003
      x11: 0000000000040000 x10: 00000000000020a3 x9 : 1fffe0001b16f0f1
      x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003f
      x5 : 0000000000000040 x4 : 0000000000000001 x3 : 0000000000000000
      x2 : 0000000000000002 x1 : 0000000000000000 x0 : ffff0000d8b78000
      Call trace:
      0x0
      netlbl_sock_setattr+0x2e4/0x338 net/netlabel/netlabel_kapi.c:1000
      smack_netlbl_add+0xa4/0x154 security/smack/smack_lsm.c:2593
      smack_socket_post_create+0xa8/0x14c security/smack/smack_lsm.c:2973
      security_socket_post_create+0x94/0xd4 security/security.c:4425
      __sock_create+0x4c8/0x884 net/socket.c:1587
      sock_create net/socket.c:1622 [inline]
      __sys_socket_create net/socket.c:1659 [inline]
      __sys_socket+0x134/0x340 net/socket.c:1706
      __do_sys_socket net/socket.c:1720 [inline]
      __se_sys_socket net/socket.c:1718 [inline]
      __arm64_sys_socket+0x7c/0x94 net/socket.c:1718
      __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
      invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
      el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
      do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
      el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712
      el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
      el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
      Code: ???????? ???????? ???????? ???????? (????????)
      ---[ end trace 0000000000000000 ]---
      
      This patch add a toy implementation that performs a simple return to
      prevent such panic. This is because MSS can be set in sock_create_kern
      or smc_setsockopt, similar to how it's done in AF_SMC. However, for
      AF_SMC, there is currently no way to synchronize MSS within
      __sys_connect_file. This toy implementation lays the groundwork for us
      to support such feature for IPPROTO_SMC in the future.
      
      Fixes: d25a92cc ("net/smc: Introduce IPPROTO_SMC")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarD. Wythe <alibuda@linux.alibaba.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Link: https://patch.msgid.link/1728456916-67035-1-git-send-email-alibuda@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6fd27ea1
    • Eric Dumazet's avatar
      ppp: fix ppp_async_encode() illegal access · 40dddd4b
      Eric Dumazet authored
      syzbot reported an issue in ppp_async_encode() [1]
      
      In this case, pppoe_sendmsg() is called with a zero size.
      Then ppp_async_encode() is called with an empty skb.
      
      BUG: KMSAN: uninit-value in ppp_async_encode drivers/net/ppp/ppp_async.c:545 [inline]
       BUG: KMSAN: uninit-value in ppp_async_push+0xb4f/0x2660 drivers/net/ppp/ppp_async.c:675
        ppp_async_encode drivers/net/ppp/ppp_async.c:545 [inline]
        ppp_async_push+0xb4f/0x2660 drivers/net/ppp/ppp_async.c:675
        ppp_async_send+0x130/0x1b0 drivers/net/ppp/ppp_async.c:634
        ppp_channel_bridge_input drivers/net/ppp/ppp_generic.c:2280 [inline]
        ppp_input+0x1f1/0xe60 drivers/net/ppp/ppp_generic.c:2304
        pppoe_rcv_core+0x1d3/0x720 drivers/net/ppp/pppoe.c:379
        sk_backlog_rcv+0x13b/0x420 include/net/sock.h:1113
        __release_sock+0x1da/0x330 net/core/sock.c:3072
        release_sock+0x6b/0x250 net/core/sock.c:3626
        pppoe_sendmsg+0x2b8/0xb90 drivers/net/ppp/pppoe.c:903
        sock_sendmsg_nosec net/socket.c:729 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:744
        ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
        ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
        __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
        __do_sys_sendmmsg net/socket.c:2771 [inline]
        __se_sys_sendmmsg net/socket.c:2768 [inline]
        __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
        x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      Uninit was created at:
        slab_post_alloc_hook mm/slub.c:4092 [inline]
        slab_alloc_node mm/slub.c:4135 [inline]
        kmem_cache_alloc_node_noprof+0x6bf/0xb80 mm/slub.c:4187
        kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:587
        __alloc_skb+0x363/0x7b0 net/core/skbuff.c:678
        alloc_skb include/linux/skbuff.h:1322 [inline]
        sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2732
        pppoe_sendmsg+0x3a7/0xb90 drivers/net/ppp/pppoe.c:867
        sock_sendmsg_nosec net/socket.c:729 [inline]
        __sock_sendmsg+0x30f/0x380 net/socket.c:744
        ____sys_sendmsg+0x903/0xb60 net/socket.c:2602
        ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2656
        __sys_sendmmsg+0x3c1/0x960 net/socket.c:2742
        __do_sys_sendmmsg net/socket.c:2771 [inline]
        __se_sys_sendmmsg net/socket.c:2768 [inline]
        __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2768
        x64_sys_call+0xb6e/0x3ba0 arch/x86/include/generated/asm/syscalls_64.h:308
        do_syscall_x64 arch/x86/entry/common.c:52 [inline]
        do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
      CPU: 1 UID: 0 PID: 5411 Comm: syz.1.14 Not tainted 6.12.0-rc1-syzkaller-00165-g360c1f1f #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: syzbot+1d121645899e7692f92a@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20241009185802.3763282-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      40dddd4b
    • Simon Horman's avatar
      docs: netdev: document guidance on cleanup patches · aeb218d9
      Simon Horman authored
      The purpose of this section is to document what is the current practice
      regarding clean-up patches which address checkpatch warnings and similar
      problems. I feel there is a value in having this documented so others
      can easily refer to it.
      
      Clearly this topic is subjective. And to some extent the current
      practice discourages a wider range of patches than is described here.
      But I feel it is best to start somewhere, with the most well established
      part of the current practice.
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20241009-doc-mc-clean-v2-1-e637b665fa81@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aeb218d9
    • Paolo Abeni's avatar
      Merge branch 'rtnetlink-handle-error-of-rtnl_register_module' · ffc8fa91
      Paolo Abeni authored
      Kuniyuki Iwashima says:
      
      ====================
      rtnetlink: Handle error of rtnl_register_module().
      
      While converting phonet to per-netns RTNL, I found a weird comment
      
        /* Further rtnl_register_module() cannot fail */
      
      that was true but no longer true after commit addf9b90 ("net:
      rtnetlink: use rcu to free rtnl message handlers").
      
      Many callers of rtnl_register_module() just ignore the returned
      value but should handle them properly.
      
      This series introduces two helpers, rtnl_register_many() and
      rtnl_unregister_many(), to do that easily and fix such callers.
      
      All rtnl_register() and rtnl_register_module() will be converted
      to _many() variant and some rtnl_lock() will be saved in _many()
      later in net-next.
      
      Changes:
        v4:
          * Add more context in changelog of each patch
      
        v3: https://lore.kernel.org/all/20241007124459.5727-1-kuniyu@amazon.com/
          * Move module *owner to struct rtnl_msg_handler
          * Make struct rtnl_msg_handler args/vars const
          * Update mctp goto labels
      
        v2: https://lore.kernel.org/netdev/20241004222358.79129-1-kuniyu@amazon.com/
          * Remove __exit from mctp_neigh_exit().
      
        v1: https://lore.kernel.org/netdev/20241003205725.5612-1-kuniyu@amazon.com/
      ====================
      
      Link: https://patch.msgid.link/20241008184737.9619-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ffc8fa91
    • Kuniyuki Iwashima's avatar
      phonet: Handle error of rtnl_register_module(). · b5e837c8
      Kuniyuki Iwashima authored
      Before commit addf9b90 ("net: rtnetlink: use rcu to free rtnl
      message handlers"), once the first rtnl_register_module() allocated
      rtnl_msg_handlers[PF_PHONET], the following calls never failed.
      
      However, after the commit, rtnl_register_module() could fail silently
      to allocate rtnl_msg_handlers[PF_PHONET][msgtype] and requires error
      handling for each call.
      
      Handling the error allows users to view a module as an all-or-nothing
      thing in terms of the rtnetlink functionality.  This prevents syzkaller
      from reporting spurious errors from its tests, where OOM often occurs
      and module is automatically loaded.
      
      Let's use rtnl_register_many() to handle the errors easily.
      
      Fixes: addf9b90 ("net: rtnetlink: use rcu to free rtnl message handlers")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: default avatarRémi Denis-Courmont <courmisch@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b5e837c8
    • Kuniyuki Iwashima's avatar
      mpls: Handle error of rtnl_register_module(). · 5be2062e
      Kuniyuki Iwashima authored
      Since introduced, mpls_init() has been ignoring the returned
      value of rtnl_register_module(), which could fail silently.
      
      Handling the error allows users to view a module as an all-or-nothing
      thing in terms of the rtnetlink functionality.  This prevents syzkaller
      from reporting spurious errors from its tests, where OOM often occurs
      and module is automatically loaded.
      
      Let's handle the errors by rtnl_register_many().
      
      Fixes: 03c05665 ("mpls: Netlink commands to add, remove, and dump routes")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5be2062e
    • Kuniyuki Iwashima's avatar
      mctp: Handle error of rtnl_register_module(). · d5170561
      Kuniyuki Iwashima authored
      Since introduced, mctp has been ignoring the returned value of
      rtnl_register_module(), which could fail silently.
      
      Handling the error allows users to view a module as an all-or-nothing
      thing in terms of the rtnetlink functionality.  This prevents syzkaller
      from reporting spurious errors from its tests, where OOM often occurs
      and module is automatically loaded.
      
      Let's handle the errors by rtnl_register_many().
      
      Fixes: 583be982 ("mctp: Add device handling and netlink interface")
      Fixes: 831119f8 ("mctp: Add neighbour netlink interface")
      Fixes: 06d2f4c5 ("mctp: Add netlink route management")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarJeremy Kerr <jk@codeconstruct.com.au>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d5170561
    • Kuniyuki Iwashima's avatar
      bridge: Handle error of rtnl_register_module(). · cba5e43b
      Kuniyuki Iwashima authored
      Since introduced, br_vlan_rtnl_init() has been ignoring the returned
      value of rtnl_register_module(), which could fail silently.
      
      Handling the error allows users to view a module as an all-or-nothing
      thing in terms of the rtnetlink functionality.  This prevents syzkaller
      from reporting spurious errors from its tests, where OOM often occurs
      and module is automatically loaded.
      
      Let's handle the errors by rtnl_register_many().
      
      Fixes: 8dcea187 ("net: bridge: vlan: add rtm definitions and dump support")
      Fixes: f26b2965 ("net: bridge: vlan: add new rtm message support")
      Fixes: adb3ce9b ("net: bridge: vlan: add del rtm message support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cba5e43b
    • Kuniyuki Iwashima's avatar
      vxlan: Handle error of rtnl_register_module(). · 78b7b991
      Kuniyuki Iwashima authored
      Since introduced, vxlan_vnifilter_init() has been ignoring the
      returned value of rtnl_register_module(), which could fail silently.
      
      Handling the error allows users to view a module as an all-or-nothing
      thing in terms of the rtnetlink functionality.  This prevents syzkaller
      from reporting spurious errors from its tests, where OOM often occurs
      and module is automatically loaded.
      
      Let's handle the errors by rtnl_register_many().
      
      Fixes: f9c4bb0b ("vxlan: vni filtering support on collect metadata device")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      78b7b991
    • Kuniyuki Iwashima's avatar
      rtnetlink: Add bulk registration helpers for rtnetlink message handlers. · 07cc7b0b
      Kuniyuki Iwashima authored
      Before commit addf9b90 ("net: rtnetlink: use rcu to free rtnl message
      handlers"), once rtnl_msg_handlers[protocol] was allocated, the following
      rtnl_register_module() for the same protocol never failed.
      
      However, after the commit, rtnl_msg_handler[protocol][msgtype] needs to
      be allocated in each rtnl_register_module(), so each call could fail.
      
      Many callers of rtnl_register_module() do not handle the returned error,
      and we need to add many error handlings.
      
      To handle that easily, let's add wrapper functions for bulk registration
      of rtnetlink message handlers.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      07cc7b0b
    • Paolo Abeni's avatar
      Merge tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 9a3cd877
      Paolo Abeni authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Restrict xtables extensions to families that are safe, syzbot found
         a way to combine ebtables with extensions that are never used by
         userspace tools. From Florian Westphal.
      
      2) Set l3mdev inconditionally whenever possible in nft_fib to fix lookup
         mismatch, also from Florian.
      
      netfilter pull request 24-10-09
      
      * tag 'nf-24-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        selftests: netfilter: conntrack_vrf.sh: add fib test case
        netfilter: fib: check correct rtable in vrf setups
        netfilter: xtables: avoid NFPROTO_UNSPEC where needed
      ====================
      
      Link: https://patch.msgid.link/20241009213858.3565808-1-pablo@netfilter.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9a3cd877
    • Eric Dumazet's avatar
      net: do not delay dst_entries_add() in dst_release() · ac888d58
      Eric Dumazet authored
      dst_entries_add() uses per-cpu data that might be freed at netns
      dismantle from ip6_route_net_exit() calling dst_entries_destroy()
      
      Before ip6_route_net_exit() can be called, we release all
      the dsts associated with this netns, via calls to dst_release(),
      which waits an rcu grace period before calling dst_destroy()
      
      dst_entries_add() use in dst_destroy() is racy, because
      dst_entries_destroy() could have been called already.
      
      Decrementing the number of dsts must happen sooner.
      
      Notes:
      
      1) in CONFIG_XFRM case, dst_destroy() can call
         dst_release_immediate(child), this might also cause UAF
         if the child does not have DST_NOCOUNT set.
         IPSEC maintainers might take a look and see how to address this.
      
      2) There is also discussion about removing this count of dst,
         which might happen in future kernels.
      
      Fixes: f8864972 ("ipv4: fix dst race in sk_dst_get()")
      Closes: https://lore.kernel.org/lkml/CANn89iLCCGsP7SFn9HKpvnKu96Td4KD08xf7aGtiYgZnkjaL=w@mail.gmail.com/T/Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Tested-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Tested-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://patch.msgid.link/20241008143110.1064899-1-edumazet@google.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ac888d58
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · a354733c
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-10-08 (ice, i40e, igb, e1000e)
      
      This series contains updates to ice, i40e, igb, and e1000e drivers.
      
      For ice:
      
      Marcin allows driver to load, into safe mode, when DDP package is
      missing or corrupted and adjusts the netif_is_ice() check to
      account for when the device is in safe mode. He also fixes an
      out-of-bounds issue when MSI-X are increased for VFs.
      
      Wojciech clears FDB entries on reset to match the hardware state.
      
      For i40e:
      
      Aleksandr adds locking around MACVLAN filters to prevent memory leaks
      due to concurrency issues.
      
      For igb:
      
      Mohamed Khalfella adds a check to not attempt to bring up an already
      running interface on non-fatal PCIe errors.
      
      For e1000e:
      
      Vitaly changes board type for I219 to more closely match the hardware
      and stop PHY issues.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
        e1000e: change I219 (19) devices to ADP
        igb: Do not bring the device up after non-fatal error
        i40e: Fix macvlan leak by synchronizing access to mac_filter_hash
        ice: Fix increasing MSI-X on VF
        ice: Flush FDB entries before reset
        ice: Fix netif_is_ice() in Safe Mode
        ice: Fix entering Safe Mode
      ====================
      
      Link: https://patch.msgid.link/20241008230050.928245-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a354733c
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-misc-fixes-involving-fallback-to-tcp' · 5151a35c
      Jakub Kicinski authored
      Matthieu Baerts says:
      
      ====================
      mptcp: misc. fixes involving fallback to TCP
      
      - Patch 1: better handle DSS corruptions from a bugged peer: reducing
        warnings, doing a fallback or a reset depending on the subflow state.
        For >= v5.7.
      
      - Patch 2: fix DSS corruption due to large pmtu xmit, where MPTCP was
        not taken into account. For >= v5.6.
      
      - Patch 3: fallback when MPTCP opts are dropped after the first data
        packet, instead of resetting the connection. For >= v5.6.
      
      - Patch 4: restrict the removal of a subflow to other closing states, a
        better fix, for a recent one. For >= v5.10.
      ====================
      
      Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-0-c6fb8e93e551@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5151a35c
    • Matthieu Baerts (NGI0)'s avatar
      mptcp: pm: do not remove closing subflows · db0a37b7
      Matthieu Baerts (NGI0) authored
      In a previous fix, the in-kernel path-manager has been modified not to
      retrigger the removal of a subflow if it was already closed, e.g. when
      the initial subflow is removed, but kept in the subflows list.
      
      To be complete, this fix should also skip the subflows that are in any
      closing state: mptcp_close_ssk() will initiate the closure, but the
      switch to the TCP_CLOSE state depends on the other peer.
      
      Fixes: 58e1b66b ("mptcp: pm: do not remove already closed subflows")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-4-c6fb8e93e551@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      db0a37b7
    • Matthieu Baerts (NGI0)'s avatar
      mptcp: fallback when MPTCP opts are dropped after 1st data · 119d51e2
      Matthieu Baerts (NGI0) authored
      As reported by Christoph [1], before this patch, an MPTCP connection was
      wrongly reset when a host received a first data packet with MPTCP
      options after the 3wHS, but got the next ones without.
      
      According to the MPTCP v1 specs [2], a fallback should happen in this
      case, because the host didn't receive a DATA_ACK from the other peer,
      nor receive data for more than the initial window which implies a
      DATA_ACK being received by the other peer.
      
      The patch here re-uses the same logic as the one used in other places:
      by looking at allow_infinite_fallback, which is disabled at the creation
      of an additional subflow. It's not looking at the first DATA_ACK (or
      implying one received from the other side) as suggested by the RFC, but
      it is in continuation with what was already done, which is safer, and it
      fixes the reported issue. The next step, looking at this first DATA_ACK,
      is tracked in [4].
      
      This patch has been validated using the following Packetdrill script:
      
         0 socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
        +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
        +0 bind(3, ..., ...) = 0
        +0 listen(3, 1) = 0
      
        // 3WHS is OK
        +0.0 < S  0:0(0)       win 65535  <mss 1460, sackOK, nop, nop, nop, wscale 6, mpcapable v1 flags[flag_h] nokey>
        +0.0 > S. 0:0(0) ack 1            <mss 1460, nop, nop, sackOK, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey]>
        +0.1 <  . 1:1(0) ack 1 win 2048                                              <mpcapable v1 flags[flag_h] key[ckey=2, skey]>
        +0 accept(3, ..., ...) = 4
      
        // Data from the client with valid MPTCP options (no DATA_ACK: normal)
        +0.1 < P. 1:501(500) ack 1 win 2048 <mpcapable v1 flags[flag_h] key[skey, ckey] mpcdatalen 500, nop, nop>
        // From here, the MPTCP options will be dropped by a middlebox
        +0.0 >  . 1:1(0)     ack 501        <dss dack8=501 dll=0 nocs>
      
        +0.1 read(4, ..., 500) = 500
        +0   write(4, ..., 100) = 100
      
        // The server replies with data, still thinking MPTCP is being used
        +0.0 > P. 1:101(100)   ack 501          <dss dack8=501 dsn8=1 ssn=1 dll=100 nocs, nop, nop>
        // But the client already did a fallback to TCP, because the two previous packets have been received without MPTCP options
        +0.1 <  . 501:501(0)   ack 101 win 2048
      
        +0.0 < P. 501:601(100) ack 101 win 2048
        // The server should fallback to TCP, not reset: it didn't get a DATA_ACK, nor data for more than the initial window
        +0.0 >  . 101:101(0)   ack 601
      
      Note that this script requires Packetdrill with MPTCP support, see [3].
      
      Fixes: dea2b1ea ("mptcp: do not reset MP_CAPABLE subflow on mapping errors")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/518 [1]
      Link: https://datatracker.ietf.org/doc/html/rfc8684#name-fallback [2]
      Link: https://github.com/multipath-tcp/packetdrill [3]
      Link: https://github.com/multipath-tcp/mptcp_net-next/issues/519 [4]
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-3-c6fb8e93e551@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      119d51e2
    • Paolo Abeni's avatar
      tcp: fix mptcp DSS corruption due to large pmtu xmit · 4dabcdf5
      Paolo Abeni authored
      Syzkaller was able to trigger a DSS corruption:
      
        TCP: request_sock_subflow_v4: Possible SYN flooding on port [::]:20002. Sending cookies.
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 5227 at net/mptcp/protocol.c:695 __mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695
        Modules linked in:
        CPU: 0 UID: 0 PID: 5227 Comm: syz-executor350 Not tainted 6.11.0-syzkaller-08829-gaf9c191a #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
        RIP: 0010:__mptcp_move_skbs_from_subflow+0x20a9/0x21f0 net/mptcp/protocol.c:695
        Code: 0f b6 dc 31 ff 89 de e8 b5 dd ea f5 89 d8 48 81 c4 50 01 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 98 da ea f5 90 <0f> 0b 90 e9 47 ff ff ff e8 8a da ea f5 90 0f 0b 90 e9 99 e0 ff ff
        RSP: 0018:ffffc90000006db8 EFLAGS: 00010246
        RAX: ffffffff8ba9df18 RBX: 00000000000055f0 RCX: ffff888030023c00
        RDX: 0000000000000100 RSI: 00000000000081e5 RDI: 00000000000055f0
        RBP: 1ffff110062bf1ae R08: ffffffff8ba9cf12 R09: 1ffff110062bf1b8
        R10: dffffc0000000000 R11: ffffed10062bf1b9 R12: 0000000000000000
        R13: dffffc0000000000 R14: 00000000700cec61 R15: 00000000000081e5
        FS:  000055556679c380(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000020287000 CR3: 0000000077892000 CR4: 00000000003506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         move_skbs_to_msk net/mptcp/protocol.c:811 [inline]
         mptcp_data_ready+0x29c/0xa90 net/mptcp/protocol.c:854
         subflow_data_ready+0x34a/0x920 net/mptcp/subflow.c:1490
         tcp_data_queue+0x20fd/0x76c0 net/ipv4/tcp_input.c:5283
         tcp_rcv_established+0xfba/0x2020 net/ipv4/tcp_input.c:6237
         tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915
         tcp_v4_rcv+0x2dc0/0x37f0 net/ipv4/tcp_ipv4.c:2350
         ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
         ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
         NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
         NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
         __netif_receive_skb_one_core net/core/dev.c:5662 [inline]
         __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
         process_backlog+0x662/0x15b0 net/core/dev.c:6107
         __napi_poll+0xcb/0x490 net/core/dev.c:6771
         napi_poll net/core/dev.c:6840 [inline]
         net_rx_action+0x89b/0x1240 net/core/dev.c:6962
         handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
         do_softirq+0x11b/0x1e0 kernel/softirq.c:455
         </IRQ>
         <TASK>
         __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
         local_bh_enable include/linux/bottom_half.h:33 [inline]
         rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline]
         __dev_queue_xmit+0x1764/0x3e80 net/core/dev.c:4451
         dev_queue_xmit include/linux/netdevice.h:3094 [inline]
         neigh_hh_output include/net/neighbour.h:526 [inline]
         neigh_output include/net/neighbour.h:540 [inline]
         ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:236
         ip_local_out net/ipv4/ip_output.c:130 [inline]
         __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:536
         __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
         tcp_transmit_skb net/ipv4/tcp_output.c:1484 [inline]
         tcp_mtu_probe net/ipv4/tcp_output.c:2547 [inline]
         tcp_write_xmit+0x641d/0x6bf0 net/ipv4/tcp_output.c:2752
         __tcp_push_pending_frames+0x9b/0x360 net/ipv4/tcp_output.c:3015
         tcp_push_pending_frames include/net/tcp.h:2107 [inline]
         tcp_data_snd_check net/ipv4/tcp_input.c:5714 [inline]
         tcp_rcv_established+0x1026/0x2020 net/ipv4/tcp_input.c:6239
         tcp_v4_do_rcv+0x96d/0xc70 net/ipv4/tcp_ipv4.c:1915
         sk_backlog_rcv include/net/sock.h:1113 [inline]
         __release_sock+0x214/0x350 net/core/sock.c:3072
         release_sock+0x61/0x1f0 net/core/sock.c:3626
         mptcp_push_release net/mptcp/protocol.c:1486 [inline]
         __mptcp_push_pending+0x6b5/0x9f0 net/mptcp/protocol.c:1625
         mptcp_sendmsg+0x10bb/0x1b10 net/mptcp/protocol.c:1903
         sock_sendmsg_nosec net/socket.c:730 [inline]
         __sock_sendmsg+0x1a6/0x270 net/socket.c:745
         ____sys_sendmsg+0x52a/0x7e0 net/socket.c:2603
         ___sys_sendmsg net/socket.c:2657 [inline]
         __sys_sendmsg+0x2aa/0x390 net/socket.c:2686
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
        RIP: 0033:0x7fb06e9317f9
        Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
        RSP: 002b:00007ffe2cfd4f98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
        RAX: ffffffffffffffda RBX: 00007fb06e97f468 RCX: 00007fb06e9317f9
        RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000005
        RBP: 00007fb06e97f446 R08: 0000555500000000 R09: 0000555500000000
        R10: 0000555500000000 R11: 0000000000000246 R12: 00007fb06e97f406
        R13: 0000000000000001 R14: 00007ffe2cfd4fe0 R15: 0000000000000003
         </TASK>
      
      Additionally syzkaller provided a nice reproducer. The repro enables
      pmtu on the loopback device, leading to tcp_mtu_probe() generating
      very large probe packets.
      
      tcp_can_coalesce_send_queue_head() currently does not check for
      mptcp-level invariants, and allowed the creation of cross-DSS probes,
      leading to the mentioned corruption.
      
      Address the issue teaching tcp_can_coalesce_send_queue_head() about
      mptcp using the tcp_skb_can_collapse(), also reducing the code
      duplication.
      
      Fixes: 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions")
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+d1bff73460e33101f0e7@syzkaller.appspotmail.com
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/513Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-2-c6fb8e93e551@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4dabcdf5
    • Paolo Abeni's avatar
      mptcp: handle consistently DSS corruption · e32d262c
      Paolo Abeni authored
      Bugged peer implementation can send corrupted DSS options, consistently
      hitting a few warning in the data path. Use DEBUG_NET assertions, to
      avoid the splat on some builds and handle consistently the error, dumping
      related MIBs and performing fallback and/or reset according to the
      subflow type.
      
      Fixes: 6771bfd9 ("mptcp: update mptcp ack sequence from work queue")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarMatthieu Baerts (NGI0) <matttbe@kernel.org>
      Link: https://patch.msgid.link/20241008-net-mptcp-fallback-fixes-v1-1-c6fb8e93e551@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e32d262c
    • Breno Leitao's avatar
      net: netconsole: fix wrong warning · d94785bb
      Breno Leitao authored
      A warning is triggered when there is insufficient space in the buffer
      for userdata. However, this is not an issue since userdata will be sent
      in the next iteration.
      
      Current warning message:
      
          ------------[ cut here ]------------
           WARNING: CPU: 13 PID: 3013042 at drivers/net/netconsole.c:1122 write_ext_msg+0x3b6/0x3d0
            ? write_ext_msg+0x3b6/0x3d0
            console_flush_all+0x1e9/0x330
      
      The code incorrectly issues a warning when this_chunk is zero, which is
      a valid scenario. The warning should only be triggered when this_chunk
      is negative.
      
      Fixes: 1ec9daf9 ("net: netconsole: append userdata to fragmented netconsole messages")
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20241008094325.896208-1-leitao@debian.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d94785bb
    • Vladimir Oltean's avatar
      net: dsa: refuse cross-chip mirroring operations · 8c924369
      Vladimir Oltean authored
      In case of a tc mirred action from one switch to another, the behavior
      is not correct. We simply tell the source switch driver to program a
      mirroring entry towards mirror->to_local_port = to_dp->index, but it is
      not even guaranteed that the to_dp belongs to the same switch as dp.
      
      For proper cross-chip support, we would need to go through the
      cross-chip notifier layer in switch.c, program the entry on cascade
      ports, and introduce new, explicit API for cross-chip mirroring, given
      that intermediary switches should have introspection into the DSA tags
      passed through the cascade port (and not just program a port mirror on
      the entire cascade port). None of that exists today.
      
      Reject what is not implemented so that user space is not misled into
      thinking it works.
      
      Fixes: f50f2127 ("net: dsa: Add plumbing for port mirroring")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://patch.msgid.link/20241008094320.3340980-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8c924369
    • Wei Fang's avatar
      net: fec: don't save PTP state if PTP is unsupported · 6be06307
      Wei Fang authored
      Some platforms (such as i.MX25 and i.MX27) do not support PTP, so on
      these platforms fec_ptp_init() is not called and the related members
      in fep are not initialized. However, fec_ptp_save_state() is called
      unconditionally, which causes the kernel to panic. Therefore, add a
      condition so that fec_ptp_save_state() is not called if PTP is not
      supported.
      
      Fixes: a1477dc8 ("net: fec: Restart PPS after link state change")
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Closes: https://lore.kernel.org/lkml/353e41fe-6bb4-4ee9-9980-2da2a9c1c508@roeck-us.net/Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Reviewed-by: default avatarCsókás, Bence <csokas.bence@prolan.hu>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Link: https://patch.msgid.link/20241008061153.1977930-1-wei.fang@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6be06307
    • Rosen Penev's avatar
      net: ibm: emac: mal: add dcr_unmap to _remove · 080ddc22
      Rosen Penev authored
      It's done in probe so it should be undone here.
      
      Fixes: 1d3bb996 ("Device tree aware EMAC driver")
      Signed-off-by: default avatarRosen Penev <rosenp@gmail.com>
      Reviewed-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://patch.msgid.link/20241008233050.9422-1-rosenp@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      080ddc22
    • Jacky Chou's avatar
      net: ftgmac100: fixed not check status from fixed phy · 70a0da8c
      Jacky Chou authored
      Add error handling from calling fixed_phy_register.
      It may return some error, therefore, need to check the status.
      
      And fixed_phy_register needs to bind a device node for mdio.
      Add the mac device node for fixed_phy_register function.
      This is a reference to this function, of_phy_register_fixed_link().
      
      Fixes: e24a6c87 ("net: ftgmac100: Get link speed and duplex for NC-SI")
      Signed-off-by: default avatarJacky Chou <jacky_chou@aspeedtech.com>
      Link: https://patch.msgid.link/20241007032435.787892-1-jacky_chou@aspeedtech.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      70a0da8c
  2. 09 Oct, 2024 9 commits
  3. 08 Oct, 2024 7 commits
    • Eric Dumazet's avatar
      net/sched: accept TCA_STAB only for root qdisc · 3cb7cf15
      Eric Dumazet authored
      Most qdiscs maintain their backlog using qdisc_pkt_len(skb)
      on the assumption it is invariant between the enqueue()
      and dequeue() handlers.
      
      Unfortunately syzbot can crash a host rather easily using
      a TBF + SFQ combination, with an STAB on SFQ [1]
      
      We can't support TCA_STAB on arbitrary level, this would
      require to maintain per-qdisc storage.
      
      [1]
      [   88.796496] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   88.798611] #PF: supervisor read access in kernel mode
      [   88.799014] #PF: error_code(0x0000) - not-present page
      [   88.799506] PGD 0 P4D 0
      [   88.799829] Oops: Oops: 0000 [#1] SMP NOPTI
      [   88.800569] CPU: 14 UID: 0 PID: 2053 Comm: b371744477 Not tainted 6.12.0-rc1-virtme #1117
      [   88.801107] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
      [   88.801779] RIP: 0010:sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq
      [ 88.802544] Code: 0f b7 50 12 48 8d 04 d5 00 00 00 00 48 89 d6 48 29 d0 48 8b 91 c0 01 00 00 48 c1 e0 03 48 01 c2 66 83 7a 1a 00 7e c0 48 8b 3a <4c> 8b 07 4c 89 02 49 89 50 08 48 c7 47 08 00 00 00 00 48 c7 07 00
      All code
      ========
         0:	0f b7 50 12          	movzwl 0x12(%rax),%edx
         4:	48 8d 04 d5 00 00 00 	lea    0x0(,%rdx,8),%rax
         b:	00
         c:	48 89 d6             	mov    %rdx,%rsi
         f:	48 29 d0             	sub    %rdx,%rax
        12:	48 8b 91 c0 01 00 00 	mov    0x1c0(%rcx),%rdx
        19:	48 c1 e0 03          	shl    $0x3,%rax
        1d:	48 01 c2             	add    %rax,%rdx
        20:	66 83 7a 1a 00       	cmpw   $0x0,0x1a(%rdx)
        25:	7e c0                	jle    0xffffffffffffffe7
        27:	48 8b 3a             	mov    (%rdx),%rdi
        2a:*	4c 8b 07             	mov    (%rdi),%r8		<-- trapping instruction
        2d:	4c 89 02             	mov    %r8,(%rdx)
        30:	49 89 50 08          	mov    %rdx,0x8(%r8)
        34:	48 c7 47 08 00 00 00 	movq   $0x0,0x8(%rdi)
        3b:	00
        3c:	48                   	rex.W
        3d:	c7                   	.byte 0xc7
        3e:	07                   	(bad)
      	...
      
      Code starting with the faulting instruction
      ===========================================
         0:	4c 8b 07             	mov    (%rdi),%r8
         3:	4c 89 02             	mov    %r8,(%rdx)
         6:	49 89 50 08          	mov    %rdx,0x8(%r8)
         a:	48 c7 47 08 00 00 00 	movq   $0x0,0x8(%rdi)
        11:	00
        12:	48                   	rex.W
        13:	c7                   	.byte 0xc7
        14:	07                   	(bad)
      	...
      [   88.803721] RSP: 0018:ffff9a1f892b7d58 EFLAGS: 00000206
      [   88.804032] RAX: 0000000000000000 RBX: ffff9a1f8420c800 RCX: ffff9a1f8420c800
      [   88.804560] RDX: ffff9a1f81bc1440 RSI: 0000000000000000 RDI: 0000000000000000
      [   88.805056] RBP: ffffffffc04bb0e0 R08: 0000000000000001 R09: 00000000ff7f9a1f
      [   88.805473] R10: 000000000001001b R11: 0000000000009a1f R12: 0000000000000140
      [   88.806194] R13: 0000000000000001 R14: ffff9a1f886df400 R15: ffff9a1f886df4ac
      [   88.806734] FS:  00007f445601a740(0000) GS:ffff9a2e7fd80000(0000) knlGS:0000000000000000
      [   88.807225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   88.807672] CR2: 0000000000000000 CR3: 000000050cc46000 CR4: 00000000000006f0
      [   88.808165] Call Trace:
      [   88.808459]  <TASK>
      [   88.808710] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434)
      [   88.809261] ? page_fault_oops (arch/x86/mm/fault.c:715)
      [   88.809561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539)
      [   88.809806] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
      [   88.810074] ? sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq
      [   88.810411] sfq_reset (net/sched/sch_sfq.c:525) sch_sfq
      [   88.810671] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036)
      [   88.810950] tbf_reset (./include/linux/timekeeping.h:169 net/sched/sch_tbf.c:334) sch_tbf
      [   88.811208] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036)
      [   88.811484] netif_set_real_num_tx_queues (./include/linux/spinlock.h:396 ./include/net/sch_generic.h:768 net/core/dev.c:2958)
      [   88.811870] __tun_detach (drivers/net/tun.c:590 drivers/net/tun.c:673)
      [   88.812271] tun_chr_close (drivers/net/tun.c:702 drivers/net/tun.c:3517)
      [   88.812505] __fput (fs/file_table.c:432 (discriminator 1))
      [   88.812735] task_work_run (kernel/task_work.c:230)
      [   88.813016] do_exit (kernel/exit.c:940)
      [   88.813372] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4))
      [   88.813639] ? handle_mm_fault (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/memcontrol.h:1022 ./include/linux/memcontrol.h:1045 ./include/linux/memcontrol.h:1052 mm/memory.c:5928 mm/memory.c:6088)
      [   88.813867] do_group_exit (kernel/exit.c:1070)
      [   88.814138] __x64_sys_exit_group (kernel/exit.c:1099)
      [   88.814490] x64_sys_call (??:?)
      [   88.814791] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1))
      [   88.815012] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      [   88.815495] RIP: 0033:0x7f44560f1975
      
      Fixes: 175f9c1b ("net_sched: Add size table for qdiscs")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Link: https://patch.msgid.link/20241007184130.3960565-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3cb7cf15
    • Vitaly Lifshits's avatar
      e1000e: change I219 (19) devices to ADP · 9d9e5347
      Vitaly Lifshits authored
      Sporadic issues, such as PHY access loss, have been observed on I219 (19)
      devices. It was found that these devices have hardware more closely
      related to ADP than MTP and the issues were caused by taking MTP-specific
      flows.
      
      Change the MAC and board types of these devices from MTP to ADP to
      correctly reflect the LAN hardware, and flows, of these devices.
      
      Fixes: db2d737d ("e1000e: Separate MTP board type from ADP")
      Signed-off-by: default avatarVitaly Lifshits <vitaly.lifshits@intel.com>
      Tested-by: default avatarMor Bar-Gabay <morx.bar.gabay@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      9d9e5347
    • Mohamed Khalfella's avatar
      igb: Do not bring the device up after non-fatal error · 330a699e
      Mohamed Khalfella authored
      Commit 004d2506 ("igb: Fix igb_down hung on surprise removal")
      changed igb_io_error_detected() to ignore non-fatal pcie errors in order
      to avoid hung task that can happen when igb_down() is called multiple
      times. This caused an issue when processing transient non-fatal errors.
      igb_io_resume(), which is called after igb_io_error_detected(), assumes
      that device is brought down by igb_io_error_detected() if the interface
      is up. This resulted in panic with stacktrace below.
      
      [ T3256] igb 0000:09:00.0 haeth0: igb: haeth0 NIC Link is Down
      [  T292] pcieport 0000:00:1c.5: AER: Uncorrected (Non-Fatal) error received: 0000:09:00.0
      [  T292] igb 0000:09:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
      [  T292] igb 0000:09:00.0:   device [8086:1537] error status/mask=00004000/00000000
      [  T292] igb 0000:09:00.0:    [14] CmpltTO [  200.105524,009][  T292] igb 0000:09:00.0: AER:   TLP Header: 00000000 00000000 00000000 00000000
      [  T292] pcieport 0000:00:1c.5: AER: broadcast error_detected message
      [  T292] igb 0000:09:00.0: Non-correctable non-fatal error reported.
      [  T292] pcieport 0000:00:1c.5: AER: broadcast mmio_enabled message
      [  T292] pcieport 0000:00:1c.5: AER: broadcast resume message
      [  T292] ------------[ cut here ]------------
      [  T292] kernel BUG at net/core/dev.c:6539!
      [  T292] invalid opcode: 0000 [#1] PREEMPT SMP
      [  T292] RIP: 0010:napi_enable+0x37/0x40
      [  T292] Call Trace:
      [  T292]  <TASK>
      [  T292]  ? die+0x33/0x90
      [  T292]  ? do_trap+0xdc/0x110
      [  T292]  ? napi_enable+0x37/0x40
      [  T292]  ? do_error_trap+0x70/0xb0
      [  T292]  ? napi_enable+0x37/0x40
      [  T292]  ? napi_enable+0x37/0x40
      [  T292]  ? exc_invalid_op+0x4e/0x70
      [  T292]  ? napi_enable+0x37/0x40
      [  T292]  ? asm_exc_invalid_op+0x16/0x20
      [  T292]  ? napi_enable+0x37/0x40
      [  T292]  igb_up+0x41/0x150
      [  T292]  igb_io_resume+0x25/0x70
      [  T292]  report_resume+0x54/0x70
      [  T292]  ? report_frozen_detected+0x20/0x20
      [  T292]  pci_walk_bus+0x6c/0x90
      [  T292]  ? aer_print_port_info+0xa0/0xa0
      [  T292]  pcie_do_recovery+0x22f/0x380
      [  T292]  aer_process_err_devices+0x110/0x160
      [  T292]  aer_isr+0x1c1/0x1e0
      [  T292]  ? disable_irq_nosync+0x10/0x10
      [  T292]  irq_thread_fn+0x1a/0x60
      [  T292]  irq_thread+0xe3/0x1a0
      [  T292]  ? irq_set_affinity_notifier+0x120/0x120
      [  T292]  ? irq_affinity_notify+0x100/0x100
      [  T292]  kthread+0xe2/0x110
      [  T292]  ? kthread_complete_and_exit+0x20/0x20
      [  T292]  ret_from_fork+0x2d/0x50
      [  T292]  ? kthread_complete_and_exit+0x20/0x20
      [  T292]  ret_from_fork_asm+0x11/0x20
      [  T292]  </TASK>
      
      To fix this issue igb_io_resume() checks if the interface is running and
      the device is not down this means igb_io_error_detected() did not bring
      the device down and there is no need to bring it up.
      Signed-off-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Reviewed-by: default avatarYuanyuan Zhong <yzhong@purestorage.com>
      Fixes: 004d2506 ("igb: Fix igb_down hung on surprise removal")
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      330a699e
    • Aleksandr Loktionov's avatar
      i40e: Fix macvlan leak by synchronizing access to mac_filter_hash · dac6c7b3
      Aleksandr Loktionov authored
      This patch addresses a macvlan leak issue in the i40e driver caused by
      concurrent access to vsi->mac_filter_hash. The leak occurs when multiple
      threads attempt to modify the mac_filter_hash simultaneously, leading to
      inconsistent state and potential memory leaks.
      
      To fix this, we now wrap the calls to i40e_del_mac_filter() and zeroing
      vf->default_lan_addr.addr with spin_lock/unlock_bh(&vsi->mac_filter_hash_lock),
      ensuring atomic operations and preventing concurrent access.
      
      Additionally, we add lockdep_assert_held(&vsi->mac_filter_hash_lock) in
      i40e_add_mac_filter() to help catch similar issues in the future.
      
      Reproduction steps:
      1. Spawn VFs and configure port vlan on them.
      2. Trigger concurrent macvlan operations (e.g., adding and deleting
      	portvlan and/or mac filters).
      3. Observe the potential memory leak and inconsistent state in the
      	mac_filter_hash.
      
      This synchronization ensures the integrity of the mac_filter_hash and prevents
      the described leak.
      
      Fixes: fed0d9f1 ("i40e: Fix VF's MAC Address change on VM")
      Reviewed-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Signed-off-by: default avatarAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      dac6c7b3
    • Marcin Szycik's avatar
      ice: Fix increasing MSI-X on VF · bce9af1b
      Marcin Szycik authored
      Increasing MSI-X value on a VF leads to invalid memory operations. This
      is caused by not reallocating some arrays.
      
      Reproducer:
        modprobe ice
        echo 0 > /sys/bus/pci/devices/$PF_PCI/sriov_drivers_autoprobe
        echo 1 > /sys/bus/pci/devices/$PF_PCI/sriov_numvfs
        echo 17 > /sys/bus/pci/devices/$VF0_PCI/sriov_vf_msix_count
      
      Default MSI-X is 16, so 17 and above triggers this issue.
      
      KASAN reports:
      
        BUG: KASAN: slab-out-of-bounds in ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice]
        Read of size 8 at addr ffff8888b937d180 by task bash/28433
        (...)
      
        Call Trace:
         (...)
         ? ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice]
         kasan_report+0xed/0x120
         ? ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice]
         ice_vsi_alloc_ring_stats+0x38d/0x4b0 [ice]
         ice_vsi_cfg_def+0x3360/0x4770 [ice]
         ? mutex_unlock+0x83/0xd0
         ? __pfx_ice_vsi_cfg_def+0x10/0x10 [ice]
         ? __pfx_ice_remove_vsi_lkup_fltr+0x10/0x10 [ice]
         ice_vsi_cfg+0x7f/0x3b0 [ice]
         ice_vf_reconfig_vsi+0x114/0x210 [ice]
         ice_sriov_set_msix_vec_count+0x3d0/0x960 [ice]
         sriov_vf_msix_count_store+0x21c/0x300
         (...)
      
        Allocated by task 28201:
         (...)
         ice_vsi_cfg_def+0x1c8e/0x4770 [ice]
         ice_vsi_cfg+0x7f/0x3b0 [ice]
         ice_vsi_setup+0x179/0xa30 [ice]
         ice_sriov_configure+0xcaa/0x1520 [ice]
         sriov_numvfs_store+0x212/0x390
         (...)
      
      To fix it, use ice_vsi_rebuild() instead of ice_vf_reconfig_vsi(). This
      causes the required arrays to be reallocated taking the new queue count
      into account (ice_vsi_realloc_stat_arrays()). Set req_txq and req_rxq
      before ice_vsi_rebuild(), so that realloc uses the newly set queue
      count.
      
      Additionally, ice_vsi_rebuild() does not remove VSI filters
      (ice_fltr_remove_all()), so ice_vf_init_host_cfg() is no longer
      necessary.
      Reported-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Fixes: 2a2cb4c6 ("ice: replace ice_vf_recreate_vsi() with ice_vf_reconfig_vsi()")
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Signed-off-by: default avatarMarcin Szycik <marcin.szycik@linux.intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      bce9af1b
    • Wojciech Drewek's avatar
      ice: Flush FDB entries before reset · fbcb968a
      Wojciech Drewek authored
      Triggering the reset while in switchdev mode causes
      errors[1]. Rules are already removed by this time
      because switch content is flushed in case of the reset.
      This means that rules were deleted from HW but SW
      still thinks they exist so when we get
      SWITCHDEV_FDB_DEL_TO_DEVICE notification we try to
      delete not existing rule.
      
      We can avoid these errors by clearing the rules
      early in the reset flow before they are removed from HW.
      Switchdev API will get notified that the rule was removed
      so we won't get SWITCHDEV_FDB_DEL_TO_DEVICE notification.
      Remove unnecessary ice_clear_sw_switch_recipes.
      
      [1]
      ice 0000:01:00.0: Failed to delete FDB forward rule, err: -2
      ice 0000:01:00.0: Failed to delete FDB guard rule, err: -2
      
      Fixes: 7c945a1a ("ice: Switchdev FDB events support")
      Reviewed-by: default avatarMateusz Polchlopek <mateusz.polchlopek@intel.com>
      Signed-off-by: default avatarWojciech Drewek <wojciech.drewek@intel.com>
      Tested-by: default avatarSujai Buvaneswaran <sujai.buvaneswaran@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      fbcb968a
    • Marcin Szycik's avatar
      ice: Fix netif_is_ice() in Safe Mode · 8e60dbcb
      Marcin Szycik authored
      netif_is_ice() works by checking the pointer to netdev ops. However, it
      only checks for the default ice_netdev_ops, not ice_netdev_safe_mode_ops,
      so in Safe Mode it always returns false, which is unintuitive. While it
      doesn't look like netif_is_ice() is currently being called anywhere in Safe
      Mode, this could change and potentially lead to unexpected behaviour.
      
      Fixes: df006dd4 ("ice: Add initial support framework for LAG")
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarMarcin Szycik <marcin.szycik@linux.intel.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Tested-by: default avatarSujai Buvaneswaran <sujai.buvaneswaran@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      8e60dbcb