1. 10 Apr, 2021 2 commits
    • Florian Westphal's avatar
      netfilter: bridge: add pre_exit hooks for ebtable unregistration · 7ee3c61d
      Florian Westphal authored
      Just like ip/ip6/arptables, the hooks have to be removed, then
      synchronize_rcu() has to be called to make sure no more packets are being
      processed before the ruleset data is released.
      
      Place the hook unregistration in the pre_exit hook, then call the new
      ebtables pre_exit function from there.
      
      Years ago, when first netns support got added for netfilter+ebtables,
      this used an older (now removed) netfilter hook unregister API, that did
      a unconditional synchronize_rcu().
      
      Now that all is done with call_rcu, ebtable_{filter,nat,broute} pernet exit
      handlers may free the ebtable ruleset while packets are still in flight.
      
      This can only happens on module removal, not during netns exit.
      
      The new function expects the table name, not the table struct.
      
      This is because upcoming patch set (targeting -next) will remove all
      net->xt.{nat,filter,broute}_table instances, this makes it necessary
      to avoid external references to those member variables.
      
      The existing APIs will be converted, so follow the upcoming scheme of
      passing name + hook type instead.
      
      Fixes: aee12a0a ("ebtables: remove nf_hook_register usage")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7ee3c61d
    • Eric Dumazet's avatar
      netfilter: nft_limit: avoid possible divide error in nft_limit_init · b895bdf5
      Eric Dumazet authored
      div_u64() divides u64 by u32.
      
      nft_limit_init() wants to divide u64 by u64, use the appropriate
      math function (div64_u64)
      
      divide error: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 8390 Comm: syz-executor188 Not tainted 5.12.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:div_u64_rem include/linux/math64.h:28 [inline]
      RIP: 0010:div_u64 include/linux/math64.h:127 [inline]
      RIP: 0010:nft_limit_init+0x2a2/0x5e0 net/netfilter/nft_limit.c:85
      Code: ef 4c 01 eb 41 0f 92 c7 48 89 de e8 38 a5 22 fa 4d 85 ff 0f 85 97 02 00 00 e8 ea 9e 22 fa 4c 0f af f3 45 89 ed 31 d2 4c 89 f0 <49> f7 f5 49 89 c6 e8 d3 9e 22 fa 48 8d 7d 48 48 b8 00 00 00 00 00
      RSP: 0018:ffffc90009447198 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000200000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff875152e6 RDI: 0000000000000003
      RBP: ffff888020f80908 R08: 0000200000000000 R09: 0000000000000000
      R10: ffffffff875152d8 R11: 0000000000000000 R12: ffffc90009447270
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  000000000097a300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000200001c4 CR3: 0000000026a52000 CR4: 00000000001506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       nf_tables_newexpr net/netfilter/nf_tables_api.c:2675 [inline]
       nft_expr_init+0x145/0x2d0 net/netfilter/nf_tables_api.c:2713
       nft_set_elem_expr_alloc+0x27/0x280 net/netfilter/nf_tables_api.c:5160
       nf_tables_newset+0x1997/0x3150 net/netfilter/nf_tables_api.c:4321
       nfnetlink_rcv_batch+0x85a/0x21b0 net/netfilter/nfnetlink.c:456
       nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:580 [inline]
       nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:598
       netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
       netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:674
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c26844ed ("netfilter: nf_tables: Fix nft limit burst handling")
      Fixes: 3e0f64b7 ("netfilter: nft_limit: fix packet ratelimiting")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Diagnosed-by: default avatarLuigi Rizzo <lrizzo@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b895bdf5
  2. 30 Mar, 2021 8 commits
    • Pablo Neira Ayuso's avatar
      netfilter: conntrack: do not print icmpv6 as unknown via /proc · fbea3180
      Pablo Neira Ayuso authored
      /proc/net/nf_conntrack shows icmpv6 as unknown.
      
      Fixes: 09ec82f5 ("netfilter: conntrack: remove protocol name from l4proto struct")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      fbea3180
    • Pablo Neira Ayuso's avatar
      netfilter: flowtable: fix NAT IPv6 offload mangling · 0e07e25b
      Pablo Neira Ayuso authored
      Fix out-of-bound access in the address array.
      
      Fixes: 5c27d8d7 ("netfilter: nf_flow_table_offload: add IPv6 support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0e07e25b
    • Paolo Abeni's avatar
      net: let skb_orphan_partial wake-up waiters. · 9adc89af
      Paolo Abeni authored
      Currently the mentioned helper can end-up freeing the socket wmem
      without waking-up any processes waiting for more write memory.
      
      If the partially orphaned skb is attached to an UDP (or raw) socket,
      the lack of wake-up can hang the user-space.
      
      Even for TCP sockets not calling the sk destructor could have bad
      effects on TSQ.
      
      Address the issue using skb_orphan to release the sk wmem before
      setting the new sock_efree destructor. Additionally bundle the
      whole ownership update in a new helper, so that later other
      potential users could avoid duplicate code.
      
      v1 -> v2:
       - use skb_orphan() instead of sort of open coding it (Eric)
       - provide an helper for the ownership change (Eric)
      
      Fixes: f6ba8d33 ("netem: fix skb_orphan_partial()")
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9adc89af
    • Yunjian Wang's avatar
      sch_htb: fix null pointer dereference on a null new_q · ae81feb7
      Yunjian Wang authored
      sch_htb: fix null pointer dereference on a null new_q
      
      Currently if new_q is null, the null new_q pointer will be
      dereference when 'q->offload' is true. Fix this by adding
      a braces around htb_parent_to_leaf_offload() to avoid it.
      
      Addresses-Coverity: ("Dereference after null check")
      Fixes: d03b195b ("sch_htb: Hierarchical QoS hardware offload")
      Signed-off-by: default avatarYunjian Wang <wangyunjian@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae81feb7
    • Loic Poulain's avatar
      net: qrtr: Fix memory leak on qrtr_tx_wait failure · 8a03dd92
      Loic Poulain authored
      qrtr_tx_wait does not check for radix_tree_insert failure, causing
      the 'flow' object to be unreferenced after qrtr_tx_wait return. Fix
      that by releasing flow on radix_tree_insert failure.
      
      Fixes: 5fdeb0d3 ("net: qrtr: Implement outgoing flow control")
      Reported-by: syzbot+739016799a89c530b32a@syzkaller.appspotmail.com
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Reviewed-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a03dd92
    • Kumar Kartikeya Dwivedi's avatar
      net: sched: bump refcount for new action in ACT replace mode · 6855e821
      Kumar Kartikeya Dwivedi authored
      Currently, action creation using ACT API in replace mode is buggy.
      When invoking for non-existent action index 42,
      
      	tc action replace action bpf obj foo.o sec <xyz> index 42
      
      kernel creates the action, fills up the netlink response, and then just
      deletes the action after notifying userspace.
      
      	tc action show action bpf
      
      doesn't list the action.
      
      This happens due to the following sequence when ovr = 1 (replace mode)
      is enabled:
      
      tcf_idr_check_alloc is used to atomically check and either obtain
      reference for existing action at index, or reserve the index slot using
      a dummy entry (ERR_PTR(-EBUSY)).
      
      This is necessary as pointers to these actions will be held after
      dropping the idrinfo lock, so bumping the reference count is necessary
      as we need to insert the actions, and notify userspace by dumping their
      attributes. Finally, we drop the reference we took using the
      tcf_action_put_many call in tcf_action_add. However, for the case where
      a new action is created due to free index, its refcount remains one.
      This when paired with the put_many call leads to the kernel setting up
      the action, notifying userspace of its creation, and then tearing it
      down. For existing actions, the refcount is still held so they remain
      unaffected.
      
      Fortunately due to rtnl_lock serialization requirement, such an action
      with refcount == 1 will not be concurrently deleted by anything else, at
      best CLS API can move its refcount up and down by binding to it after it
      has been published from tcf_idr_insert_many. Since refcount is atleast
      one until put_many call, CLS API cannot delete it. Also __tcf_action_put
      release path already ensures deterministic outcome (either new action
      will be created or existing action will be reused in case CLS API tries
      to bind to action concurrently) due to idr lock serialization.
      
      We fix this by making refcount of newly created actions as 2 in ACT API
      replace mode. A relaxed store will suffice as visibility is ensured only
      after the tcf_idr_insert_many call.
      
      Note that in case of creation or overwriting using CLS API only (i.e.
      bind = 1), overwriting existing action object is not allowed, and any
      such request is silently ignored (without error).
      
      The refcount bump that occurs in tcf_idr_check_alloc call there for
      existing action will pair with tcf_exts_destroy call made from the
      owner module for the same action. In case of action creation, there
      is no existing action, so no tcf_exts_destroy callback happens.
      
      This means no code changes for CLS API.
      
      Fixes: cae422f3 ("net: sched: use reference counting action init")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6855e821
    • Milton Miller's avatar
      net/ncsi: Avoid channel_monitor hrtimer deadlock · 03cb4d05
      Milton Miller authored
      Calling ncsi_stop_channel_monitor from channel_monitor is a guaranteed
      deadlock on SMP because stop calls del_timer_sync on the timer that
      invoked channel_monitor as its timer function.
      
      Recognise the inherent race of marking the monitor disabled before
      deleting the timer by just returning if enable was cleared.  After
      a timeout (the default case -- reset to START when response received)
      just mark the monitor.enabled false.
      
      If the channel has an entry on the channel_queue list, or if the
      state is not ACTIVE or INACTIVE, then warn and mark the timer stopped
      and don't restart, as the locking is broken somehow.
      
      Fixes: 0795fb20 ("net/ncsi: Stop monitor if channel times out or is inactive")
      Signed-off-by: default avatarMilton Miller <miltonm@us.ibm.com>
      Signed-off-by: default avatarEddie James <eajames@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03cb4d05
    • Lv Yunlong's avatar
      ethernet/netronome/nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx · 6e5a03bc
      Lv Yunlong authored
      In nfp_bpf_ctrl_msg_rx, if
      nfp_ccm_get_type(skb) == NFP_CCM_TYPE_BPF_BPF_EVENT is true, the skb
      will be freed. But the skb is still used by nfp_ccm_rx(&bpf->ccm, skb).
      
      My patch adds a return when the skb was freed.
      
      Fixes: bcf0cafa ("nfp: split out common control message handling code")
      Signed-off-by: default avatarLv Yunlong <lyl2019@mail.ustc.edu.cn>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e5a03bc
  3. 29 Mar, 2021 26 commits
  4. 26 Mar, 2021 4 commits
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · 75887e88
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2021-03-25
      
      This series contains updates to virtchnl header file and i40e driver.
      
      Norbert removes added padding from virtchnl RSS structures as this
      causes issues when iterating over the arrays.
      
      Mateusz adds Asym_Pause as supported to allow these settings to be set
      as the hardware supports it.
      
      Eryk fixes an issue where encountering a VF reset alongside releasing
      VFs could cause a call trace.
      
      Arkadiusz moves TC setup before resource setup as previously it was
      possible to enter with a null q_vector causing a kernel oops.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75887e88
    • Eric Dumazet's avatar
      sch_red: fix off-by-one checks in red_check_params() · 3a87571f
      Eric Dumazet authored
      This fixes following syzbot report:
      
      UBSAN: shift-out-of-bounds in ./include/net/red.h:237:23
      shift exponent 32 is too large for 32-bit type 'unsigned int'
      CPU: 1 PID: 8418 Comm: syz-executor170 Not tainted 5.12.0-rc4-next-20210324-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
       __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327
       red_set_parms include/net/red.h:237 [inline]
       choke_change.cold+0x3c/0xc8 net/sched/sch_choke.c:414
       qdisc_create+0x475/0x12f0 net/sched/sch_api.c:1247
       tc_modify_qdisc+0x4c8/0x1a50 net/sched/sch_api.c:1663
       rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
       netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
       netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:674
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x43f039
      Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffdfa725168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000400488 RCX: 000000000043f039
      RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000004
      RBP: 0000000000403020 R08: 0000000000400488 R09: 0000000000400488
      R10: 0000000000400488 R11: 0000000000000246 R12: 00000000004030b0
      R13: 0000000000000000 R14: 00000000004ac018 R15: 0000000000400488
      
      Fixes: 8afa10cb ("net_sched: red: Avoid illegal values")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a87571f
    • David S. Miller's avatar
      Merge branch 'tunnel-shinfo' · 3cec1921
      David S. Miller authored
      Antoine Tenart says:
      
      ====================
      net: do not modify the shared tunnel info when PMTU triggers an ICMP reply
      
      The series fixes an issue were a shared ip_tunnel_info is modified when
      PMTU triggers an ICMP reply in vxlan and geneve, making following
      packets in that flow to have a wrong destination address if the flow
      isn't updated. A detailled information is given in each of the two
      commits.
      
      This was tested manually with OVS and I ran the PTMU selftests with
      kmemleak enabled (all OK, none was skipped).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cec1921
    • Antoine Tenart's avatar
      geneve: do not modify the shared tunnel info when PMTU triggers an ICMP reply · 68c1a943
      Antoine Tenart authored
      When the interface is part of a bridge or an Open vSwitch port and a
      packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When
      using the external mode (collect metadata) the source and destination
      addresses are reversed, so that Open vSwitch can match the packet
      against an existing (reverse) flow.
      
      But inverting the source and destination addresses in the shared
      ip_tunnel_info will make following packets of the flow to use a wrong
      destination address (packets will be tunnelled to itself), if the flow
      isn't updated. Which happens with Open vSwitch, until the flow times
      out.
      
      Fixes this by uncloning the skb's ip_tunnel_info before inverting its
      source and destination addresses, so that the modification will only be
      made for the PTMU packet, not the following ones.
      
      Fixes: c1a800e8 ("geneve: Support for PMTU discovery on directly bridged links")
      Tested-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Reviewed-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68c1a943