1. 31 Mar, 2021 40 commits
    • Vladimir Oltean's avatar
      net: enetc: add support for XDP_TX · 7ed2bc80
      Vladimir Oltean authored
      For reflecting packets back into the interface they came from, we create
      an array of TX software BDs derived from the RX software BDs. Therefore,
      we need to extend the TX software BD structure to contain most of the
      stuff that's already present in the RX software BD structure, for
      reasons that will become evident in a moment.
      
      For a frame with the XDP_TX verdict, we don't reuse any buffer right
      away as we do for XDP_DROP (the same page half) or XDP_PASS (the other
      page half, same as the skb code path).
      
      Because the buffer transfers ownership from the RX ring to the TX ring,
      reusing any page half right away is very dangerous. So what we can do is
      we can recycle the same page half as soon as TX is complete.
      
      The code path is:
      enetc_poll
      -> enetc_clean_rx_ring_xdp
         -> enetc_xdp_tx
         -> enetc_refill_rx_ring
      (time passes, another MSI interrupt is raised)
      enetc_poll
      -> enetc_clean_tx_ring
         -> enetc_recycle_xdp_tx_buff
      
      But that creates a problem, because there is a potentially large time
      window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in
      which we'll have less and less RX buffers.
      
      Basically, when the ship starts sinking, the knee-jerk reaction is to
      let enetc_refill_rx_ring do what it does for the standard skb code path
      (refill every 16 consumed buffers), but that turns out to be very
      inefficient. The problem is that we have no rx_swbd->page at our
      disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would
      have to call enetc_new_page for every buffer that we refill (if we
      choose to refill at this early stage). Very inefficient, it only makes
      the problem worse, because page allocation is an expensive process, and
      CPU time is exactly what we're lacking.
      
      Additionally, there is an even bigger problem: if we let
      enetc_refill_rx_ring top up the ring's buffers again from the RX path,
      remember that the buffers sent to transmission haven't disappeared
      anywhere. They will be eventually sent, and processed in
      enetc_clean_tx_ring, and an attempt will be made to recycle them.
      But surprise, the RX ring is already full of new buffers, because we
      were premature in deciding that we should refill. So not only we took
      the expensive decision of allocating new pages, but now we must throw
      away perfectly good and reusable buffers.
      
      So what we do is we implement an elastic refill mechanism, which keeps
      track of the number of in-flight XDP_TX buffer descriptors. We top up
      the RX ring only up to the total ring capacity minus the number of BDs
      that are in flight (because we know that those BDs will return to us
      eventually).
      
      The enetc driver manages 1 RX ring per CPU, and the default TX ring
      management is the same. So we do XDP_TX towards the TX ring of the same
      index, because it is affined to the same CPU. This will probably not
      produce great results when we have a tc-taprio/tc-mqprio qdisc on the
      interface, because in that case, the number of TX rings might be
      greater, but I didn't add any checks for that yet (mostly because I
      didn't know what checks to add).
      
      It should also be noted that we need to change the DMA mapping direction
      for RX buffers, since they may now be reflected into the TX ring of the
      same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and
      remapping as DMA_TO_DEVICE, because performance is better this way.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ed2bc80
    • Vladimir Oltean's avatar
      net: enetc: add support for XDP_DROP and XDP_PASS · d1b15102
      Vladimir Oltean authored
      For the RX ring, enetc uses an allocation scheme based on pages split
      into two buffers, which is already very efficient in terms of preventing
      reallocations / maximizing reuse, so I see no reason why I would change
      that.
      
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half B | half B | half B | half B | half B | half B | half B |
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half A | half A | half A | half A | half A | half A | half A | RX ring
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
           ^                                                     ^
           |                                                     |
       next_to_clean                                       next_to_alloc
                                                            next_to_use
      
                         +--------+--------+--------+--------+--------+
                         |        |        |        |        |        |
                         | half B | half B | half B | half B | half B |
                         |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |        |        |        |        |        |
       | half B | half B | half A | half A | half A | half A | half A | RX ring
       |        |        |        |        |        |        |        |
       +--------+--------+--------+--------+--------+--------+--------+
       |        |        |   ^                                   ^
       | half A | half A |   |                                   |
       |        |        | next_to_clean                   next_to_use
       +--------+--------+
                    ^
                    |
               next_to_alloc
      
      then when enetc_refill_rx_ring is called, whose purpose is to advance
      next_to_use, it sees that it can take buffers up to next_to_alloc, and
      it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate
      one!".
      
      The only problem is that for default PAGE_SIZE values of 4096, buffer
      sizes are 2048 bytes. While this is enough for normal skb allocations at
      an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256
      bytes, and including skb_shared_info and alignment, we end up being able
      to make use of only 1472 bytes, which is insufficient for the default
      MTU.
      
      To solve that problem, we implement scatter/gather processing in the
      driver, because we would really like to keep the existing allocation
      scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and
      another one of 28 bytes.
      
      Because the headroom required by XDP is different (and much larger) than
      the one required by the network stack, whenever a BPF program is added
      or deleted on the port, we drain the existing RX buffers and seed new
      ones with the required headroom. We also keep the required headroom in
      rx_ring->buffer_offset.
      
      The simplest way to implement XDP_PASS, where an skb must be created, is
      to create an xdp_buff based on the next_to_clean RX BDs, but not clear
      those BDs from the RX ring yet, just keep the original index at which
      the BDs for this frame started. Then, if the verdict is XDP_PASS,
      instead of converting the xdb_buff to an skb, we replay a call to
      enetc_build_skb (just as in the normal enetc_clean_rx_ring case),
      starting from the original BD index.
      
      We would also like to be minimally invasive to the regular RX data path,
      and not check whether there is a BPF program attached to the ring on
      every packet. So we create a separate RX ring processing function for
      XDP.
      
      Because we only install/remove the BPF program while the interface is
      down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there
      shouldn't be any circumstance in which we are processing packets and
      there is a potentially freed BPF program attached to the RX ring.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1b15102
    • Vladimir Oltean's avatar
      net: enetc: move up enetc_reuse_page and enetc_page_reusable · 65d0cbb4
      Vladimir Oltean authored
      For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring,
      so we need to avoid a forward declaration.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65d0cbb4
    • Vladimir Oltean's avatar
      net: enetc: clean the TX software BD on the TX confirmation path · 1ee8d6f3
      Vladimir Oltean authored
      With the future introduction of some new fields into enetc_tx_swbd such
      as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits
      to true from the XDP_TX/XDP_REDIRECT code path, but also to false from
      the old code paths.
      
      This is because TX software buffer descriptors are kept in a ring that
      is shadow of the hardware TX ring, so these structures keep getting
      reused, and there is always the possibility that when a software BD is
      reused (after we ran a full circle through the TX ring), the old user of
      the tx_swbd had set is_xdp_tx = true, and now we are sending a regular
      skb, which would need to set is_xdp_tx = false.
      
      To be minimally invasive to the old code paths, let's just scrub the
      software TX BD in the TX confirmation path (enetc_clean_tx_ring), once
      we know that nobody uses this software TX BD (tx_ring->next_to_clean
      hasn't yet been updated, and the TX paths check enetc_bd_unused which
      tells them if there's any more space in the TX ring for a new enqueue).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ee8d6f3
    • Vladimir Oltean's avatar
      net: enetc: add a dedicated is_eof bit in the TX software BD · d504498d
      Vladimir Oltean authored
      In the transmit path, if we have a scatter/gather frame, it is put into
      multiple software buffer descriptors, the last of which has the skb
      pointer populated (which is necessary for rearming the TX MSI vector and
      for collecting the two-step TX timestamp from the TX confirmation path).
      
      At the moment, this is sufficient, but with XDP_TX, we'll need to
      service TX software buffer descriptors that don't have an skb pointer,
      however they might be final nonetheless. So add a dedicated bit for
      final software BDs that we populate and check explicitly. Also, we keep
      looking just for an skb when doing TX timestamping, because we don't
      want/need that for XDP.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d504498d
    • Vladimir Oltean's avatar
      net: enetc: move skb creation into enetc_build_skb · a800abd3
      Vladimir Oltean authored
      We need to build an skb from two code paths now: from the plain RX data
      path and from the XDP data path when the verdict is XDP_PASS.
      
      Create a new enetc_build_skb function which contains the essential steps
      for building an skb based on the first and last positions of buffer
      descriptors within the RX ring.
      
      We also squash the enetc_process_skb function into enetc_build_skb,
      because what that function did wasn't very meaningful on its own.
      
      The "rx_frm_cnt++" instruction has been moved around napi_gro_receive
      for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which
      itself must be before napi_gro_receive, because that's when we lose
      ownership of the skb.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a800abd3
    • Vladimir Oltean's avatar
      net: enetc: consume the error RX buffer descriptors in a dedicated function · 2fa423f5
      Vladimir Oltean authored
      We can and should check the RX BD errors before starting to build the
      skb. The only apparent reason why things are done in this backwards
      order is to spare one call to enetc_rxbd_next.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fa423f5
    • Eric Dumazet's avatar
      ipv6: remove extra dev_hold() for fallback tunnels · 0d7a7b20
      Eric Dumazet authored
      My previous commits added a dev_hold() in tunnels ndo_init(),
      but forgot to remove it from special functions setting up fallback tunnels.
      
      Fallback tunnels do call their respective ndo_init()
      
      This leads to various reports like :
      
      unregister_netdevice: waiting for ip6gre0 to become free. Usage count = 2
      
      Fixes: 48bb5697 ("ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods")
      Fixes: 6289a98f ("sit: proper dev_{hold|put} in ndo_[un]init methods")
      Fixes: 40cb881b ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
      Fixes: 7f700334 ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d7a7b20
    • Yang Yingliang's avatar
      net/tipc: fix missing destroy_workqueue() on error in tipc_crypto_start() · ac1db7ac
      Yang Yingliang authored
      Add the missing destroy_workqueue() before return from
      tipc_crypto_start() in the error handling case.
      
      Fixes: 1ef6f7c9 ("tipc: add automatic session key exchange")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac1db7ac
    • David S. Miller's avatar
      Merge branch 'inet-shrink-netns' · ab1b4f0a
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      inet: shrink netns_ipv{4|6}
      
      This patch series work on reducing footprint of netns_ipv4
      and netns_ipv6. Some sysctls are converted to bytes,
      and some fields are moves to reduce number of holes
      and paddings.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab1b4f0a
    • Eric Dumazet's avatar
      ipv6: move ip6_dst_ops first in netns_ipv6 · 0dd39d95
      Eric Dumazet authored
      ip6_dst_ops have cache line alignement.
      
      Moving it at beginning of netns_ipv6
      removes a 48 byte hole, and shrinks netns_ipv6
      from 12 to 11 cache lines.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0dd39d95
    • Eric Dumazet's avatar
      ipv6: convert elligible sysctls to u8 · a6175633
      Eric Dumazet authored
      Convert most sysctls that can fit in a byte.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6175633
    • Eric Dumazet's avatar
      tcp: convert tcp_comp_sack_nr sysctl to u8 · 1c3289c9
      Eric Dumazet authored
      tcp_comp_sack_nr max value was already 255.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c3289c9
    • Eric Dumazet's avatar
      ipv4: convert igmp_link_local_mcast_reports sysctl to u8 · 7d4b37eb
      Eric Dumazet authored
      This sysctl is a bool, can use less storage.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d4b37eb
    • Eric Dumazet's avatar
      ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8 · be205fe6
      Eric Dumazet authored
      Make room for better packing of netns_ipv4
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be205fe6
    • Eric Dumazet's avatar
      ipv4: convert udp_l3mdev_accept sysctl to u8 · cd04bd02
      Eric Dumazet authored
      Reduce footprint of sysctls.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd04bd02
    • Eric Dumazet's avatar
      ipv4: convert fib_notify_on_flag_change sysctl to u8 · b2908fac
      Eric Dumazet authored
      Reduce footprint of sysctls.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2908fac
    • Eric Dumazet's avatar
      inet: shrink netns_ipv4 by another cache line · 490f33c4
      Eric Dumazet authored
      By shuffling around some fields to remove 8 bytes of hole,
      we can save one cache line.
      
      pahole result before/after the patch :
      
      /* size: 768, cachelines: 12, members: 139 */
      /* sum members: 673, holes: 11, sum holes: 39 */
      /* padding: 56 */
      /* paddings: 2, sum paddings: 7 */
      /* forced alignments: 1 */
      
      ->
      
      /* size: 704, cachelines: 11, members: 139 */
      /* sum members: 673, holes: 10, sum holes: 31 */
      /* paddings: 2, sum paddings: 7 */
      /* forced alignments: 1 */
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      490f33c4
    • Eric Dumazet's avatar
      inet: shrink inet_timewait_death_row by 48 bytes · 1caf8d39
      Eric Dumazet authored
      struct inet_timewait_death_row uses two cache lines, because we want
      tw_count to use a full cache line to avoid false sharing.
      
      Rework its definition and placement in netns_ipv4 so that:
      
      1) We add 60 bytes of padding after tw_count to avoid
        false sharing, knowing that tcp_death_row will
        have ____cacheline_aligned_in_smp attribute.
      
      2) We do not risk padding before tcp_death_row, because
        we move it at the beginning of netns_ipv4, even if new
       fields are added later.
      
      3) We do not waste 48 bytes of padding after it.
      
      Note that I have not changed dccp.
      
      pahole result for struct netns_ipv4 before/after the patch :
      
      /* size: 832, cachelines: 13, members: 139 */
      /* sum members: 721, holes: 12, sum holes: 95 */
      /* padding: 16 */
      /* paddings: 2, sum paddings: 55 */
      
      ->
      
      /* size: 768, cachelines: 12, members: 139 */
      /* sum members: 673, holes: 11, sum holes: 39 */
      /* padding: 56 */
      /* paddings: 2, sum paddings: 7 */
      /* forced alignments: 1 */
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1caf8d39
    • David S. Miller's avatar
      Merge branch 'net-coding-style' · 30b8817f
      David S. Miller authored
      Weihang Li says:
      
      ====================
      net: fix some coding style issues
      
      Do some cleanups according to the coding style of kernel, including wrong
      print type, redundant and missing spaces and so on.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30b8817f
    • Yangyang Li's avatar
      net: lpc_eth: fix format warnings of block comments · 44d043b5
      Yangyang Li authored
      Fix the following format warning:
      1. Block comments use * on subsequent lines
      2. Block comments use a trailing */ on a separate line
      Signed-off-by: default avatarYangyang Li <liyangyang20@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44d043b5
    • Yixing Liu's avatar
      net: toshiba: fix the trailing format of some block comments · 142c1d2e
      Yixing Liu authored
      Use a trailling */ on a separate line for block comments.
      Signed-off-by: default avatarYixing Liu <liuyixing1@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      142c1d2e
    • Yixing Liu's avatar
      net: ocelot: fix a trailling format issue with block comments · 1f78ff4f
      Yixing Liu authored
      Use a tralling */ on a separate line for block comments.
      Signed-off-by: default avatarYixing Liu <liuyixing1@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f78ff4f
    • Yixing Liu's avatar
      net: amd: correct some format issues · 3f6ebcff
      Yixing Liu authored
      There should be a blank line after declarations.
      Signed-off-by: default avatarYixing Liu <liuyixing1@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f6ebcff
    • Yixing Liu's avatar
      net: amd8111e: fix inappropriate spaces · ca3fc0aa
      Yixing Liu authored
      Delete unncecessary spaces and add some reasonable spaces according to the
      coding-style of kernel.
      Signed-off-by: default avatarYixing Liu <liuyixing1@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca3fc0aa
    • Yixing Liu's avatar
      net: ena: remove extra words from comments · e355fa6a
      Yixing Liu authored
      Remove the redundant "for" from the commment.
      Signed-off-by: default avatarYixing Liu <liuyixing1@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e355fa6a
    • Yixing Liu's avatar
      net: ena: fix inaccurate print type · b788ff0a
      Yixing Liu authored
      Use "%u" to replace "hu%".
      Signed-off-by: default avatarYixing Liu <liuyixing1@huawei.com>
      Signed-off-by: default avatarWeihang Li <liweihang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b788ff0a
    • Matthew Wilcox (Oracle)'s avatar
      qrtr: Convert qrtr_ports from IDR to XArray · 3cbf7530
      Matthew Wilcox (Oracle) authored
      The XArray interface is easier for this driver to use.  Also fixes a
      bug reported by the improper use of GFP_ATOMIC.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cbf7530
    • Wan Jiabing's avatar
      net: ethernet: stmicro: Remove duplicate struct declaration · 53f7c5e1
      Wan Jiabing authored
      struct stmmac_safety_stats is declared twice. One has been
      declared at 29th line. Remove the duplicate.
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53f7c5e1
    • Eric Dumazet's avatar
      ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods · 48bb5697
      Eric Dumazet authored
      Same reasons than for the previous commits :
      6289a98f ("sit: proper dev_{hold|put} in ndo_[un]init methods")
      40cb881b ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
      7f700334 ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")
      
      After adopting CONFIG_PCPU_DEV_REFCNT=n option, syzbot was able to trigger
      a warning [1]
      
      Issue here is that:
      
      - all dev_put() should be paired with a corresponding prior dev_hold().
      
      - A driver doing a dev_put() in its ndo_uninit() MUST also
        do a dev_hold() in its ndo_init(), only when ndo_init()
        is returning 0.
      
      Otherwise, register_netdevice() would call ndo_uninit()
      in its error path and release a refcount too soon.
      
      [1]
      WARNING: CPU: 1 PID: 21059 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
      Modules linked in:
      CPU: 1 PID: 21059 Comm: syz-executor.4 Not tainted 5.12.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
      Code: 1d 6a 5a e8 09 31 ff 89 de e8 8d 1a ab fd 84 db 75 e0 e8 d4 13 ab fd 48 c7 c7 a0 e1 c1 89 c6 05 4a 5a e8 09 01 e8 2e 36 fb 04 <0f> 0b eb c4 e8 b8 13 ab fd 0f b6 1d 39 5a e8 09 31 ff 89 de e8 58
      RSP: 0018:ffffc900025aefe8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000040000 RSI: ffffffff815c51f5 RDI: fffff520004b5def
      RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff815bdf8e R11: 0000000000000000 R12: ffff888023488568
      R13: ffff8880254e9000 R14: 00000000dfd82cfd R15: ffff88802ee2d7c0
      FS:  00007f13bc590700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f0943e74000 CR3: 0000000025273000 CR4: 00000000001506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __refcount_dec include/linux/refcount.h:344 [inline]
       refcount_dec include/linux/refcount.h:359 [inline]
       dev_put include/linux/netdevice.h:4135 [inline]
       ip6_tnl_dev_uninit+0x370/0x3d0 net/ipv6/ip6_tunnel.c:387
       register_netdevice+0xadf/0x1500 net/core/dev.c:10308
       ip6_tnl_create2+0x1b5/0x400 net/ipv6/ip6_tunnel.c:263
       ip6_tnl_newlink+0x312/0x580 net/ipv6/ip6_tunnel.c:2052
       __rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3443
       rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3491
       rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
       netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
       netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:674
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48bb5697
    • David S. Miller's avatar
      Merge branch 'ethtool-fec-netlink' · e3f685aa
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      ethtool: support FEC configuration over netlink
      
      This series adds support for the equivalents of ETHTOOL_GFECPARAM
      and ETHTOOL_SFECPARAM over netlink.
      
      As a reminder - this is an API which allows user to query current
      FEC mode, as well as set FEC manually if autoneg is disabled.
      It does not configure anything if autoneg is enabled (that said
      few/no drivers currently reject .set_fecparam calls while autoneg
      is disabled, hopefully FW will just ignore the settings).
      
      The existing functionality is mostly preserved in the new API.
      The ioctl interface uses a set of flags, and link modes to tell
      user which modes are supported. Here is how the flags translate
      to the new interface (skipping descriptions for actual FEC modes):
      
        ioctl flag      |   description         |  new API
      ================================================================
      ETHTOOL_FEC_OFF   | disabled (supported)  | \
      ETHTOOL_FEC_RS    |                       |  ` link mode bitset
      ETHTOOL_FEC_BASER |                       |  / .._A_FEC_MODES
      ETHTOOL_FEC_LLRS  |                       | /
      ETHTOOL_FEC_AUTO  | pick based on cable   | bool .._A_FEC_AUTO
      ETHTOOL_FEC_NONE  | not supported         | no bit, no AUTO reported
      
      Since link modes are already depended on (although somewhat implicitly)
      for expressing supported modes - the new interface uses them for
      the manual configuration, as well as uses link mode bit number
      to communicate the active mode.
      
      Use of link modes allows us to define any number of FEC modes we want,
      and reuse the strset we already have defined.
      
      Separating AUTO as its own attribute is the biggest changed compared
      to the ioctl. It means drivers can no longer report AUTO as the
      active FEC mode because there is no link mode for AUTO.
      active_fec == AUTO makes little sense in the first place IMHO,
      active_fec should be the actual mode, so hopefully this is fine.
      
      The other minor departure is that None is no longer explicitly
      expressed in the API. But drivers are reasonable in handling of
      this somewhat pointless bit, so I'm not expecting any issues there.
      
      One extension which could be considered would be moving active FEC
      to ETHTOOL_MSG_LINKMODE_*, but then why not move all of FEC into
      link modes? I don't know where to draw the line.
      
      netdevsim support and a simple self test are included.
      
      Next step is adding stats similar to the ones added for pause.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      ,
      e3f685aa
    • Jakub Kicinski's avatar
      selftests: ethtool: add a netdevsim FEC test · 1da07e5d
      Jakub Kicinski authored
      Test FEC settings, iterate over configs.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1da07e5d
    • Jakub Kicinski's avatar
      netdevsim: add FEC settings support · 0d7f76dc
      Jakub Kicinski authored
      Add support for ethtool FEC and some ethtool error injection.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d7f76dc
    • Jakub Kicinski's avatar
      ethtool: support FEC settings over netlink · 1e5d1f69
      Jakub Kicinski authored
      Add FEC API to netlink.
      
      This is not a 1-to-1 conversion.
      
      FEC settings already depend on link modes to tell user which
      modes are supported. Take this further an use link modes for
      manual configuration. Old struct ethtool_fecparam is still
      used to talk to the drivers, so we need to translate back
      and forth. We can revisit the internal API if number of FEC
      encodings starts to grow.
      
      Enforce only one active FEC bit (by using a bit position
      rather than another mask).
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e5d1f69
    • Eric Lin's avatar
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Only perform atomic nexthop bucket replacement when requested · 7866f265
      Ido Schimmel authored
      When cleared, the 'force' parameter in nexthop bucket replacement
      notifications indicates that a driver should try to perform an atomic
      replacement. Meaning, only update the contents of the bucket if it is
      inactive.
      
      Since mlxsw only queries buckets' activity once every second, there is
      no point in trying an atomic replacement if the idle timer interval is
      smaller than 1 second.
      
      Currently, mlxsw ignores the original value of 'force' and will always
      try an atomic replacement if the idle timer is not smaller than 1
      second.
      
      Fix this by taking the original value of 'force' into account and never
      promoting a non-atomic replacement to an atomic one.
      
      Fixes: 617a77f0 ("mlxsw: spectrum_router: Add nexthop bucket replacement support")
      Reported-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7866f265
    • David S. Miller's avatar
      Merge branch 'mptcp-subflow-disconnected' · 65550f03
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      MPTCP: Allow initial subflow to be disconnected
      
      An MPTCP connection is aggregated from multiple TCP subflows, and can
      involve multiple IP addresses on either peer. The addresses used in the
      initial subflow connection are assigned address id 0 on each side of the
      link. More addresses can be added and shared with the peer using address
      IDs of 1 or larger. MPTCP in Linux shares non-zero address IDs across
      all MPTCP connections in a net namespace, which allows userspace to
      manage subflow connections across a number of sockets. However, this
      makes the address with id 0 a special case, since the IP address
      associated with id 0 is potentially different for each socket.
      
      This patch set allows the initial subflow to be disconnected when
      userspace specifies an address to remove using both id 0 and an IP
      address, or when the peer sends an RM_ADDR for id 0.
      
      Patches 1 and 3 implement the change for requests from the peer and
      userspace, respectively.
      
      Patch 2 consolidates some code for disconnecting subflows.
      
      Patches 4-6 update the self tests to cover removal of subflows using
      address id 0.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65550f03
    • Geliang Tang's avatar
      selftests: mptcp: remove id 0 address testcases · 5e287fe7
      Geliang Tang authored
      This patch added the testcases for removing the id 0 subflow and the id 0
      address.
      
      In do_transfer, use the removing addresses number '9' for deleting the id
      0 address.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e287fe7
    • Geliang Tang's avatar
      selftests: mptcp: add addr argument for del_addr · 2d121c9a
      Geliang Tang authored
      For the id 0 address, different MPTCP connections could be using
      different IP addresses for id 0.
      
      This patch added an extra argument IP address for del_addr when
      using id 0.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d121c9a
    • Matthieu Baerts's avatar
      selftests: mptcp: avoid calling pm_nl_ctl with bad IDs · 6254ad40
      Matthieu Baerts authored
      IDs are supposed to be between 0 and 255.
      
      In pm_nl_ctl, for both the 'add' and 'get' instruction, the ID is casted
      in a u_int8_t. So if we give 256, we will delete ID 0. Obviously, the
      goal is not to delete this ID by giving 256.
      
      We could modify pm_nl_ctl and stop if the ID is negative or higher than
      255 but probably better not to increase the number of lines for such
      things in this tool which is only used in selftests. Instead, we use it
      within the limits.
      
      This modification also means that we will no longer add a new ID for the
      2nd entry. That's why we removed an expected entry from the dump and
      introduced with
      commit dc8eb10e ("selftests: mptcp: add testcases for setting the address ID").
      
      So now we delete ID 9 like before and we add entries for IDs 10 to 255
      that are deleted just after.
      
      Note that this could be seen as a fix but it was not really an issue so
      far: we were simply playing with ID 0/1 once again. With the following
      commit ("selftests: mptcp: add addr argument for del_addr"), it will be
      different because ID 0 is going to required an address. We don't want
      errors when trying to delete ID 0 without the address argument.
      Acked-and-tested-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6254ad40