1. 31 Mar, 2021 38 commits
    • Matthew Wilcox (Oracle)'s avatar
      qrtr: Convert qrtr_ports from IDR to XArray · 3cbf7530
      Matthew Wilcox (Oracle) authored
      The XArray interface is easier for this driver to use.  Also fixes a
      bug reported by the improper use of GFP_ATOMIC.
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3cbf7530
    • Wan Jiabing's avatar
      net: ethernet: stmicro: Remove duplicate struct declaration · 53f7c5e1
      Wan Jiabing authored
      struct stmmac_safety_stats is declared twice. One has been
      declared at 29th line. Remove the duplicate.
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53f7c5e1
    • Eric Dumazet's avatar
      ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods · 48bb5697
      Eric Dumazet authored
      Same reasons than for the previous commits :
      6289a98f ("sit: proper dev_{hold|put} in ndo_[un]init methods")
      40cb881b ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
      7f700334 ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")
      
      After adopting CONFIG_PCPU_DEV_REFCNT=n option, syzbot was able to trigger
      a warning [1]
      
      Issue here is that:
      
      - all dev_put() should be paired with a corresponding prior dev_hold().
      
      - A driver doing a dev_put() in its ndo_uninit() MUST also
        do a dev_hold() in its ndo_init(), only when ndo_init()
        is returning 0.
      
      Otherwise, register_netdevice() would call ndo_uninit()
      in its error path and release a refcount too soon.
      
      [1]
      WARNING: CPU: 1 PID: 21059 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
      Modules linked in:
      CPU: 1 PID: 21059 Comm: syz-executor.4 Not tainted 5.12.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
      Code: 1d 6a 5a e8 09 31 ff 89 de e8 8d 1a ab fd 84 db 75 e0 e8 d4 13 ab fd 48 c7 c7 a0 e1 c1 89 c6 05 4a 5a e8 09 01 e8 2e 36 fb 04 <0f> 0b eb c4 e8 b8 13 ab fd 0f b6 1d 39 5a e8 09 31 ff 89 de e8 58
      RSP: 0018:ffffc900025aefe8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000040000 RSI: ffffffff815c51f5 RDI: fffff520004b5def
      RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff815bdf8e R11: 0000000000000000 R12: ffff888023488568
      R13: ffff8880254e9000 R14: 00000000dfd82cfd R15: ffff88802ee2d7c0
      FS:  00007f13bc590700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f0943e74000 CR3: 0000000025273000 CR4: 00000000001506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       __refcount_dec include/linux/refcount.h:344 [inline]
       refcount_dec include/linux/refcount.h:359 [inline]
       dev_put include/linux/netdevice.h:4135 [inline]
       ip6_tnl_dev_uninit+0x370/0x3d0 net/ipv6/ip6_tunnel.c:387
       register_netdevice+0xadf/0x1500 net/core/dev.c:10308
       ip6_tnl_create2+0x1b5/0x400 net/ipv6/ip6_tunnel.c:263
       ip6_tnl_newlink+0x312/0x580 net/ipv6/ip6_tunnel.c:2052
       __rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3443
       rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3491
       rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502
       netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline]
       netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338
       netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:674
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2404
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48bb5697
    • David S. Miller's avatar
      Merge branch 'ethtool-fec-netlink' · e3f685aa
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      ethtool: support FEC configuration over netlink
      
      This series adds support for the equivalents of ETHTOOL_GFECPARAM
      and ETHTOOL_SFECPARAM over netlink.
      
      As a reminder - this is an API which allows user to query current
      FEC mode, as well as set FEC manually if autoneg is disabled.
      It does not configure anything if autoneg is enabled (that said
      few/no drivers currently reject .set_fecparam calls while autoneg
      is disabled, hopefully FW will just ignore the settings).
      
      The existing functionality is mostly preserved in the new API.
      The ioctl interface uses a set of flags, and link modes to tell
      user which modes are supported. Here is how the flags translate
      to the new interface (skipping descriptions for actual FEC modes):
      
        ioctl flag      |   description         |  new API
      ================================================================
      ETHTOOL_FEC_OFF   | disabled (supported)  | \
      ETHTOOL_FEC_RS    |                       |  ` link mode bitset
      ETHTOOL_FEC_BASER |                       |  / .._A_FEC_MODES
      ETHTOOL_FEC_LLRS  |                       | /
      ETHTOOL_FEC_AUTO  | pick based on cable   | bool .._A_FEC_AUTO
      ETHTOOL_FEC_NONE  | not supported         | no bit, no AUTO reported
      
      Since link modes are already depended on (although somewhat implicitly)
      for expressing supported modes - the new interface uses them for
      the manual configuration, as well as uses link mode bit number
      to communicate the active mode.
      
      Use of link modes allows us to define any number of FEC modes we want,
      and reuse the strset we already have defined.
      
      Separating AUTO as its own attribute is the biggest changed compared
      to the ioctl. It means drivers can no longer report AUTO as the
      active FEC mode because there is no link mode for AUTO.
      active_fec == AUTO makes little sense in the first place IMHO,
      active_fec should be the actual mode, so hopefully this is fine.
      
      The other minor departure is that None is no longer explicitly
      expressed in the API. But drivers are reasonable in handling of
      this somewhat pointless bit, so I'm not expecting any issues there.
      
      One extension which could be considered would be moving active FEC
      to ETHTOOL_MSG_LINKMODE_*, but then why not move all of FEC into
      link modes? I don't know where to draw the line.
      
      netdevsim support and a simple self test are included.
      
      Next step is adding stats similar to the ones added for pause.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      
      ,
      e3f685aa
    • Jakub Kicinski's avatar
      selftests: ethtool: add a netdevsim FEC test · 1da07e5d
      Jakub Kicinski authored
      Test FEC settings, iterate over configs.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1da07e5d
    • Jakub Kicinski's avatar
      netdevsim: add FEC settings support · 0d7f76dc
      Jakub Kicinski authored
      Add support for ethtool FEC and some ethtool error injection.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d7f76dc
    • Jakub Kicinski's avatar
      ethtool: support FEC settings over netlink · 1e5d1f69
      Jakub Kicinski authored
      Add FEC API to netlink.
      
      This is not a 1-to-1 conversion.
      
      FEC settings already depend on link modes to tell user which
      modes are supported. Take this further an use link modes for
      manual configuration. Old struct ethtool_fecparam is still
      used to talk to the drivers, so we need to translate back
      and forth. We can revisit the internal API if number of FEC
      encodings starts to grow.
      
      Enforce only one active FEC bit (by using a bit position
      rather than another mask).
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e5d1f69
    • Eric Lin's avatar
    • Ido Schimmel's avatar
      mlxsw: spectrum_router: Only perform atomic nexthop bucket replacement when requested · 7866f265
      Ido Schimmel authored
      When cleared, the 'force' parameter in nexthop bucket replacement
      notifications indicates that a driver should try to perform an atomic
      replacement. Meaning, only update the contents of the bucket if it is
      inactive.
      
      Since mlxsw only queries buckets' activity once every second, there is
      no point in trying an atomic replacement if the idle timer interval is
      smaller than 1 second.
      
      Currently, mlxsw ignores the original value of 'force' and will always
      try an atomic replacement if the idle timer is not smaller than 1
      second.
      
      Fix this by taking the original value of 'force' into account and never
      promoting a non-atomic replacement to an atomic one.
      
      Fixes: 617a77f0 ("mlxsw: spectrum_router: Add nexthop bucket replacement support")
      Reported-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7866f265
    • David S. Miller's avatar
      Merge branch 'mptcp-subflow-disconnected' · 65550f03
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      MPTCP: Allow initial subflow to be disconnected
      
      An MPTCP connection is aggregated from multiple TCP subflows, and can
      involve multiple IP addresses on either peer. The addresses used in the
      initial subflow connection are assigned address id 0 on each side of the
      link. More addresses can be added and shared with the peer using address
      IDs of 1 or larger. MPTCP in Linux shares non-zero address IDs across
      all MPTCP connections in a net namespace, which allows userspace to
      manage subflow connections across a number of sockets. However, this
      makes the address with id 0 a special case, since the IP address
      associated with id 0 is potentially different for each socket.
      
      This patch set allows the initial subflow to be disconnected when
      userspace specifies an address to remove using both id 0 and an IP
      address, or when the peer sends an RM_ADDR for id 0.
      
      Patches 1 and 3 implement the change for requests from the peer and
      userspace, respectively.
      
      Patch 2 consolidates some code for disconnecting subflows.
      
      Patches 4-6 update the self tests to cover removal of subflows using
      address id 0.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65550f03
    • Geliang Tang's avatar
      selftests: mptcp: remove id 0 address testcases · 5e287fe7
      Geliang Tang authored
      This patch added the testcases for removing the id 0 subflow and the id 0
      address.
      
      In do_transfer, use the removing addresses number '9' for deleting the id
      0 address.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e287fe7
    • Geliang Tang's avatar
      selftests: mptcp: add addr argument for del_addr · 2d121c9a
      Geliang Tang authored
      For the id 0 address, different MPTCP connections could be using
      different IP addresses for id 0.
      
      This patch added an extra argument IP address for del_addr when
      using id 0.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d121c9a
    • Matthieu Baerts's avatar
      selftests: mptcp: avoid calling pm_nl_ctl with bad IDs · 6254ad40
      Matthieu Baerts authored
      IDs are supposed to be between 0 and 255.
      
      In pm_nl_ctl, for both the 'add' and 'get' instruction, the ID is casted
      in a u_int8_t. So if we give 256, we will delete ID 0. Obviously, the
      goal is not to delete this ID by giving 256.
      
      We could modify pm_nl_ctl and stop if the ID is negative or higher than
      255 but probably better not to increase the number of lines for such
      things in this tool which is only used in selftests. Instead, we use it
      within the limits.
      
      This modification also means that we will no longer add a new ID for the
      2nd entry. That's why we removed an expected entry from the dump and
      introduced with
      commit dc8eb10e ("selftests: mptcp: add testcases for setting the address ID").
      
      So now we delete ID 9 like before and we add entries for IDs 10 to 255
      that are deleted just after.
      
      Note that this could be seen as a fix but it was not really an issue so
      far: we were simply playing with ID 0/1 once again. With the following
      commit ("selftests: mptcp: add addr argument for del_addr"), it will be
      different because ID 0 is going to required an address. We don't want
      errors when trying to delete ID 0 without the address argument.
      Acked-and-tested-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6254ad40
    • Geliang Tang's avatar
      mptcp: remove id 0 address · 740d798e
      Geliang Tang authored
      This patch added a new function mptcp_nl_remove_id_zero_address to
      remove the id 0 address.
      
      In this function, traverse all the existing msk sockets to find the
      msk matched the input IP address. Then fill the removing list with
      id 0, and pass it to mptcp_pm_remove_addr and mptcp_pm_remove_subflow.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Suggested-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      740d798e
    • Geliang Tang's avatar
      mptcp: unify RM_ADDR and RM_SUBFLOW receiving · 9f12e97b
      Geliang Tang authored
      There are some duplicate code in mptcp_pm_nl_rm_addr_received and
      mptcp_pm_nl_rm_subflow_received. This patch unifies them into a new
      function named mptcp_pm_nl_rm_addr_or_subflow. In it, use the input
      parameter rm_type to identify it's now removing an address or a subflow.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f12e97b
    • Geliang Tang's avatar
      mptcp: remove all subflows involving id 0 address · 774c8a8d
      Geliang Tang authored
      There's only one subflow involving the non-zero id address, but there
      may be multi subflows involving the id 0 address.
      
      Here's an example:
      
       local_id=0, remote_id=0
       local_id=1, remote_id=0
       local_id=0, remote_id=1
      
      If the removing address id is 0, all the subflows involving the id 0
      address need to be removed.
      
      In mptcp_pm_nl_rm_addr_received/mptcp_pm_nl_rm_subflow_received, the
      "break" prevents the iteration to the next subflow, so this patch
      dropped them.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      774c8a8d
    • Eric Dumazet's avatar
      net: fix icmp_echo_enable_probe sysctl · b8128656
      Eric Dumazet authored
      sysctl_icmp_echo_enable_probe is an u8.
      
      ipv4_net_table entry should use
       .maxlen       = sizeof(u8).
       .proc_handler = proc_dou8vec_minmax,
      
      Fixes: f1b8fa9f ("net: add sysctl for enabling RFC 8335 PROBE messages")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8128656
    • David S. Miller's avatar
      Merge branch 'ionic-cleanups' · 3c7a83fa
      David S. Miller authored
      Shannon Nelson says:
      
      ====================
      ionic: code cleanup for heartbeat, dma error counts, sizeof, stats
      
      These patches are a few more bits of code cleanup found in
      testing and review: count all our dma error instances, make
      better use of sizeof, fix a race in our device heartbeat check,
      and clean up code formatting in the ethtool stats collection.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c7a83fa
    • Shannon Nelson's avatar
      ionic: pull per-q stats work out of queue loops · aa620993
      Shannon Nelson authored
      Abstract out the per-queue data collection work into separate
      functions from the per-queue loops in the stats reporting,
      similar to what Alex did for the data label strings in
      commit acebe5b6 ("ionic: Update driver to use ethtool_sprintf")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa620993
    • Shannon Nelson's avatar
      ionic: avoid races in ionic_heartbeat_check · b2b9a8d7
      Shannon Nelson authored
      Rework the heartbeat checks to be sure that we're getting an
      atomic operation.  Through testing we found occasions where a
      separate thread could clash with this check and cause erroneous
      heartbeat check results.
      Signed-off-by: default avatarAllen Hubbe <allenbh@pensando.io>
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2b9a8d7
    • Shannon Nelson's avatar
      ionic: fix sizeof usage · 230efff4
      Shannon Nelson authored
      Use the actual pointer that we care about as the subject of the
      sizeof, rather than a struct name.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      230efff4
    • Shannon Nelson's avatar
      ionic: count dma errors · 0f4e7f4e
      Shannon Nelson authored
      Increment our dma-error counter in a couple of spots
      that were missed before.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f4e7f4e
    • David S. Miller's avatar
      Merge branch 'dpaa2-switch-STP' · 578c97b0
      David S. Miller authored
      Ioana Ciornei says:
      
      ====================
      dpaa2-switch: add STP support
      
      This patch set adds support for STP to the dpaa2-switch.
      
      First of all, it fixes a bug which was determined by the improper usage
      of bridge BR_STATE_* values directly in the MC ABI.
      The next patches deal with creating an ACL table per port and trapping
      the STP frames to the control interface by adding an entry into each
      table.
      The last patch configures proper learning state depending on the STP
      state.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      578c97b0
    • Ioana Ciornei's avatar
      dpaa2-switch: setup learning state on STP state change · bc96781a
      Ioana Ciornei authored
      Depending on what STP state a port is in, the learning on that port
      should be enabled or disabled.
      
      When the STP state is DISABLED, BLOCKING or LISTENING no learning should
      be happening irrespective of what the bridge previously requested. The
      learning state is changed to be the one setup by the bridge when the STP
      state is LEARNING or FORWARDING.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc96781a
    • Ioana Ciornei's avatar
      dpaa2-switch: trap STP frames to the CPU · 1a64ed12
      Ioana Ciornei authored
      Add an ACL entry in each port's ACL table to redirect any frame that
      has the destination MAC address equal to the STP dmac to the control
      interface.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a64ed12
    • Ioana Ciornei's avatar
      dpaa2-switch: keep track of the current learning state per port · 62734c74
      Ioana Ciornei authored
      Keep track of the current learning state per port so that we can
      reference it in the next patches when setting up a STP state.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62734c74
    • Ioana Ciornei's avatar
      dpaa2-switch: create and assign an ACL table per port · 90f07102
      Ioana Ciornei authored
      In order to trap frames to the CPU, the DPAA2 switch uses the ACL table.
      At probe time, create an ACL table for each switch port so that in the
      next patches we can use this to trap STP frames and redirect them to the
      control interface.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90f07102
    • Ioana Ciornei's avatar
      dpaa2-switch: fix the translation between the bridge and dpsw STP states · 6aa6791d
      Ioana Ciornei authored
      The numerical values used for STP states are different between the
      bridge and the MC ABI therefore, the direct usage of the
      BR_STATE_* macros directly in the structures passed to the firmware is
      incorrect.
      
      Create a separate function that translates between the bridge STP states
      and the enum that holds the STP state as seen by the Management Complex.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6aa6791d
    • Vlad Buslov's avatar
      tc-testing: add simple action change test · e48792a9
      Vlad Buslov authored
      Use act_simple to verify that action created with 'tc actions change'
      command exists after command returns. The goal is to verify internal action
      API reference counting to ensure that the case when netlink message has
      NLM_F_REPLACE flag set but action with specified index doesn't exist is
      handled correctly.
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e48792a9
    • David S. Miller's avatar
      Merge branch 'udp-gro-L4' · df82e9c6
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      udp: GRO L4 improvements
      
      This series improves the UDP L4 - either 'forward' or 'frag_list' -
      co-existence with UDP tunnel GRO, allowing the first to take place
      correctly even for encapsulated UDP traffic.
      
      The first for patches are mostly bugfixes, addressing some GRO
      edge-cases when both tunnels and L4 are present, enabled and in use.
      
      The next 3 patches avoid unneeded segmentation when UDP GRO
      traffic traverses in the receive path UDP tunnels.
      
      Finally, some self-tests are included, covering the relevant
      GRO scenarios.
      
      Even if most patches are actually bugfixes, this series is
      targeting net-next, as overall it makes available a new feature.
      
      v2 -> v3:
       - no code changes, more verbose commit messages and comment in
         patch 1/8
      
      v1 -> v2:
       - restrict post segmentation csum fixup to the only the relevant pkts
       - use individual 'accept_gso_type' fields instead of whole gso bitmask
         (Willem)
       - use only ipv6 addesses from test range in self-tests (Willem)
       - hopefully clarified most individual patches commit messages
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df82e9c6
    • Paolo Abeni's avatar
      selftests: net: add UDP GRO forwarding self-tests · a062260a
      Paolo Abeni authored
      Create a bunch of virtual topologies and verify that
      NETIF_F_GRO_FRAGLIST or NETIF_F_GRO_UDP_FWD-enabled
      devices aggregate the ingress packets as expected.
      Additionally check that the aggregate packets are
      segmented correctly when landing on a socket
      
      Also test SKB_GSO_FRAGLIST and SKB_GSO_UDP_L4 aggregation
      on top of UDP tunnel (vxlan)
      
      v1 -> v2:
       - hopefully clarify the commit message
       - moved the overlay network ipv6 range into the 'documentation'
         reserved range (Willem)
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a062260a
    • Paolo Abeni's avatar
      bareudp: allow UDP L4 GRO passthrou · b03ef676
      Paolo Abeni authored
      Similar to the previous commit, let even geneve
      passthrou the L4 GRO packets
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b03ef676
    • Paolo Abeni's avatar
      geneve: allow UDP L4 GRO passthrou · 61630c4f
      Paolo Abeni authored
      Similar to the previous commit, let even geneve
      passthrou the L4 GRO packets
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61630c4f
    • Paolo Abeni's avatar
      vxlan: allow L4 GRO passthrough · d18931a9
      Paolo Abeni authored
      When passing up an UDP GSO packet with L4 aggregation, there is
      no need to segment it at the vxlan level. We can propagate the
      packet untouched and let it be segmented later, if needed.
      
      Introduce an helper to allow let the UDP socket to accept any
      L4 aggregation and use it in the vxlan driver.
      
      v1 -> v2:
       - updated to use the newly introduced UDP socket 'accept*' fields
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d18931a9
    • Paolo Abeni's avatar
      udp: never accept GSO_FRAGLIST packets · 78352f73
      Paolo Abeni authored
      Currently the UDP protocol delivers GSO_FRAGLIST packets to
      the sockets without the expected segmentation.
      
      This change addresses the issue introducing and maintaining
      a couple of new fields to explicitly accept SKB_GSO_UDP_L4
      or GSO_FRAGLIST packets. Additionally updates  udp_unexpected_gso()
      accordingly.
      
      UDP sockets enabling UDP_GRO stil keep accept_udp_fraglist
      zeroed.
      
      v1 -> v2:
       - use 2 bits instead of a whole GSO bitmask (Willem)
      
      Fixes: 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78352f73
    • Paolo Abeni's avatar
      udp: properly complete L4 GRO over UDP tunnel packet · e0e3070a
      Paolo Abeni authored
      After the previous patch, the stack can do L4 UDP aggregation
      on top of a UDP tunnel.
      
      In such scenario, udp{4,6}_gro_complete will be called twice. This function
      will enter its is_flist branch immediately, even though that is only
      correct on the second call, as GSO_FRAGLIST is only relevant for the
      inner packet.
      
      Instead, we need to try first UDP tunnel-based aggregation, if the GRO
      packet requires that.
      
      This patch changes udp{4,6}_gro_complete to skip the frag list processing
      when while encap_mark == 1, identifying processing of the outer tunnel
      header.
      Additionally, clears the field in udp_gro_complete() so that we can enter
      the frag list path on the next round, for the inner header.
      
      v1 -> v2:
       - hopefully clarified the commit message
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0e3070a
    • Paolo Abeni's avatar
      udp: skip L4 aggregation for UDP tunnel packets · 18f25dc3
      Paolo Abeni authored
      If NETIF_F_GRO_FRAGLIST or NETIF_F_GRO_UDP_FWD are enabled, and there
      are UDP tunnels available in the system, udp_gro_receive() could end-up
      doing L4 aggregation (either SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST) at
      the outer UDP tunnel level for packets effectively carrying and UDP
      tunnel header.
      
      That could cause inner protocol corruption. If e.g. the relevant
      packets carry a vxlan header, different vxlan ids will be ignored/
      aggregated to the same GSO packet. Inner headers will be ignored, too,
      so that e.g. TCP over vxlan push packets will be held in the GRO
      engine till the next flush, etc.
      
      Just skip the SKB_GSO_UDP_L4 and SKB_GSO_FRAGLIST code path if the
      current packet could land in a UDP tunnel, and let udp_gro_receive()
      do GRO via udp_sk(sk)->gro_receive.
      
      The check implemented in this patch is broader than what is strictly
      needed, as the existing UDP tunnel could be e.g. configured on top of
      a different device: we could end-up skipping GRO at-all for some packets.
      
      Anyhow, that is a very thin corner case and covering it will add quite
      a bit of complexity.
      
      v1 -> v2:
       - hopefully clarify the commit message
      
      Fixes: 9fd1ff5d ("udp: Support UDP fraglist GRO/GSO.")
      Fixes: 36707061 ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets")
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18f25dc3
    • Paolo Abeni's avatar
      udp: fixup csum for GSO receive slow path · 000ac44d
      Paolo Abeni authored
      When UDP packets generated locally by a socket with UDP_SEGMENT
      traverse the following path:
      
      UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) ->
      	UDP tunnel (rx) -> UDP socket (no UDP_GRO)
      
      ip_summed will be set to CHECKSUM_PARTIAL at creation time and
      such checksum mode will be preserved in the above path up to the
      UDP tunnel receive code where we have:
      
       __iptunnel_pull_header() -> skb_pull_rcsum() ->
      skb_postpull_rcsum() -> __skb_postpull_rcsum()
      
      The latter will convert the skb to CHECKSUM_NONE.
      
      The UDP GSO packet will be later segmented as part of the rx socket
      receive operation, and will present a CHECKSUM_NONE after segmentation.
      
      Additionally the segmented packets UDP CB still refers to the original
      GSO packet len. Overall that causes unexpected/wrong csum validation
      errors later in the UDP receive path.
      
      We could possibly address the issue with some additional checks and
      csum mangling in the UDP tunnel code. Since the issue affects only
      this UDP receive slow path, let's set a suitable csum status there.
      
      Note that SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets lacking an UDP
      encapsulation present a valid checksum when landing to udp_queue_rcv_skb(),
      as the UDP checksum has been validated by the GRO engine.
      
      v2 -> v3:
       - even more verbose commit message and comments
      
      v1 -> v2:
       - restrict the csum update to the packets strictly needing them
       - hopefully clarify the commit message and code comments
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      000ac44d
  2. 30 Mar, 2021 2 commits