1. 15 Sep, 2020 6 commits
    • Alexandra Winter's avatar
      s390/qeth: Reset address notification in case of buffer overflow · 817741a8
      Alexandra Winter authored
      In case hardware sends more device-to-bridge-address-change notfications
      than the qeth-l2 driver can handle, the hardware will send an overflow
      event and then stop sending any events. It expects software to flush its
      FDB and start over again. Re-enabling address-change-notification will
      report all current addresses.
      
      In order to re-enable address-change-notification this patch defines
      the functions qeth_l2_dev2br_an_set() and qeth_l2_dev2br_an_set_cb
      to enable or disable dev-to-bridge-address-notification.
      
      A following patch will use the learning_sync bridgeport flag to trigger
      enabling or disabling of address-change-notification, so we define
      priv->brport_features to store the current setting. BRIDGE_INFO and
      ADDR_INFO functionality are mutually exclusive, whereas ADDR_INFO and
      qeth_l2_vnicc* can be used together.
      
      Alternative implementations to handle buffer overflow:
      Just re-enabling notification and adding all newly reported addresses
      would cover any lost 'add' events, but not the lost 'delete' events.
      Then these invalid addresses would stay in the bridge FDB as long as the
      device exists.
      Setting the net device down and up, would be an alternative, but is a bit
      drastic. If the net device has many secondary addresses this will create
      many delete/add events at its peers which could de-stabilize the
      network segment.
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      817741a8
    • Alexandra Winter's avatar
      bridge: Add SWITCHDEV_FDB_FLUSH_TO_BRIDGE notifier · d05e8e68
      Alexandra Winter authored
      so the switchdev can notifiy the bridge to flush non-permanent fdb entries
      for this port. This is useful whenever the hardware fdb of the switchdev
      is reset, but the netdev and the bridgeport are not deleted.
      
      Note that this has the same effect as the IFLA_BRPORT_FLUSH attribute.
      
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Ivan Vecera <ivecera@redhat.com>
      CC: Roopa Prabhu <roopa@nvidia.com>
      CC: Nikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Acked-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d05e8e68
    • Alexandra Winter's avatar
      s390/qeth: Translate address events into switchdev notifiers · 10a6cfc0
      Alexandra Winter authored
      A qeth-l2 HiperSockets card can show switch-ish behaviour in the sense,
      that it can report all MACs that are reachable via this interface. Just
      like a switch device, it can notify the software bridge about changes
      to its fdb. This patch exploits this device-to-bridge-notification and
      extracts the relevant information from the hardware events to generate
      notifications to an attached software bridge.
      
      There are 2 sources for this information:
      1) The reply message of Perform-Network-Subchannel-Operations (PNSO)
      (operation code ADDR_INFO) reports all addresses that are currently
      reachable (implemented in a later patch).
      2) As long as device-to-bridge-notification is enabled, hardware will
      generate address change notification events, whenever the content of
      the hardware fdb changes (this patch).
      
      The bridge_hostnotify feature (PNSO operation code BRIDGE_INFO) uses
      the same address change notification events. We need to distinguish
      between qeth_pnso_mode QETH_PNSO_BRIDGEPORT and QETH_PNSO_ADDR_INFO
      and call a different handler. In both cases deadlocks must be
      prevented, if the workqueue is drained under lock and QETH_PNSO_NONE,
      when notification is disabled.
      
      bridge_hostnotify generates udev events, there is no intend to do the same
      for dev2br. Instead this patch will generate SWITCHDEV_FDB_ADD_TO_BRIDGE
      and SWITCHDEV_FDB_DEL_TO_BRIDGE notifications, that will cause the
      software bridge to add (or delete) entries to its fdb as 'extern_learn
      offload'.
      
      Documentation/networking/switchdev.txt proposes to add
      "depends NET_SWITCHDEV" to driver's Kconfig. This is not done here,
      so even in absence of the NET_SWITCHDEV module, the QETH_L2 module will
      still be built, but then the switchdev notifiers will have no effect.
      
      No VLAN filtering is done on the entries and VLAN information is not
      passed on to the bridge fdb entries. This could be added later.
      For now VLAN interfaces can be defined on the upper bridge interface.
      
      Multicast entries are not passed on to the bridge fdb.
      This could be added later. For now mcast flooding can be used in the
      bridge.
      
      The card reports all MACs that are in its FDB, but we must not pass on
      MACs that are registered for this interface.
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10a6cfc0
    • Alexandra Winter's avatar
      s390/qeth: Detect PNSO OC3 capability · fa115adf
      Alexandra Winter authored
      This patch detects whether device-to-bridge-notification, provided
      by the Perform Network Subchannel Operation (PNSO) operation code
      ADDR_INFO (OC3), is supported by this card. A following patch will
      map this to the learning_sync bridgeport flag, so we store it in
      priv->brport_hw_features in bridgeport flag format.
      
      Only IQD cards provide PNSO.
      There is a feature bit to indicate whether the machine provides OC3,
      unfortunately it is not set on old machines.
      So PNSO is called to find out. As this will disable notification
      and is exclusive with bridgeport_notification, this must be done
      during card initialisation before previous settings are restored.
      
      PNSO functionality requires some configuration values that are added to
      the qeth_card.info structure. Some helper functions are defined to fill
      them out when the card is brought online and some other places are
      adapted, that can also benefit from these fields.
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa115adf
    • Alexandra Winter's avatar
      s390/cio: Helper functions to read CSSID, IID, and CHID · b983aa1f
      Alexandra Winter authored
      Add helper functions to expose Channel Subsystem ID (CSSID), MIF Image Id
      (IID), Channel ID (CHID) and Channel Path ID (CHPID).
      These values are required by the qeth driver's exploitation of network-
      address-change-notifications to determine which entries belong to this
      interface.
      
      Store the Partition identifier in System log, as this may be used to map
      a Linux view to a Hardware view for debugging purpose.
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarVineeth Vijayan <vneethv@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b983aa1f
    • Alexandra Winter's avatar
      s390/cio: Add new Operation Code OC3 to PNSO · 4fea49a7
      Alexandra Winter authored
      Add support for operation code 3 (OC3) of the
      Perform-Network-Subchannel-Operations (PNSO) function
      of the Channel-Subsystem-Call (CHSC) instruction.
      
      PNSO provides 2 operation codes:
      OC0 - BRIDGE_INFO
      OC3 - ADDR_INFO (new)
      
      Extend the function calls to *pnso* to pass the OC and
      add new response code 0108.
      
      Support for OC3 is indicated by a flag in the css_general_characteristics.
      Signed-off-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Reviewed-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Reviewed-by: default avatarVineeth Vijayan <vneethv@linux.ibm.com>
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fea49a7
  2. 14 Sep, 2020 34 commits
    • Soheil Hassas Yeganeh's avatar
      tcp: schedule EPOLLOUT after a partial sendmsg · afb83012
      Soheil Hassas Yeganeh authored
      For EPOLLET, applications must call sendmsg until they get EAGAIN.
      Otherwise, there is no guarantee that EPOLLOUT is sent if there was
      a failure upon memory allocation.
      
      As a result on high-speed NICs, userspace observes multiple small
      sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
      1-2 TSOs in between two sendmsg syscalls:
      
      // One large partial send due to memory allocation failure.
      sendmsg(20MB)   = 2MB
      // Many small sends until EAGAIN.
      sendmsg(18MB)   = 64KB
      sendmsg(17.9MB) = 128KB
      sendmsg(17.8MB) = 64KB
      ...
      sendmsg(...)    = EAGAIN
      // At this point, userspace can assume an EPOLLOUT.
      
      To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
      to guarantee that we send EPOLLOUT after partial sendmsg.
      
      After this commit userspace can assume that it will receive an EPOLLOUT
      after the first partial sendmsg. This EPOLLOUT will benefit from
      sk_stream_write_space() logic delaying the EPOLLOUT until significant
      space is available in write queue.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afb83012
    • Soheil Hassas Yeganeh's avatar
      tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit · 8ba3c9d1
      Soheil Hassas Yeganeh authored
      If there was any event available on the TCP socket, tcp_poll()
      will be called to retrieve all the events.  In tcp_poll(), we call
      sk_stream_is_writeable() which returns true as long as we are at least
      one byte below notsent_lowat.  This will result in quite a few
      spurious EPLLOUT and frequent tiny sendmsg() calls as a result.
      
      Similar to sk_stream_write_space(), use __sk_stream_is_writeable
      with a wake value of 1, so that we set EPOLLOUT only if half the
      space is available for write.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ba3c9d1
    • Shannon Nelson's avatar
      ionic: fix up debugfs after queue swap · ed6d9b02
      Shannon Nelson authored
      Clean and rebuild the debugfs info for the queues being swapped.
      
      Fixes: a34e25ab ("ionic: change the descriptor ring length without full reset")
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed6d9b02
    • Vladimir Oltean's avatar
      __netif_receive_skb_core: don't untag vlan from skb on DSA master · b14a9fc4
      Vladimir Oltean authored
      A DSA master interface has upper network devices, each representing an
      Ethernet switch port attached to it. Demultiplexing the source ports and
      setting skb->dev accordingly is done through the catch-all ETH_P_XDSA
      packet_type handler. Catch-all because DSA vendors have various header
      implementations, which can be placed anywhere in the frame: before the
      DMAC, before the EtherType, before the FCS, etc. So, the ETH_P_XDSA
      handler acts like an rx_handler more than anything.
      
      It is unlikely for the DSA master interface to have any other upper than
      the DSA switch interfaces themselves. Only maybe a bridge upper*, but it
      is very likely that the DSA master will have no 8021q upper. So
      __netif_receive_skb_core() will try to untag the VLAN, despite the fact
      that the DSA switch interface might have an 8021q upper. So the skb will
      never reach that.
      
      So far, this hasn't been a problem because most of the possible
      placements of the DSA switch header mentioned in the first paragraph
      will displace the VLAN header when the DSA master receives the frame, so
      __netif_receive_skb_core() will not actually execute any VLAN-specific
      code for it. This only becomes a problem when the DSA switch header does
      not displace the VLAN header (for example with a tail tag).
      
      What the patch does is it bypasses the untagging of the skb when there
      is a DSA switch attached to this net device. So, DSA is the only
      packet_type handler which requires seeing the VLAN header. Once skb->dev
      will be changed, __netif_receive_skb_core() will be invoked again and
      untagging, or delivery to an 8021q upper, will happen in the RX of the
      DSA switch interface itself.
      
      *see commit 9eb8eff0 ("net: bridge: allow enslaving some DSA master
      network devices". This is actually the reason why I prefer keeping DSA
      as a packet_type handler of ETH_P_XDSA rather than converting to an
      rx_handler. Currently the rx_handler code doesn't support chaining, and
      this is a problem because a DSA master might be bridged.
      Signed-off-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b14a9fc4
    • David S. Miller's avatar
      Merge branch 'net-next-dsa-mt7530-add-support-for-MT7531' · 0ca6d8b7
      David S. Miller authored
      Landen Chao says:
      
      ====================
      net-next: dsa: mt7530: add support for MT7531
      
      This patch series adds support for MT7531.
      
      MT7531 is the next generation of MT7530 which could be found on Mediatek
      router platforms such as MT7622 or MT7629.
      
      It is also a 7-ports switch with 5 giga embedded phys, 2 cpu ports, and
      the same MAC logic of MT7530. Cpu port 6 only supports SGMII interface.
      Cpu port 5 supports either RGMII or SGMII in different HW SKU, but cannot
      be muxed to PHY of port 0/4 like mt7530. Due to support for SGMII
      interface, pll, and pad setting are different from MT7530.
      
      MT7531 SGMII interface can be configured in following mode:
      - 'SGMII AN mode' with in-band negotiation capability
          which is compatible with PHY_INTERFACE_MODE_SGMII.
      - 'SGMII force mode' without in-band negotiation
          which is compatible with 10B/8B encoding of
          PHY_INTERFACE_MODE_1000BASEX with fixed full-duplex and fixed pause.
      - 2.5 times faster clocked 'SGMII force mode' without in-band negotiation
          which is compatible with 10B/8B encoding of
          PHY_INTERFACE_MODE_2500BASEX with fixed full-duplex and fixed pause.
      
      v4 -> v5
      - Add fixed-link node to dsa cpu port in dts file by suggestion of
        Vladimir Oltean.
      
      v3 -> v4
      - Adjust the coding style by suggestion of Jakub Kicinski.
        Remove unnecessary jumping label, merge continuous numeric 'switch
        cases' into one line, and keep the variables longest to shortest
        (reverse xmas tree).
      
      v2 -> v3
      - Keep the same setup logic of mt7530/mt7621 because these series of
        patches is for adding mt7531 hardware.
      - Do not adjust rgmii delay when vendor phy driver presents in order to
        prevent double adjustment by suggestion of Andrew Lunn.
      - Remove redundant 'Example 4' from dt-bindings by suggestion of
        Rob Herring.
      - Fix typo.
      
      v1 -> v2
      - change phylink_validate callback function to support full-duplex
        gigabit only to match hardware capability.
      - add description of SGMII interface.
      - configure mt7531 cpu port in fastest speed by default.
      - parse SGMII control word for in-band negotiation mode.
      - configure RGMII delay based on phy.rst.
      - Rename the definition in the header file to avoid potential conflicts.
      - Add wrapper function for mdio read/write to support both C22 and C45.
      - correct fixed-link speed of 2500base-x in dts.
      - add MT7531 port mirror setting.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ca6d8b7
    • Landen Chao's avatar
      arm64: dts: mt7622: add mt7531 dsa to bananapi-bpi-r64 board · 79a675e6
      Landen Chao authored
      Add mt7531 dsa to bananapi-bpi-r64 board for 5 giga Ethernet ports support.
      Signed-off-by: default avatarLanden Chao <landen.chao@mediatek.com>
      Tested-By: default avatarFrank Wunderlich <frank-w@public-files.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79a675e6
    • Landen Chao's avatar
      arm64: dts: mt7622: add mt7531 dsa to mt7622-rfb1 board · 6af06448
      Landen Chao authored
      Add mt7531 dsa to mt7622-rfb1 board for 5 giga Ethernet ports support.
      mt7622 only supports 1 sgmii interface, so either gmac0 or gmac1 can be
      configured as sgmii interface. In this patch, change to connect mt7622
      gmac0 and mt7531 port6 through sgmii interface.
      Signed-off-by: default avatarLanden Chao <landen.chao@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6af06448
    • Landen Chao's avatar
      net: dsa: mt7530: Add the support of MT7531 switch · c288575f
      Landen Chao authored
      Add new support for MT7531:
      
      MT7531 is the next generation of MT7530. It is also a 7-ports switch with
      5 giga embedded phys, 2 cpu ports, and the same MAC logic of MT7530. Cpu
      port 6 only supports SGMII interface. Cpu port 5 supports either RGMII
      or SGMII in different HW sku, but cannot be muxed to PHY of port 0/4 like
      mt7530. Due to SGMII interface support, pll, and pad setting are different
      from MT7530. This patch adds different initial setting, and SGMII phylink
      handlers of MT7531.
      
      MT7531 SGMII interface can be configured in following mode:
      - 'SGMII AN mode' with in-band negotiation capability
          which is compatible with PHY_INTERFACE_MODE_SGMII.
      - 'SGMII force mode' without in-band negotiation
          which is compatible with 10B/8B encoding of
          PHY_INTERFACE_MODE_1000BASEX with fixed full-duplex and fixed pause.
      - 2.5 times faster clocked 'SGMII force mode' without in-band negotiation
          which is compatible with 10B/8B encoding of
          PHY_INTERFACE_MODE_2500BASEX with fixed full-duplex and fixed pause.
      Signed-off-by: default avatarLanden Chao <landen.chao@mediatek.com>
      Signed-off-by: default avatarSean Wang <sean.wang@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c288575f
    • Landen Chao's avatar
      dt-bindings: net: dsa: add new MT7531 binding to support MT7531 · 27834b02
      Landen Chao authored
      Add devicetree binding to support the compatible mt7531 switch as used
      in the MediaTek MT7531 switch.
      Signed-off-by: default avatarSean Wang <sean.wang@mediatek.com>
      Signed-off-by: default avatarLanden Chao <landen.chao@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27834b02
    • Landen Chao's avatar
      net: dsa: mt7530: Extend device data ready for adding a new hardware · 88bdef8b
      Landen Chao authored
      Add a structure holding required operations for each device such as device
      initialization, PHY port read or write, a checker whether PHY interface is
      supported on a certain port, MAC port setup for either bus pad or a
      specific PHY interface.
      
      The patch is done for ready adding a new hardware MT7531, and keep the
      same setup logic of existing hardware.
      Signed-off-by: default avatarLanden Chao <landen.chao@mediatek.com>
      Signed-off-by: default avatarSean Wang <sean.wang@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88bdef8b
    • Landen Chao's avatar
      net: dsa: mt7530: Refine message in Kconfig · dc8ef938
      Landen Chao authored
      Refine message in Kconfig with fixing typo and an explicit MT7621 support.
      Signed-off-by: default avatarLanden Chao <landen.chao@mediatek.com>
      Signed-off-by: default avatarSean Wang <sean.wang@mediatek.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc8ef938
    • Xie He's avatar
      drivers/net/wan/x25_asy: Remove an unnecessary x25_type_trans call · 4b468385
      Xie He authored
      x25_type_trans only needs to be called before we call netif_rx to pass
      the skb to upper layers.
      
      It does not need to be called before lapb_data_received. The LAPB module
      does not need the fields that are set by calling it.
      
      In the other two X.25 drivers - lapbether and hdlc_x25. x25_type_trans
      is only called before netif_rx and not before lapb_data_received.
      
      Cc: Martin Schiller <ms@dev.tdt.de>
      Signed-off-by: default avatarXie He <xie.he.0141@gmail.com>
      Acked-by: default avatarMartin Schiller <ms@dev.tdt.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b468385
    • Paolo Abeni's avatar
      net: try to avoid unneeded backlog flush · 2de79ee2
      Paolo Abeni authored
      flush_all_backlogs() may cause deadlock on systems
      running processes with FIFO scheduling policy.
      
      The above is critical in -RT scenarios, where user-space
      specifically ensure no network activity is scheduled on
      the CPU running the mentioned FIFO process, but still get
      stuck.
      
      This commit tries to address the problem checking the
      backlog status on the remote CPUs before scheduling the
      flush operation. If the backlog is empty, we can skip it.
      
      v1 -> v2:
       - explicitly clear flushed cpu mask - Eric
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2de79ee2
    • David S. Miller's avatar
      Merge branch 'mlxsw-Derive-SBIB-from-maximum-port-speed-and-MTU' · 7b2d1b8d
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Derive SBIB from maximum port speed & MTU
      
      Petr says:
      
      Internal buffer is a part of port headroom used for packets that are
      mirrored due to triggers that the Spectrum ASIC considers "egress". Besides
      ACL mirroring on port egresss this includes also packets mirrored due to
      ECN marking.
      
      This patchset changes the way the internal mirroring buffer is reserved.
      Currently the buffer reflects port MTU and speed accurately. In the future,
      mlxsw should support dcbnl_setbuffer hook to allow the users to set buffer
      sizes by hand. In that case, there might not be enough space for growth of
      the internal mirroring buffer due to MTU and speed changes. While vetoing
      MTU changes would be merely confusing, port speed changes cannot be vetoed,
      and such change would simply lead to issues in packet mirroring.
      
      For these reasons, with these patches the internal mirroring buffer is
      derived from maximum MTU and maximum speed achievable on the port.
      
      Patches #1 and #2 introduce a new callback to determine the maximum speed a
      given port can achieve.
      
      With patches #3 and #4, the information about, respectively, maximum MTU
      and maximum port speed, is kept in struct mlxsw_sp_port.
      
      In patch #5, maximum MTU and maximum speed are used to determine the size
      of the internal buffer. MTU update and speed update hooks are dropped,
      because they are no longer necessary.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b2d1b8d
    • Petr Machata's avatar
      mlxsw: spectrum_span: Derive SBIB from maximum port speed & MTU · 532b49e4
      Petr Machata authored
      The SBIB register configures the size of an internal buffer that the
      Spectrum ASICs use when mirroring traffic on egress. This size should be
      taken into account when validating that the port headroom buffers are not
      larger than the chip can handle. Up until now this was not done, which is
      incidentally not a problem, because the priority group buffers that mlxsw
      auto-configures are small enough that the boundary condition could not be
      violated.
      
      However when dcbnl_setbuffer is implemented, the user has control over
      sizes of PG buffers, and they might overshoot the headroom capacity.
      However the size of the SBIB buffer depends on port speed, and that cannot
      be vetoed. Therefore SBIB size should be deduced from maximum port speed.
      
      Additionally, once the buffers are configured by hand, the user could get
      into an uncomfortable situation where their MTU change requests get vetoed,
      because the SBIB does not fit anymore. Therefore derive SBIB size from
      maximum permissible MTU as well.
      
      Remove all the code that adjusted the SBIB size whenever speed or MTU
      changed.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      532b49e4
    • Petr Machata's avatar
      mlxsw: spectrum: Keep maximum speed around · 3232e8c6
      Petr Machata authored
      The maximum port speed depends on link modes supported by the port, and for
      Ethernet ports is constant. The maximum speed will be handy when setting
      SBIB, the internal buffer used for traffic mirroring. Therefore, keep it in
      struct mlxsw_sp_port for easy access.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3232e8c6
    • Petr Machata's avatar
      mlxsw: spectrum: Keep maximum MTU around · 2ecf87ae
      Petr Machata authored
      The maximum port MTU depends on port type. On Spectrum, mlxsw configures
      all ports as Ethernet ports, and the maximum MTU therefore never changes.
      Besides checking MTU configuration, maximum MTU will also be handy when
      setting SBIB, the internal buffer used for traffic mirroring. Therefore,
      keep it in struct mlxsw_sp_port for easy access.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ecf87ae
    • Petr Machata's avatar
      mlxsw: spectrum_ethtool: Introduce ptys_max_speed callback · 60fbc521
      Petr Machata authored
      The SBIB register configures the size of an internal buffer that the
      Spectrum ASICs use when mirroring traffic on egress. This size should be
      taken into account when validating that the port headroom buffers are not
      larger than the chip can handle. Up until now this was not done, which is
      incidentally not a problem, because the priority group buffers that mlxsw
      auto-configures are small enough that the boundary condition could not be
      violated.
      
      When dcbnl_setbuffer is implemented, the user gets control over sizes of PG
      buffers, and they might overshoot the headroom capacity. However the size
      of the SBIB buffer depends on port speed, which cannot be vetoed. There is
      obviously no way to retroactively push back on requests for overlarge PG
      buffers, or reject an overlarge MTU, or cancel losslessness of a certain
      PG.
      
      Therefore, instead of taking into account the current speed when
      calculating SBIB buffer size, take into account the maximum speed that a
      port with given Ethernet protocol capabilities can have.
      
      To that end, add a new ethtool callback, ptys_max_speed, which determines
      this maximum speed.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60fbc521
    • Petr Machata's avatar
      mlxsw: spectrum_ethtool: Extract a helper to get Ethernet attributes · d24ca6c0
      Petr Machata authored
      In order to allow reusing the logic, extract from
      mlxsw_sp_port_get_link_ksettings() the code to obtain Ethernet protocol
      attributes, mlxsw_sp_port_ptys_query().
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d24ca6c0
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 7952d7ed
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2020-09-14
      
      This series contains updates to i40e driver only.
      
      Li RongQing removes binding affinity mask to a fixed CPU and sets
      prefetch of Rx buffer page to occur conditionally.
      
      Björn provides AF_XDP performance improvements by not prefetching HW
      descriptors, using 16 byte descriptors, and moving buffer allocation
      out of Rx processing loop.
      
      v2: Define prefetch_page_address in a common header for patch 2.
      Dropped, previous, patch 5 as it is being reworked to be more
      generalized.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7952d7ed
    • David S. Miller's avatar
      Merge tag 'rxrpc-next-20200914' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · e0d9ae69
      David S. Miller authored
      David Howells says:
      
      ====================
      rxrpc: Fixes for the connection manager rewrite
      
      Here are some fixes for the connection manager rewrite:
      
       (1) Fix a goto to the wrong place in error handling.
      
       (2) Fix a missing NULL pointer check.
      
       (3) The stored allocation error needs to be stored signed.
      
       (4) Fix a leak of connection bundle when clearing connections due to
           net namespace exit.
      
       (5) Fix an overget of the bundle when setting up a new client conn.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0d9ae69
    • Luo bin's avatar
      hinic: add vxlan segmentation and cs offload support · 33acd755
      Luo bin authored
      Add NETIF_F_GSO_UDP_TUNNEL and NETIF_F_GSO_UDP_TUNNEL_CSUM features
      to support vxlan segmentation and checksum offload. Ipip and ipv6
      tunnel packets are regarded as non-tunnel pkt for hw and as for other
      type of tunnel pkts, checksum offload is disabled.
      Signed-off-by: default avatarLuo bin <luobin9@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33acd755
    • Zhang Changzhong's avatar
      net: qlcnic: remove unused variable 'val' in qlcnic_83xx_cam_unlock() · f3694707
      Zhang Changzhong authored
      Fixes the following W=1 kernel build warning(s):
      
      drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c:661:6: warning:
       variable 'val' set but not used [-Wunused-but-set-variable]
        661 |  u32 val;
            |      ^~~
      
      After commit 7f966452 ("qlcnic: 83xx memory map and HW access
      routines"), variable 'val' is never used in qlcnic_83xx_cam_unlock(), so
      removing it to avoid build warning.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3694707
    • Zhang Changzhong's avatar
      net: pxa168_eth: remove unused variable 'retval' int pxa168_eth_change_mtu() · f7ab0f04
      Zhang Changzhong authored
      Fixes the following W=1 kernel build warning(s):
      
      drivers/net/ethernet/marvell/pxa168_eth.c:1190:6: warning:
       variable 'retval' set but not used [-Wunused-but-set-variable]
       1190 |  int retval;
            |      ^~~~~~
      
      Function pxa168_eth_change_mtu() always return zero, so variable 'retval'
      is redundant, just remove it.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7ab0f04
    • Zhang Changzhong's avatar
      net: fec: ptp: remove unused variable 'ns' in fec_time_keep() · 992bae7e
      Zhang Changzhong authored
      Fixes the following W=1 kernel build warning(s):
      
      drivers/net/ethernet/freescale/fec_ptp.c:523:6: warning:
       variable 'ns' set but not used [-Wunused-but-set-variable]
        523 |  u64 ns;
            |      ^~
      
      After commit 6605b730 ("FEC: Add time stamping code and a PTP
      hardware clock"), variable 'ns' is never used in fec_time_keep(),
      so removing it to avoid build warning.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Acked-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      992bae7e
    • Zhang Changzhong's avatar
      net: dnet: remove unused variable 'tx_status 'in dnet_start_xmit() · 85743cea
      Zhang Changzhong authored
      Fixes the following W=1 kernel build warning(s):
      
      drivers/net/ethernet/dnet.c:510:6: warning:
       variable 'tx_status' set but not used [-Wunused-but-set-variable]
        u32 tx_status, irq_enable;
            ^~~~~~~~~
      
      After commit 47964174 ("dnet: Dave DNET ethernet controller driver
      (updated)"), variable 'tx_status' is never used in dnet_start_xmit(),
      so removing it to avoid build warning.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85743cea
    • Eric Dumazet's avatar
      tcp: remove SOCK_QUEUE_SHRUNK · 0cbe6a8f
      Eric Dumazet authored
      SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
      that remembers if some room has been made in the rtx queue
      by an incoming ACK packet.
      
      This is later used from tcp_check_space() before
      considering to send EPOLLOUT.
      
      Problem is: If we receive SACK packets, and no packet
      is removed from RTX queue, we can send fresh packets, thus
      moving them from write queue to rtx queue and eventually
      empty the write queue.
      
      This stall can happen if TCP_NOTSENT_LOWAT is used.
      
      With this fix, we no longer risk stalling sends while holes
      are repaired, and we can fully use socket sndbuf.
      
      This also removes a cache line dirtying for typical RPC
      workloads.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cbe6a8f
    • Xie He's avatar
      net/packet: Fix a comment about hard_header_len and headroom allocation · b4c58814
      Xie He authored
      This comment is outdated and no longer reflects the actual implementation
      of af_packet.c.
      
      Reasons for the new comment:
      
      1.
      
      In af_packet.c, the function packet_snd first reserves a headroom of
      length (dev->hard_header_len + dev->needed_headroom).
      Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header,
      which calls dev->header_ops->create, to create the link layer header.
      If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of
      length (dev->hard_header_len), and checks if the user has provided a
      header sized between (dev->min_header_len) and (dev->hard_header_len)
      (in dev_validate_header).
      This shows the developers of af_packet.c expect hard_header_len to
      be consistent with header_ops.
      
      2.
      
      In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment.
      That comment states that prepending an LL header internally in a driver
      is considered a bug. I believe this bug can be fixed by setting
      hard_header_len to 0, making the internal header completely invisible
      to af_packet.c (and requesting the headroom in needed_headroom instead).
      
      3.
      
      There is a commit for a WiFi driver:
      commit 9454f7a8 ("mwifiex: set needed_headroom, not hard_header_len")
      According to the discussion about it at:
        https://patchwork.kernel.org/patch/11407493/
      The author tried to set the WiFi driver's hard_header_len to the Ethernet
      header length, and request additional header space internally needed by
      setting needed_headroom.
      This means this usage is already adopted by driver developers.
      
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Brian Norris <briannorris@chromium.org>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarXie He <xie.he.0141@gmail.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4c58814
    • David S. Miller's avatar
      Merge branch 'mptcp-introduce-support-for-real-multipath-xmit' · b91c06c5
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      mptcp: introduce support for real multipath xmit
      
      This series enable MPTCP socket to transmit data on multiple subflows
      concurrently in a load balancing scenario.
      
      First the receive code path is refactored to better deal with out-of-order
      data (patches 1-7). An RB-tree is introduced to queue MPTCP-level out-of-order
      data, closely resembling the TCP level OoO handling.
      
      When data is sent on multiple subflows, the peer can easily see OoO - "future"
      data at the MPTCP level, especially if speeds, delay, or jitter are not
      symmetric.
      
      The other major change regards the netlink PM, which is extended to allow
      creating non backup subflows in patches 9-11.
      
      There are a few smaller additions, like the introduction of OoO related mibs,
      send buffer autotuning and better ack handling.
      
      Finally a bunch of new self-tests is introduced. The new feature is tested
      ensuring that the B/W used by an MPTCP socket using multiple subflows matches
      the link aggregated B/W - we use low B/W virtual links, to ensure the tests
      are not CPU bounded.
      
      v1 -> v2:
        - fix 32 bit build breakage
        - fix a bunch of checkpatch issues
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b91c06c5
    • Paolo Abeni's avatar
      mptcp: simult flow self-tests · 1a418cb8
      Paolo Abeni authored
      Add a bunch of test-cases for multiple subflow xmit:
      create multiple subflows simulating different links
      condition via netem and verify that the msk is able
      to use completely the aggregated bandwidth.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a418cb8
    • Paolo Abeni's avatar
      mptcp: call tcp_cleanup_rbuf on subflows · c76c6956
      Paolo Abeni authored
      That is needed to let the subflows announce promptly when new
      space is available in the receive buffer.
      
      tcp_cleanup_rbuf() is currently a static function, drop the
      scope modifier and add a declaration in the TCP header.
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c76c6956
    • Paolo Abeni's avatar
      mptcp: allow picking different xmit subflows · d5f49190
      Paolo Abeni authored
      Update the scheduler to less trivial heuristic: cache
      the last used subflow, and try to send on it a reasonably
      long burst of data.
      
      When the burst or the subflow send space is exhausted, pick
      the subflow with the lower ratio between write space and
      send buffer - that is, the subflow with the greater relative
      amount of free space.
      
      v1 -> v2:
       - fix 32 bit build breakage due to 64bits div
       - fix checkpath issues (uint64_t -> u64)
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5f49190
    • Paolo Abeni's avatar
      mptcp: allow creating non-backup subflows · 4596a2c1
      Paolo Abeni authored
      Currently the 'backup' attribute of local endpoint
      is ignored. Let's use it for the MP_JOIN handshake
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4596a2c1
    • Paolo Abeni's avatar
      mptcp: move address attribute into mptcp_addr_info · ef0da3b8
      Paolo Abeni authored
      So that can be accessed easily from the subflow creation
      helper. No functional change intended.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef0da3b8