1. 10 Jan, 2018 19 commits
    • Andy Gospodarek's avatar
      net/mlx5e: Move AM logic enums · f5e7f67d
      Andy Gospodarek authored
      More movement to help make this code more generic.
      Signed-off-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Acked-by: default avatarTal Gilboa <talgi@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5e7f67d
    • Andy Gospodarek's avatar
      net/mlx5e: Remove rq references in mlx5e_rx_am · 138968e9
      Andy Gospodarek authored
      This makes mlx5e_am_sample more generic so that it can be called easily
      from a driver that does not use the same data structure to store these
      values in a single structure.
      Signed-off-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Acked-by: default avatarTal Gilboa <talgi@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      138968e9
    • Andy Gospodarek's avatar
      net/mlx5e: Move interrupt moderation forward declarations · f58ee099
      Andy Gospodarek authored
      Move these to newly created file to prepare to move these functions to a
      library.
      Signed-off-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Acked-by: default avatarTal Gilboa <talgi@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f58ee099
    • Andy Gospodarek's avatar
      net/mlx5e: Move interrupt moderation structs to new file · 98dd1edf
      Andy Gospodarek authored
      Create new header file to prepare to move code that handles irq
      moderation to a library that lives in a header file.
      Signed-off-by: default avatarAndy Gospodarek <gospo@broadcom.com>
      Acked-by: default avatarTal Gilboa <talgi@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98dd1edf
    • David S. Miller's avatar
      Merge branch 'ipv6-Add-support-for-non-equal-cost-multipath' · 8448f91f
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      ipv6: Add support for non-equal-cost multipath
      
      This set aims to add support for IPv6 non-equal-cost multipath routes.
      The first three patches convert multipath selection to use the
      hash-threshold method (RFC 2992) instead of modulo-N. The same method is
      employed by the IPv4 routing code since commit 0e884c78 ("ipv4: L3
      hash-based multipath").
      
      Unlike modulo-N, with hash-threshold only the flows near the region
      boundaries are affected when a nexthop is added or removed. In addition,
      it allows us to easily add support for non-equal-cost multipath in the
      last patch by sizing the different regions according to the provided
      weights.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8448f91f
    • Ido Schimmel's avatar
      ipv6: Add support for non-equal-cost multipath · 398958ae
      Ido Schimmel authored
      The use of hash-threshold instead of modulo-N makes it trivial to add
      support for non-equal-cost multipath.
      
      Instead of dividing the multipath hash function's output space equally
      between the nexthops, each nexthop is assigned a region size which is
      proportional to its weight.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      398958ae
    • Ido Schimmel's avatar
      ipv6: Use hash-threshold instead of modulo-N · 3d709f69
      Ido Schimmel authored
      Now that each nexthop stores its region boundary in the multipath hash
      function's output space, we can use hash-threshold instead of modulo-N
      in multipath selection.
      
      This reduces the number of checks we need to perform during lookup, as
      dead and linkdown nexthops are assigned a negative region boundary. In
      addition, in contrast to modulo-N, only flows near region boundaries are
      affected when a nexthop is added or removed.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d709f69
    • Ido Schimmel's avatar
      ipv6: Use a 31-bit multipath hash · 7696c06a
      Ido Schimmel authored
      The hash thresholds assigned to IPv6 nexthops are in the range of
      [-1, 2^31 - 1], where a negative value is assigned to nexthops that
      should not be considered during multipath selection.
      
      Therefore, in a similar fashion to IPv4, we need to use the upper
      31-bits of the multipath hash for multipath selection.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7696c06a
    • Ido Schimmel's avatar
      ipv6: Calculate hash thresholds for IPv6 nexthops · d7dedee1
      Ido Schimmel authored
      Before we convert IPv6 to use hash-threshold instead of modulo-N, we
      first need each nexthop to store its region boundary in the hash
      function's output space.
      
      The boundary is calculated by dividing the output space equally between
      the different active nexthops. That is, nexthops that are not dead or
      linkdown.
      
      The boundaries are rebalanced whenever a nexthop is added or removed to
      a multipath route and whenever a nexthop becomes active or inactive.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7dedee1
    • Jason Wang's avatar
      vhost_net: batch used ring update in rx · e2b3b35e
      Jason Wang authored
      This patch tries to batched used ring update during RX. This is pretty
      fit for the case when guest is much faster (e.g dpdk based
      backend). In this case, used ring is almost empty:
      
      - we may get serious cache line misses/contending on both used ring
        and used idx.
      - at most 1 packet could be dequeued at one time, batching in guest
        does not make much effect.
      
      Update used ring in a batch can help since guest won't access the used
      ring until used idx was advanced for several descriptors and since we
      advance used ring for every N packets, guest will only need to access
      used idx for every N packet since it can cache the used idx. To have a
      better interaction for both batch dequeuing and dpdk batching,
      VHOST_RX_BATCH was used as the maximum number of descriptors that
      could be batched.
      
      Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU
      E5-2630 connected back to back through ixgbe. Traffic were generated
      on one remote ixgbe through MoonGen and measure the RX pps through
      testpmd in guest when do xdp_redirect_map from local ixgbe to
      tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31%
      improvement).
      
      One possible concern for this is the implications for TCP (especially
      latency sensitive workload). Result[1] does not show obvious changes
      for most of the netperf test (RR, TX, and RX). And we do get some
      improvements for RX on some specific size.
      
      Guest RX:
      
      size/sessions/+thu%/+normalize%
         64/     1/   +2%/   +2%
         64/     2/   +2%/   -1%
         64/     4/   +1%/   +1%
         64/     8/    0%/    0%
        256/     1/   +6%/   -3%
        256/     2/   -3%/   +2%
        256/     4/  +11%/  +11%
        256/     8/    0%/    0%
        512/     1/   +4%/    0%
        512/     2/   +2%/   +2%
        512/     4/    0%/   -1%
        512/     8/   -8%/   -8%
       1024/     1/   -7%/  -17%
       1024/     2/   -8%/   -7%
       1024/     4/   +1%/    0%
       1024/     8/    0%/    0%
       2048/     1/  +30%/  +14%
       2048/     2/  +46%/  +40%
       2048/     4/    0%/    0%
       2048/     8/    0%/    0%
       4096/     1/  +23%/  +22%
       4096/     2/  +26%/  +23%
       4096/     4/    0%/   +1%
       4096/     8/    0%/    0%
      16384/     1/   -2%/   -3%
      16384/     2/   +1%/   -4%
      16384/     4/   -1%/   -3%
      16384/     8/    0%/   -1%
      65535/     1/  +15%/   +7%
      65535/     2/   +4%/   +7%
      65535/     4/    0%/   +1%
      65535/     8/    0%/    0%
      
      TCP_RR:
      
      size/sessions/+thu%/+normalize%
          1/     1/    0%/   +1%
          1/    25/   +2%/   +1%
          1/    50/   +4%/   +1%
         64/     1/    0%/   -4%
         64/    25/   +2%/   +1%
         64/    50/    0%/   -1%
        256/     1/    0%/    0%
        256/    25/    0%/    0%
        256/    50/   +4%/   +2%
      
      Guest TX:
      
      size/sessions/+thu%/+normalize%
         64/     1/   +4%/   -2%
         64/     2/   -6%/   -5%
         64/     4/   +3%/   +6%
         64/     8/    0%/   +3%
        256/     1/  +15%/  +16%
        256/     2/  +11%/  +12%
        256/     4/   +1%/    0%
        256/     8/   +5%/   +5%
        512/     1/   -1%/   -6%
        512/     2/    0%/   -8%
        512/     4/   -2%/   +4%
        512/     8/   +6%/   +9%
       1024/     1/   +3%/   +1%
       1024/     2/   +3%/   +9%
       1024/     4/    0%/   +7%
       1024/     8/    0%/   +7%
       2048/     1/   +8%/   +2%
       2048/     2/   +3%/   -1%
       2048/     4/   -1%/  +11%
       2048/     8/   +3%/   +9%
       4096/     1/   +8%/   +8%
       4096/     2/    0%/   -7%
       4096/     4/   +4%/   +4%
       4096/     8/   +2%/   +5%
      16384/     1/   -3%/   +1%
      16384/     2/   -1%/  -12%
      16384/     4/   -1%/   +5%
      16384/     8/    0%/   +1%
      65535/     1/    0%/   -3%
      65535/     2/   +5%/  +16%
      65535/     4/   +1%/   +2%
      65535/     8/   +1%/   -1%
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2b3b35e
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2018-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 65d51f26
      David S. Miller authored
      mlx5-updates-2018-01-08
      
      Four patches from Or that add Hairpin support to mlx5:
      ===========================================================
      From:  Or Gerlitz <ogerlitz@mellanox.com>
      
      We refer the ability of NIC HW to fwd packet received on one port to
      the other port (also from a port to itself) as hairpin. The application API
      is based
      on ingress tc/flower rules set on the NIC with the mirred redirect
      action. Other actions can apply to packets during the redirect.
      
      Hairpin allows to offload the data-path of various SW DDoS gateways,
      load-balancers, etc to HW. Packets go through all the required
      processing in HW (header re-write, encap/decap, push/pop vlan) and
      then forwarded, CPU stays at practically zero usage. HW Flow counters
      are used by the control plane for monitoring and accounting.
      
      Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ).
      All the flows that share <recv NIC, mirred NIC> are redirected through
      the same hairpin pair. Currently, only header-rewrite is supported as a
      packet modification action.
      
      I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this
      functionality
      on HW simulator, before it was avail in the FW so the driver code could be
      tested early.
      ===========================================================
      
      From Feras three patches that provide very small changes that allow IPoIB
      to support RX timestamping for child interfaces, simply by hooking the mlx5e
      timestamping PTP ioctl to IPoIB child interface netdev profile.
      
      One patch from Gal to fix a spilling mistake.
      
      Two patches from Eugenia adds drop counters to VF statistics
      to be reported as part of VF statistics in netlink (iproute2) and
      implemented them in mlx5 eswitch.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65d51f26
    • David S. Miller's avatar
      Merge branch 'hns3-next' · 45f89822
      David S. Miller authored
      Peng Li says:
      
      ====================
      code improvements in HNS3 driver
      
      This patchset fixes 2 comments for community review.
      [patch 1/2] reverts "net: hns3: Add packet statistics of netdev"
      reported by Jakub Kicinski and David Miller.
      [patch 2/2] reports the function type the same line with
      hns3_nic_get_stats64, reported by Andrew Lunn.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45f89822
    • Peng Li's avatar
      net: hns3: report the function type the same line with hns3_nic_get_stats64 · 6c88d9d7
      Peng Li authored
      The function type should be on the same line with the function
      name, or it may cause display error if a patch edit the
      function. There is am example following:
      https://www.spinics.net/lists/netdev/msg476141.htmlSigned-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c88d9d7
    • Peng Li's avatar
      Revert "net: hns3: Add packet statistics of netdev" · bf909456
      Peng Li authored
      This reverts commit 84910007.
      
      It is duplicate to add statistics of netdev for ethtool -S.
      Signed-off-by: default avatarPeng Li <lipeng321@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf909456
    • David S. Miller's avatar
      Merge branch 'Socionext-Synquacer-NETSEC-driver' · 68d5c265
      David S. Miller authored
      Jassi Brar says:
      
      ====================
      Socionext Synquacer NETSEC driver
      
      Changes since v5
      	# Removed helper macros
      	# Removed 'inline' qualifier
      	# Changed multiline empty comment to single line
      	# Added 'clock-names' property in DT binding example
      	# Ignore 'clock-names' property in driver until f/ws in the wild are
      	  upgraded or we support instance that take in more than one clock.
      	# Rebased the patchset onto net-next
      
      Changes since v4
              # Fixed ucode indexing as a word, instead of byte
              # Removed redundant clocks, keep only phy rate reference clock
                and expect it to be 'phy_ref_clk'
      
      Changes since v3
              # Discard 'socionext,snq-mdio', and simply use 'mdio' subnode.
              # Use ioremap on ucode region as well, instead of memremap.
      
      Changes since v2
              # Use 'mdio' subnode in DT bindings.
              # Use phy_interface_mode_is_rgmii(), instead of open coding the check.
              # Use readl/b with eeprom_base pointer.
              # Unregister mdio bus upon failure in probe.
      
      Changes since v1
              # Switched from using memremap to ioremap
              # Implemented ndo_do_ioctl callback
              # Defined optional 'dma-coherent' DT property
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68d5c265
    • Jassi Brar's avatar
      MAINTAINERS: Add entry for Socionext ethernet driver · 919e66a2
      Jassi Brar authored
      Add entry for the Socionext Netsec controller driver and DT bindings.
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarJassi Brar <jaswinder.singh@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      919e66a2
    • Jassi Brar's avatar
      net: socionext: Add Synquacer NetSec driver · 533dd11a
      Jassi Brar authored
      This driver adds support for Socionext "netsec" IP Gigabit
      Ethernet + PHY IP used in the Synquacer SC2A11 SoC.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarJassi Brar <jaswinder.singh@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      533dd11a
    • Jassi Brar's avatar
      dt-bindings: net: Add DT bindings for Socionext Netsec · f78f4107
      Jassi Brar authored
      This patch adds documentation for Device-Tree bindings for the
      Socionext NetSec Controller driver.
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarJassi Brar <jaswinder.singh@linaro.org>
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f78f4107
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · c215dae4
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      10GbE Intel Wired LAN Driver Updates 2018-01-09
      
      This series contains updates to ixgbe and ixgbevf only.
      
      Emil fixes an issue with "wake on LAN"(WoL) where we need to ensure we
      enable the reception of multicast packets so that WoL works for IPv6
      magic packets.  Cleaned up code no longer needed with the update to
      adaptive ITR.
      
      Paul update the driver to advertise the highest capable link speed
      when a module gets inserted.  Also extended the displaying of firmware
      version to include the iSCSI and OEM block in the EEPROM to better
      identify firmware versions/images.
      
      Tonghao Zhang cleans up a code comment that no longer applies since
      InterruptThrottleRate has been removed from the driver.
      
      Alex fixes SR-IOV and MACVLAN offload interaction, where the MACVLAN
      offload was incorrectly configuring several filters with the wrong
      pool value which resulted in MACLVAN interfaces not being able to
      receive traffic that had to pass over the physical interface.  Fixed
      transmit hangs and dropped receive frames when the number of VFs
      changed.  Added support for RSS on MACVLAN pools for X550 devices.
      Fixed up the MACVLAN limitations so we can now support 63 offloaded
      devices.  Cleaned up MACVLAN code that is no longer needed with the
      recent changes and fixes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c215dae4
  2. 09 Jan, 2018 21 commits
    • David S. Miller's avatar
      Merge branch 'r8169-improve-runtime-pm' · 61ad6408
      David S. Miller authored
      Heiner Kallweit says:
      
      ====================
      r8169: improve runtime pm
      
      On my system with two network ports I found that runtime PM didn't
      suspend the unused port. Therefore I checked runtime pm in this driver
      in somewhat more detail and this series improves runtime pm in general
      and solves the mentioned issue.
      
      Tested on a system with RTL8168evl (MAC version 34).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61ad6408
    • Heiner Kallweit's avatar
      r8169: improve runtime pm in general and suspend unused ports · a92a0849
      Heiner Kallweit authored
      So far rpm doesn't cover cases like unused ports which are never
      brought up. If they are active at probe time they remain in this state.
      Included in this patch:
      
      - Let the idle notification check whether we can suspend and let it
        schedule the suspend. This way we don't need to have calls to
        pm_schedule_suspend in different places.
      
      - At the end of rtl_open and rtl_init_one send an idle notification
        to allow suspending if the link is down. If a cable is plugged in
        aneg is finished before the suspend timer expires and the suspend
        request is cancelled.
      
      - Change rtl8169_runtime_suspend to power down the chip if the
        interface is down.
      
      Successfully tested on a RTL8168evl (mac version 34).
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a92a0849
    • Heiner Kallweit's avatar
      r8169: improve runtime pm in rtl8169_check_link_status · ef4d5fcc
      Heiner Kallweit authored
      This patch partially reverts commit e4fbce74 "r8169: Fix runtime
      power management" from 2010. At that time the suspend delay was 100ms
      and therefore suspending happened during initial aneg. Currently
      suspend delay is 5s, so suspend starts after aneg and the issue
      doesn't exist any longer. On my system aneg takes almost 3s, to be on
      the safe side let's increase the suspend delay to 10s.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef4d5fcc
    • Heiner Kallweit's avatar
      r8169: remove unneeded rpm ops in rtl_shutdown · b9aa1c75
      Heiner Kallweit authored
      This patch reverts commit 2a15cd2f "r8169: runtime resume before
      shutdown" from 2012. Few months after this change the underlying issue
      was solved in the PCI core with commit 3ff2de9b "PCI/PM: Resume
      device before shutdown".
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9aa1c75
    • David S. Miller's avatar
      Merge branch 'tipc-improvements-to-group-messaging' · fdb533c3
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: improvements to group messaging
      
      We make a number of simplifications and improvements to the group
      messaging service. They aim at readability/maintainability of the code
      as well as scalability.
      
      The series is based on commit f9c935db ("tipc: fix problems with
      multipoint-to-point flow control) which has been applied to 'net' but
      not yet to 'net-next'.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdb533c3
    • Jon Maloy's avatar
      tipc: improve poll() for group member socket · eb929a91
      Jon Maloy authored
      The current criteria for returning POLLOUT from a group member socket is
      too simplistic. It basically returns POLLOUT as soon as the group has
      external destinations, something obviously leading to a lot of spinning
      during destination congestion situations. At the same time, the internal
      congestion handling is unnecessarily complex.
      
      We now change this as follows.
      
      - We introduce an 'open' flag in  struct tipc_group. This flag is used
        only to help poll() get the setting of POLLOUT right, and *not* for
        congeston handling as such. This means that a user can choose to
        ignore an  EAGAIN for a destination and go on sending messages to
        other destinations in the group if he wants to.
      
      - The flag is set to false every time we return EAGAIN on a send call.
      
      - The flag is set to true every time any member, i.e., not necessarily
        the member that caused EAGAIN, is removed from the small_win list.
      
      - We remove the group member 'usr_pending' flag. The size of the send
        window and presence in the 'small_win' list is sufficient criteria
        for recognizing congestion.
      
      This solution seems to be a reasonable compromise between 'anycast',
      which is normally not waiting for POLLOUT for a specific destination,
      and the other three send modes, which are.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb929a91
    • Jon Maloy's avatar
      tipc: improve groupcast scope handling · 232d07b7
      Jon Maloy authored
      When a member joins a group, it also indicates a binding scope. This
      makes it possible to create both node local groups, invisible to other
      nodes, as well as cluster global groups, visible everywhere.
      
      In order to avoid that different members end up having permanently
      differing views of group size and memberhip, we must inhibit locally
      and globally bound members from joining the same group.
      
      We do this by using the binding scope as an additional separator between
      groups. I.e., a member must ignore all membership events from sockets
      using a different scope than itself, and all lookups for message
      destinations must require an exact match between the message's lookup
      scope and the potential target's binding scope.
      
      Apart from making it possible to create local groups using the same
      identity on different nodes, a side effect of this is that it now also
      becomes possible to create a cluster global group with the same identity
      across the same nodes, without interfering with the local groups.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      232d07b7
    • Jon Maloy's avatar
      tipc: add option to suppress PUBLISH events for pre-existing publications · 8348500f
      Jon Maloy authored
      Currently, when a user is subscribing for binding table publications,
      he will receive a PUBLISH event for all already existing matching items
      in the binding table.
      
      However, a group socket making a subscriptions doesn't need this initial
      status update from the binding table, because it has already scanned it
      during the join operation. Worse, the multiplicatory effect of issuing
      mutual events for dozens or hundreds group members within a short time
      frame put a heavy load on the topology server, with the end result that
      scale out operations on a big group tend to take much longer than needed.
      
      We now add a new filter option, TIPC_SUB_NO_STATUS, for topology server
      subscriptions, so that this initial avalanche of events is suppressed.
      This change, along with the previous commit, significantly improves the
      range and speed of group scale out operations.
      
      We keep the new option internal for the tipc driver, at least for now.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8348500f
    • Jon Maloy's avatar
      tipc: send out join messages as soon as new member is discovered · d12d2e12
      Jon Maloy authored
      When a socket is joining a group, we look up in the binding table to
      find if there are already other members of the group present. This is
      used for being able to return EAGAIN instead of EHOSTUNREACH if the
      user proceeds directly to a send attempt.
      
      However, the information in the binding table can be used to directly
      set the created member in state MBR_PUBLISHED and send a JOIN message
      to the peer, instead of waiting for a topology PUBLISH event to do this.
      When there are many members in a group, the propagation time for such
      events can be significant, and we can save time during the join
      operation if we use the initial lookup result fully.
      
      In this commit, we eliminate the member state MBR_DISCOVERED which has
      been the result of the initial lookup, and do instead go directly to
      MBR_PUBLISHED, which initiates the setup.
      
      After this change, the tipc_member FSM looks as follows:
      
           +-----------+
      ---->| PUBLISHED |-----------------------------------------------+
      PUB- +-----------+                                 LEAVE/WITHRAW |
      LISH       |JOIN                                                 |
                 |     +-------------------------------------------+   |
                 |     |                            LEAVE/WITHDRAW |   |
                 |     |                +------------+             |   |
                 |     |   +----------->|  PENDING   |---------+   |   |
                 |     |   |msg/maxactv +-+---+------+  LEAVE/ |   |   |
                 |     |   |              |   |       WITHDRAW |   |   |
                 |     |   |   +----------+   |                |   |   |
                 |     |   |   |revert/maxactv|                |   |   |
                 |     |   |   V              V                V   V   V
                 |   +----------+  msg  +------------+       +-----------+
                 +-->|  JOINED  |------>|   ACTIVE   |------>|  LEAVING  |--->
                 |   +----------+       +--- -+------+ LEAVE/+-----------+DOWN
                 |        A   A               |      WITHDRAW A   A    A   EVT
                 |        |   |               |RECLAIM        |   |    |
                 |        |   |REMIT          V               |   |    |
                 |        |   |== adv   +------------+        |   |    |
                 |        |   +---------| RECLAIMING |--------+   |    |
                 |        |             +-----+------+  LEAVE/    |    |
                 |        |                   |REMIT   WITHDRAW   |    |
                 |        |                   |< adv              |    |
                 |        |msg/               V            LEAVE/ |    |
                 |        |adv==ADV_IDLE+------------+   WITHDRAW |    |
                 |        +-------------|  REMITTED  |------------+    |
                 |                      +------------+                 |
                 |PUBLISH                                              |
      JOIN +-----------+                                LEAVE/WITHDRAW |
      ---->|  JOINING  |-----------------------------------------------+
           +-----------+
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d12d2e12
    • Jon Maloy's avatar
      tipc: simplify group LEAVE sequence · c2b22bcf
      Jon Maloy authored
      After the changes in the previous commit the group LEAVE sequence
      can be simplified.
      
      We now let the arrival of a LEAVE message unconditionally issue a group
      DOWN event to the user. When a topology WITHDRAW event is received, the
      member, if it still there, is set to state LEAVING, but we only issue a
      group DOWN event when the link to the peer node is gone, so that no
      LEAVE message is to be expected.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2b22bcf
    • Jon Maloy's avatar
      tipc: create group member event messages when they are needed · 7ad32bcb
      Jon Maloy authored
      In the current implementation, a group socket receiving topology
      events about other members just converts the topology event message
      into a group event message and stores it until it reaches the right
      state to issue it to the user. This complicates the code unnecessarily,
      and becomes impractical when we in the coming commits will need to
      create and issue membership events independently.
      
      In this commit, we change this so that we just notice the type and
      origin of the incoming topology event, and then drop the buffer. Only
      when it is time to actually send a group event to the user do we
      explicitly create a new message and send it upwards.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ad32bcb
    • Jon Maloy's avatar
      tipc: adjustment to group member FSM · 0233493a
      Jon Maloy authored
      Analysis reveals that the member state MBR_QURANTINED in reality is
      unnecessary, and can be replaced by the state MBR_JOINING at all
      occurrencs.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0233493a
    • Jon Maloy's avatar
      tipc: let group member stay in JOINED mode if unable to reclaim · 4ea5dab5
      Jon Maloy authored
      We handle a corner case in the function tipc_group_update_rcv_win().
      During extreme pessure it might happen that a message receiver has all
      its active senders in RECLAIMING or REMITTED mode, meaning that there
      is nobody to reclaim advertisements from if an additional sender tries
      to go active.
      
      Currently we just set the new sender to ACTIVE anyway, hence at least
      theoretically opening up for a receiver queue overflow by exceeding the
      MAX_ACTIVE limit. The correct solution to this is to instead add the
      member to the pending queue, while letting the oldest member in that
      queue revert to JOINED state.
      
      In this commit we refactor the code for handling message arrival from
      a JOINED member, both to make it more comprehensible and to cover the
      case described above.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ea5dab5
    • Jon Maloy's avatar
      tipc: a couple of cleanups · 8d5dee21
      Jon Maloy authored
      - We remove the 'reclaiming' member list in struct tipc_group, since
        it doesn't serve any purpose.
      
      - We simplify the GRP_REMIT_MSG branch of tipc_group_protocol_rcv().
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d5dee21
    • David S. Miller's avatar
      Merge branch 'ethtool-ringparam-upper-bound' · a67c01e2
      David S. Miller authored
      Tariq Toukan says:
      
      ====================
      ethtool ringparam upper bound
      
      This patchset by Jenny adds sanity checks in ethtool ringparam
      operation for input upper bounds, similarly to what's done in
      ethtool_set_channels.
      
      The checks are added in patch 1, using a call to get_ringparam
      prior to calling set_ringparam NDO.
      
      Patch 2 changes the function's behavior in mlx4_en, so that
      it returns an error for out-of-range input, instead of rounding
      it to closest valid, similar to mlx5e.
      
      Patch 3 removes the upper bound checks in mlx5e_ethtool_set_ringparam
      as it becomes redundant.
      
      Series generated against net-next commit:
      f66faae2 Merge branch 'ipv6-ipv4-nexthop-align'
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a67c01e2
    • Eugenia Emantayev's avatar
      net/mlx5e: Remove redundant checks in set_ringparam · bacc7943
      Eugenia Emantayev authored
      Since the checks are done in upper layer ethtool code,
      checks in driver are not needed any more.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bacc7943
    • Eugenia Emantayev's avatar
      net/mlx4_en: Align behavior of set ring size flow via ethtool · 7589fd5c
      Eugenia Emantayev authored
      In current implementation, any requested RX/TX ring size value
      that is less than minimum is silently casted to nearest valid value.
      Update this behavior to align with mlx5 behavior by printing warning
      in dmesg and remaining the size unchanged.
      Kernel is responsible for verifying against the maximum.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7589fd5c
    • Eugenia Emantayev's avatar
      ethtool: Ensure new ring parameters are within bounds during SRINGPARAM · 37e2d99b
      Eugenia Emantayev authored
      Add a sanity check to ensure that all requested ring parameters
      are within bounds, which should reduce errors in driver implementation.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37e2d99b
    • Alexander Duyck's avatar
      ixgbe: Drop l2_accel_priv data pointer from ring struct · 68ae7424
      Alexander Duyck authored
      The l2 acceleration private pointer isn't needed in the ring struct. It
      isn't really used anywhere other than to test and see if we are supporting
      an offloaded macvlan netdev, and it is much easier to test netdev for not
      being ixgbe based to verify that.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      68ae7424
    • Alexander Duyck's avatar
      ixgbe: Use ring values to test for Tx pending · 1489542b
      Alexander Duyck authored
      This patch simplifies the check for Tx pending traffic and makes it more
      holistic as there being any difference between next_to_use and
      next_to_clean is much more informative than if head and tail are equal, as
      it is possible for us to either not update tail, or not be notified of
      completed work in which case next_to_clean would not be equal to head.
      
      In addition the simplification makes it so that we don't have to read
      hardware which allows us to drop a number of variables that were previously
      being used in the call.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1489542b
    • Alexander Duyck's avatar
      ixgbe: Fix limitations on macvlan so we can support up to 63 offloaded devices · 4e039c16
      Alexander Duyck authored
      This change is a fix of the macvlan offload so that we correctly handle
      macvlan offloaded devices. Specifically we were configuring our limits based
      on the assumption that we were going to max out the RSS indices for every
      mode. As a result when we went to 15 or more macvlan interfaces we were
      forced into the 2 queue RSS mode on VFs even though they could have still
      supported 4.
      
      This change splits the logic up so that we limit either the total number of
      macvlan instances if DCB is enabled, or limit the number of RSS queues used
      per macvlan (instead of per pool) if SR-IOV is enabled. By doing this we
      can make best use of the part.
      
      In addition I have increased the maximum number of supported interfaces to
      63 with one queue per offloaded interface as this more closely reflects the
      actual values supported by the interface.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      4e039c16