1. 20 Jan, 2017 11 commits
    • David S. Miller's avatar
      Merge branch 'bus-agnostic-num-vf' · 07604628
      David S. Miller authored
      Phil Sutter says:
      
      ====================
      Retrieve number of VFs in a bus-agnostic way
      
      Previously, it was assumed that only PCI NICs would be capable of having
      virtual functions - with my proposed enhancement of dummy NIC driver
      implementing (fake) ones for testing purposes, this is no longer true.
      
      Discussion of said patch has led to the suggestion of implementing a
      bus-agnostic method for VF count retrieval so rtnetlink could work with
      both real VF-capable PCI NICs as well as my dummy modifications without
      introducing ugly hacks.
      
      The following series tries to achieve just that by introducing a bus
      type callback to retrieve a device's number of VFs, implementing this
      callback for PCI bus and finally adjusting rtnetlink to make use of the
      generalized infrastructure.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07604628
    • Phil Sutter's avatar
      device: Implement a bus agnostic dev_num_vf routine · 9af15c38
      Phil Sutter authored
      Now that pci_bus_type has num_vf callback set, dev_num_vf can be
      implemented in a bus type independent way and the check for whether a
      PCI device is being handled in rtnetlink can be dropped.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9af15c38
    • Phil Sutter's avatar
      PCI: implement num_vf bus type callback · 02e0bea6
      Phil Sutter authored
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02e0bea6
    • Phil Sutter's avatar
      device: bus_type: Introduce num_vf callback · 582a686f
      Phil Sutter authored
      This allows for bus types to implement their own method of retrieving
      the number of virtual functions a NIC on that type of bus supports.
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      582a686f
    • Geliang Tang's avatar
      sock: use hlist_entry_safe · 6c59ebd3
      Geliang Tang authored
      Use hlist_entry_safe() instead of open-coding it.
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c59ebd3
    • Jakub Sitnicki's avatar
      gre6: Clean up unused struct ipv6_tel_txoption definition · c10aa71b
      Jakub Sitnicki authored
      Commit b05229f4 ("gre6: Cleanup GREv6 transmit path, call common GRE
      functions") removed the ip6gre specific transmit function, but left the
      struct ipv6_tel_txoption definition. Clean it up.
      Signed-off-by: default avatarJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c10aa71b
    • Eric Dumazet's avatar
      net: remove bh disabling around percpu_counter accesses · c2a2efbb
      Eric Dumazet authored
      Shaohua Li made percpu_counter irq safe in commit 098faf58
      ("percpu_counter: make APIs irq safe")
      
      We can safely remove BH disable/enable sections around various
      percpu_counter manipulations.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2a2efbb
    • Arnd Bergmann's avatar
      cxgb4: hide unused warnings · 0a327889
      Arnd Bergmann authored
      The two new variables are only used inside of an #ifdef and cause
      harmless warnings when that is disabled:
      
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c: In function 'init_one':
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4646:9: error: unused variable 'port_vec' [-Werror=unused-variable]
      drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:4646:6: error: unused variable 'v' [-Werror=unused-variable]
      
      This adds another #ifdef around the declarations.
      
      Fixes: 96fe11f2 ("cxgb4: Implement ndo_get_phys_port_id for mgmt dev")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a327889
    • David Ahern's avatar
      net: ipv6: Keep nexthop of multipath route on admin down · a1a22c12
      David Ahern authored
      IPv6 deletes route entries associated with multipath routes on an
      admin down where IPv4 does not. For example:
          $ ip ro ls vrf red
          unreachable default metric 8192
          1.1.1.0/24 metric 64
                  nexthop via 10.100.1.254  dev eth1 weight 1
                  nexthop via 10.100.2.254  dev eth2 weight 1
          10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4
          10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4
      
          $ ip -6 ro ls vrf red
          2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
          2001:db8:2:: dev red proto none metric 0  pref medium
          2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
          2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024  pref medium
          2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024  pref medium
          ...
      
      Set link down:
          $ ip li set eth1 down
      
      IPv4 retains the multihop route but flags eth1 route as dead:
      
          $ ip ro ls vrf red
          unreachable default metric 8192
          1.1.1.0/24
                  nexthop via 10.100.1.16  dev eth1 weight 1 dead linkdown
                  nexthop via 10.100.2.16  dev eth2 weight 1
          10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4
      
      and IPv6 deletes the route as part of flushing all routes for the device:
      
          $ ip -6 ro ls vrf red
          2001:db8:2:: dev red proto none metric 0  pref medium
          2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
          2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024  pref medium
          ...
      
      Worse, on admin up of the device the multipath route has to be deleted
      to get this leg of the route re-added.
      
      This patch keeps routes that are part of a multipath route if
      ignore_routes_with_linkdown is set with the dead and linkdown flags
      enabling consistency between IPv4 and IPv6:
      
          $ ip -6 ro ls vrf red
          2001:db8:2:: dev red proto none metric 0  pref medium
          2001:db8:2::/120 dev eth2 proto kernel metric 256  pref medium
          2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown  pref medium
          2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024  pref medium
          ...
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1a22c12
    • Eric Dumazet's avatar
      mlx4: support __GFP_MEMALLOC for rx · dceeab0e
      Eric Dumazet authored
      Commit 04aeb56a ("net/mlx4_en: allocate non 0-order pages for RX
      ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
      that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
      
      As using memory reserves is a must in some situations (swap over NFS or
      iSCSI), this patch adds this flag.
      
      Note that this driver does not reuse pages (yet) so we do not have to
      add anything else.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dceeab0e
    • Timur Tabi's avatar
      Revert "net: qcom/emac: configure the external phy to allow pause frames" · 8a43c052
      Timur Tabi authored
      This reverts commit 3e884493.
      
      With commit 529ed127 ("net: phy: phy drivers should not set
      SUPPORTED_[Asym_]Pause"), phylib now handles automatically enabling
      pause frame support in the PHY, and the MAC driver should follow suit.
      
      Since the EMAC driver driver does this,  we no longer need to force
      pause frames support.
      Signed-off-by: default avatarTimur Tabi <timur@codeaurora.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a43c052
  2. 19 Jan, 2017 4 commits
  3. 18 Jan, 2017 25 commits
    • Tobias Klauser's avatar
      net: Remove usage of net_device last_rx member · 4a7c9726
      Tobias Klauser authored
      The network stack no longer uses the last_rx member of struct net_device
      since the bonding driver switched to use its own private last_rx in
      commit 9f242738 ("bonding: use last_arp_rx in slave_last_rx()").
      
      However, some drivers still (ab)use the field for their own purposes and
      some driver just update it without actually using it.
      
      Previously, there was an accompanying comment for the last_rx member
      added in commit 4dc89133 ("net: add a comment on netdev->last_rx")
      which asked drivers not to update is, unless really needed. However,
      this commend was removed in commit f8ff080d ("bonding: remove
      useless updating of slave->dev->last_rx"), so some drivers added later
      on still did update last_rx.
      
      Remove all usage of last_rx and switch three drivers (sky2, atp and
      smc91c92_cs) which actually read and write it to use their own private
      copy in netdev_priv.
      
      Compile-tested with allyesconfig and allmodconfig on x86 and arm.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Mirko Lindner <mlindner@marvell.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4a7c9726
    • Vivien Didelot's avatar
      net: dsa: use cpu_switch instead of ds[0] · 9520ed8f
      Vivien Didelot authored
      Now that the DSA Ethernet switches are true Linux devices, the CPU
      switch is not necessarily the first one. If its address is higher than
      the second switch on the same MDIO bus, its index will be 1, not 0.
      
      Avoid any confusion by using dst->cpu_switch instead of dst->ds[0].
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9520ed8f
    • Vivien Didelot's avatar
      net: dsa: store CPU switch structure in the tree · b22de490
      Vivien Didelot authored
      Store a dsa_switch pointer to the CPU switch in the tree instead of only
      its index. This avoids the need to initialize it to -1.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b22de490
    • Ivan Khoronzhuk's avatar
      net: ethernet: ti: davinci_cpdma: correct check on NULL in set rate · e33c2ef1
      Ivan Khoronzhuk authored
      Check "ch" on NULL first, then get ctlr.
      Signed-off-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e33c2ef1
    • David S. Miller's avatar
      Merge branch 'vhost_net-batching' · e3e37e70
      David S. Miller authored
      Jason Wang says:
      
      ====================
      vhost_net tx batching
      
      This series tries to implement tx batching support for vhost. This was
      done by using MSG_MORE as a hint for under layer socket. The backend
      (e.g tap) can then batch the packets temporarily in a list and
      submit it all once the number of bacthed exceeds a limitation.
      
      Tests shows obvious improvement on guest pktgen over over
      mlx4(noqueue) on host:
      
                                           Mpps  -+%
              rx-frames = 0                0.91  +0%
              rx-frames = 4                1.00  +9.8%
              rx-frames = 8                1.00  +9.8%
              rx-frames = 16               1.01  +10.9%
              rx-frames = 32               1.07  +17.5%
              rx-frames = 48               1.07  +17.5%
              rx-frames = 64               1.08  +18.6%
              rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      Changes from V4:
      - stick to NAPI_POLL_WEIGHT for rx-frames is user specify a value
        greater than it.
      Changes from V3:
      - use ethtool instead of module parameter to control the maximum
        number of batched packets
      - avoid overhead when MSG_MORE were not set and no packet queued
      Changes from V2:
      - remove uselss queue limitation check (and we don't drop any packet now)
      Changes from V1:
      - drop NAPI handler since we don't use NAPI now
      - fix the issues that may exceeds max pending of zerocopy
      - more improvement on available buffer detection
      - move the limitation of batched pacekts from vhost to tuntap
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3e37e70
    • Jason Wang's avatar
      tun: rx batching · 5503fcec
      Jason Wang authored
      We can only process 1 packet at one time during sendmsg(). This often
      lead bad cache utilization under heavy load. So this patch tries to do
      some batching during rx before submitting them to host network
      stack. This is done through accepting MSG_MORE as a hint from
      sendmsg() caller, if it was set, batch the packet temporarily in a
      linked list and submit them all once MSG_MORE were cleared.
      
      Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
      
                                       Mpps  -+%
          rx-frames = 0                0.91  +0%
          rx-frames = 4                1.00  +9.8%
          rx-frames = 8                1.00  +9.8%
          rx-frames = 16               1.01  +10.9%
          rx-frames = 32               1.07  +17.5%
          rx-frames = 48               1.07  +17.5%
          rx-frames = 64               1.08  +18.6%
          rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      User were allowed to change per device batched packets through
      ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
      to prevent bh from being disabled too long.
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5503fcec
    • Jason Wang's avatar
      vhost_net: tx batching · 0ed005ce
      Jason Wang authored
      This patch tries to utilize tuntap rx batching by peeking the tx
      virtqueue during transmission, if there's more available buffers in
      the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
      to batch the packets.
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ed005ce
    • Jason Wang's avatar
      vhost: better detection of available buffers · 275bf960
      Jason Wang authored
      This patch tries to do several tweaks on vhost_vq_avail_empty() for a
      better performance:
      
      - check cached avail index first which could avoid userspace memory access.
      - using unlikely() for the failure of userspace access
      - check vq->last_avail_idx instead of cached avail index as the last
        step.
      
      This patch is need for batching supports which needs to peek whether
      or not there's still available buffers in the ring.
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      275bf960
    • Mao Wenan's avatar
      net:add one common config ARCH_WANT_RELAX_ORDER to support relax ordering · 1a8b6d76
      Mao Wenan authored
      Relax ordering(RO) is one feature of 82599 NIC, to enable this feature can
      enhance the performance for some cpu architecure, such as SPARC and so on.
      Currently it only supports one special cpu architecture(SPARC) in 82599
      driver to enable RO feature, this is not very common for other cpu architecture
      which really needs RO feature.
      This patch add one common config CONFIG_ARCH_WANT_RELAX_ORDER to set RO feature,
      and should define CONFIG_ARCH_WANT_RELAX_ORDER in sparc Kconfig firstly.
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.duyck@gmail.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a8b6d76
    • David S. Miller's avatar
      Merge branch 'ipv6-simplify-rt6_fill_node' · 1e48aac1
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: ipv6: simplify rt6_fill_node
      
      Remove a couple of unnecessary input arguments to rt6_fill_node.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e48aac1
    • David Ahern's avatar
      net: ipv6: remove prefix arg to rt6_fill_node · f8cfe2ce
      David Ahern authored
      The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route
      where a user is requesting a prefix only dump. Simplify rt6_fill_node
      by removing the prefix arg and moving the prefix check to rt6_dump_route.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8cfe2ce
    • David Ahern's avatar
      net: ipv6: remove nowait arg to rt6_fill_node · fd61c6ba
      David Ahern authored
      All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and
      simplify rt6_fill_node accordingly.
      
      rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the
      nowait arg from it as well.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd61c6ba
    • David S. Miller's avatar
      Merge branch 'sctp-sender-side-stream-reconf-ssn-reset-request-chunk' · 1ce463dd
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: add sender-side procedures for stream reconf ssn reset request chunk
      
      Patch 6/6 is to implement sender-side procedures for the Outgoing
      and Incoming SSN Reset Request Parameter described in rfc6525
      section 5.1.2 and 5.1.3
      
      Patches 1-5/6 are ahead of it to define some apis and asoc members
      for it.
      
      Note that with this patchset, asoc->reconf_enable has no chance yet to
      be set, until the patch "sctp: add get and set sockopt for reconf_enable"
      is applied in the future. As we can not just enable it when sctp is not
      capable of processing reconf chunk yet.
      
      v1->v2:
        - put these into a smaller group.
        - rename some temporary variables in the codes.
        - rename the titles of the commits and improve some changelogs.
      v2->v3:
        - re-split the patchset and make sure it has no dead codes for review.
      v3->v4:
        - move sctp_make_reconf() into patch 1/6 to avoid kbuild warning.
        - drop unused struct sctp_strreset_req.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ce463dd
    • Xin Long's avatar
      sctp: implement sender-side procedures for SSN Reset Request Parameter · 7f9d68ac
      Xin Long authored
      This patch is to implement sender-side procedures for the Outgoing
      and Incoming SSN Reset Request Parameter described in rfc6525 section
      5.1.2 and 5.1.3.
      
      It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2
      for users.
      
      Note that the new asoc member strreset_outstanding is to make sure
      only one reconf request chunk on the fly as rfc6525 section 5.1.1
      demands.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f9d68ac
    • Xin Long's avatar
      sctp: add sockopt SCTP_ENABLE_STREAM_RESET · 9fb657ae
      Xin Long authored
      This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set
      strreset_enable to indicate which reconf request type it supports,
      which is described in rfc6525 section 6.3.1.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fb657ae
    • Xin Long's avatar
      sctp: add reconf_enable in asoc ep and netns · c28445c3
      Xin Long authored
      This patch is to add reconf_enable field in all of asoc ep and netns
      to indicate if they support stream reset.
      
      When initializing, asoc reconf_enable get the default value from ep
      reconf_enable which is from netns netns reconf_enable by default.
      
      It is also to add reconf_capable in asoc peer part to know if peer
      supports reconf_enable, the value is set if ext params have reconf
      chunk support when processing init chunk, just as rfc6525 section
      5.1.1 demands.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c28445c3
    • Xin Long's avatar
      sctp: add stream reconf primitive · 7a090b04
      Xin Long authored
      This patch is to add a primitive based on sctp primitive frame for
      sending stream reconf request. It works as the other primitives,
      and create a SCTP_CMD_REPLY command to send the request chunk out.
      
      sctp_primitive_RECONF would be the api to send a reconf request
      chunk.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a090b04
    • Xin Long's avatar
      sctp: add stream reconf timer · 7b9438de
      Xin Long authored
      This patch is to add a per transport timer based on sctp timer frame
      for stream reconf chunk retransmission. It would start after sending
      a reconf request chunk, and stop after receiving the response chunk.
      
      If the timer expires, besides retransmitting the reconf request chunk,
      it would also do the same thing with data RTO timer. like to increase
      the appropriate error counts, and perform threshold management, possibly
      destroying the asoc if sctp retransmission thresholds are exceeded, just
      as section 5.1.1 describes.
      
      This patch is also to add asoc strreset_chunk, it is used to save the
      reconf request chunk, so that it can be retransmitted, and to check if
      the response is really for this request by comparing the information
      inside with the response chunk as well.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b9438de
    • Xin Long's avatar
      sctp: add support for generating stream reconf ssn reset request chunk · cc16f00f
      Xin Long authored
      This patch is to add asoc strreset_outseq and strreset_inseq for
      saving the reconf request sequence, initialize them when create
      assoc and process init, and also to define Incoming and Outgoing
      SSN Reset Request Parameter described in rfc6525 section 4.1 and
      4.2, As they can be in one same chunk as section rfc6525 3.1-3
      describes, it makes them in one function.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc16f00f
    • David S. Miller's avatar
      Merge branch 'rework-inet_csk_get_port' · b16ed2b1
      David S. Miller authored
      Josef Bacik says:
      
      ====================
      Rework inet_csk_get_port
      
      V3->V4:
      -Removed the random include of addrconf.h that is no longer needed.
      
      V2->V3:
      -Dropped the fastsock from the tb and instead just carry the saddrs, family, and
       ipv6 only flag.
      -Reworked the helper functions to deal with this change so I could still use
       them when checking the fast path.
      -Killed tb->num_owners as per Eric's request.
      -Attached a reproducer to the bottom of this email.
      
      V1->V2:
      -Added a new patch 'inet: collapse ipv4/v6 rcv_saddr_equal functions into one'
       at Hannes' suggestion.
      -Dropped ->bind_conflict and just use the new helper.
      -Fixed a compile bug from the original ->bind_conflict patch.
      
      The original description of the series follows:
      
      At some point recently the guys working on our load balancer added the ability
      to use SO_REUSEPORT.  When they restarted their app with this option enabled
      they immediately hit a softlockup on what appeared to be the
      inet_bind_bucket->lock.  Eventually what all of our debugging and discussion led
      us to was the fact that the application comes up without SO_REUSEPORT, shuts
      down which creates around 100k twsk's, and then comes up and tries to open a
      bunch of sockets using SO_REUSEPORT, which meant traversing the inet_bind_bucket
      owners list under the lock.  Since this lock is needed for dealing with the
      twsk's and basically anything else related to connections we would softlockup,
      and sometimes not ever recover.
      
      To solve this problem I did what you see in Path 5/5.  Once we have a
      SO_REUSEPORT socket on the tb->owners list we know that the socket has no
      conflicts with any of the other sockets on that list.  So we can add a copy of
      the sock_common (really all we need is the recv_saddr but it seemed ugly to copy
      just the ipv6, ipv4, and flag to indicate if we were ipv6 only in there so I've
      copied the whole common) in order to check subsequent SO_REUSEPORT sockets.  If
      they match the previous one then we can skip the expensive
      inet_csk_bind_conflict check.  This is what eliminated the soft lockup that we
      were seeing.
      
      Patches 1-4 are cleanups and re-workings.  For instance when we specify port ==
      0 we need to find an open port, but we would do two passes through
      inet_csk_bind_conflict every time we found a possible port.  We would also keep
      track of the smallest_port value in order to try and use it if we found no
      port our first run through.  This however made no sense as it would have had to
      fail the first pass through inet_csk_bind_conflict, so would not actually pass
      the second pass through either.  Finally I split the function into two functions
      in order to make it easier to read and to distinguish between the two behaviors.
      
      I have tested this on one of our load balancing boxes during peak traffic and it
      hasn't fallen over.  But this is not my area, so obviously feel free to point
      out where I'm being stupid and I'll get it fixed up and retested.  Thanks,
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b16ed2b1
    • Josef Bacik's avatar
      inet: reset tb->fastreuseport when adding a reuseport sk · 637bc8bb
      Josef Bacik authored
      If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
      never set it again.  Which means that in the future if we end up adding a bunch
      of reuseport sk's to that tb we'll have to do the expensive scan every time.
      Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
      so we know what comparison to make, and the ipv6 only setting so we can make
      sure to compare with new sockets appropriately.  Once one sk has made it onto
      the list we know that there are no potential bind conflicts on the owners list
      that match that sk's rcv_addr.  So copy the sk's information into our bind
      bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
      an extra check for subsequent reuseport sockets and skip the expensive bind
      conflict check.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      637bc8bb
    • Josef Bacik's avatar
      inet: split inet_csk_get_port into two functions · 289141b7
      Josef Bacik authored
      inet_csk_get_port does two different things, it either scans for an open port,
      or it tries to see if the specified port is available for use.  Since these two
      operations have different rules and are basically independent lets split them
      into two different functions to make them both more readable.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      289141b7
    • Josef Bacik's avatar
      inet: don't check for bind conflicts twice when searching for a port · 6cd66616
      Josef Bacik authored
      This is just wasted time, we've already found a tb that doesn't have a bind
      conflict, and we don't drop the head lock so scanning again isn't going to give
      us a different answer.  Instead move the tb->reuse setting logic outside of the
      found_tb path and put it in the success: path.  Then make it so that we don't
      goto again if we find a bind conflict in the found_tb path as we won't reach
      this anymore when we are scanning for an ephemeral port.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cd66616
    • Josef Bacik's avatar
      inet: kill smallest_size and smallest_port · b9470c27
      Josef Bacik authored
      In inet_csk_get_port we seem to be using smallest_port to figure out where the
      best place to look for a SO_REUSEPORT sk that matches with an existing set of
      SO_REUSEPORT's.  However if we get to the logic
      
      if (smallest_size != -1) {
      	port = smallest_port;
      	goto have_port;
      }
      
      we will do a useless search, because we would have already done the
      inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
      would have gone to found_tb and succeeded.  Since this logic makes us do yet
      another trip through inet_csk_bind_conflict for a port we know won't work just
      delete this code and save us the time.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9470c27
    • Josef Bacik's avatar
      inet: drop ->bind_conflict · aa078842
      Josef Bacik authored
      The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
      is how they check the rcv_saddr, so delete this call back and simply
      change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa078842