1. 16 Jun, 2016 23 commits
  2. 15 Jun, 2016 17 commits
    • David S. Miller's avatar
      Merge branch 'cxgb4-sriov-sysfs' · 60100978
      David S. Miller authored
      Hariprasad Shenai says:
      
      ====================
      Add SRIOV configuration via sysfs and few fixes
      
      This series adds support to configure SR-IOV via PCI sysfs interface,
      reduces resource allocation in kdump kernel by disabling offload. Also
      synchronize unicast and multicast mac address, even in the interface is in
      Promiscuous mode.
      
      This patch series has been created against net-next tree and includes
      patches on cxgb4 and cxgb4vf driver.
      
      We have included all the maintainers of respective drivers. Kindly review
      the change and let us know in case of any review comments.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60100978
    • Hariprasad Shenai's avatar
      cxgb4/cxgb4vf: Synchronize all MAC addresses · d01f7abc
      Hariprasad Shenai authored
      Even if interface is in Promiscuous mode/Allmulti mode synchronize
      MAC addresses.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d01f7abc
    • Hariprasad Shenai's avatar
      cxgb4: Enable SR-IOV configuration via PCI sysfs interface · b6244201
      Hariprasad Shenai authored
      Implement callback in the driver for the new PCI bus driver
      interface that allows the user to enable/disable SR-IOV
      virtual functions in a device via the sysfs interface.
      
      Deprecate module parameter used to configure SRIOV
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6244201
    • Hariprasad Shenai's avatar
      cxgb4: Force cxgb4 driver as MASTER in kdump kernel · c5a8c0f3
      Hariprasad Shenai authored
      When is_kdump_kernel() is true, Forcing cxgb4 driver as Master so we can
      reinitialize the Firmware/Chip. Also reduce memory usage by disabling
      offload.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5a8c0f3
    • David S. Miller's avatar
      Merge branch 'sched_skb_free_defer' · 88da48f4
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net_sched: defer skb freeing while changing qdiscs
      
      qdiscs/classes are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      I saw spikes of 50+ ms on quite fast hardware...
      
      This patch series adds a simple queue protected by RTNL
      where skbs can be placed until RTNL is released.
      
      Note that this might also serve in the future for optional
      reinjection of packets when a qdisc is replaced.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88da48f4
    • Eric Dumazet's avatar
      net_sched: sch_fq: defer skb freeing · fea02478
      Eric Dumazet authored
      sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fea02478
    • Eric Dumazet's avatar
      net_sched: sch_pie: defer skb freeing · db4879d9
      Eric Dumazet authored
      pie_change() can use rtnl_qdisc_drop() to benefit from
      deferred freeing.
      
      pie_reset() is already using qdisc_reset_queue()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db4879d9
    • Eric Dumazet's avatar
      net_sched: sch_netem: defer skb freeing · 2f08a9a1
      Eric Dumazet authored
      rtnl_kfree_skbs() can be used in tfifo_reset()
      
      It would be nice if we could iterate through rb tree instead
      of removing one skb at a time, and build a single skb chain.
      But this is left for a future patch.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f08a9a1
    • Eric Dumazet's avatar
      net_sched: sch_htb: defer skb freeing · a5a9f534
      Eric Dumazet authored
      Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
      instead of __skb_queue_purge() to defer skb freeing of internal
      queues.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5a9f534
    • Eric Dumazet's avatar
      net_sched: sch_hhf: defer skb freeing · e7e424cd
      Eric Dumazet authored
      Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7e424cd
    • Eric Dumazet's avatar
      net_sched: fq_codel: defer skb freeing · ece5d4c7
      Eric Dumazet authored
      Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ece5d4c7
    • Eric Dumazet's avatar
      net_sched: sch_fq: defer skb freeing · e14ffdfd
      Eric Dumazet authored
      Both fq_change() and fq_reset() can use rtnl_kfree_skbs()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e14ffdfd
    • Eric Dumazet's avatar
      net_sched: sch_codel: defer skb freeing in codel_change() · b3d7e2b2
      Eric Dumazet authored
      codel_change() can use rtnl_qdisc_drop()
      to defer expensive skb freeing after locks are released.
      
      codel_reset() already has support for deferred skb freeing
      because it uses qdisc_reset_queue()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3d7e2b2
    • Eric Dumazet's avatar
      net_sched: sch_choke: defer skb freeing · f9aed311
      Eric Dumazet authored
      choke_reset() and choke_change() can use rtnl_qdisc_drop()
      to defer expensive skb freeing after locks are released.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9aed311
    • Eric Dumazet's avatar
      net_sched: add the ability to defer skb freeing · 1b5c5493
      Eric Dumazet authored
      qdisc are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      This commit adds rtnl_kfree_skbs(), used to queue
      skbs for deferred freeing.
      
      Actual freeing happens right after RTNL is released,
      with appropriate scheduling points.
      
      rtnl_qdisc_drop() can also be used in place
      of disc_drop() when RTNL is held.
      
      qdisc_reset_queue() and __qdisc_reset_queue() get
      the new behavior, so standard qdiscs like pfifo, pfifo_fast...
      have their ->reset() method automatically handled.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b5c5493
    • Jon Paul Maloy's avatar
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy authored
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c55c98
    • David Ahern's avatar
      net: vrf: Update flags and features settings · 7889681f
      David Ahern authored
      1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users
         can add one as desired.
      
      2. Disable adding a VLAN to a VRF device.
      
      3. Enable offloads and hardware features similar to other logical
         devices (e.g., dummy, veth)
      
      Change provides a significant boost in TCP stream Tx performance,
      from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the
      performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark
      using qemu with virtio+vhost for the NICs
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7889681f