1. 20 Jul, 2016 37 commits
    • Brenden Blanco's avatar
      net/mlx4_en: add xdp forwarding and data write support · 9ecc2d86
      Brenden Blanco authored
      A user will now be able to loop packets back out of the same port using
      a bpf program attached to xdp hook. Updates to the packet contents from
      the bpf program is also supported.
      
      For the packet write feature to work, the rx buffers are now mapped as
      bidirectional when the page is allocated. This occurs only when the xdp
      hook is active.
      
      When the program returns a TX action, enqueue the packet directly to a
      dedicated tx ring, so as to avoid completely any locking. This requires
      the tx ring to be allocated 1:1 for each rx ring, as well as the tx
      completion running in the same softirq.
      
      Upon tx completion, this dedicated tx ring recycles pages without
      unmapping directly back to the original rx ring. In steady state tx/drop
      workload, effectively 0 page allocs/frees will occur.
      
      In order to separate out the paths between free and recycle, a
      free_tx_desc func pointer is introduced that is optionally updated
      whenever recycle_ring is activated. By default the original free
      function is always initialized.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ecc2d86
    • Brenden Blanco's avatar
      net/mlx4_en: break out tx_desc write into separate function · 224e92e0
      Brenden Blanco authored
      In preparation for writing the tx descriptor from multiple functions,
      create a helper for both normal and blueflame access.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      224e92e0
    • Brenden Blanco's avatar
      bpf: add XDP_TX xdp_action for direct forwarding · 6ce96ca3
      Brenden Blanco authored
      XDP enabled drivers must transmit received packets back out on the same
      port they were received on when a program returns this action.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ce96ca3
    • Brenden Blanco's avatar
      net/mlx4_en: add page recycle to prepare rx ring for tx support · d576acf0
      Brenden Blanco authored
      The mlx4 driver by default allocates order-3 pages for the ring to
      consume in multiple fragments. When the device has an xdp program, this
      behavior will prevent tx actions since the page must be re-mapped in
      TODEVICE mode, which cannot be done if the page is still shared.
      
      Start by making the allocator configurable based on whether xdp is
      running, such that order-0 pages are always used and never shared.
      
      Since this will stress the page allocator, add a simple page cache to
      each rx ring. Pages in the cache are left dma-mapped, and in drop-only
      stress tests the page allocator is eliminated from the perf report.
      
      Note that setting an xdp program will now require the rings to be
      reconfigured.
      
      Before:
       26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
       17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
        6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
        4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
        3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
        2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
        2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
      
      After:
       31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
        8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
        7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
        6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
        4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
        4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
        3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
        2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
        1.37%  swapper      [kernel.vmlinux]       [k] menu_select
        1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d576acf0
    • Brenden Blanco's avatar
      Add sample for adding simple drop program to link · 86af8b41
      Brenden Blanco authored
      Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
      hook of a link. With the drop-only program, observed single core rate is
      ~20Mpps.
      
      Other tests were run, for instance without the dropcnt increment or
      without reading from the packet header, the packet rate was mostly
      unchanged.
      
      $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
      proto 17:   20403027 drops/s
      
      ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
      Running... ctrl^C to stop
      Device: eth4@0
      Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
        5056638pps 2427Mb/sec (2427186240bps) errors: 0
      Device: eth4@1
      Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
        5133311pps 2463Mb/sec (2463989280bps) errors: 0
      Device: eth4@2
      Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
        5077431pps 2437Mb/sec (2437166880bps) errors: 0
      Device: eth4@3
      Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
        5043067pps 2420Mb/sec (2420672160bps) errors: 0
      
      perf report --no-children:
       26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
       17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
        5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
        4.90%  swapper      [kernel.vmlinux]  [k] poll_idle
        4.14%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
        2.78%  ksoftirqd/0  [kernel.vmlinux]  [k] __free_pages_ok
        2.57%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
        2.51%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
        1.94%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
        1.45%  swapper      [mlx4_en]         [k] mlx4_en_alloc_frags
        1.35%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
        1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
        1.04%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5c5
        0.96%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c58d
        0.93%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6ee
        0.92%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6b9
        0.89%  ksoftirqd/0  [kernel.vmlinux]  [k] __alloc_pages_nodemask
        0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c686
        0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5d5
        0.78%  ksoftirqd/0  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
        0.77%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5b4
        0.77%  ksoftirqd/0  [kernel.vmlinux]  [k] net_rx_action
      
      machine specs:
       receiver - Intel E5-1630 v3 @ 3.70GHz
       sender - Intel E5645 @ 2.40GHz
       Mellanox ConnectX-3 @40G
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86af8b41
    • Brenden Blanco's avatar
      net/mlx4_en: add support for fast rx drop bpf program · 47a38e15
      Brenden Blanco authored
      Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
      
      In tc/socket bpf programs, helpers linearize skb fragments as needed
      when the program touches the packet data. However, in the pursuit of
      speed, XDP programs will not be allowed to use these slower functions,
      especially if it involves allocating an skb.
      
      Therefore, disallow MTU settings that would produce a multi-fragment
      packet that XDP programs would fail to access. Future enhancements could
      be done to increase the allowable MTU.
      
      The xdp program is present as a per-ring data structure, but as of yet
      it is not possible to set at that granularity through any ndo.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47a38e15
    • Brenden Blanco's avatar
      rtnl: add option for setting link xdp prog · d1fdd913
      Brenden Blanco authored
      Sets the bpf program represented by fd as an early filter in the rx path
      of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
      Providing a negative value as fd clears the program. Getting the fd back
      via rtnl is not possible, therefore reading of this value merely
      provides a bool whether the program is valid on the link or not.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1fdd913
    • Brenden Blanco's avatar
      net: add ndo to setup/query xdp prog in adapter rx · a7862b45
      Brenden Blanco authored
      Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
      filter. The single op is used for both setup/query of the xdp program,
      modelled after ndo_setup_tc.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7862b45
    • Brenden Blanco's avatar
      bpf: add XDP prog type for early driver filter · 6a773a15
      Brenden Blanco authored
      Add a new bpf prog type that is intended to run in early stages of the
      packet rx path. Only minimal packet metadata will be available, hence a
      new context type, struct xdp_md, is exposed to userspace. So far only
      expose the packet start and end pointers, and only in read mode.
      
      An XDP program must return one of the well known enum values, all other
      return codes are reserved for future use. Unfortunately, this
      restriction is hard to enforce at verification time, so take the
      approach of warning at runtime when such programs are encountered. Out
      of bounds return codes should alias to XDP_ABORTED.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a773a15
    • Brenden Blanco's avatar
      bpf: add bpf_prog_add api for bulk prog refcnt · 59d3656d
      Brenden Blanco authored
      A subsystem may need to store many copies of a bpf program, each
      deserving its own reference. Rather than requiring the caller to loop
      one by one (with possible mid-loop failure), add a bulk bpf_prog_add
      api.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      59d3656d
    • David S. Miller's avatar
      Merge branch 'ncsi' · ddbcb794
      David S. Miller authored
      Gavin Shan says:
      
      ====================
      NCSI Support
      
      This series rebases on David's linux-net git repo ("master" branch). It's
      to support NCSI stack on drivers/net/ethernet/faraday/ftgmac100.c. The
      implementation is based on NCSI spec (version: 1.1.0):
      https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
      
      As the following figure shows and defined in NCSI spec:
      
       * The NC-SI (aka NCSI) is defined as the interface between a (Base)
         Management Controller (BMC) and one or multiple Network Interface
         Controlers (NIC) on host side. The interface is responsible for providing
         external network connectivity for BMC.
       * Each BMC can connect to multiple packages, up to 8. Each package can have
         multiple channels, up to 32. Every package and channel are identified by
         3-bits and 5-bits in NCSI packet.
       * NCSI packet, encapsulated in ethernet frame, has 0x88F8 in the protocol
         field. The destination MAC address should be 0xFF's while the source MAC
         address can be arbitrary one.
       * NCSI packets are classified to command, response, AEN (Asynchronous Event Notification).
         Commands are sent from BMC to host (NIC) for configuration and
         information retrival. Responses, corresponding to commands, are sent from
         host to BMC for confirmation and requested information. One command should
         have one and only one response. AEN is sent from host to BMC for notification
         (e.g. link down on active channel) so that BMC can take appropriate action.
      
         +------------------+        +----------------------------------------------+
         |                  |        |                     Host                     |
         |        BMC       |        |                                              |
         |                  |        | +-------------------+  +-------------------+ |
         |    +---------+   |        | |     Package-A     |  |     Package-B     | |
         |    |         |   |        | +---------+---------+  +-------------------+ |
         |    |ftgmac100|   |        | | Channel | Channel |  | Channel | Channel | |
         +----+----+----+---+        +-+---------+---------+--+---------+---------+-+
                   |                             |                      |
                   |                             |                      |
                   +-----------------------------+----------------------+
      
      The series of patches is highlighted as:
      
      The design for the patchset is highlighted as below:
      
       * The network driver uses 3 interfaces exported from NCSI stack:
         ncsi_register_dev() - Register (create) a associated NCSI device.
         ncsi_start_dev() - Bring up the NCSI device.
         ncsi_unregister_dev() - Destroy the registered NCSI device.
       * There are several data structures introduced for different objects:
         struct ncsi_dev - NCSI device seen by network device driver.
         struct ncsi_dev_priv - NCSI device seen by NCSI stack.
         struct ncsi_package - NCSI package which can have multiple channels.
         struct ncsi_channel - NCSI channel.
       * The NCSI stack is driven by workqueue and state machine internally.
       * The all available NCSI packages and channels are enumerated (probed) on
         the first call to ncsi_start_dev(). The NCSI topology won't change until
         the NCSI device is destroyed.
       * All available channels will be brought up When the hardware arbitration
         is enabled. Otherwise, only one channel is selected as active one. The
         NCSI internal is driven by state machine with help of a workqueue. In
         the meanwhile, there are 3 states for each channel which can be put into
         a queue requesting for configuration or suspending. Channels in the queue
         with inactive state set will be configured (bringup) while channels in
         the queue with active state will be suspended (teardown). The request
         configuration or suspending is being applied on the channel if it's in
         invisible state.
       * Failover, another inactive channel is selected as active, can happen when
         the hardware arbitration is disabled. The failover can be caused by timeout
         on link monitor and AEN.
       * NCSI stack should be configurable through netlink or another mechanism, it's
         not implemented in this patchset. It's something TBD.
       * The first NIC driver that is aware of NCSI: drivers/net/ethernet/faraday/ftgmac100.c
      
      Changelog
      =========
      v2 -> v3:
       * Include (one line) change in include/uapi/linux/if_ether.h to fix build
         error.
      v1 -> v2:
       * Support NCSI spec v1.1.0 (3 more commands and 4 hardware arbitration
         modes added).
       * Enable AEN packets according to the supported list.
       * Introduce NCSI channel states and processing queue in order to support
         the hardware arbitration.
       * The hardware arbitration is supported (tested with emulated environment).
       * Introduce link monitor with GLS (Get Link Status) command/response as part
         of the error handling defined in NCSI spec.
       * Support IPv6 address discovery when CONFIG_IPV6 is enabled.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddbcb794
    • Gavin Shan's avatar
      net/faraday: Mask PHY interrupt with NCSI mode · fc6061cf
      Gavin Shan authored
      Bogus PHY interrupts are observed. This masks the PHY interrupt
      when the interface works in NCSI mode as there is no attached
      PHY under the circumstance.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc6061cf
    • Gavin Shan's avatar
      net/faraday: Match driver according to compatible property · bb168e2e
      Gavin Shan authored
      This matches the driver with devices compatible with "faraday,ftgmac100"
      declared in the device tree. Originally, device's name from device
      tree for it.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb168e2e
    • Gavin Shan's avatar
      net/faraday: Support NCSI mode · bd466c3f
      Gavin Shan authored
      This makes ftgmac100 driver support NCSI mode. The NCSI is enabled
      on the interface if property "use-nc-si" or "use-ncsi" is found from
      the device node in device tree.
      
         * No PHY device is used when NCSI mode is enabled.
         * The NCSI device (struct ncsi_dev) is created when probing the
           device while it's enabled/started when the interface is brought
           up.
         * Hardware IP checksum dosn't work when NCSI mode is enabled. It
           is disabled on enabled NCSI.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd466c3f
    • Gavin Shan's avatar
      net/faraday: Read MAC address from chip · 113ce107
      Gavin Shan authored
      The device is assigned with random MAC address. It isn't reasonable.
      An valid MAC address might have been provided by (uboot) firmware by
      device-tree or in chip. It's reasonable to use it to maintain consistency.
      
      This uses the MAC address from device-tree or that in the chip if it's
      valid. Otherwise, a random MAC address is given as before.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      113ce107
    • Gavin Shan's avatar
      net/faraday: Helper functions to create or destroy MDIO interface · eb418184
      Gavin Shan authored
      This introduces two helper functions to create or destroy MDIO
      interface. No logical changes introduced except the proper MDIO
      names are given when having more than one MDIO bus.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb418184
    • Gavin Shan's avatar
      net/ncsi: NCSI AEN packet handler · 7a82ecf4
      Gavin Shan authored
      This introduces NCSI AEN packet handlers that result in (A) the
      currently active channel is reconfigured; (B) Currently active
      channel is deconfigured and disabled, another channel is chosen
      as active one and configured. Case (B) won't happen if hardware
      arbitration has been enabled, the channel that was in active
      state is suspended simply.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a82ecf4
    • Gavin Shan's avatar
      net/ncsi: Package and channel management · e6f44ed6
      Gavin Shan authored
      This manages NCSI packages and channels:
      
       * The available packages and channels are enumerated in the first
         time of calling ncsi_start_dev(). The channels' capabilities are
         probed in the meanwhile. The NCSI network topology won't change
         until the NCSI device is destroyed.
       * There in a queue in every NCSI device. The element in the queue,
         channel, is waiting for configuration (bringup) or suspending
         (teardown). The channel's state (inactive/active) indicates the
         futher action (configuration or suspending) will be applied on the
         channel. Another channel's state (invisible) means the requested
         action is being applied.
       * The hardware arbitration will be enabled if all available packages
         and channels support it. All available channels try to provide
         service when hardware arbitration is enabled. Otherwise, one channel
         is selected as the active one at once.
       * When channel is in active state, meaning it's providing service, a
         timer started to retrieve the channe's link status. If the channel's
         link status fails to be updated in the determined period, the channel
         is going to be reconfigured. It's the error handling implementation
         as defined in NCSI spec.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6f44ed6
    • Gavin Shan's avatar
      net/ncsi: NCSI response packet handler · 138635cc
      Gavin Shan authored
      The NCSI response packets are sent to MC (Management Controller)
      from the remote end. They are responses of NCSI command packets
      for multiple purposes: completion status of NCSI command packets,
      return NCSI channel's capability or configuration etc.
      
      This defines struct to represent NCSI response packets and introduces
      function ncsi_rcv_rsp() which will be used to receive NCSI response
      packets and parse them.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      138635cc
    • Gavin Shan's avatar
      net/ncsi: NCSI command packet handler · 6389eaa7
      Gavin Shan authored
      The NCSI command packets are sent from MC (Management Controller)
      to remote end. They are used for multiple purposes: probe existing
      NCSI package/channel, retrieve NCSI channel's capability, configure
      NCSI channel etc.
      
      This defines struct to represent NCSI command packets and introduces
      function ncsi_xmit_cmd(), which will be used to transmit NCSI command
      packet according to the request. The request is represented by struct
      ncsi_cmd_arg.
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6389eaa7
    • Gavin Shan's avatar
      net/ncsi: Resource management · 2d283bdd
      Gavin Shan authored
      NCSI spec (DSP0222) defines several objects: package, channel, mode,
      filter, version and statistics etc. This introduces the data structs
      to represent those objects and implement functions to manage them.
      Also, this introduces CONFIG_NET_NCSI for the newly implemented NCSI
      stack.
      
         * The user (e.g. netdev driver) dereference NCSI device by
           "struct ncsi_dev", which is embedded to "struct ncsi_dev_priv".
           The later one is used by NCSI stack internally.
         * Every NCSI device can have multiple packages simultaneously, up
           to 8 packages. It's represented by "struct ncsi_package" and
           identified by 3-bits ID.
         * Every NCSI package can have multiple channels, up to 32. It's
           represented by "struct ncsi_channel" and identified by 5-bits ID.
         * Every NCSI channel has version, statistics, various modes and
           filters. They are represented by "struct ncsi_channel_version",
           "struct ncsi_channel_stats", "struct ncsi_channel_mode" and
           "struct ncsi_channel_filter" separately.
         * Apart from AEN (Asynchronous Event Notification), the NCSI stack
           works in terms of command and response. This introduces "struct
           ncsi_req" to represent a complete NCSI transaction made of NCSI
           request and response.
      
      link: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdfSigned-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d283bdd
    • David S. Miller's avatar
      Merge branch 'dsa-mv88e6xxx-g2-cleanup-stp' · 5e31c701
      David S. Miller authored
      Vivien Didelot says:
      
      ====================
      net: dsa: mv88e6xxx: Global2 cleanup and STP
      
      The Marvell switches registers are organized in distinct internal SMI
      devices, such as PHY, Port, Global 1 or Global 2 registers sets.
      
      Since not all chips support every registers sets or have slightly
      differences in them (such as old 88E6060 or new 88E6390 likely to be
      supported soon), make the setup code clearer now by removing a few
      family checks and adding flags to describe the Global 2 registers map.
      
      This patchset enables basic STP support and bridging on most chips when
      getting rid of a few inconsistencies in chip descriptions (patch 1) and
      add bridge Ageing Time support to DSA and the mv88e6xxx driver.
      
      Changes v2 -> v3:
        - rename mv88e6xxx_update_write to mv88e6xxx_update
        - set fastest ageing time in use in the chip for multiple bridges,
          tested with a few printk
      
      Changes v1 -> v2:
        - add a write helper for pointer-data Update registers
        - add ageing time support
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e31c701
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add support for DSA ageing time · 2cfcd964
      Vivien Didelot authored
      Implement the DSA driver function to configure the bridge ageing time.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2cfcd964
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add G1 helper for ageing time · acddbd21
      Vivien Didelot authored
      All Marvell switch chips from (88E6060 to 88E6390) have a ATU Control
      register containing bits 11:4 to configure an ATU Age Time quotient.
      
      However the coefficient used to calculate the ATU Age Time vary with the
      models. E.g. 88E6060, 88E6352 and 88E6390 use respectively 16, 15 and
      3.75 seconds.
      
      Add a age_time_coeff to the info structure to handle this and a Global 1
      helper to set the default age time of 5 minutes in the setup code.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acddbd21
    • Vivien Didelot's avatar
      net: dsa: support switchdev ageing time attr · 34a79f63
      Vivien Didelot authored
      Add a new function for DSA drivers to handle the switchdev
      SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute.
      
      The ageing time is passed as milliseconds.
      
      Also because we can have multiple logical bridges on top of a physical
      switch and ageing time are switch-wide, call the driver function with
      the fastest ageing time in use on the chip instead of the requested one.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34a79f63
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add cap for IRL · 8ec61c7f
      Vivien Didelot authored
      Add capability flags to describe the presence of Ingress Rate Limit unit
      registers and an helper function to clear it.
      
      In the meantime, fix a few harmless issues:
      
        - 6185 and 6095 don't have such registers (reserved)
        - the previous code didn't wait for the IRL operation to complete
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ec61c7f
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add cap for Priority Override · 9bda889f
      Vivien Didelot authored
      Add flags and helpers to describe the presence of Priority Override
      Table (POT) related registers and simplify the setup of Global 2.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bda889f
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add cap for PVT · 63ed880d
      Vivien Didelot authored
      Add flags to describe the presence of Cross-chip Port VLAN Table (PVT)
      related registers and simplify the setup of Global 2.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63ed880d
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: rework Switch MAC setter · 3b4caa1b
      Vivien Didelot authored
      Switches such as 88E6185 as 3 Switch MAC registers in Global 1. Newer
      chips such as 88E6352 have freed these registers in favor of an indirect
      access in a Switch MAC/WoL/WoF register in Global 2.
      
      Explicit this difference with G1 and G2 helpers and flags.
      
      Also, note that this indirect access is a single-register which doesn't
      require to wait for the operation to complete (like Switch MAC, Trunk
      Mapping, etc.), in contrary to multi-registers indirect accesses with
      several operations and a busy bit.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b4caa1b
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: add cap for MGMT Enables bits · 47395ed2
      Vivien Didelot authored
      Some switches provide a Rsvd2CPU mechanism used to choose which of the
      16 reserved multicast destination addresses matching 01:80:c2:00:00:0x
      should be considered as MGMT and thus forwarded to the CPU port.
      
      Other switches extend this mechanism to also configure as MGMT the
      additional 16 reserved multicast addresses matching 01:80:c2:00:00:2x.
      
      This mechanism is exposed via two registers in Global 2, and an Rsvd2CPU
      enable bit in the management register.
      
      Newer chip (such as 88E6390) has replaced these registers with a new
      indirect MGMT mechanism in Global 1.
      
      The patch adds two MV88E6XXX_FLAG_G2_MGMT_EN_{0,2}X flags to describe
      the presence of these Global 2 registers. If 88E6390 support is added, a
      MV88E6XXX_FLAG_G1_MGMT_CTRL flag will be needed to setup Rsvd2CPU.
      
      Note: all switches still support in parallel the ATU Load operation with
      an MGMT Entry State to forward such frames in a less convenient way.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47395ed2
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: extract trunk mapping · 5154041f
      Vivien Didelot authored
      The Trunk Mask and Trunk Mapping registers are two Global 2 indirect
      accesses to trunking configuration.
      
      Add helpers for these tables and simplify the Global 2 setup.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5154041f
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: extract device mapping · f22ab641
      Vivien Didelot authored
      The Device Mapping register is an indirect table access.
      
      Provide helpers to access this table and explicit the checking of the
      new DSA_RTABLE_NONE routing table value.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f22ab641
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: split setup of Global 1 and 2 · 9729934c
      Vivien Didelot authored
      Separate the setup of Global 1 and Global 2 internal SMI devices and add
      a flag to describe the presence of this second registers set.
      
      Also rearrange the G1 setup in the registers order.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9729934c
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: remove basic function flags · d51c542b
      Vivien Didelot authored
      All 88E6xxx Marvell switches (even the old not supported yet 88E6060)
      have at least an ATU, per-port STP states and VLAN map, to run basic
      switch functions such as Spanning Tree and port based VLANs.
      
      Get rid of the related MV88E6XXX_FLAG_{ATU,PORTSTATE,VLANTABLE} flags,
      as they are defaults to every chip.
      
      This enables STP on 6185 and removes many inconsistencies on others.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d51c542b
    • Andrew Morton's avatar
      kernel/trace/bpf_trace.c: work around gcc-4.4.4 anon union initialization bug · 183fc153
      Andrew Morton authored
      kernel/trace/bpf_trace.c: In function 'bpf_event_output':
      kernel/trace/bpf_trace.c:312: error: unknown field 'next' specified in initializer
      kernel/trace/bpf_trace.c:312: warning: missing braces around initializer
      kernel/trace/bpf_trace.c:312: warning: (near initialization for 'raw.frag.<anonymous>')
      
      Fixes: 555c8a86 ("bpf: avoid stack copy and use skb ctx for event output")
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      183fc153
    • Andy Lutomirski's avatar
      virtio-net: Remove more stack DMA · a725ee3e
      Andy Lutomirski authored
      VLAN and MQ control was doing DMA from the stack.  Fix it.
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a725ee3e
    • Florian Fainelli's avatar
      bnxt_en: Remove locking around txr->dev_state · cbce91ca
      Florian Fainelli authored
      txr->dev_state was not consistently manipulated with the acquisition of
      the per-queue lock, after further inspection the lock does not seem
      necessary, either the value is read as BNXT_DEV_STATE_CLOSING or 0.
      
      Reported-by: coverity (CID 1339583)
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbce91ca
  2. 19 Jul, 2016 3 commits
    • David S. Miller's avatar
      Merge branch 'frag-udp-tunneled-skbs' · 7e0433b3
      David S. Miller authored
      Shmulik Ladkani says:
      
      ====================
      net: Consider fragmentation of udp tunneled skbs in 'ip_finish_output_gso'
      
      Currently IP fragmentation of GSO segments that exceed dst mtu is
      considered only in the ipv4 forwarding case.
      
      There are cases where GSO skbs that are bridged and then udp-tunneled
      may have gso_size exceeding the egress device mtu.
      It makes sense to fragment them, as in the non GSOed code path.
      
      The exact cases where this behavior is needed is described and addressed
      in the 2nd patch.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e0433b3
    • Shmulik Ladkani's avatar
      net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow... · b8247f09
      Shmulik Ladkani authored
      net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs
      
      Given:
       - tap0 and vxlan0 are bridged
       - vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)
      
      Assume GSO skbs arriving from tap0 having a gso_size as determined by
      user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).
      
      After encapsulation these skbs have skb_gso_network_seglen that exceed
      eth0's ip_skb_dst_mtu.
      
      These skbs are accidentally passed to ip_finish_output2 AS IS.
      Alas, each final segment (segmented either by validate_xmit_skb or by
      hardware UFO) would be larger than eth0 mtu.
      As a result, those above-mtu segments get dropped on certain networks.
      
      This behavior is not aligned with the NON-GSO case:
      Assume a non-gso 1500-sized IP packet arrives from tap0. After
      encapsulation, the vxlan datagram is fragmented normally at the
      ip_finish_output-->ip_fragment code path.
      
      The expected behavior for the GSO case would be segmenting the
      "gso-oversized" skb first, then fragmenting each segment according to
      dst mtu, and finally passing the resulting fragments to ip_finish_output2.
      
      'ip_finish_output_gso' already supports this "Slowpath" behavior,
      according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
      forwarding (not set in the bridged case).
      
      In order to support the bridged case, we'll mark skbs arriving from an
      ingress interface that get udp-encaspulated as "allowed to be fragmented",
      causing their network_seglen to be validated by 'ip_finish_output_gso'
      (and fragment if needed).
      
      Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
      gso and non-gso cases), which serves users wishing to forbid
      fragmentation at the udp tunnel endpoint.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8247f09
    • Shmulik Ladkani's avatar
      net/ipv4: Introduce IPSKB_FRAG_SEGS bit to inet_skb_parm.flags · 359ebda2
      Shmulik Ladkani authored
      This flag indicates whether fragmentation of segments is allowed.
      
      Formerly this policy was hardcoded according to IPSKB_FORWARDED (set by
      either ip_forward or ipmr_forward).
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      359ebda2