1. 20 Jul, 2016 40 commits
    • Jacob Keller's avatar
    • Jacob Keller's avatar
      fm10k: return proper error code when pci_enable_msix_range fails · 30e23b71
      Jacob Keller authored
      The pci_enable_msix_range() function returns a positive value of the
      number of allocated vectors if it succeeds. On failure it returns
      a negative error code. Return this code properly so that the error
      message printed by the driver will show the actual error code instead of
      being masked by -ENOMEM.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      30e23b71
    • Jacob Keller's avatar
      fm10k: force link to remain down for at least a second on resume events · 0356b23b
      Jacob Keller authored
      When we resume from an AER recovery with many active VFs, the PF sees
      many spurious link up and link down events. Prevent this by delaying
      link down for at least one second after the resume event.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0356b23b
    • Jacob Keller's avatar
      fm10k: implement request_lport_map pointer · 0afd20e5
      Jacob Keller authored
      If the fm10k interface is brought up, but the switch manager software is
      not running, the driver will continuously request the lport map every
      few seconds in the base driver watchdog routine. Eventually after
      several minutes the switch mailbox Tx fifo will fill up and the mailbox
      will timeout, resulting in a reset. This reset will appear as if for no
      reason, and occurs regularly every few minutes until the switch manager
      software is loaded.
      
      Prevent this from happening by only requesting the lport map after we've
      verified the switch mailbox is tx_ready. In order to simplify code logic
      and reduce code duplication, implement this as a new function pointer
      "mac.ops.request_lport_map" which the VF will not implement. Otherwise,
      we have to duplicate the tx_ready check outside of
      fm10k_get_host_state_generic, or re-implement most of
      fm10k_get_host_state_generic in the pf version.
      
      The resulting code is simpler and easier to understand, and prevents the
      PF from continuously requesting lport map and filling the Tx fifo of
      a switch mailbox that isn't ready.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0afd20e5
    • Jacob Keller's avatar
      fm10k: check if PCIe link is restored · 9d7dbf06
      Jacob Keller authored
      Sometimes, a VF driver will lose PCIe address access, such as due to
      a PF FLR event. In fm10k_detach_subtask, poll and check whether the
      PCIe register space is active again and restore the device when it has.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9d7dbf06
    • Jacob Keller's avatar
      fm10k: enable bus master after every reset · 0d63a8f5
      Jacob Keller authored
      If an FLR occurs, VF devices will be knocked out of bus master mode, and
      the driver will be unable to recover from the reset properly, resulting
      in malicious driver events and an infinite reset loop. In the normal
      case, the bus master mode will already be enabled and this call will
      essentially be a no-op. Since we're doing this every reset, it is
      possible we could remove the other calls to pci_set_master() but it
      seems not harmful to just leave them in place.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0d63a8f5
    • Jacob Keller's avatar
      fm10k: use common flow for suspend and resume · 7756c08b
      Jacob Keller authored
      Continuing the effort to commonize the similar suspend/resume flows,
      finish up by using the new fm10k_handle_suspand and fm10k_handle_resume
      functions for the standard suspend/resume flow.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      7756c08b
    • Jacob Keller's avatar
      fm10k: implement reset_notify handler for PCIe FLR events · 0593186a
      Jacob Keller authored
      When a function level PCI reset is triggered using sysfs, it calls the
      driver's .reset_notify error handler. Implement a handler based on the
      now split fm10k_prepare_for_reset and fm10k_handle_reset functions, so
      that we fully reset the driver when the PCI function level reset occurs.
      This also ensures the reset is handled in a clean way by first disabling
      all the driver bits first and then restoring them after the function
      reset. Previously the stack simply performed a blind function reset and
      our driver didn't take any part in the process.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0593186a
    • Jacob Keller's avatar
      fm10k: use common reset flow when handling io errors from PCI stack · 820c91aa
      Jacob Keller authored
      Now that we have extracted the necessary steps for a split
      suspend/resume flow, re-use these functions instead of using the current
      open coded flow. This ensures that we don't miss any steps. It also
      ensures that we have the correct driver states set.
      
      Since we'll be handling all of the reset flow ourselves, we no longer
      need to request a reset in the io_slot_reset() function.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      820c91aa
    • Jacob Keller's avatar
      fm10k: implement prepare_suspend and handle_resume · dc4b76c0
      Jacob Keller authored
      Implement fm10k_prepare_suspend and fm10k_handle_resume functions which
      abstract around the now existing fm10k_prepare_for_reset and
      fm10k_handle_reset. The new functions also handle stopping the service
      task, which is something that the original re-init flow does not need.
      
      Every other location that does a suspend/resume type flow is expected to
      use these functions, because otherwise they may have conflicts with the
      running watchdog routines. This also has the effect of preventing
      possible surprise remove events during handling of FLR events and PCIe
      errors.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dc4b76c0
    • Jacob Keller's avatar
      fm10k: split fm10k_reinit into two functions · 40de1fad
      Jacob Keller authored
      There are several flows in the driver which perform the similar function
      of tearing down software and restoring software to recover from certain
      errors or PCIe events, including:
      
        * fm10k_reinit
        * fm10k_suspend/resume
        * fm10k_io_error_detected/fm10k_io_resume
      
      In addition, we want to implement a .reset_notify() handler as well
      which will also perform similar function.
      
      Rework how the driver codes reset and resume flows by separating out the
      reinit logic into two functions "fm10k_prepare_for_reset" and
      "fm10k_handle_reset". This first step will allow us to re-use this
      functionality in the similar blocks of code instead of re-coding the
      same sequence of events slightly different.
      
      The end result should be more maintainable and correct, fixing several
      inconsistencies with the work flow.
      
      The new functions expect to take the rtnl_lock() themselves, and it does
      have the unfortunate side effect of having the reinit flow take then
      release then take the rtnl_lock. However, this minor downside is
      out weighted by the benefits of code reduction and reducing needless
      difference between these flows.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      40de1fad
    • Jacob Keller's avatar
      fm10k: wait for queues to drain if stop_hw() fails once · 94877768
      Jacob Keller authored
      It turns out that sometimes during a reset the Tx queues will be
      temporarily stuck longer than .stop_hw() expects. Work around this issue
      by attempting to .stop_hw() first. If it tails, wait a number of
      attempts until the Tx queues appear to be drained. After this, attempt
      stop_hw() again. This ensures that we avoid waiting if we don't need to,
      such as during the first initialization of a VF, and give the proper
      amount of time necessary to recover from most situations. It is possible
      that the hardware is actually stuck. For PFs, this is usually fixed by
      a datapath reset. Unfortunately the VF cannot request a similar reset
      for itself.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      94877768
    • Jacob Keller's avatar
      fm10k: only warn when stop_hw fails with FM10K_ERR_REQUESTS_PENDING · 106ca423
      Jacob Keller authored
      When stop_hw() routine fails with FM10K_ERR_REQUESTS_PENDING, this
      indicates that the Tx or Rx queues did not shutdown within the time
      limit. Print a more suitable message at the dev_info level instead of
      dev_err.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      106ca423
    • Jacob Keller's avatar
    • Jacob Keller's avatar
      fm10k: perform data path reset even when switch is not ready · 892c9e08
      Jacob Keller authored
      A while ago, an additional check for the switch being ready was added to
      reset_hw. A recent refactor accidentally made this check return an error
      code on failure which caused fm10k_probe to fail when the switch wasn't
      brought up first. The original reasoning for the check was to prevent
      additional data path reset when the fabric wasn't ready yet. However,
      there isn't a compelling reason to keep the check, as the data path
      reset will restore hardware to a known good state. Remove the check and
      perform the data path reset regardless of the switch manager state.
      
      An alternative fix is to return FM10K_SUCCESS instead, and bypass the
      actual data path reset. This should be fine as we will perform
      a reset_hw once the switch is active. However, since data path reset
      will reset many parts of the hardware it seems better to just perform
      the reset regardless of switch state.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      892c9e08
    • Jacob Keller's avatar
      fm10k: don't stop reset due to FM10K_ERR_REQUESTS_PENDING · ce33624f
      Jacob Keller authored
      Don't report FM10K_ERR_REQUESTS_PENDING when we fail to disable queues
      within the timeout. This can occur due to a hardware Tx hang, or when
      the switch ethernet fabric is resetting while we are transmitting
      traffic. It can sometimes take up to 500ms before the Tx DMA engine
      gives up. Instead, just skip the DMA engine check and perform
      a data-path reset anyways. Add a statistic counter to keep track of the
      number of resets occurring while we have pending DMA on the rings.
      
      In order to prevent having to re-assign err to 0, re-order the
      last few items of the reset_hw_pf function so that we don't perform
      "return err" at the end.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ce33624f
    • Ngai-Mint Kwan's avatar
      fm10k: Reset mailbox global interrupts · 5e93cbad
      Ngai-Mint Kwan authored
      When a data path reset is initiated, write control to the PCIE_GMBX is
      yanked from the switch manager. The switch manager writes to this
      register to clear mailbox global interrupt bits as part of its mailbox
      interrupt handling routine. When the device recovers from the data path
      reset and these bits are not cleared, it will prevent future mailbox
      global interrupts from being triggered. Upon confirming that the device
      has exited from a data path reset, clear these bits to ensure the proper
      functioning of the mailbox global interrupt.
      Signed-off-by: default avatarNgai-Mint Kwan <ngai-mint.kwan@intel.com>
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5e93cbad
    • Jacob Keller's avatar
      fm10k: prevent multiple threads updating statistics · 9d73edee
      Jacob Keller authored
      Also prevent updating stats while the interface is down. If we're
      already updating stats, just return doing nothing. When we take the
      device down, block stat updates until we come back up. This ensures that
      we avoid tearing down rings when we're updating statistics, and prevents
      updating statistics until we're up.
      
      We can't re-use the __FM10K_DOWN for this because it wouldn't prevent
      multiple threads from accessing statistics. Neither does it prevent the
      case where we start updating stats and then start going down in another
      thread.
      
      The fm10k_get_stats64 is except from this, because it has a completely
      different flow which does not suffer from the same issues as
      fm10k_update_stats might.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9d73edee
    • Jacob Keller's avatar
      fm10k: avoid possible null pointer dereference in fm10k_update_stats · b624714b
      Jacob Keller authored
      It's currently possible for fm10k_update_stats to be called during the
      window when we go down and the rings are removed. This can result in
      a null pointer dereference. In fm10k_get_stats64 we work around this by
      using ACCESS_ONCE and a null pointer check inside the loop. Use this
      same flow in the fm10k_update_stats to avoid the potential null pointer.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b624714b
    • Jacob Keller's avatar
      fm10k: no need to continue in fm10k_down if __FM10K_DOWN already set · 1b00c6c0
      Jacob Keller authored
      Return early from fm10k_down() when we are already down, since that
      means another thread is either already finished or has started going
      down, so shouldn't conflict with them.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1b00c6c0
    • David S. Miller's avatar
      Merge branch 'mlxsw-per-prio-tc-counters' · c0d661ca
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      mlxsw: Add per-{Prio,TC} counters
      
      Ido says:
      
      Add per-priority and per-tc counters, which are very useful for debugging
      purposes and fine-tuning.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0d661ca
    • Ido Schimmel's avatar
      mlxsw: spectrum: Expose per-tc counters via ethtool · df4750e8
      Ido Schimmel authored
      Expose the transmit queue length of each traffic class and the amount of
      unicast packets discarded due to insufficient room in the shared buffer.
      
      The first counter allows us to debug user priority to traffic class
      mapping, whereas the drop counter is useful when determining shared buffer
      configuration.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df4750e8
    • Ido Schimmel's avatar
      mlxsw: spectrum: Expose per-priority counters via ethtool · 7ed674bc
      Ido Schimmel authored
      Expose per-priority bytes / packets / PFC packets counters via ethtool.
      
      These counters are very useful when debugging QoS functionality and
      provide a better insight into the device's forwarding plane.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ed674bc
    • Wei Yongjun's avatar
      net: cpmac: fix error handling of cpmac_probe() · 09714275
      Wei Yongjun authored
      Add the missing free_netdev() before return from function
      cpmac_probe() in the error handling case.
      This patch revert commit 0465be8f ("net: cpmac: fix in
      releasing resources"), which changed to only free_netdev
      while register_netdev failed.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09714275
    • Wei Yongjun's avatar
      net/mlx5: Use PTR_ERR_OR_ZERO() to simplify the code · 44fafdaa
      Wei Yongjun authored
      Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR.
      
      Generated by: scripts/coccinelle/api/ptr_ret.cocci
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44fafdaa
    • Wei Yongjun's avatar
      net: ethernet: nb8800: fix error handling of nb8800_probe() · 9a7bae8a
      Wei Yongjun authored
      In ops->reset() error handling case, clk_disable_unprepare() is missed
      before return from this function.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: default avatarMans Rullgard <mans@mansr.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a7bae8a
    • Wei Yongjun's avatar
      wan/fsl_ucc_hdlc: use module_platform_driver to simplify the code · 459421cc
      Wei Yongjun authored
      module_platform_driver() makes the code simpler by eliminating
      boilerplate code.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      459421cc
    • Wei Yongjun's avatar
      wan/fsl_ucc_hdlc: remove .owner field for driver · 9d5658e6
      Wei Yongjun authored
      Remove .owner field if calls are used which set it automatically.
      
      Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d5658e6
    • Wei Yongjun's avatar
      net: axienet: Fix return value check in axienet_probe() · 3ad7b147
      Wei Yongjun authored
      In case of error, the function of_parse_phandle() returns NULL
      pointer not ERR_PTR(). The IS_ERR() test in the return value
      check should be replaced with NULL test.
      
      Fixes: 46aa27df ('net: axienet: Use devm_* calls')
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ad7b147
    • David S. Miller's avatar
      Merge branch 'for-upstream' of... · 4599f772
      David S. Miller authored
      Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
      
      Johan Hedberg says:
      
      ====================
      pull request: bluetooth-next 2016-07-19
      
      Here's likely the last bluetooth-next pull request for the 4.8 kernel:
      
       - Fix for L2CAP setsockopt
       - Fix for is_suspending flag handling in btmrvl driver
       - Addition of Bluetooth HW & FW info fields to debugfs
       - Fix to use int instead of char for callback status.
      
      The last one (from Geert Uytterhoeven) is actually not purely a
      Bluetooth (or 802.15.4) patch, but it was agreed with other maintainers
      that we take it through the bluetooth-next tree.
      
      Please let me know if there are any issues pulling. Thanks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4599f772
    • Daniel Borkmann's avatar
      bpf, elf: add official ELF machine define for eBPF · b02b94b3
      Daniel Borkmann authored
      Add the official BPF ELF e_machine value that was assigned recently [1,2]
      and will be propagated to glibc, et al. LLVM is switching to it in 3.9
      release.
      
        [1] https://github.com/llvm-mirror/llvm/commit/36b9c09330bfb5e771914cfe307588f30d5510d2
        [2] http://lists.iovisor.org/pipermail/iovisor-dev/2016-June/000266.htmlSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b02b94b3
    • Brenden Blanco's avatar
      bpf: fix implicit declaration of bpf_prog_add · cc2e0b3f
      Brenden Blanco authored
      For the ifndef case of CONFIG_BPF_SYSCALL, an inline version of
      bpf_prog_add needs to exist otherwise the build breaks on some configs.
      
       drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2544:10: error: implicit declaration of function 'bpf_prog_add'
             prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
      
      The function is introduced in
      59d3656d ("bpf: add bpf_prog_add api for bulk prog refcnt")
      and first used in
      47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program").
      
      Fixes: 47f1afdba2b87 ("net/mlx4_en: add support for fast rx drop bpf program")
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Reported-by: default avatarTariq Toukan <ttoukan.linux@gmail.com>
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc2e0b3f
    • David S. Miller's avatar
      Merge branch 'xdp' · 22b35488
      David S. Miller authored
      Brenden Blanco says:
      
      ====================
      Add driver bpf hook for early packet drop and forwarding
      
      This patch set introduces new infrastructure for programmatically
      processing packets in the earliest stages of rx, as part of an effort
      others are calling eXpress Data Path (XDP) [1]. Start this effort by
      introducing a new bpf program type for early packet filtering, before
      even an skb has been allocated.
      
      Extend on this with the ability to modify packet data and send back out
      on the same port.
      
      Patch 1 adds an API for bulk bpf prog refcnt incrememnt.
      Patch 2 introduces the new prog type and helpers for validating the bpf
        program. A new userspace struct is defined containing only data and
        data_end as fields, with others to follow in the future.
      In patch 3, create a new ndo to pass the fd to supported drivers.
      In patch 4, expose a new rtnl option to userspace.
      In patch 5, enable support in mlx4 driver.
      In patch 6, create a sample drop and count program. With single core,
        achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
        packet data access, bpf array lookup, and increment.
      In patch 7, add a page recycle facility to mlx4 rx, enabled when xdp is
        active.
      In patch 8, add the XDP_TX type to bpf.h
      In patch 9, add helper in tx patch for writing tx_desc
      In patch 10, add support in mlx4 for packet data write and forwarding
      In patch 11, turn on packet write support in the bpf verifier
      In patch 12, add a sample program for packet write and forwarding. With
        single core, achieved ~10 Mpps rewrite and forwarding.
      
      [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
      
      v10:
       1/12: Add bulk refcnt api.
       5/12: Move prog from priv to ring. This attribute is still only set
         globally, but the path to finer granularity should be clear. No lock
         is taken, so some rings may operate on older programs for a time (one
         napi loop). Looked into options such as napi_synchronize, but they
         were deemed too slow (calls to msleep).
         Rename prog to xdp_prog. Add xdp_ring_num to help with accounting,
         used more heavily in later patches.
       7/12: Adjust to use per-ring xdp prog. Use priv->xdp_ring_num where
         before priv->prog was used to determine buffer allocations.
       9/12: Add cpu_to_be16 to vlan_tag in mxl4_en_xmit(). Remove unused variable
         from mlx4_en_xmit and unused params from build_inline_wqe.
      
      v9:
       4/11: Add missing newline in en_err message.
       6/11: Move page_cache cleanup from mlx4_en_destroy_rx_ring to
         mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to
         static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set.
         Adopt instead the approach of mlx4_en_change_mtu to use a watchdog.
       9/11: Use a per-ring function pointer in tx to separate out the code
         for regular and recycle paths of tx completion handling. Add a helper
         function to init the recycle ring and callback, called just after
         activating tx. Remove extra tx ring resource requirement, and instead
         steal from the upper rings. This helps to avoid needing
         mlx4_en_alloc_resources. Add some hopefully meaningful error
         messages for the various error cases. Reverted some of the
         hard-to-follow logic that was accounting for the extra tx rings.
      
      v8:
       1/11: Reduce WARN_ONCE to single line. Also, change act param of that
         function to u32 to match return type of bpf_prog_run_xdp.
       2/11: Clarify locking semantics in ndo comment.
       4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.
      
      v7:
       Addressing two of the major discussion points: return codes and ndo.
       The rest will be taken as todo items for separate patches.
      
       Add an XDP_ABORTED type, which explicitly falls through to DROP. The
       same result must be taken for the default case as well, as it is now
       well-defined API behavior.
      
       Merge ndo_xdp_* into a single ndo. The style is similar to
       ndo_setup_tc, but with less unidirectional naming convention. The IFLA
       parameter names are unchanged.
      
       TODOs:
       Add ethtool per-ring stats for aborted, default cases, maybe even drop
       and tx as well.
       Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
       Saeed.
      
        1/12: Add XDP_ABORTED enum, reword API comment, and update commit
         message.
        2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
          calling convention.
        3/12: Switch to ndo_xdp callback.
        4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
          ndo_xdp.
       12/12: Dropped, this will need some more work.
      
      v6:
        2/12: drop unnecessary netif_device_present check
        4/12, 6/12, 9/12: Reorder default case statement above drop case to
          remove some copy/paste.
      
      v5:
        0/12: Rebase and remove previous 1/13 patch
        1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
          in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
          out of bounds action is returned by the program. Add a comment to
          bpf.h denoting the undefined nature of out of bounds returns.
        2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
          ndo_xdp_attached().
        3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
          for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
        4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
          bpf_warn_invalid_xdp_action helper.
        5/12: Adjust to using the nested netlink options.
        6/12: kbuild was complaining about overflow of u16 on tile
          architecture...bump frag_stride to u32. The page_offset member that
          is computed from this was already u32.
      
      v4:
        2/12: Add inline helper for calling xdp bpf prog under rcu
        3/12: Add detail to ndo comments
        5/12: Remove mlx4_call_xdp and use inline helper instead.
        6/12: Fix checkpatch complaints
        9/12: Introduce new patch 9/12 with common helper for tx_desc write
          Refactor to use common tx_desc write helper
       11/12: Fix checkpatch complaints
      
      v3:
        Rewrite from v2 trying to incorporate feedback from multiple sources.
        Specifically, add ability to forward packets out the same port and
          allow packet modification.
        For packet forwarding, the driver reserves a dedicated set of tx rings
          for exclusive use by xdp. Upon completion, the pages on this ring are
          recycled directly back to a small per-rx-ring page cache without
          being dma unmapped.
        Use of the percpu skb is dropped in favor of a lightweight struct
          xdp_buff. The direct packet access feature is leveraged to remove
          dependence on the skb.
        The mlx4 driver implementation allocates a page-per-packet and maps it
          in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
        Naming is converted to use "xdp" instead of "phys_dev".
      
      v2:
        1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
          Introduce enum for return values from phys_dev hook
        2/5: Move prog->type check to just before invoking ndo
          Change ndo to take a bpf_prog * instead of fd
          Add ndo_bpf_get rather than keeping a bool in the netdev struct
        3/5: Use ndo_bpf_get to fetch bool
        4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
          mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
          to set a bpf prog when priv->num_frags > 1
          Rename pseudo_skb to bpf_phys_dev_md
          Implement ndo_bpf_get
          Add dma sync just before invoking prog
          Check for explicit bpf return code rather than nonzero
          Remove increment of rx_dropped
        5/5: Use explicit bpf return code in example
          Update commit log with higher pps numbers
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22b35488
    • Brenden Blanco's avatar
      bpf: add sample for xdp forwarding and rewrite · 764cbcce
      Brenden Blanco authored
      Add a sample that rewrites and forwards packets out on the same
      interface. Observed single core forwarding performance of ~10Mpps.
      
      Since the mlx4 driver under test recycles every single packet page, the
      perf output shows almost exclusively just the ring management and bpf
      program work. Slowdowns are likely occurring due to cache misses.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      764cbcce
    • Brenden Blanco's avatar
      bpf: enable direct packet data write for xdp progs · 4acf6c0b
      Brenden Blanco authored
      For forwarding to be effective, XDP programs should be allowed to
      rewrite packet data.
      
      This requires that the drivers supporting XDP must all map the packet
      memory as TODEVICE or BIDIRECTIONAL before invoking the program.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4acf6c0b
    • Brenden Blanco's avatar
      net/mlx4_en: add xdp forwarding and data write support · 9ecc2d86
      Brenden Blanco authored
      A user will now be able to loop packets back out of the same port using
      a bpf program attached to xdp hook. Updates to the packet contents from
      the bpf program is also supported.
      
      For the packet write feature to work, the rx buffers are now mapped as
      bidirectional when the page is allocated. This occurs only when the xdp
      hook is active.
      
      When the program returns a TX action, enqueue the packet directly to a
      dedicated tx ring, so as to avoid completely any locking. This requires
      the tx ring to be allocated 1:1 for each rx ring, as well as the tx
      completion running in the same softirq.
      
      Upon tx completion, this dedicated tx ring recycles pages without
      unmapping directly back to the original rx ring. In steady state tx/drop
      workload, effectively 0 page allocs/frees will occur.
      
      In order to separate out the paths between free and recycle, a
      free_tx_desc func pointer is introduced that is optionally updated
      whenever recycle_ring is activated. By default the original free
      function is always initialized.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ecc2d86
    • Brenden Blanco's avatar
      net/mlx4_en: break out tx_desc write into separate function · 224e92e0
      Brenden Blanco authored
      In preparation for writing the tx descriptor from multiple functions,
      create a helper for both normal and blueflame access.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      224e92e0
    • Brenden Blanco's avatar
      bpf: add XDP_TX xdp_action for direct forwarding · 6ce96ca3
      Brenden Blanco authored
      XDP enabled drivers must transmit received packets back out on the same
      port they were received on when a program returns this action.
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ce96ca3
    • Brenden Blanco's avatar
      net/mlx4_en: add page recycle to prepare rx ring for tx support · d576acf0
      Brenden Blanco authored
      The mlx4 driver by default allocates order-3 pages for the ring to
      consume in multiple fragments. When the device has an xdp program, this
      behavior will prevent tx actions since the page must be re-mapped in
      TODEVICE mode, which cannot be done if the page is still shared.
      
      Start by making the allocator configurable based on whether xdp is
      running, such that order-0 pages are always used and never shared.
      
      Since this will stress the page allocator, add a simple page cache to
      each rx ring. Pages in the cache are left dma-mapped, and in drop-only
      stress tests the page allocator is eliminated from the perf report.
      
      Note that setting an xdp program will now require the rings to be
      reconfigured.
      
      Before:
       26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
       17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
        6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
        4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
        3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
        2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
        2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
      
      After:
       31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
        8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
        7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
        6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
        4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
        4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
        3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
        2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
        1.37%  swapper      [kernel.vmlinux]       [k] menu_select
        1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d576acf0
    • Brenden Blanco's avatar
      Add sample for adding simple drop program to link · 86af8b41
      Brenden Blanco authored
      Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
      hook of a link. With the drop-only program, observed single core rate is
      ~20Mpps.
      
      Other tests were run, for instance without the dropcnt increment or
      without reading from the packet header, the packet rate was mostly
      unchanged.
      
      $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
      proto 17:   20403027 drops/s
      
      ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
      Running... ctrl^C to stop
      Device: eth4@0
      Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
        5056638pps 2427Mb/sec (2427186240bps) errors: 0
      Device: eth4@1
      Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
        5133311pps 2463Mb/sec (2463989280bps) errors: 0
      Device: eth4@2
      Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
        5077431pps 2437Mb/sec (2437166880bps) errors: 0
      Device: eth4@3
      Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
        5043067pps 2420Mb/sec (2420672160bps) errors: 0
      
      perf report --no-children:
       26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
       17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
        5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
        4.90%  swapper      [kernel.vmlinux]  [k] poll_idle
        4.14%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
        2.78%  ksoftirqd/0  [kernel.vmlinux]  [k] __free_pages_ok
        2.57%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
        2.51%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
        1.94%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
        1.45%  swapper      [mlx4_en]         [k] mlx4_en_alloc_frags
        1.35%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
        1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
        1.04%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5c5
        0.96%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c58d
        0.93%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6ee
        0.92%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6b9
        0.89%  ksoftirqd/0  [kernel.vmlinux]  [k] __alloc_pages_nodemask
        0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c686
        0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5d5
        0.78%  ksoftirqd/0  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
        0.77%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5b4
        0.77%  ksoftirqd/0  [kernel.vmlinux]  [k] net_rx_action
      
      machine specs:
       receiver - Intel E5-1630 v3 @ 3.70GHz
       sender - Intel E5645 @ 2.40GHz
       Mellanox ConnectX-3 @40G
      Signed-off-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86af8b41